Abstract
Peptides have a plethora of activities in biological systems that can potentially be exploited biotechnologically. Several peptides are used clinically, as well as in industry and agriculture. The increase in available ’omics data has recently provided a large opportunity for mining novel enzymes, biosynthetic gene clusters, and molecules. While these data primarily consist of DNA sequences, other types of data provide important complementary information. Due to their size, the approaches proven successful at discovering novel proteins of canonical size cannot be naïvely applied to the discovery of peptides. Peptides can be encoded directly in the genome as short open reading frames (smORFs), or they can be derived from larger proteins by proteolysis. Both of these peptide classes pose challenges as simple methods for their prediction result in large numbers of false positives. Similarly, functional annotation of larger proteins, traditionally based on sequence similarity to infer orthology and then transferring functions between characterized proteins and uncharacterized ones, cannot be applied for short sequences. The use of these techniques is much more limited and alternative approaches based on machine learning are used instead. Here, we review the limitations of traditional methods as well as the alternative methods that have recently been developed for discovering novel bioactive peptides with a focus on prokaryotic genomes and metagenomes.
Keywords: bioinformatics, biomedicine, data mining < bioinformatics, diseases < biomedicine, infectious
1 ∣. INTRODUCTION
Organisms from all domains of life produce bioactive peptides [1]. This includes humans which produce peptides for signaling as well as for defense against pathogens [2, 3]. While organisms produce them for their fitness benefits, once discovered, they can be harnessed and used for biotechnological applications [4]. Peptides with antimicrobial activity constitute a potential source of alternative antibiotic classes with clinical uses [5-7], while other peptides are potentially useful for cancer therapy [8-10]. Outside the clinic, bioactive peptides have broad applications in the food industry, improving the safety of production and preservation of food [11]. Additionally, the use of peptides is being explored as an alternative to chemical pesticides, offering safe and sustainable options to improve crop production [12, 13] and improve the health of poultry and livestock [11] or in aquaculture [14].
Sequencing projects provide a source of novel sequences which can be explored for discovering novel bioactive peptides (Figure 1). In the last few years, particularly microbiome sequencing has provided a large source of novel sequences for discovering natural products from the global microbiome [15, 16] (see Table 1). The available space of genomes to mine for novel sequences has exploded: modern genome databases contain close to 1 million genomes – the proGenomes3 database, released in 2022 [17], represents a >10x growth compared to the previous version, proGenomes2, from 2020 [18] – and metagenomics contributes an even larger number of genomes and genes [19]. Although the most widely available data source is metagenomic deoxyribonucleic acid (DNA) sequencing, complementary ’omics data is increasingly collected as well, which brings its own challenges, but also can enhance our ability to discover novel peptides of interest. Exploration of the human gut microbiome has already yielded antimicrobial and anticancer peptide candidates [9, 20, 21], which we have recently expanded to cover the global microbiome [22].
FIGURE 1.
Different approaches for genomic mining of bioactive peptides. The schematic illustrates the typical computational framework for identifying functional peptides within bacterial genomes specifically, which are different from those applied to canonical size proteins. As can be seen, the ortholog-based approach does not work well with small proteins, which makes the machine learning a viable option to attribute function to peptides. This characteristically multidisciplinary approach usually reduces considerably the number of identifiable small proteins and makes the detection less reliable.
TABLE 1.
General genomic resources which can serve as sources for finding novel bioactive peptides.
| Database | Description | Link |
|---|---|---|
| GMGCv1 [19] | A database that provides a comprehensive catalog of microbial genes and genomes from diverse environments, including metagenomes, single-cell genomes, and isolate genomes | https://gmgc.embl.de/ |
| Earth Microbiome Catalog [23] | A database of genes and genomes collected from several microbiomes across the earth | https://portal.nersc.gov/GEM/ |
| Human microbiome blueprint [24] | ~100k MAGs catalog of microbes from human microbiota | https://ftp.ebi.ac.uk/pub/databases/metagenomics/umgs_analyses/ |
| Genomes OnLine Database (GOLD) [25] | An open-access repository of genome and metagenome sequencing projects with their associated metadata. Provides login-free access to a growing catalogue of manually curated public projects from all over the world | https://gold.jgi.doe.gov/ |
| NCBI [26] | A database that contains metagenomic classification tools to match sequences against a database of microbial genomes to identify the microbial community structure and function | https://ncbi.nlm.nih.gov/ |
| ProGenomes [17] | A database of high-quality, complete, and draft prokaryotic genomes that can be used for comparative genomics, metagenomics, and other studies | https://progenomes.embl.de |
| GTDB [27] | A database that provides a standardized and comprehensive genome-based taxonomy for bacteria and archaea | https://gtdb.ecogenomic.org |
| Animal Metagenomes Database [28] | A database of animal metagenomes that allows users to browse, search, and download animal metagenomic data of interest based on different attributes of the metadata such as animal species | https://figshare.com/articles/dataset/AnimalMetagenome_DB_a_database_for_animal_metagenomes/19728619 |
| MGnify [29] | Database containing metagenomes, genomes, MAGs and other information at the global level with their own annotation pipeline. | https://www.ebi.ac.uk/metagenomics |
| IMG/M [30] | A database that provides tools for analyzing microbial communities sequenced by the Joint Genome Institute (JGI) | https://img.jgi.doe.gov |
| PATRIC [31] | A database that provides bacterial genomic data and tools for analyzing bacterial pathogens | https://www.bv-brc.org/ |
| MG-RAST [32] | A database that provides tools for analyzing metagenomic data and a repository for metagenomic datasets | https://www.mg-rast.org/ |
GMGCv1, Global Microbial Gene Catalog version 1; GTDB, Genome Taxonomy Database; IMG/M, Integrated Microbial Genomes & Microbiomes; MG-RAST, Metagenomics Rapid Annotation using Subsystem Technology; NCBI, National Center for Biotechnology Information; PATRIC, Pathosystems Resource Integration Center.
However, while this wealth of data represents an opportunity to find novel bioactive sequences, the techniques used for finding functional genes of longer size, comprising bioinformatics approaches developed over several decades, are not applicable to smaller gene and protein sequences. Instead, specific techniques and methods are required (Figure 1). Here, we review both the difficulties in adapting existing approaches as well as some of the ongoing efforts to overcome them. This is still an emerging and active field of research, but approaches based on artificial intelligence rather than pure sequence similarity appear to be more successful [33-35].
2 ∣. TRADITIONAL GENE MINING METHODS DO NOT WORK FOR PEPTIDES
At a high-level, the most widely used approach starts with genomic sequencing (including, nowadays, metagenomes, i.e., sequenced microbial communities), and first predicting genes coding for proteins [36] (see Figure 1). Using computational methods, mostly based on homology (or orthology [37]), or resorting to data types beyond sequencing such as transcriptomics [16] or proteomics [38], proteins with functions of interest are identified for further exploration.
When considering small sequences, traditional methods underperform at all these steps (Figure 1). For gene finding, we first need to take into consideration that peptides can be derived from either longer precursor proteins by proteolysis, when they are sometimes referred to as being encrypted, or encoded as their own genes (small open reading frames – small/short Open Reading Frame [smORF] – encoded peptides or smORF-encoded peptides [SEPs]) [39-42]. These two categories pose different challenges.
If we consider first the case of SEPs, then the problem is gene prediction from sequence. For prokaryotes, Prodigal (including its metagenomic mode, sometimes referred to as metaProdigal) is one of the most widely used tools [36]. It uses models learned from model organisms to identify features of protein coding genes which can be used to identify novel genes. However, it only returns predictions above a minimum of 90 bp (base pairs, corresponding to 30 amino acids) and those below 250 bp (corresponding to 83 amino acids) are internally penalized by the model and less likely to be output. These thresholds and penalties are necessary to reduce the number of false positive predictions [36]. Other tools, such as Glimmer3 [43] or MetaGeneMark [44] use a similar combination of hard thresholds and penalties against short genes to reduce false positive predictions. Furthermore, smORFs can often even be found overlapping canonical genes [45], which traditional tools rarely consider. Thus, while predicting canonical length protein-coding genes can be done automatically errors in a full ab initio form (i.e., based only on features of the sequence), the same methods cannot be naively applied to predicting small proteins. The challenge of finding reliable smORFs in genomic sequences has been known for a long time [46], but there are fundamental limitations: for example, while a long coding sequence without a stop codon can be indicative of selection, smORFs can appear purely by chance [47].
When it comes to finding encrypted peptides, it is similarly difficult to avoid high rates of false positives. A further complication is that, in the case of encrypted peptides, the function of the precursor protein may be unrelated to the function of the peptide [41, 48].
After having predicted a potential peptide, the next step is functional prediction (Figure 1). For larger sequences, this is most often performed using concepts related to homology, that is, by finding another better studied protein (or a group of proteins) with a shared common ancestor and transferring the knowledge from the better characterized protein to a new one [37, 49-51]. The development of algorithms to perform this reliably and efficiently has been one of the achievements of the field of bioinformatics, including fundamental tools such as BLAST [49], and more modern approaches such as DIAMOND [50] and MMSeqs2 [52] that can scale to the very large databases available now.
These methods cannot be naïvely applied to short sequences, as finding statistically reliable matches is difficult unless one is dealing with very close relatives. In the case of finding a small peptide match in a large database, even a high-identity match can occur by chance and this is not sufficient evidence of shared ancestry. Therefore, when applied to short sequences, these methods only work for finding sequences that are close to already characterized ones [53]. Thus, these methods cannot be extended to searching for matches in large datasets. Finally, short proteins appear to be under lower purifying selection than traditional genes (at least in humans), which additionally complicates the process of finding homologous genes [54].
In summary, the traditional sequence to gene to function pipeline that has been developed by the field of bioinformatics over several decades is not necessarily adequate for discovering bioactive peptides.
3 ∣. MACHINE LEARNING APPROACHES FOR PREDICTION OF PEPTIDES
For predicting smORFs in genomes with high confidence, one approach is to partly abandon the ab initio approach and rely on homology and signals of conservation. sORF finder learns a hexamer-based model which relies on trained models and finding homologs smORFs to find signatures of purifying selection [55]. In their survey of smORFs from the global microbiome, Sberro et al. [56] used a similar concept implemented in the tool RNACode, which similarly relies on multiple homologous sequences as inputs [57]. Later, the same group used this database to build predictive models which form the basis of the tools smORFinder [58]. Alternatively, tools such as Random forest-based tool for the prediction of Small Encoded Proteins (RanSEPs) and ProsmORF-pred employ machine learning models based on features of the sequence [59, 60].
A few approaches have previously been employed for functional prediction of peptides. For example, smORFunction uses co-expression to predict the function of small proteins [61]. This approach uses expression profiling to correlate the expression of a particular smORF with other annotated proteins, which is a classical approach for functional annotation in longer proteins [62, 63]. This can also be performed in communities, using metatranscriptomics in parallel with metagenomics [64].
Given that it is not trivial to obtain transcriptomics data and that some of the functions that peptides are used for may only be triggered in specific conditions, it is still important to attempt to obtain functional predictions from sequence alone. For this task, orthology finding is the major computational approach for canonical-length proteins. As discussed above, these approaches are not directly applicable for shorter sequences and machine-learning based methods are instead the most widely used approaches [65-67]. In this framework, a dataset of annotated peptides is used to train a model which is then applied to other candidate sequences.
For users of prediction tools, it is crucial to understand the quality of the predictions. For the purposes of mining sequences from large-scale data, the crucial metric is the precision: the fraction of peptides predicted as active that are truly active. It is important to note that the classifier will be applied to the outputs of the earlier steps in the pipeline, which – as we noted above – are themselves often false positives. Thus, the classifier will be applied to several times more negative inputs than positive ones. Nonetheless, many methods have been evaluated on test datasets where there is an equal number of positive and negative examples. In this context, a model can exhibit apparent high accuracy, but only achieve mediocre results in practice. For example, if the true fraction of active peptides is 1% and a model has 98% precision and recall with an equal propensity for both false positives and false negatives, then precision will be only 17%.
For use in gene mining applications, appropriate validation datasets need to be selected [68] to avoid pervasive overestimation [69]. Alternatively, methods can be tuned to provide conservative estimates, minimizing the rate of false positives even at the cost of lower recall [20, 70]. When the goal is the discovery of new bioactive peptides, the computational process aims to produce a set of candidates that will then be validated experimentally: the goal is to obtain a small number of very high-quality predictions.
For prediction of antimicrobial peptides (AMPs), many combinations of features, and machine learning algorithms have been tried [71-74]. Rather than sequence similarity, these methods exploit properties characteristic to AMPs, such as amino acid characteristics and distribution. Anticancer or antiviral peptide predictors have also been proposed based on similar principles [75] and some tools aim to classify multiple classes simultaneously [76-78].
Structure prediction has undergone a revolution with the advent of deep learning-based methods, in particular AlphaFold [79] and AlphaFold2 (AF2) [80]. The best performances are obtained with multiple-sequence alignments requiring homology finding, which, as discussed earlier, is not trivial with small proteins.
Results on small peptides have shown that AF2 performs well in certain simple folds such as membrane-associated α-helices and β-hairpins, with mixed results in other classes [81]. Overall, though, it appears that although folding methods may require some adaptation to work consistently well on small peptides, unlike some of the other problems listed before, there is no fundamental reason why they should not perform well once trained on peptide sequences, except that compared to longer sequences, small peptides may have a less well-defined structure.
4 ∣. ’OMICS DATA BEYOND DNA SEQUENCES
Other data sources can be used to provide evidence for genomic predictions. In this context, transcriptomics (high-throughput RNA sequencing [RNA-seq]), proteomics, and ribosome profiling (Ribo-seq, where molecules being actively transcribed are identified) are the most widely used methods. They can be used individually or combined together [82, 83].
Transcriptomics can be particularly valuable when studying complex organisms where production of active secreted peptides can be localized to a particular tissue or cell type. In that case, targeted transcriptomics of those tissues can provide a source enriched in active peptides [84]. In the case of communities, it is possible to obtain meta-transcriptomes, but these are not as widely available as metagenomes given the difficulty in obtaining them in the lab.
Proteomics can, in principle, provide direct detection of peptides, particularly if protocols are adapted to this task [85, 86]. It is important to note, however, that processing proteomics data is often done by matching to databases. As most extant databases do not contain short peptides, novel peptides will not be reported. Furthermore, in the case of encrypted peptides, the detection of the peptide may be reported as a detection of the original full-length protein. Even for independent smORFs, short sequences may not produce enough unique spectra so that a confident detection can be reported. Using machine learning, data processing pipelines can be adapted to the task of finding small peptides in proteomics data [87] (Figure 1).
Ribo-seq, or ribosome profiling, is the sequencing of messenger RNA (ribonucleic acid) (mRNA) transcripts that are attached to ribosomes. The result is that transcripts that are being actively translated get captured and can then be mapped back to the genome to identify actively translated sequences. With this approach, transcription of noncanonical proteins can be observed, including smORFs. For this class of peptides, Ribo-seq (and meta-Ribo-seq) can indeed provide strong evidence of translation. From the point of view of mining and peptide discovery, the paucity of datasets is currently the main limitation in this field.
5 ∣. DATABASES OF ACTIVE PEPTIDES
Databases of active peptides are useful for cataloguing known peptides for human exploration. Furthermore, they are crucial for training machine learning methods as described above. Many databases of AMPs have been published [88-96] and this field has been reviewed elsewhere [1].
Several groups have performed smORF screens in select species [97], while others have collated smORFs from the published literature to build aggregate databases (Table 2). The web resource sORF.org uses Ribo-seq data [98, 99], Small Proteins Identified from Ribosome Profiling Database (smPROT) incorporated mass spectrometry data and literature mining [100], ARA-PEPs collated putative SEPs in Arabidopsis thaliana [101], and Plant small ORFs database (PsORF) applied automated detection methods to 36 different plant species [102]. The MetamORF database reworked and standardized data from several projects on human and mouse smORFs [103]. Crucially, these databases are focused on a relatively small number of eukaryotic species. The database DBsmORF is one exception that was built from prokaryotic genomes and metagenomes [58], namely Reference Sequence Database (NCBI) (RefSeq) and Human Microbiome Project (HMP), based on the tool SmORFinder [58], while the recently made available Global Microbial smORF Catalogue (GMSC) explores the global microbiome [104, 105].
TABLE 2.
Databases dedicated to smORFs.
| Database | Description | Link |
|---|---|---|
| MetamORF [103] | A repository of unique short open reading frames identified by both experimental and computational approaches | https://metamorf.hb.univ-amu.fr/ |
| SmProt [100, 106] | A repository with comprehensive annotation of small proteins identified from ribosome profiling | http://bigdata.ibp.ac.cn/SmProt/ |
| sORFs.org [98, 99] | A repository of small ORFs identified by ribosome profiling | https://www.sorfs.org |
| DBsmORF [58] | Database and web portal designed to help you identify small open reading frames (smORFs) in your microbial sequencing datasets | http://104.154.134.205:3838/DBsmORF/ |
| GMSC [104, 105] | Database collating smSORFs from the global microbiome | https://gmsc.big-data-biology.org/ |
GMSC, Global Microbial smORF Catalogue; ORF, Open Reading Frame; smORFs, small/short Open Reading Frames; smPROT, Small Proteins Identified from Ribosome Profiling Database.
6 ∣. CONCLUSIONS
Mining high-throughput ’omics data (primarily genomic sequences) represents an exciting opportunity to find novel bioactive peptides, which are much needed in several biotechnological and medical applications. However, to do so requires a set of computational tools and protocols very different (or at least adapted) from the ones that have been so successful for longer sequences. In particular, machine learning and artificial intelligence play a bigger role in this area compared to canonical proteins where sequence similarity (to infer homology and orthology) is the main tool for analyzing sequences.
Genomic (DNA) information is the most abundant, which includes metagenomics. Combining multiple sources of information (multi’omics) is one approach to reduce the issue of false positive predictions. Ribo-seq, transcriptomics, and proteomics can all be used to increase confidence in the activity of a specific sequence. In the case of analyzing complex communities, these methods are still not widely used as they are much more challenging to execute. Computational methods to analyze these data sources also need to be adapted to small sequences and their results can be less reliable than is the case of canonical-length sequences.
Despite the challenges, there is immense potential in exploiting the vast genomic information and other resources that have become available in the last several years, which will continue to grow. Groundtruth experimental validation will always be needed, but computational methods can help us accelerate the discovery of novel, functional peptides.
ACKNOWLEDGMENTS
Cesar de la Fuente-Nunez holds a Presidential Professorship at the University of Pennsylvania, is a recipient of the Langer Prize by the AIChE Foundation, and acknowledges funding from the IADR Innovation in Oral Care Award, the Procter & Gamble Company, United Therapeutics, a BBRF Young Investigator Grant, the Nemirovsky Prize, Penn Health-Tech Accelerator Award, the Dean’s Innovation Fund from the Perelman School of Medicine at the University of Pennsylvania, the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM138201, and the Defense Threat Reduction Agency (DTRA; HDTRA11810041, HDTRA1-21-1-0014, and HDTRA1-23-1-0001). Luis Pedro Coelho is supported by the Australian Research Council (grant number FT230100724). We thank the Coelho and de la Fuente Lab members for insightful discussions.
Abbreviations:
- AF2
AlphaFold2
- AMP
antimicrobial peptide
- bp
base pairs
- DNA
deoxyribonucleic acid
- GMGCv1
Global Microbial Gene Catalog version 1
- GMSC
Global Microbial smORF Catalogue
- GOLD
Genomes Online Database
- GTDB
Genome Taxonomy Database
- HMP
Human Microbiome Project
- IMG/M
Integrated Microbial Genomes & Microbiomes
- MAGs
metagenome assembled genomes
- MG-RAST
Metagenomics Rapid Annotation using Subsystem Technology
- mRNA
messenger RNA (ribonucleic acid)
- NCBI
National Center for Biotechnology Information
- ORF
Open Reading Frame
- PATRIC
Pathosystems Resource Integration Center
- RanSEPs
Random forest-based tool for the prediction of Small Encoded Proteins in bacterial genomes
- Ribo-Seq
ribosome profiling
- RNA-Seq
high-throughput RNA sequencing
- SEP
smORF-encoded peptide
- smORF
small/short Open Reading Frame
- smPROT
Small Proteins Identified from Ribosome Profiling Database
- RefSeq
Reference Sequence Database (NCBI)
- PsORF
Plant small ORFs database
Footnotes
CONFLICT OF INTEREST STATEMENT
The authors declare no conflicts of interest.
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the upon reasonable request.
REFERENCES
- 1.Ramazi S, Mohammadi N, Allahverdi A, Khalili E, & Abdolmaleki P (2022). A review on antimicrobial peptides databases and the computational tools. Database (Oxford), baac011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Magana M, Pushpanathan M, Santos AL, Leanse L, Fernandez M, Ioannidis A, Giulianotti MA, Apidianakis Y, Bradfute S, Ferguson AL, Cherkasov A, Seleem MN, Pinilla C, de la Fuente-Nunez C, Lazaridis T, Dai T, Houghten RA, Hancock REW, & Tegos GP (2020). The value of antimicrobial peptides in the age of resistance. The Lancet Infectious Diseases, 20, e216–e230. [DOI] [PubMed] [Google Scholar]
- 3.Torres MDT, & de la Fuente-Nunez C (2019). Reprogramming biological peptides to combat infectious diseases. Chemical Communications (Cambridge, England), 55, 15020–15032. [DOI] [PubMed] [Google Scholar]
- 4.Fosgerau K, & Hoffmann T (2015). Peptide therapeutics: Current status and future directions. Drug Discovery Today, 20, 122–128. [DOI] [PubMed] [Google Scholar]
- 5.Soltani S, Hammami R, Cotter PD, Rebuffat S, Said LB, Gaudreau H, Bédard F, Biron E, Drider D, & Fliss I (2021). Bacteriocins as a new generation of antimicrobials: Toxicity aspects and regulations. FEMS Microbiology Review, 45, fuaa039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.de la Fuente-Núñez C, Korolik V, Bains M, Nguyen U, Breidenstein EB, Horsman S, Lewenza S, Burrows L, & Hancock RE (2012). Inhibition of bacterial biofilm formation and swarming motility by a small synthetic cationic peptide. Antimicrobial Agents and Chemotherapy, 56, 2696–2704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Porto WF, Irazazabal L, Alves ESF, Ribeiro SM, Matos CO, Pires ÁS, Fensterseifer ICM, Miranda VJ, Haney EF, Humblot V, Torres MDT, Hancock REW, Liao LM, Ladram A, Lu TK, de la Fuente-Nunez C, & Franco OL (2018). In silico optimization of a guava antimicrobial peptide enables combinatorial exploration for peptide design. Nature Communications, 9, 1490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Tyagi A, Tuknait A, Anand P, Gupta S, Sharma M, Mathur D, Joshi A, Singh S, Gautam A, & Raghava GP (2015). CancerPPD: A database of anticancer peptides and proteins. Nucleic Acids Research, 43, D837–843. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ma Y, Liu X, Zhang X, Yu Y, Li Y, Song M, & Wang J (2023). Efficient mining of anticancer peptides from gut metagenome. Advanced Science (Weinh), 10, e2300107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Torres MDT, Silva AF, Andrade GP, Pedron CN, Cerchiaro G, Ribeiro AO, Oliveira VX Jr, & de la Fuente-Nunez C (2020). The wasp venom antimicrobial peptide polybia-CP and its synthetic derivatives display antiplasmodial and anticancer properties. Bioengineering & Translational Medicine, 5, e10167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Tian T, Xie W, Liu L, Fan S, Zhang H, Qin Z, & Yang C (2023). Industrial application of antimicrobial peptides based on their biological activity and structure-activity relationship. Critical Reviews in Food Science and Nutrition, 63, 5430–5445. [DOI] [PubMed] [Google Scholar]
- 12.Ormancey M, Guillotin B, San Clemente H, Thuleau P, Plaza S, & Combier JP (2021). Use of microRNA-encoded peptides to improve agronomic traits. Plant Biotechnology Journal, 19, 1687–1689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lin S, Chen X, Chen H, Cai X, Chen X, & Wang S (2022). The bioprospecting of microbial-derived antimicrobial peptides for sustainable agriculture. Engineering, 10.1016/j.eng.2022.08.011 [DOI] [Google Scholar]
- 14.Lu J, Zhang Y, Wu J, & Wang J (2022). Intervention of antimicrobial peptide usage on antimicrobial resistance in aquaculture. Journal of Hazardous Materials, 427, 128154. [DOI] [PubMed] [Google Scholar]
- 15.Youngblut ND, de la Cuesta-Zuluaga J, Reischer GH, Dauser S, Schuster N, Walzer C, Stalder G, Farnleitner AH, & Ley RE (2020). Large scale metagenome assembly reveals novel animal-associated microbial genomes, biosynthetic gene clusters, and other genetic diversity. mSystems, 5, 10.1128/msystems.01045-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Paoli L, Ruscheweyh H-J, Forneris CC, Hubrich F, Kautsar S, Bhushan A, Lotti A, Clayssen Q, Salazar G, Milanese A, Carlström CI, Papadopoulou C, Gehrig D, Karasikov M, Mustafa H, Larralde M, Carroll LM, Sánchez P, Zayed AA, … Sunagawa S (2022). Biosynthetic potential of the global ocean microbiome. Nature, 607, 111–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Fullam A, Letunic I, Schmidt TSB, Ducarmon QR, Karcher N, Khedkar S, Kuhn M, Larralde M, Maistrenko OM, Malfertheiner L, Milanese A, Rodrigues JFM, Sanchis-López C, Schudoma C, Szklarczyk D, Sunagawa S, Zeller G, Huerta-Cepas J, von Mering C, … Mende DR (2023). proGenomes3: Approaching one million accurately and consistently annotated high-quality prokaryotic genomes. Nucleic Acids Research, 51, D760–D766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Mende DR, Letunic I, Maistrenko OM, Schmidt TSB, Milanese A, Paoli L, Hernández-Plaza A, Orakov AN, Forslund SK, Sunagawa S, Zeller G, Huerta-Cepas J, Coelho LP, & Bork P (2020). proGenomes2: An improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Research, 48, D621–D625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Coelho LP, Alves R, del Río ÁR, Myers PN, Cantalapiedra CP, Giner-Lamia J, Schmidt TS, Mende DR, Orakov A, Letunic I, Hildebrand F, Van Rossum T, Forslund SK, Khedkar S, Maistrenko OM, Pan S, Jia L, Ferretti P, Sunagawa S, … Bork P (2022). Towards the biogeography of prokaryotic genes. Nature, 601, 252–256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ma Y, Guo Z, Xia B, Zhang Y, Liu X, Yu Y, Tang N, Tong X, Wang M, Ye X, Feng J, Chen Y, & Wang J (2022). Identification of antimicrobial peptides from the human gut microbiome using deep learning. Nature Biotechnology, 1–11. [DOI] [PubMed] [Google Scholar]
- 21.Torres MDT, Brooks E, Cesaro A, Sberro H, Nicolaou C, Bhatt AS, & de la Fuente-Nunez C (2023). Human gut metagenomic mining reveals an untapped source of peptide antibiotics. BioRxiv, 2023.08.31.555711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Santos-Júnior CD, Torres MDT, Duan Y, Del Río ÁR, Schmidt TSB, Chong H, Fullam A, Kuhn M, Zhu C, Houseman A, Somborski J, Vines A, Zhao XM, Bork P, Huerta-Cepas J, de la Fuente-Nunez C, & Coelho LP (2023). Computational exploration of the global microbiome for antibiotic discovery. BioRxiv, 2023.08.31.555663. [Google Scholar]
- 23.Nayfach S, Roux S, Seshadri R, Udwary D, Varghese N, Schulz F, Wu D, Paez-Espino D, Chen IM, Huntemann M, Palaniappan K, Ladau J, Mukherjee S, Reddy TBK, Nielsen T, Kirton E, Faria JP, Edirisinghe JN, Henry CS, … Eloe-Fadrosh EA (2021). A genomic catalog of Earth’s microbiomes. Nature Biotechnology, 39, 499–509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Almeida A, Mitchell AL, Boland M, Forster SC, Gloor GB, Tarkowska A, Lawley TD, & Finn RD (2019). A new genomic blueprint of the human gut microbiota. Nature, 568, 499–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Mukherjee S, Stamatis D, Li CT, Ovchinnikova G, Bertsch J, Sundaramurthi JC, Kandimalla M, Nicolopoulos PA, Favognano A, Chen IA, Kyrpides NC, & Reddy TBK (2023). Twenty-five years of Genomes OnLine Database (GOLD): Data updates and new features in v.9. Nucleic Acids Research, 51, D957–D963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sayers EW, Bolton EE, Brister JR, Canese K, Comeau DC, Funk K, Kim S, Klimke W, Marchler-Bauer A, Landrum M, Lathrop S, Lu Z, Madden TL, O’Leary N, Phan L, Rangwala SH, Schneider VA, Skripchenko Y, … Sherry ST (2023). Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Research, 51, D29–D38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil P-A, & Hugenholtz P (2022). GTDB: An ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Research, 50, D785–D794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Hu R, Yao R, Li L, Xu Y, Lei B, Tang G, Liang H, Lei Y, Li C, Li X, Liu K, Wang L, Zhang Y, Wang Y, Cui Y, Dai J, Ni W, Zhou P, Yu B, & Hu S (2022). A database of animal metagenomes. Scientific Data, 9, 312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Richardson L, Allen B, Baldi G, Beracochea M, Bileschi ML, Burdett T, Burgin J, Caballero-Pérez J, Cochrane G, Colwell LJ, Curtis T, Escobar-Zepeda A, Gurbich TA, Kale V, Korobeynikov A, Raj S, Rogers AB, Sakharova E, Sanchez S, … Finn RD(2023). MGnify: The microbiome sequence data analysis resource in 2023. Nucleic Acids Research, 51, D753–D759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Chen I-MA, Chu K, Palaniappan K, Ratner A, Huang J, Huntemann M, Hajek P, Ritter S, Varghese N, Seshadri R, Roux S, Woyke T, Eloe-Fadrosh EA, Ivanova NN, & Kyrpides NC (2021). The IMG/M data management and analysis system v.6.0: New tools and advanced capabilities. Nucleic Acids Research, 49, D751–D763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Olson RD, Assaf R, Brettin T, Conrad N, Cucinell C, Davis JJ, Dempsey DM, Dickerman A, Dietrich EM, Kenyon RW, Kuscuoglu M, Lefkowitz EJ, Lu J, Machi D, Macken C, Mao C, Niewiadomska A, Nguyen M, Olsen GJ, … Stevens RL (2023). Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): A resource combining PATRIC, IRD and ViPR. Nucleic Acids Research, 51, D678–D689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Meyer F, Bagchi S, Chaterji S, Gerlach W, Grama A, Harrison T, Paczian T, Trimble WL, & Wilke A (2019). MG-RAST version 4-lessons learned from a decade of low-budget ultra-high-throughput metagenome analysis. Briefings in Bioinformatics, 20, 1151–1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Wan F, Kontogiorgos-Heintz D, & de la Fuente-Nunez C (2022). Deep generative models for peptide design. Digital Discovery, 1, 195–208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wong F, de la Fuente-Nunez C, & Collins JJ (2023). Leveraging artificial intelligence in the fight against infectious diseases. Science, 381, 164–170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Wan F, Torres MDT, Peng J, & de la Fuente-Nunez C (2023). Molecular de-extinction of antibiotics enabled by deep learning. BioRxiv, 2023.10.01.560353. [Google Scholar]
- 36.Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, & Hauser LJ (2010). Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11, 119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, von Mering C, & Bork P (2017). Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper. Molecular Biology and Evolution, 34, 2115–2122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Wilmes P, Wexler M, & Bond PL (2008). Metaproteomics provides functional insight into activated sludge wastewater treatment. PLoS ONE, 3, e1778. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Saghatelian A, & Couso JP (2015). Discovery and characterization of smORF-encoded bioactive polypeptides. Nature Chemical Biology, 11, 909–916. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Torres MDT, Melo MCR, Flowers L, Crescenzi O, Notomista E, & de la Fuente-Nunez C (2022). Mining for encrypted peptide antibiotics in the human proteome. Nature Biomedical Engineering, 6, 67–75. [DOI] [PubMed] [Google Scholar]
- 41.Cesaro A, Torres MDT, Gaglione R, Dell’Olmo E, Bosso A, Pizzo E, Haagsman HP, Veldhuizen EJA, de la Fuente-Nunez C, & Arciello A (2022). Synthetic antibiotic derived from sequences encrypted in a protein from human plasma. ACS Nano, 16, 1880–1895. [DOI] [PubMed] [Google Scholar]
- 42.Maasch JRMA, Torres MDT, Melo MCR, & de la Fuente-Nunez C (2023). Molecular de-extinction of ancient antimicrobial peptides enabled by machine learning. Cell Host & Microbe, 31, 1260–1274.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Delcher AL, Bratke KA, Powers EC, & Salzberg SL (2007). Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics, 23, 673–679. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Zhu W, Lomsadze A, & Borodovsky M (2010). Ab initio gene identification in metagenomic sequences. Nucleic Acids Research, 38, e132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Zehentner B, Ardern Z, Kreitmeier M, Scherer S, & Neuhaus K (2020). Evidence for numerous embedded antisense overlapping genes in diverse E. coli strains. BioRxiv, 2020.11.18.388249. [Google Scholar]
- 46.Ochman H. (2002). Distinguishing the ORFs from the ELFs: Short bacterial genes and the annotation of genomes. Trends in Genetics, 18, 335–337. [DOI] [PubMed] [Google Scholar]
- 47.Fickett JW (1995). ORFs and genes: How strong a connection? Journal of Computational Biology, 2, 117–123. [DOI] [PubMed] [Google Scholar]
- 48.Fingerhut LCHW, Miller DJ, Strugnell JM, Daly NL, & Cooke HR (2020). ampir: An R package for fast genome-wide prediction of antimicrobial peptides. Bioinformatics, 36, 5262–5263. [DOI] [PubMed] [Google Scholar]
- 49.Altschul SF, Gish W, Miller W, Myers EW, & Lipman DJ (1990). Basic local alignment search tool. Journal of Molecular Biology, 215, 403–410. [DOI] [PubMed] [Google Scholar]
- 50.Buchfink B, Xie C, & Huson DH (2014). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12, 59–60. [DOI] [PubMed] [Google Scholar]
- 51.Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, & Huerta-Cepas J (2021). eggNOG-mapper v2: Functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Molecular Biology and Evolution, 38, 5825–5829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Steinegger M, & Söding J (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35, 1026–1028. [DOI] [PubMed] [Google Scholar]
- 53.Scheetz T, Bartlett JA, Walters JD, Schutte BC, Casavant TL, & McCray PB Jr (2002). Genomics-based approaches to gene discovery in innate immunity. Immunological Reviews, 190, 137–145. [DOI] [PubMed] [Google Scholar]
- 54.Slavoff SA, Mitchell AJ, Schwaid AG, Cabili MN, Ma J, Levin JZ, Karger AD, Budnik BA, Rinn JL, & Saghatelian A (2013). Peptidomic discovery of short open reading frame–encoded peptides in human cells. Nature Chemical Biology, 9, 59–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Hanada K, Akiyama K, Sakurai T, Toyoda T, Shinozaki K, & Shiu SH (2010). sORF finder: A program package to identify small open reading frames with high coding potential. Bioinformatics, 26, 399–400. [DOI] [PubMed] [Google Scholar]
- 56.Sberro H, Fremin BJ, Zlitni S, Edfors F, Greenfield N, Snyder MP, Pavlopoulos GA, Kyrpides NC, & Bhatt AS (2019). Large-scale analyses of human microbiomes reveal thousands of small, novel genes. Cell, 178, 1245–1259.e14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Washietl S, Findeiß S, Müller SA, Kalkhof S, von Bergen M, Hofacker IL, Stadler PF, & Goldman N (2011). RNAcode: Robust discrimination of coding and noncoding regions in comparative sequence data. RNA, 17, 578–594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Durrant MG, & Bhatt AS (2021). Automated prediction and annotation of small open reading frames in microbial genomes. Cell Host & Microbe, 29, 121–131.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Miravet-Verde S, Ferrar T, Espadas-García G, Mazzolini R, Gharrab A, Sabido E, Serrano L, & Lluch-Senar M (2019). Unraveling the hidden universe of small proteins in bacterial genomes. Molecular Systems Biology, 15, e8290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Khanduja A, Kumar M, & Mohanty D (2023). ProsmORF-pred: A machine learning-based method for the identification of small ORFs in prokaryotic genomes. Briefings in Bioinformatics, 24, bbad101. [DOI] [PubMed] [Google Scholar]
- 61.Ji X, Cui C, & Cui Q (2020). smORFunction: A tool for predicting functions of small open reading frames and microproteins. BMC Bioinformatics, 21, 455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Korbel JO, Jensen LJ, von Mering C, & Bork P (2004). Analysis of genomic context: Prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nature Biotechnology, 22, 911–917. [DOI] [PubMed] [Google Scholar]
- 63.Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, & Altschuler SJ (2002). Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nature Genetics, 31, 255–265. [DOI] [PubMed] [Google Scholar]
- 64.Salazar G, Paoli L, Alberti A, Huerta-Cepas J, Ruscheweyh H-J, Cuenca M, Field CM, Coelho LP, Cruaud C, Engelen S, Gregory AC, Labadie K, Marec C, Pelletier E, Royo-Llonch M, Roux S, Sánchez P, Uehara H, Zayed AA, … Wincker P (2019). Gene expression changes and community turnover differentially shape the global ocean metatranscriptome. Cell, 179, 1068–1083.e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Basith S, Manavalan B, Hwan Shin T, & Lee G (2020). Machine intelligence in peptide therapeutics: A next-generation tool for rapid disease screening. Medicinal Research Reviews, 40, 1276–1314. [DOI] [PubMed] [Google Scholar]
- 66.Wang G, Vaisman II, & van Hoek ML (2022). In Simonson T (Ed.), Computational peptide science: Methods and protocols (pp. 1–37), Springer; US. [Google Scholar]
- 67.Bárcenas O, Pintado-Grima C, Sidorczuk K, Teufel F, Nielsen H, Ventura S, & Burdukiewicz M (2022). The dynamic landscape of peptide activity prediction. Computational and Structural Biotechnology Journal, 20, 6526–6533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Sidorczuk K, Gagat P, Pietluch F, Kała J, Rafacz D, Bąkała L, Słowik J, Kolenda R, Rödiger S, Fingerhut LCHW, Cooke IR, Mackiewicz P, & Burdukiewicz M (2022). Benchmarks in antimicrobial peptide prediction are biased due to the selection of negative data. Briefings in Bioinformatics, 23, bbac343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Gabere MN, & Noble WS (2017). Empirical comparison of web-based antimicrobial peptide prediction tools. Bioinformatics, 33, 1921–1929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Santos-Júnior CD, Pan S, Zhao X-M, & Coelho LP (2020). Macrel: Antimicrobial peptide screening in genomes and metagenomes. PeerJ, 8, e10555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Lata S, Sharma B, & Raghava G (2007). Analysis and prediction of antibacterial peptides. BMC Bioinformatics, 8, 263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Bhadra P, Yan J, Li J, Fong S, & Siu SWI (2018). AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Scientific Reports, 8, 1697. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Lawrence TJ, Carper DL, Spangler MK, Carrell AA, Rush TA, Minter SJ, Weston DJ, & Labbé JL (2021). amPEPpy 1.0: A portable and accurate antimicrobial peptide prediction tool. Bioinformatics, 37, 2058–2060. [DOI] [PubMed] [Google Scholar]
- 74.Li C, Sutherland D, Hammond SA, Yang C, Taho F, Bergman L, Houston S, Warren RL, Wong T, Hoang LMN, Cameron CE, Helbing CC, & Birol I (2022). AMPlify: Attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens. BMC Genomics, 23, 77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Thi Phan L, Woo Park H, Pitti T, Madhavan T, Jeon YJ, & Manavalan B (2022). MLACP 2.0: An updated machine learning tool for anticancer peptide prediction. Computational and Structural Biotechnology Journal, 20, 4473–4480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Wei L, Zhou C, Su R, & Zou Q (2019). PEPred-Suite: Improved and robust prediction of therapeutic peptides using adaptive feature representation learning. Bioinformatics, 35, 4272–4280. [DOI] [PubMed] [Google Scholar]
- 77.Zhang YP, & Zou Q (2020). PPTPP: A novel therapeutic peptide prediction method using physicochemical property encoding and adaptive feature representation learning. Bioinformatics, 36, 3982–3987. [DOI] [PubMed] [Google Scholar]
- 78.Tang W, Dai R, Yan W, Zhang W, Bin Y, Xia E, & Xia J (2022). Identifying multi-functional bioactive peptide functions using multi-label deep learning. Briefings in Bioinformatics, 23, bbab414. [DOI] [PubMed] [Google Scholar]
- 79.Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, Qin C, Žídek A, Nelson AWR, Bridgland A, Penedones H, Petersen S, Simonyan K, Crossan S, Kohli P, Jones DT, Silver D, Kavukcuoglu K, & Hassabis D (2020). Improved protein structure prediction using potentials from deep learning. Nature, 577, 706–710. [DOI] [PubMed] [Google Scholar]
- 80.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes A, Nikolov S, Jain R, Adler J, … Hassabis D (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.McDonald EF, Jones T, Plate L, Meiler J, & Gulsevin A (2023). Benchmarking AlphaFold2 on peptide structure prediction. Structure (London, England), 31, 111–119.e2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Agüero-Chapin G, Galpert-Cañizares D, Domínguez-Pérez D, Marrero-Ponce Y, Pérez-Machado G, Teijeira M, & Antunes A (2022). Emerging computational approaches for antimicrobial peptide discovery. Antibiotics (Basel), 11, 936. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Omasits U, Varadarajan AR, Schmid M, Goetze S, Melidis D, Bourqui M, Nikolayeva O, Québatte M, Patrignani A, Dehio C, Frey JE, Robinson MD, Wollscheid B, & Ahrens CH (2017). An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics. Genome Research, 27, 2083–2095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Wei D, Tian C-B, Liu S-H, Wang T, Smagghe G, Jia F-X, Dou W, & Wang J-J (2016). Transcriptome analysis to identify genes for peptides and proteins involved in immunity and reproduction from male accessory glands and ejaculatory duct of Bactrocera dorsalis. Peptides, 80, 48–60. [DOI] [PubMed] [Google Scholar]
- 85.Bartel J, Varadarajan AR, Sura T, Ahrens CH, Maaß S, & Becher D (2020). Optimized proteomics workflow for the detection of small proteins. Journal of Proteome Research, 19, 4004–4018. [DOI] [PubMed] [Google Scholar]
- 86.Petruschke H, Schori C, Canzler S, Riesbeck S, Poehlein A, Daniel R, Frei D, Segessemann T, Zimmerman J, Marinos G, Kaleta C, Jehmlich N, Ahrens CH, & von Bergen M (2021). Discovery of novel community-relevant small proteins in a simplified human intestinal microbiome. Microbiome, 9, 55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Madsen CT, Refsgaard JC, Teufel FG, Kjærulff SK, Wang Z, Meng G, Jessen C, Heljo P, Jiang Q, Zhao X, Wu B, Zhou X, Tang Y, Jeppesen JF, Kelstrup CD, Buckley ST, Tullin S, Nygaard-Jensen J, Chen X, … de Lichtenberg U (2022). Combining mass spectrometry and machine learning to discover bioactive peptides. Nature Communications, 13, 6235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Wang Z, & Wang G (2004). APD: The antimicrobial peptide database. Nucleic Acids Research, 32, D590–D592. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Wang G, Li X, & Wang Z (2009). APD2: The updated antimicrobial peptide database and its application in peptide design. Nucleic Acids Research, 37, D933–937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Zhao X, Wu H, Lu H, Li G, & Huang Q (2013). LAMP: A database linking antimicrobial peptides. PLoS ONE, 8, e66557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Waghu FH, Barai RS, Gurung P, & Idicula-Thomas S (2016). CAMPR3: A database on sequences, structures and signatures of antimicrobial peptides. Nucleic Acids Research, 44, D1094–D1097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Thomas S, Karnik S, Barai RS, Jayaraman VK, & Idicula-Thomas S (2010). CAMP: A useful resource for research on antimicrobial peptides. Nucleic Acids Research, 38, D774–D780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Waghu FH, Gopi L, Barai RS, Ramteke P, Nizami B, & Idicula-Thomas S (2014). CAMP: Collection of sequences and structures of antimicrobial peptides. Nucleic Acids Research, 42, D1154–D1158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Gawde U, Chakraborty S, Waghu FH, Barai RS, Khanderkar A, Indraguru R, Shirsat T, & Idicula-Thomas S (2023). CAMPR4: A database of natural and synthetic antimicrobial peptides. Nucleic Acids Research, 51, D377–D383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Fan L, Sun J, Zhou M, Zhou J, Lao X, Zheng H, & Xu H (2016). DRAMP: A comprehensive data repository of antimicrobial peptides. Scientific Reports, 6, 24482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Wang G, Li X, & Wang Z (2016). APD3: The antimicrobial peptide database as a tool for research and education. Nucleic Acids Research, 44, D1087–D1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Garai P, & Blanc-Potard A (2020). Uncovering small membrane proteins in pathogenic bacteria: Regulatory functions and therapeutic potential. Molecular Microbiology, 114, 710–720. [DOI] [PubMed] [Google Scholar]
- 98.Olexiouk V, Crappé J, Verbruggen S, Verhegen K, Martens L, & Menschaert G (2016). sORFs.org: A repository of small ORFs identified by ribosome profiling. Nucleic Acids Research, 44, D324–D329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Olexiouk V, Van Criekinge W, & Menschaert G (2018). An update on sORFs.org: A repository of small ORFs identified by ribosome profiling. Nucleic Acids Research, 46, D497–D502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Hao Y, Zhang L, Niu Y, Cai T, Luo J, He S, Zhang B, Zhang D, Qin Y, Yang F, & Chen R (2018). SmProt: A database of small proteins encoded by annotated coding and non-coding RNA loci. Briefings in Bioinformatics, 19, 636–643. [DOI] [PubMed] [Google Scholar]
- 101.Hazarika RR, De Coninck B, Yamamoto LR, Martin LR, Cammue BPA, & van Noort V (2017). ARA-PEPs: A repository of putative sORF-encoded peptides in Arabidopsis thaliana. BMC Bioinformatics, 18, 37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Chen Y, Li D, Fan W, Zheng X, Zhou Y, Ye H, Liang X, Du W, Zhou Y, & Wang K (2020). PsORF: A database of small ORFs in plants. Plant Biotechnology Journal, 18, 2158–2160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Choteau SA, Wagner A, Pierre P, Spinelli L, & Brun C (2021). MetamORF: A repository of unique short open reading frames identified by both experimental and computational approaches for gene and metagene analyses. Database (Oxford), 2021, baab032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Schmidt TSB, Fullam A, Ferretti P, Orakov A, Maistrenko OM, Ruscheweyh HJ, Letunic I, Duan Y, Van Rossum T, Sunagawa S, Mende DR, Finn RD, Kuhn M, Pedro Coelho L, & Bork P (2023). SPIRE: A Searchable, Planetary-scale mIcrobiome REsource. Nucleic Acids Research, gkad943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Duan Y, Santos-Junior CD, Schmidt TS, Fullam A, de Almeida BLS, Zhu C, Michael K, Zhao XM, Bork P, & Coelho LP (2023). A catalogue of small proteins from the global microbiome. BioRxiv, 2023.12.27.573469. [Google Scholar]
- 106.Li Y, Zhou H, Chen X, Zheng Y, Kang Q, Hao D, Zhang L, Song T, Luo H, Hao Y, Chen R, Zhang P, & He S (2021). SmProt: A reliable repository with comprehensive annotation of small proteins identified from ribosome profiling. Genomics, Proteomics & Bioinformatics, 19, 602–610. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data that support the findings of this study are available from the upon reasonable request.

