Abstract
Small proteins (SPs) are typically characterized as eukaryotic proteins shorter than 100 amino acids and prokaryotic proteins shorter than 50 amino acids. Historically, they were disregarded because of the arbitrary size thresholds to define proteins. However, recent research has revealed the existence of many SPs and their crucial roles. Despite this, the identification of SPs and the elucidation of their functions are still in their infancy. To pave the way for future SP studies, we briefly introduce the limitations and advancements in experimental techniques for SP identification. We then provide an overview of available computational tools for SP identification, their constraints, and their evaluation. Additionally, we highlight existing resources for SP research. This survey aims to initiate further exploration into SPs and encourage the development of more sophisticated computational tools for SP identification in prokaryotes and microbiomes.
Keywords: small proteins, ribosome profiling, mass spectrometry, computational tools
Introduction
Small proteins (SPs) are prevalent across all three domains of life [1, 2]. They are often referred to as proteins with fewer than 50 amino acids in prokaryotes and fewer than 100 amino acids in eukaryotes [3, 4]. While these thresholds for defining regular proteins may seem arbitrary, they were essential for managing false positive rates in predicting genes and proteins during early research [5]. It is now widely acknowledged that the open reading frames (ORFs) encoding SPs are dispersed throughout genomes, spanning intergenic, and genic regions. When they are in genic regions, small ORFs (sORFs) that code SPs can occur on the same strand as coding genes or on the antisense strand [6].
Historically, SPs were overlooked due to challenges in distinguishing true sORFs from false ones and limitations in experimental methods for sORF detection, etc. [1–4]. Recent studies have challenged the arbitrary length cutoffs to define coding ORFs, revealing sORFs within noncoding regions encoding SPs and overlapping with traditional protein-coding genes, prompting a reevaluation of genome comprehension [3, 7–11]. Notably, discoveries such as HOXB-AS3, initially classified as a long noncoding ribonucleic acid (lncRNA), have unveiled SP functions, such as suppressing colon cancer growth [12]. Similarly, research by Ruiz et al. [13] suggests that despite low conservation, many lncRNAs exhibit ribosome binding akin to SP synthesis.
Despite being in its early stages, the discovery and annotation of SPs are crucial, given their diverse roles in cellular processes. Beyond the well-known antimicrobial activity of antimicrobial peptides, SPs contribute to stabilizing protein complexes, modulating large protein activities, and participating in various cellular functions such as morphogenesis, transport, metabolism, cell division, cell growth, signal transduction, stress response, etc. [4, 14, 15]. For instance, myoregulin, a conserved SP in the mouse genome, regulates calcium intake by the sarcoplasmic reticulum, impacting exercise capacity in mice [16]. Another study identified essential SPs in Mycoplasma pneumoniae, with 53% vital for the growth of this bacterium [17]. Identifying SPs within genomes holds promise for enhancing our understanding of cellular mechanisms and providing insights into disease and drug development [3, 4, 8, 9, 16, 18–20].
Experimental identification of SPs primarily relies on ribosome profiling or mass spectrometry (MS). Ribosome profiling identifies the stalled ribosomes at the start codon, stop codon, or elongated peptide segments, which can pinpoint the start, end, or middle regions of SPs [21–23]. MS measures the mass-to-charge ratio to identify and quantify the present peptide segments to reconstruct the original proteins, including SPs in samples [24–26]. By different variants of these two types of experimental approaches, we have over 22 000 eukaryotic and 33 000 prokaryotic SPs curated in the UniProt database (2024_01) [27]. Among them, there are 770 reviewed human SPs with unknown functions for most of them, and >39 000 predicted human SPs that are not reviewed. Note that although ribosome profiling shows the involvement of the short transcripts with the translational machine, it can catch only transcripts expressed at specific conditions and provide only indirect evidence of the expression of SPs, because mRNA translation is condition-specific, and the translation of some SPs is unstable or produces nonfunctional products. On the other hand, although MS can identify SPs, many SPs may not be detected because of the precision of the MS protocols, the small size and low abundance of SPs, their transient activities, their sample-specific expression, etc. [24].
With the limitations of experimental approaches and the anticipated large number of unknown SPs, computational prediction has become indispensable. While numerous computational methods exist for predicting proteins from genomic sequences, few are tailored specifically for SP identification [28–34]. Moreover, many methods for SP identification in eukaryotes are designed to cover only one or a few specific eukaryotic species, with limited options available for SP identification in prokaryotic species and only two for SP identification in microbiomes [28, 31–33]. Although these methods have demonstrated good performance in predicting SPs in their original studies, there remains ample room for further improvement in their performance.
With SP identification a largely unexplored field, especially in prokaryotes, this study aims to survey the latest computational approaches for SP identification. We will first briefly summarize the recent improvement in SP experimental identification. We will then focus on reviewing the computational approaches for SP identification, emphasizing their rationale, features used, and training data. Subsequently, we will highlight the available resources for future SP studies. Finally, we discuss the future directions of SP studies.
Experimental approaches for small protein identification
As we mentioned above, experimental approaches for SP identification are mainly based on MS and ribosome profiling. These experimental approaches were originally created for regular proteins, proteins of at least 100 amino acids long. They thus need to be tailored and adapted for SP identification.
Traditional MS techniques involve multiple steps for protein identification and quantification, including protein digestion, chromatography, and MS analysis. These steps can inadvertently exclude SPs through sample preparation filters, ineffective digestion, limitations of the MS platform, etc. [24, 35, 36]. Advancements such as suspension trapping methods [37], multi-protease digestion [38], matrix-assisted laser desorption/ionization (MALDI)-MS with specialized matrices [39], MALDI-tandem MS (MS/MS) [40], SP enrichment [41–43], low molecular weight cut-off filters [44], and fragmentation methods [44–47] have been developed to address these issues. These innovations enhance the retention, digestion, and separation of SPs, improving identification accuracy and resolution. These advancements collectively aid in addressing the limitations of traditional methods, to provide more accurate and reliable SP identification. More details about the improved MS for SP identification can be found in a recent review [24].
Ribosome profiling is a technique that identifies proteins by capturing ribosome-protected fragments of ORFs that are actively being translated. However, it encounters challenges with sORFs due to low ribosome density and distortions caused by translation inhibitors [21, 48, 49]. Recent advancements have addressed these issues by using antibiotics like Onc112 and retapamulin to stall ribosomes at start codons without affecting elongation [21–23] and employing flash-freezing methods to preserve the translation state before lysis, thereby avoiding biases introduced by inhibitors [50, 51]. Additionally, computational methods enhance the prediction and validation of sORF translation by analyzing ribosome activity and sequence features [28, 29, 34, 52]. These improvements collectively enhance the accuracy and reliability of SP identification in ribosome profiling. We recommend a recent review [23] for more details about the application and improvement of ribosomal profiling to SP identification in prokaryotes.
Although MS and ribosomal profiling are powerful for SP identification and validation, certain SPs may escape scrutiny due to their limited resolution [53]. Additionally, the activity of SPs are conditional-specific, and their identification must be under species conditions [1–4]. Because these experimental approaches cannot be carried out under every experimental condition, computational identification of SPs is indispensable and a necessary complement.
Computational approaches
Despite the important role SPs play and the vast majority of unknown SPs, there are only a handful of computational tools for SP identification from genomic sequences without the requirement of additional experimental data. Most tools are for SP identification in eukaryotes, while only three tools are for prokaryotes. With the development of metagenomics and the accumulation of microbiome data, two methods have also been attempted to identify SPs in metagenomic datasets (Fig. 1). The following will describe the tools and methods for eukaryotes, prokaryotes, and microbiome data.
Figure 1.
A classification of the existing tools specifically designed for SP identification.
Tools for small protein identification in eukaryotes
About five tools were developed for SP identification in eukaryotes: sORF finder, MiPepid, CPPred, DeepCPP, and csORF-finder [28–30, 34, 54]. We selected these tools because they have at least the functionality specifically designed for SP identification from genomic sequences without the prerequisite of specific types of experimental data. There are also tools not specifically for SP identification but may work or help with SP identification, such as uPEPperoni [55], RNAsamba [56], and CodAn [57], which are not included because the tools specifically designed for SP identification are in general performing much better than those not in terms of the accuracy of SP identification [28–30, 34, 58]. In addition, tools are designed to analyze MS/MS data or ribosome profiling data [59–61] to identify SPs, which are restricted to specific experimental conditions with available data types and cannot be applied to systematically identify SPs in a species or a microbiome. Below, we outline each tool, detailing its input, principles, uniqueness, performance, training data, and any comparative insights, as applicable (refer to Table 1). Finally, we conclude with a summary of these tools.
Table 1.
Tools for SP identification
| Tools | Features | Training data |
|---|---|---|
| sORF finder (2015) Eukaryotes |
Nucleotide composition frequencies |
Positive
: random known coding sequences for each organism Negative: random known noncoding sequences for each organism |
| RanSEPs (2019) Prokaryotes |
Hexamer score, codon adaptation index, −10 score, +20 score, coding potential prediction, start codon, GC content, ribosome binding site stacking energy, ribosome presence, codon adaptation index |
Positive
: annotated proteins with known SPs of target species Negative: randomly-generated sORFs in intergenic regions with no known homologs in the NCBI database |
| CPPred (2019) Eukaryotes |
Coding potential prediction, ORF length, ORF coverage, ORF integrity, Fickett score, Hexamer score, Isoelectric point, gravy, instability index, CTD features |
Positive
: randomly-selected human, mouse, zebrafish, fruit fly, Saccharomyces cerevisiae, nematode, and Arabidopsis thaliana coding RNAs Negative: randomly-selected noncoding RNAs in these species |
| MiPepid (2019) Prokaryotes, eukaryotes |
Coding potential prediction for sORFs, 4-mer |
Positive
: nonredundant known human SPs from the SmProt database Negative: human miRNA, rRNA, snRNA, snoRNA, tRNA, and scaRNA from the Ensembl database |
| DeepCPP (2020) Eukaryotes |
RNA coding potential prediction, Max ORF length, ORF coverage, Fickett score, Hexamer score, Nucleotide bias, k-mer, g-gap |
Positive
: sORF type human training dataset from CPPred, vertebrate training dataset, and insect training dataset from NCBI and Ensembl databases Negative: normal type human training dataset from CPPred, vertebrate training dataset, and insect training dataset from NCBI and Ensembl databases |
| SmORFinder (2021) Prokaryotes |
Shine–Dalgarno sequence identification, deprioritize wobble position, CSS score, DeepLIFT importance scores |
Positive
: true positive sORFs from Sberro et al. Negative: true negative sORFs from a thorough exclusion search |
| PsORF (2021) Prokaryotes, eukaryotes |
Codon usage frequency |
Positive
: nonredundant positive sORFs of human. mouse, Arabidopsis thaliana, and Prokaryotic genomes from NCBI RefSeq, and sORF.org databases Negative: randomly generated negatives |
| csORF-finder (2022) Eukaryotes |
i-framed-3mer, i-framed-CKSNAP, i-framed-TDE, CDS prediction, non-CDS prediction, AAC, AAPC, CTD, ORF length, Hexamer score |
Positive
: coding sORFs of human, mouse, and fruit fly from SmProt, Ensembl, NONCODE, and NCBI RefSeq databases Negative: noncoding sORFs of the above species from SmProt, Ensembl, NONCODE, and NCBI RefSeq databases |
Table 2.
Results of the three tools when testing on UniprotKB eukaryotic SPs
| Tools | Precision | Sensitivity | Specificity | F1 | AUROC | AUPR |
|---|---|---|---|---|---|---|
| csORF-finder | 0.381 | 0.863 | 0.145 | 0.528 | 0.546 | 0.436 |
| MiPepid | 0.431 | 0.955 | 0.030 | 0.594 | 0.428 | 0.410 |
| DeepCPP | 0.568 | 0.265 | 0.850 | 0.361 | 0.465 | 0.488 |
sORF finder
sORF finder [54] is the first SP identification tool, published in 2010 and written in Perl. It considers the hexamer frequency difference in coding and noncoding sequences of a species to determine the coding potential of an input nucleotide sequence. The training data is the known coding and noncoding sequences in a genome. sORF finder further assesses the function of a potential SP by Basic Local Alignment Search Tool (BLAST), searching homologous sequences of the input sequence and testing whether the number of synonymous sites is greater than the number of nonsynonymous sites in the homologous sequences. The tool and its code are not available anymore.
MiPepid (https://github.com/MindAI/MiPepid)
MiPepid is a machine learning-based tool specifically designed for predicting SPs from the nucleotide sequences. Different from sORF finder, MiPepid is the first one to use features learned from SPs instead of regular proteins. Its training dataset comprises positive SPs from SmProt after removing redundant or highly similar SPs, and negative SPs or non-SPs from traditional noncoding RNAs, including microRNAs, ribosomal RNAs, and small nuclear RNAs [28, 62–64]. MiPepid considers 4-mers in positive and negative nucleotide sequences and normalizes their frequency relative to the sequence length as the features. It then applies the logistic regression to distinguish positive from negative SPs based on their normalized 4-mer frequencies. Compared with the traditional tools that measure coding potential to determine whether a sequence coding a protein, such as CPC, CPC2, and CPAT, MiPepid shows a much higher sensitivity and comparable specificity. Although trained primarily on human data, MiPepid’s utility extends to related mammalian species such as mouse, according to the authors [28].
CPPred (http://www.rnabinding.com/CPPred/)
CPPred is a tool that employs a support vector machines-based approach to predict SPs in nucleotide sequences. It considers both nucleotide and peptide sequence features of the input nucleotide sequences and its six possible ORFs. The nucleotide features include the ORF length, coverage, integrity, Fickett score, and hexamer score. The peptide features include the isoelectric point, grand average of hydropathicity of the peptide, stability, and the commonly used composition, transition, and distribution (CTD) features. Note that these features are widely used for measuring the coding potentials of regular proteins instead of SPs [34]. CPPred claimed that the CTD feature is critical for its good performance in predicting SPs, and it is the first one to use the CTD feature for eukaryotic SP identification. CPPred uses known coding as positives and known noncoding sequences as negatives for training and testing. It exhibited superior performance to CPAT, CPC2, sORF finder, and PLEK, only trailing slightly behind PLEK in human testing groups.
DeepCPP (https://github.com/yuuuuzhang/DeepCPP)
DeepCPP is a convolutional neural network-based tool designed to predict the coding potential of the input nucleotide sequences. It is particularly adept at identifying coding sORFs [29, 65, 66]. It considers the following features: maximum ORF length, ORF coverage, mean hexamer score, Fickett score, k-mers, gapped dinucleotides, and nucleotide bias around the start codon. The nucleotide bias measures the nucleotide preference from the three nucleotides before the start codon to the three nucleotides after the start codon of an input sequence, which was shown to be an obvious difference by previous studies. DeepCPP used the same training data as CCPred, which uses coding sequences as positives and noncoding sequences as negatives. DeepCPP trained two models for measuring the coding potential of ORFs, one for regular ORFs and the other for sORFs. DeepCPP trained both models with the above training data but with a different number of features. The authors compared DeepCPP against Hugo’s SVM method, mRNN, lncRNAnet, LncADeep, LncFinder, RNAsambsa, and CPPred using both normal and sORF-type human datasets. DeepCPP consistently demonstrated superior accuracy, sensitivity, F-scores, Matthew’s correlation coefficient (MCC), and harmonic mean on the human sORF test data.
csORF-finder (https://github.com/mengzhanggggg/csORF-finder_webserver)
csORF-finder predicts whether a nucleotide sequence codes for an SP using a collection of ensemble models [30]. It uses both nucleotide and peptide sequence features. The nucleotide sequence features use the in-framed features such as the frequency of trimers, the frequency of spaced dimers separated by k nucleotides, and a new feature called trinucleotide deviation from the expected mean (TDE). Here, the in-framed features refer to features within one of the six reading frames of the input nucleotide sequence between the start and stop codon. For the peptide features, it considers amino acid composition, amino acid pair composition, and the CTD features. csORF-finder is trained and tested in human, mouse and fruit fly. The positive data are taken from smProt, a database of curated SPs, and the negative data are taken as sORFs that do not overlap with the positive data. csORF-finder showed that incorporating sequence properties, especially the TDE features, improves the accuracy of the SP prediction. Moreover, it showed superior performance in terms of sensitivity, accuracy, and F1 scores when predicting sORFs in test datasets of rat, zebrafish, Saccharomyces cerevisiae, and Arabidopsis thaliana. This performance comparison test was made with other tools for sORF prediction, including CPPred, DeepCPP, CPAT, RNAsamba, and MiPepid.
Concluding remarks
While the aforementioned five tools are tailored for eukaryotic sequences, one could potentially adapt their methodologies to train analogous models on prokaryotic data. Nonetheless, due to significant differences between eukaryotic and prokaryotic genomes, such as variations in gene density, the features crucial for eukaryotic SP identification might not be directly applicable to prokaryotic SP identification. Hence, we should be cautious about their application to prokaryotic data, although we may apply them to prokaryotic data.
Primarily, all five tools for eukaryotes rely on conventional sequence features, like k-mers, ORF length, Fickett score, and hexamer score, for SP identification. Yet, it is apparent that there must exist novel features that distinguish SPs from regular proteins, essential for precise SP prediction. Indeed, if SPs lacked unique sequence characteristics, they would resemble protein domains or motifs found in regular proteins, which is not the case [3]. Some tools start to explore gapped k-mers, which is a good start [29]. However, a thorough investigation into new SP features is imperative for more accurate identification and a deeper understanding of their functions.
With regard to training data, all five tools except MiPepid [28] used the coding sequences as positive sequences to train the SP prediction models. There may be certain properties shared by regular coding sequences and sORFs that code SPs. However, the latter have their own unique properties. Such a choice of positive training data is thus problematic, as demonstrated below when we tested three of the five tools on known SP sequences. Because the goal is to predict SPs, ideally, the models must be trained on the SP sequences to reflect their characteristics.
In addition to positive training data, there are also concerns about the negative training data these tools used. All tools except MiPepid used noncoding sequences as negatives to train their models. It is evident that many long noncoding sequences comprise of sORF sequences that code SPs. MiPepid noticed this issue and chose microRNAs and others that are unlikely to contain coding sORFs. However, given the limited number of these noncoding RNAs, the negatives may not be enough to represent the true negatives. In fact, using these noncoding RNAs is likely to yield a high sensitivity and a very low specificity, as shown in our test below.
In summary, the existing tools for eukaryotic SP identification have demonstrated efficacy in their respective studies. However, there is still room for improvement, including the new features and training and testing datasets. We expect that new studies will address these issues and new tools will advance our understanding of eukaryotic SPs.
Tools for small protein identification in prokaryotes
Only three tools for SP identification in prokaryotes are available: RanSEPs, SmORFinder, and PsORF. Unlike PsORF, RanSEPs and SmORFinder identify sORFs in a prokaryotic genome. Instead, PsORF and the above tools predict sORFs from input nucleotide sequences that are about the length of a gene. The details of the three tools are below.
RanSEPs (http://ranseps.crg.es/)
RanSEPs is a random forest-based tool written in Python that predicts coding sORFs in a bacterial genome [31]. It requires the input of the genome sequence under consideration together with the annotated coding sequences in the genome. It considers the following features in its random forest models: quasi-sequence-order-coupling numbers based on the Schneider–Wrede physicochemical distance matrix, hydrophobicity, secondary structure, start codon, GC content, the existence of Shine–Dalgarno sequence, stacking energy and preference scores around the start codon, hexamers, dimer amino acid at the start and stop codons, and codon adaptation index. These features are chosen from >1500 features and are of specific biological interest. To train RanSEPs, the positive data are known coding sequences in the genome, including a fraction of known sORFs. The negative data are chosen from noncoding regions of the same genome, which do not have homologous sequences in other species. RanSEPs underwent comparison with other predictive software, including Prodigal, GeneMarks, BASys, CPC, and Glimmer, and they demonstrated superior performance compared to these programs.
SmORFinder (https://github.com/bhattlab/SmORFinder)
SmORFinder [52] is a specialized tool for identifying sORFs by leveraging profile-hidden Markov models and deep-learning classifiers. It takes three inputs: the ORF itself and the sequences 100 base pairs upstream and downstream of the ORF. SmORFinder can identify Shine–Dalgarno sequences, deprioritize wobble position, and calculate a codon synonym similarity (CSS) score in 26 different bacterial species by conducting feature importance analysis. The tool was trained using predicted true positive sORFs identified in a metagenome study conducted by Sberro et al. (2019) [18]. Predicted true negative sORFs were meticulously identified through a rigorous exclusion protocol, ensuring high-confidence noncoding ORFs for training purposes. Overall, SmORFinder offers high-confidence sORF identification while allowing flexibility in applying various filters to suit the user’s needs.
PsORF (http://211.64.32.111:8888/)
PsORF is a protein-coding sORF prediction tool specifically designed for prokaryotic sORFs [32]. The tool is written in Matlab and is no longer accessible. PsORF employs a random forest-based approach for sORF predictions. PsORF utilizes the frequency of the 64 codons in a sequence as features. It has been trained with both eukaryotic data and prokaryotic sequence data. The positive training data is from the sORF.org database [67], which stores annotated SPs in different species. The negative training data is generated randomly, as few noncoding sequences exist in prokaryotic genomes. The program was compared with nine other protein-coding sORF prediction software tools, including CPC2, CPPred, DeepCPP, CPAT, CNCI, PLEK, LGC, CPPred-sORF, and MiPepid. An evaluation was conducted using test datasets from human, mouse, Arabidopsis thaliana, prokaryotes, and an experimentally verified dataset from another publication. In its study, it exhibited superior performance in terms of accuracy and MCC across human, mouse, and Arabidopsis thaliana datasets, while also demonstrating heightened sensitivity, accuracy, and MCC for the prokaryotic dataset.
Concluding remarks
While we have introduced three tools for prokaryotic SP identification, there is no tool for predicting prokaryotic SPs across diverse species. The third tool is no longer accessible, and the first two are designed for specific prokaryotic species, lacking validation for application across other prokaryotic species. Furthermore, the necessity for detailed genome knowledge of the target species restricts the utility of these first two tools.
The features used by the three tools are still mainly the standard features for regular proteins. With these features, we essentially identify regular proteins instead of SPs if we do not have the length constraints. Notably, two tools incorporate a common feature: Shine–Dalgarno sequences. These sequences represent ribosomal binding sites in bacterial and archaeal mRNAs, which are thus biologically meaningful. Because this feature alone is still weak, future studies may identify additional features to improve the accuracy of prokaryotic SP identification.
Small protein identification in microbiomes
More than 99% of microbial species are not culturable [68]. To study these species, we must study them together with other species in their communities. It is thus necessary to identify SPs directly from nucleotide sequences generated in microbiomes. In a typical microbiome, we often do not have the genome sequences of its present species. Instead, we only have mixed shotgun reads from unknown species in this community. A read is often too short to contain the ORF of an entire SP, and we usually do not know which species a read is from. Moreover, the abundance of many species may be too low to have their genome represented in the reads. Because of these challenges, only two studies identify SPs in microbiomes.
One is the comparative genomics study conducted by Sberro et al. in 2019 [18]. The researchers analyzed 1773 human-associated metagenomes from four distinct body sites (mouth, skin, gut, and vagina) of 263 individuals. Previous studies have pre-assembled reads in these metagenomes into contigs. Employing a reference-free approach to avoid limitations in sequenced genome search, the study utilized MetaProdigal [69], a metagenomic gene prediction program, for annotating all ORFs in the contigs. The analysis focused on ORFs of proteins less than 50 amino acids in length, resulting in 2 514 099 sORFs available for further investigation. The researchers then employed CD-Hit [70], a clustering program to cluster the sORFs based on amino acid similarity and protein length. Subsequently, the resulting clusters were queried against the Conserved Domain Database [71] to filter clusters corresponding to known coding sequences. Finally, the researchers evaluated the coding potential of the remaining clusters with at least eight sequences and successfully identified 4539 clusters as conserved SP families. Notably, the study underscores the necessity for further classification of SP domains, as over 90% of the protein families identified contained unknown domains [18]. The methodologies employed in this study offer a framework to explore potential SPs specific to other microbiomes.
The other is the MetaBP study [33]. This study developed the MetaBP tool (https://github.com/yao-laboratory/metaBP and https://github.com/yao-laboratory/metaBP_ML) that is specifically engineered to profile and annotate community-specific bacterial peptides extracted from metagenomic samples and provides a comprehensive toolkit for metagenomic analysis annotation. MetaBP ingests raw sequence reads, typically from paired-end shotgun sequencing, and outputs protein clusters with mutations, SP annotations, and a protein copy number table. MetaBP focuses on characterizing functional bacterial peptides from metagenomic data, considering potential mutations, and providing insights into the unique landscape of microbial communities. MetaBP enhances protein recovery and search from ORFs by leveraging protein-level assembly from metagenomic sequencing data while identifying SP sequences and potential mutations within diverse homologous clusters using state-of-the-art protein sequence clustering techniques. The tool capabilities were evaluated with data recovered from mice gut microbial samples.
The above two studies open a new avenue to discover SPs. The majority of the identified SPs by Sberro et al. are likely authentic SPs due to their conservation in at least eight “species”. Although the MetaBP study did not provide a pipeline to identify new SPs, it proposed an important problem, how to identify SPs from shotgun reads? It also provides useful utilities to annotate SPs in microbiomes. Note that despite its value, the pipeline from this study may generate many false positive predictions, as the assembly and binning shotgun reads is still challenging for existing computational tools [72].
Tool evaluation
We evaluated the efficacy of three eukaryotic tools: MiPepid [28], DeepCPP [29], and csORF-finder [30]. We omitted sORF finder and CPPred from our assessment; the former was unavailable, while the latter demonstrated inferior performance compared to DeepCPP and csORF-finder [29, 30]. Other mentioned tools or methods were not applicable in this context.
For our evaluation, we utilized the annotated SPs from the UniprotKB database [27]. By querying proteins shorter than 100 amino acids in UniprotKB with taxonomy ID 2759, we retrieved 22 075 SPs, with average and median lengths of 57 and 62 amino acids, respectively. These SPs served as positive data. To create negative data, we randomly permuted the corresponding sORF sequences of these SPs while preserving their start and stop codons. If a stop codon was generated in the middle of a permuted sequence, it was replaced with a nonstop codon randomly. Utilizing these 22 075 pairs of positive and negative SPs, we ran the tools with their default parameters.
As expected, the tools exhibited rather modest performance, with the area under the receiver operating characteristic curve (AUROC) falling below 0.550 and the area under the precision-recall curve (AUPR) below 0.500. Notably, csORF-finder showed comparable performance to DeepCPP, with DeepCPP outperforming MiPepid in terms of both AUROC and AUPR. MiPepid demonstrated the highest sensitivity, while DeepCPP exhibited superior specificity. The suboptimal performance suggested that the standard features used for regular protein identification are not enough for an accurate prediction of SPs.
Resources
Only three dedicated databases exclusively store information on SPs and their genes. These databases have gathered information via literature mining, previous publications, and other databases containing SPs and sORFs. We will discuss their features, applications, data sources and the tools within them. Additionally, we will discuss UniProt, which, while not specializing in SPs, contains a vast library of confirmed and potential SPs and their sORFs accumulated from other databases.
sORFs.org (https://sorfs.ugent.be/)
sORFs.org [67, 73] is a dedicated database for collecting sORFs identified through ribosome profiling. This database hosts over four million entries across six species: human, fruit fly, rat, mouse, Caenorhabditis elegans, and zebrafish. These entries are compiled from 90 different datasets from various publications and users, following a systematic review process before integrating into the database. sORFs.org offers two methods of utilization: a simple, quick search feature and a more advanced search option via their “BioMart” interface. Additionally, users can manually inspect the ribosome profiling data associated with each result, enabling access to all attributes of an sORF.
SmProt (http://bigdata.ibp.ac.cn/SmProt/)
SmProt addresses the need for specialized databases to compile data and findings related to SPs [74]. It is a comprehensive database focusing on SPs encoded by annotated coding and noncoding RNAs. The data in SmProt has been gathered through diverse sources such as literature mining, database searching, ribosome profiling, and MS/MS [74, 75]. Currently, SmProt includes 255 010 SPs identified across 291 cell types and tissues from human, mouse, rat, fruit fly, zebrafish, yeast, Caenorhabditis elegans, and Escherichia coli samples. To enhance the utility of the collected data, SmProt has undergone re-annotation, providing NONCODE, SmProt, and Ensembl IDs to genes. Additionally, the data has been reorganized, categorizing SPs based on their location relative to the genes encoding the proteins [74, 75]. SmProt goes beyond data compilation and offers predictions regarding the function of SPs. This is achieved through information from various databases, leveraging high-throughput literature mining, and incorporating ribosome profiles via InterProScan [74–76]. Researchers can use SmProt in conjunction with experimental methods to gain insights into the SP features and characteristics. The database also serves as a valuable resource for those exploring SPs across organisms and cell types, facilitating a deeper understanding of their roles and functions in various biological contexts.
PsORF (http://psorf.whu.edu.cn/#/)
PsORF [77] is a database dedicated to plant sORFs. This database currently hosts over 412,000 entries across 35 different plant species. Notably, it offers specialized results from MS, ribosome profiling, genomes, and transcriptomes for five of these species—Arabidopsis thaliana, Chlamydomonas reinhardtii, Gossypium arboreum, Oryza sativa, and Zea mays. The results for the remaining thirty plant species were identified through homologous searches via BLAST of the sORFs in the five main species. PsORF provides users with various functionalities, including the ability to utilize BLAST for sequence similarity searches, interact with ribosome and RNA profiling data of sORFs using their “JBrowser”, view MS/MS fragmentation spectra of SPs, access phylogenetic trees for information on conservation between species, and gather information about other publications where searched sORFs are mentioned.
UniProt (https://www.uniprot.org/)
UniProt [27] is a database not specializing in SP identification, yet it hosts valuable information on SPs akin to SmProt. Drawing from diverse sources, UniProt amalgamates proteomic data from over 515 000 organisms, encompassing an extensive repository of SPs starting at four amino acids in length. Additionally, UniProt offers a suite of tools, including a BLAST tool for conducting sequence similarity searches [27, 78], a Clustal Omega alignment tool for sequence alignment [27, 79], and the retrieve/ID mapping tool facilitating the retrieval of proteins using identifiers and mapping between UniProt and other external databases [27, 80]. Researchers can leverage UniProt to corroborate predictions or annotations of SPs and sORFs, utilizing the resource to compare findings with those confirmed by the community via Swiss-Prot [81]. This database is a crucial resource for cross-referencing and validating SPs and sORFs, contributing to the collective understanding of these biologically significant entities.
Discussion
We conducted a comprehensive survey of experimental and computational approaches for SP identification. Experimental approaches such as ribosome profiling and MS have established our rudimentary understanding of SPs. However, they come with inherent limitations and cannot cover all physiological conditions. Computational SP identification is thus essential, especially for SPs in microbiomes.
We surveyed nine computational tools or methods for SP identification. It is worth pointing out that other tools may also be used for SP identification, including traditional gene prediction tools [55–57, 82–84]. We did not include other tools because they are not specifically tailored for SP identification. Even for the tools we discussed above, some were originally designed for regular protein identification while being extended to identify SPs. They are essentially a simple application of the regular protein identification methods, with performance incomparable to those specifically designed for SP identification [30, 58].
Further exploration in several directions is warranted. First, there is a great need for tools designed for SP identification in prokaryotes and microbiomes. Most existing tools are tailored for eukaryotes, and their performance is still suboptimal even on eukaryotic sequences. Because the eukaryotic SPs are likely to have different sequence features from prokaryotic ones, such as the lack of Shine–Dalgarno sequences, the tools developed for eukaryotes are not suitable for prokaryotes. It is thus necessary to develop tools specifically for prokaryotes and microbiomes.
Second, we still lack a reliable set of features to distinguish SPs from regular proteins. Most tools use coding features learned from regular proteins to predict SPs. We believe that SPs may have different features from regular proteins [85]. For instance, several studies point out that SPs do not have secondary or tertiary structures [86, 87]. Some studies show that gapped k-mers are enriched in SPs compared with regular proteins [29]. All these studies suggest that we are probably still at the beginning of defining the coding potential of sORFs. It also suggests that the coding potential of regular ORFs may be misleading when defining the coding potential of sORFs. If we could identify the features unique to SPs, we may create tools that can catch the essential properties of SPs, common to all SPs in prokaryotes or eukaryotes. Such a tool will be much more useful and can be universally applied to identify SPs in various prokaryotic or eukaryotic species.
Third, we still do not have a good negative dataset for predicting SPs. Most existing tools use noncoding RNAs as negatives, which are shown to potentially contain SPs [27, 32]. Some studies use microRNAs, snRNAs, etc., which may be better than the above noncoding negatives but may not represent all negative SPs well. One may consider using the permuted positive sequences as negatives, which should work better than noncoding RNAs, but they would have to be considered carefully, as the limited number of permuted sequences may still be biased in representing the true negatives in a study.
In summary, the study of SPs is still in the early stages. Future endeavors aiming at elucidating SP features, establishing more representative negative datasets, and exploring prokaryotic SPs in microbiomes will be instrumental in advancing this field.
Key Points
Experimental approaches for SP identification have limitations.
Computational SP identification is indispensable.
Most computational methods for SP identification are on specific eukaryotic species.
The advancement of computational SP identification necessitates the integration of novel features, enhanced training datasets, and a specific emphasis on applications within prokaryotic organisms and microbiomes.
Data availability
The aforementioned data are in the public databases. The tool links are provided in the paper.
Contributor Information
Joshua Beals, Burnett School of Biomedical Science, University of Central Florida, 4000 Central Florida Blvd, Orlando, FL 32816, United States.
Haiyan Hu, Department of Computer Science, University of Central Florida, 4000 Central Florida Blvd, Orlando, FL 32816, United States.
Xiaoman Li, Burnett School of Biomedical Science, University of Central Florida, 4000 Central Florida Blvd, Orlando, FL 32816, United States.
Author contributions
H.H. and X.L. conceived the idea. J.B. implemented the idea and generated results. J.B., H.H., and X.L. analyzed the results and wrote the manuscript. All authors reviewed the manuscript.
Conflicts of interest: None declared.
Funding
This work was supported by the National Science Foundation (2120907, 2015838).
References
- 1. Jordan B, Weidenbach K, Schmitz RA. The power of the small: the underestimated role of small proteins in bacterial and archaeal physiology. Curr Opin Microbiol 2023;76:102384. [DOI] [PubMed] [Google Scholar]
- 2. Weidenbach K, Gutt M, Cassidy L. et al. Small proteins in archaea, a mainly unexplored world. J Bacteriol 2022;204:e0031321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Steinberg R, Koch H-G. The largely unexplored biology of small proteins in pro- and eukaryotes. FEBS J 2021;288:7002–24. [DOI] [PubMed] [Google Scholar]
- 4. Su M, Ling Y, Yu J. et al. Small proteins: untapped area of potential biological importance. Front Genet 2013;4:286. 10.3389/fgene.2013.00286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Storz G, Wolf YI, Ramamurthi KS. Small proteins can no longer be ignored. Annu Rev Biochem 2014;83:753–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Harrison PM, Kumar A, Lang N. et al. A question of size: the eukaryotic proteome and the problems in defining it. Nucleic Acids Res 2002;30:1083–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Pueyo JI, Magny EG, Couso JP. New peptides under the s(ORF)ace of the genome. Trends Biochem Sci 2016;41:665–78. [DOI] [PubMed] [Google Scholar]
- 8. Ladoukakis E, Pereira V, Magny EG. et al. Hundreds of putatively functional small open reading frames in Drosophila. Genome Biol 2011;12:R118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Leslie M. Outsize impact. Science 2019;366:296–9. [DOI] [PubMed] [Google Scholar]
- 10. Ransohoff JD, Wei Y, Khavari PA. The functions and unique features of long intergenic non-coding RNA. Nat Rev Mol Cell Biol 2018;19:143–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Dhamija S, Menon MB. Non-coding transcript variants of protein-coding genes—what are they good for? RNA Biol 2018;15:1025–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Huang JZ, Chen M, Chen D. et al. A peptide encoded by a putative lncRNA HOXB-AS3 suppresses colon cancer growth. Mol Cell 2017;68:171–184.e6. [DOI] [PubMed] [Google Scholar]
- 13. Ruiz-Orera J, Messeguer X, Subirana JA. et al. Long non-coding RNAs as a source of new peptides. Elife 2014;3:e03523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Fuchs S, Kucklick M, Lehmann E. et al. Towards the characterization of the hidden world of small proteins in Staphylococcus aureus, a proteogenomics approach. PLoS Genet 2021;17:e1009585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Zhang Y, Wang S, Hu H. et al. A systematic study of HIF1A cofactors in hypoxic cancer cells. Sci Rep 2022;12:18962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Anderson DM, Anderson KM, Chang CL. et al. A micropeptide encoded by a putative long noncoding RNA regulates muscle performance. Cell 2015;160:595–606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Lluch-Senar M, Delgado J, Chen WH. et al. Defining a minimal cell: essentiality of small ORFs and ncRNAs in a genome-reduced bacterium. Mol Syst Biol 2015;11:780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Sberro H, Fremin BJ, Zlitni S. et al. Large-scale analyses of human microbiomes reveal thousands of small, novel genes. Cell (Cambridge, Mass) 2019;178:1245–1259.e14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Wang S, Zheng H, Choi JS. et al. A systematic evaluation of the computational tools for ligand-receptor-based cell–cell interaction inference. Brief Funct Genomics 2022;21:339–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Wang Y, Goodison S, Li X. et al. Prognostic cancer gene signatures share common regulatory motifs. Sci Rep 2017;7:4750. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Brar GA, Weissman JS. Ribosome profiling reveals the what, when, where and how of protein synthesis. Nat Rev Mol Cell Biol 2015;16:651–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Power L. Beginners guide to ribosome profiling. Biochem 2022;44:30–4. [Google Scholar]
- 23. Vazquez-Laslop N, Sharma Cynthia M, Mankin A. et al. Identifying small open reading frames in prokaryotes with ribosome profiling. J Bacteriol 2022;204:e00294–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Ahrens CH, Wade JT, Champion MM. et al. A practical guide to small protein discovery and characterization using mass spectrometry. J Bacteriol 2022;204:e0035321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. McCammon MG, Robinson CV. Me, my cell, and I: the role of the collision cell in the tandem mass spectrometry of macromolecules. Biotechniques 2005;39:447–53. [DOI] [PubMed] [Google Scholar]
- 26. Kaltashov IA, Bobst CE, Abzalimov RR. Mass spectrometry-based methods to study protein architecture and dynamics. Protein Sci 2013;22:530–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res 2023;51:D523–d531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Zhu M, Gribskov M. MiPepid: MicroPeptide identification tool using machine learning. BMC Bioinformatics 2019;20:559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Zhang Y, Jia C, Fullwood MJ. et al. DeepCPP: a deep neural network based on nucleotide bias information and minimum distribution similarity feature selection for RNA coding potential prediction. Brief Bioinform 2020;22:2073–84. [DOI] [PubMed] [Google Scholar]
- 30. Zhang M, Zhao J, Li C. et al. csORF-finder: an effective ensemble learning framework for accurate identification of multi-species coding short open reading frames. Brief Bioinform 2022;23:bbac392. 10.1093/bib/bbac392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Miravet-Verde S, Ferrar T, Espadas-García G. et al. Unraveling the hidden universe of small proteins in bacterial genomes. Mol Syst Biol 2019;15:e8290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Yu J, Guo L, Dou X. et al. Comprehensive evaluation of protein-coding sORFs prediction based on a random sequence strategy. FBL 2021;26:272–8. [DOI] [PubMed] [Google Scholar]
- 33. Vajjala M, Johnson B, Kasparek L. et al. Profiling a community-specific function landscape for bacterial peptides through protein-level meta-assembly and machine learning. Front Genet 2022;13:935351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Tong X, Liu S. CPPred: coding potential prediction based on the global description of RNA sequence. Nucleic Acids Res 2019;47:e43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Tyanova S, Temu T, Cox J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat Protoc 2016;11:2301–19. [DOI] [PubMed] [Google Scholar]
- 36. Wang P, Wilson SR. Mass spectrometry-based protein identification by integrating de novo sequencing with database searching. BMC Bioinformatics 2013;14:S24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Zougman A, Selby PJ, Banks RE. Suspension trapping (STrap) sample preparation method for bottom-up proteomics analysis. Proteomics 2014;14:1006–0. [DOI] [PubMed] [Google Scholar]
- 38. Kaulich PT, Cassidy L, Bartel J. et al. Multi-protease approach for the improved identification and molecular characterization of small proteins and short open reading frame-encoded peptides. J Proteome Res 2021;20:2895–903. [DOI] [PubMed] [Google Scholar]
- 39. Gu H, Ma K, Zhao W. et al. A general purpose MALDI matrix for the analyses of small organic, peptide and protein molecules. Analyst 2021;146:4080–6. [DOI] [PubMed] [Google Scholar]
- 40. Meier-Credo J, Preiss L, Wullenweber I. et al. Top–down identification and sequence analysis of small membrane proteins using MALDI-MS/MS. J Am Soc Mass Spectrom 2022;33:1293–302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Harney DJ, Hutchison AT, Su Z. et al. Small-protein enrichment assay enables the rapid, unbiased analysis of over 100 low abundance factors from human plasma. Mol Cell Proteomics 2019;18:1899–915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Harney DJ, Larance M. The small-protein enrichment assay (SPEA) for analysis of low abundance peptide hormones in plasma. Methods Mol Biol 2023;2628:265–76. [DOI] [PubMed] [Google Scholar]
- 43. Cassidy L, Kaulich PT, Tholey A. Depletion of high-molecular-mass proteins for the identification of small proteins and short open reading frame encoded peptides in cellular proteomes. J Proteome Res 2019;18:1725–34. [DOI] [PubMed] [Google Scholar]
- 44. Fabre B, Combier J-P, Plaza S. Recent advances in mass spectrometry–based peptidomics workflows to identify short-open-reading-frame-encoded peptides and explore their functions. Curr Opin Chem Biol 2021;60:122–30. [DOI] [PubMed] [Google Scholar]
- 45. Fuchs S, Engelmann S. Small proteins in bacteria—big challenges in prediction and identification. Proteomics 2023;23:2200421. [DOI] [PubMed] [Google Scholar]
- 46. Zubarev RA. Electron-capture dissociation tandem mass spectrometry. Curr Opin Biotechnol 2004;15:12–6. [DOI] [PubMed] [Google Scholar]
- 47. Ma J, Diedrich JK, Jungreis I. et al. Improved identification and analysis of small open reading frame encoded polypeptides. Anal Chem 2016;88:3967–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Subramaniam AR, Zid BM, O'Shea EK. An integrated approach reveals regulatory controls on bacterial translation elongation. Cell 2014;159:1200–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Gerashchenko MV, Gladyshev VN. Translation inhibitors cause abnormalities in ribosome profiling experiments. Nucleic Acids Res 2014;42:e134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Ingolia NT, Brar GA, Rouskin S. et al. The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments. Nat Protoc 2012;7:1534–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Glaub A, Huptas C, Neuhaus K. et al. Recommendations for bacterial ribosome profiling experiments based on bioinformatic evaluation of published data. J Biol Chem 2020;295:8999–9011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Durrant MG, Bhatt AS. Automated prediction and annotation of small open reading frames in microbial genomes. Cell Host Microbe 2021;29:121–131.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Hsu PY, Calviello L, Wu HL. et al. Super-resolution ribosome profiling reveals unannotated translation events in Arabidopsis. Proc Natl Acad Sci U S A 2016;113:E7126–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Hanada K, Akiyama K, Sakurai T. et al. sORF finder: a program package to identify small open reading frames with high coding potential. Bioinformatics 2009;26:399–400. [DOI] [PubMed] [Google Scholar]
- 55. Skarshewski A, Stanton-Cook M, Huber T. et al. uPEPperoni: an online tool for upstream open reading frame location and analysis of transcript conservation. BMC Bioinformatics 2014;15:36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Camargo AP, Sourkov V, Pereira Gonçalo AG. et al. RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences. NAR Genom Bioinform 2020;2:lqz024. 10.1093/nargab/lqz024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Nachtigall PG, Kashiwabara AY, Durham AM. CodAn: predictive models for precise identification of coding regions in eukaryotic transcripts. Brief Bioinform 2020;22:bbaa045. 10.1093/bib/bbaa045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Gelhausen R, Müller T, Svensson SL. et al. RiboReport—benchmarking tools for ribosome profiling-based identification of open reading frames in bacteria. Brief Bioinform 2022;23:bbab549. 10.1093/bib/bbab549. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Bunk B, Kucklick M, Jonas R. et al. MetaQuant: a tool for the automatic quantification of GC/MS-based metabolome data. Bioinformatics 2006;22:2962–5. [DOI] [PubMed] [Google Scholar]
- 60. Bartholomäus A, Kolte B, Mustafayeva A. et al. smORFer: a modular algorithm to detect small ORFs in prokaryotes. Nucleic Acids Res 2021;49:e89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Platon L, Zehraoui F, Bendahmane A. et al. IRSOM, a reliable identifier of ncRNAs based on supervised self-organizing maps with rejection. Bioinformatics 2018;34:i620–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Ratti M, Lampis A, Ghidini M. et al. MicroRNAs (miRNAs) and long non-coding RNAs (lncRNAs) as new tools for cancer therapy: first steps from bench to bedside. Target Oncol 2020;15:261–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Matera AG, Terns RM, Terns MP. Non-coding RNAs: lessons from the small nuclear and small nucleolar RNAs. Nat Rev Mol Cell Biol 2007;8:209–20. [DOI] [PubMed] [Google Scholar]
- 64. Kaliatsi EG, Giarimoglou N, Stathopoulos C. et al. Non-coding RNA-driven regulation of rRNA biogenesis. Int J Mol Sci 2020;21:9738. 10.3390/ijms21249738. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Leong AZ, Lee PY, Mohtar MA. et al. Short open reading frames (sORFs) and microproteins: an update on their identification and validation measures. J Biomed Sci 2022;29:19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66. Schlesinger D, Elsässer SJ. Revisiting sORFs: overcoming challenges to identify and characterize functional microproteins. FEBS J 2022;289:53–74. [DOI] [PubMed] [Google Scholar]
- 67. Olexiouk V, Crappé J, Verbruggen S. et al. sORFs.org: a repository of small ORFs identified by ribosome profiling. Nucleic Acids Res 2016;44:D324–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68. Ventolero MF, Wang S, Hu H. et al. Computational analyses of bacterial strains from shotgun reads. Brief Bioinform 2022;23:bbac013. 10.1093/bib/bbac013. [DOI] [PubMed] [Google Scholar]
- 69. Hyatt D, LoCascio PF, Hauser LJ. et al. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics 2012;28:2223–30. [DOI] [PubMed] [Google Scholar]
- 70. Fu L, Niu B, Zhu Z. et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012;28:3150–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. Marchler-Bauer A, Lu S, Anderson JB. et al. CDD: a conserved domain database for the functional annotation of proteins. Nucleic Acids Res 2011;39:D225–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72. Miller RM, Millikin RJ, Hoffmann CV. et al. Improved protein inference from multiple protease bottom-up mass spectrometry data. J Proteome Res 2019;18:3429–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73. Olexiouk V, Van Criekinge W, Menschaert G. An update on sORFs.org: a repository of small ORFs identified by ribosome profiling. Nucleic Acids Res 2017;46:D497–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74. Li Y, Zhou H, Chen X. et al. SmProt: a reliable repository with comprehensive annotation of small proteins identified from ribosome profiling. Genomics Proteomics Bioinformatics 2021;19:602–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75. Hao Y, Zhang L, Niu Y. et al. SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci. Brief Bioinform 2017;19:636–43. [DOI] [PubMed] [Google Scholar]
- 76. Jones P, Binns D, Chang HY. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 2014;30:1236–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77. Chen Y, Li D, Fan W. et al. PsORF: a database of small ORFs in plants. Plant Biotechnol J 2020;18:2158–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78. Wheeler D, Bhagwat M. BLAST QuickStart: example-driven web-based BLAST tutorial. In: Bergman NH (ed). Comparative Genomics: Volumes 1 and 2. Totowa (NJ): Humana Press, 2007. [Google Scholar]
- 79. Sievers F, Higgins DG. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci 2018;27:135–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80. Pundir S, Martin MJ, O’Donovan C. et al. UniProt tools. Curr Protoc Bioinformatics 2016;53:1.29.21–21.29.15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81. Boeckmann B, Bairoch A, Apweiler R. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003;31:365–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol 1997;268:78–94. [DOI] [PubMed] [Google Scholar]
- 83. Rey J, Deschavanne P, Tuffery P. BactPepDB: a database of predicted peptides from a exhaustive survey of complete prokaryote genomes. Database (Oxford) 2014;2014:bau106. 10.1093/database/bau106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84. Hyatt D, Chen GL, Locascio PF. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 2010;11:119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85. Dill KA, MacCallum JL. The protein-folding problem, 50 years on. Science 2012;338:1042–6. [DOI] [PubMed] [Google Scholar]
- 86. Kubatova N, Pyper DJ, Jonker HRA. et al. Rapid biophysical characterization and NMR spectroscopy structural analysis of small proteins from bacteria and archaea. Chembiochem 2020;21:1178–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87. Neidigh JW, Fesinmeyer RM, Andersen NH. Designing a 20-residue protein. Nat Struct Biol 2002;9:425–30. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The aforementioned data are in the public databases. The tool links are provided in the paper.

