2016 Aug 12;13(11):1051–1059.

Decoding sORF translation – from small proteins to gene regulation

Luis Enrique Cabrera-Quio 1,*, Sarah Herberg 1,*, Andrea Pauli 1,
PMCID: PMC5100344  PMID: 27653973


Translation is best known as the fundamental mechanism by which the ribosome converts a sequence of nucleotides into a string of amino acids. Extensive research over many years has elucidated the key principles of translation, and the majority of translated regions were thought to be known. The recent discovery of wide-spread translation outside of annotated protein-coding open reading frames (ORFs) came therefore as a surprise, raising the intriguing possibility that these newly discovered translated regions might have unrecognized protein-coding or gene-regulatory functions. Here, we highlight recent findings that provide evidence that some of these newly discovered translated short ORFs (sORFs) encode functional, previously missed small proteins, while others have regulatory roles. Based on known examples we will also speculate about putative additional roles and the potentially much wider impact that these translated regions might have on cellular homeostasis and gene regulation.

KEYWORDS: Ribosome, short proteins, sORF, translation, translational regulation, uORF


Traditionally, translation has been assumed to be largely restricted to protein-coding open reading frames (ORF). This long-standing view has been challenged by a series of studies emerging from technological advances that have enabled researchers to analyze translation genome-wide at unprecedented depth and detail. As such, it is now possible not only to predict even very short protein-coding ORFs based on homology,1,2 but also to analyze the translational state of ORFs genome-wide by ribosome profiling.3-5 Moreover, peptide products can be detected with increased sensitivity by improved mass-spectrometry.6-12 Based on these studies it is becoming increasingly clear that many regions outside of annotated protein-coding ORFs are translated. New translated regions have not only been identified in transcripts thought to be non-coding, but also upstream of a large fraction of protein-coding ORFs (so-called upstream ORFs, uORFs).4,13-21 These usually short translated ORFs were likely missed in previous mutagenesis screens due to their small size, and remained un-annotated in genome annotations due to their small size and lack of evidence for codingness.22-24 In analogy to the term ‘pervasive transcription’,25 this unforeseen prevalence of short ORF (sORF) translation has spurred the notion of ‘pervasive translation’.23,26

It should be pointed out that a certain fraction of the detected translation events will likely comprise ‘noise’, which can be of either technical or biological nature. For example, ribosome profiling enriches for 80S ribosome-protected mRNA fragments, yet other protected mRNA fragments of similar size and sedimentation behavior might be co-purified, which will generate technical noise.27 Moreover, the use of translation inhibitors as well as differences in sample preparation and downstream analyses can introduce biases in ribosome footprinting assays that might not accurately reflect the translational state in an unperturbed cell.28-30 To distinguish actual translation events from technical noise, a series of data analysis tools have been developed. These computational approaches use certain features like Ribosome Protected Fragment (RPF) abundance, length and trinucleotide periodicity, positioning of the ORF within a transcript and responsiveness to translation inhibitors to help detect “real” translation and eliminate technical noise.13,17,20,26,31,32 Apart from ‘technical noise’, there is ‘biological noise’ that originates from genuine translation, yet neither the peptide product nor the act of translation of this particular sequence might serve any specific purpose. Because there is ample evidence that at least some sORFs do have essential functions in vivo,19,33-42 we will omit further discussions of noise and focus on functional aspects of sORF translation.

For the sake of simplicity, we have divided this review into 2 main parts, namely roles of sORFs as 1) hidden sources of functional short proteins, and as 2) widespread regulatory elements conferring post-transcriptional control of gene expression. This division does however not preclude that the act of translation of regions functioning as short proteins might also have regulatory roles and vice versa. In the end we will summarize the challenges and opportunities that this newly emerging research area has.

sORFs as source of functional short proteins

Despite previous medium- to large-scale forward genetic screens in organisms ranging from yeast,43,44 plants,45,46 worms47,48 and flies49,50 to zebrafish51,52 and mice,53-55 several new, essential short proteins were discovered during the last decade in different organisms.19,35-42 The majority of these newly identified short proteins show a higher degree of amino acid sequence conservation than observed in known regulatory translation events. The modes of action of these short proteins are divers and comprise intra- as well as extracellular functions (Fig. 1). In the following part we will highlight recent examples of newly discovered short proteins regulating key processes during embryogenesis, in cell-cell communication or in cell physiology. Based on known examples, combined with the extensive body of knowledge on protein-functionalities in general, we will also speculate about other possible roles that this potentially large source of putatively bioactive short proteins might have.

Figure 1.

Figure 1.

sORFs as source of functional short proteins. Translation of short open reading frames (sORFs) can generate short proteins (light blue sphere) with diverse intracellular and extracellular roles. Functions range from cytoplasmic regulation of enzymes, protein-protein interactions, enzymatic activities to regulation of transcription factors in the nucleus and extracellular roles as signals, membrane-associated proteins or antigens presented on MHC-type molecules.

Short proteins as regulators of protein-protein interaction and enzymatic activity

Due to their small size, short proteins can easily fit into the binding pockets of other proteins, which makes them candidate regulators of protein-protein interactions and enzymatic activities. For example, the binding of a peptide to an allosteric site of a protein could induce a conformational change that alters the interaction surface or the enzymatic activity of the protein. The former mode of action was proposed for the Drosophila polished-rice or tarsal-less (pri) sORF peptides, which are one of the best studied examples of short peptides with an essential function during embryogenesis.36,37,56 The pri locus is transcribed into a polycistronic mRNA encoding 4 evolutionary conserved short ORFs of only 11 to 32 amino acids (aa).36 These short peptides control the binding of the E3 ubiquitin ligase Ubr3 to the transcriptional regulator Shavenbaby (Svb) by inducing a conformational change in Ubr3 that leads to the exposure of a Svb-recognition site in Ubr3.56 Pri-induced binding of Ubr3 to Svb leads to ubiquitination of Svb, which is then N-terminally truncated by the proteasome resulting in a switch in Svb's activity from a transcriptional repressor to a transcriptional activator.37,56 By promoting the formation of Svb activator, Pri-peptides thus induce trichome formation and epidermal differentiation in Drosophila.

By modifying protein-protein interactions, short proteins are also able to alter protein localization and recruitment, as has been proposed for the Drosophila Polar granule component (Pgc). Initially thought to function as a non-coding RNA,57 the germ-cell expressed pgc transcript was subsequently shown to encode a 71 aa long protein that inhibits the phosphorylation of serine 2 residues in the C-terminal domain of RNA polymerase II by preventing the recruitment of the kinase P-TEFb (positive transcription elongation factor b) to transcription sites. As a result, RNA polymerase II-dependent transcription is repressed in germ cells, protecting them from differentiation into somatic cells.58,59

The binding of a short protein to an enzyme can also directly affect enzymatic activities. This is the case for a group of recently identified, short, α-helical transmembrane proteins (all < 50 aa) that bind to and thereby control the activity of SERCA (sacro-endoplasmic reticulum Ca2+ adenosine triphosphatase). SERCA is an ATPase located in the membrane of the sarcoplasmic reticulum (SR), where it pumps Ca2+ back into the SR after Ca2+ release upon muscle contraction. SERCA was recently shown to be inhibited by 3 short proteins, namely phospholamban (PLN/PLB), sacrolipin (SLN) and myoregulin (MLN),38,40 while the short protein DWORF (dwarf open reading frame) enhances SERCA activity by displacing those inhibitory proteins.41 Apart from allosteric regulation, it is also feasible that short proteins could negatively regulate enzyme activities by competing with enzymatic substrates for binding to the active site. While direct evidence is still missing, it is intriguing to speculate that pseudogenes, about one-third of which were recently shown to be translated into proteins of various lengths in human cells,15 might provide a source for such competitive inhibitory peptides.60

Future work might also reveal that some of the newly discovered proteins have specific enzymatic activities themselves. Examples for such small enzymes exist, such as Cytochrome C (105 aa) or the smallest known enzyme, 4-oxalocrotonate tautomerase (62 aa per monomer).61,62 Moreover, short proteins can also act as part of large protein assemblies. For example, advances in mass spectrometry have recently led to the identification of APC15 and APC16, 2 new constitutive components of the extensively studied, 1.5-MDa multi-protein complex Anaphase Promoting Complex/Cyclosome (APC/C).63-65 APC15 was subsequently shown to mediate auto-ubiquitination of Cdc20 by APC/CMCC and disassembly of the mitotic checkpoint complex (MCC).66 Another well-known large molecular machine containing a number of short proteins as structural components is the ribosome itself: Of approximately 80 eukaryotic ribosomal proteins, 21 proteins consist of only 25–100 aa.67 Based on these 2 noticeable examples, small proteins can indeed play prominent roles as subunits of well-studied large protein complexes.

Short proteins in signaling and cell-cell communication

Besides such intracellular functions, recent studies have revealed important roles for short proteins as secreted peptides and hormones. Especially in plants different classes of polypeptide hormones have been identified, several of which are involved in defense mechanisms. For example, Systemin, which is the first plant polypeptide hormone identified, only consists of 18 aa.68,69 Many other secreted short plant proteins are involved in developmental processes, growth control and stress response (for an overview see 70). This diverse group of signals includes the recently discovered ESF1 (embryo surrounding factor 1), which regulates early embryo tissue patterning,39 and the secreted phytosulfokine pentapeptides (PSK), which are encoded by 7 different precursor genes of approximately 100 aa in length and regulate plant growth and stress responses. PSK binds to the extracellular leucine-rich repeats of the PSK receptor PSKR1 and might thereby allosterically activate the Ser/Thr kinase activity of PSKR.71,72

Signaling functions for short proteins are not limited to plants and have also been discovered in vertebrates: For example, the mature, 36 aa long secreted polypeptide Apelin binds to the G protein-coupled Apelin receptor APJ and has been implicated in the regulation of various physiological processes, ranging from angiogenesis to energy metabolism, neuroendocrine stress response, cardiovascular function and fluid homeostasis (reviewed by 73). Recently, a second conserved APJ ligand named Apela/ELABELA/Toddler was identified that had been mis-annotated as a non-coding RNA.19,35 Functional analyses in zebrafish embryos revealed that Apela/ELABELA/Toddler promotes the movement of ventral and lateral mesendodermal cells during gastrulation19,35 and the migration of angioblasts during vasculogenesis.74

Apart from signaling to other cells as secreted factors, it is also feasible that membrane-bound short proteins could act as (co-)receptors or as cell adhesion molecules. While specific examples for such functionalities have to our knowledge not yet been identified, the SERCA regulators PLN, SLN, MLN and DWORF38,40,41 provide the proof-of-principle that functional, cell membrane-embedded short proteins exist.

Cell-cell communication can also occur via an entirely different route, e.g. by the presentation of peptides on the surface of a cell by Major Histocompatibility Complex class I (MHC I) molecules. Many antigens that are presented on MHC I molecules were recently shown to derive from non-conventional peptide sources, such as untranslated regions (UTRs), unannotated ORFs, introns, non-AUG start codons or from alternative translational reading frames.75-78 Short peptides presented as self-antigens on MHC-type molecules could play a role in the negative selection of T-cells during T-cell development in the thymus, or might shape the immune response during viral infection, cancer progression and autoimmune disease.26,79 Because the presentation of peptides by MHC-type molecules is largely independent of the encoded amino acid sequence, a certain fraction - if not even the entire cohort - of sORF-originating short proteins might be co-opted for such an immunological functionality.

Given the range of possible activities and the paucity of functional studies up to date, it is clear that we are only at the beginning of grasping the full impact that short proteins have.

sORFs as post-transcriptional regulators of gene expression

One hallmark of most sORFs discussed so far is that they encode evolutionarily conserved short proteins. However, this is not the case for the majority of newly identified short translated ORFs. Instead, there is increasing support for the idea that the gene-regulatory effect imposed by the act of uORF translation is conserved across vertebrates.16,20,80-83 Although translated sORFs also occur in presumably non-coding transcripts, and to a much lower extent also in 3’UTRs,8,13-15,17,21 we will focus on uORFs as most abundant and best-studied class of regulatory sORFs.

Post-transcriptional regulation plays an important role in controlling the composition of the proteome of each cell. Because it does not require transcriptional changes, this mode of regulation stands out as being fast and – if it does not trigger transcript degradation – reversible. It is well known from classical, mostly single gene studies that uORFs can repress translation of downstream ORFs by 2 means.34,84-91 uORF translation can impact the stability of the mRNA by triggering co-translational RNA decay pathways,21,92-95 and it can interfere with ribosome access to downstream ORFs91,96: in order to reach a downstream (coding) ORF, scanning ribosomes have to either read-through uORFs without initiating (‘leaky scanning’), or re-start translation after having already translated a uORF (‘reinitiation’) (Fig. 2). These mechanisms enable the cell to cope with uORFs and had been known for decades, yet the extent and possibly widespread impact of uORF-mediated regulation has only become evident over the past few years.

Figure 2.

Figure 2.

uORFs as post-transcriptional regulators of gene expression. Translation of sORFs upstream of the main coding sequence (CDS, dark blue), so-called upstream ORFs (uORFs, light blue), is generally repressive by blocking the ribosome from accessing the downstream CDS. Two strategies allow the cell to bypass the inhibitory effect of uORFs (dashed arrows): leaky scanning and reinitiation. 80S, actively translating ribosome; 60S, large ribosomal subunit; 40S, small ribosomal subunit; 43S, preinitiation complex.

uORFs as widespread repressive genetic elements

A large fraction of ribosomal footprints outside of the annotated coding sequences originates from uORFs within transcript leaders, which are conventionally yet rather misleadingly called 5’UTRs (5’ untranslated region).4,13,14,16,20,97-99 The generally repressive effect of uORFs on downstream translation has been studied at the level of individual genes,34,90,100-102 in reporter studies,16,83 and more recently also genome-wide.16,20,83

One of the best-studied examples for regulatory uORFs is the yeast transcriptional activator GCN4 (ATF4 in vertebrates).33,90 The GCN4 mRNA contains 4 short uORFs that function as translational barriers for the main coding sequence under normal conditions when GCN4 protein is not required. Upon stress, translation of the GCN4 coding sequence is induced by a mechanism that harnesses uORF1 translation to bypass the 3 strictly inhibitory uORFs 2-4. Thus, differential translation of the 4 uORFs causes differential translation of GCN4 in response to stress.

In support of uORFs being repressive genetic elements, genome-wide analyses of vertebrate ribosome profiling data revealed an inverse correlation between the number of uORFs within a transcript and the efficiency of CDS translation: CDSes of transcripts lacking uORFs are more efficiently translated than those of transcripts with uORFs; furthermore, the more uORFs a transcript has, the less the CDS is translated.16,20,103 Apart from reduced translation downstream, the presence of uORFs also correlates with reduced steady-state levels of transcripts,16 which is indicative of reduced stability of uORF-containing transcripts. Consistently, uORF-containing transcripts have been shown to be enriched in targets of nonsense-mediated mRNA decay (NMD),21,92,94,104-107 which induces rapid degradation of transcripts with premature termination codons.

A certain fraction of uORFs is likely to act constitutively. Without invoking any further regulatory mechanisms, the mere presence of a uORF amenable to translational initiation (e.g., located within a permissive sequence context) is expected to cause dampening of downstream translation and/or destabilization of the associated transcript - independent of cell type, cellular condition or environmental change. The strength of the imposed regulatory effect of constitutive uORFs is determined by cis-encoded sequence features. For example, certain scenarios like a sub-optimal sequence context around the initiation codon and close proximity to the 5’ cap interfere with recognition of initiation codons, promote leaky scanning and thus correlate with weak repressiveness of the uORF.108,109 On the other hand, long uORFs and short distances between uORF and main ORF impede reinitiation and thus correlate with increased repressiveness of the uORF.91,110,111

Regulating the regulator – how uORF translation can be controlled

Apart from such a constitutive mode of action, there is increasing evidence that uORF-mediated regulation itself can be dynamically regulated. In its simplest form of regulation, uORFs can either be selectively included or excluded during the production of the mRNA. Mechanisms generating transcript isoforms differing in the number and/or location of uORFs include alternative transcription start site (TSS) usage92,107,112,113 and alternative splicing.114-118 The impact that a different transcript leader can have on protein production is exemplified by the disease-causing splice donor mutation in the Thrombopoietin (THPO) transcript leader.119 Exon-skipping generates a THPO transcript leader lacking uORFs. Consistent with a repressive effect of the THPO uORFs, the resultant transcript variant shows increased translation of THPO and leads to an over-production of TPO protein. More interesting from a developmental point of view are regulated changes in transcript leaders that can contribute to the cell-type specific proteome. For example, differential splicing of the 5’UTR of Elk-1 removes the STOP-containing exon of the first uORF, which concomitantly places the AUG initiation codon of the second uORF in-frame within the first uORF. While the underlying molecular mechanism is not yet clear, this alternatively spliced transcript shows decreased sensitivity to mTOR inhibition by Rapamycin.118 Another example stems from differential TSS usage: Differential TSS usage during myoblasts differentiation generates a Cryab transcript isoform in myoblasts with an additional 5’-most uORF-containing exon. Lack of this 5’-most exon in differentiated myotubes could contribute to the increased Cryab protein production.120,121 Because alternative TSS usage is common during vertebrate embryogenesis122,123 and in different cell types,124-127 such a transcription-based control of uORF regulation could have broad implications in developmental regulation of gene expression.

Other mechanisms of uORF regulation require additional RNA-binding factors or translation machinery associated proteins that modulate the extent of leaky scanning or reinitiation.128 A factor that has been implicated in promoting upstream AUG start codon selection is the DExH-box helicase DHX29.129 In vitro experiments revealed that DHX29, in association with the initiation factor eIF1A and the pre-initiation complex, reduces leaky scanning, presumably by inducing a conformational change and stabilization of the pre-initiation complex. Moreover, changing the activity and concentration of eIF2 affects the general ability of ribosomes to reinitiate translation after uORFs.91,110,111 Similarly, a study carried out in Arabidopsis thaliana has identified the target-of-rapamycin (TOR) pathway in conjunction with the S6 kinase (S6K) and the plant reinitiation factor eIF3h as more general regulators of reinitiation.130 Induction of the TOR pathway by the plant hormone auxin leads to S6K-mediated phosphorylation of eIF3h, which promotes the assembly of reinitiating ribosomes.

In contrast to these globally acting trans-factors, there are several examples for RNA-binding factors that affect translation of specific uORF-containing transcripts. A good example is the trans-acting RNA binding protein Sex Lethal (SXL). SXL binds downstream of a short uORF in the male-specific lethal (msl)-2 transcript and reduces leaky scanning by promoting translation initiation at the uORF, which augments the repressive effect of the uORF on downstream translation by about 9-fold.102 Another transcript-specific modulator is the non-canonical translation initiation factor DENR-MCT (density regulated protein and multiple copies in T-cell lymphoma) complex, which promotes reinitiation at ∼100 transcripts characterized by the presence of short uORFs with strong Kozak sequences.131 DENR knockout mutants in Drosophila exhibit defects in larval growth and reduced accumulation rates of DENR-regulated proteins, especially in proliferating tissues.

Challenges and opportunities

The biggest challenge in the field of sORFs will be to tease apart those that have a specific function from those that don’t. Evolutionary conservation of either the encoded protein sequence or the regulatory effect is clearly a hint toward functionality, yet for most sORFs the most rigorous assessment of functionality, namely mutagenesis of the sORF and analyses of resultant in vivo phenotypes, remains to be done. In light of the recent finding that some uORF-, pseudogene- and dORF- (downstream ORF) encoded peptides are conserved,15 these analyses might reveal protein-coding functions for some sORFs that are currently classified as regulatory. In general, classifying sORFs into likely protein-coding versus regulatory regions is aided by the analysis of nucleotide and amino acid conservation. However, in the absence of functional data this division remains artificial and does not exclude a possible dual (regulatory and coding) role for a single sORF. Precedence for such uORF-encoded functional peptides exist from studies in plants.132,133

While not all of the conserved sORFs are expected to reveal discernable phenotypes when mutated, the chances of identifying even subtle phenotypes indicating potentially essential roles can be greatly increased by specific hypotheses that can be tested in targeted functional assays. For example, flies and mice lacking SERCA-regulating short protein-encoding sORFs are viable and do not show overt behavioral or morphological muscle phenotypes, yet detailed analyses of muscle physiology and functionality revealed altered muscle contractions and aberrant Calcium flux in mutant animals.38,40,41 Therefore, the particular challenge for identifying functions for small protein-encoding sORFs will be to narrow down the vast range of potentially affected cellular processes to formulate specific, testable hypotheses. On the other hand, big outstanding questions in the field of regulatory sORFs are regarding 1) the extent to which sORF translation really matters, 2) the extent to which sORFs are dynamically regulated, and 3) what the underlying mechanisms are. While the combination of currently available techniques like ribosome profiling, RNA-IPs and CRISPR/Cas9-based mutagenesis of potential regulators will likely address several of these questions, genome-wide technologies that can assess translational regulation at the level of individual transcripts are still missing. As such, it is currently unclear to which extent ribosome protected fragments in transcript leaders originate from the same transcript that also translates the downstream ORF, or from a different transcript at which the downstream ORF is not associated with translating ribosomes. Answers to this long-standing question will open new possibilities to globally assess the dynamics of translational regulation at unprecedented detail.

Disclosure of potential conflicts of interest.

No potential conflicts of interest were disclosed.


