Skip to main content
Protein Science : A Publication of the Protein Society logoLink to Protein Science : A Publication of the Protein Society
. 2023 Aug 1;32(8):e4708. doi: 10.1002/pro.4708

Shining a light on the dark proteome: Non‐canonical open reading frames and their encoded miniproteins as a new frontier in cancer biology

Zoe Posner 1, Ian Yannuzzi 1, John R Prensner 2,3,
PMCID: PMC10357943  PMID: 37350227

Abstract

In the decades following the discovery that genes encode proteins, scientists have tried to exhaustively and comprehensively characterize the human genome. Recent advances in computational methods along with transcriptomic and proteomic techniques have now shown that historically non‐coding genomic regions may contain non‐canonical open reading frames (ncORFs), which may encode functional miniproteins or otherwise exert regulatory activity through coding‐independent functions. Increasingly, it is clear that these ncORFs may play critical roles in major human diseases such as cancer. In this review, we summarize the history and current progress of ncORF research and explore the known functions of ncORFs and the miniproteins they may encode. We particularly highlight the emerging body of evidence supporting a role for ncORFs and miniproteins contributions in cancer. Finally, we provide a blueprint for high‐priority areas of future research for ncORFs in cancer, focusing on ncORF detection, functional characterization, and therapeutic intervention.

Keywords: cancer, miniprotein, non‐coding genome, open reading frame

1. INTRODUCTION

Gregor Mendel's proposal that heredity is transmitted in discrete units spurred a long‐term interest to dissect, quantify, and characterize such units—known today as genes. Some 60 years after Mendel's death, a major breakthrough came when George Beadle and Edward Tatum performed their famed bread mold experiments to form the “one gene, one enzyme hypothesis” (Beadle & Tatum, 1941). Winning not only the 1958 Nobel Prize for their work, they would inspire a generation of leading researchers. By 1963, only 10 years from the advent of the double helix model of DNA, much about the central dogma was already appreciated: DNA is transcribed into mRNA, which encodes proteins via triplet codons with help from transfer RNAs and ribosomes (Nirenberg, 1963).

Which nucleotides, then, does the ribosome choose to translate? Seminal work from Joan Steitz defined that ribosomes engage RNAs in ~29 nucleotide fragments (Steitz, 1969; Steitz et al., 1970), and comprehensive efforts by Marilyn Kozak worked to compile the first catalog of 211 known messenger RNAs (Kozak, 1984), leading to the scanning model of ribosome initiation (Kozak, 1989). Yet, it was the advent of DNA sequencing and the efforts of the Human Genome Project that resulted in the assembly of the first complete human genome, bringing the possibility of quantifying and characterizing all human genes nearer. Taking this a step further, the Encyclopedia of DNA Elements (ENCODE), formed shortly after the Human Genome Project's completion, sought to characterize all functional genomic elements in humans (ENCODE Project Consortium et al., 2007).

The Human Genome Project first estimated that the human genome contains approximately 30,000–40,000 protein‐coding genes (Lander et al., 2001). Later scrutiny trimmed this list to ~25,000 (International Human Genome Sequencing Consortium, 2004), and since the mid‐2000s, this list has remained fairly static, settling at an estimation of 19,600 protein‐coding genes (Harrow et al., 2012). By and large, identification of these genes in humans and other mammals relied on a number of parametric assumptions; namely, a protein size (>100 amino acids), an AUG start codon, monocistronic transcript requirements, and no overlap in open reading frames (Mouse Genome Sequencing Consortium et al., 2002; Lander et al., 2001). In rare individual cases, these guidelines were modified, such as for the annotation of the CDKN2A gene, which was previously well‐characterized to encode both the INK4a and p14‐ARF proteins (Ouelle et al., 1995), as well as for the transcript encoding the RPP14 and HTD2 proteins (Autio et al., 2008).

In this historical context, the two decades since the completion of the Human Genome Project have witnessed a tremendous expansion of research on RNA translation. Collectively, this work has spurred the nomination of thousands of human genetic elements that interact with the ribosome but are not annotated as protein coding in major reference databases. These elements are termed non‐canonical open reading frames (ncORFs). ncORFs are often small (<100 amino acids), can have near‐cognate start codons, may exist on a polycistronic transcript, and may overlap with other open reading frames (Mohaupt et al., 2022). Since these features defy many of the annotation guidelines set following the Human Genome Project, many ncORFs were missed during early gene discovery efforts. A subset of these nominated ncORFs likely encode stable small protein species termed miniproteins, which may have important implications for the human proteome. Representing a new aspect of human genome functionality, ncORFs promise to yield important insights into diseases like cancer. As a disease of dysregulated signaling and developmental pathways, ncORFs and miniproteins represent a class of yet unexplored potential cancer vulnerabilities and predictive biomarkers through intrinsic and regulatory functions.

In this review, we will provide an overview of the state of ncORFs research, describe the implications of these ncORFs in cancer, and provide a future outlook for ncORFs as new players in the landscape of cancer research.

2. EARLY INVESTIGATIONS OF NCORFS IN THE HUMAN GENOME

2.1. Miniprotein discovery before the human genome project

The origins of ncORF research date back to early observations of small open reading frames. In 1987, Marilyn Kozak explored “nonfunctional” upstream AUG codons in known vertebrate mRNAs (Kozak, 1987). Kozak interpreted these upstream open reading frames (uORFs) as a mechanism of translational regulation through ribosomal hogging, leading to decreased translation of the downstream protein‐coding sequence (CDS). While Kozak's work on uORFs suggested ribosome engagement at non‐canonical sites, it did not address whether ncORF translation results in stable and bioactive miniproteins, despite interest in whether this was the case. Rather, the topic of discovery of CDS was largely based on early computational approaches, which struggled to predict real, functional miniproteins and differentiate them from those that might arise by chance among human DNA sequences. As such, early protein/miniprotein research relied on the use of cDNA libraries, ribosome binding models, and serial analysis of gene expression (SAGE) experiments to nominate candidate ncORFs (Hemm et al., 2008; Kastenmayer et al., 2006; Kondo et al., 2007; Velculescu et al., 1995, 1997).

Despite these difficulties, the possibility of miniproteins persistently sparked attention from groups interested in genome characterization, particularly in non‐mammal organisms such as Drosophila, Escherichia coli, and Saccharomyces cerevisiae (Hemm et al., 2008; Kastenmayer et al., 2006; Kondo et al., 2007). Yeast was the first eukaryotic organism to gain general acceptance for the fact that ncORFs are widespread throughout its genome. Using early cDNA analyses of the yeast transcriptome along with homology‐based inferences, one study identified 299 ncORFs within the S. cerevisiae genome, representing approximately ~5% of all S. cerevisiae genes (Kastenmayer et al., 2006). Many of these genes were essential for yeast viability (Kastenmayer et al., 2006). Additional work on ncORFs in unicellular organisms was also pursued in E. coli, where early investigations used sequence conservation and ribosome binding site models to identify potential ncORFs of 16–50 aa in the intergenic regions of the genome (Hemm et al., 2008). This work not only observed multiple candidate ncORFs, but also validated 20 previously identified and 18 newly predicted ncORFs using endogenous epitope tagging of the genomic site (Hemm et al., 2008).

Drosophila was the first multicellular organism to be extensively interrogated for ncORFs. Here, studies first identified candidate ncORFs through cDNA sequencing libraries that aimed to uncover non‐coding RNA (ncRNA) transcripts (Inagaki et al., 2005; Rubin et al., 2000; Tupy et al., 2005). For example, initial analyses identified a putative ncRNA termed MRE29, which was nominated as a non‐coding RNA due to the absence of a long coding sequence >100 amino acids. Yet, this transcript was later found to encode four embryonic lethal ncORFs involved in regulation of actin‐based cell morphogenesis (Kondo et al., 2007). Likewise, the tarsal‐less gene was characterized as a carefully regulated and locally expressed polycistronic mRNA encoding several ncORFs of 11 aa involved in development (Galindo et al., 2007). As such, while appreciation for ncORFs in non‐mammal genomes grew, there remained challenges in annotation and validation of ncORFs in the human genome.

2.2. Early methods in protein‐coding gene annotation

Initial efforts to identify the locations of protein‐coding genes in mammals relied on genetic (or linkage) mapping methods (Bell & Haldane, 1937; Robson, 1988), which determined the location of genes on a chromosome through analyzing patterns of inheritance in families or by comparing the DNA of different, unrelated individuals. However, genetic mapping, particularly through classical pedigree analysis, is low‐throughput and impractical for large‐scale annotation efforts. With the advent of Sanger sequencing, large‐scale projects became practical, prompting the development of methods to determine protein‐coding regions of the genome.

The earliest of these methods involved bioinformatic approaches, including neural network and information theory techniques to predict protein‐coding regions from DNA (Farber et al., 1992; Lapedes et al., 1990). However, such work was dependent on training computational models on the limited numbers of previously annotated protein‐coding DNA sequences, making model performance highly dependent on the length of the ORF to the exclusion of ORFs shorter than 90 codons (Farber et al., 1992). Therefore, cDNA libraries were used as an alternative method to identify functional miniproteins (Inagaki et al., 2005; Rubin et al., 2000; Stallmeyer et al., 1999; Tupy et al., 2005). These projects aimed to annotate cDNA clones containing the full‐length ORF sequence for each gene (Rubin et al., 2000). These cDNA libraries became important resources for identifying bicistronic transcripts (Stallmeyer et al., 1999) and putative lncRNAs (Inagaki et al., 2005; Tupy et al., 2005)—many of which were later determined to encode miniproteins (Hartford & Lal, 2020).

Despite early successes, cDNA libraries had to overcome some significant limitations—such as leading to false positive and negative results, and only being representative of sequences from mature mRNA. For example, the lncRNA, Xist, which plays a major role in X‐chromosome inactivation, was originally thought to be a protein‐coding gene from a cDNA library experiment (Borsani et al., 1991). Since start codons and ORFs are inferred based on cDNA sequence, it can be challenging to differentiate true lncRNAs from non‐canonical ORFs. Moreover, cDNA libraries may over‐represent the most abundant mRNA transcripts, thus underrepresenting or completely missing lowly expressed transcripts (Liang & Pardee, 1992). Since the entire human genome was not fully sequenced at this point, early cDNA libraries reconstructed transcripts through aligning cDNAs to each other and concatenating cDNA fragments. This method, however, became increasingly skeptical of small ORFs—which had a greater likelihood of representing individual portions of a single larger ORF once the cDNA fragments were concatenated. This challenge led to early methods establishing a 100 aa cutoff to decrease the chance of such artifacts, but this made false negative results for ncORFs more common (Burge & Karlin, 1997). Ultimately, early sequence prediction approaches and use of cDNA libraries yielded results that were challenging to interpret, necessitating new and improved methods to accurately identify miniproteins.

2.3. Miniprotein discovery after the human genome project

2.3.1. Sequence‐based computational miniprotein prediction

Following the Human Genome Project, miniprotein discovery and protein‐coding annotation relied heavily on computational methods (Figure 1) (Wang et al., 2004), particularly evolutionary‐based alignment techniques (Lin et al., 2011; Mudge et al., 2019; Wang et al., 2013). Pressure to develop these methods increased with the rise in lncRNA discovery through the invention of RNA sequencing (Cabili et al., 2011; Guttman et al., 2009; Trapnell et al., 2010). From work in bacterial genomes, the CRITICA (Coding Region Identification Tool Invoking Comparative Analysis) program was developed as one of the earliest ncORF annotation tools (Badger & Olsen, 1999; Frith et al., 2006). CRITICA combines a comparative sequence homology alignment analysis with a statistical analysis of hexanucleotide frequencies in predicted coding and non‐coding frames. Applying this method to large mouse cDNA collections identified more than 3000 potential ORFs <100 aa (Frith et al., 2006). Despite these advances, computational prediction methods still suffered from high false‐positive rates, casting doubt on the true biological relevance of ncORFs.

FIGURE 1.

FIGURE 1

Methods for the discovery or validation of non‐canonical open reading frames (ncORFs). Non‐canonical open reading frames are putatively identified through computational methods, ribosome profiling, or mass spectrometry approaches. These computational approaches include sequence based approaches that predict the coding potential of a non‐canonical open reading frame based on sequence conservation or by analyzing a nucleotide sequence for codon bias and coding potential. Other computational approaches assess ribosome‐profiling data to identify the non‐canonical open reading frame that gave rise to a given protected (translated) RNA transcript. Finally, translated miniproteins can be studied through mass spectrometry and other functional approaches, including CRISPR‐based functional screens and protein structure–function studies, including novel approaches like BONCAT. Ultimately, the successful identification and validation of a non‐canonical ORF will require the simultaneous use of multiple methodological approaches. Figure created with biorender.com.

Methods involving conservation‐based comparative analyses significantly improved bioinformatic protein prediction and include conservation‐analysis at both the nucleotide and amino‐acid levels (Lin et al., 2011; Trapnell et al., 2010; Mudge et al., 2019; Mackowiak et al., 2015). To illustrate, the program PhyloCSF uses a multispecies nucleotide alignment to calculate the likelihood that a particular sequence represents a conserved protein‐coding region (Lin et al., 2011). PhyloCSF has been applied to whole‐genome sequencing data for multiple model organisms, facilitating the identification of some novel protein‐coding genes (Mudge et al., 2019). However, ncORFs are generally less well conserved than canonical protein‐coding ORFs and these models still rely on previously identified ncORFs for training and validation; both caveats necessitate continued optimization of these methods for ncORFs (Dinger et al., 2008).

Outside of strict conservation‐based methods of miniprotein prediction, other algorithms include AUGUSTUS, sOFR finder, and Coding‐Potential Assessment Tool (CPAT). AUGUSTUS establishes a generalized Hidden Markov Model and combines ab initio gene prediction with a comparative genomics approach to accurately identify gene structures (Stanke et al., 2006). AUGUSTUS uses a species‐specific training set of high‐quality gene annotations to develop parameters to predict new genes by first identifying potential transcription start sites and promoter elements. Furthermore, AUGUSTUS uses alignments to closely related species to detect regions of conservation and provide further statistical power for accurate gene prediction. Unlike AUGUSTUS, sORF finder was developed specifically for the prediction of small ORFs (Hanada et al., 2010). sORF finder uses a high‐quality nucleotide sequence as input—scanning the input sequence in all reading frames and searching for ORFs that meet certain criteria, such as minimum length, absence of internal stop codons, sequence conservation, codon usage bias, and secondary structure predictions. Lastly, the CPAT program combines machine learning algorithms with sequence features to differentiate between coding and noncoding sequences (Wang et al., 2013). It utilizes a training set of known coding and non‐coding sequences to establish positive and negative training examples. CPAT extracts various sequence features from the training sequences, such as nucleotide composition, ORF length, ORF coverage, hexamer usage bias, and conservation scores, to capture discriminatory patterns between coding and non‐coding regions. Using the extracted features, CPAT trains a machine learning model, typically a support vector machine (SVM) algorithm. Once the machine learning model is trained, CPAT can be applied to assess the coding potential of novel transcript sequences. For additional information on these and other computational tools, there are more comprehensive reviews suggested here (Wang et al., 2004; Do & Choi, 2006; Alioto, 2012; Goel et al., 2013; Klasberg et al., 2016).

2.3.2. Mass spectrometry for miniprotein discovery

To address limitations of phylogenetic solely computational‐based approaches, advances in mass spectrometry (MS) provided crucial experimental evidence of miniprotein translation (Oyama et al., 2004; Svensson et al., 2003). Mass spectrometry, which can directly identify amino acid sequences from miniproteins, helped to facilitate the first high‐throughput experimental evidence of miniproteins (Oyama et al., 2004, 2007; Slavoff et al., 2012). Importantly, shotgun MS techniques are biased against miniproteins due to fewer tryptic cleavage sites, greater protein instability, and lack of comparative databases for analysis (Yewdell, 2022). Optimization of MS techniques through biochemical enrichment, less stringent cutoffs, and other prior isolation steps addresses these limitations (Orr et al., 2020). For example, Slavoff et al. (2012) combined MS with RNA sequencing, enriching for ncORFs by inhibiting degradation and using electrostatic repulsion hydrophilic interaction chromatography to separate peptides prior to MS. Leveraging a custom RNA‐seq database, Slavoff and colleagues identified evidence of protein translation for 86 novel ncORFs (Slavoff et al., 2012). Further optimization of the MS fractionation step has facilitated discovery of additional novel ncORFs (Ma et al., 2014, 2016).

Slavoff and colleagues’ use of a custom RNA‐seq database highlights the broader point that curating protein sequence databases with improved genomic and transcriptomic data increases the ability to identify ncORF‐encoded miniproteins through MS (Nesvizhskii, 2014). A variety of computational programs, including customProDB, Galaxy Integrated Omics, and QUILTs can aid in the design of these custom databases, and are comprehensively reviewed elsewhere (Ruggles et al., 2017). As newer proteomics techniques improve in sensitivity (Brunner et al., 2022), these methods will become increasingly important for characterizing the landscape of ncORFs and clarifying intrinsic versus regulatory functions.

2.4. Ribosome profiling

Ribosome profiling, or Ribo‐seq, has been the most successful method for identifying ncORFs. This technique uses deep sequencing of ribosome footprints, the fragment of mRNA that is protected from nucleolytic digestion by the ribosome, resulting in an indirect measurement of active translation (Ingolia et al., 2009). Therefore, a wide variety of distinct classes of ncORFs have been described with Ribo‐seq (Figure 2). While this technique can provide high resolution mapping of ORFs, the quality of ribosome profiling data and fluctuations in periodicity can alter the ncORFs detectable through this approach (Hsu et al., 2016). As Ribo‐seq has grown in popularity, a large number of computational tools have been developed to analyze profiling data, particularly for ORF identification. Many of these tools, such as RiboTools and RiboSeqR are notable for ease of use given limited computational background. For greater review of computational tools and approaches for Ribo‐seq analysis, we direct our readers to recent comprehensive reviews (Calviello & Ohler, 2017; Wang et al., 2019; Kiniry et al., 2020).

FIGURE 2.

FIGURE 2

Major categories of ncORFs. Non‐canonical open reading frames are genomic regions bookended by start and stop codons, which confer the potential for translation of these genomic regions. ncORFs can exist within a variety of genomic regions including: upstream of a protein‐coding sequence (CDS) in the 5′UTR, downstream of a CDS in the 3′UTR, contained internally within a canonical CDS, overlapping by spanning from a 5′UTR into the CDS or from the CDS into the 3′UTR. Outside of annotated protein‐coding genes, ncORFs can also exist within long non‐coding RNA (lncRNA), within retroviral genes, and within pseudogenes. Finally, non‐canonical back splicing of pre‐mRNA transcripts can give rise to circular RNAs that encode unique translational products. ncORF transcription does not guarantee translation into miniproteins. However, ncORF functionality can be a product of coding‐independent activity, coding‐dependent activity, or both. Finally, cancer‐associated ncORFs, where applicable, are highlighted herein according to their classification. Figure created with biorender.com.

Ribo‐seq is also unlimited in its ability to detect non‐AUG start codons, both for ncORFs as well as annotated CDSs (Andreev et al., 2022; Fedorova et al., 2022; Ingolia et al., 2011). Application of drugs, such as homoharringtonine or lactimidomycin, that stall ribosomes at translation initiation sites enable further resolution on the use of non‐AUG start codons (Lee et al., 2012). Notably, Ribo‐seq has the potential to describe very tiny ncORFs with putative functional miniproteins (Sandmann et al., 2023), whereas MS may be unable to resolve very small ORFs due to challenges with uniquely aligning peptides less than 8 amino acids. Given the utility of Ribo‐seq, it has been used as the basis for genomic annotation of ncORFs in standardized databases (Mudge et al., 2022). Finally, efforts to expand ncORF databases are ongoing. Especially notable are recent efforts to define the landscape of human small ORF (smORF) translation in primary cells and human tissues (Chothani et al., 2022).

3. NCORFS IN CANCER

With the advances in Ribo‐seq and MS enabling identification of a growing number of ncORFs, cancer biologists have increasingly taken note of whether these ncORFs—and the miniproteins they encode—may expand knowledge of cancer pathogenesis and treatment options. With the possibility of both protein‐coding or coding‐independent activities, ncORFs may act as crucial regulators of transcription, translation, and post‐translational protein function (Liu et al., 2022), and characterization of ncORFs and miniproteins has revealed that many are tightly associated with cancer and frequently dysregulated early in oncogenesis (Erady et al., 2021; Zou et al., 2019). Extending this further, CRISPR/Cas9 loss of function screens targeting ncORFs have identified multiple miniproteins essential for cancer cell viability (Chen et al., 2020; Prensner et al., 2021). Here, we review the major coding functions of ncORFs and their regulatory interests, highlighting their contributions to cancer development and progression (Figure 3).

FIGURE 3.

FIGURE 3

Miniprotein involvement in major cancer signaling pathways. Miniproteins can suppress or promote pro‐oncogenic signaling pathways in diverse manners. At the plasma membrane, AKT3‐174aa—encoded by a circular RNA isoform of AKT3—competes with p‐PDK‐1 to inhibit AKT activation. Consistent with this tumor‐suppressive function, circAKT3 is downregulated in glioblastoma (GBM) samples (Xia et al., 2019). Also overexpressed in GBM, C‐E‐Cad is a secreted miniprotein encoded by a circular isoform of E‐cadherin that binds the CR2 domain of EGFR to activate it (Gao et al., 2021). The 59aa SMIM30 miniprotein mediates the SRC/YES1 complex anchoring to the cytoplasmic surface of the plasma membrane. SRC/YES1 membrane‐anchoring enables phosphorylation and subsequent activation of the complex, which induces downstream MAPK signaling (Pang et al., 2020). SMIM30 is regulated by c‐MYC and overexpressed in hepatocellular carcinoma. In the cytoplasm, the tumor‐suppressive FBXW7‐185aa promotes c‐myc degradation by suppressing USP28 inhibition of the canonical FBXW7 protein (Yeh et al., 2018). Further, CIP2A‐BP competes with PP2A for binding to CIP2A (Guo et al., 2020) which prevents CIP2A stabilization, thereby inhibiting downstream AKT/NFkB pathway activation. Localized to the lysosome, the lncRNA‐encoded SPAR miniprotein interacts with v‐ATPase to prevent mTORC1 recruitment to the complex, attenuating growth‐factor‐independent mTORC1 activation (Matsumoto et al., 2017). Nuclear‐localized miniproteins are implicated in transcriptional regulation. For instance, SEHBP tightly binds histone H2B and interacts with scaffolding proteins that modulate chromatin accessibility, strongly suggesting a role in transcriptional regulation (Koh et al., 2021). Figure created with biorender.com.

3.1. ncORF‐intrinsic interests

An intriguing conclusion from ncORF discovery efforts is that some ncORFs indeed expand the known proteome by encoding functional miniproteins (Anderson et al., 2015; Pauli et al., 2014). These miniproteins, as a consequence of their small size, tend to contain few functional domains. Yet, despite this relative lack of domain architecture, they frequently interact with protein complexes (Chen et al., 2020) and can act as allosteric regulators and peptide hormones. Miniproteins also localize to specific cellular compartments (Na et al., 2022) and often reside in membranes (Pang et al., 2020; Senís et al., 2021). In this section, we review different mechanisms underlying miniprotein activity, while highlighting cancer‐relevant miniproteins and ones with known involvement in tumor signaling pathways.

3.1.1. Miniprotein interactions with protein complexes

Miniprotein characterization studies have elucidated that some modulate the activity of protein complexes. Intriguingly, complex‐associated miniproteins are mitochondrially‐enriched and may comprise over 28% of proteins involved in the electron transport chain (ETC) (Liang et al., 2022). For example, the ncORF‐encoded BRAWNIN protein functions as an assembly factor, essential for Complex III formation and stability (Zhang et al., 2020). BRAWNIN may further operate with two other ncORF‐encoded miniproteins to regulate the biogenesis of a critical protein subunit of Complex III (Liang et al., 2022). LYRM2 is another mitochondrial miniprotein and associates with Complex I in the ETC to promote oxidative phosphorylation (Huang et al., 2019). In colorectal cancer, oxidative phosphorylation is overactive and LYRM2 is consistently upregulated. Correspondingly, miniproteins may contribute to metabolic reprogramming in cancer, a process known to impact tumor cell proliferation, metastasis, and drug adaptation (Nayak et al., 2018).

Nuclear‐localized miniproteins may act in protein complexes to regulate DNA transcription and repair. For instance, the miniprotein SEHBP tightly binds histone H2B and interacts with scaffolding proteins that modulate chromatin accessibility, indicating a role in transcriptional regulation (Koh et al., 2021). Tumor‐suppressive transcriptional regulators have also been identified. Zhang and coworkers investigated an lncRNA‐derived circRNA that generates the functional miniprotein PINT87aa. This miniprotein interacts with the PAF1 complex to modulate promoter binding and potentially stall RNA Pol II, thereby inhibiting oncogene elongation (Zhang et al., 2018). While not involved in transcriptional regulation, pTINCR is a ubiquitin‐like and primarily nuclearly‐localized miniprotein that stabilizes SUMO1 and SUMO2/3 (Boix et al., 2022). Consistent with this function, overexpression of pTINCR results in SUMOYlation of CDC42, promoting its activation to mediate maintenance of epithelial cell state (Boix et al., 2022). As such, pTINCR likely acts as a tumor‐suppressor by suppressing epithelial to mesenchymal transition.

Membrane associated miniproteins frequently regulate protein complexes. For instance, the 59aa SMIM30 miniprotein mediates the SRC/YES1 complex anchoring to the cytoplasmic surface of the plasma membrane, necessary for downstream MAPK signaling (Pang et al., 2020). SMIM30 is regulated by c‐MYC and overexpressed in HCC, potentially explaining the mechanism underlying MAPK signaling dysregulation in HCCs. Thus, SMIM30 provides an example of oncogene hijacking of ncORF expression and illustrates the subsequent pro‐tumorigenic effects that ncORFs can orchestrate. In other cases, membrane‐associated miniproteins are downregulated in cancers, suggesting potential anti‐tumor roles. pTUNAR is an lncRNA‐encoded transmembrane protein that localizes to the endoplasmic reticulum and maintains calcium homeostasis through activation of the SERCA2 complex, which transports calcium into the ER (Senís et al., 2021). Downregulation of TUNAR lncRNA in glioblastoma likely contributes to calcium dysregulation in these cancers.

3.1.2. Allosteric regulation by miniproteins

Miniproteins can also regulate cell function by acting as allosteric activators or inhibitors. For instance, Wang et al characterized an lncRNA‐encoded miniprotein, termed ASRPS, that binds the coil‐coil domain of STAT3, inhibiting its phosphorylation and subsequent transcriptional activity (Wang et al., 2020). Given the role of STAT3 in promoting metastasis, ASRPS acts as a tumor‐suppressing miniprotein and is downregulated in triple negative breast cancer (Wang et al., 2020). Another lncRNA encoded miniprotein, CIP2A‐BP, was also found to act as a tumor‐suppressor in triple‐negative breast cancer. CIP2A‐BP acts by competing with PP2A for CIP2A binding (Guo et al., 2020). PP2A binding stabilizes the CIP2A protein, allowing for its activation and subsequent signaling in AKT/NFkB pathways (Wang et al., 2017). Thus, CIP2A‐BP, by inhibiting downstream CIP2A signaling, reduces pro‐tumorigenic signaling in triple‐negative breast cancer. Interrogating binding sites of both oncogenic and anti‐tumor miniproteins offers therapeutic opportunities to develop small‐molecule binders.

3.1.3. Miniproteins as peptide hormones

Due to their small size, miniproteins are logical candidates to act as peptide hormones to mediate both short and long range intercellular communication. Despite this expectation, thus far, few peptide‐hormone miniproteins are known. ELABELA is a 54aa peptide hormone that activates PI3K/AKT signaling via the apelin receptor to stimulate hESC growth and maintain self‐renewal through inhibition of apoptotic pathways (Ho et al., 2015). Interestingly, ELABELA also has coding‐independent functions—its lncRNA transcript can sequester the p53 inhibitor hnRNPL (Ho et al., 2015). A second miniprotein, apelin, acts as an ELABELA agonist, also binding the apelin receptor through a two‐site binding mechanism (Ma et al., 2017). Intriguingly, Gao et al. have suggested that a circular form of E‐cadherin produces a distinct, secreted miniprotein that activates EGFR in glioblastoma (Gao et al., 2021). While it has been suggested that up to 80 additional miniproteins may be secreted, whether they act as signaling peptides is unknown (Hu et al., 2022). Finally, ncORFs, particularly those associated with lncRNAs and miRNAs, can act in hormone signaling through coding‐independent roles via RNA‐based regulation, as summarized elsewhere (Pardini & Calin, 2019).

3.1.4. Molecular mimicry

Some ncORFs may share sequence similarities with annotated ORFs, and sufficient sequence homology may permit a non‐canonical peptide to bind its cognate protein's target or regulator. In either instance, the binding mode determines whether this interaction reinforces or competes with the canonical protein's function. As an example, STORM, encoded by linc00689, is upregulated by stress‐induced eIF4E phosphorylation. STORM mimics SRP19 to inhibit SRP complex assembly (Min et al., 2017). In addition, a formerly uncharacterized gene, Cxorf67 was found to encode EZHIP, a protein whose overexpression is a biomarker in posterior fossa ependymoma and midline gliomas H3‐WT (Antin et al., 2020). EZHIP inhibits a gene‐silencing protein complex, PRC2, by mimicking the sequence of the K27M onco‐histone (Antin et al., 2020; Jain et al., 2019).

3.1.5. Translational regulation by miniproteins

A major mode by which miniproteins regulate translation is by modulating mRNA stability. The NoBody miniprotein interacts with mRNA decapping proteins to remove the 5′ cap from mRNA transcripts, triggering nonsense mediated decay (D'Lima et al., 2017). Other miniproteins enhance mRNA stability. For instance, some miniproteins can sponge miRNA. This was demonstrated by Xu and coworkers who identified the HOXB‐AS3 miniprotein which sponges the tumor‐suppressive miR‐378a‐3p to upregulate lactate dehydrogenase‐A (LDHA) expression (Xu et al., 2021). Further, miniproteins can interact with m6A readers to strengthen target recognition, in turn bolstering mRNA stability and target translation. The onco‐peptide IGF2BP1 is a cancer‐relevant miniprotein that strengthens m6A recognition of c‐Myc, contributing to its overexpression (Zhu et al., 2020).

Miniproteins can also drive alternative splicing of mRNA transcripts. The SRSP miniprotein interacts with the splicing factor SRSF3 to promote alternate splicing of target transcripts, including SP4 and likely others (Meng et al., 2020). SRSP upregulation in colorectal cancer lends support to the potential role for transcript splicing in this disease.

It is worth noting that, while uORFs have well defined coding‐independent roles in translational regulation (Johnstone et al., 2016), the impact of miniproteins on mRNA translation remains more controversial. One challenge has been the lack of scalable high‐throughput technologies to address this question comprehensively, and single‐gene studies remain time‐consuming to complete.

3.2. Mechanisms of regulation by ncORFs

ncORFs are increasingly recognized as major translational regulators. Here, we focus on the coding‐independent regulatory interests of ncORFs, review how the intricate regulation of ncORFs allows them to finely tune translation, and summarize how the regulatory properties of ncORFs are hijacked during oncogenesis to promote cancer.

3.2.1. Regulation of Main CDS translation by uORFs and dORFs

The most widely known model of translational regulation by a ncORF is through uORFs, which may negatively regulate the translation of their downstream (cognate) ORF (Johnstone et al., 2016). uORFs exert their suppressive effect by causing ribosome stalling and dissociation, and have the capacity to reduce main CDS translation substantially in some cases (Chen & Tarn, 2019). When uORFs are present, translation of the main CDS requires that ribosomes re‐initiate at the cognate CDS or otherwise scan past the uORFs through a process termed “leaky scanning” (Wright et al., 2022). This process itself is influenced by sequence features of the uORF that govern either a weaker or stronger ability to bind the ribosomal pre‐initiation complex (Chen & Tarn, 2019). Other crucial uORF properties that influence its impact on the cognate ORF include the length of the uORF and intercistronic sequences, uORF secondary structure, and the position of the uORF termination codon (Barbosa et al., 2013; Lin et al., 2019).

Consequently, modifications of uORFs may modulate their regulatory activities. For example, m6A methylation sites may modulate ribosome scanning and start codon selection to favor ribosome binding to the canonical start site (Vasudevan et al., 2020), and Zhou et al found that during the integrated stress response (ISR), m6A methylation of the 5′UTR of ATF4 promoted translation of the cognate CDS. There is additional evidence that, when uORF translation is highly favored, specialized translation factors are necessary to promote cognate CDS translation (Weber et al., 2022). Specifically, DAP5 interacts with eIF4A to modulate ribosome scanning and shift ribosomes toward the main CDS. Significantly, DAP5 transcriptional targets are enriched for proteins involved in cell migration, proliferation, and development, including multiple proto‐oncogenes (Liberman et al., 2009). Moreover, uORFs can regulate mRNA stability by promoting ribosome reinitiation at a downstream AUG rather than an annotated start codon, resulting in an erroneously truncated protein isoform and consequent NMD (Arribere & Gilbert, 2013).

In addition, there is evidence that uORFs can promote the translation and stability of the main CDS, particularly as an adaptive response to stress conditions. In yeast, translation of the GCN4 transcription factor is largely controlled by four uORFs located in the GCN4 mRNA (Hinnebusch, 1997). Under non‐starved states, the uORFs downregulate GCN4 translation. However, under starved conditions, phosphorylation of eukaryotic translation initiation factor 2 (eIF2) causes a reduction in the eIF2·GTP·Met‐tRNAiMet ternary complex formation. This reduces the ability of the ribosome to bind to uORFs, resulting in the ribosome scanning through the uORFs and initiating translation at the main GCN4 reading frame instead (Hinnebusch, 1997). Work by Andreev and colleagues indicates that uORFs overcome eIF2 phosphorylation‐mediated translational downregulation in other contexts as well (Andreev et al., 2015). Here, the authors induced the eIF2 stress response and found that mRNAs resistant to eIF2 inhibition, with only one exception, contain a uORF that is highly translated under normal conditions (Andreev et al., 2015). These results suggest that uORFs play an important role in enhancing the translation of the main CDS upon eIF2 phosphorylation in the stress response.

Since ribosomes terminate at stop codons in the CDS, downstream ORFs (dORFs) are less prevalent in mammalian transcripts (Mudge et al., 2022) and therefore less studied. In some cases, dORFs may enhance main CDS translation. Using Ribo‐seq to identify dORFs, Wu and coworkers found that all mRNAs containing dORFs had higher translation efficiency, and that this effect was not influenced by uORF length or sequence (Wu et al., 2020). The mechanism underlying this enhancement of the main CDS translation is unresolved. The prevailing hypothesis is that ribosome recruitment through cap‐independent mechanisms (including IRES) drives dORF translational activity (Ruiz Cuevas et al., 2021), although further work is needed.

3.3. ncORF dysregulation and mutations in cancer

3.3.1. ncORF dysregulation is both a response to and driver of oncogenesis

Cancer is a disease of rapid and unchecked cell proliferation resulting from gene dysregulation via genetic and epigenetic mechanisms. These mechanisms may impact ncORFs by altering their transcription, translation, or both. Typically, in instances where ncORFs lie within a coding gene, transcription of the ncORF and main CDS are regulated by the same elements. Thus, if transcription of a given gene is aberrantly expressed during oncogenesis, the associated ncORF is as well. Following transcription, modifications to ncRNA transcripts can further regulate their ultimate expression, often independently from the main CDS.

The N6‐methyladenosine (m6A) modification occurs on ncORFs, lncRNAs, and circRNAs (Ma et al., 2022). This modification functions in cap‐independent translation during the integrated stress response (ISR). In this context, EIF3 binds m6A residues to recruit the 43S ribosomal complex to initiate translation at these sites (Meyer et al., 2015). m6A modifications are prevalent in ncORFs, potentially explaining why their expression persists during ISR despite global reductions in protein translation. Given that ISR is frequently induced during tumorigenesis (Meyer, 2019; Tian et al., 2021), ncORF translation is plausibly enhanced in this context, though further work is needed in this area.

In addition to promoting the translation and activity of oncogenes while inhibiting tumor suppressors in cancer, ncORFs themselves are also dysregulated in cancer settings. Indeed, Erady and coworkers performed a pan‐cancer differential expression analysis of ncORF transcriptional levels using RNA‐sequencing data from The Cancer Genome Atlas (TCGA) and Genotype‐Expression (GTex) and found that commonly expressed ncORF transcripts are often dysregulated across cancer types (Erady et al., 2021). This dysregulation of ncORF expression also had prognostic value, hinting at their potential as biomarkers. Finally, characterization of differentially expressed ncORFs revealed that many are enriched in sites amenable to post‐translational modifications, suggesting a major potential mechanism for their (dys)regulation. Aberrant ncORF expression in cancer is corroborated by other studies. For instance, Zhang and coworkers demonstrated that the Ras oncoprotein upregulates Orilnc1 in BRAF‐mutant cancers to further drive RAS/RAF activation (Zhang et al., 2017).

Importantly, the relationship between ncORF dysregulation and cancer is bidirectional—cancers can drive dysregulation of ncORFs and ncORFs can act during tumor initiation to upregulate pro‐tumorigenic translation. This striking finding was demonstrated by Sendoel and coworkers using an SOX2‐inducible mouse model (Sendoel et al., 2017). SOX2 induction drove significant shifts in mRNA translation efficiency and ribosome occupancy, with enhanced occupancy at 5′UTRs. Further, they observed eIF2 repression and eIF2A de‐repression. EIF2A preferentially directs initiator‐tRNA to uORFs, driving increased uORF expression. Finally, examination of the uORFs upregulated upon SOX2 induction showed enhancement for uORFs with downstream ORFs with cancer‐associated functions. Taken together, these findings strongly support a role for ncORFs in early stages of tumorigenesis (Sendoel et al., 2017).

3.3.2. Mutations of ncORFs and their relevance in cancer

The elucidation of functional miniproteins has motivated efforts to uncover cancer‐associated ncORF mutations. Conceptually, such mutations fall into two categories; namely, cancer variants that impact existing ncORFs and those that generate novel ncORFs.

Mutations occurring in endogenous ncORFs can alter miniprotein translation or translation of the main CDS. As a general proof of concept, Whiffin and colleagues demonstrated that uORF point mutations that either generate start codons or premature termination codons are under strong negative selection, implying that uORF disruption by way of mutation can be deleterious and potentially disease causing (Whiffin et al., 2020). In a more cancer‐specific context, Occhi and coworkers identified a 4 bp deletion in an endogenous and highly conserved uORF within the 5′UTR of CDKN1B, a known tumor‐suppressor gene. The deletion results in an elongated miniprotein and shortens the intercistronic space, blocking ribosome reinitiation at the cognate CDS and ultimately suppressing CDKN1B expression (Occhi et al., 2013).

Cancer mutations can also produce novel ncORFs that disrupt main CDS expression. For instance, Liu and coworkers characterized a 5′UTR point mutation that generates a novel AUG‐start codon to cause a melanoma‐predisposing CDKN2A loss of function mutation (Liu et al., 1999). Likewise, a mutation in the NF2 5′UTR that produces a novel upstream overlapping ORF was shown to produce NF2 loss of function to drive neurofibromatosis (Whiffin et al., 2020). These examples notwithstanding, there are few known cancer‐driving mutations in ncORFs. Myriad factors account for this under‐discovery. Principally, cancer‐variants are typically identified using exome sequencing. Inherently, the majority of ncORF variants—which reside in flanking regions or outside exons—are missed by this approach (Bailey et al., 2020). Capturing these ncORFs instead requires the use of whole‐genome sequencing.

The International Cancer Genome Consortium (ICGC) has undertaken a massive effort to identify unique cancer driver mutations using whole‐genome sequencing, integrating their own sequencing data with that from the Pan‐Cancer Analysis of Whole Genomes (PCAWG) and The Cancer Genome Atlas (TCGA) (ICGC/TCGA Pan‐Cancer Analysis of Whole Genomes Consortium, 2020). Leveraging this combined dataset, Rheinbay and colleagues analyzed whole‐genome sequencing data from 2658 samples to define cancer driver mutations in non‐coding genomic regions (Rheinbay et al., 2020). This analysis identified non‐coding driver mutations affecting TP53, NFKBIZ, TOB1, and BRD4 (among other protein‐coding genes). The authors also found that cancer‐driving mutations in non‐coding regions are far rarer than those in coding regions. Nonetheless, since the non‐coding genome vastly exceeds the coding genome, the total cancer‐burden of non‐coding mutations is likely still significant (Schipper & Posthuma, 2022). Crucially, the Rheinbay study interrogated variants in nearby transcriptional start sites, which is not well suited to detect ncORF variants. In a later attempt to better annotate cancer‐drivers in non‐coding regions, Dietlein and coworkers developed a computational method that stratifies variants based on their genomic region. This accounts for the varied likelihood that a mutation is cancer‐driving based on genomic context (Dietlein et al., 2022). Through this novel approach, they identified significantly more non‐coding driver mutations relative to findings from other studies. This suggests that as the community continues to develop improved computational approaches, more ncORF driver mutations will be detected.

Even as methods to detect ncORF variants improve, evaluating the functional effects of ncORF mutations adds additional challenge. Most methods to analyze the functional effects of a mutation utilize conservation of the affected amino acid residue. Specifically, variant annotation methods assess whether a mutation occurs at a conserved residue such that it would drastically impact protein structure (e.g., by altering key side chains, changing the isoelectric point, etc.). For young or de novo ORFs, evolutionary conservation analysis cannot well predict the effects of an ncORF variant on its coding or regulatory function (Vakirlis et al., 2022). Given that ncORFs are generally non conserved, new approaches are needed to better scrutinize effects of variants.

There are further difficulties in computationally mining for cancer‐associated ncORF mutations. Statistical models identify cancer‐driving mutations based on recurrence in coding‐regions only (Cibulskis et al., 2013). As a result, these models cannot be applied for the classification of somatic mutations in ncORFs or non‐coding regions. Efforts to develop new analytical approaches to identify somatic mutations outside of coding‐regions are ongoing (Dietlein et al., 2022).

Given the multi‐faceted role that ncORFs play in translational regulation and cell signaling, it is likely that there are cancer‐relevant mutations in ncORFs. However, their elucidation is complicated by a lack of computational approaches to facilitate their detection and analyze their functional relevance. The development of annotation methods that do not rely on evolutionary conservation will be crucial to these efforts.

3.3.3. Cancer‐specific ncORF annotations

An overarching challenge for ncORFs remains their inconsistent annotation in human genome databases. To this end, a recent international collaborative effort is engaged in efforts to standardize the reference annotation database for ncORFs (Mudge et al., 2022). Yet, reference gene annotations are, by definition, designed for the reference human genome, rather than disease‐specific genomes. Beyond employing copy number variation and translocations that may physically disrupt the genome, cancers also deregulate normally quiescent regions of the genome. Cancer‐specific transcriptomes are therefore well‐established (Hu et al., 2022); many RNA transcripts are induced uniquely in certain cancer states and those transcripts are not annotated in the reference human genome. Likewise, cancer‐specific ncORF translations are an emerging topic (Vibert et al., 2022) and disease‐specific ncORFs are likely to reflect the activity of oncogenic drivers.

Ongoing research into ncORFs uniquely expressed in cancer promises to elucidate new mechanisms underlying oncogenesis and open therapeutic avenues. Molecular studies of cancer‐unique ncORFs and their potential functional products may uncover new signaling and regulatory events that drive tumor progression. Clinically, these ncORFs have promise as biomarkers or as tumor specific antigens.

3.4. Miniproteins and the immunopeptidome

Tumor immunosurveillance relies heavily on CD8+ T‐cell recognition of tumor cells, largely mediated by the MHC‐I system, wherein MHC‐I molecules present antigens of cytosolic origin to CD8+ cells. Antigen presentation occurs when endogenous proteins are degraded into small fragments (8–10aa) and subsequently bind to the peptide‐binding site of MHC‐I (Hewitt, 2003). Cancer cells, due to the expression of proteins bearing non‐synonymous mutations, often present aberrant peptide fragments, or tumor‐associated neoantigens. Recognition of these neoantigens by CD8+ cells facilitates the T‐cell mediated antitumor response (Smith et al., 2019).

The “immunopeptidome” describes the set of peptides presented by MHC molecules (Yewdell, 2022). Intriguingly, the immunopeptidome is not representative of the proteome, meaning that the abundance of an antigenic peptide does not necessarily correlate with its protein abundance (Dersh et al., 2021). In fact, miniproteins—despite comprising only a small fraction of translated proteins—make up a relatively higher proportion of the MHC‐I immunopeptidome, approximately 7.5% (Yewdell, 2022). Consistent with this finding, non‐coding ORFs generate MHC‐I associated peptides (MAPs) at a five‐fold higher efficiency compared to canonical transcripts (Ruiz Cuevas et al., 2021).

Since cancer‐associated genetic alterations can drive upregulated expression of miniproteins (Ruiz Cuevas et al., 2021), tumor cells may display antigens deriving from miniproteins at a higher rate than non‐tumor cells. Indeed, Ouspenskaia and coworkers identified cancer‐enriched ncORF MAPs in glioblastoma and melanomas (Ouspenskaia et al., 2022). Relatedly, Chong et al. identified an immunogenic tumor‐specific antigen derived from the dORF of the ABCB5 gene, which is pro‐oncogenic and differentially expressed in melanoma cell lines (Chong et al., 2020). Their finding suggests that upregulated cancer‐promoting genes containing ncORFs may also drive increased tumor antigen production. Importantly, the function and stability of a miniprotein is irrelevant to its efficiency in generating MAPs. As a whole, ncORFs represent an understudied source of tumor antigens that could be leveraged in immunotherapeutic applications, including for the development of cancer vaccines.

4. CONCLUSION

Studies of ncORFs are poised to reveal new dimensions in cancer biology. Collectively, ncORFs may reveal new aspects of cancer pathogenesis, both through gene regulation as well as through the production of proteins or peptide products. To deliver on this promise, the set of computational and experimental tools to find and functionally validate ncORFs and their miniprotein products requires ongoing expansion and improvement. While current identification of ncORFs mainly relies on Ribo‐seq and mass spectrometry, new computational methods (Nabi et al., 2023) potentiate higher‐confidence and higher‐throughput discovery. For instance, the TIS Transformer program, which uses artificial intelligence to map the human genome, was able to detect ncORFs and predict those that encode miniproteins with high performance (Clauwaert et al., 2023).

Experimental functional validation efforts will benefit from new approaches such as CRISPR/Cas9 based screening (Prensner et al., 2021), bio‐orthogonal non‐canonical amino acid tagging (BONCAT) (Cao et al., 2023), proximity biotinylation to determine miniprotein subcellular localization (Na et al., 2022), and use of AlphaFold for structural predictions (Perrakis & Sixma, 2021). Recent technologies that adapt DNA‐sequencing approaches for high‐throughput protein analysis (Layton et al., 2019; Yu et al., 2023) may also accelerate miniprotein characterization.

While specific ncORF‐directed therapeutics remain a future goal at this time, active work in this area lends credibility to the hope for new disease‐relevant therapeutic insights through the study of ncORFs (Figure 4). Cancer or tissue specific ncORFs are attractive therapeutic targets to disrupt downstream translation of undruggable oncoproteins with greater selectivity than drugging the relevant transcription factors. Miniproteins themselves may be also utilized as biomarkers or in drug design. For instance, miniproteins that scaffold protein complexes could be used in molecular glues. Miniprotein‐based drugs are not simply theoretical—the miniprotein Myc inhibitor, Omomyc, recently advanced through Phase I clinical trials (Llombart & Mansour, 2022; Peptomyc, n.d.). Finally, the immuno‐oncology field is actively exploring the use of miniproteins as tumor‐specific neoantigens. While these therapeutic applications will require additional investigation, they highlight the exciting translational potential of ncORFs.

FIGURE 4.

FIGURE 4

Clinical applications of miniproteins, ranging from drug development to biomarkers. Cancer cells are enriched for ncORF MHC‐I associated peptides (MAPs), suggesting that they could serve as a source of tumor antigens for immunotherapeutic applications. Additionally, pharmacological targeting of cancer‐specific ncRNAs to silence pro‐oncogenic miniprotein translation could enable downstream inhibition of undruggable targets. Miniproteins can also be used to inform drug design. Miniprotein characterization could be used to identify protein–protein interaction hotspots or inhibitor binding sites. Miniprotein conjugation to other drugs could be explored for potential application in molecular glues or heterobifunctional compounds. Miniproteins are also potentially suitable as non‐antibody binding scaffolds. Finally, differential expression of ncORFs and miniproteins in cancer potentiates their use as potential biomarkers, especially in cases of differentially expressed secreted miniproteins. Figure created with biorender.com.

Meanwhile, the fundamental question of which ncORFs operate as cancer drivers compared to bystanders in oncogenesis remains unanswered, and will require substantial efforts by the cancer research community. Other essential aspects of ncORF biology also require further elucidation—for instance, it remains unknown whether specific ribosome binding proteins are associated with ncORFs and their regulation. Similarly, greater mechanistic understanding of the translation initiation of ncORFs is still needed. Much of the dark proteome remains shrouded in mystery; like all great unknowns, unraveling ncORF and miniprotein biology offers to enrich our understanding of the complexities underlying cell signaling and regulation while also engendering new therapeutic strategies and targets.

AUTHOR CONTRIBUTIONS

Zoe Posner: Conceptualization; Data Curation; Methodology; Visualization; Writing – Original Draft Preparation; Writing – Review & Editing.

Ian Yannuzzi: Conceptualization; Data Curation; Methodology; Visualization; Writing – Original Draft Preparation; Writing – Review & Editing.

John R. Prensner: Conceptualization; Funding Acquisition; Supervision; Writing – Review & Editing.

FUNDING INFORMATION

John R. Prensner acknowledges funding from the National Institutes of Health/National Cancer Institute (K08‐CA263552‐01A1), the Alex's Lemonade Stand Foundation Young Investigator Award (#21‐23983), the St. Baldrick's Foundation Scholar Award (#931638), The DIPG/DMG Research Funding Alliance, and a Collaborative Pediatric Cancer Research Awards Program/Kids Join the Fight award (#22FN23).

CONFLICT OF INTEREST STATEMENT

The authors declare no competing interests.

ACKNOWLEDGMENTS

We would like to acknowledge members of the Golub lab at the Broad Institute of MIT and Harvard for helpful conversations and discussions. We acknowledge www.biorender.com for the generation of figures.

Posner Z, Yannuzzi I, Prensner JR. Shining a light on the dark proteome: Non‐canonical open reading frames and their encoded miniproteins as a new frontier in cancer biology. Protein Science. 2023;32(8):e4708. 10.1002/pro.4708

Review Editor: Aitziber L. Cortajarena

REFERENCES

  1. Alioto T. Gene prediction. Methods Mol Biol. 2012;855:175–201. [DOI] [PubMed] [Google Scholar]
  2. Anderson DM, Anderson KM, Chang C‐L, Makarewich CA, Nelson BR, McAnally JR, et al. A micropeptide encoded by a putative long noncoding RNA regulates muscle performance. Cell. 2015;160:595–606. 10.1016/j.cell.2015.01.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Andreev DE, Loughran G, Fedorova AD, Mikhaylova MS, Shatsky IN, Baranov PV. Non‐AUG translation initiation in mammals. Genome Biol. 2022;23(1):111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Andreev DE, O'Connor PBF, Fahey C, Kenny EM, Terenin IM, Dmitriev SE, et al. Translation of 5′ leaders is pervasive in genes resistant to eIF2 repression. Elife. 2015;4:e03971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Antin C, Tauziède‐Espariat A, Debily M‐A, Castel D, Grill J, Pagès M, et al. EZHIP is a specific diagnostic biomarker for posterior fossa ependymomas, group PFA and diffuse midline gliomas H3‐WT with EZHIP overexpression. Acta Neuropathol Commun. 2020;8:183. 10.1186/s40478-020-01056-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Arribere JA, Gilbert WV. Roles for transcript leaders in translation and mRNA decay revealed by transcript leader sequencing. Genome Res. 2013;23(6):977–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Autio KJ, Kastaniotis AJ, Pospiech H, Miinalainen IJ, Schonauer MS, Dieckmann CL, et al. An ancient genetic link between vertebrate mitochondrial fatty acid synthesis and RNA processing. FASEB J. 2008;22(2):569–78. [DOI] [PubMed] [Google Scholar]
  8. Badger JH, Olsen GJ. CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol. 1999;16(4):512–24. [DOI] [PubMed] [Google Scholar]
  9. Bailey MH, Meyerson WU, Dursi LJ, Wang L‐B, Dong G, Liang W‐W, et al. Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples. Nat Commun. 2020;11(1):4748. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Barbosa C, Peixeiro I, Romão L. Gene expression regulation by upstream open reading frames and human disease. PLoS Genet. 2013;9(8):e1003529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Beadle GW, Tatum EL. Genetic control of biochemical reactions in neurospora. Proc Natl Acad Sci. 1941;27:499–506. 10.1073/pnas.27.11.499 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Bell J, Haldane JBS. The linkage between the genes for colour‐blindness and haemophilia in man. Proc Biol Sci R Soc. 1937;123(831):119–50. [DOI] [PubMed] [Google Scholar]
  13. Boix O, Martinez M, Vidal S, Giménez‐Alejandre M, Palenzuela L, Lorenzo‐Sanz L, et al. pTINCR microprotein promotes epithelial differentiation and suppresses tumor growth through CDC42 SUMOylation and activation. Nat Commun. 2022;13(1):6840. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Borsani G, Tonlorenzi R, Simmler MC, Dandolo L, Arnaud D, Capra V, et al. Characterization of a murine gene expressed from the inactive X chromosome. Nature. 1991;351(6324):325–9. [DOI] [PubMed] [Google Scholar]
  15. Brunner A‐D, Thielert M, Vasilopoulou C, Ammar C, Coscia F, Mund A, et al. Ultra‐high sensitivity mass spectrometry quantifies single‐cell proteome changes upon perturbation. Mol Syst Biol. 2022;18(3):e10798. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997;268(1):78–94. [DOI] [PubMed] [Google Scholar]
  17. Cabili MN, Trapnell C, Goff L, Koziol M, Tazon‐Vega B, Regev A, et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 2011;25(18):1915–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Calviello L, Ohler U. Beyond read‐counts: ribo‐seq data analysis to understand the functions of the transcriptome. Trends Genet. 2017;33(10):728–44. [DOI] [PubMed] [Google Scholar]
  19. Cao X, Chen Y, Khitun A, Slavoff SA. BONCAT‐based profiling of nascent small and alternative open reading frame‐encoded proteins. Bio‐Protocol. 2023;13(1):e4585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Chen J, Brunner A‐D, Cogan JZ, Nuñez JK, Fields AP, Adamson B, et al. Pervasive functional translation of noncanonical human open reading frames. Science. 2020;367(6482):1140–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Chen H‐H, Tarn W‐Y. uORF‐mediated translational control: recently elucidated mechanisms and implications in cancer. RNA Biol. 2019;16(10):1327–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Chong C, Müller M, Pak H, Harnett D, Huber F, Grun D, et al. Integrated proteogenomic deep sequencing and analytics accurately identify non‐canonical peptides in tumor immunopeptidomes. Nat Commun. 2020;11(1):1293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Chothani SP, Adami E, Widjaja AA, Langley SR, Viswanathan S, Pua CJ, et al. A high‐resolution map of human RNA translation. Mol Cell. 2022;82(15):2885–99.e8. [DOI] [PubMed] [Google Scholar]
  24. Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol. 2013;31(3):213–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Clauwaert J, Gupta R, McVey Z, Menschaert G. TIS transformer: remapping the human proteome using deep learning. Nucleic Acid Res: Genom Bioinform. 2023;5(1):lqad021. 10.1093/nargab/lqad021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Dersh D, Hollý J, Yewdell JW. A few good peptides: MHC class I‐based cancer immunosurveillance and immunoevasion. Nat Rev Immunol. 2021;21(2):116–28. [DOI] [PubMed] [Google Scholar]
  27. Dietlein F, Wang AB, Fagre C, Tang A, Besselink NJM, Cuppen E, et al. Genome‐wide analysis of somatic noncoding mutation patterns in cancer. Science. 2022;376(6589):eabg5601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Dinger ME, Pang KC, Mercer TR, Mattick JS. Differentiating protein‐coding and noncoding RNA: challenges and ambiguities. PLoS Comput Biol. 2008;4(11):e1000176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. D'Lima NG, Ma J, Winkler L, Chu Q, Loh KH, Corpuz EO, et al. A human microprotein that interacts with the mRNA decapping complex. Nat Chem Biol. 2017;13(2):174–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Do JH, Choi D‐K. Computational approaches to gene prediction. J Microbiol. 2006;44(2):137–44. [PubMed] [Google Scholar]
  31. ENCODE Project Consortium , Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447(7146):799–816. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Erady C, Boxall A, Puntambekar S, Suhas Jagannathan N, Chauhan R, Chong D, et al. Pan‐cancer analysis of transcripts encoding novel open‐reading frames (nORFs) and their potential biological functions. NPJ Genom Med. 2021;6(1):4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Farber R, Lapedes A, Sirotkin K. Determination of eukaryotic protein coding regions using neural networks and information theory. J Mol Biol. 1992;226(2):471–9. [DOI] [PubMed] [Google Scholar]
  34. Fedorova AD, Kiniry SJ, Andreev DE, Mudge JM, Baranov PV. Thousands of human non‐AUG extended proteoforms lack evidence of evolutionary selection among mammals. Nat Commun. 2022;13(1):7910. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Frith MC, Forrest AR, Nourbakhsh E, Pang KC, Kai C, Kawai J, et al. The abundance of short proteins in the mammalian proteome. PLoS Genet. 2006;2(4):e52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Galindo MI, Pueyo JI, Fouix S, Bishop SA, Couso JP. Peptides encoded by short ORFs control development and define a new eukaryotic gene family. PLoS Biol. 2007;5(5):e106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Gao X, Xia X, Li F, Zhang M, Zhou H, Wu X, et al. Circular RNA‐encoded oncogenic E‐cadherin variant promotes glioblastoma tumorigenicity through activation of EGFR–STAT3 signalling. Nat Cell Biol. 2021;23(3):278–91. [DOI] [PubMed] [Google Scholar]
  38. Goel N, Singh S, Aseri TC. A comparative analysis of soft computing techniques for gene prediction. Anal Biochem. 2013;438(1):14–21. [DOI] [PubMed] [Google Scholar]
  39. Guo B, Wu S, Zhu X, Zhang L, Deng J, Li F, et al. Micropeptide CIP 2A‐BP encoded by LINC 00665 inhibits triple‐negative breast cancer progression. EMBO J. 2020;39:e102190. 10.15252/embj.2019102190 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, et al. Chromatin signature reveals over a thousand highly conserved large non‐coding RNAs in mammals. Nature. 2009;458(7235):223–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Hanada K, Akiyama K, Sakurai T, Toyoda T, Shinozaki K, Shiu S‐H. sORF finder: a program package to identify small open reading frames with high coding potential. Bioinformatics. 2010;26(3):399–400. [DOI] [PubMed] [Google Scholar]
  42. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al. GENCODE: the reference human genome annotation for the ENCODE project. Genome Res. 2012;22(9):1760–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Hartford CCR, Lal A. When long noncoding becomes protein coding. Mol Cell Biol. 2020;40(6). 10.1128/MCB.00528-19 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Hemm MR, Paul BJ, Schneider TD, Storz G, Rudd KE. Small membrane proteins found by comparative genomics and ribosome binding site models. Mol Microbiol. 2008;70(6):1487–501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Hewitt EW. The MHC class I antigen presentation pathway: strategies for viral immune evasion. Immunology. 2003;110(2):163–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Hinnebusch AG. Translational regulation of yeast GCN4. A window on factors that control initiator‐trna binding to the ribosome. J Biol Chem. 1997;272(35):21661–4. [DOI] [PubMed] [Google Scholar]
  47. Ho L, Tan SYX, Wee S, Wu Y, Tan SJC, Ramakrishna NB, et al. ELABELA is an endogenous growth factor that sustains hESC self‐renewal via the PI3K/AKT pathway. Cell Stem Cell. 2015;17(4):435–47. [DOI] [PubMed] [Google Scholar]
  48. Hsu PY, Calviello L, Wu H‐YL, Li F‐W, Rothfels CJ, Ohler U, et al. Super‐resolution ribosome profiling reveals unannotated translation events in Arabidopsis. Proc Natl Acad Sci U S A. 2016;113(45):E7126–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Hu F, Lu J, Munoz MD, Saveliev A, Turner M. ORFLine: a bioinformatic pipeline to prioritise small open reading frames identifies candidate secreted small proteins from lymphocytes. Bioinformatics. 2022;38(9):2673–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Hu W, Wu Y, Shi Q, Wu J, Kong D, Wu X, et al. Systematic characterization of cancer transcriptome at transcript resolution. Nat Commun. 2022;13(1):6803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Huang Q, Chen Z, Cheng P, Jiang Z, Wang Z, Huang Y, et al. LYRM2 directly regulates complex I activity to support tumor growth in colorectal cancer by oxidative phosphorylation. Cancer Lett. 2019;455:36–47. [DOI] [PubMed] [Google Scholar]
  52. ICGC/TCGA Pan‐Cancer Analysis of Whole Genomes Consortium . Pan‐cancer analysis of whole genomes. Nature. 2020;578(7793):82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Inagaki S, Numata K, Kondo T, Tomita M, Yasuda K, Kanai A, et al. Identification and expression analysis of putative mRNA‐like non‐coding RNA in drosophila. Genes Cells. 2005;10(12):1163–73. [DOI] [PubMed] [Google Scholar]
  54. Ingolia NT, Ghaemmaghami S, Newman JRS, Weissman JS. Genome‐wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science. 2009;324(5924):218–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Ingolia NT, Lareau LF, Weissman JS. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 2011;147(4):789–802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. International Human Genome Sequencing Consortium . Finishing the euchromatic sequence of the human genome. Nature. 2004;431(7011):931–45. [DOI] [PubMed] [Google Scholar]
  57. Jain SU, Do TJ, Lund PJ, Rashoff AQ, Diehl KL, Cieslik M, et al. PFA ependymoma‐associated protein EZHIP inhibits PRC2 activity through a H3 K27M‐like mechanism. Nat Commun. 2019;10(1):2146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Johnstone TG, Bazzini AA, Giraldez AJ. Upstream ORF s are prevalent translational repressors in vertebrates. EMBO J. 2016;35:706–23. 10.15252/embj.201592759 [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Kastenmayer JP, Ni L, Chu A, Kitchen LE, Au W‐C, Yang H, et al. Functional genomics of genes with small open reading frames (sORFs) in S. cerevisiae. Genome Res. 2006;16(3):365–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Kiniry SJ, Michel AM, Baranov PV. Computational methods for ribosome profiling data analysis. Wiley Interdiscip Rev RNA. 2020;11(3):e1577. [DOI] [PubMed] [Google Scholar]
  61. Klasberg S, Bitard‐Feildel T, Mallet L. Computational identification of novel genes: current and future perspectives. Bioinform Biol Insights. 2016;10:121–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Koh M, Ahmad I, Ko Y, Zhang Y, Martinez TF, Diedrich JK, et al. A short ORF‐encoded transcriptional regulator. Proc Natl Acad Sci U S A. 2021;118(4). 10.1073/pnas.2021943118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Kondo T, Hashimoto Y, Kato K, Inagaki S, Hayashi S, Kageyama Y. Small peptide regulators of actin‐based cell morphogenesis encoded by a polycistronic mRNA. Nat Cell Biol. 2007;9(6):660–5. [DOI] [PubMed] [Google Scholar]
  64. Kozak M. Compilation and analysis of sequences upstream from the translational start site in eukaryotic mRNAs. Nucleic Acids Res. 1984;12:857–72. 10.1093/nar/12.2.857 [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Kozak M. An analysis of 5′‐noncoding sequences from 699 vertebrate messenger RNAs. Nucleic Acids Res. 1987;15(20):8125–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Kozak M. The scanning model for translation: an update. J Cell Biol. 1989;108:229–41. 10.1083/jcb.108.2.229 [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860–921. [DOI] [PubMed] [Google Scholar]
  68. Lapedes A, Barnes C, Burks C, Farber R, Sirotkin K. Application of neural networks and other machine learning algorithms to DNA sequence analysis. Computers and DNA. 1st ed. New York, NY, USA: Routledge; 1990. 10.4324/9780429501463-15 [DOI] [Google Scholar]
  69. Layton CJ, McMahon PL, Greenleaf WJ. Large‐scale, quantitative protein assays on a high‐throughput DNA sequencing chip. Mol Cell. 2019;73(5):1075–1082.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Lee S, Liu B, Lee S, Huang S‐X, Shen B, Qian S‐B. Global mapping of translation initiation sites in mammalian cells at single‐nucleotide resolution. Proc Natl Acad Sci U S A. 2012;109(37):E2424–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Liang P, Pardee AB. Differential display of eukaryotic messenger RNA by means of the polymerase chain reaction. Science. 1992;257(5072):967–71. [DOI] [PubMed] [Google Scholar]
  72. Liang C, Zhang S, Robinson D, Ploeg MV, Wilson R, Nah J, et al. Mitochondrial microproteins link metabolic cues to respiratory chain biogenesis. Cell Rep. 2022;40(7):111204. [DOI] [PubMed] [Google Scholar]
  73. Liberman N, Marash L, Kimchi A. The translation initiation factor DAP5 is a regulator of cell survival during mitosis. Cell Cycle. 2009;8:204–9. 10.4161/cc.8.2.7384 [DOI] [PubMed] [Google Scholar]
  74. Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and non‐coding regions. Bioinformatics. 2011;27(13):i275–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Lin Y, May GE, Kready H, Nazzaro L, Mao M, Spealman P, et al. Impacts of uORF codon identity and position on translation regulation. Nucleic Acids Res. 2019;47(17):9358–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Liu L, Dilworth D, Gao L, Monzon J, Summers A, Lassam N, et al. Mutation of the CDKN2A 5’ UTR creates an aberrant initiation codon and predisposes to melanoma. Nat Genet. 1999;21(1):128–32. [DOI] [PubMed] [Google Scholar]
  77. Liu Y, Zeng S, Wu M. Novel insights into noncanonical open reading frames in cancer. Biochim Biophys Acta Rev Cancer. 2022;1877(4):188755. [DOI] [PubMed] [Google Scholar]
  78. Llombart V, Mansour MR. Therapeutic targeting of “undruggable” MYC. EBioMedicine. 2022;75:103756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Ma J, Diedrich JK, Jungreis I, Donaldson C, Vaughan J, Kellis M, et al. Improved identification and analysis of small open reading frame encoded polypeptides. Anal Chem. 2016;88(7):3967–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Ma J, Ward CC, Jungreis I, Slavoff SA, Schwaid AG, Neveu J, et al. Discovery of human sORF‐encoded polypeptides (SEPs) in cell lines and tissue. J Proteome Res. 2014;13:1757–65. 10.1021/pr401280w [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Ma M, Ye T, Wang J, Zhao H, Zhang S, Li P, et al. N6‐methyladenosine modification of noncoding RNAs: mechanisms and clinical applications in cancer. Diagnostics (Basel, Switzerland). 2022;12(12):2996. 10.3390/diagnostics12122996 [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Ma Y, Yue Y, Ma Y, Zhang Q, Zhou Q, Song Y, et al. Structural basis for apelin control of the human apelin receptor. Structure. 2017;25(6):858–866.e4. [DOI] [PubMed] [Google Scholar]
  83. Mackowiak SD, Zauber H, Bielow C, Thiel D, Kutz K, Calviello L, et al. Extensive identification and analysis of conserved small ORFs in animals. Genome Biol. 2015;16:179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Matsumoto A, Clohessy JG, Pandolfi PP. SPAR, a lncRNA encoded mTORC1 inhibitor. Cell Cycle. 2017;16:815–6. 10.1080/15384101.2017.1304735 [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Meng N, Chen M, Chen D, Chen X‐H, Wang J‐Z, Zhu S, et al. Small protein hidden in lncRNA LOC90024 promotes “cancerous” RNA splicing and tumorigenesis. Adv Sci. 2020;7(10):1903233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  86. Meyer KD. m6A‐mediated translation regulation. Biochim Biophys Acta Gene Regul Mech. 2019;1862(3):301–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Meyer KD, Patil DP, Zhou J, Zinoviev A, Skabkin MA, Elemento O, et al. 5’ UTR m(6)A promotes cap‐independent translation. Cell. 2015;163(4):999–1010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Min K‐W, Davila S, Zealy RW, Lloyd LT, Lee IY, Lee R, et al. eIF4E phosphorylation by MST1 reduces translation of a subset of mRNAs, but increases lncRNA translation. Biochim Biophys Acta Gene Regul Mech. 2017;1860(7):761–72. [DOI] [PubMed] [Google Scholar]
  89. Mohaupt P, Roucou X, Delaby C, Vialaret J, Lehmann S, Hirtz C. The alternative proteome in neurobiology. Front Cell Neurosci. 2022;16:1019680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  90. Mouse Genome Sequencing Consortium , Waterston RH, Lindblad‐Toh K, Birney E, Rogers J, Abril JF, et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–62. 10.1038/nature01262 [DOI] [PubMed] [Google Scholar]
  91. Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright JC, Kay M, et al. Discovery of high‐confidence human protein‐coding genes and exons by whole‐genome PhyloCSF helps elucidate 118 GWAS loci. Genome Res. 2019;29(12):2073–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  92. Mudge JM, Ruiz‐Orera J, Prensner JR, Brunet MA, Calvet F, Jungreis I, et al. Standardized annotation of translated open reading frames. Nat Biotechnol. 2022;40(7):994–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  93. Na Z, Dai X, Zheng S‐J, Bryant CJ, Loh KH, Su H, et al. Mapping subcellular localizations of unannotated microproteins and alternative proteins with MicroID. Mol Cell. 2022;82(15):2900–2911.e7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  94. Nabi A, Dilekoglu B, Adebali O, Tastan O. Discovering misannotated lncRNAs using deep learning training dynamics. Bioinformatics. 2023;39(1). 10.1093/bioinformatics/btac821 [DOI] [PMC free article] [PubMed] [Google Scholar]
  95. Nayak A, Kapur A, Barroilhet L, Patankar M. Oxidative phosphorylation: a target for novel therapeutic strategies against ovarian cancer. Cancer. 2018;10:337. 10.3390/cancers10090337 [DOI] [PMC free article] [PubMed] [Google Scholar]
  96. Nesvizhskii AI. Proteogenomics: concepts, applications and computational strategies. Nat Methods. 2014;11(11):1114–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  97. Nirenberg MW. The genetic code: II. Sci Am. 1963;208:80–95. 10.1038/scientificamerican0363-80 [DOI] [PubMed] [Google Scholar]
  98. Occhi G, Regazzo D, Trivellin G, Boaretto F, Ciato D, Bobisse S, et al. A novel mutation in the upstream open reading frame of the CDKN1B gene causes a MEN4 phenotype. PLoS Genet. 2013;9(3):e1003350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  99. Orr MW, Mao Y, Storz G, Qian S‐B. Alternative ORFs and small ORFs: shedding light on the dark proteome. Nucleic Acids Res. 2020;48(3):1029–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  100. Ouelle DE, Zindy F, Ashmun RA, Sherr CJ. Alternative reading frames of the INK4a tumor suppressor gene encode two unrelated proteins capable of inducing cell cycle arrest. Cell. 1995;83:993–1000. 10.1016/0092-8674(95)90214-7 [DOI] [PubMed] [Google Scholar]
  101. Ouspenskaia T, Law T, Clauser KR, Klaeger S, Sarkizova S, Aguet F, et al. Unannotated proteins expand the MHC‐I‐restricted immunopeptidome in cancer. Nat Biotechnol. 2022;40(2):209–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  102. Oyama M, Itagaki C, Hata H, Suzuki Y, Izumi T, Natsume T, et al. Analysis of small human proteins reveals the translation of upstream open reading frames of mRNAs. Genome Res. 2004;14(10B):2048–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  103. Oyama M, Kozuka‐Hata H, Suzuki Y, Semba K, Yamamoto T, Sugano S. Diversity of translation start sites may define increased complexity of the human short ORFeome. Mol Cell Proteomics. 2007;6(6):1000–6. [DOI] [PubMed] [Google Scholar]
  104. Pang Y, Liu Z, Han H, Wang B, Li W, Mao C, et al. Peptide SMIM30 promotes HCC development by inducing SRC/YES1 membrane anchoring and MAPK pathway activation. J Hepatol. 2020;73(5):1155–69. [DOI] [PubMed] [Google Scholar]
  105. Pardini B, Calin GA. MicroRNAs and long non‐coding RNAs and their hormone‐like activities in cancer. Cancer. 2019;11(3):378. 10.3390/cancers11030378 [DOI] [PMC free article] [PubMed] [Google Scholar]
  106. Pauli A, Norris ML, Valen E, Chew G‐L, Gagnon JA, Zimmerman S, et al. Toddler: an embryonic signal that promotes cell movement via apelin receptors. Science. 2014;343(6172):1248636. [DOI] [PMC free article] [PubMed] [Google Scholar]
  107. Peptomyc SL. A phase 1/2 study to evaluate the safety, pharmacokinetics, and anti‐tumour activity of the MYC inhibitor OMO‐103 administered intravenously in patients with advanced solid tumours (Clinical Trial Registration No. NCT04808362). n.d.
  108. Perrakis A, Sixma TK. AI revolutions in biology: the joys and perils of AlphaFold. EMBO Rep. 2021;22(11):e54046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  109. Prensner JR, Enache OM, Luria V, Krug K, Clauser KR, Dempster JM, et al. Noncanonical open reading frames encode functional proteins essential for cancer cell survival. Nat Biotechnol. 2021;39:697–704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  110. Rheinbay E, Nielsen MM, Abascal F, Wala JA, Shapira O, Tiao G, et al. Analyses of non‐coding somatic drivers in 2,658 cancer whole genomes. Nature. 2020;578(7793):102–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  111. Robson EB. The human gene map. Philos Trans R Soc Lond B Biol Sci. 1988;319(1194):229–37. [DOI] [PubMed] [Google Scholar]
  112. Rubin GM, Hong L, Brokstein P, Evans‐Holm M, Frise E, Stapleton M, et al. A drosophila complementary DNA resource. Science. 2000;287(5461):2222–4. [DOI] [PubMed] [Google Scholar]
  113. Ruggles KV, Krug K, Wang X, Clauser KR, Wang J, Payne SH, et al. Methods, tools and current perspectives in proteogenomics*. Mol Cell Proteomics. 2017;16(6):959–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  114. Ruiz Cuevas MV, Hardy M‐P, Hollý J, Bonneil É, Durette C, Courcelles M, et al. Most non‐canonical proteins uniquely populate the proteome or immunopeptidome. Cell Rep. 2021;34(10):108815. [DOI] [PMC free article] [PubMed] [Google Scholar]
  115. Sandmann C‐L, Schulz JF, Ruiz‐Orera J, Kirchner M, Ziehm M, Adami E, et al. Evolutionary origins and interactomes of human, young microproteins and small peptides translated from short open reading frames. Mol Cell. 2023;83:994–1011.e18. 10.1016/j.molcel.2023.01.023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  116. Schipper M, Posthuma D. Demystifying non‐coding GWAS variants: an overview of computational tools and methods. Hum Mol Genet. 2022;31(R1):R73–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  117. Sendoel A, Dunn JG, Rodriguez EH, Naik S, Gomez NC, Hurwitz B, et al. Translation from unconventional 5′ start sites drives tumour initiation. Nature. 2017;541(7638):494–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  118. Senís E, Esgleas M, Najas S, Jiménez‐Sábado V, Bertani C, Giménez‐Alejandre M, et al. TUNAR lncRNA encodes a microprotein that regulates neural differentiation and neurite formation by modulating calcium dynamics. Front Cell Dev Biol. 2021;9:747667. [DOI] [PMC free article] [PubMed] [Google Scholar]
  119. Slavoff SA, Mitchell AJ, Schwaid AG, Cabili MN, Ma J, Levin JZ, et al. Peptidomic discovery of short open reading frame‐encoded peptides in human cells. Nat Chem Biol. 2012;9(1):59–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  120. Smith CC, Selitsky SR, Chai S, Armistead PM, Vincent BG, Serody JS. Alternative tumour‐specific antigens. Nat Rev Cancer. 2019;19(8):465–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  121. Stallmeyer B, Drugeon G, Reiss J, Haenni AL, Mendel RR. Human molybdopterin synthase gene: identification of a bicistronic transcript with overlapping reading frames. Am J Hum Genet. 1999;64(3):698–705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  122. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 2006;34:W435–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  123. Steitz JA. Polypeptide chain initiation: nucleotide sequences of the three ribosomal binding sites in bacteriophage R17 RNA. Nature. 1969;224:957–64. 10.1038/224957a0 [DOI] [PubMed] [Google Scholar]
  124. Steitz JA, Dube SK, Rudland PS. Control of translation by T4 phage: altered ribosome binding at R17 initiation sites. Nature. 1970;226:824–7. 10.1038/226824a0 [DOI] [PubMed] [Google Scholar]
  125. Svensson M, Sköld K, Svenningsson P, Andren PE. Peptidomics‐based discovery of novel neuropeptides. J Proteome Res. 2003;2(2):213–9. [DOI] [PubMed] [Google Scholar]
  126. Tian X, Zhang S, Zhou L, Seyhan AA, Hernandez Borrero L, Zhang Y, et al. Targeting the integrated stress response in cancer therapy. Front Pharmacol. 2021;12:747837. [DOI] [PMC free article] [PubMed] [Google Scholar]
  127. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA‐seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28(5):511–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  128. Tupy JL, Bailey AM, Dailey G, Evans‐Holm M, Siebel CW, Misra S, et al. Identification of putative noncoding polyadenylated transcripts in Drosophila melanogaster. Proc Natl Acad Sci U S A. 2005;102(15):5495–500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  129. Vakirlis N, Vance Z, Duggan KM, McLysaght A. De novo birth of functional microproteins in the human lineage. Cell Rep. 2022;41(12):111808. [DOI] [PMC free article] [PubMed] [Google Scholar]
  130. Vasudevan D, Neuman SD, Yang A, Lough L, Brown B, Bashirullah A, et al. Translational induction of ATF4 during integrated stress response requires noncanonical initiation factors eIF2D and DENR. Nat Commun. 2020;11(1):4677. [DOI] [PMC free article] [PubMed] [Google Scholar]
  131. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science. 1995;270:484–7. 10.1126/science.270.5235.484 [DOI] [PubMed] [Google Scholar]
  132. Velculescu VE, Zhang L, Zhou W, Vogelstein J, Basrai MA, Bassett DE Jr, et al. Characterization of the yeast transcriptome. Cell. 1997;88(2):243–51. [DOI] [PubMed] [Google Scholar]
  133. Vibert J, Saulnier O, Collin C, Petit F, Borgman KJE, Vigneau J, et al. Oncogenic chimeric transcription factors drive tumor‐specific transcription, processing, and translation of silent genomic regions. Mol Cell. 2022;82(13):2458–2471.e9. [DOI] [PubMed] [Google Scholar]
  134. Wang Z, Chen Y, Li Y. A brief review of computational gene prediction methods. Genomics Proteomics Bioinformatics. 2004;2(4):216–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  135. Wang J, Okkeri J, Pavic K, Wang Z, Kauko O, Halonen T, et al. Oncoprotein CIP 2A is stabilized via interaction with tumor suppressor PP 2A/B56. EMBO Rep. 2017;18:437–50. 10.15252/embr.201642788 [DOI] [PMC free article] [PubMed] [Google Scholar]
  136. Wang L, Park HJ, Dasari S, Wang S, Kocher J‐P, Li W. CPAT: coding‐potential assessment tool using an alignment‐free logistic regression model. Nucleic Acids Res. 2013;41(6):e74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  137. Wang H, Wang Y, Xie Z. Computational resources for ribosome profiling: from database to web server and software. Brief Bioinform. 2019;20(1):144–55. [DOI] [PubMed] [Google Scholar]
  138. Wang Y, Wu S, Zhu X, Zhang L, Deng J, Li F, et al. LncRNA‐encoded polypeptide ASRPS inhibits triple‐negative breast cancer angiogenesis. J Exp Med. 2020;217(3). 10.1084/jem.20190950 [DOI] [PMC free article] [PubMed] [Google Scholar]
  139. Weber R, Kleemann L, Hirschberg I, Chung M‐Y, Valkov E, Igreja C. DAP5 enables main ORF translation on mRNAs with structured and uORF‐containing 5′ leaders. Nat Commun. 2022;13(1):7510. [DOI] [PMC free article] [PubMed] [Google Scholar]
  140. Whiffin N, Karczewski KJ, Zhang X, Chothani S, Smith MJ, Evans DG, et al. Characterising the loss‐of‐function impact of 5′ untranslated region variants in 15,708 individuals. Nat Commun. 2020;11(1):2523. [DOI] [PMC free article] [PubMed] [Google Scholar]
  141. Wright BW, Yi Z, Weissman JS, Chen J. The dark proteome: translation from noncanonical open reading frames. Trends Cell Biol. 2022;32(3):243–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  142. Wu Q, Wright M, Gogol MM, Bradford WD, Zhang N, Bazzini AA. Translation of small downstream ORFs enhances translation of canonical main open reading frames. EMBO J. 2020;39(17):e104763. [DOI] [PMC free article] [PubMed] [Google Scholar]
  143. Xia X, Li X, Li F, Wu X, Zhang M, Zhou H, et al. A novel tumor suppressor protein encoded by circular AKT3 RNA inhibits glioblastoma tumorigenicity by competing with active phosphoinositide‐dependent Kinase‐1. Mol Cancer. 2019;18(1):131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  144. Xu S, Jia G, Zhang H, Wang L, Cong Y, Lv M, et al. LncRNA HOXB‐AS3 promotes growth, invasion and migration of epithelial ovarian cancer by altering glycolysis. Life Sci. 2021;264:118636. [DOI] [PubMed] [Google Scholar]
  145. Yeh C‐H, Bellon M, Nicot C. FBXW7: a critical tumor suppressor of human cancers. Mol Cancer. 2018;17(1):115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  146. Yewdell JW. MHC class I immunopeptidome: past, present, and future. Mol Cell Proteomics. 2022;21(7):100230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  147. Yu L, Kang X, Li F, Mehrafrooz B, Makhamreh A, Fallahi A, et al. Unidirectional single‐file transport of full‐length proteins through a nanopore. Nat Biotechnol. 2023. 10.1038/s41587-022-01598-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  148. Zhang S, Reljić B, Liang C, Kerouanton B, Francisco JC, Peh JH, et al. Mitochondrial peptide BRAWNIN is essential for vertebrate respiratory complex III assembly. Nat Commun. 2020;11(1):1312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  149. Zhang D, Zhang G, Hu X, Wu L, Feng Y, He S, et al. Oncogenic RAS regulates long noncoding RNA Orilnc1 in human cancer. Cancer Res. 2017;77(14):3745–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  150. Zhang M, Zhao K, Xu X, Yang Y, Yan S, Wei P, et al. A peptide encoded by circular form of LINC‐PINT suppresses oncogenic transcriptional elongation in glioblastoma. Nat Commun. 2018;9(1):4475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  151. Zhu S, Wang J‐Z, Chen D, He Y‐T, Meng N, Chen M, et al. An oncopeptide regulates m6A recognition by the m6A reader IGF2BP1 and tumorigenesis. Nat Commun. 2020;11(1):1685. [DOI] [PMC free article] [PubMed] [Google Scholar]
  152. Zou Q, Xiao Z, Huang R, Wang X, Wang X, Zhao H, et al. Survey of the translation shifts in hepatocellular carcinoma with ribosome profiling. Theranostics. 2019;9(14):4141–55. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Protein Science : A Publication of the Protein Society are provided here courtesy of The Protein Society

RESOURCES