Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2026 Apr 21;54(7):gkag303. doi: 10.1093/nar/gkag303

Computational prediction-combined proteogenomics unveils widespread non-AUG translation initiation events in plants

Yuqian Zhang 1,2,#, Baiyang Chang 3,#, Shunxi Wang 4, Lei Tian 5, Huina Zhao 6, Weiwei Luo 7, Hanxue Zhang 8, Shoudong Zhang 9, Shubiao Wu 10, Liuji Wu 11,12,
PMCID: PMC13096810  PMID: 42011784

Abstract

Non-AUG translation initiation can generate N-terminally extended proteoforms, contributing to proteome complexity and regulatory diversity. While well characterized in mammals, its identification in plants remains limited, hindering both functional investigations and cross-species comparisons. Here, we applied a computational prediction-combined proteogenomic strategy to systematically explore non-AUG translation initiation events in the monocots maize and rice and the dicot soybean, identifying 879 transcripts potentially producing 3 938 N-terminally extended proteoforms. These events exhibited both conserved and lineage-specific mechanistic features, including stable RNA secondary structures flanking upstream translation initiation sites (uTISs), codon and sequence context preferences between monocot and dicot species, and a lack of evolutionary conservation. Plant N-terminal extensions were predicted to encode diverse targeting signals, implicating them in subcellular localization and functional diversification. Comparative analysis revealed both conserved trends and plant-specific features relative to humans. Collectively, this study provides a foundational resource and conceptual framework to advance understanding of plant non-AUG translation within a cross-kingdom evolutionary context. It also offers new opportunities to elucidate the roles of non-AUG translation in regulatory networks, proteome diversification, and adaptive biological functions across eukaryotic systems.

Graphical Abstract

Graphical Abstract.

For image description, please refer to the figure legend and surrounding text.

Introduction

Translation is a crucial step in gene expression, determining when, where, what, and how the proteins are synthesized. In eukaryotes, protein translation typically follows the canonical mode, initiating at the AUG codon and terminating at the stop codon of a coding sequence (CDS) [1]. For decades, extensive research has focused on proteins synthesized through this canonical mechanism. However, recent discoveries have revealed an increasing number of non-canonical translation events beyond canonical translation rules, such as non-AUG translation initiation [2], ribosomal frameshifting [3], and stop codon readthrough [4]. Collectively, these events have reshaped our understanding of gene expression complexity and expanded the functional proteome landscape [2, 5, 6]. Among them, non-AUG translation initiation represents a major class of non-canonical translation events, wherein alternative AUGs or near-cognate start codons (triplets differing from AUG by a single nucleotide) serve as translation initiation sites (TISs). Non-AUG translation initiation upstream of the annotated AUG is expected to generate novel proteoforms with N-terminal extensions (NTEs) beyond the canonical proteins [7, 8].

Non-AUG translation initiation giving rise to NTEs was early described in microorganisms, such as the A-protein gene in bacteriophage and the lactose repressor gene in Escherichia coli [9, 10]. Subsequent studies identified more individual genes with non-AUG-initiated NTEs across diverse species. In yeast, several tRNA synthetase genes were found to encode dual-localized proteoforms with distinct NTEs [1113, 14]. In mammals, such non-AUG translation initiation events have been frequently reported in tumor-associated genes, including c-Myc [15, 16, 17], BAG-1 [18, 19, 20], VEGF-A [21], PTEN [2224, 25], and bFGF [26, 27]. Many of these genes possess multiple active uTISs, generating N-terminally extended proteoforms with distinct subcellular localizations or functions. Recent systems biology approaches have enabled genome-wide identification of non-AUG translation initiation with NTEs in mammals, such as humans [5, 28, 29, 30] and mice [2, 31, 32]. Beyond dual localization [33, 34, 35], the resulting proteoforms contribute to various cellular processes, including signaling [22], mitochondrial energy metabolism [23], rDNA transcription [36], and stress responses [21, 37, 38], thereby greatly enhancing proteome complexity and functional versatility.

Despite these advances, the identification and characterization of non-AUG translation initiation with NTEs in plants remain limited, with no such events reported in monocots to date. Although we have identified other types of non-canonical translation events in plants [4, 39], the prevalence, mechanistic features, and potential functional relevance of non-AUG translation initiation with NTEs remain largely unexplored in the plant kingdom. Moreover, comparative analyses of non-AUG translation initiation events with NTEs between plants and animals are lacking, limiting broader insights into the evolutionary conservation and divergence of these events across kingdoms.

To address these gaps, we employed a computational prediction-combined proteogenomic strategy to systematically investigate non-AUG translation initiation events with NTEs in the monocots maize and rice and the dicot soybean. We identified a total of 879 transcripts undergoing non-AUG translation initiation events, potentially leading to the generation of 3938 N-terminally extended proteoforms. We further characterized the sequence, structural, and transcriptomic features of these events, explored their potential functional implications, and conducted a cross-kingdom comparative analysis with humans. Importantly, this study aims to fill the current gap in understanding the prevalence, sequence characteristics, and biological significance of non-AUG translation initiation in plants, and to provide new insights into the evolutionary conservation and divergence of non-canonical translation across eukaryotes. By uncovering this previously underappreciated layer of translational regulation, our work lays the foundation for future investigations into the biological roles of N-terminally extended proteoforms in plant physiology and adaptation.

Materials and methods

Plant materials and growth conditions

Maize (Zea mays, B73 inbred line), rice (Oryza sativa L. ssp. japonica cv. Nipponbare), and soybean (Glycine max L., ZH13) plants were grown in a temperature- and light-controlled greenhouse. The growth conditions for maize were a 15-h light/9-h dark photoperiod with a 28/25°C day/night air temperature, and samples were collected at the three-leaf stage. For rice, the growth conditions were a 12-h light/12-h dark photoperiod with a 28/25°C day/night air temperature, and samples were collected on day 56. Soybean plants were grown under a 16-h light/8-h dark photoperiod with a 22/21°C day/night air temperature, and samples were collected till the trifoliate leaves were fully expanded. Three replicates of leaf samples for these three plant species were collected, frozen in liquid nitrogen, and stored at -80°C before use.

Protein extraction

Total proteins from plant leaf samples were extracted as previously reported [40]. In brief, the leaf tissues (1 g) from each species were pulverized in liquid nitrogen, and the powder was precipitated in 10% (w/v) trichloroacetic acid/acetone solution containing 65 mM dithiothreitol (DTT) at −20°C for 1 h. The solution was then centrifuged at 10 000 × g at 4°C for 45 min, with the supernatant discarded. The precipitate was vacuum-dried and solubilized in 1/10 volumes of SDT buffer (4% [w/v] SDS, 100 mM DTT, and 150 mM Tris-HCl, pH 8.0). After a 3-min incubation in boiling water, the suspensions were ultrasonicated (80 w, ultrasonic 10 s/time, every 15 s, repeated 10 times) and incubated at 100°C for 3 min. The extract was centrifuged at 12 000 × g at 25°C for 10 min, and the total protein content was quantified using the bicinchoninic acid (BCA) Protein Assay Reagent (Sigma-Aldrich, USA). The supernatants were stored at -80°C for further analysis.

Protein digestion and peptide fractionation

The FASP procedure was used to digest 250 μg per sample with trypsin (Promega Corporation, USA) as previously described [40]. Following digestion, the peptide filtrate was fractionated by strong cation exchange (SCX) chromatography using an AKTA Purifier system (GE Healthcare, USA). The dried peptide mixture was reconstituted and acidified with 2 ml of buffer A (10 mM KH2PO4 in 25%[v/v] of ACN, pH = 2.7) and loaded onto a PolySULFOETHYL 4.6 × 100 mm column (5 µm, 200 Å, PolyLC Inc, USA). Peptides were eluted at a flow rate of 1 ml/min with buffer B (500 mM KCl, 10 mM KH2PO4 in 25% of ACN, pH = 2.7) in a gradient of 0–10% (v/v) for 2 min, 10–20% (v/v) for 25 min, 20–45% (v/v) for 5 min, and 50–100% (v/v) for 5 min. The elution was monitored by absorbance at 214 nm, and fractions were collected every 1 min. The collected fractions were combined into 15 pools and desalted on C18 Cartridges (Empore SPE Cartridges C18 standard density, bed I.D. 7 mm, volume 3 ml, Sigma-Aldrich, Germany). Each fraction was concentrated by vacuum centrifugation and reconstituted in 40 µl of 0.1% (v/v) trifluoroacetic acid. All samples were stored at −80°C for LC-MS/MS analysis.

Liquid chromatography-tandem mass spectrometry analysis

A Q-Exactive HF-X mass spectrometer coupled to the Easy nLC 1200 (Thermo Fisher Scientific, USA) was used to analyze the samples. A total of 2 μg peptide was loaded onto a C18-reversed phase column (25 cm long, 75 μm inner diameter) packed with 3 μm resin (EASY-Column Capillary Columns, Thermo Fisher Scientific, USA) in buffer A (2% [v/v] acetonitrile and 0.1% [v/v] formic acid) and separated with a linear gradient of buffer B (80% [v/v] acetonitrile and 0.1% formic acid) at a flow rate of 250 nL/min controlled by IntelliFlow technology over a duration of 90 min. MS data were acquired using the data-dependent acquisition (DDA) mode, dynamically selecting the top 10 most abundant precursor ions from the survey scan (m/z = 300–1800) for higher-energy collision dissociation (HCD) fragmentation. The determination of the target value was based on predictive Automatic Gain Control (pAGC). The dynamic exclusion duration was 25 s. Survey scans were acquired at a resolution of 70 000 (m/z = 200), and the resolution for HCD spectra was set to 17 500 (m/z = 200). The normalized collision energy was set at 30 eV, and the underfill ratio, specifying the minimum percentage of the target value likely to be reached at maximum fill time, was defined as 0.1%. The instrument operated with the peptide recognition mode enabled. Triplicate experiments were conducted for each sample in MS.

Customized database construction

The complete transcriptome sequences for the three plant species used in this study were obtained from the following available databases: Ensembl Plants (http://ftp.ensemblgenomes.org/pub/plants/release-41/fasta/zea_mays/) for maize, MSU Rice Genome Annotation Project (http://rice.uga.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/) for rice, and the National Genomics Data Center (https://download.cncb.ac.cn/gwh/Plants/Glycine_max_Gmax_ZH13_v2.0_GWHAAEV00000000.1/GWHAAEV00000000.1.RNA.fasta.gz) for soybean. Databases of theoretical N-terminally extended proteoforms were constructed for each species as previously described [41]. Briefly, the upstream in-frame AUG and seven near-cognate start codons differing from AUG by a single nucleotide (CTG, TTG, GTG, ACG, ATC, ATA, and ATT) were searched in the 5′ direction from the annotated AUG for each transcript. The search continued until an in-frame upstream stop codon was encountered or at the start of the transcript if no in-frame stop codon was present in the 5′ UTRs. The selection of these seven near-cognate start codons out of the total nine was based on their previously observed frequencies as natural non-AUG start codons and the higher initiation efficiencies compared to the remaining two codons (AGG and AAG) determined by various assays [41, 42, 43]. All these searched uTISs were designated as potential uTIS sites, and in silico translation was performed starting from these codons and ending at the annotated stop codon for each transcript. The generated database of N-terminally extended proteoforms in each species, combined with the database of CDS-encoded proteins, serves as the customized database for subsequent identification of non-AUG translation initiation events in the MS experiment.

Proteogenomic identification of non-AUG translation initiation events using MaxQuant

The MaxQuant software (v.1.6.1.10) [44] was used to analyze the MS/MS raw data. The MS/MS spectra were searched against the constructed customized databases for the three plant species. The initial search was set at a precursor mass window of 6 ppm. The search followed the trypsin enzymatic cleavage rule with a maximum of two missed cleavage sites allowed and a mass tolerance of 20 ppm for fragment ions. Carbamidomethylation of cysteines was defined as a fixed modification, and methionine oxidation was defined as a variable modification. The cutoff of global FDR was set to 0.01 for both peptide and protein identification. For each species, the resulting MS/MS data were searched against the corresponding customized database in conjunction with the remaining CDS-encoded proteins without NTEs using MaxQuant. All identified peptides were relocated to the corresponding positions of the N-terminally extended proteoform sequences in the database and mapped to their transcript locations. According to the mapping positions, peptides were categorized into three classes: peptides fully mapping to the NTE regions were defined as “before” peptides; peptides spanning the position of the translation site of annotated AUGs were defined as “through” peptides; peptides mapping to the coding regions were defined as “after” peptides.

We next applied a stringent filtering strategy to reduce potential false positives. First, for all identified peptides, only unique peptides were retained for further analysis. This step ensures that the peptides mapping to the NTE regions or spanning the annotated AUGs have a single position within the N-terminally extended proteoforms, thus excluding the potential of peptides from upstream ORFs (uORFs), downstream ORFs (dORFs), or overlapping ORFs (uoORFs), pseudogenes, or lncRNA. Second, we further filtered the unique peptides by retaining only those meeting the following criteria: (1) presented in at least two of the three replicates in the MS experiment; (2) scores exceeding the threshold of 20. This step greatly improved the quality and reproducibility of the remaining identified peptides. Third, a contaminant database was included in the MaxQuant software to avoid possible contaminants. The minimum peptide length required to support NTE identification was set to seven amino acids. As calculated, the average length for through peptides overlapping with the NTE region was seven amino acids, with one amino acid being the basic standard for overlap. Those N-terminally extended proteoforms were retained if they were supported by through and/or before peptides, and their corresponding transcripts were identified as undergoing non-AUG translation initiation events with NTEs. In this context, “AUG” refers to the annotated AUG start codon, and uTISs in non-AUG translation initiation events include both upstream AUGs and seven near-cognate start codons. Multiple uTISs were defined for a single transcript from proteomic evidence when each site was supported by downstream before or through peptides.

Identification of before and through peptides using publicly available plant proteomics datasets

The MS-identified before and through peptides in maize and rice were further corroborated by searching against publicly available plant proteomics datasets [45, 46]. For maize, MS raw data from juvenile leaf samples were used. The complete amino acid sequences of the identified peptides before and after were used for the search. To ensure data quality, the searched peptides mapping to the reverse database and those with scores below 20 were discarded during screening. Due to the limited availability of proteomics data for cultivar ZH13, no corresponding data of soybean were obtained for the validation of these peptides.

Genome distribution of non-AUG-initiated transcripts in the three plant species

The genome-wide density of non-AUG-initiated transcripts in chromosomes for the three species was plotted using a window size of 20 Mb with a step of 2 Mb based on the annotated maize genome (V4.0) in Ensembl (https://plants.ensembl.org/index.html) for maize, a window size of 4 Mb with a step of 400 kb based on the annotated rice genome from MSU (http://rice.uga.edu/), and a window size of 10 Mb with a 1 Mb step based on the ZH13 reference genome (https://download.cncb.ac.cn/gwh/Plants/Glycine_max_Gmax_ZH13_v2.0_GWHAAEV00000000.1/GWHAAEV00000000.1.gff.gz) for soybean. The hotspot region was used to reflect the distribution density of non-AUG translation initiation events. Hotspot regions were defined as ≥ 10 non-AUG transcripts in a window size of 20 Mb for maize, ≥5 non-AUG transcripts in a window size of 4 Mb for rice, and ≥3 non-AUG transcripts in a window size of 10 Mb for soybean.

RNA-seq and ribo-seq analysis of non-AUG translation initiation

The RNA-seq and Ribo-seq datasets of maize samples under normal conditions were downloaded from the NCBI Sequence Read Archive under the accession number SRP052520 [47]. The raw data of transcriptome were filtered using fastp to obtain the clean data [48]. The processed data were then mapped to the non-AUG-initiated and the background transcript databases using bowtie2 [49]. Quantification of transcript expression was performed using eXpress software [50], generating FPKM values of each transcript. Ribo-seq data analyses were conducted as described in the previous study [47]: the raw data were preprocessed using fastx to obtain clean data. Then the data were mapped to the non-AUG-initiated and background transcript databases using bowtie2. The quantification of transcript expression was also performed using eXpress software to generate RPKM values of each transcript. For each transcript, reads mapped to CDS are quantified. Only transcripts with expression levels FPKM/RPKM ≥ 1 were included in the analyses for non-AUG-initiated and background transcripts. The comparison of the expression levels between the two groups was conducted using a permutation test with the number of permutations set as 10 000. For each permutation, the same number of background transcripts as non-AUG transcripts was randomly selected, and the mean values of the expression level between the two groups were calculated for comparison.

To further investigate the proteogenomics-identified non-AUG translation initiation events in plant species using the Ribo-seq data, we downloaded publicly available Ribo-seq datasets for rice [51]. The raw Ribo-seq data for maize and rice were processed using STAR (2.7.3a) [52] for genomic alignment, producing BAM files for downstream analysis. uTISs were predicted for those non-AUG-initiated transcripts using Ribo-TISH [53], with the default settings and additional parameter “-altcodons,” incorporating ATG, CTG, GTG, ACG, TTG, ATA, ATC, and ATT as potential TIS codons. The putative protein sequences derived from the predicted uTISs were generated by RiboTISH and subsequently compared against the MS-identified proteoforms to identify overlaps. Due to the lack of publicly available Ribo-seq data for the soybean cultivar ZH13, no validation was performed for this species.

GO and KEGG enrichment analysis

The GO and KEGG enrichment analyses of genes involved in non-AUG translation initiation events were performed using the TBtools software package [54]. GO terms and KEGG pathways with corrected P-value < 0.05 were considered significantly enriched by differentially expressed genes, which used Benjamini-Hochberg to adjust the P-value.

Minimum free energy calculation

Free Gibbs energy (dG) was calculated using the RNAfold program [55] as previously described with minor modifications [29], within a sliding window of 22nt at 1nt step in the regions starting from 10-nt upstream and up to 80-nt downstream for each of the putative uTISs and the annotated AUGs.

Prediction of functional signals in N-terminal extensions of non-AUG-initiated proteoforms in plant species

Functional signals were predicted for proteoforms in non-AUG translation initiation events in maize, rice, and soybean. Full protein sequences were scanned for secretory signal peptides, mitochondrial targeting signals, and chloroplast targeting signals using TargetP 2.0 [56] with settings of “plant” and “long output.” Transmembrane domains were predicted using TMHMM (v.2.0) [57] with default settings. Full proteoform sequences were also searched for peroxisomal targeting signal 2 (PTS2) according to previously reported PTS2 consensus motifs [58].

Evolutionary analysis of non-AUG translation initiation in plant species

The protein sequences of maize, rice, and soybean were downloaded from the public databases described earlier. The protein sequences of other 13 plant species were downloaded from Ensembl Plants, including Populus trichocarpa (https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-62/plants/fasta/populus_trichocarpa/pep/ Populus_trichocarpa.Pop_triv4.pep.all.fa.gz), Malus domestica (https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-62/plants/fasta/malus_domestica_golden/pep/Malus_domestica_golden.ASM211411v1.pep.allfa.gz), Arabidopsis thaliana (https://ftr.ebi.ac.uk/ensemblgenomes/pub/release-62/plants/fasta/arabidopsis_thaliana/pep/Arabidopsis_thaliana.TAlR10.pep.all.fa.gz), Arabidopsis lyrata (https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-62/plants/fasta/arabidopsislyrata/pep/Arabidopsis_lyrata.v.1.0.pep.all.fa.gz), Medicago truncatula (https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-62/plants/fasta/medicago_truncatula/pep/Medicagotruncatula.MtrunA17r5.0_ANR.pep.all.fa.gz), Vitis vinifera (https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-62/plants/fasta/vitis_vinifera/pep/Vitis_vinifera.ASM3070453v1.pep.al.fa.gz), Triticum aestivum (https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-62/plants/fasta/triticum_aestivum/pep/Triticum_aestivum.lWGSC.pep.all.fa.gz), Brachypodium dista chyon (https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-62/plants/fasta/brachypodium_distachyon/pep/Brachy-podium_distachyon.Brachypodium_distachyon_v3.0.pep), Oryza brachyantha (https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-62/plants/fasta/oryzabrachyantha/pep/Oryza_brachyantha.Oryza_brachyantha.v1.4b.pep.allfa.gz), Setaria italica (https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-62/plants/fasta/setaria_italica/pep/Setaria_italica.Setaria_italica_v2.0.pep.all.fa.gz), Sorghum bicolor (https://ftp.ebi.ac.uk/ensemblgenomes/pub/release62/plants/fasta/sorghum_bicolor/pep/Sorghum_bicolor.Sorghum_bicolor_NCBlv3.pep.all.fa.gz), Brassica rapa (https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-62/plants/fasta/brassica_rapa/pep/Brassica_rapa.Brapa_1.0.pep.all.fa.gz), and Hordeum vulgare (https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-62/plants/fasta/Hordeum_vulgare/pep/Hordeum_vulgare.MorexV3_pseudomolecules_assembly.pep.all.fa.gz).

Orthogroups containing genes with non-AUG NTEs were first predicted across maize, rice, and soybean using OrthoFinder (v3.1.0) [59] with default parameters. OrthoFinder was then used to conduct the conservation analysis by searching for the orthologs of the orthogroups with non-AUG NTEs in the other 13 plant species, and the sequences were examined by FigTree (v1.4.4). For orthologs in each species, the longest protein sequence from each gene was retained to reduce redundancy according to the transcript annotation available for these proteins. The most closely related sequence was manually selected for further alignments. The cDNA sequences were then translated upwards from the annotated start codon position until the nearest stop codon, and the corresponding putative NTE sequences were obtained. Finally, MEGA11[60] was used to construct the phylogenetic tree and perform multiple sequence alignments. Species without orthologs or with orthologs lacking 5′ UTRs were excluded from the final sequence alignment.

Feature analyses of non-AUG translation initiation events in humans

We downloaded the datasets of non-AUG-initiated genes with NTEs from three animal studies [29, 61, 62]. In the study of Rodriguez et al. [62], we used the 171 genes with translation evidence in their NTEs at the protein level for analysis. In the study of Fedorova et al. [29], we used the RiboSET dataset containing 390 non-AUG-initiated genes with Ribo-seq evidence in NTEs for analysis. In the study of Na et al. [61], we used the 55 non-redundant genes with N-terminal peptides in NTEs for analysis. According to the sample types (Mix samples or cell lines) and the year of the study, the collection of candidates in different studies was named as: Mix_2024 for the study of Rodriguez et al. [62], Cell_2022 for the study of Fedorova et al. [29], and Mix_2018 for the study of Na et al. [61]. We analyzed the different features of these non-AUG-initiated candidates using the same methods applied to plant species as described above.

Results

Computational prediction-combined proteogenomic workflow for identifying non-AUG translation initiation events with NTEs in maize

To identify genome-wide non-AUG translation initiation events resulting in NTEs in plant species, we used maize as the primary example. This was achieved by implementing a computational prediction-integrated proteogenomic strategy, which enables direct translational evidence of non-AUG initiation at the protein level (Fig. 1). As detailed in the Materials and methods section, we constructed a database of potential N-terminally extended proteoforms by assuming that upstream AUGs and seven near-cognate start codons (CUG, UUG, GUG, ACG, AUC, AUA, and AUU) in-frame with the annotated AUG could serve as alternative start sites. Each transcript was systematically scanned to identify these codon sites, and in silico translation initiating from these sites was performed to generate a collection of putatively translated N-terminally extended proteoforms. For transcripts with more than one upstream putative start site, each was used to generate a separate N-terminally extended proteoform. Liquid chromatography-tandem mass spectrometry (LC-MS/MS) was performed using a bottom-up proteomics pipeline to acquire MS/MS spectra data. The predicted N-terminally extended proteoforms, combined with canonical CDS-encoded proteins, constituted the customized database as the search space for MS/MS spectra data with the MaxQuant search engine. Experimental spectra were matched against the theoretical peptide spectra generated from the combined customized database (Fig. 1A).

Figure 1.

Figure 1 is an overview of the computational prediction-combined proteogenomic workflow for identifying non-AUG translation initiation events with N-terminal extensions (NTEs) in maize, and the pipeline of peptide annotation and selection.

Computational prediction-combined proteogenomic workflow for identifying non-AUG translation initiation events with N-terminal extensions (NTEs) in maize. (A) The workflow integrates transcriptome-wide computational prediction of N-terminally extended proteoforms with liquid chromatography-tandem mass spectrometry (LC-MS/MS)-based proteogenomics. First, plant transcript sequences were downloaded from the database. For each transcript, upstream in-frame AUG and seven near-cognate start codons were systematically scanned in the 5′ direction from the annotated AUG until an in-frame stop codon was reached, or to the start of the transcript if no in-frame stop codon was encountered. These searched putative upstream translation initiation sites (uTISs) were then used as start sites for in-silico translation to the annotated stop codon of each transcript, generating a database of N-terminally extended proteoforms. Second, total proteins were extracted from plant leaves and subjected to enzymatic digestion, fragmentation, and LC-MS/MS analysis to generate spectra datasets. The resulting MS/MS spectra datasets were searched against a customized database comprising N-terminally extended proteoforms and canonical coding sequence (CDS)-encoded proteins. N-terminally extended proteoforms detected with peptides uniquely mapping to regions spanning (“through” peptides) or preceding (“before” peptides) the annotated AUG translation start site were retained, and their corresponding transcripts were identified as undergoing non-AUG translation initiation events with NTEs. (B) Peptide annotation and selection in MS identification. In the annotation step, according to the mapping positions, peptides fully mapping to the NTE regions were defined as “before” peptides; peptides spanning the position of the translation site of annotated AUGs were defined as “through” peptides; peptides mapping to the coding regions were defined as “after” peptides. In the selection step, only unique peptides with a score >20 and reproducibility ≥2 replicates were retained for further analysis.

Identified peptides were mapped to their genomic coordinates and annotated according to their positions (Fig. 1B, left). N-terminally extended proteoforms detected with unique peptides mapping to NTE regions (“before” peptides) or AUG-spanning regions (“through” peptides) were retained, and their corresponding transcripts were identified as undergoing non-AUG translation initiation events that lead to NTEs (Fig. 1A). For further quality control, those unique peptides detected in fewer than two experimental replicates and with scores below the threshold of 20 were filtered out (Fig. 1B, right). These filtering steps finally retained 640 high-confidence transcripts identified as non-AUG-initiated candidates with NTEs, potentially generating an additional 3011 N-terminally extended proteoforms (Supplementary Table S1). We also explored publicly available maize proteomics datasets to validate the identified peptides. The results showed that 69 peptides supporting NTEs were recovered from the previous data (Supplementary Table S2), demonstrating the reproducibility of our proteogenomic findings in independent datasets. In addition, we retrieved available maize Ribo-seq data and analyzed the identified non-AUG-initiated transcripts using Ribo-TISH. Our analysis revealed that 158 non-AUG-initiated transcripts identified through proteogenomics were corroborated by Ribo-seq analysis (Supplementary Table S1). Of these, 22 transcripts contained multiple putative uTISs identified by Ribo-TISH.

Collectively, the above results reveal the widespread presence of non-AUG translation initiation events with NTEs in the monocot species maize, uncovering a previously underappreciated source of proteome diversity generated by non-canonical translation initiation.

Global characteristics of non-AUG translation initiation events with NTEs in maize

To characterize the identified non-AUG initiation events in maize, we analyzed the genomic and transcript features of the identified non-AUG-initiated transcripts with NTEs. We first profiled the genomic distribution of these transcripts to determine whether they exhibited any chromosomal or regional preference at the genomic level. The results revealed that these non-AUG-initiated candidates were broadly distributed across all maize chromosomes (Fig. 2A). Among the 640 high-confidence transcript candidates with NTEs, approximately 34% possessed a single putative upstream translation initiation site (uTIS), while 66% harbored multiple putative uTISs (Fig. 2B). Analysis of codon usage at uTISs showed that GUG was the most prevalent codon (21%), followed by CUG (19%) and AUC (17%), whereas AUA was the least represented (7%) (Fig. 2C). Examination of NTE lengths revealed that over 90% of the NTEs in non-AUG-initiated transcripts were shorter than 500 nucleotides (Fig. 2D).

Figure 2.

Figure 2 showing the global characteristics of non-AUG translation initiation events with NTEs in maize, including their genomic distribution, uTIS composition, NTE length distribution,transcript features, and expression comparisons with background transcripts and proteins.

Global characteristics of non-AUG translation initiation events with NTEs in maize. (A) Genomic distribution of non-AUG-initiated transcripts with NTEs across maize chromosomes. Dark green bars in each chromosome represent the location of non-AUG-initiated transcripts, marked with the positions of the corresponding annotated AUG codons. Grey dots are centromeres. Orange curves represent the distribution patterns of non-AUG-initiated transcripts using a 20-Mb sliding window with a 2-Mb step based on the B73 reference genome. Values of the curve range from 0 to 56. Purple asterisks show the distribution of hotspot regions for non-AUG translation initiation events, defined as ≥10 non-AUG-initiated transcripts in a 20-Mb window. (B) Distribution of the number of predicted uTISs per transcript with NTEs. (C) Pie chart showing the codon composition of putative uTISs. (D) Length distribution and cumulative frequency of NTEs for non-AUG-initiated transcripts. For transcripts with multiple putative uTISs, all uTISs were used to determine the NTE length. (E) Boxplots describing the lengths of primary transcripts, CDSs, introns, and 5′ UTRs for transcripts with NTEs and background transcripts. Primary transcripts refer to the full-length pre-mRNAs transcribed from the genomic DNA, including 5′UTRs, coding sequences (CDSs), introns, and 3′UTRs. The number of the background transcripts is 61 168. (F) Number of exons for transcripts with NTEs and background transcripts. P values were calculated using a two-sided Student’s t-test, ***P < 0.001. A permutation test was also performed for the comparisons in (E) and (F) (N = 10 000), and all differences remained statistically significant. (G-H) Violin plots showing the expression levels (log2) of transcripts with NTEs and background transcripts at the transcriptome (G) and translatome (H) levels under normal conditions of maize seedlings. Reads mapped to CDS were quantified. FPKM and RPKM values from RNA-seq and Ribo-seq datasets with two biological replicates were calculated. Only transcripts with FPKM or RPKM ≥ 1 were included. Whiskers within violin plots represent 1.5 × IQR, and dark dots in the internal boxplot represent the median values. P values were calculated using a permutation test (N = 10 000), *P < 0.05. FPKM, fragments per kilobase of transcript per million mapped reads. RPKM, reads per kilobase per million mapped reads. (I) Box plots showing the expression levels (log10) for non-AUG-associated canonical proteins and background proteins. P values were calculated using a permutation test (N = 10 000), ***P < 0.001. For all boxplots, box edges represent the 0.25 and 0.75 quantiles. Lines within the boxes represent the median values, hollow squares represent the mean values, and whiskers represent 1.5 × IQR.

We next compared transcript features between non-AUG-initiated candidates and background transcripts. The average length of primary transcripts for non-AUG-initiated candidates with NTEs was significantly shorter than the background (P < 0.001, Fig. 2E), and their corresponding canonical proteins also exhibited significantly shorter average lengths and lower molecular weights (Supplementary Fig. S1). Similar trends were observed for CDS and intron lengths (P < 0.001, Fig. 2E). In contrast, the average length of 5′ UTR for non-AUG-initiated candidates with NTEs was significantly longer than the background transcripts (P < 0.001, Fig. 2E). Additionally, candidates with NTEs significantly fewer exons on average than the background counterparts (P < 0.001, Fig. 2F). A permutation test was also performed to compare the transcript features, and the results showed that all these comparisons remained statistically significant (P < 0.001), indicating that the observed differences were not due to sample-size imbalance.

To explore a potential relationship between gene expression levels and non-AUG initiation, we compared the expression levels of transcripts with NTEs to those of background transcripts using available RNA-seq and Ribo-seq data. Transcripts with NTEs exhibited significantly lower expression levels than the background transcripts at both the transcriptomic and translatomic levels under normal growth conditions (< 0.05, Figure 2G, H), with this trend becoming even more pronounced under drought stress (< 0.01, Supplementary Fig. S2A-D). By contrast, MS data revealed that the proteins associated with non-AUG translation initiation events had significantly higher expression levels compared to those of the background proteins (< 0.001, Fig. 2I). To address potential protein-size effects, we also evaluated the protein-level comparison using iBAQ quantification, which showed consistent results (< 0.001). It should be noted that although amino acid composition may also influence peptide detectability in MS, no standard protein-level correction is routinely applied in label-free quantification pipelines [44]. Discrepancies between the transcript and protein expression levels might be attributed to a range of factors, including differing stabilities of mRNA and protein, translational control, delays in protein synthesis, and protein transport [45, 63]. Together, these results highlight the widespread occurrence and the associated expression features of non-AUG translation initiation events with NTEs in maize, providing a foundation for further exploration of their functional significance.

Sequence context features of uTISs in non-AUG translation initiation events with NTEs in maize

To investigate the potential mechanisms underlying non-AUG translation initiation in maize, we analyzed the nucleotide sequence context surrounding the putative uTISs of non-AUG-initiated transcripts. We first examined nucleotide frequencies at positions from −4 to + 4 relative to the putative uTISs and the corresponding annotated AUGs. The uTISs exhibited a strong enrichment of purines at key flanking positions, with guanine (G) being the most abundant nucleotide at position −3 (35%), adenine (A) at position −2 (33%), G at position −1 (33%), and G at position +4 (36%) (Fig. 3A, top). This pattern closely resembles that of annotated AUGs, which showed even higher frequencies of G at position −3 (43%), A at position −2 (32%), and G at position +4 (48%) (Fig. 3A, bottom). In addition, A accounted for the highest proportion (48%) at the position +1 for uTISs. Analysis of AG content from positions −3 to +4 revealed that 72% of the putative uTISs exhibited high AG content (AG content ≥ 50%), although still significantly lower than the 80% observed for annotated AUGs (P < 0.01, Figure 3B). Based on nucleotide identity at positions −3 (A/G) and + 4 (G), we evaluated the Kozak context strength of these TISs. Nearly half (49%) of the putative uTISs possessed a medium Kozak sequence, followed by weak (29%) and strong Kozak sequences (22%) (Fig. 3C). In comparison, annotated AUGs showed a higher proportion of medium (53%) and strong (33%) Kozak contexts, with only 14% classified as weak (Fig. 3C). These results suggest that many non-AUG uTISs are embedded in favorable sequence contexts that may support efficient initiation, similar to those surrounding canonical AUGs.

Figure 3.

Figure 3 showing the sequence context features of uTISs for non-AUG translation initiation events with NTEs in maize, including surrounding nucleotide frequencies, AG/GC content, Kozak context, and minimum free energy distribution.

Sequence context features of uTISs in non-AUG translation initiation events with NTEs in maize. (A) Weblogo showing nucleotide frequencies at positions −4 to + 4 surrounding putative uTISs (top) and annotated AUGs (bottom) in non-AUG-initiated transcripts. The first nucleotide of the TIS codon is assigned as + 1. The annotated AUGs refer to the canonical translation initiation sites of the transcripts. (B) AG-content distribution (positions −3 to + 4) for uTISs and annotated AUGs. P values were calculated using a two-sided chi-squared test, ***P < 0.001. (C) Distribution of Kozak context classes (strong, medium, and weak) surrounding uTISs and annotated AUGs. Classification is based on − 3R and + 4G (R = A/G): strong (both − 3R and + 4G), medium (either − 3R or + 4G), weak (neither − 3R nor + 4G). (D) GC-content of different transcript regions in non-AUG-initiated candidates, including the canonical CDS, 5′UTR, and NTE regions. GC content was calculated as (G + C)/ (G + C + A + G) × 100% within a given region. For transcripts with multiple putative uTISs, all uTISs were used for GC-content calculation. Lines within the boxes represent the median values, hollow squares represent the mean values, and whiskers represent 1.5 × IQR. P values were calculated using one-way ANOVA with Tukey’s multiple comparisons test, *< 0.05, ***< 0.001. (E) Distribution of minimum free energy (MFE) values (dG, Gibbs free energy) for RNA secondary structures downstream of uTISs and annotated AUGs. The region spans from 10-nt upstream to 80-nt downstream of the TIS codons. Inset: comparison of average MFE (dG) for each nucleotide position around the putative uTISs and annotated AUGs. P values were calculated using a two-sided Student’s t-test, ***< 0.001.

GC-content is known as a crucial factor influencing DNA structural stability and gene expression [64]. We assessed the GC-content in different transcript regions of the non-AUG-initiated candidates. The results showed that the average GC-content of the NTEs and 5′ UTRs of non-AUG-initiated candidates was 54% and 54.7%, respectively, both of which were significantly higher than that of the CDS regions (48.9%) (Fig. 3D) and also exceeded the typical genomic-wide GC-content range reported in plants (33–48%) [65]. We also examined the GC-content in different regions of the background transcripts. The results showed that the average GC-content of 5′ UTRs for background transcripts was significantly higher than that of the CDS regions (P < 0.01), suggesting that the higher GC-content of 5′ UTRs compared to the CDSs is not specific to the non-AUG-initiated transcripts in maize. Next, we evaluated the stability of RNA secondary structures located downstream of the putative uTISs and the annotated AUGs by calculating minimum free energy. The distribution of minimum free energy surrounding the putative uTISs was more narrowly concentrated and significantly lower than that of annotated AUGs (P < 0.001, Fig. 3E, inset), indicating more stable local RNA secondary structures around non-AUG start sites. These findings support a model in which stable RNA structures may facilitate recognition or accessibility of non-AUG TISs, potentially compensating for their weaker sequence identity.

Functional implications of N-terminally extended proteoforms in maize

To investigate the biological relevance of non-AUG translation initiation events identified in maize, we first conducted Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses on genes involved in non-AUG translation initiation events. In the biological process category, significant enrichment was primarily observed for terms related to translation, response to abiotic stimulus, and catabolic process (Fig. 4A; Supplementary Table S3). In the molecular function category, the top three enriched terms included “structural molecule activity,” “nucleotide binding,” and “translation regulator activity.” For the cellular component, the most significantly enriched three terms were “cytosol,” “cytoplasm,” and “chloroplast” (Fig. 4A); Supplementary Table S3). KEGG pathway analysis further identified “cellular processes,” “transport and catabolism,” “exosome,” “glycolysis/gluconeogenesis,” and “translation” as the top five significantly enriched pathways (Fig. 4B); Supplementary Table S4). These findings suggest that non-AUG translation initiation plays a role in fundamental cellular processes, including protein synthesis, metabolic regulation, and stress response in maize.

Figure 4.

Figure 4 showing the functional implications of N-terminally extended proteoforms in maize, including GO and KEGG enrichment analyses, as well as predictions of diverse functional signals.

Functional implications of N-terminally extended proteoforms in maize. (A) Gene Ontology (GO) enrichment analysis of genes undergoing non-AUG translation initiation. The GO enrichment terms include biological process (BP), cellular component (CC), and molecular function (MF). Bubble size represents the number of gene enrichments; color represents the adjusted P-value. The top 10 enriched terms for BP and MF, and the top five for CC, are shown. (B) Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis of the same gene set. Color represents the adjusted P-value. (C) Functional peptide signals predicted within the NTEs of proteoforms, including secretory signal peptides, mitochondrial transit peptides, chloroplast transit peptides, thylakoid luminal transit peptides, and transmembrane domains. (D–G) Examples of predicted functional signals in individual N-terminally extended proteoforms. (D) Predicted signal peptide in the NTE of proteoform Zm00001d010048_P003-bf147 using TargetP 2.0, with a cleavage site at residues 23–24. (E) Mitochondrial transit peptide signal predicted in the NTE of proteoform Zm00001d053453_P010-bf177 using TargetP 2.0, with a cleavage site at residues 20–21. (F) Chloroplast transit peptide signal predicted in the NTE of proteoform Zm00001d021715_P001-bf582 using TargetP 2.0, with a cleavage site at residues 57–58. CS: cleavage site. (G) Transmembrane domain predicted in the NTE of proteoform Zm00001d8004_P006-bf222 using TMHMM 2.0. The orange, magenta, and blue lines represent residues 1–9 outside the membrane, 10–29 as the transmembrane domain, and 30–74 inside the membrane, respectively.

To explore the potential functional impact of NTEs, we predicted the presence of different functional signals for those proteoforms. The analysis showed that a total of 18 proteoforms were predicted to contain secretory signal peptides (SPs) in NTE regions, indicating possible roles in protein translocation across membranes (Fig. 4C); Supplementary Table S5). For instance, the proteoform Zm00001d010048_P003-bf147 was predicted to harbor a signal peptide in the NTE with a cleavage site located 23–24 residues from the N-terminus (Fig. 4D). In addition, 31 and 34 N-terminally extended proteoforms were predicted to contain mitochondrial and chloroplast transit peptides in their NTEs, respectively, suggesting possible relocalization to organelles (Fig. 4C); Supplementary Table S5). Representative examples include Zm00001d053453_P010-bf177 and Zm00001d021715_P001-bf582, which were predicted to harbor mitochondrial and chloroplast transit peptides, respectively (Fig. 4E, F). Two proteoforms were predicted to contain thylakoid luminal transit peptides (Fig. 4C); Supplementary Table S5). Moreover, 58 proteoforms were predicted to contain transmembrane domains within their NTEs (Fig. 4C); Supplementary Table S6), such as Zm00001d018004-P006-bf222 with a predicted transmembrane domain shown in Fig. 4G. These results collectively highlight the functional versatility of N-terminally extended proteoforms generated through non-AUG translation initiation and suggest that these extensions may alter the localization, stability, or regulatory roles of proteins within diverse biological pathways.

Identification and characterization of non-AUG translation initiation events with NTEs in rice and soybean

Using the same computational prediction-combined proteogenomic strategy, we extended our analysis of non-AUG translation initiation events with NTEs to two additional plant species, including the monocot rice and the dicot soybean. This effort identified 149 transcripts undergoing non-AUG translation initiation events with NTEs in rice, potentially generating an additional 590 N-terminally extended proteoforms (Supplementary Table S7). Similarly, we examined publicly available rice proteomics datasets, and 74 through or before peptides were corroborated (Supplementary Table S8). By further analysis of the putative uTISs using the available Ribo-seq data for rice, we found that 121 non-AUG-initiated transcripts were identified to have putative uTISs using Ribo-TISH (Supplementary Table S7). In soybean, we identified 90 transcripts undergoing non-AUG translation initiation events, potentially generating 337 N-terminally extended proteoforms (Supplementary Table S9).

Genome-wide distribution analysis revealed that the identified non-AUG translation events were dispersed in all chromosomes of rice (Fig. 5A, left), and across all soybean chromosomes except chromosome 11 (Fig. 5A, right). In rice, 31% of the non-AUG-initiated transcripts harbored a single putative uTIS, while the majority contained multiple putative uTISs (Fig. 5B, pink). Similarly, 32% of the non-AUG-initiated transcripts in soybean contained a single putative uTIS, and the remaining harbored multiple uTISs (Fig. 5B, blue). Codon composition analysis of uTISs showed that in rice, upstream GUG was the most frequent codon (22%), followed by AUC (15.8%) and CUG (14.9%), and upstream AUG was the least frequent codon (2%) (Fig. 5C, left). For soybean, GUG, AUU, and UUG were equally represented as the most frequent uTIS codons, each accounting for 18% of the total (Fig. 5C, right). Length distribution analysis of NTEs showed that over 90% of the NTEs in non-AUG-initiated transcripts were shorter than 400 nucleotides in both species (Fig. 5D). In rice, the average GC-content of NTEs (54.1%) and the 5′ UTRs (55.3%) was significantly higher than that of CDSs (49.9%, Fig. 5E). Interestingly, a distinct pattern in soybean was observed compared to the two monocots: the average GC-content of NTEs (43.4%) was comparable to that of the CDSs (43.9%), and significantly higher than that of the 5′ UTRs (41.3%) (Fig. 5F).

Figure 5.

Figure 5 showing the identification and characterization of non-AUG translation initiation events with NTEs in rice and soybean, including their genomic distribution, uTIS composition, NTE length distribution, GC/AG content, surrounding nucleotide frequencies, and Kozak context.

Identification and characterization of non-AUG translation initiation events with NTEs in rice and soybean. (A) Circos plot showing the genomic distribution of non-AUG-initiated transcripts with NTEs in the chromosomes of rice (left) and soybean (right). The circos plot contains four layers: the outermost layer represents chromosomes; the second layer represents the location of non-AUG-initiated transcripts, marked by the chromosomal position of the corresponding annotated AUG codons; the third layer represents the distribution patterns of non-AUG-initiated transcripts using a 4-Mb sliding window with a 400-Kb step based on the rice reference genome and a 4-Mb sliding window with 10-Mb sliding window with a 1-Mb step based on the soybean reference genome; and the innermost layer shows the distribution of hotspot regions for non-AUG translation initiation events, defined as ≥5 of the non-AUG-initiated transcripts for rice and ≥ 3 for soybean within the corresponding window size. (B) Distribution of the number of putative uTISs per non-AUG-initiated transcripts in rice (pink) and soybean (blue). The x-axis represents the number of putative uTISs per transcript, and circles with numbers represent the number of non-AUG-initiated transcripts. (C) Heatmap showing codon composition of the putative uTISs in non-AUG translation initiation events for rice (left) and soybean (right). (D) Length distribution of putative NTEs for non-AUG-initiated transcripts in rice (pink) and soybean (blue). (E-F) GC-content of different regions for non-AUG-initiated candidates in rice (E) and soybean (F). The regions include the canonical CDS, 5′UTR, and NTE regions. Lines within the boxes represent the median values, hollow squares represent the mean values, and whiskers represent 1.5 × IQR. P values were calculated using one-way ANOVA with Tukey’s multiple comparisons test, ***< 0.001. (G-H) Weblogo showing the nucleotide frequencies surrounding the putative uTISs and annotated AUGs (positions −4 to + 4) in rice (G) and soybean (H). The first nucleotide of the TIS is assigned as + 1. (I) AG content flanking the predicted uTISs and annotated AUGs from positions −3 to + 4 in rice (top) and soybean (bottom). P values were calculated by the two-sided chi-squared test, ***P < 0.001, *P < 0.05. (J-K) Distribution of Kozak context classes (strong, medium, weak) surrounding uTISs and annotated AUGs in rice (J) and soybean (K). Classification is based on − 3R and + 4G (R = A/G): strong (both − 3R and + 4G), medium (either − 3R or + 4G), weak (neither − 3R nor + 4G).

We next examined the sequence context surrounding the putative uTISs and annotated AUGs in these two plant species. In rice, purines (A/G) dominated the flanking positions from −4 to + 4 (Fig. 5G, top), with G being most enriched at position −3 of both uTISs and annotated AUGs (Fig. 5G, bottom). A was the most abundant nucleotide at position −2 around the uTIS, again paralleling the maize pattern. In soybean, A was the most enriched at positions −3, −2, and +4 surrounding the putative uTISs, whereas uracil (U) predominated at positions −4 and −1 (Fig. 5H, top). In contrast, annotated AUGs displayed a different pattern, with A most frequent at position −3, C at positions −2 and −1, and G at position +4 (Fig. 5H, bottom). In addition, A accounted for the highest proportion (48% in rice and 51% in soybean) at the position +1 for uTISs in both species. Analysis of AG content from positions −3 to +4 showed that 72% of the putative uTISs and 81% of annotated AUGs had high AG content (Fig. 5I) in rice, though the AG content at uTISs remained significantly lower than at annotated AUGs (< 0.001, Fig. 5I, top). Similarly, 73% of putative uTISs and 71% of annotated AUGs exhibited high AG content (Fig. 5I) in soybean, with the latter being significantly higher (< 0.05, Figure 5I, bottom).

Kozak context analysis revealed that more than half of putative uTISs in rice possessed a medium-strength Kozak sequence, with 15% classified as strong (Fig. 5J). We further examined the nucleotide compositions at the critical positions −3 and +4 surrounding the two classes of the putative uTISs, including upstream AUG and near-cognate start codons. The results showed that AUG uTISs had the highest enrichment of G at both positions in rice (Supplementary Fig. S3A, B), while near-cognate start codons favored G at position −3 (30.3%) and cytosine (C) at position +4 (29.3%) (Supplementary Fig. S3C, D). These findings suggest that AUG uTISs are frequently embedded in favorable Kozak contexts compared to the near-cognate start codons. The annotated AUGs in rice showed a similar distribution: 52% with medium-strength, 34% with strong, and 14% with weak Kozak sequences (Fig. 5J). In soybean, 48% of putative uTISs had a medium-strength Kozak sequence, followed by weak (32%) and strong (20%) Kozak sequences (Fig. 5K). In contrast, annotated AUGs showed the highest proportion of strong Kozak sequence (43%) (Fig. 5K). Furthermore, the distribution of the minimum free energy for regions surrounding the putative uTISs was generally lower than that of annotated AUGs in both species (Supplementary Fig. S4), with the average value showing statistical significance in soybean (P < 0.001, Supplementary Fig. S4B).

In addition, we analyzed transcript features associated with these non-AUG translation initiation events in rice and soybean. The results showed that non-AUG-initiated candidates exhibited significantly longer 5′ UTR compared to background transcripts in rice (P < 0.001, Supplementary Fig. S5A), similar to what we observed in maize. These candidates also showed a trend towards shorter primary transcripts, CDSs, and introns, as well as fewer exons on average than the background, although the differences were not statistically significant (Supplementary Fig. S5A, B). In soybean, non-AUG-initiated candidates exhibited longer primary transcripts, introns, and 5′ UTRs compared to the background, with the increase in 5′ UTR length showing statistical significance (P < 0.01, Supplementary Fig. S6A). No significant difference was observed in the average number of exons between non-AUG-initiated transcripts and background transcripts (Supplementary Fig. S6B). Together, these results indicate that non-AUG initiation in the two monocots shares both conserved and species-specific features with soybean, further illustrating the regulatory complexity of this mechanism in different plant species.

Similar to maize, we predicted the potential functional signals within NTE regions of those proteoforms in rice and soybean. The results showed that in rice, one N-terminally extended proteoform was predicted to contain secretory signal peptides, 12 contained mitochondrial transit peptides, seven contained chloroplast transit peptides, and two contained thylakoid luminal transit peptides (Supplementary Table S10). In addition, 16 proteoforms were predicted to harbor transmembrane domains (Supplementary Table S11). In soybean, 21 proteoforms were predicted to contain secretory signal peptides, six contained chloroplast transit peptides, and 20 contained transmembrane domains (Supplementary Tables S10, S11).

Based on the non-AUG translation initiation events identified in the three plant species, we further explored the evolutionary conservation of the non-AUG NTEs across different species. First, we analyzed the orthogroups containing genes with non-AUG NTEs across maize, rice, and soybean. Three orthogroups shared between two species were identified, including LOC_Os03g52150.1 and SoyZH13_14G070000.m1 (Fig. 6A, red), LOC_Os08g34190.1, and SoyZH13_04G058800.m1 (Fig. 6B, red), as well as Zm00001d026490_T015 and SoyZH13_13G138600.m2 (Fig. 6C, red). Next, we extended the analysis by performing sequence alignments for each orthogroup across 15 plant species to further assess the conservation of NTE regions. The results revealed that, although the equivalent regions of putative NTEs were generally not conserved across species, the nucleotide sequences immediately upstream of the annotated AUG showed high similarity among the most closely related species (Fig. 6A, B, C), which may be partly due to the Kozak context of the annotated AUG. Collectively, these findings suggest that the identified non-AUG NTEs show limited evolutionary conservation across these plant species.

Figure 6.

Figure 6 showing the conservation of non-AUG-initiated genes across different plant species, including phylogenetic relationships and sequence alignments.

Conservation analysis of non-AUG translation initiation events across different plant species. (A–C) Phylogenetic trees (left) and sequence alignments (right) of putative N-terminal extension (NTE) regions for three orthogroups containing genes with non-AUG translation initiation events identified across maize, rice, and soybean and orthologs in multiple plant species. (A) Orthogroup 1: LOC_Os03g52150.1 and SoyZH13_14G070000.m1. Putative uTISs for LOC_Os03g52150.1 are TTG288 and ATC384; and for SoyZH13_14G070000.m1 are GTG72 and ATC93. (B) Orthogroup 2: LOC_Os08g34190.1 and SoyZH13_04G058800.m1. Putative uTISs for LOC_Os08g34190.1 are ACG156; and for SoyZH13_04G058800.m1 are GTG180, CTG210, TTG282, ATT288, TTG312, and GTG318. (C) Orthogroup 3: Zm00001d026490_T015 and SoyZH13_13G138600.m2. Putative uTISs for Zm00001d026490_T015 are ATC63, CTG96, CTG117, GTG129, ATC138, ATC144, and CTG156; and for SoyZH13_13G138600.m2 are ATC12; ACG18. uTISs within the alignments are marked in a black box, and those located further upstream are not shown due to alignment length constraints. The phylogenetic trees represent the evolutionary relationships among the orthologs, and the sequence alignments display the conservation of the equivalent regions of putative NTEs across different species.

Cross-kingdom comparative analysis of mechanistic features of non-AUG translation initiation between plants and humans

To gain deeper insight into the non-AUG translation initiation events across kingdoms, we systematically re-analyzed the non-AUG-initiated genes with NTEs reported in three previous human studies [29, 61, 62] and compared their features with those identified in plants in our study. In humans, codon composition analysis of uTISs for those non-AUG-initiated candidates revealed that CUG was the most prevalent uTIS, followed by GUG, whereas AUG was less frequently utilized relative to these near-cognate start codons (Fig. 7A). Length distribution analysis of putative NTEs showed that over 90% of the putative NTEs in all human samples were shorter than 300 nts (Fig. 7B).

Figure 7.

Figure 7 showing the comparative features of non-AUG translation initiation between plants and humans, including uTIS composition, NTE length distribution, minimum free energy distribution, and transcript features.

Cross-kingdom comparative analysis of mechanistic features of non-AUG translation initiation between plants and humans. (A) Codon composition of putative uTISs in non-AUG translation initiation events across three human datasets: Mix_2024 [62], Cell_2022 [29], and Mix_2018 [61], named based on sample types and publication year. (B) Length distribution of putative NTEs for non-AUG-initiated candidates of Mix_2024, Cell_2022, and Mix_2018. (C) Weblogo showing nucleotide frequencies flanking putative uTISs and annotated AUGs (positions −4 to + 4) in Mix_2024, Cell_2022, and Mix_2018. The first nucleotide of the TIS is assigned as +1. (D) Gibbs free energy (dG) values indicating RNA secondary structure stability surrounding the putative uTISs and annotated AUGs in Mix_2024, Cell_2022, and Mix_2018. The region spans from 10-nt upstream to 80-nt downstream of the TIS codons. (E-F) Boxplots describing the lengths of primary transcripts and introns (E), and CDSs and 5′ UTRs (F) between non-AUG-initiated transcripts in Mix_2024 (Green), Cell_2022 (orange), and Mix_2018 (Blue) and background transcripts (grey). (G) Number of exons for non-AUG-initiated transcripts in the three datasets and background transcripts. For boxplots, box edges represent the 0.25 and 0.75 quantiles. Lines within the boxes represent the median values, hollow squares represent the mean values, and whiskers represent 1.5 × IQR. P values were calculated using a two-sided Student’s t-test, *P < 0.05, ***P < 0.001.

We next examined the nucleotide sequence context surrounding the putative uTISs in humans, and the results showed that G was the most enriched nucleotide at both positions −3 and + 4, while C was most frequently observed at position −2 in all human samples (Fig. 7C). Analysis of local RNA secondary structure revealed that the average minimum free energy surrounding putative uTISs was significantly lower than that around annotated AUGs (Fig. 7D; Supplementary Fig. S7A, B, C), indicating increased local RNA stability at non-AUG start sites. Next, we analyzed the transcript features of these non-AUG-initiated candidates. The results showed that the primary transcript and intron lengths for non-AUG-initiated candidates tended to be shorter than the background levels, with statistical significance observed in Mix_2024 (Fig. 7E). Conversely, the average length of 5′ UTRs for non-AUG-initiated candidates was significantly longer than the background across all human samples (Fig. 7F). Additionally, these non-AUG candidates exhibited a greater number of exons compared to the background transcripts (Fig. 7G), again with statistical significance shown in Mix_2024.

Cross-kingdom comparison of these features revealed both conserved and lineage-specific characteristics of non-AUG translation initiation with NTEs. In both plants and humans, near-cognate start codons such as CUG and GUG were favored over AUG at uTIS positions. A shared feature across kingdoms was that regions flanking uTISs were associated with lower minimum free energy compared to those around annotated AUGs. In addition, non-AUG-initiated candidates with NTEs have significantly longer 5′ UTRs than the background transcripts in both plants and humans. Notably, we also observed key differences in nucleotide preference surrounding uTISs. For example, position −2 was most enriched for adenine (A) in plants but for cytosine (C) in humans, which is in line with the previously reported differences in the nucleotide features around the annotated AUGs in these species [49]. These comparative analyses highlight both conserved and divergent features of non-AUG translation initiation, offering new perspectives on its regulatory complexity and evolutionary conservation across eukaryotic systems.

Discussion

The growing recognition of non-canonical translation events, particularly the discovery of start codon plurality within a single mRNA, has fundamentally expanded our understanding of proteome complexity and translational regulation. Among these, non-AUG translation initiation represents a major source of proteome diversification. While its biological significance has been extensively studied in mammals, corresponding studies in plants remain scarce. To date, non-AUG translation initiation events with NTEs have only been reported in two dicot species, Arabidopsis and tomato [6668, 69], leaving their prevalence and broader biological relevance across diverse plant lineages largely unexplored. In this study, we employed a computational prediction-combined proteogenomic strategy and systematically identified non-AUG translation initiation events with NTEs in monocots maize and rice, as well as the dicot soybean. We detected 879 transcripts across the three plant species undergoing non-AUG translation initiation events, potentially producing an additional 3938 N-terminally extended proteoforms. Nevertheless, there are a few limitations to our approach. First, the MS-based approach is inherently biased toward detecting high–abundance proteins [70, 71], and stringent filtering criteria were applied during MS identification in our study to ensure data reliability, both of which may lead to the underrepresentation of non-AUG transcripts. Other complementary approaches could be applied to further improve detection sensitivity and expand the coverage. Second, we recognize that accurate determination of uTISs remains challenging. Although methods such as computational prediction and Ribo-seq analysis have been widely used to identify potential uTISs, they may still fail to detect uTISs with poor context or low initiation efficiency [68, 72]. In addition, genes with multiple active uTISs are frequently reported, with each resulting proteoform contributing to diverse biological processes [19, 22, 24, 26, 73]. Therefore, our study retained all potential uTISs associated with the identified non-AUG translation initiation events, including both proximal and distal sites whose NTE regions are supported by through or before peptides. We aim to capture the full diversity of non-AUG translation initiation events and provide a more comprehensive perspective on their general features. Future refinement will be necessary to address these technical challenges in future studies.

Previous research has demonstrated that non-AUG translation can generate longer proteoforms with novel N-terminal signals, thereby influencing their subcellular localization and biological functions. For instance, non-AUG translation of tRNA synthetases in yeast produces proteoforms with mitochondrial targeting signals, enabling dual localization [1113, 14, 41], and similar phenomena have been described for well-characterized mammalian genes such as PTEN [22, 23, 24], bFGF [74], and BAG-1 [19]. In our study, we identified over 800 transcripts undergoing non-AUG translation in three plant species. Among them, a portion of the corresponding NTE-containing proteoforms are predicted to harbor novel functional signals, including signal peptides, mitochondrial, chloroplast, and thylakoid luminal transit peptides, as well as transmembrane domains. It is also possible that some non-canonical peptides could be proteolytically processed from the non-AUG-derived proteoforms. These findings suggest that non-AUG-derived proteoforms may contribute to functional diversification by modulating subcellular localization and interaction networks in plants. Notably, we also observed that 18 transcripts in maize and two transcripts in soybean exhibited both non-AUG translation initiation and stop codon readthrough, as previously reported [4] (Supplementary Table S12). These genes are involved in fundamental processes, such as energy metabolism, photosynthesis, glycometabolism, and glutathione biosynthesis. Collectively, these examples highlight the intricate interplay between alternative translation mechanisms, such as non-AUG initiation and stop codon readthrough, in shaping proteome diversity. Elucidating how these layers of regulation are coordinated in plants represents an exciting direction for future research.

Our comparative analysis revealed both conserved and unique mechanistic features of non-AUG translation initiation events between plants and humans. First, near-cognate start codons, such as CUG and GUG, were favored over AUG as the potential uTISs in both plants and humans. This preference might be associated with the annotation pipelines, as the most upstream in-frame AUG codon is typically selected for annotation as the canonical initiation site, leading to the underrepresentation of AUGs in the 5′ UTRs [75]. We identified widespread upstream non-AUG translation across the examined plants, in line with prior omics-based reports in diverse organisms [5, 34, 62, 76]. These findings highlight an annotation gap: non-AUG uTISs are often excluded from current reference annotations by default and are typically incorporated only when supported by gene-specific experimental evidence, suggesting that upstream non-AUG translation may be systematically under-annotated. Beyond the codon usage, regions flanking uTISs consistently exhibited lower minimum free energy than those surrounding annotated AUGs across all studied species, suggesting a conserved tendency for uTIS flanking regions to form more stable secondary structures compared to annotated AUGs. These findings align with previous reports indicating that the stability of local RNA secondary structures may facilitate translation initiation from weaker initiation contexts [30, 77]. Furthermore, non-AUG-initiated candidates with NTEs displayed significantly longer 5′ UTRs than the background transcripts in all species examined, highlighting a potential regulatory role of extended 5′ UTRs in facilitating non-canonical initiation. Through evolutionary analyses, we found that non-AUG initiation events with NTEs show limited evolutionary conservation across the examined plant species. This pattern may be influenced by the limited completeness of transcript/5′ UTR annotation and does not exclude the possibility that a fraction of events may reflect biological noise [78]. Importantly, lack of conservation does not necessarily imply lack of function: although conserved non-AUG translation initiation events have been documented in closely related mammals [23, 28, 29], non-conserved or lineage-specific cases have also been reported across diverse organisms, including human, mouse, yeast, and Escherichia coli [5, 29, 34, 62, 76], many of which have documented biological significance [8, 29, 34]. In addition, it should be noted that the apparent degree of conservation can be influenced by the identification strategy, as comparative genomics is more likely to recover conserved events shared across species, whereas proteogenomic and ribosome profiling approaches can additionally capture lineage-specific or poorly conserved cases. Our observations in plants are consistent with these findings.

Interestingly, the comparative analysis revealed that several mechanistic features of these non-AUG translation initiation events in monocots more closely resemble those in humans than in the dicot soybean. This observation suggests a possible evolutionary convergence in the regulatory architecture of non-AUG initiation between monocots and humans. For instance, in monocots and humans, G is the predominant nucleotide at positions −3 and + 4 surrounding uTISs, a feature associated with enhanced initiation efficiency, whereas A is more frequent in the dicot soybean. By contrast, G or C are more prevalent at positions −4 and −1 surrounding the uTISs in the two monocots and humans, whereas U is more common at these positions in soybean. Similar patterns of uracil preference near uTISs have also been reported in dicots like cotton and tomato, but not in monocot rice [51, 68, 79], suggesting that such a preference may be specific to dicot species. Furthermore, a higher GC-content in NTEs than in the CDSs was observed in monocots, which parallels patterns reported in humans [62]. In contrast, this trend was not observed in the dicot soybean, indicating lineage-specific differences in sequence features. The difference in GC-richness may be partly associated with differences in upstream-region conservation between monocot and dicot species. Moreover, the lengths of primary transcripts and introns for non-AUG-initiated candidates tended to be shorter compared to background genes in both monocots and humans, whereas the opposite pattern was observed in the dicot soybean. Greater similarity of the sequence context between monocots and humans than between dicots and humans has also been noted in previous studies [80, 81]. Nevertheless, it should be noted that cross-kingdom comparisons could be influenced by technical or pipeline-related differences between studies, which should be interpreted with caution. Collectively, these patterns suggest the diversity and plasticity of sequence and structural features associated with non-AUG translation initiation across kingdoms.

In summary, we systematically identified hundreds of non-AUG translation initiation events in three plant species using a computational-prediction combined proteogenomic strategy. These events potentially generated 3938 N-terminally extended proteoforms, greatly expanding the known landscape of translational diversity in plants. The extensive repertoire of non-AUG-derived proteoforms presented here serves as a valuable resource for future functional studies. Future efforts should aim to experimentally validate the biological functions and regulatory mechanisms of these proteoforms, particularly their roles in plant development, stress responses, and adaptive traits. Our study reveals both the biological relevance of non-AUG translation initiation events in plants and their comparative features with those in humans, thus providing a foundational resource and conceptual framework for advancing the study of non-canonical translation and its functional impacts across eukaryotic systems.

Supplementary Material

gkag303_Supplemental_Files

Acknowledgements

We thank Dr Anguo Sun and Dr Yanwen Xiang for their technical assistance and fruitful discussions on the manuscript.

Author contributions: L.W. (Conceptualization [lead], Funding acquisition [lead], Methodology [lead], Project administration [lead], Supervision [lead], Writing – review & editing [lead]), Y.Z. (Data curation [lead], Formal Analysis [equal], Visualization [lead], Writing – original draft [lead], Writing – review & editing [equal]), B.C. (Data curation [supporting], Formal Analysis [equal], Visualization [supporting], Writing – original draft [supporting], Writing – review & editing [equal]), S.W. (Formal Analysis [supporting], Resources [supporting], Writing – review & editing [supporting]), L.T. (Data curation [supporting], Formal Analysis [supporting], Writing – review & editing [supporting]), H.Z. (Data curation [supporting], Formal Analysis [supporting], Visualization [supporting]), W.L. (Formal Analysis [supporting], Writing – review & editing [supporting]), H.Z. (Formal Analysis [supporting], Writing – review & editing [supporting]), S.Z. (Formal Analysis [supporting], Writing – review & editing [supporting]).

Notes

Present address: College of Horticulture, Henan Agricultural University, Zhengzhou, Henan 450046, China

Contributor Information

Yuqian Zhang, State Key Laboratory of High-Efficiency Production of Wheat-Maize Double Cropping, Henan Agricultural University, Zhengzhou, Henan 450046, China; School of Environmental and Rural Science, University of New England, Armidale, NSW 2351, Australia.

Baiyang Chang, State Key Laboratory of High-Efficiency Production of Wheat-Maize Double Cropping, Henan Agricultural University, Zhengzhou, Henan 450046, China.

Shunxi Wang, State Key Laboratory of High-Efficiency Production of Wheat-Maize Double Cropping, Henan Agricultural University, Zhengzhou, Henan 450046, China.

Lei Tian, State Key Laboratory of High-Efficiency Production of Wheat-Maize Double Cropping, Henan Agricultural University, Zhengzhou, Henan 450046, China.

Huina Zhao, State Key Laboratory of High-Efficiency Production of Wheat-Maize Double Cropping, Henan Agricultural University, Zhengzhou, Henan 450046, China.

Weiwei Luo, State Key Laboratory of High-Efficiency Production of Wheat-Maize Double Cropping, Henan Agricultural University, Zhengzhou, Henan 450046, China.

Hanxue Zhang, School of Agriculture, Yunnan University, Kunming, Yunnan 650500, China.

Shoudong Zhang, School of Agriculture, Yunnan University, Kunming, Yunnan 650500, China.

Shubiao Wu, School of Environmental and Rural Science, University of New England, Armidale, NSW 2351, Australia.

Liuji Wu, State Key Laboratory of High-Efficiency Production of Wheat-Maize Double Cropping, Henan Agricultural University, Zhengzhou, Henan 450046, China; School of Environmental and Rural Science, University of New England, Armidale, NSW 2351, Australia.

Supplementary data

Supplementary data is available at NAR online.

Conflict of interest

None declared.

Funding

This work was supported by the National Natural Science Foundation of China (grant number U22A20474, 32172073) and the National Key Research and Development Program of China (grant number 2022YFD1201802).

Data availability

The data underlying this article are available in the article and in its online supplementary material. The raw datasets of proteomics have been deposited at the ProteomeXchange under the accession ID PXD058528.

References

  • 1. Brar  GA. Beyond the triplet code: context cues transform translation. Cell. 2016;167:1681–92. 10.1016/j.cell.2016.09.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Ingolia  NT, Lareau  LF, Weissman  JS. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 2011;147:789–802. 10.1016/j.cell.2011.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Atkins  JF, Loughran  G, Bhatt  PR  et al.  Ribosomal frameshifting and transcriptional slippage: from genetic steganography and cryptography to adventitious use. Nucleic Acids Res. 2016;44:7007–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Zhang  Y, Li  H, Shen  Y  et al.  Readthrough events in plants reveal plasticity of stop codons. Cell Rep. 2024;43:113723. 10.1016/j.celrep.2024.113723. [DOI] [PubMed] [Google Scholar]
  • 5. Sapkota  D, Lake  AM, Yang  W  et al.  Cell-type-specific profiling of alternative translation identifies regulated protein isoform variation in the mouse brain. Cell Rep. 2019;26:594–607. 10.1016/j.celrep.2018.12.077. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Kim  MS, Pinto  SM, Getnet  D  et al.  A draft map of the human proteome. Nature. 2014;509:575–81. 10.1038/nature13302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Wright  BW, Yi  Z, Weissman  JS  et al.  The dark proteome: translation from noncanonical open reading frames. Trends Cell Biol. 2022;32:243–58. 10.1016/j.tcb.2021.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Andreev  DE, Loughran  G, Fedorova  AD  et al.  Non-AUG translation initiation in mammals. Genome Biol. 2022;23:111. 10.1186/s13059-022-02674-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Fiers  W, Contreras  R, Duerinck  F  et al.  A-protein gene of bacteriophage MS2. Nature. 1975;256:273–8. 10.1038/256273a0. [DOI] [PubMed] [Google Scholar]
  • 10. Steege  DA. 5′-Terminal nucleotide sequence of Escherichia coli lactose repressor mRNA: features of translational initiation and reinitiation sites. Proc Natl Acad Sci USA. 1977;74:4163–7. 10.1073/pnas.74.10.4163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Natsoulis  G, Hilger  F, Fink  GR. The HTS1 gene encodes both the cytoplasmic and mitochondrial histidine tRNA synthetases of S. cerevisiae. Cell. 1986;46:235–43. 10.1016/0092-8674(86)90740-3. [DOI] [PubMed] [Google Scholar]
  • 12. Chatton  B, Walter  P, Ebel  JP  et al.  The yeast VAS1 gene encodes both mitochondrial and cytoplasmic valyl-tRNA synthetases. J Biol Chem. 1988;263:52–7. 10.1016/S0021-9258(19)57354-9. [DOI] [PubMed] [Google Scholar]
  • 13. Chang  KJ, Wang  CC. Translation initiation from a naturally occurring non-AUG codon in Saccharomyces cerevisiae. J Biol Chem. 2004;279:13778–85. 10.1074/jbc.M311269200. [DOI] [PubMed] [Google Scholar]
  • 14. Tang  HL, Yeh  LS, Chen  NK  et al.  Translation of a yeast mitochondrial tRNA synthetase initiated at redundant non-AUG codons. J Biol Chem. 2004;279:49656–63. 10.1074/jbc.M408081200. [DOI] [PubMed] [Google Scholar]
  • 15. Hann  SR, King  MW, Bentley  DL  et al.  A non-AUG translational initiation in c-myc exon 1 generates an N-terminally distinct protein whose synthesis is disrupted in Burkitt’s lymphomas. Cell. 1988;52:185–95. 10.1016/0092-8674(88)90507-7. [DOI] [PubMed] [Google Scholar]
  • 16. Hann  SR, Sloan-Brown  K, Spotts  GD. Translational activation of the non-AUG-initiated c-myc 1 protein at high cell densities due to methionine deprivation. Genes Dev. 1992;6:1229–40. 10.1101/gad.6.7.1229. [DOI] [PubMed] [Google Scholar]
  • 17. Hann  SR, Dixit  M, Sears  RC  et al.  The alternatively initiated c-myc proteins differentially regulate transcription through a noncanonical DNA-binding site. Genes Dev. 1994;8:2441–52. 10.1101/gad.8.20.2441. [DOI] [PubMed] [Google Scholar]
  • 18. Froesch  BA, Takayama  S, Reed  JC. BAG-1L protein enhances androgen receptor function. J Biol Chem. 1998;273:11660–6. 10.1074/jbc.273.19.11660. [DOI] [PubMed] [Google Scholar]
  • 19. Yang  X, Chernenko  G, Hao  Y  et al.  Human BAG-1/RAP46 protein is generated as four isoforms by alternative translation initiation and overexpressed in cancer cells. Oncogene. 1998;17:981–9. 10.1038/sj.onc.1202032. [DOI] [PubMed] [Google Scholar]
  • 20. Cato  L, Neeb  A, Sharp  A  et al.  Development of Bag-1L as a therapeutic target in androgen receptor-dependent prostate cancer. eLife. 2017;6:e27159. 10.7554/eLife.27159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Katsman  M, Azriel  A, Horev  G  et al.  N-VEGF, the autoregulatory arm of VEGF-A. Cells. 2022;11:1289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Hopkins  BD, Fine  B, Steinbach  N  et al.  A secreted PTEN phosphatase that enters cells to alter signaling and survival. Science. 2013;341:399–402. 10.1126/science.1234907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Liang  H, He  S, Yang  J  et al.  PTENα, a PTEN isoform translated through alternative initiation, regulates mitochondrial function and energy metabolism. Cell Metab. 2014;19:836–48. 10.1016/j.cmet.2014.03.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Liang  H, Chen  X, Yin  Q  et al.  PTENβ is an alternatively translated isoform of PTEN that regulates rDNA transcription. Nat Commun. 2017;8:14771. 10.1038/ncomms14771. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Zhang  Q, Liang  H, Zhao  X  et al.  PTENε suppresses tumor metastasis through regulation of filopodia formation. EMBO J. 2021;40:e105806. 10.15252/embj.2020105806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Florkiewicz  RZ, Sommer  A. Human basic fibroblast growth factor gene encodes four polypeptides: three initiate translation from non-AUG codons. Proc Natl Acad Sci USA. 1989;86:3978–81. 10.1073/pnas.86.11.3978. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Arnaud  E, Touriol  C, Boutonnet  C  et al.  A new 34-kilodalton isoform of human fibroblast growth factor 2 is cap dependently synthesized by using a non-AUG start codon and behaves as a survival factor. Mol Cell Biol. 1999;19:505–14. 10.1128/MCB.19.1.505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Ivanov  IP, Firth  AE, Michel  AM  et al.  Identification of evolutionarily conserved non-AUG-initiated N-terminal extensions in human coding sequences. Nucleic Acids Res. 2011;39:4220–34. 10.1093/nar/gkr007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Fedorova  AD, Kiniry  SJ, Andreev  DE  et al.  Thousands of human non-AUG extended proteoforms lack evidence of evolutionary selection among mammals. Nat Commun. 2022;13:7910. 10.1038/s41467-022-35595-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Lee  S, Liu  B, Lee  S  et al.  Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution. Proc Natl Acad Sci USA. 2012;109:E2424–32. 10.1073/pnas.1207846109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Van Damme  P, Gawron  D, Van Criekinge  W  et al.  N-terminal proteomics and ribosome profiling provide a comprehensive view of the alternative translation initiation landscape in mice and men. Mol Cell Proteomics. 2014;13:1245–61. 10.1074/mcp.M113.036442. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Menschaert  G, Van Criekinge  W, Notelaers  T  et al.  Deep proteome coverage based on ribosome profiling aids mass spectrometry-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events. Mol Cell Proteomics. 2013;12:1780–90. 10.1074/mcp.M113.027540. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Acland  P, Dixon  M, Peters  G  et al.  Subcellular fate of the int-2 oncoprotein is determined by choice of initiation codon. Nature. 1990;343:662–5. 10.1038/343662a0. [DOI] [PubMed] [Google Scholar]
  • 34. Eisenberg  AR, Higdon  AL, Hollerer  I  et al.  Translation initiation site profiling reveals widespread synthesis of non-AUG-initiated protein isoforms in yeast. Cell Syst. 2020;11:145-160.e145–160.e5. 10.1016/j.cels.2020.06.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Kritsiligkou  P, Chatzi  A, Charalampous  G  et al.  Unconventional targeting of a thiol peroxidase to the mitochondrial intermembrane space facilitates oxidative protein folding. Cell Rep. 2017;18:2729–41. 10.1016/j.celrep.2017.02.053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Müller  C, Bremer  A, Schreiber  S  et al.  Nucleolar retention of a translational C/ebpα isoform stimulates rDNA transcription and cell size. EMBO J. 2010;29:897–909. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Vagner  S, Touriol  C, Galy  B  et al.  Translation of CUG- but not AUG-initiated forms of human fibroblast growth factor 2 is activated in transformed and stressed cells. J Cell Biol. 1996;135:1391–402. 10.1083/jcb.135.5.1391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Gerashchenko  MV, Lobanov  AV, Gladyshev  VN. Genome-wide ribosome profiling reveals complex translational regulation in response to oxidative stress. Proc Natl Acad Sci USA. 2012;109:17394–9. 10.1073/pnas.1120799109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Wang  S, Tian  L, Liu  H  et al.  Large-scale discovery of non-conventional peptides in maize and arabidopsis through an integrated peptidogenomic pipeline. Mol Plant. 2020;13:1078–93. 10.1016/j.molp.2020.05.012. [DOI] [PubMed] [Google Scholar]
  • 40. Zielinska  DF, Gnad  F, Wiśniewski  JR  et al.  Precision mapping of an in vivo N-glycoproteome reveals rigid topological and sequence constraints. Cell. 2010;141:897–907. 10.1016/j.cell.2010.04.012. [DOI] [PubMed] [Google Scholar]
  • 41. Monteuuis  G, Miścicka  A, Świrski  M  et al.  Non-canonical translation initiation in yeast generates a cryptic pool of mitochondrial proteins. Nucleic Acids Res. 2019;47:5777–91. 10.1093/nar/gkz301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Kearse  MG, Wilusz  JE. Non-AUG translation: a new start for protein synthesis in eukaryotes. Genes Dev. 2017;31:1717–31. 10.1101/gad.305250.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Diaz de Arce  AJ, Noderer  WL, Wang  CL. Complete motif analysis of sequence requirements for translation initiation at non-AUG start codons. Nucleic Acids Res. 2018;46:985–94. 10.1093/nar/gkx1114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Tyanova  S, Temu  T, Cox  J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat Protoc. 2016;11:2301–19. 10.1038/nprot.2016.136. [DOI] [PubMed] [Google Scholar]
  • 45. Walley  JW, Sartor  RC, Shen  Z  et al.  Integration of omic networks in a developmental atlas of maize. Science. 2016;353:814–8. 10.1126/science.aag1125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Li  ST, Ke  Y, Zhu  Y  et al.  Mass spectrometry-based proteomic landscape of rice reveals a post-transcriptional regulatory role of N(6)-methyladenosine. Nat Plants. 2024;10:1201–14. 10.1038/s41477-024-01745-5. [DOI] [PubMed] [Google Scholar]
  • 47. Lei  L, Shi  J, Chen  J  et al.  Ribosome profiling reveals dynamic translational landscape in maize seedlings under drought stress. Plant J. 2015;84:1206–18. 10.1111/tpj.13073. [DOI] [PubMed] [Google Scholar]
  • 48. Chen  S, Zhou  Y, Chen  Y  et al.  fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–90. 10.1093/bioinformatics/bty560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Langmead  B, Salzberg  SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–9. 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Roberts  A, Pachter  L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat Methods. 2013;10:71–3. 10.1038/nmeth.2251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Zhu  XT, Zhou  R, Che  J  et al.  Ribosome profiling reveals the translational landscape and allele-specific translational efficiency in rice. Plant Commun. 2023;4:100457. 10.1016/j.xplc.2022.100457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Dobin  A, Davis  CA, Schlesinger  F  et al.  STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Zhang  P, He  D, Xu  Y  et al.  Genome-wide identification and differential analysis of translational initiation. Nat Commun. 2017;8:1749. 10.1038/s41467-017-01981-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Chen  C, Wu  Y, Li  J  et al.  TBtools-II: a “one for all, all for one” bioinformatics platform for biological big-data mining. Mol Plant. 2023;16:1733–42. 10.1016/j.molp.2023.09.010. [DOI] [PubMed] [Google Scholar]
  • 55. Lorenz  R, Bernhart  SH, Höner Zu Siederdissen  C  et al.  ViennaRNA package 2.0. Algorithms Mol Biol. 2011;6:26. 10.1186/1748-7188-6-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Almagro Armenteros  JJ, Salvatore  M, Emanuelsson  O  et al.  Detecting sequence signals in targeting peptides using deep learning. Life Sci Alliance. 2019;2:e201900429, 10.26508/lsa.201900429 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Krogh  A, Larsson  B, von Heijne  G  et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001;305:567–80. 10.1006/jmbi.2000.4315. [DOI] [PubMed] [Google Scholar]
  • 58. Kunze  M. The type-2 peroxisomal targeting signal. Biochim Biophys Acta Mol Cell Res. 2020;1867:118609. 10.1016/j.bbamcr.2019.118609. [DOI] [PubMed] [Google Scholar]
  • 59. Emms  DM, Kelly  S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20:238. 10.1186/s13059-019-1832-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Tamura  K, Stecher  G, Kumar  S. MEGA11: molecular evolutionary genetics analysis version 11. Mol Biol Evol. 2021;38:3022–7. 10.1093/molbev/msab120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Na  CH, Barbhuiya  MA, Kim  MS  et al.  Discovery of noncanonical translation initiation sites through mass spectrometric analysis of protein N termini. Genome Res.  2018;28:25–36. 10.1101/gr.226050.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Rodriguez  JM, Abascal  F, Cerdán-Vélez  D  et al.  Evidence for widespread translation of 5' untranslated regions. Nucleic Acids Res. 2024;52:8112–26. 10.1093/nar/gkae571. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Liu  Y, Beyer  A, Aebersold  R. On the dependency of cellular protein levels on mRNA abundance. Cell. 2016;165:535–50. 10.1016/j.cell.2016.03.014. [DOI] [PubMed] [Google Scholar]
  • 64. Nakagawa  S, Niimura  Y, Gojobori  T  et al.  Diversity of preferred nucleotide sequences around the translation initiation codon in eukaryote genomes. Nucleic Acids Res. 2008;36:861–71. 10.1093/nar/gkm1102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Šmarda  P, Bureš  P, Šmerda  J  et al.  Measurements of genomic GC content in plant genomes with flow cytometry: a test for reliability. New Phytol. 2012;193:513–21. [DOI] [PubMed] [Google Scholar]
  • 66. Simpson  GG, Laurie  RE, Dijkwel  PP  et al.  Noncanonical translation initiation of the Arabidopsis flowering time and alternative polyadenylation regulator FCA. Plant Cell. 2010;22:3764–77. 10.1105/tpc.110.077990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. van der Horst  S, Snel  B, Hanson  J  et al.  Novel pipeline identifies new upstream ORFs and non-AUG initiating main ORFs with conserved amino acid sequences in the 5' leader of mRNAs in Arabidopsis thaliana. RNA. 2019;25:292–304. 10.1261/rna.067983.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Li  YR, Liu  MJ. Prevalence of alternative AUG and non-AUG translation initiators and their regulatory effects across plants. Genome Res.  2020;30:1418–33. 10.1101/gr.261834.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Willems  P, Ndah  E, Jonckheere  V  et al.  N-terminal proteomics assisted profiling of the unexplored translation initiation landscape in Arabidopsis thaliana. Mol Cell Proteomics. 2017;16:1064–80. 10.1074/mcp.M116.066662. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Nakayasu  ES, Gritsenko  M, Piehowski  PD  et al.  Tutorial: best practices and considerations for mass-spectrometry-based protein biomarker discovery and validation. Nat Protoc. 2021;16:3737–60. 10.1038/s41596-021-00566-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Guzman  UH, Martinez-Val  A, Ye  Z  et al.  Ultra-fast label-free quantification and comprehensive proteome coverage with narrow-window data-independent acquisition. Nat Biotechnol. 2024;42:1855–66. 10.1038/s41587-023-02099-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Tong  G, Hah  N, Martinez  TF. Comparison of software packages for detecting unannotated translated small open reading frames by Ribo-seq. Brief Bioinform. 2024;25:bbae268. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Wamboldt  Y, Mohammed  S, Elowsky  C  et al.  Participation of leaky ribosome scanning in protein dual targeting by alternative translation initiation in higher plants. Plant Cell. 2009;21:157–67. 10.1105/tpc.108.063644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Bugler  B, Amalric  F, Prats  H. Alternative initiation of translation determines cytoplasmic or nuclear localization of basic fibroblast growth factor. Mol Cell Biol. 1991;11:573–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Fields  AP, Rodriguez  EH, Jovanovic  M  et al.  A regression-based analysis of ribosome-profiling data reveals a conserved complexity to mammalian translation. Mol Cell. 2015;60:816–27. 10.1016/j.molcel.2015.11.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76. Meydan  S, Marks  J, Klepacki  D  et al.  Retapamulin-assisted ribosome profiling reveals the alternative bacterial proteome. Mol Cell. 2019;74:481–93. 10.1016/j.molcel.2019.02.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77. Kozak  M. Downstream secondary structure facilitates recognition of initiator codons by eukaryotic ribosomes. Proc Natl Acad Sci USA. 1990;87:8301–5. 10.1073/pnas.87.21.8301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Tress  ML. The degradation of extended protein isoforms points to a misfiring translation initiation process. Mol Genet Genomics. 2025;301:3. 10.1007/s00438-025-02324-9. [DOI] [PubMed] [Google Scholar]
  • 79. Qanmber  G, You  Q, Yang  Z  et al.  Transcriptional and translational landscape fine-tune genome annotation and explores translation control in cotton. J Adv Res. 2024;58:13–30. 10.1016/j.jare.2023.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Kozak  M. An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs. Nucleic Acids Res. 1987;15:8125–48. 10.1093/nar/15.20.8125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81. Fang  JC, Liu  MJ. Translation initiation at AUG and non-AUG triplets in plants. Plant Sci. 2023;335:111822. 10.1016/j.plantsci.2023.111822. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkag303_Supplemental_Files

Data Availability Statement

The data underlying this article are available in the article and in its online supplementary material. The raw datasets of proteomics have been deposited at the ProteomeXchange under the accession ID PXD058528.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES