Abstract
The high complexity of eukaryotic organisms enabled their evolutionary success, driven by the diversification of their proteomes. Various mechanisms contributed to this process. Alternative splicing had the largest known impact among these mechanisms. Earlier, we hypothesized that along with alternative splicing, a different but conceptually similar mechanism creates novel versions of existing proteins in all eukaryotes. However, this mechanism operates at the level of translation, where amino acid sequence novelty arises through multiple programmed ribosomal frameshifting events occurring within the same transcript. This mechanism, which is termed mosaic translation, is very difficult to demonstrate even with the most up-to-date molecular tools. Thus, it remained unnoticed so far. Using a subset of mass spectrometry proteomic data from various organs of the model plant Medicago truncatula, we took the first step toward experimental validation of this hypothesis. Our original in silico approach resulted in the discovery of two candidates for mosaic proteins (homologs of EF1α and RuBisCo) and 154 candidates for chimeric peptides. Chimeric peptides and polypeptides are produced in the course of one ribosomal frameshifting event and may correspond to parts of mosaic proteins. In addition, our analysis reveals the possibility of translation of chimeric peptides from five ribosomal RNA transcripts, ten long non-coding RNA transcripts, and one transfer RNA transcript. These findings are novel and will form the basis for future experimental validation. We also present multiple lines of indirect evidence supporting the validity of our in silico data.
Keywords: Chimeric peptide, Programmed ribosomal frameshifting, Alternative open reading frame, Elongation factor, RuBisCo, Mosaic translation
Graphical Abstract
1. Introduction
Recently, it has been recognized that eukaryotic transcripts have a polycistronic nature, which revolutionized our understanding of the proteome complexity [3]. A pivotal aspect of this paradigm shift was the discovery of translated alternative open reading frames (altORFs) and their cataloging across diverse organisms [4], [5], [6]. These regions of transcripts can be defined as relatively long stop-free sequences in any reading frame. Products of their translation, alternative proteins (altProts), may resemble some annotated proteins or be entirely unique. In both cases, they are thought to be an important source of protein novelty in evolution [9], [7], [8]. AltORFs often overlap with annotated coding sequences, also called reference ORFs (refORFs), or other altORFs located on the same transcript [3], [8]. A refORF is typically defined as the longest ORF in a given mRNA transcript, which is assumed to be translated conventionally to a reference protein (refProt). Previously, the high abundance of conserved altORFs in plant transcriptomes led us to hypothesize that information from overlapping ORFs could be combined into continuous polypeptides through multiple programmed ribosomal frameshifting (PRF), potentially contributing to an organism’s adaptability to internal and external conditions [10]. A chimeric peptide or polypeptide originates from a single PRF event per transcript, with only a few known examples in prokaryotes [15], [11], [12], [13], [14] and many more in eukaryotes [16], [17], [18], [19], [20]. By contrast, a peptide or polypeptide produced via two or more PRF events is mosaic because it can stitch together translational products of more than two reading frames [21]. Thus, we refer to this mode of translation as mosaic translation. Mosaic Gag-Pro-Pol polypeptides produced by this mechanism have been described in some viruses [24], [22], [23]. Initially, it was thought that viruses must overlap their genes with high density for the more complete usage of their limited genomic space. However, later a persuasive argument against this so-called compression theory was offered by Brandes & Linial [25]. They proposed that small genome sizes and overlapping genes are adaptations that boost gene novelty and evolutionary exploration in viruses. In contrast, this way of using the coding potential of altORFs has not been anticipated to play an important role in eukaryotes, probably because of their large genomes. To the best of our knowledge, only one study has suggested the existence of mosaic proteins in eukaryotes [21]. We took a long step forward and proposed why mosaic translation may be a ubiquitous phenomenon with fundamental importance in all domains of life [10], similarly to its new proposed role in viruses [25]. Our chief argument was the enormous biological advantage that should result from the manifold expansion of protein-coding capacity. Direct validation of this hypothesis is nearly impossible without a breakthrough in the achievable read length of protein sequencing technologies. The plan to develop a long-read method by which mosaic proteins can be discovered at a large scale has been mentioned by Timp and Timp [26]. It is based on the nanopore sequencing principle, which proved to be very useful in long-read sequencing of nucleic acids [27]. Despite very recent revolutionary developments in nanopore sequencing of proteins, the long-read version of the method is still not available due to major challenges. The first challenge is slowing down and stretching the long polypeptide molecules during their transit through the nanopore. The second challenge is discriminating among amino acids with different posttranslational modifications [29], [28].
Due to the absence of any adequate alternative to the protein nanopore sequencing, earlier, we proposed a strategy that can detect candidates for mosaic translation based on available mass spectrometry (MS) proteomic data. If conserved and/or translated altORFs overlap with refORFs, other altORFs, or both, possible PRF events that change the translation from one frame to another can be modeled. Once the frameshifted sequences are modeled, they can be used as a query database in the searches for corresponding MS peptides in biological samples. This way, we proposed to detect individual chimeric peptides in the first place. If two or more chimeric peptides validated by MS originate from the same transcript, depending on the frameshift type and the distance between PRF sites, they can be parts of a mosaic protein. Despite its substantial computational demands, the principal advantage of our approach lies in its reliance on MS proteomic data. For many organisms, such data are already available in a very large amount [31], [30]. Up to 60 % of high-quality peptides derived from eukaryotes that have been identified by MS cannot be assigned to any genomic location [32], [33]. A similar problem has been found in assigning the minimal proteome of the smallest culturable bacterium with only 729 ORFs [34]. Earlier, it was suggested that this unmapped “dark” proteome could be composed of translational products of altORFs [3], [8]. A recent study in humans identified many thousands of previously unknown peptides, so-called non-canonical ORFs (ncORFs), which further emphasizes the true complexity of the proteome [35]. We extend this idea to mosaic peptides and proteins, which are made of altORF/ncORF translation blocks. Novel PRF sites can be detected with ribosome profiling, or Ribo-Seq [39], [36], [37], [38]. While Ribo-Seq-based information about positions of the frameshifts is expensive and technically challenging to generate, it is also indirect compared to the information deduced from MS proteomic reads. Many organisms have no Ribo-Seq datasets deposited publicly, including the model plant Medicago truncatula.
Here we demonstrate the efficient application of our MS-based approach on three selected proteomic datasets from M. truncatula. To our knowledge, this pilot study is the first attempt to demonstrate the existence of mosaic peptides and proteins in a non-viral biological system. Although only two strong candidates for mosaic translation were identified in the three selected datasets, the approach can be applied to additional M. truncatula datasets and to MS data from other organisms, potentially leading to the discovery of many mosaic proteins. Importantly, in the course of this work, we also detected many chimeric peptides previously unthought for a eukaryotic genome. We hope that the novelty of our observations will ultimately bring chimeric and mosaic proteins into the spotlight and will motivate the scientific community to apply our approach to other organisms, including humans.
It is important to explain why we chose to address a fundamental question, such as the abundance of PRF in a eukaryotic organism, by studying the proteome of M. truncatula. This phenomenon could be studied in humans or the well-established model plant Arabidopsis thaliana, both of which have ample MS proteomic and Ribo-Seq resources. Our motivation is driven by one major global challenge: long-term food security. Modern intensive agriculture depends heavily on non-renewable resources (oil, natural gas, coal, and uranium) for production of nitrogen fertilizers. It also depends on the steady supply of rock phosphate as a source of phosphorus. These resources will be depleted relatively soon: in 30–150 years and 70–140 years from now, respectively [40], [41], [42]. In contrast to humans and A. thaliana, M. truncatula can undergo mutualistic symbioses with nitrogen-fixing bacteria (rhizobia) and arbuscular mycorrhizal (AM) fungi [43], [44], [45], [46]. These symbioses are natural ways that legume plants use to obtain their nitrogen and phosphorus, respectively. Deciphering the genetic programs underlying these symbioses could enable the transfer of symbiotic traits to major crop plants such as wheat, rice, and maize, all of which currently depend on synthetic nitrogen fertilizers and rock phosphate [47], [48], [49]. Thus, we chose to study chimeric and possibly mosaic proteins in M. truncatula because of their potential broad relevance to symbiotic nitrogen fixation and AM symbiosis, among many other biological processes. Our study is part of the ongoing global effort to achieve sustainable long-term food security via uncoupling agricultural production from non-renewable energy sources.
2. Material and methods
2.1. In silico extraction and translation of altProts
The M. truncatula genome assembly and annotated features (v. 5.1.7) were downloaded from the M. truncatula genome portal MtrunA17r5.0-ANR ([50]; https://medicago.toulouse.inra.fr/MtrunA17r5.0-ANR/). In each transcript, we identified regions that are at least 60 base pairs long and do not contain any in-frame stop codons in three reading frames. These regions were then translated into sequences of amino acids using the standard genetic code. This way, we generated a query database of peptide and polypeptide sequences (altProts), from which reference proteins (refProts) were eliminated based on their identity to sequences from the annotated proteome. RefProts are usually hypothetical (but sometimes MS-validated) products of annotated refORFs, which are assumed to be translated conventionally (without PRF), one refORF per mRNA transcript. The presence of refProt-derived MS peptides in a proteomic sample is expected. Thus, altProt-specific MS peptides must be distinguished from refProt-derived ones. Because many altProts are identical to refProts, their removal from the query database is essential. Input and output sequences of the corresponding script are FASTA-formatted. The script supplies output sequences with unique identifiers that contain the following information: locus identifier, reading frame (three forward frames denoted as 1 F to 3 F; frame 1 F starts with the first nucleotide of a transcript, etc.), nucleotide coordinates of the extracted region on the transcript (first base and last base), and the length of the extracted region on the transcript in bases. The code for generating this query database was published elsewhere [1].
2.2. Sequence similarity searches using a protein reference database
Each altProt was compared with entries from the reference protein database UniProt v. 2020_02 [51] using DIAMOND v. 0.9.14 [52]. To build the DIAMOND search database using the “makedb” command, a single FASTA file was generated by concatenating the downloaded UniProt sequences. The following parameters were used for the similarity searches: “max-target-seqs”, which is the maximum number of hits, was set to zero; this setting enabled the report of all hits per query; the maximal expect value was set to 10−3; to enable higher sensitivity of the searches, an option “more-sensitive” was used. For efficient downstream analysis of significant hits, the following information was recorded for each sequence with at least one hit: “qseqid”, “stitle”, “length”, “pident”, “ppos”, “qcovhsp”, “evalue”, and the number of hits. The information on taxonomy of all hits was recorded for in-depth phylogenetic analysis.
2.3. Validation of altProts by MS proteomics
We searched for mass spectra-derived peptides that match each in silico generated altProt with the aid of SearchGUI v. 4.0.41 [53] and its partner tool PeptideShaker v. 2.0.33 [54]. Three proteomic datasets publicly deposited at the ProteomeXchange database [30] were used for this purpose: PXD002692 [55], PXD013606 [56], and PXD022278 [57], which contain nine, one, and six samples respectively (Supplementary Dataset S1). ThermoRawFileParser v. 1.1.2 [58] was employed for the conversion of raw data from these three datasets to Mascot Generic Format (MGF). Search algorithms X!Tandem [59], MS-GF+ [60], OMSSA [61], and Comet [62] were used in all searches. The search database consisted of the following three components: (1) in silico translated altProts; (2) all annotated refProts; and (3) the contaminant database known as cRAP (common Repository of Adventitious Proteins), which was downloaded from the GPM resource ([63] https://www.thegpm.org/crap/). Validated altProts were excluded from the analysis if they grouped with a refProt or a contaminant sequence (Related Proteins). Carbamidomethylation of C was set to fixed modification, and acetylation of protein N-termini and oxidation of M were set to variable modifications. Precursor and fragment tolerance were set to 4.5 ppm and 20.0 ppm, respectively. At most, two missed cleavages were allowed. False discovery rate (FDR) of 1 % was used to validate peptide-spectrum matches (PSMs), peptides, and proteins with target/decoy hit distribution. The decoy spectral library was generated by in silico reversing the collection of refProt, altProt, and contaminant sequences. Software version-specific default settings were used for parameters that are not mentioned here. The results of PeptideShaker confidence classification of the chimeric peptides are available in Supplementary Dataset S1.
Including all mRNA-derived altProts in the analysis inflates the search database (to more than 840,000 entries after adding refProts; Supplementary Table S1), which substantially reduces the number of confident protein identifications [64], [65]. To avoid database inflation, a two-step MS search approach was used [66]. In the first step, the list of altProt was shuffled and split into ten chunks. Each chunk was used as a search database, and proteins validated in this first step were subsequently included in the second search. Given that the three datasets (PXD002692, PXD013606, and PXD022278) together contained 16 organs/conditions and that the mRNA-derived altProt sequences were split into ten groups, 160 MS searches were conducted in the first step of the two-step approach, followed by 16 searches in the second step. The parameters of individual searches were specified above. The two-step approach was not used for non-mRNA-derived altProts because their relatively small number does not inflate the search database. Thus, a single-step MS search protocol was used for those sequences. MS searches for ncRNA, rRNA, and tRNA-derived altProts were conducted separately. Because there were 16 organs/conditions in total in the three MS datasets combined, 16 MS searches were conducted for each group of non-mRNA-derived altProts. All validated altProts were recorded for further analysis. The MS datasets were also searched independently; for example, validated proteins from dataset PXD002692 were not combined with those from PXD013606 or PXD022278 when constructing the second-step search database.
2.4. Modeling of chimeric proteins for MS searches
Throughout the text, we discriminate between terms “chimeric peptide” and “chimeric protein”. Specifically, we use the former term for short amino acid sequences identified through MS proteomics. In contrast, in silico generated models for matching corresponding MS peptides are referred to as chimeric protein models even though their actual length generally does not exceed 40 aa in our study. This is meant to emphasize that these models may represent longer chimeric sequences or fragments of mosaic proteins. We also use the term “chimeric protein” in contexts where the length is not relevant.
Although our in-house script can use the list of any overlapping or non-overlapping ORFs on the same transcript as a starting point, regardless of their conservation and translation status, the number of chimeric protein models thus generated would be astronomic. For this reason, we modeled chimeric proteins based on two lists of altProts. The first list contained altProts validated by MS searches, and the second list contained so-called conserved altProts. The definition of “conserved” in this case refers to the presence of at least one hit with at least 70 % identity (e-value below 0.001) in the global sequence similarity search using UniProt as a reference protein database. Thus, our conditional definition of “conserved” encompasses both intra- and interspecies conservation.
Each MS-validated and/or conserved altProt has a corresponding altORF with coordinates on its transcript and locus information. Our chimeric protein modeling algorithm determines and in silico translates products of many possible PRF events that may occur if an altORF overlaps with its refORF or other altORFs on the same transcript. We use the word combination “many possible PRF events” instead of “all possible PRF events” to emphasize that some situations were deliberately left beyond the scope of our analysis. For example, we considered only PRF values −2, −1, + 1, and + 2, which correspond to the backward and forward slippage of ribosomes, respectively, by one or two nucleotides, although “longer” events that cause the frameshifting by up to six bases are known [67], [68]. The reason for focusing on the “shortest” PRF events and the simplest scenarios was the phenomenon of the search database inflation [64], [65]. While our algorithm can technically generate models for all theoretically possible chimeric proteins, including all of them in the analysis would be counterproductive. Thus, we included only situations that would serve as the most convincing illustration of chimeric translation without inflating the MS search database. This explains why the settings for the generation of chimeric models were not uniform for all cases but were tailored to specific scenarios described in the corresponding software article [1]. Furthermore, those settings depended on the location of the frameshift relative to the involved ORFs (the 5’-end vs the 3’-end).
2.5. Validation of chimeric protein models by MS proteomics
The number of chimeric protein models derived from MS-validated altProts is moderate (36,536) and does not cause inflation of the search database. In contrast, the models derived from conserved altProts are numerous (533,569), which causes inflation of the search database (see Section 3.3). Thus, similarly to the MS-validation of mRNA-derived altProts, the models from the latter group were searched by the two-step MS approach, while the regular MS searches were conducted for the validation of the former group. The validation of chimeric models involved the same three MS datasets and the same pipeline as the validation of altProts. The list of chimeric protein models derived from the conserved altProts was split into ten chunks. Each chunk was searched separately in 16 organs/conditions in the first step. Then, validated chimeric protein models were searched one more time (phase two of the two-step approach). Validated models were recorded for further analysis. Because there were 16 organs/conditions in those three datasets, 160 and 16 MS searches were conducted in the first and second steps of the two-step approach, respectively, for the models derived from conserved altProts. Unlike in the altProt validation, altProts used for the modeling of chimeric proteins were also included in the search database in addition to the refProt and cRAP databases, which permitted the elimination of chimeric peptides identical to sequences of altProts. This step was necessary because some chimeric peptides differed from their altProts or refProts by only one or two amino acids. If these different amino acids are indistinguishable by MS, true altProts or true refProts may be categorized as chimeric proteins. Validated chimeric protein models were excluded from the analysis if they grouped with an altProt, a refProt, or a contaminant sequence. Thus, our final dataset (Supplementary Dataset S1) is free from this ambiguity. Like with the validation of altProts, the three MS datasets were searched independently.
2.6. Visualization of chimeric MS peptides, chimeric protein models, mosaic proteins, and their respective transcripts
Sequences of MS-supported chimeric peptides were identified and manually mapped to their transcripts in Geneious® v. 7.1 (Dotmatics Ltd., MA, USA, https://www.geneious.com). The location of each PRF event, its value (-2, −1, +1, or +2), type, and subtype (Supplementary Dataset S1) were deduced manually. To create images of transcripts associated with each PRF event (Supplementary Datasets S2, S4, S5, and S28), we calculated sizes and coordinates of each feature in Geneious® and then converted them to distances in centimeters to be plotted. Transcript lengths were scaled to either 10 cm or 20 cm in the images, depending on the purpose of each figure, so that transcripts of different nucleotide lengths appeared equal in the visualizations. This way, we unified visualization of transcripts that ranged from 84 bp to 10,214 bp.
2.7. Protein folding predictions
To estimate the proportion of potentially unstructured protein sequences in our collection of putative chimeric peptides, and also to discriminate between sequences that fold into alpha-helices and beta-sheets, we predicted their folding with ColabFold v1.5.2 [69] and visualized it with ChimeraX v. 1.6.1 [70]. The predictions were run in the unrelaxed mode. Because MS peptides that support chimeric models are too short for folding predictions (8–30 aa, Supplementary Dataset S1), we based the predictions on chimeric models, which are longer than MS peptides. Five predictions per sequence were made, out of which the top-score predictions, and in some cases strongly differing additional predictions with a lower score, were depicted. Although predictions on short sequences such as those in our study have intrinsic limitations [71], the suitability of using AlphaFold2 (the foundation of ColabFold) has been demonstrated even for peptides as short as 10–40 aa [72].
2.8. Homology searches
To identify homology groups within the dataset, transcript and protein sequences that correspond to chimeric peptides were aligned with a Geneious® alignment tool, which is a built-in option in the Geneious® software, and also with Clustal Omega [73]. To identify small-scale local similarities among sequences, we employed the MEME (Multiple EM for Motif Elicitation) software v. 5.4.1 [74], [75]. In addition, BLASTN and BLASTP tools [76], [77] and the M. truncatula genome portal [50] were used for the identification of homology groups based on shared subjects. The same tools were used for the annotation of transcripts and translational products of refORFs and altORFs involved in the production of chimeric peptides (Supplementary Dataset S1).
2.9. Identification of alternative sources
In searches for non-chimeric alternative sources of chimeric MS peptides, we used genomic, repeat element, and transcript data from the M. truncatula genome portal [50]. We also used 50 selected RNA-Seq datasets of M. truncatula deposited at the Sequence Read Archive (SRA) database [78], [79] and the RNA-Seq-based gene expression atlas of this organism (MtExpress v. 3, [80]; https://medicago.toulouse.inrae.fr/GEA). In-house scripts based on TBLASTN were used for the similarity searches in this part of the study. MtExpress was also used for transcription profiles of primary sources and the selection of the most likely alternative chimeric sources listed in Supplementary Dataset S1. In searches for chimeric alternative sources, we used a script that was published separately as the area of its application is broad [2]. Transcript and repeat-element data from the newer version 5.1.9 of the M. truncatula genome browser [50] were used in this analysis. Each primary-source transcript was manually analyzed for differences between v. 5.1.7 and v. 5.1.9. There were no sequence or length differences found between these two genome releases for any of the primary-source transcripts. It should be noted that we did not search for putative PRF sites in intergenic regions, promoters, and introns. This way, we focused on regions known or likely to be transcribed (transcriptome and repeatome, respectively). However, for specific purposes like the identification of novel transcribed loci, genomic DNA can serve as a search space for our software [2].
2.10. Statistical analysis
To estimate the deviation from the randomness assumption in the relationships between various features of chimeric peptides and their transcripts, we mostly employed the chi-square tests for the association and the goodness-of-fit as conservative non-parametric tools. In some cases, we applied Pearson and Spearman correlation analyses, using corresponding t-tests to assess the significance of their p-values. For the assessment of differences between BLASTP and BLASTN-related characteristics of four RNA types (mRNA, ncRNA, rRNA, and tRNA), Kholmogorov-Smirnov test was utilized in R [81]. This test is an adequate non-parametric procedure suitable for the comparisons of samples that have very different variances and sizes. In cases where the application of the chi-square test was inappropriate (at least one cell with an expected value below five), three alternative tests were used in R: (1) the Fisher’s exact test for more than one level and more than two proportions; (2) the Exact Multinomial Test for one level and more than two proportions; (3) the Exact Binomial Test for one level and two proportions. In some cases, indicated in supplementary figures, the exact tests could not be run because of a large number of categories and/or very low values in specific groups. An alpha-level of 5 % was used as a significance threshold in all tests. The default setting of all tests mentioned above is to calculate a two-sided p-value. Graphs for supplementary figures were generated in Microsoft Excel and in some cases in R or SPSS® Statistics v. 27.0 (IBM® Corporation, NY, USA, https://www.ibm.com/products/spss-statistics).
3. Results
3.1. The M. truncatula transcriptome contains thousands of altORFs with a conservation signature
To enable a comprehensive study with potential for novel observations, we have not limited our analysis to mRNA transcripts. We included three other RNA types that are classically defined as non-coding: ncRNA, rRNA, and tRNA, which are among the longest of all known RNA types [82], [83]. With the same intention, we have not limited our definition of an open reading frame (ORF) to transcript regions that start with or contain the standard translation initiation codon AUG [84]. Here, we define an ORF as a stop-free region exceeding a certain length in any forward reading frame, with the conceptual potential to be translated into a short peptide or a longer amino acid chain. Eventually, this ORF definition proved to be useful as it led to the discovery of many translated ORFs that do not start with an ATG or even contain no ATG at any position (results not shown). On the other hand, the inclusion of non-coding RNA types allowed us to detect chimeric peptides potentially translated from all the three groups of transcripts, including tRNA, and non-chimeric altProts translated from ncRNA.
Because we planned to use the DIAMOND BLASTP analysis as a part of our detection pipeline, we chose 60 nt as a minimal ORF length. This decision was conditioned by the intrinsic limitations of the BLASTP algorithm in handling peptide sequences shorter than 20 aa [85]. The annotated transcriptome v. 5.1.7 of M. truncatula [50] contains 875,356 altORFs longer than 59 nt, which correspond to putative altProts of 20 aa or longer (Supplementary Table S1). We extracted these sequences by in silico translation and subjected them to the BLASTP search against the global UniProt reference database v. 2020_02 (all species) [51]. AltProts with at least one hit were retained for further analysis if they met the following criteria: e-value ≤ 0.001 and amino acid identity ≥ 70 %. There were 13,078 such sequences (Table 1), which we conditionally call conserved altProts, although their conservation is often limited to the M. truncatula proteome. Table 1, Supplementary Figures S1-S5, and Supplementary Tables S1-S4 illustrate details and statistics of the BLASTP results.
Table 1.
Statistics on altProts with at least one hit in the global BLASTP analysis after the application of the filtera.
| Statistic | mRNA | ncRNA | rRNA | tRNA |
|---|---|---|---|---|
| Number of transcripts | 44,624 | 5657 | 62 | 974 |
| Median length of transcripts, nt | 1280 | 413 | 120 | 75 |
| Total number of altORFsb | 8718 | 3549 | 385 | 426 |
| altORFs per transcript | 0.20 | 0.63 | 6.21 | 0.44 |
| Transcripts per altORF | 5.12 | 1.59 | 0.16 | 2.29 |
| Median length of altProts, aa | 53.0 | 47.0 | 45.0 | 25.0 |
| Median length of alignment, aa | 38.0 | 36.0 | 40.0 | 24.0 |
| Median percent identity | 87.5 | 91.4 | 93.3 | 100.0 |
| Median percent query coverage | 82.5 | 84.5 | 94.8 | 100.0 |
| altORFs longer than 300 nt | 795 | 215 | 15 | 0 |
| altORFs longer than 600 nt | 33 | 27 | 3 | 0 |
BLASTP filter settings: e-value ≤ 0.001; percent identity ≥ 70.
There are 13,078 of such altORFs combined from the RNA types mentioned here.
3.2. Three public MS proteomic datasets of M. truncatula provide evidence for translation of 805 putative altProts
In parallel with the BLASTP analysis, all 875,356 putative altProts were analyzed for the presence of corresponding MS spectra in three selected proteomic datasets of M. truncatula [56], [57], [55]. This analysis revealed 805 unique putative altProts with evidence for translation, 122 of which are also present in the list of conserved altProts. Supplementary Figures S4 and S5 show distributions of the top-hit percent identity and the number of hits, respectively, for the 122 conserved MS-supported altProts. Most of the altProts detected by MS (720 unique sequences) correspond to mRNA transcripts; however, 85 unique ncRNA-derived altProts were also supported by MS data. No validated non-chimeric MS peptides were found for rRNA-altProts and tRNA-altProts (Supplementary Figure S6). In total, 16 biological samples contain MS peptides corresponding to altProts. More than half of these peptides (573) were detected in just four samples: seeds (155), 10-dpi nodules (147), flowers (136), and buds (135), which may point to the importance of altProts in the reproduction and symbiotic nitrogen fixation (Supplementary Figure S6).
In the BLASTP analysis, the taxonomic status of the top-hit subject sequence may reflect the degree of conservation of the query sequence among different organisms, especially if the number of significant hits is small. In our analysis, 70, 65, and 39 % of altProts have between 1 and 5 hits before the application of the 70 % filter, after the filter, and in the set of conserved MS-supported altProts, respectively. The corresponding median numbers of hits in these three samples are 2, 2, and 16.5, which is rather low (Supplementary Figures S2, S3, and S5). This indicates that, overall, the altProts detected in our study share similarity with proteins from only a few species each. A very large portion of these altProts have the top-similarity with proteins from M. truncatula itself or from other legume species (Supplementary Figure S7). At the same time, the majority of MS-supported altProts (683 out of 805, which is ca. 85 %) have no similarity with any annotated protein at all. On the other hand, percent amino acid identity with the top hit is another parameter that may reflect the degree of conservation. In our dataset (Supplementary Figure S1), median values of percent identity are very high, which may point to the potential origin of many altProts from duplication events and frameshifting mutations in the M. truncatula genome and its common ancestors with other legume species. Percent identity is the highest in tRNA-altProts (100 %), followed by rRNA-altProts (93.3 %), the latter also showing the highest median number of hits (Supplementary Figure S1). Together with other parameters described in Table 1, these observations highlight a very special role of tRNA and rRNA in genome evolution, which was proposed earlier [88], [89], [90], [86], [87].
3.3. Modeling of chimeric proteins based on the combined evidence for conservation and translation helped to validate MS peptides that match 156 chimeric models
Our next goal was to generate chimeric models for matching them to peptides present in MS proteomic samples. We searched for altORFs that overlap with annotated coding sequences (refORFs) and/or each other among altORFs of 13,780 conserved and/or translated altProts (13,078 conserved + 805 translated - 103 both conserved and translated, see Table 1 and also Supplementary Figures S4 and S6). Upon detection of such overlapping ORFs, they were used for the modeling of many possible PRF positions at which the translational switch can occur. Four PRF categories are considered in this study: −2, −1, +1, and +2. These symbols indicate the ribosomal movement back (-) or forth (+) by the corresponding number of nucleotides. To explore the possibility of unconventional PRF events bridging closely spaced non-overlapping (adjacent) ORFs, we also included ORFs separated by 1–10 nt. For this category of ORF pairs, we have modeled chimeric proteins using only forward frameshifts, from + 1 to + 10. In some cases, the software generated models that incorporated stop codons. Such models were eliminated from the analysis or shortened by a few amino acids. In this process, 570,105 models of chimeric proteins were generated, 36,536 of which were modeled with MS-supported altProts and the remaining 533,569 ones used conserved putative altProts as the basis (redundant numbers of models shown in Supplementary Table S5). These models were subjected to the search for matching MS peptides using the same three datasets as for the identification of translated altORFs. This search delivered translation evidence for 156 putative chimeric proteins, 20 of which were modeled with MS-supported altProts and 135 with conserved putative altProts. One of these sequences, chimeric protein 78 (CP78) was modeled with an altProt that was conserved and MS-supported at the same time (Supplementary Dataset S1). None of the chimeric protein models generated with non-overlapping adjacent ORFs (Scenario 3 in [1]) matched validated MS peptides in our study. It should be noted that refORFs were assumed to be translated in this analysis. However, if a chimeric protein was modeled with a conserved altORF that overlaps with a refORF, it was scored as modeled with a conserved altORF. This definition was appropriate because our primary goal was to study translation of altORFs, whereas refORF translation is an expected condition. Supplementary Dataset S2 illustrates the details of each chimeric model, its matching MS peptide, position, type, and value of PRF, and positions of involved ORFs relative to their transcripts. Supplementary Dataset S3 contains sequences of 156 chimeric models, their matching MS peptides, and corresponding 145 transcripts with detailed annotation of features. Supplementary Dataset S1 provides a very comprehensive summary of many aspects of the identified chimeric proteins and permits easy cross-comparison of various categories.
Based on the manual annotation of transcripts and refProts of 145 identified PRF loci, we attempted to deduce biological processes in which these genes may be involved (Supplementary Dataset S1, column AO). This analysis revealed 20 groups of processes, among which four were represented with frequency of at least 10 % each (Unclassified 16 %, Signal transduction 13 %, Translation 12 %, and Transcription regulation 10 %). Because our dataset was artificially enriched with conserved sequences (see the previous section), it was natural to find Translation and Transcription regulation in this top-four list. Using the terminology used in a fascinating study of Bowman et al. [91], “translation is the hub of life”, and transcription regulation is “within two degrees” from the ribosome as the central element in the origin of life. In contrast, the top-two groups, namely Unclassified and Signal transduction, are usually not associated with a high degree of evolutionary conservation. This is a surprising point. The next striking observation was that only half of the 20 groups were among processes “within two degrees” from the ribosome, according to Bowman et al. [91] (Translation, Transcription regulation, Protein degradation, ATP biosynthesis, Amino acid biosynthesis, Nucleotide biosynthesis, Chaperoning, Chromatin remodeling, Vesicle transport, and RNA processing). Other processes, which are not “within two degrees” from the ribosome, included Unclassified, Signal transduction, Metabolism, Photosynthesis, Cytoskeleton-based movement, Transposition, DNA replication, Transmembrane transport, Oxidative stress, and Storage. Various metabolic processes combined accounted for 22 % of PRF loci (Metabolism, ATP biosynthesis, Photosynthesis, Amino acid biosynthesis, and Nucleotide biosynthesis). These results suggest that PRF occurs in both housekeeping and highly specialized transcripts.
After the analysis was completed, we realized there were two sources of redundancy among 570,105 models of chimeric proteins. Firstly, chimeric models that corresponded to 103 conserved MS-supported altORFs were counted twice because we conducted the modeling separately for conserved and translated altORFs. Secondly, chimeric models in which PRF events involved only altORFs were counted twice, so our pipeline generated two identical sets of models for each altORF. In the first set, altORF1 (the upstream one) was considered as a refORF, and in the other set, altORF2 (the downstream one) was treated as a refORF. These two sources of redundancy were subsequently eliminated from the pipeline [1]. Accordingly, corresponding non-redundant counts of chimeric models were recorded for different altORF types, PRF values, RNA types, MS proteomic studies (Supplementary Tables S6-S9), and also for different chromosomal locations (see Section 3.5.9 in Supplementary Results). In supplementary figures, expected counts based on the distribution of chimeric models were calculated using non-redundant numbers of chimeric models per category without models that come from non-overlapping ORFs (Supplementary Table S9). These models that correspond to the group “other” were excluded from the calculations of expected counts because none of them received MS-support. It should be noted that the inclusion of 99,355 duplicated chimeric models in our MS search process (ca. 17 % of 570,105) is unlikely to have affected the efficiency of detection because models with identical sequences were combined by the analysis software into related protein groups.
Lastly, we present the details of our two-step MS search procedure conducted according to Jagtap et al. [66]. It was used only for mRNA-derived altProts and chimeric proteins modeled with conserved altORFs because the large sizes of these two search databases were expected to cause the inflation effect. The structure, inputs, outputs, and reduction rates of these searches are described in Supplementary Tables S10 and S11. As expected for large search spaces, the proportion of sequences validated in the second search was less than 100 % of those validated in the first search. Depending on the MS proteomic dataset, it varied between 76 % and 80 % for altProts and between 54 % and 75 % for chimeric proteins.
3.4. The detection of eight transcripts associated with multiple MS-supported chimeric peptides enabled the discovery of first candidates for mosaic translation
Our primary analysis revealed the presence of eight transcripts each associated with more than one frameshifting event (Supplementary Dataset S4). We also conducted a nearly exhaustive search for genomic and transcript regions that can potentially produce any of the 156 detected chimeric MS peptides, with or without PRF involved. We call them alternative sources throughout the manuscript. That additional analysis indicated conservation of some chimeric peptides within the M. truncatula genome and pointed to other transcripts with multiple frameshifting potential. Transcripts mentioned in this section are referred to as primary sources (145 transcripts). They will be the focus of our main discussion.
Among eight transcripts with multiple frameshifting potential, six were annotated as mRNA. They have two or three MS-deduced putative PRF sites per transcript, most of which are located within the boundaries of their refORFs (except for MtrunA17_Chr5g0430341, CP90 and CP91). Two non-mRNA transcripts are MtrunA17_MTg0490971 (ncRNA, CP150 and CP151) and MtrunA17_Chr5g0422291 (rRNA, CP88 and CP89). Each of them has two putative PRF sites per transcript (Supplementary Dataset S4).
Next, we analyzed the sequence context of each chimeric peptide in this group, with the focus on the PRF value, frames involved, the distance between the PRF sites, and the presence of in-frame stop codons between the PRF sites. This analysis suggested that six transcripts out of the eight are unlikely candidates for mosaic translation unless additional MS-supported PRF sites are found for those transcripts. For example, in MtrunA17_Chr1g0185811 (mRNA, CP16, CP17, and CP18), all the three PRF sites have the same PRF value (+1). In addition, they belong to the same PRF type (1→2) and PRF subtype (a→r). Supplementary Dataset S1 explains the definitions of these categories. Such an arrangement of PRF sites does not support the possibility of mosaic translation from MtrunA17_Chr1g0185811. In MtrunA17_Chr5g0430341 (mRNA, CP90 and CP91), two PRF sites are separated by 1373 nt. Although the first PRF site has value −2 and the second one + 2, both PRF events in this transcript start from frame 3, which is incompatible with the production of one continuous mosaic protein of the category that we called short round trip [10]. Furthermore, there are 25 stop codons in frame 1 between the first and the second PRF site, which would require the discovery of many additional chimeric peptides to form a continuous mosaic “bridge” between the two PRF sites. Unlike other transcripts in this dataset, MtrunA17_Chr1g0200071 (mRNA, CP23 and CP24) and MtrunA17_Chr6g0457461 (mRNA, CP98, CP99, and CP100) are good candidates for mosaic translation (Fig. 1 and Supplementary Dataset S4). The annotated product of MtrunA17_Chr1g0200071 is a protein-synthesizing GTPase. The transcript of MtrunA17_Chr1g0200071 has 73.5 % nucleotide identity with an alternative-source transcript of CP24 and CP101, MtrunA17_Chr6g0458111 (elongation factor 1-alpha named MtEF1A1 in [92]) and the primary-source transcript of CP101, MtrunA17_Chr6g0458091. MtEF1A1 is the key regulator of translation involved in abiotic stress responses and adaptation to the environment. MtrunA17_Chr6g0458091 is occasionally used as a housekeeping gene for RT-PCR analysis [93], [94] despite suboptimal stability of its expression in various tissues [95]. The second candidate for mosaic translation, MtrunA17_Chr6g0457461, encodes the small chain of ribulose-bisphosphate carboxylase (RuBisCo), which is the major enzyme in photosynthesis [96]. Despite completely different putative cellular functions, our study indicates that these two transcripts, MtrunA17_Chr1g0200071 and MtrunA17_Chr6g0457461, have much in common. The first PRF site in both transcripts has the same value −2, the same PRF type (2→3), and the same PRF subtype (r→a). Likewise, the second PRF site shares the same characteristics in those two transcripts: the PRF value + 2, the PRF type (3→2), and the PRF subtype (a→r). The distance between the two PRF sites is only 81 nt in the transcript of MtrunA17_Chr1g0200071 and 60 nt in MtrunA17_Chr6g0457461. Neither transcript has in-frame stop codons between the two PRF sites, which makes them perfect candidates for producing mosaic proteins of the short round trip category [10].
Fig. 1.
Three candidate mosaic proteins deduced from chimeric MS peptides. A, Mosaic protein 1 translated from transcript MtrunA17_Chr1g0200071 (the locus annotated as a putative protein-synthesizing GTPase). B, Mosaic protein 2 translated from transcript MtrunA17_Chr6g0457461 (the locus annotated as a putative ribulose-bisphosphate carboxylase, RuBisCo). C, Mosaic protein 3 translated from the same transcript MtrunA17_Chr6g0457461. Mosaic proteins 2 and 3 differ by one amino acid around the −2 PRF site (blue-boxed R and K, respectively). The upper portion of each figure depicts a corresponding transcript with ORFs mapped and scaled relative to the whole transcript length. Asterisks represent positions of PRF events. The text in blue describes the PRF value, type, and subtype of each frameshifting event. Interpretation example: −2 (2r→3a) refers to a frameshift with value minus 2 from a refORF in frame 2 (yellow) to an altORF in frame 3 (pink). The green triangle indicates the position of the first in-frame translational start codon (AUG).
According to our study, MtrunA17_Chr1g0200071 is expected to produce one 448 aa long mosaic protein, which is of the same length as the annotated product. However, this mosaic protein contains 28 aa that are entirely different from the annotated protein in the middle of the sequence (Fig. 1A, Supplementary Dataset S5). This segment introduced by putative mosaic translation does not disrupt the overall three-dimensional structure of the protein, as predicted by ColabFold, but rather modifies it (Supplementary Figure S8). This is not very surprising because the NCBI BLASTP analysis of the conserved altProt that donates its sequence to the mosaic protein has the top similarity to elongation factor 1-alpha from M. truncatula.
In contrast to MtrunA17_Chr1g0200071, the second likely candidate for mosaic translation, MtrunA17_Chr6g0457461, is expected to produce two 177 aa long mosaic proteins, which are of the same length as the annotated RuBisCo. Because the second −2 PRF site is only one nucleotide away from the first, the two mosaic proteins of MtrunA17_Chr6g0457461 differ by just one amino acid. One mosaic isoform contains 22 aa in the middle of the sequence that are entirely different from the annotated protein (Fig. 1B, Supplementary Dataset S5). The other mosaic isoform contains the same segment but one amino acid shorter at the left side (21 aa different from the annotated product, Fig. 1C, Supplementary Dataset S5). Like in the case with MtrunA17_Chr1g0200071, these segments introduced by putative mosaic translation somewhat modify the three-dimensional structure of the protein, as predicted by ColabFold (Supplementary Figures S9 and S10). The translational product of the conserved altORF involved in the putative PRF events of both mosaic isoforms is similar to RuBisCo from another legume species, Arachis duranensis (peanut). Thus, it is possible that this altORF emerged via a frameshifting mutation, which is older than in the case of MtrunA17_Chr1g0200071. The complete ColabFold prediction files corresponding to the annotated and mosaic proteins of MtrunA17_Chr1g0200071 and MtrunA17_Chr6g0457461 can be found in Supplementary Dataset S6.
Interestingly, the putative PRF sites in transcripts mentioned in this section repeat in distinct groups. Namely, three transcripts, MtrunA17_Chr1g0200071, MtrunA17_Chr6g0457461, and MtrunA17_Chr5g0430341, have the first PRF site with the value −2 followed by a + 2 PRF site. The other group, MtrunA17_Chr1g0185811, MtrunA17_MTg0490471, and MtrunA17_MTg0490971, have repeated + 1 sites. The only rRNA transcript in this group, MtrunA17_Chr5g0422291, has repeated −1 sites. MtrunA17_Chr3g0144151 has no particular pattern of PRF sites, which is an exception in this subset of transcripts (Supplementary Dataset S4). These non-random observations may reflect the biological relevance of this grouping.
3.5. Multiple significant associations between various parameters of the dataset reject the null hypothesis of its artifactual origin
Our hypothesis about the mosaic nature of proteins produced by MtrunA17_Chr1g0200071 and MtrunA17_Chr6g0457461 relies on the validity of the entire dataset. So far, MS proteomics is the most powerful and accurate large-scale method for the discovery of new proteins [26], [97], [98]. Despite its unique status among currently available methods, MS-based detection of peptides and proteins is inherently prone to false discoveries, even when standard control procedures are implemented, particularly in proteogenomics [99], [100]. In our study, we used a standard FDR of 1 % as a threshold. However, because of the very large search space used for the detection of chimeric peptides (more than half-million of chimeric protein models, Supplementary Table S5), we had to apply a two-step search procedure. Since the primary purpose of this approach is to lower the rate of false negatives [66], it is likely to be associated with more difficult control over FDR. To address this known weakness of the large-scale search for matching MS peptides, we conducted a very comprehensive analysis of chimeric peptide features summarized in Supplementary Dataset S1. Our intention was to detect significant relationships between features that are not compatible with the null hypothesis about the randomness of the dataset. If the various features were occurring randomly, it would suggest that the 156 validated chimeric peptides might be false discoveries. Much of the data described in Fig. 2, Fig. 3, Fig. 4, Fig. 5, Fig. 6, Fig. 7, Fig. 8, Supplementary Figures S11-S62 (see Supplementary Results), and discussed below passed statistical tests. Collectively, our statistical analysis revealed a clear deviation from the randomness assumption. Thus, the detected sequences must have biological relevance.
Fig. 2.
Distribution of 156 chimeric MS peptides in different groups according to PRF values. A, Four separate groups of PRF. B, Two combined groups of PRF. Expected values were calculated based on the numbers of corresponding chimeric protein models (non-redundant counts in Supplementary Table S9, 469,600 models in total). There is no significant difference between the observed and expected proportions for the separate groups (A) and a significant difference (p = 0.016; the chi-square test for goodness-of-fit) for the combined groups (B). The distribution of PRF values within individual groups in A and B is not significantly different from the uniform distribution. However, the distribution of observed values in B is significantly different from the uniform distribution (p = 0.025; the chi-square test for homogeneity): PRF values with the negative sign are significantly overrepresented in the dataset.
Fig. 3.
Distribution of 156 chimeric MS peptides in different groups based on the RNA type and the PRF value. The Fisher’s exact test shows significant association between the RNA type and the PRF value (p = 0.007 with tRNA included, p = 7.024E-07 with tRNA excluded). Within individual RNA groups, the distribution of PRF values is significantly different from the uniform distribution in one case indicated with an asterisk (the Exact Multinomial Test, p = 0.019). With PRF values −1 and + 2 only, the difference is not significant. Note the absence of certain PRF values from ncRNA and rRNA. Remarkably, the only PRF value in tRNA is the same as the most frequent PRF value in rRNA.
Fig. 4.
Distribution of 156 chimeric MS peptides based on the genomic location of their primary-source genes and the PRF value. “Nuclear”: Chromosome 0 (unmapped loci) to Chromosome 8. “Non-nuclear”: chloroplasts and mitochondria. There is no significant association between the genomic location and the PRF value. Within individual genomic locations, the distribution of PRF values is significantly different from the uniform distribution in one case indicated with an asterisk (the Exact Multinomial Test, p = 0.018). PRF events with value + 1 are significantly underrepresented in nuclear transcripts and non-significantly overrepresented in chloroplasts and mitochondria.
Fig. 5.
Distribution of 156 chimeric MS peptides based on the PRF type and value. Each PRF type has only two possible values: one positive and one negative. There is significant association between the PRF type and the PRF value based on the chi-square test (p = 0.007, a two-category test with categories of the same absolute value combined). Within individual groups, the distribution is significantly different from the uniform distribution in two cases indicated with asterisks: 1→3 and 2→3 (the chi-square test for homogeneity, p-values are 0.011 and 0.005, respectively). Backward PRF is more frequent for the shift from any frame to frame 3.
Fig. 6.
Distribution of 156 MS-supported models in three folding categories based on two aspects of the PRF value. A, grouping by the direction of PRF (minus, backward frameshifts; plus, forward frameshifts). B, grouping by the length of PRF (±1, “short” frameshifts; ±2, “long” frameshifts). There is no significant association in A and B. Within individual folding categories, the distribution is significantly different from the uniform distribution in two cases indicated with asterisks (the chi-square test for homogeneity, p-values are 0.035 and 0.014 for A and B, respectively). Backward PRF values are significantly overrepresented in chimeric peptides with predicted alpha-helices. Beta-sheet structures are three times more abundant in the models with “long” PRF values.
Fig. 7.
Distribution of 156 chimeric MS peptides in 16 biological samples depending on the PRF value. The counts are not additive because some chimeric peptides were identified in multiple samples. No chimeric peptides were detected in samples N28 and PhD. The PRF value is significantly associated with the sample (the Fisher’s exact test with a simulated p-value, p = 0.010). Within individual samples, the distribution of PRF values is significantly different from the uniform distribution in three cases indicated with asterisks. P-values of the chi-square test for homogeneity are as follows: 0.036 (B) and 0.040 (Se). The significance in sample RoC was assessed with the Exact Multinomial Test (p = 0.019). No chimeric peptides with PRF values −1 and + 1 were found in samples RoC and RoD. Chimeric peptides with PRF values + 1 and + 2 are absent from sample PhC. Sample B lacks chimeric peptides with PRF value + 1.
Fig. 8.
Distribution of 156 chimeric MS peptides in different PRF categories depending on the PRF type and subtype. r→a, PRF from refORF to altORF; a→r, PRF from altORF to refORF; a1→a2, PRF from altORF1 to altORF2, where altORF1 starts upstream of altORF2; a2→a1, PRF from altORF2 to altORF1, where altORF1 starts upstream of altORF2. There is no significant association between the PRF type and the PRF subtype. Proportion pairs in which the distribution is significantly different from the uniform distribution are marked with asterisks. P-values of the chi-square test for homogeneity of two proportions: *0.034; **0.005. In PRF type 1→2, PRF subtype r→a is significantly more frequent than PRF subtype a→r. In PRF type 2→1, it is the other way around, which means refORFs involved in PRF are associated with frame 1 and altORFs with frame 2 in those two categories. There is a non-significant trend of the same kind in PRF types 2→3 and 3→2, where refORFs are associated with frame 2.
4. Discussion
4.1. Old is gold: MS proteomics can help reveal the existence of mosaic proteins in the absence of methodology for long-read protein sequencing
Earlier, we proposed that multiple PRF may play an important role in the adaptability of organisms. The versatility of their proteomes may have been greatly expanded through the production of polypeptides incorporating translation products of different reading frames. We refer to this hypothetical mechanism as mosaic translation [10]. Experimental demonstration of this mechanism is a major challenge for a number of reasons. They are associated with technical limitations of the protein detection methods. The chief group of methods is based on MS proteomics. These methods infer the presence of long continuous polypeptide sequences via the detection of their short fragments (7–35 aa, [101]). Unfortunately, this methodology permits the identification of only a small fraction of non-canonical peptides actually present in a sample. Many biologically relevant peptides that are expressed at a low level and/or have small size remain undetected by MS proteomics, which does not reflect their instability or the lack of biological importance [103], [102]. An unequivocal demonstration of the mosaic nature of a protein requires long sequencing reads. Recently, a breakthrough in the development of nanopore-based sequencing of proteins has been reported [29], [28]. Still, the current state of this methodology is far from providing significant help in detecting long, continuous mosaic polypeptides in a complex mixture of amino acid sequences. In the absence of any alternative to MS-based proteomics, we decided to mine the existing MS proteomic datasets for fragments of putative mosaic proteins that we modeled using an original script [1]. Two types of data were used as inputs for the script: (1) altProts that have a conservation signature based on the global sequence similarity searches and (2) altProts that have matching MS-validated peptides regardless of their conservation signature. Due to the nature of MS proteomics as a detection strategy [103], the 156 chimeric peptides reported here likely represent only “the tip of an iceberg”, with the hidden part awaiting discovery.
4.2. Main findings and arguments for their validity
Upon the detection of 156 MS peptides that match chimeric models, we found that some of them could be mapped to the same transcripts. This way, we discovered eight transcripts associated with multiple PRF events, two or three per transcript. Based on the PRF value and type, we showed that two of these transcripts are good candidates for the production of mosaic proteins (Fig. 1). One of them corresponds to a putative protein-synthesizing GTPase, which is a close homolog of elongation factor alpha (MtEF1α). The other one encodes a putative ribulose-bisphosphate carboxylase (RuBisCo), which is an enzyme central to photosynthesis. Intriguingly, both putative mosaic proteins involve the same PRF events: the first −2 frameshift changes translation from frame 2 (refORF) to frame 3 (altORF), and the second + 2 frameshift brings translation back to the refORF located in frame 2. Such exact correspondence between these two PRF events in completely unrelated transcripts suggests a biological basis for this conservation. The discovery of two strikingly convergent transcripts that are likely to produce mosaic proteins in our non-viral system is highly novel and unexpected. Until now, mosaic proteins produced via PRF have been documented only in viruses [24], [22], [23], [21]. This is the main finding of our study. The remaining six transcripts with multiple PRF sites can also be candidates for mosaic translation because additional PRF events associated with these transcripts can be found in the future.
Ideally, MS peptides matching a mosaic protein must be long enough to include both PRF sites. Due to the short read length of the current MS technology (7–35 aa, [101]), putative mosaic proteins produced by MtrunA17_Chr1g0200071 and MtrunA17_Chr6g0457461 have no matching MS peptides that would join the products of individual PRF sites. However, the gap in MS peptide coverage is fairly small: 16 aa for MtrunA17_Chr1g0200071 and only 6 aa for MtrunA17_Chr6g0457461. This may indicate that the peptides are indeed part of a continuous mosaic sequence (Fig. 1, Supplementary Dataset S5). These short distances between MS peptides permit testing the effects of corresponding synthetic mosaic proteins in vivo. In addition, due to these short distances, generation of MS proteomic data from the same samples using different digestive enzymes could potentially yield peptide “bridges” that span the gaps in the peptide coverage. This would serve as indirect evidence for continuity. Nevertheless, it is possible that MS peptides corresponding to the PRF sites of these transcripts are translated individually as chimeric non-mosaic sequences. On one hand, this possibility is supported by the existence of alternative sources of these peptides. On the other hand, while three MS peptides that correspond to putative PRF sites in MtrunA17_Chr6g0457461 were all detected in the same organ (buds), two MS peptides of MtrunA17_Chr1g0200071 were found in two different organs (one in seeds and the other one in leaves). Given the ambiguity associated with these multi-PRF sequences, the following three scenarios are not mutually exclusive. First, chimeric peptides are produced individually from the same transcript. Second, chimeric peptides are produced individually from two or more different transcripts. Third, chimeric peptides are produced in the course of mosaic translation as parts of a continuous amino acid sequence from one transcript or multiple transcripts. Dedicated wet lab studies are required for discrimination between these scenarios. Despite the lack of solid proof, MS peptides corresponding to MtrunA17_Chr1g0200071 and MtrunA17_Chr6g0457461 provide the first experimental hint for a possibility of mosaic translation in a non-viral system. Thus, they deserve close investigation in M. truncatula, which will pave the avenue toward the discovery of mosaic proteins in other organisms. Our methodology, in combination with parallel multi-enzyme digestion of protein samples for MS, is likely to reveal many more candidates for mosaic proteins.
Our study provides unique information of fundamental importance in several other respects. Firstly, the discovery of a large number of chimeric peptides in a non-viral system is another novel outcome of our work because relatively few chimeric peptides have so far been reported outside the viral kingdom: at least three in bacteria [15], [11], [12], [13], [14] and at least 400 in eukaryotes [16], [17], [18], [19], [20]. In this respect, our results agree with the prediction according to which as much as 10 % of genes in a eukaryotic genome can be associated with PRF [21]. A pioneering study in ciliates [20] and a more recent human study [19] reached similar conclusions (see Section 4.7). Secondly, the inclusion of non-mRNA transcript types in our search pipeline revealed an entirely novel possibility. It demonstrated that MS-supported PRF events can be detected in ncRNA, rRNA, and even tRNA transcripts, all of which are traditionally considered non-coding. Although still very exotic, the ability to serve as a template for translation has already been demonstrated for ncRNA [104], [105], [36] and rRNA [106], [107]. It has also been shown for pre-miRNA [109], [108] and pre-siRNA [110]. However, no information of this type has ever been reported for tRNA. At the same time, to the best of our knowledge, PRF events have never been detected in any non-mRNA transcript so far. Thus, our work sends several important messages that require close attention.
Naturally, the presence of several non-ordinary findings in one computational MS proteomics-based study raises doubts about the validity of the entire dataset. Thus, let us set our null-hypothesis as follows: the dataset is a collection of false-positives arising from the difficulty of controlling the false discovery rate when the procedure involves a very large search database. The detailed manual analysis of various parameters (Supplementary Dataset S1) in conjunction with stringent statistical procedures clearly indicates that most observations are non-random and most associations are significant. This concerns not one or two parameters but several dozen distinct aspects. We summarized 37 of the most obvious significant observations in Supplementary Dataset S29. Supplementary Dataset S30 lists 14 of the most relevant significant associations. Taken together, these findings speak strongly against the null-hypothesis, which indicates our data have biological relevance despite the high false discovery rate expected for an MS proteomics-based study.
4.3. RNA-Seq data as a support for chimeric nature of MS-validated peptides
Another major concern associated with the discovery of highly novel peptide sequences is their true origin. Can our chimeric peptides have alternative sources that do not involve PRF? Using genomic DNA as a basis, we found that one chimeric peptide in our dataset, CP80, can potentially be produced from two genomic loci with unknown translation status. However, these loci are not transcribed, which makes them unlikely non-chimeric sources of the peptide identical to CP80 (see Sections 3.7.1 and 3.7.3 in Supplementary Results). Alternative splicing events, such as exon skipping or intron retention, can potentially mimic chimeric peptides. Products of RNA editing and natural polymorphisms may also have this effect. Using the collection of individual reads from 50 relevant RNA-Seq runs, we demonstrate that only one to four MS-validated chimeric peptides (CP54, CP93, CP140, and CP148, or CP54 alone) could originate from unusual versions of native transcripts without PRF. Contrary to our predictions summarized in Supplementary Dataset S1, none of these peptides is likely to be produced via alternative splicing (Supplementary Datasets S18 and S19).
4.4. The existence of potential alternative sources is a challenge for the functional analysis of chimeric proteins
Because of the chimeric nature of MS-validated peptides in our dataset and their short length typical for an MS study, many of them can have multiple origins (Supplementary Datasets S1 and S22 to S24). This genetic redundancy will make it very difficult to study their functions by conventional loss-of-function methods such as insertional mutagenesis. Conceptually, RNA interference (RNAi) can be used to downregulate a group of related transcripts [111]. Alternative sources of chimeric peptides in our study share a limited similarity, which may permit their simultaneous silencing (as can be judged from their annotation, Supplementary Dataset S24). Nevertheless, construction of multiple mutants may be required to learn the loss-of-function phenotypes of these loci. Moreover, demonstrating individual phenotypes for a refProt, altProts, and a chimeric protein(s) derived from a given transcript will require complementation of null mutants with constructs in which these proteins are disabled differentially. Such an approach would allow separate assessment of each protein’s role. Using an in-house script [2], we conducted a comprehensive inventory of genetic loci that can potentially produce chimeric peptides identified in our study. This inventory is crucial for understanding the biological roles of chimeric proteins. At the same time, ca. 58 % of chimeric peptides reported here have unique sources, which means they can be functionally characterized without generation of multiple mutants. Based on the analysis of expression profiles, only 31 chimeric peptides out of 156 (ca. 20 %) can have alternative sources at least as likely as their primary sources (Supplementary Dataset S1). Thus, it is possible that most of the 145 transcripts that are denoted as primary sources in our study are indeed the main or even the only contributors to the translation of these peptides. This possibility is supported by numerous significant associations observed at the level of transcripts and genomic DNA of the primary-source loci (e.g., Fig. 3, Fig. 4, Fig. 5, Fig. 8, Supplementary Figures S11-S15 and S50-S62). These observations cannot be expected under the assumption that translation of chimeric peptides detected in this study occurs mainly or exclusively from other transcripts (alternative sources). A recent discovery of the promotive effect of synthetic complementary peptides on the expression of matching transcripts offers a unique opportunity for studying mosaic and chimeric proteins [112]. Using synthetic versions of these proteins, one can study the gain-of-function phenotypes linked not to a single locus but to all loci that contribute to producing a mosaic or chimeric protein.
4.5. Four genetic loci associated with the production of chimeric peptides in M. truncatula have been in the focus of functional studies
During our study, we recognized the importance of determining whether any genes listed as primary or alternative sources of chimeric peptides have previously been analyzed using loss-of-function approaches. Like in the case of altProts, chimeric proteins are expected to introduce ambiguity to the interpretation of any genetic study. In the absence of knowledge about an altProt or a chimeric protein translated along with a refProt, naturally, any mutant phenotype linked to the locus is automatically attributed to the loss-of-function of a refProt. Misattributed phenotype-genotype relationships can significantly hinder efforts to treat genetic diseases or to improve crop resistance and productivity. To address this important task, we have conducted a nearly comprehensive analysis of more than 500 loss-of-function studies in M. truncatula that were published since the time of its advent as a model organism [44]. By June 2025, these studies involved 673 genetic loci. To our surprise, only two genes from our primary-source list have been the focus of such studies so far. One of them is a sucrose synthase gene MtSUCS1 (CP80, MtrunA17_Chr4g0070011). Antisense-mediated transcript knockdown of MtSUCS1 leads to defects in symbiotic nitrogen fixation and arbuscular mycorrhizal symbiosis [113], [114]. The other gene is a cytokinin oxidase/dehydrogenase MtCKX6, which was targeted by insertional mutagenesis using tobacco retrotransposon Tnt1. The phenotype of the Mtckx6 mutants was analyzed in the context of root development. It turned out to be not different from the phenotype of wild-type plants [115]. One gene listed among two alternative sources of CP104 was targeted by RNAi. It encodes a pathogenesis-related protein MtPR10–5 (MtrunA17_Chr4g0067951). Simultaneous downregulation of this gene and four likely off-target loci led to reduced colonization and suppressed infection by the oomycete pathogen Aphanomyces euteiches. Colditz et al., [117], [116]. Our meta-analysis of publications on 673 loci targeted by at least one loss-of-function study does not include genes analyzed exclusively via gain-of-function approaches. However, one such study is worth mentioning here because it was conducted on an alternative source of two chimeric peptides, CP24 and CP101 (MtrunA17_Chr6g0458111). This locus is among six possible sources of these CPs, all of which are annotated as protein-synthesizing GTPases, including a qRT-PCR housekeeping gene MtEF1 [93], [94]. Overexpression of MtrunA17_Chr6g0458111 named MtEF1A1 in the original study renders transgenic plants of M. truncatula and A. thaliana more salt-tolerant [92]. It may be interesting to clarify whether the reported phenotypes of these genes can be attributed to their refProts, chimeric peptides, or both.
The scarcity of previously studied genes in our dataset can probably be explained by the fact that most of the 145 transcripts have transcription profiles not specific to a particular biological process (Supplementary Datasets S1 and S27). In contrast, functional genomics studies naturally focus on genes with process-specific or organ-specific transcription profiles.
4.6. Chimeric sequences may be important for symbiotic nitrogen fixation and seed development
Nearly one quarter of chimeric peptides in our study were identified in nodule samples. Together with chimeric peptides found in seeds, they make up 41 % of the dataset (64 sequences, Supplementary Dataset S1, Supplementary Figure S36). This overrepresentation of two biological processes may have technical reasons: we cannot meaningfully compare abundances of validated proteins in different samples, even if they are prepared by the same group (Supplementary Figures S6 and S36). However, it may also point to important biological roles of altORFs and chimeric peptides in symbiotic nitrogen fixation and seed development, along with other biological processes. Although it is unclear whether the detected MS peptides represent short sequences or fragments of longer proteins, it is worth noting that short peptides play a regulatory role in symbiotic nitrogen fixation [121], [122], [123], [118], [119], [120]. Among chimeric peptides detected in multiple samples, the largest number of sequences was shared by nodule and seed samples (eight chimeric peptides, Supplementary Figure S38). This moderate association is unexpected, as seed development and biological nitrogen fixation are generally thought to involve distinct gene expression profiles [124], [125]. This is further supported by the fact that only two genes out of 673 targeted by at least one loss-of-function approach in M. truncatula (as of June 2025) are likely to have roles in both processes: MtCYP15a [126] and MtNOOT2 [129], [127], [128]. On the other hand, root and nodule samples could be expected to have much overlap in the abundance of peptides based on similarities in gene expression [124], [125]. Contrary to this expectation, roots share only three chimeric peptides with nodules (Supplementary Dataset S1). Observations summarized in this section warrant dedicated functional studies on our candidate loci. Differential mutagenesis-based analyses of refProts and corresponding chimeric peptides from our dataset may provide useful insights into molecular mechanisms shared by nodules and seeds.
4.7. Our data complement two MS-based PRF studies conducted on ciliate and human samples
In this section, we highlight two notable studies on PRF and compare them with our findings. The pioneering work of [20] showed that PRF is not only common but also obligatory in ciliates of genus Euplotes. In these organisms with two types of nuclei, translation of micronucleate genes uses stop codons as signals for PRF rather than translation termination signals. This information contrasts with our findings in M. truncatula, where no such association between stop codons and PRF was detected. However, this unique feature supports the idea of mosaic translation in eukaryotes, where shifting between frames can extend the length of a protein product encoded by a transcript with no long ORF. The study identified only 13 PRF-derived chimeric MS peptides, none of which cluster to the same transcript like in the case of the elongation factor and RuBisCo. However, many more PRF sites were predicted in these organisms based on Ribo-Seq. Some predicted PRF sites are located on the same transcripts as MS-validated PRF sites. This co-occurrence suggests the possibility of mosaic translation. There are three sequences of that type reported in Supplementary Note 1 of [20]. One of them may contain two additional PRF sites along with the MS-validated PRF site (sequence denoted as comp7073_c0_seq1). In contrast to our results, PRF events in Euplotes were restricted to + 1 and + 2. To the best of our knowledge, [20] provided the first MS-based demonstration of multiple chimeric peptides in a eukaryotic organism.
A more recent study by Ren and associates (2024) discovered at least 405 unique chimeric peptides in 32 diverse human samples not associated with any pathological condition. These peptides correspond to 454 loci, all of which have naturally occurring repeat codon sequences at the putative PRF sites. This group not only provided evidence for the presence of such peptides in multiple samples but also functionally characterized one of the loci, which encodes a histone deacetylase HsHDAC1. This study is very important because it shows how widespread PRF can be in eukaryotic organisms. Although our work shares a similar goal and methodology, it does not duplicate the findings of Ren et al. [19]. The studies differ in several key aspects and are therefore complementary. Firstly, none of 156 chimeric peptides reported here originate from loci with repeated codons at the site of putative PRF. Secondly, our dataset contains a homolog of HsHDAC1, which is a gene encoding a histone deacetylase MtHDA9 (MtrunA17_Chr5g0393401, CP81). It also contains a few other loci associated with histone deacetylase activity and histones in general (Supplementary Dataset S1). Transcripts of HsHDAC1 and MtHDA9 share only ca. 57 % nucleotide identity, and their PRF sites do not coincide in the alignment. Thirdly, while the availability of extensive MS proteomic datasets for humans enabled the identification of chimeric peptides across numerous samples, the M. truncatula datasets are still very limited. Thus, only 27 chimeric peptides out of 156 (ca. 17 %) were identified in more than one sample. This presents a major technical limitation of our study along with the absence of wet lab validation of our candidate peptides. However, in a few other aspects, our work has a broader scope and offers additional points of novelty. For example, our methodology is conceptually open to the detection of PRF events de novo, with no prior knowledge about similarity of PRF sites to already characterized cases. Along with this principal difference, our dataset is not limited to chimeric peptides produced by “short” PRF events (-1 and +1). It includes many sequences that require “long” frameshifts (-2 and +2) for their generation, namely ca. 53 %. Next, our dataset includes three non-mRNA transcript types, which constitute ca. 12 % of the sequences. Importantly, our study aimed to detect multiple PRF sites per transcript, which resulted in the discovery of 15 non-RE and eight RE loci potentially associated with multiple frameshifting. Lastly, there is a major difference in the way we considered alternative sources of chimeric sequences. Namely, along with the exclusion of sequences that correspond to refProts and non-chimeric products of short ORFs, we conducted a very comprehensive search for additional alternative sources. This search permitted the discrimination between truly chimeric sequences and those that can be produced from unusual forms of transcripts as follows from the analysis of RNA-Seq data. With an in-house script [2], we discovered alternative chimeric sources for many peptides and highlighted the role of repeat elements as a potential origin of some chimeric peptides. In summary, the insights from these three studies are likely to have a synergistic effect, encouraging further functional characterization of promising candidate peptides by other research groups.
4.8. Our data support the modern view on pre-biotic evolution and suggest that the ability to undergo “short” frameshifting may be an ancestral feature of all genomes
Besides the identification of candidate loci for the production of mosaic and chimeric proteins, our results also indirectly support the modern view on the origin of genomes, and the central role of rRNA in genome evolution. According to current consensus, the earliest genomes and enzymes were composed of single-stranded RNA. These proto-genomes were similar to the modern rRNA in several aspects. They are thought to have been formed by ligation of tRNA-like molecules that acted as monomers, or proto-genes [88], [89], [90], [86], [87]. This scenario is not purely hypothetical. It is based on experimental observations. Modern rRNA and tRNA molecules have striking features that point to their common evolution. rRNA molecules contain sequences very similar to tRNAs of all 20 usual proteinogenic amino acids [88]. Furthermore, rRNA-like sequences were found in samples of the poly-adenylated mRNAs, which makes them likely templates for translation. So far, at least 29 such transcripts have been detected in various organisms (e.g., [130], [132], [133], [131], [134]; extensively reviewed in [135] and [90]). Many of these transcripts share similarity with rRNA sequences identified in our study, although typically in regions distant from putative PRF sites. However, four transcripts exhibit significant same-strand nucleotide alignments near PRF regions: Mus musculus testin-2 [132]; Homo sapiens mitochondria-encoded Humanin [106]; Candida albicans mitochondrial protein Tar1p (closely related to Saccharomyces cerevisiae Tar1p, [131]); and Reticulitermes flavipes rRNA-like mRNA [134]. Alignments of these transcripts are shown in Supplementary Dataset S31. Interestingly, Humanin, which is a 24 aa long peptide, is thought to be translated directly from a mitochondrial rRNA transcript [107]. At the amino acid level, PRF-altORFs located on one tRNA and five rRNA transcripts detected in our study have similarity with annotated proteins of 31 different types (Supplementary Datasets S1 and S31), out of which only two have previously been associated with rRNA-like mRNA transcripts: testin-2 [132] and receptor-like serine/threonine-protein kinase [135]. Thus, our study expands the knowledge about rRNA-like mRNA transcripts by detecting 29 additional protein types encoded by rRNA-like sequences. To date, only three rRNA-located ORFs have been associated with rRNA-specific functions. This suggests that most rRNA-ORFs with the protein coding capacity may be involved in other biological processes. According to the ribosome-first hypothesis, in rRNA-like proto-genomes, relatively short ORFs overlapped with each other in all reading frames and directions. Thus, the density of protein-coding regions was very high in contrast to modern genomes where only a portion of genes overlap [89]. Our data support this hypothesis. For example, the analysis of all major transcript types in M. truncatula revealed that rRNA has the highest number of ORFs per transcript, when we consider ORFs that have high similarity to known proteins after the conversion of ORFs to amino acid sequences, i.e. conserved altProts. With a median length of 120 nt, rRNA transcripts are on average ten times shorter than mRNAs and nearly four times shorter than ncRNAs. Despite the short length, rRNA transcripts contain ca. 31 and ten times more conserved altORFs compared to mRNA and ncRNA, respectively (Table 1). Another remarkable observation concerns the percent identity of ORF translations aligned with annotated proteins from the global collection. The median value of percent identity is the highest for tRNA (100 %) followed by rRNA (93.3 %), as can be seen in Table 1 and Supplementary Figure S1. The same concerns the median value of query coverage (100 % and 94.8 %, respectively; the global BLASTP analysis, Table 1). This means that ORF translations from tRNA and rRNA have the highest similarity to annotated proteins, even though these RNA types are traditionally classified as non-coding. Likewise, ORF translations from rRNA have the highest median number of hits per query in the global BLASTP analysis (Supplementary Figures S2 and S3). It is ten times higher compared to mRNA. This supports observations of other researchers according to which rRNA-like sequences are overrepresented in genomes [90]. However, our data confirm this overrepresentation at the protein level, which once again brings rRNA into the spotlight as a potential template for translation. This overrepresentation is also evident at the transcript level, as follows from our analysis of shared subjects in BLASTN (Supplementary Figures S63-S65). In view of these special characteristics of rRNA, it is perhaps not surprising that rRNA transcripts are significantly overrepresented (25-fold) in our dataset (Supplementary Figure S50A). Many of our chimeric models were built using conserved altProts, which were most abundant in rRNA (Table 1). The surprising point is the absence of MS-validated altProts from either rRNA or tRNA in our dataset and the presence of only MS-validated chimeric peptides associated with these transcript types (Supplementary Figure S6). This paradox suggests that some characteristics of rRNA and tRNA distinguish them from other RNA types with regard to PRF. The nearly complete prevalence of PRF value −1 in rRNA and tRNA (Fig. 3) further points to the tight evolutionary link between these transcript types. This prevalence is remarkable; the only rRNA transcript associated with multiple frameshifting in our study (MtrunA17_Chr5g0422291) has both PRF sites with value −1 (Supplementary Dataset S4). PRF in tRNA and rRNA has been unknown so far. Likewise, the ability of ncRNA to produce chimeric peptides has never been described before. In our study, 11 such peptides originate from ncRNA. One of eight multi-PRF transcripts illustrated in Supplementary Dataset S4 is ncRNA from mitochondria (MtrunA17_MTg0490971).
Since many peptides in our dataset have more than one potential source in the transcriptome, there is a valid question about the true origin of tRNA- and rRNA-derived chimeric MS peptides. To address this concern, we examined the annotations of loci serving as alternative sources of chimeric peptides derived from non-mRNA transcripts (Supplementary Dataset S1). This analysis revealed that both alternative sources of CP72 (tRNA) are annotated as tRNA. However, they can also be produced by the reverse complement of repeat elements that overlap those tRNA loci. All five non-repeat alternative sources of CP87 (rRNA) are annotated as rRNA. However, this peptide can also be produced by repeat elements. CP1, CP88, CP89, and CP141 have only repeat-element alternative sources. In contrast, CP114 (rRNA) has no detected alternative sources. This indicates that our study is the first one to demonstrate the presence of a unique chimeric rRNA-encoded MS peptide in a proteomic dataset. This aligns with the suggestion that the non-chimeric peptide Humanin (HN1) may be translated directly from rRNA [106], [107] because the transcript of CP114 shares 67.4 % nucleotide identity with the transcript of Humanin (Supplementary Dataset S31). Interestingly, CP114 is the only rRNA-chimeric peptide with PRF value + 2. All other rRNA- and tRNA-peptides have PRF value −1 (Fig. 3). It should be noted that no evidence for the association of repeat elements with PRF value −1 or any other PRF value can be found in our data. Furthermore, among the repeat elements mentioned above, only those that correspond to CP88 and CP89 have sufficiently high expression in corresponding organs to be considered as true sources of those rRNA-derived peptides. At the same time, regions around PRF sites in the transcript of CP87, CP88, and CP89 share nucleotide identity with two different rRNA-like mRNA transcripts described earlier (Supplementary Datasets S28 and S31). Together, these observations suggest that CP114 is not the only chimeric peptide translated directly from rRNA (Supplementary Dataset S1). To the best of our knowledge, the detection of rRNA and ncRNA loci with more than one PRF site per transcript is also unique to our study (Supplementary Datasets S4 and S28).
Do non-nuclear transcripts exhibit any distinctive features in the context of PRF? We noticed a trend toward overrepresentation of PRF value +1 in mitochondrial transcripts regardless of the RNA type (Supplementary Figure S11) and also in non-nuclear transcripts combined (Fig. 4). Remarkably, the trend is opposite for nuclear transcripts, where PRF value + 1 is underrepresented. Two out of eight multi-PRF transcripts shown in Supplementary Dataset S4 have mitochondrial origin, one mRNA and one ncRNA. Both transcripts contain exclusively + 1 PRF sites. Chloroplast and mitochondrial transcripts are significantly overrepresented in the dataset (19-fold and 12-fold, respectively; Supplementary Figure S50). These observations reflect the origin of these organelles and highlight the PRF-related difference between nuclear (eukaryotic) and non-nuclear (prokaryotic) transcripts. The biological significance of this difference deserves close attention. It is possible that “short” frameshifts that appear to be associated with transcripts of ancient origin in our study (-1 frameshifts in rRNA and +1 frameshifts in non-nuclear transcripts) represent the ancestral feature whereas “long” frameshifts evolved later. Earlier, we emphasized that prokaryotes are likely to have tremendously benefited from mosaic translation and from PRF in general given the absence of alternative splicing and relatively small genome sizes. In this respect, they are close to viruses where PRF is essential for survival.
Several studies emphasized the central role of the ribosome in the origin of life [88], [89], [90], [136], [91]. It is likely that the ability of the ribosome to maintain the reading frame was acquired gradually during evolution. Thus, for a long time, (spontaneous) ribosomal frameshifting must have been the default behavior resulting from the lack of efficient frame control. We hypothesize that uncontrolled ribosomal frameshifting evolved into the ability to undergo PRF, which became an intrinsic feature of all genomes. Abundant chimeric peptides and proteins, most of which must have been by-products of pervasive transcription and translation [137], constituted the primeval biochemical variation, which was the substrate for natural selection. Thus, chimeric sequences probably contributed to the competitiveness and adaptation of early cells. Conceivably, at some early point in evolution, PRF was harnessed for mosaic translation. This process must have diversified the proteomes of primitive organisms because it utilizes the genomic space more efficiently and inventively. This “new dimension” of the proteomes probably facilitated the gradual increase in complexity and adaptability of organisms, which ensured their evolutionary success [10].
4.9. Limitations of the present study
This work re-analyses public MS datasets using a large in silico search space of altORFs and chimeric models. Although we controlled the FDR by setting the threshold at 1 % using a target-decoy strategy and applied a two-step search to mitigate database inflation, residual false positives remain possible. In the absence of Ribo-Seq data for M. truncatula, PRF sites were inferred from the chimeric MS peptide sequences. This approach allows the discovery of novel sites but is limited by its one-sided nature. The ability to fold is often considered an indicator of functionality. However, for short peptides, AlphaFold2-based models provide only indicative insights, such as order/disorder tendencies or secondary-structure propensities rather than definitive atomic structures, and should be interpreted with caution. Our pilot analyses are restricted to a single species and three datasets; future studies should broaden this scope. Although MS proteomics is the most direct method for detecting non-canonical peptides, it cannot alone distinguish PRF events from genetic frameshifts or atypical splicing. We therefore screened public RNA-Seq data for potential non-PRF origins of chimeric MS peptides. The requirement for such post-hoc analyses, and particularly for ensuring their depth, represents a generic limitation of this and other studies on novel amino acid sequences. Multiple alternative chimeric sources identified for many peptides in our dataset present a challenge for the functional analysis of the corresponding loci. They also make it more difficult to unequivocally identify transcripts that have more than one PRF site, which is crucial for studies on mosaic translation. Finally, parameter choices, such as fragment-tolerance settings and limiting modeled PRF events to four types (−2, −1, +1, and +2), reflect a balance between sensitivity and search-space inflation. At the same time, they artificially restrict the detectable range of PRF events. For example, the current pipeline cannot identify events involving overlapping ORFs with PRF values larger than ±2 (specifically, −3, −4, −5, etc.; +3, +4, +5, etc.).
5. Conclusions
We hope that our study will constitute a rich resource for many discoveries associated with PRF and translation from non-mRNA transcript types. In a broader sense, we anticipate that this work will seed a new field of genetic studies that will consider the nearly infinite protein-coding potential of transcripts due to multiple PRF events. We also hope that many genetic diseases, so far not explained with traditional views on the proteome complexity, will ultimately find their cures when chimeric and mosaic proteins in eukaryotes will be discovered on a large scale. Likewise, the progress in the development of more resilient and more productive crops will be sped up by the discovery of chimeric and mosaic proteins involved in efficient stress responses and other agriculturally important traits. We believe that major advances in nanopore sequencing of proteins will enable direct detection of chimeric and mosaic proteins, leaving no doubt about their composite origin. As more evidence supporting the conclusions of our study becomes available, it is possible that our view on the protein-coding potential in higher eukaryotes will undergo a transition from two-dimensional (refProt plus altProt) to multi-dimensional (refProt plus numerous multi-frame chimeric products). We are looking forward to further development of this concept, which is currently in its infancy.
Author statement
All authors have read and approved the submission of this revised manuscript to Computational and Structural Biotechnology Journal. The manuscript is not under consideration by any other journal.
CRediT authorship contribution statement
Umut Çakır: Writing – review & editing, Writing – original draft, Visualization, Validation, Software, Resources, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization. Kryvoruchko Igor S: Writing – review & editing, Writing – original draft, Visualization, Validation, Supervision, Software, Resources, Project administration, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization. Senjuti Sinharoy: Writing – review & editing, Conceptualization. Selen Kaya: Investigation, Data curation. Yunus Emre Köroğlu: Software, Methodology. Noujoud Gabed: Writing – review & editing, Validation, Resources, Investigation, Formal analysis, Conceptualization. Vagner A. Benedito: Writing – review & editing, Conceptualization. Xavier Roucou: Writing – review & editing, Methodology. Marie Brunet: Methodology.
Declaration of Competing Interest
The authors declare no competing interest relevant to this study.
Acknowledgements
This work was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK) grants and Boğaziçi University standard research grant (BAP-P) to UÇ and IK (TÜBİTAK 1001 120Z514, TÜBİTAK 1002 120Z247, and BAP-P 18841), a Canada Research Chair in Functional Proteomics and Discovery of Novel Proteins to XR, and a Canadian Institutes of Health Research Project Grant PJT-175322 to XR and MB. MB was supported by a Fonds de Recherche du Québec en Santé (FRQS) Junior 1 award (307936). Computational analysis was conducted using the server of the Turkish National e-Science e-Infrastructure (TRUBA) center. The completion of this study was made possible by support for IK from United Arab Emirates University and for UÇ from the IMPRS-Genome Science PhD program.
Footnotes
Supplementary data associated with this article can be found in the online version at doi:10.1016/j.csbj.2025.09.019.
Appendix A. Supplementary material
Supplementary material
Supplementary material
Supplementary material
Supplementary material
Data availability
Sequences of MS-supported chimeric models, their corresponding MS peptides and transcripts, as well as genomic DNA, alignments, and other relevant features with detailed annotation are available in Geneious® format in the online version of this article. Chimeric protein model sequences used as MS search databases, certificates of analysis, and peptide/protein report files from MS searches are available as separate supplementary datasets in the online version of this article (Supplementary Datasets S32-S36). Corresponding files for the conservation evidence and MS-validation of altProts that were the basis for modeling of chimeric proteins will be published later in a paper with the focus on altProts. The pipelines developed for this study are available through two software articles: one for the modeling of chimeric peptides (MosaicProt, [1]) and one for the detection of alternative chimeric sources (ChiMSource, [2]). On 8 January 2025, our data were integrated into the M. truncatula genome portal MtrunA17r5.0-ANR as a separate track. In total, 156 MS-validated chimeric peptides were mapped to 246 primary and alternative source loci in the genome browser (non-RE alternative sources only). In addition, 805 MS-validated non-chimeric altProts were mapped to 833 genomic loci. The complete set of data related to this article is also available from the digital repository, Zenodo (https://zenodo.org/records/17095391; DOI: https://doi.org/10.5281/zenodo.17095391).
References
- 1.Çakır U., Gabed N., Yurtseven A., Kryvoruchko I.S. A universal pipeline MosaicProt enables large-scale modeling and detection of chimeric protein sequences for studies on programmed ribosomal frameshifting. bioRxiv. 2025 doi: 10.1101/2025.05.29.656767. [DOI] [Google Scholar]
- 2.Çakır U., Gabed N., Yurtseven A., Kryvoruchko I.S. ChiMSource improves the accuracy of studies on novel amino acid sequences by predicting alternative sources of mass spectrometry-derived peptides. Comput Struct Biotechnol J. 2025;27:3704–3709. doi: 10.1016/j.csbj.2025.08.023. [DOI] [Google Scholar]
- 3.Mouilleron H., Delcourt V., Roucou X. Death of a dogma: eukaryotic mRNAs can code for more than one protein. Nucleic Acids Res. 2016;44:14–23. doi: 10.1093/nar/gkv1218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Brunet M.A., Brunelle M., Lucier J.F., Delcourt V., Levesque M., et al. OpenProt: a more comprehensive guide to explore eukaryotic coding potential and proteomes. Nucleic Acids Res. 2019 doi: 10.1093/nar/gky936. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Brunet M.A., Lucier J.F., Levesque M., Leblanc S., Jacques J.F., et al. OpenProt 2021: deeper functional annotation of the coding potential of eukaryotic genomes. Nucleic Acids Res. 2021;49:D380–D388. doi: 10.1093/nar/gkaa1036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Leblanc S., Yala F., Provencher N., Lucier J.F., Levesque M., et al. OpenProt 2.0 builds a path to the functional characterization of alternative proteins. Nucleic Acids Res. 2024;52:D522–D528. doi: 10.1093/nar/gkad1050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ardern Z. Alternative Reading frames are an underappreciated source of protein sequence novelty. J Mol Evol. 2023;91:570–580. doi: 10.1007/s00239-023-10122-3. [DOI] [PubMed] [Google Scholar]
- 8.Orr M.W., Mao Y., Storz G., Qian S.B. Alternative ORFs and small ORFs: shedding light on the dark proteome. Nucleic Acids Res. 2020;48:1029–1042. doi: 10.1093/nar/gkz734. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Riegger R.J., Caliskan N. Thinking outside the frame: impacting genomes capacity by programmed ribosomal frameshifting. Front Mol Biosci. 2022;9 doi: 10.3389/fmolb.2022.842261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Çakır U., Gabed N., Brunet M., Roucou X., Kryvoruchko I.S. Mosaic translation hypothesis: chimeric polypeptides produced via multiple ribosomal frameshifting as a basis for adaptability. FEBS J. 2023;290:370–378. doi: 10.1111/febs.16269. [DOI] [PubMed] [Google Scholar]
- 11.Blinkowa A.L., Walker J.R. Programmed ribosomal frameshifting generates the escherichia coli DNA polymerase III gamma subunit from within the tau subunit Reading frame. Nucleic Acids Res. 1990;18:1725–1729. doi: 10.1093/nar/18.7.1725. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Chaijarasphong T., Nichols R.J., Kortright K.E., Nixon C.F., Teng P.K., et al. Programmed ribosomal frameshifting mediates expression of the α-carboxysome. J Mol Biol. 2016;428:153–164. doi: 10.1016/j.jmb.2015.11.017. [DOI] [PubMed] [Google Scholar]
- 13.Flower A.M., McHenry C.S. The gamma subunit of DNA polymerase III holoenzyme of escherichia coli is produced by ribosomal frameshifting. Proc Natl Acad Sci USA. 1990;87:3713–3717. doi: 10.1073/pnas.87.10.3713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Meydan S., Klepacki D., Karthikeyan S., Margus T., Thomas P., et al. Programmed ribosomal frameshifting generates a copper transporter and a copper chaperone from the same gene. Mol Cell. 2017;65:207–219. doi: 10.1016/j.molcel.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Tsuchihashi Z., Kornberg A. Translational frameshifting generates the gamma subunit of DNA polymerase III holoenzyme. Proc Natl Acad Sci USA. 1990;87:2516–2520. doi: 10.1073/pnas.87.7.2516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Clark M.B., Jänicke M., Gottesbühren U., Kleffmann T., Legge M., et al. Mammalian gene PEG10 expresses two Reading frames by high efficiency -1 frameshifting in embryonic-associated tissues. J Biol Chem. 2007;282:37359–37369. doi: 10.1074/jbc.M705676200. [DOI] [PubMed] [Google Scholar]
- 17.Ivanov I.P., Atkins J.F. Ribosomal frameshifting in decoding antizyme mRNAs from yeast and protists to humans: close to 300 cases reveal remarkable diversity despite underlying conservation. Nucleic Acids Res. 2007;35:1842–1858. doi: 10.1093/nar/gkm035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Matsufuji S., Matsufuji T., Miyazaki Y., Murakami Y., Atkins J.F., et al. Autoregulatory frameshifting in decoding mammalian ornithine decarboxylase antizyme. Cell. 1995;80:51–60. doi: 10.1016/0092-8674(95)90450-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ren G., Gu X., Zhang L., Gong S., Song S., et al. Ribosomal frameshifting at normal codon repeats recodes functional chimeric proteins in human. Nucleic Acids Res. 2024 doi: 10.1093/nar/gkae035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Lobanov A.V., Heaphy S.M., Turanov A.A., Gerashchenko M.V., Pucciarelli S., et al. Position-dependent termination and widespread obligatory frameshifting in Euplotes translation. Nat Struct Mol Biol. 2017;24:61–68. doi: 10.1038/nsmb.3330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Ketteler R. On programmed ribosomal frameshifting: the alternative proteomes. Front Genet. 2012;3:242. doi: 10.3389/fgene.2012.00242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hatfield D.L., Levin J.G., Rein A., Oroszlan S. Translational suppression in retroviral gene expression. Adv Virus Res. 1992;41:193–239. doi: 10.1016/s0065-3527(08)60037-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Jacks T. Translational suppression in gene expression in retroviruses and retrotransposons. Curr Top Microbiol Immunol. 1990;157:93–124. doi: 10.1007/978-3-642-75218-6_4. [DOI] [PubMed] [Google Scholar]
- 24.Rex C., Nadeau M.J., Douville R., Schellenberg K. Expression of human endogenous Retrovirus-K in spinal and bulbar muscular atrophy. Front Neurol. 2019;10:968. doi: 10.3389/fneur.2019.00968. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Brandes N., Linial M. Gene overlapping and size constraints in the viral world. Biol Direct. 2016;11:26. doi: 10.1186/s13062-016-0128-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Timp W., Timp G. Beyond mass spectrometry, the next step in proteomics. Sci Adv. 2020;6 doi: 10.1126/sciadv.aax8978. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Pugh J. The current state of nanopore sequencing. Methods Mol Biol. 2023;2632:3–14. doi: 10.1007/978-1-0716-2996-3_1. [DOI] [PubMed] [Google Scholar]
- 28.Martin-Baniandres P., Lan W.H., Board S., Romero-Ruiz M., Garcia-Manyes S., et al. Enzyme-less nanopore detection of post-translational modifications within long polypeptides. Nat Nanotechnol. 2023;18:1335–1340. doi: 10.1038/s41565-023-01462-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Wang K., Zhang S., Zhou X., Yang X., Li X., et al. Unambiguous discrimination of all 20 proteinogenic amino acids and their modifications by nanopore. Nat Methods. 2024;21:92–101. doi: 10.1038/s41592-023-02021-8. [DOI] [PubMed] [Google Scholar]
- 30.Deutsch E.W., Bandeira N., Perez-Riverol Y., Sharma V., Carver J.J., et al. The ProteomeXchange consortium at 10 years: 2023 update. Nucleic Acids Res. 2023;51:D1539–D1548. doi: 10.1093/nar/gkac1040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Perez-Riverol Y., Bai J., Bandla C., García-Seisdedos D., Hewapathirana S., et al. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res. 2022;50:D543–D552. doi: 10.1093/nar/gkab1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ning K., Fermin D., Nesvizhskii A.I. Computational analysis of unassigned high-quality MS/MS spectra in proteomic data sets. Proteomics. 2010;10:2712–2718. doi: 10.1002/pmic.200900473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Pathan M., Samuel M., Keerthikumar S., Mathivanan S. Unassigned MS/MS spectra: who am I? Methods Mol Biol. 2017;1549:67–74. doi: 10.1007/978-1-4939-6740-7_6. [DOI] [PubMed] [Google Scholar]
- 34.Lluch-Senar M., Mancuso F.M., Climente-González H., Peña-Paz M.I., Sabido E., et al. Rescuing discarded spectra: full comprehensive analysis of a minimal proteome. Proteomics. 2016;16:554–563. doi: 10.1002/pmic.201500187. [DOI] [PubMed] [Google Scholar]
- 35.Cao X., Sun S., Xing J. A massive proteogenomic screen identifies thousands of novel peptides from the human "dark" proteome. Mol Cell Proteom. 2024;23 doi: 10.1016/j.mcpro.2024.100719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Ingolia N.T., Lareau L.F., Weissman J.S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 2011;147:789–802. doi: 10.1016/j.cell.2011.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Kiniry S.J., O'Connor P.B.F., Michel A.M., Baranov P.V. Trips-Viz: a transcriptome browser for exploring Ribo-Seq data. Nucleic Acids Res. 2019;47:D847–D852. doi: 10.1093/nar/gky842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Kiniry S.J., Judge C.E., Michel A.M., Baranov P.V. Trips-Viz: an environment for the analysis of public and user-generated ribosome profiling data. Nucleic Acids Res. 2021;49:W662–W670. doi: 10.1093/nar/gkab323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Richardson M.O., Eddy S.R. ORFeus: a computational method to detect programmed ribosomal frameshifts and other non-canonical translation events. BMC Bioinforma. 2023;24:471. doi: 10.1186/s12859-023-05602-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Kuo G.When fossil fuels run out, what then? The Millennium Alliance for Humanity and the Biosphere (MAHB). Available: https://mahb.stanford.edu/library-item/fossil-fuels-run/ Accessed: 2025 Aug 01.2019.
- 41.Li B., Boiarkina I., Young B., Yu W., Singhal N. Prediction of future phosphate rock: a demand based model. J Environ Inf. 2017;31:41–53. doi: 10.3808/jei.201700364. [DOI] [Google Scholar]
- 42.Nussey B.Five trends that will lead to the end of fossil fuels. The Freeing Energy Project. Available: https://www.freeingenergy.com/five-trends-that-will-lead-to-the-end-of-fossil-fuels/ Accessed: 2025 Aug 01.2019.
- 43.Bandyopadhyay K., Verdier J., Kang Y. In: Plant biotechnology: progress in genomic era. Khurana S., Gaur R., editors. Springer; Singapore: 2019. The model legume medicago truncatula: past, present, and future; pp. 109–130. [DOI] [Google Scholar]
- 44.Barker D.G., Bianchi S., Blondon F., Dattée Y., Duc G., et al. Medicago truncatula, a model plant for studying the molecular genetics of the Rhizobium-legume symbiosis. Plant Mol Biol Rep. 1990;8:40–49. doi: 10.1007/BF02668879. [DOI] [Google Scholar]
- 45.Kang Y., Li M., Sinharoy S., Verdier J. A snapshot of functional genetic studies in medicago truncatula. Front Plant Sci. 2016;7:1175. doi: 10.3389/fpls.2016.01175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Nandety R.S., Wen J., Mysore K.S. Medicago truncatula resources to study legume biology and symbiotic nitrogen fixation. Fundam Res. 2022;3:219–224. doi: 10.1016/j.fmre.2022.06.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Bonfante P. The future has roots in the past: the ideas and scientists that shaped mycorrhizal research. N Phytol. 2018;220:982–995. doi: 10.1111/nph.15397. [DOI] [PubMed] [Google Scholar]
- 48.Jhu M.Y., Oldroyd G.E.D. Dancing to a different tune, can we switch from chemical to biological nitrogen fixation for sustainable food security? PLoS Biol. 2023;21 doi: 10.1371/journal.pbio.3001982. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Pankievicz V.C.S., Irving T.B., Maia L.G.S., Ané J.M. Are we there yet? The long walk towards the development of efficient symbiotic associations between nitrogen-fixing bacteria and non-leguminous crops. BMC Biol. 2019;17:99. doi: 10.1186/s12915-019-0710-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Pecrix Y., Staton S.E., Sallet E., Lelandais-Brière C., Moreau S., et al. Whole-genome landscape of medicago truncatula symbiotic genes. Nat Plants. 2018;4:1017–1025. doi: 10.1038/s41477-018-0286-7. [DOI] [PubMed] [Google Scholar]
- 51.UniProt Consortium UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47:D506–D515. doi: 10.1093/nar/gky1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Buchfink B., Reuter K., Drost H.G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021;18:366–368. doi: 10.1038/s41592-021-01101-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Barsnes H., Vaudel M. SearchGUI: a highly adaptable common interface for proteomics search and de novo engines. J Proteome Res. 2018;17:2552–2555. doi: 10.1021/acs.jproteome.8b00175. [DOI] [PubMed] [Google Scholar]
- 54.Vaudel M., Burkhart J.M., Zahedi R.P., Oveland E., Berven F.S., et al. PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nat Biotechnol. 2015;33:22–24. doi: 10.1038/nbt.3109. [DOI] [PubMed] [Google Scholar]
- 55.Marx H., Minogue C.E., Jayaraman D., Richards A.L., Kwiecien N.W., et al. A proteomic Atlas of the legume medicago truncatula and its nitrogen-fixing endosymbiont sinorhizobium meliloti. Nat Biotechnol. 2016;34:1198–1205. doi: 10.1038/nbt.3681. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Shin J., Marx H., Richards A., Vaneechoutte D., Jayaraman D., et al. A network-based comparative framework to study conservation and divergence of proteomes in plant phylogenies. Nucleic Acids Res. 2021;49 doi: 10.1093/nar/gkaa1041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Castañeda V., González E.M., Wienkoop S. Phloem sap proteins are part of a core stress responsive proteome involved in drought stress adjustment. Front Plant Sci. 2021;12 doi: 10.3389/fpls.2021.625224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Hulstaert N., Shofstahl J., Sachsenberg T., Walzer M., Barsnes H., et al. ThermoRawFileParser: modular, scalable, and cross-platform RAW file conversion. J Proteome Res. 2020;19:537–542. doi: 10.1021/acs.jproteome.9b00328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Bjornson R.D., Carriero N.J., Colangelo C., Shifman M., Cheung K.H., et al. X!!Tandem, an improved method for running x!tandem in parallel on collections of commodity computers. J Proteome Res. 2008;7:293–299. doi: 10.1021/pr0701198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Kim S., Pevzner P.A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat Commun. 2014;5:5277. doi: 10.1038/ncomms6277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Geer L.Y., Markey S.P., Kowalak J.A., Wagner L., Xu M., et al. Open mass spectrometry search algorithm. J Proteome Res. 2004;3:958–964. doi: 10.1021/pr0499491. [DOI] [PubMed] [Google Scholar]
- 62.Eng J.K., Jahan T.A., Hoopmann M.R. Comet: an open-source MS/MS sequence database search tool. Proteomics. 2013;13:22–24. doi: 10.1002/pmic.201200439. [DOI] [PubMed] [Google Scholar]
- 63.GPM: Generalized Proteomics data Meta-analysis. Accessed on 15 February 2024. https://www.thegpm.org/crap/.
- 64.Kumar D., Yadav A.K., Dash D. Choosing an optimal database for protein identification from tandem mass spectrometry data. Methods Mol Biol. 2017;1549:17–29. doi: 10.1007/978-1-4939-6740-7_3. [DOI] [PubMed] [Google Scholar]
- 65.Li H., Joh Y.S., Kim H., Paek E., Lee S.W., et al. Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification. BMC Genom. 2016;17:1031. doi: 10.1186/s12864-016-3327-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Jagtap P., Goslinga J., Kooren J.A., McGowan T., Wroblewski M.S., et al. A two-step database search method improves sensitivity in peptide sequence matches for metaproteomics and proteogenomics studies. Proteomics. 2013;13:1352–1357. doi: 10.1002/pmic.201200352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Weiss R.B., Dunn D.M., Atkins J.F., Gesteland R.F. Slippery runs, shifty stops, backward steps, and forward hops: -2, -1, +1, +2, +5, and +6 ribosomal frameshifting. Cold Spring Harb Symp Quant Biol. 1987;52:687–693. doi: 10.1101/sqb.1987.052.01.078. [DOI] [PubMed] [Google Scholar]
- 68.Yan S., Wen J.D., Bustamante C., Tinoco I., Jr Ribosome excursions during mRNA translocation mediate broad branching of frameshift pathways. Cell. 2015;160:870–881. doi: 10.1016/j.cell.2015.02.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Mirdita M., Schütze K., Moriwaki Y., Heo L., Ovchinnikov S., et al. ColabFold: making protein folding accessible to all. Nat Methods. 2022;19:679–682. doi: 10.1038/s41592-022-01488-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Pettersen E.F., Goddard T.D., Huang C.C., Couch G.S., Greenblatt D.M., et al. UCSF Chimera--a visualization system for exploratory research and analysis. J Comput Chem. 2004;25:1605–1612. doi: 10.1002/jcc.20084. [DOI] [PubMed] [Google Scholar]
- 71.Bertoline L.M.F., Lima A.N., Krieger J.E., Teixeira S.K. Before and after AlphaFold2: an overview of protein structure prediction. Front Bioinform. 2023;3 doi: 10.3389/fbinf.2023.1120370. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.McDonald E.F., Jones T., Plate L., Meiler J., Gulsevin A. Benchmarking AlphaFold2 on peptide structure prediction. Structure. 2023;31:111–119. doi: 10.1016/j.str.2022.11.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Sievers F., Wilm A., Dineen D., Gibson T.J., Karplus K., et al. Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Mol Syst Biol. 2011;7:539. doi: 10.1038/msb.2011.75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Bailey T.L., Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol. 1994;2:28–36. [PubMed] [Google Scholar]
- 75.Bailey T.L., Johnson J., Grant C.E., Noble W.S. The MEME suite. Nucleic Acids Res. 2015;43:W39–W49. doi: 10.1093/nar/gkv416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 77.Boratyn G.M., Camacho C., Cooper P.S., Coulouris G., Fong A., et al. BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 2013;41:W29–W33. doi: 10.1093/nar/gkt282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Katz K., Shutov O., Lapoint R., Kimelman M., Brister J.R., et al. The sequence read archive: a decade more of explosive growth. Nucleic Acids Res. 2022;50:D387–D390. doi: 10.1093/nar/gkab1053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Leinonen R., Sugawara H., Shumway M., International Nucleotide Sequence Database Collaboration The sequence read archive. Nucleic Acids Res. 2011;39:D19–D21. doi: 10.1093/nar/gkq1019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Carrere S., Verdier J., Gamas P. MtExpress, a comprehensive and curated RNAseq-based gene expression Atlas for the model legume medicago truncatula. Plant Cell Physiol. 2021;62:1494–1500. doi: 10.1093/pcp/pcab110. [DOI] [PubMed] [Google Scholar]
- 81.R Core Team (2021) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Accessed on 17 February 2024. https://www.R-project.org/.
- 82.Boivin V., Faucher-Giguère L., Scott M., Abou-Elela S. The cellular landscape of mid-size noncoding RNA. Wiley Inter Rev RNA. 2019;10 doi: 10.1002/wrna.1530. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Palazzo A.F., Lee E.S. Non-coding RNA: what is functional and what is junk? Front Genet. 2015;6:2. doi: 10.3389/fgene.2015.00002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Sieber P., Platzer M., Schuster S. The definition of open Reading frame revisited. Trends Genet. 2018;34:167–170. doi: 10.1016/j.tig.2017.12.009. [DOI] [PubMed] [Google Scholar]
- 85.Altschul S.F. Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 1991;219:555–565. doi: 10.1016/0022-2836(91)90193-a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Caetano-Anollés D., Caetano-Anollés G. Piecemeal buildup of the genetic code, ribosomes, and genomes from primordial tRNA building blocks. Life. 2016;6:43. doi: 10.3390/life6040043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.de Farias S.T., Rêgo T.G., José M.V. tRNA core hypothesis for the transition from the RNA world to the ribonucleoprotein world. Life. 2016;6:15. doi: 10.3390/life6020015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Root-Bernstein M., Root-Bernstein R. The ribosome as a missing link in the evolution of life. J Theor Biol. 2015;367:130–158. doi: 10.1016/j.jtbi.2014.11.025. [DOI] [PubMed] [Google Scholar]
- 89.Root-Bernstein R., Root-Bernstein M. The ribosome as a missing link in prebiotic evolution II: ribosomes encode ribosomal proteins that bind to common regions of their own mRNAs and rRNAs. J Theor Biol. 2016;397:115–127. doi: 10.1016/j.jtbi.2016.02.030. [DOI] [PubMed] [Google Scholar]
- 90.Root-Bernstein R., Root-Bernstein M. The ribosome as a missing link in prebiotic evolution III: Over-Representation of tRNA- and rRNA-Like sequences and plieofunctionality of Ribosome-Related molecules argues for the evolution of primitive genomes from ribosomal RNA modules. Int J Mol Sci. 2019;20:140. doi: 10.3390/ijms20010140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Bowman J.C., Petrov A.S., Frenkel-Pinter M., Penev P.I., Williams L.D. Root of the tree: the significance, evolution, and origins of the ribosome. Chem Rev. 2020;120:4848–4878. doi: 10.1021/acs.chemrev.9b00742. [DOI] [PubMed] [Google Scholar]
- 92.Xu L., Zhang L., Liu Y., Sod B., Li M., et al. Overexpression of the elongation factor MtEF1A1 promotes salt stress tolerance in Arabidopsis thaliana and medicago truncatula. BMC Plant Biol. 2023;23:138. doi: 10.1186/s12870-023-04139-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Gomez S.K., Javot H., Deewatthanawong P., Torres-Jerez I., Tang Y., et al. medicago truncatula and glomus intraradices gene expression in cortical cells harboring arbuscules in the arbuscular mycorrhizal symbiosis. BMC Plant Biol. 2009;9:10. doi: 10.1186/1471-2229-9-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Jiang Y., Xie Q., Wang W., Yang J., Zhang X., et al. Medicago AP2-domain transcription factor WRI5a is a master regulator of lipid biosynthesis and transfer during mycorrhizal symbiosis. Mol Plant. 2018;11:1344–1359. doi: 10.1016/j.molp.2018.09.006. [DOI] [PubMed] [Google Scholar]
- 95.Kakar K., Wandrey M., Czechowski T., Gaertner T., Scheible W.R., et al. A community resource for high-throughput quantitative RT-PCR analysis of transcription factor gene expression in Medicago truncatula. Plant Methods. 2008;4:18. doi: 10.1186/1746-4811-4-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Chen T., Riaz S., Davey P., Zhao Z., Sun Y., et al. Producing fast and active rubisco in tobacco to enhance photosynthesis. Plant Cell. 2023;35:795–807. doi: 10.1093/plcell/koac348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Bennett H.M., Stephenson W., Rose C.M., Darmanis S. Single-cell proteomics enabled by next-generation sequencing or mass spectrometry. Nat Methods. 2023;20:363–374. doi: 10.1038/s41592-023-01791-5. [DOI] [PubMed] [Google Scholar]
- 98.Messner C.B., Demichev V., Wang Z., Hartl J., Kustatscher G., et al. Mass spectrometry-based high-throughput proteomics and its role in biomedical studies and systems biology. Proteomics. 2023;23 doi: 10.1002/pmic.202200013. [DOI] [PubMed] [Google Scholar]
- 99.Aggarwal S., Raj A., Kumar D., Dash D., Yadav A.K. False discovery rate: the achilles' heel of proteogenomics. Brief Bioinform. 2022;23 doi: 10.1093/bib/bbac163. [DOI] [PubMed] [Google Scholar]
- 100.Ebadi A., Freestone J., Noble W.S., Keich U. Bridging the false discovery gap. J Proteome Res. 2023;22:2172–2178. doi: 10.1021/acs.jproteome.3c00176. [DOI] [PubMed] [Google Scholar]
- 101.Swaney D.L., Wenger C.D., Coon J.J. Value of using multiple proteases for large-scale mass spectrometry-based proteomics. J Proteome Res. 2010;9:1323–1329. doi: 10.1021/pr900863u. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Álvarez-Urdiola R., Riechmann J.L. The 'non-conventional' peptidome: a new layer on plant regulatory mechanisms. Plant Commun. 2025 doi: 10.1016/j.xplc.2025.101437. [DOI] [PubMed] [Google Scholar]
- 103.Wacholder A., Carvunis A.R. Biological factors and statistical limitations prevent detection of most noncanonical proteins by mass spectrometry. PLoS Biol. 2023;21 doi: 10.1371/journal.pbio.3002409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Ruiz-Orera J., Albà M.M. Conserved regions in long non-coding RNAs contain abundant translation and protein-RNA interaction signatures. NAR Genom Bioinform. 2019;1 doi: 10.1093/nargab/lqz002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Zaheed O., Kiniry S.J., Baranov P.V., Dean K. Exploring evidence of non-coding RNA translation with Trips-Viz and GWIPS-Viz browsers. Front Cell Dev Biol. 2021;9 doi: 10.3389/fcell.2021.703374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Hashimoto Y., Niikura T., Tajima H., Yasukawa T., Sudo H., et al. A rescue factor abolishing neuronal cell death by a wide spectrum of familial alzheimer's disease genes and abeta. Proc Natl Acad Sci USA. 2001;98:6336–6341. doi: 10.1073/pnas.101133498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107.Maximov V., Martynenko A., Hunsmann G., Tarantul V. Mitochondrial 16S rRNA gene encodes a functional peptide, a potential drug for alzheimer's disease and target for cancer therapy. Med Hypotheses. 2002;59:670–673. doi: 10.1016/s0306-9877(02)00223-2. [DOI] [PubMed] [Google Scholar]
- 108.Couzigou J.M., André O., Guillotin B., Alexandre M., Combier J.P. Use of microRNA-encoded peptide miPEP172c to stimulate nodulation in soybean. N Phytol. 2016;211:379–381. doi: 10.1111/nph.13991. [DOI] [PubMed] [Google Scholar]
- 109.Wang Y., Wang L., Zou Y., Chen L., Cai Z., et al. Soybean miR172c targets the repressive AP2 transcription factor NNC1 to activate ENOD40 expression and regulate nodule initiation. Plant Cell. 2014;26:4782–4801. doi: 10.1105/tpc.114.131607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110.Yoshikawa M., Iki T., Numa H., Miyashita K., Meshi T., et al. A short open Reading frame encompassing the microRNA173 target site plays a role in trans-acting small interfering RNA biogenesis. Plant Physiol. 2016;171:359–368. doi: 10.1104/pp.16.00148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111.Mello C.C., Conte D., Jr Revealing the world of RNA interference. Nature. 2004;431:338–342. doi: 10.1038/nature02872. [DOI] [PubMed] [Google Scholar]
- 112.Ormancey M., Guillotin B., Merret R., Camborde L., Duboé C., et al. Complementary peptides represent a credible alternative to agrochemicals by activating translation of targeted proteins. Nat Commun. 2023;14:254. doi: 10.1038/s41467-023-35951-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113.Baier M.C., Barsch A., Küster H., Hohnjec N. Antisense repression of the medicago truncatula nodule-enhanced sucrose synthase leads to a handicapped nitrogen fixation mirrored by specific alterations in the symbiotic transcriptome and metabolome. Plant Physiol. 2007;145:1600–1618. doi: 10.1104/pp.107.106955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114.Baier M.C., Keck M., Gödde V., Niehaus K., Küster H., et al. Knockdown of the symbiotic sucrose synthase MtSucS1 affects arbuscule maturation and maintenance in mycorrhizal roots of medicago truncatula. Plant Physiol. 2010;152:1000–1014. doi: 10.1104/pp.109.149898. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115.Wang C., Wang H., Zhu H., Ji W., Hou Y., et al. Genome-wide identification and characterization of cytokinin oxidase/dehydrogenase family genes in medicago truncatula. J Plant Physiol. 2021;256 doi: 10.1016/j.jplph.2020.153308. [DOI] [PubMed] [Google Scholar]
- 116.Colditz F., Niehaus K., Krajinski F. Silencing of PR-10-like proteins in medicago truncatula results in an antagonistic induction of other PR proteins and in an increased tolerance upon infection with the oomycete Aphanomyces euteiches. Planta. 2007;226:57–71. doi: 10.1007/s00425-006-0466-y. [DOI] [PubMed] [Google Scholar]
- 117.Samac D.A., Peñuela S., Schnurr J.A., Hunt E.N., Foster-Hartnett D., et al. Expression of coordinately regulated defence response genes and analysis of their role in disease resistance in Medicago truncatula. Mol Plant Pathol. 2011;12:786–798. doi: 10.1111/j.1364-3703.2011.00712.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118.Djordjevic M.A., Mohd-Radzman N.A., Imin N. Small-peptide signals that control root nodule number, development, and symbiosis. J Exp Bot. 2015;66:5171–5181. doi: 10.1093/jxb/erv357. [DOI] [PubMed] [Google Scholar]
- 119.Ito M., Tajima Y., Ogawa-Ohnishi M., Nishida H., Nosaki S., et al. IMA peptides regulate root nodulation and nitrogen homeostasis by providing iron according to internal nitrogen status. Nat Commun. 2024;15:733. doi: 10.1038/s41467-024-44865-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120.Kereszt A., Mergaert P., Montiel J., Endre G., Kondorosi É. Impact of plant peptides on symbiotic nodule development and functioning. Front Plant Sci. 2018;9:1026. doi: 10.3389/fpls.2018.01026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 121.Roy S., Müller L.M. A rulebook for peptide control of legume-microbe endosymbioses. Trends Plant Sci. 2022;27:870–889. doi: 10.1016/j.tplants.2022.02.002. [DOI] [PubMed] [Google Scholar]
- 122.Roy S., Torres-Jerez I., Zhang S., Liu W., Schiessl K., et al. The peptide GOLVEN10 alters root development and noduletaxis in Medicago truncatula. Plant J. 2024 doi: 10.1111/tpj.16626. [DOI] [PubMed] [Google Scholar]
- 123.Van de Velde W., Zehirov G., Szatmari A., Debreczeny M., Ishihara H., et al. Plant peptides govern terminal differentiation of bacteria in symbiosis. Science. 2010;327:1122–1126. doi: 10.1126/science.1184057. [DOI] [PubMed] [Google Scholar]
- 124.Benedito V.A., Torres-Jerez I., Murray J.D., Andriankaja A., Allen S., et al. A gene expression Atlas of the model legume Medicago truncatula. Plant J. 2008;55:504–513. doi: 10.1111/j.1365-313X.2008.03519.x. [DOI] [PubMed] [Google Scholar]
- 125.Mergaert P., Kereszt A., Kondorosi E. Gene expression in nitrogen-fixing symbiotic nodule cells in Medicago truncatula and other nodulating plants. Plant Cell. 2020;32:42–68. doi: 10.1105/tpc.19.00494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126.Sheokand S., Dahiya P., Vincent J., Brewin N. Modified expression of cysteine protease affects seed germination, vegetative growth and nodule development in transgenic lines of Medicago truncatula. Plant Sci. 2005;169:966–975. doi: 10.1016/j.plantsci.2005.07.003. [DOI] [Google Scholar]
- 127.Liu S., Magne K., Zhou J., Laude J., Dalmais M., et al. The transcriptional co-regulators NBCL1 and NBCL2 redundantly coordinate aerial organ development and root nodule identity in legumes. J Exp Bot. 2023;74:194–213. doi: 10.1093/jxb/erac389. [DOI] [PubMed] [Google Scholar]
- 128.Magne K., Couzigou J.M., Schiessl K., Liu S., George J., et al. MtNODULE ROOT1 and MtNODULE ROOT2 are essential for indeterminate nodule identity. Plant Physiol. 2018;178:295–316. doi: 10.1104/pp.18.00610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 129.Zhang J., Wang X., Han L., Zhang J., Xie Y., et al. The formation of stipule requires the coordinated actions of the legume orthologs of arabidopsis BLADE-ON-PETIOLE and LEAFY. N Phytol. 2022;236:1512–1528. doi: 10.1111/nph.18445. [DOI] [PubMed] [Google Scholar]
- 130.Chooi W.Y., Leiby K.R. The in vivo expression of pseudo ribosomal RNA genes in Drosophila melanogaster. Mol Gen Genet. 1981;182:245–251. doi: 10.1007/BF00269665. [DOI] [PubMed] [Google Scholar]
- 131.Coelho P.S., Bryan A.C., Kumar A., Shadel G.S., Snyder M. A novel mitochondrial protein, Tar1p, is encoded on the antisense strand of the nuclear 25S rDNA. Genes Dev. 2002;16:2755–2760. doi: 10.1101/gad.1035002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 132.Divecha N., Charleston B. Cloning and characterisation of two new cDNAs encoding murine triple LIM domains. Gene. 1995;156:283–286. doi: 10.1016/0378-1119(95)00088-n. [DOI] [PubMed] [Google Scholar]
- 133.Kermekchiev M., Ivanova L. Ribin, a protein encoded by a message complementary to rRNA, modulates ribosomal transcription and cell proliferation. Mol Cell Biol. 2001;21:8255–8263. doi: 10.1128/MCB.21.24.8255-8263.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 134.Scharf M.E., Wu-Scharf D., Zhou X., Pittendrigh B.R., Bennett G.W. Gene expression profiles among immature and adult reproductive castes of the termite Reticulitermes flavipes. Insect Mol Biol. 2005;14:31–44. doi: 10.1111/j.1365-2583.2004.00527.x. [DOI] [PubMed] [Google Scholar]
- 135.Kong Q., Stockinger M.P., Chang Y., Tashiro H., Lin C.L. The presence of rRNA sequences in polyadenylated RNA and its potential functions. Biotechnol J. 2008;3:1041–1046. doi: 10.1002/biot.200800122. [DOI] [PubMed] [Google Scholar]
- 136.Bose T., Fridkin G., Davidovich C., Krupkin M., Dinger N., et al. Origin of life: protoribosome forms peptide bonds and links RNA and protein dominated worlds. Nucleic Acids Res. 2022;50:1815–1828. doi: 10.1093/nar/gkac052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 137.Rödelsperger C., Prabh N., Sommer R.J. New gene origin and deep taxon phylogenomics: opportunities and challenges. Trends Genet. 2019;35:914–922. doi: 10.1016/j.tig.2019.08.007. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary material
Supplementary material
Supplementary material
Supplementary material
Data Availability Statement
Sequences of MS-supported chimeric models, their corresponding MS peptides and transcripts, as well as genomic DNA, alignments, and other relevant features with detailed annotation are available in Geneious® format in the online version of this article. Chimeric protein model sequences used as MS search databases, certificates of analysis, and peptide/protein report files from MS searches are available as separate supplementary datasets in the online version of this article (Supplementary Datasets S32-S36). Corresponding files for the conservation evidence and MS-validation of altProts that were the basis for modeling of chimeric proteins will be published later in a paper with the focus on altProts. The pipelines developed for this study are available through two software articles: one for the modeling of chimeric peptides (MosaicProt, [1]) and one for the detection of alternative chimeric sources (ChiMSource, [2]). On 8 January 2025, our data were integrated into the M. truncatula genome portal MtrunA17r5.0-ANR as a separate track. In total, 156 MS-validated chimeric peptides were mapped to 246 primary and alternative source loci in the genome browser (non-RE alternative sources only). In addition, 805 MS-validated non-chimeric altProts were mapped to 833 genomic loci. The complete set of data related to this article is also available from the digital repository, Zenodo (https://zenodo.org/records/17095391; DOI: https://doi.org/10.5281/zenodo.17095391).









