Skip to main content
eLife logoLink to eLife
. 2022 Mar 28;11:e73980. doi: 10.7554/eLife.73980

Pervasive translation in Mycobacterium tuberculosis

Carol Smith 1,, Jill G Canestrari 1,, Archer J Wang 1,, Matthew M Champion 2, Keith M Derbyshire 1,3,, Todd A Gray 1,3,, Joseph T Wade 1,3,
Editors: Bavesh D Kana4, Bavesh D Kana5
PMCID: PMC9094748  PMID: 35343439

Abstract

Most bacterial ORFs are identified by automated prediction algorithms. However, these algorithms often fail to identify ORFs lacking canonical features such as a length of >50 codons or the presence of an upstream Shine-Dalgarno sequence. Here, we use ribosome profiling approaches to identify actively translated ORFs in Mycobacterium tuberculosis. Most of the ORFs we identify have not been previously described, indicating that the M. tuberculosis transcriptome is pervasively translated. The newly described ORFs are predominantly short, with many encoding proteins of ≤50 amino acids. Codon usage of the newly discovered ORFs suggests that most have not been subject to purifying selection, and hence are unlikely to contribute to cell fitness. Nevertheless, we identify 90 new ORFs (median length of 52 codons) that bear the hallmarks of purifying selection. Thus, our data suggest that pervasive translation of short ORFs in Mycobacterium tuberculosis serves as a rich source for the evolution of new functional proteins.

Research organism: None

eLife digest

How can you predict which proteins an organism can make? To answer this question, scientists often use computer programs that can scan the genetic information of a species for open reading frames – a type of DNA sequence that codes for a protein. However, very short genes and overlapping genes are often missed through these searches.

Mycobacteria are a group of bacteria that includes the species Mycobacterium tuberculosis, which causes tuberculosis. Previous work has predicted several thousand open reading frames for M. tuberculosis, but Smith et al. decided to use a different approach to determine whether there could be more. They focused on ribosomes, the cellular structures that assemble a specific protein by reading the instructions provided by the corresponding gene.

Examining the sections of genetic code that ribosomes were processing in M. tuberculosis uncovered hundreds of new open reading frames, most of which carried the instructions to make very short proteins. A closer look suggested that only 90 of these proteins were likely to have a useful role in the life of the bacteria, which could open new doors in tuberculosis research. The rest of the sequences showed no evidence of having evolved a useful job, yet they were still manufactured by the mycobacteria. This pervasive production could play a role in helping the bacteria adapt to quickly changing environments by evolving new, functional proteins.

Introduction

The canonical mode of bacterial translation initiation begins with the association of a 30 S ribosomal subunit, initiator tRNA, and initiation factors, with the ribosome binding site of an mRNA (Laursen et al., 2005). Binding of the 30 S initiation complex to the mRNA involves base-pairing interactions between the mRNA Shine-Dalgarno (S-D) sequence, located a short distance upstream of the start codon, and the anti-S-D sequence in the 16 S ribosomal RNA (rRNA). Local mRNA secondary structure around the ribosome binding site can reduce interaction with the 30 S initiation complex. Translation initiates at a start codon, typically an AUG; less frequently, translation initiation occurs at GUG or UUG, and in rare instances at AUC, AUU, and AUA start codons (Gvozdjak and Samanta, 2020; Hecht et al., 2017). Hence, the likelihood of translation initiation at a given sequence will depend on the sequence upstream of the start codon, the degree of secondary structure in the region surrounding the start codon, and start codon identity.

Due to the requirement for a 5’ untranslated region that includes the S-D sequence, mRNAs translated using the canonical mechanism are referred to as ‘leadered’. By contrast, ‘leaderless’ translation initiation occurs on mRNAs that lack a 5’ UTR, such that the transcription start site (TSS) and translation start codon coincide. The mechanism of leaderless translation initiation is poorly understood. Until recently, there were few known examples of leaderless mRNAs; leaderless translation in the model bacterium Escherichia coli was shown to be rare and inefficient (Moll et al., 2002; Romero et al., 2014; Shell et al., 2015). However, recent studies indicate that leaderless translation initiation is a prevalent and robust mechanism in many bacterial and archaeal species (Beck and Moll, 2018). We and others showed that ~25% of all mRNAs in Mycobacterium smegmatis and Mycobacterium tuberculosis (Mtb) are leaderless (Cortes et al., 2013; Shell et al., 2015). Moreover, our data suggested that any RNA with a 5’ AUG or GUG will be efficiently translated using the leaderless mechanism in M. smegmatis (Shell et al., 2015).

Bacterial open reading frames (ORFs) are typically identified from genome sequences using automated prediction algorithms (Besemer and Borodovsky, 2005; Delcher et al., 2007; Hyatt et al., 2010). Among the criteria used by these algorithms are ORF length, and the presence of a S-D sequence. Hence, they often fail to identify non-canonical ORFs, including overlapping ORFs (Burge and Karlin, 1998), leaderless ORFs (Beck and Moll, 2018; Lomsadze et al., 2018), and short ORFs (sORFs; encoding small proteins of 50 or fewer amino acids; most algorithms have a lower size limit of 50 codons). Recent studies have revealed hundreds of sORFs in diverse bacterial species (Orr et al., 2020; Sberro et al., 2019; Storz et al., 2014; Stringer et al., 2021; VanOrsdel et al., 2018; Weaver et al., 2019). Some sORFs encode functional small proteins that contribute to cell fitness, whereas other sORFs function as cis-acting regulators. In eukaryotes, there have been reports of ‘pervasive translation’ of thousands of unannotated sORFs, likely due to the imperfect specificity of the translation machinery (Ingolia et al., 2014; Ruiz-Orera et al., 2018; Wacholder et al., 2021). The function, if any, of most of these sORFs and/or their encoded proteins is unclear, although they are rarely subject to purifying selection (Ruiz-Orera et al., 2018; Wacholder et al., 2021). Nonetheless, a high-throughput mutagenesis study of unannotated sORFs in human cells suggested that some contribute to cell fitness (Chen et al., 2020). Moreover, pervasively translated eukaryotic sORFs may function as ‘proto-genes’, that, over the course of evolution, can acquire a function promoting cell fitness, a process referred to as ‘de novo gene birth’ (Blevins et al., 2021; Carvunis et al., 2012; Ruiz-Orera et al., 2018; Vakirlis et al., 2018; Vakirlis et al., 2020).

Ribosome profiling (Ribo-seq) is a powerful experimental approach to identify the translated regions of mRNAs by mapping ribosome-protected RNA fragments (Ingolia et al., 2009). Ribo-RET is a modified form of Ribo-seq in which bacterial cells are treated with the antibiotic retapamulin before lysis; retapamulin traps bacterial ribosomes at sites of translation initiation, whereas elongating ribosomes are free to complete translation (Meydan et al., 2019). Thus, Ribo-RET facilitates the identification of overlapping ORFs by limiting the signal to the start codons (Meydan et al., 2018; Meydan et al., 2019). Ribo-RET was recently applied to E. coli, revealing start codons for many previously undescribed ORFs (Meydan et al., 2019; Stringer et al., 2021; Weaver et al., 2019), including sORFs, and ORFs positioned in frame with annotated ORFs, such that the translated protein is an isoform of the previously described protein. Here, we use a combination of Ribo-seq and Ribo-RET to map translated ORFs in Mtb. We detect thousands of robustly translated, previously undescribed sORFs from leaderless and leadered mRNAs. We also identify hundreds of ORFs that have start codons upstream or downstream of those for annotated genes, in the same reading frame. We conclude that the Mtb transcriptome is pervasively translated, with spurious translation initiation occurring at many sites. We also identify a subset of novel sORFs that appear to be under purifying selection, suggesting these ORFs, or the proteins they encode, contribute to cell fitness. Thus, our data suggest that pervasive translation of sORFs in Mtb serves as a rich source for the evolution of functional genes.

Results

Hundreds of actively translated sORFs from leaderless mRNAs

Mtb has a genome of 4,411,532bp, with 3989 annotated protein-coding genes (RefSeq annotation). Two previous studies of Mtb identified 1285 transcription start sites (TSSs) for which the associated transcript begins with the sequence ‘RUG’ (R = A or G; Supplementary file 1A; Cortes et al., 2013; Shell et al., 2015), suggesting that these transcripts correspond to leaderless mRNAs (Shell et al., 2015). Of the 1285 TSSs associated with a 5’ RUG, 577 match the start codons of protein-coding genes included in the current genome annotation, as previously noted (Cortes et al., 2013; Shell et al., 2015). A further 338 of the RUG-associated TSSs correspond to putative ORFs whose start codons are unannotated, but whose stop codons match those of annotated genes; we refer to this architecture as ‘isoform’, since translation of these putative ORFs would generate N-terminally extended or truncated isoforms of annotated proteins. We note that some isoform ORFs likely reflect mis-annotations, as has been suggested previously (Cortes et al., 2013; Shell et al., 2015). Lastly, 370 of the 1,285 RUG-associated TSSs correspond to putative ORFs whose start and stop codons do not match those of any annotated gene; we refer to these as putative ‘novel’ ORFs.

To determine whether the putative isoform and novel leaderless ORFs are actively translated, we performed Ribo-seq in Mtb. Note that all genome-scale data described in this manuscript can be viewed in our interactive genome browser (https://mtb.wadsworth.org/). We first assessed ribosome occupancy profiles for leadered ORFs that are present in the current genome annotation. Consistent with previous studies (Oh et al., 2011; Woolstenhulme et al., 2015), we observed enrichment of ribosome occupancy at start and stop codons of annotated, leadered ORFs; the 3’ ends of ribosome-protected RNA fragments are enriched 15 nt downstream of the start codons, and 12 nt downstream of stop codons (Figure 1A). We note that there are also smaller peaks and troughs of Ribo-seq signal precisely at start and stop codons, likely attributable to sequence biases associated with library preparation that are highlighted when groups of similar sequences (e.g. start/stop codons) are aligned (see Methods). We next assessed ribosome occupancy profiles for the 577 leaderless ORFs that are present in the current genome annotation. As expected, we observed an enrichment of ribosome-protected RNA fragments, with 3’ ends positioned 12 nt downstream of stop codons (Figure 1B), consistent with the profile observed for leadered ORFs. However, 3’ ends of ribosome-protected RNA fragments were not enriched 15 nt downstream of the start codons of the 577 annotated leaderless ORFs; rather, we observed enrichment spread across the region ~25–35 nt downstream of leaderless start codons (Figure 1B), suggesting either that ribosomes at leaderless ORF start codons behave differently to those at leadered ORF start codons, or that ribosome-protected fragments are too small to be represented in the RNA library; this observation is consistent with a previous study (Sawyer et al., 2021). Further confounding analysis of leaderless start codons, which are, by definition, aligned with TSSs, we consistently observed non-random Ribo-seq signals at TSSs of non-leaderless transcripts (Figure 1—figure supplement 1), albeit to a lesser extent than that observed for leaderless gene starts.

Figure 1. Ribo-seq data support the translation of hundreds of isoform and novel ORFs from leaderless mRNAs.

(A) Metagene plot showing normalized Ribo-seq sequence read coverage for untreated cells in the regions around start (left graph) and stop codons (right graph) of previously annotated, leadered ORFs. Note that sequence read coverage is plotted only for the 3’ ends of reads, since these are consistently positioned relative to the ribosome P-site (Woolstenhulme et al., 2015). Data are shown for two biological replicate experiments. The schematics show the position of initiating/terminating ribosomes, highlighting the expected site of ribosome occupancy enrichment at the downstream edge of the ribosome. (B) Equivalent data to (A) but for putative annotated, leaderless ORFs. (C) Equivalent data to (A) but for putative novel, leaderless ORFs. (D) Equivalent data to (A) but for putative isoform, leaderless ORFs. Only data for start codons are shown because the same stop codon is used by both an annotated and isoform ORF.

Figure 1.

Figure 1—figure supplement 1. Modest enrichment of Ribo-seq signal downstream of the transcription start sites (TSSs) of non-leaderless RNAs.

Figure 1—figure supplement 1.

Metagene plot showing normalized Ribo-seq sequence read coverage (data indicate the position of ribosome footprint 3’ ends) in the region from –50 to +100 nt relative to the TSSs of RNAs that are not leaderless mRNAs.

We reasoned that if the putative leaderless isoform and novel ORFs are actively translated, they would exhibit similar ribosome occupancy profiles to the leaderless annotated ORFs. Indeed, this was the case, with similar relative occupancy of ribosomes undergoing translation initiation and termination at start/stop codons (Figure 1C–D; we did not analyze isoform ORF stop codons because they are shared with those of annotated ORFs). Thus, our data are consistent with active translation of the majority of the 370 putative novel ORFs as leaderless mRNAs. Strikingly, 268 of the leaderless novel ORFs are sORFs. We conclude that Mtb has hundreds of actively translated sORFs on leaderless mRNAs.

Ribo-RET identifies sites of translation initiation in Mtb

While there are likely >1000 leaderless mRNAs in Mtb, most mRNAs are leadered (Cortes et al., 2013; Sawyer et al., 2021; Shell et al., 2015). Given that our data support the existence of >300 novel ORFs translated from the 5’ ends of leaderless mRNAs, we speculated that there are many more unannotated ORFs translated from leadered initiation codons. While sites of leaderless translation initiation can be readily identified from TSS maps, identification of novel leadered ORFs is more challenging. Translated leadered ORFs generate signal in Ribo-seq datasets, but identification of novel ORFs from Ribo-seq data is confounded by (i) the potential for artifactual signal in 5’ UTRs due to the binding of RNA-binding proteins (Ji et al., 2016), and (ii) masking of signal by overlapping ORFs on the same strand. To circumvent these problems, we performed Ribo-RET with Mtb to specifically map sites of translation initiation. We aligned the ribosome-protected RNA fragment sequences to the Mtb genome to identify ‘Initiation-Enriched Ribosome Footprints’ (IERFs), sites of ribosome occupancy that exceed the local background (Supplementary file 1B). Specifically, IERFs correspond to genomic coordinates that have ribosome occupancy coverage that exceeds an arbitrarily defined threshold value (5.5 reads per million) and is at least 10-fold higher than the mean ribosome occupancy coverage in the region 50 nt upstream to 50 nt downstream. We hypothesized that most IERFs correspond to sites of translation initiation. In support of this idea, there is a strong enrichment of IERF 3’ ends 15 nt downstream of the start codons of annotated, leadered genes; this enrichment is substantially greater than that observed for Ribo-seq data from cells grown without retapamulin treatment (Figure 2A; Figure 2—figure supplement 1).

Figure 2. Ribo-RET of M. tuberculosis identifies sites of translation initiation.

(A) Metagene plot showing normalized Ribo-seq and Ribo-RET sequence read coverage (single replicate for each; data indicate the position of ribosome footprint 3’ ends) in the region from –50 to +100 nt relative to the start codons of annotated, leadered ORFs. (B) Heatmap showing the enrichment of eight selected trinucleotide sequences, for regions upstream of IERFs, relative to control regions. Expected positions of start codons and S-D sequences are indicated below the heatmap.

Figure 2.

Figure 2—figure supplement 1. Retapamulin treatment traps initiating ribosomes.

Figure 2—figure supplement 1.

Metagene plot showing normalized Ribo-seq and Ribo-RET sequence read coverage (single replicate for each; data indicate the position of ribosome footprint 3’ ends) in the region from –50 to +100 nt relative to the start codons of annotated, leadered ORFs. Figure 2A shows data for the other replicate datasets.
Figure 2—figure supplement 2. Sequence bias associated with the 3’ ends of ribosome-protected RNAs at IERFs.

Figure 2—figure supplement 2.

Logo showing sequence bias around the 3’ ends of Ribo-RET RNA fragments associated with IERFs. The cleavage site at the 3’ end of the aligned RNA fragments is indicated by a vertical dashed line.

We determined the abundance of all trinucleotide sequences in the 40 nt regions upstream of IERF 3’ ends; there is a > 2 fold enrichment of ATG, GTG and TTG (likely start codons), but not CTG, ATT or ATC, 15 nt upstream of IERF 3’ ends, and an enrichment of AGG and GGA (components of a consensus AGGAGGU Shine-Dalgarno sequence) 22–31 nt upstream of IERF 3’ ends (Figure 2B). We also observed >1.5 fold enrichment of ATG and GTG 14, 16, 17, and 18 nt upstream of IERF 3’ ends. The enrichment and position of start codon and Shine-Dalgarno-like sequence features upstream of IERFs are consistent with IERFs marking sites of translation initiation. We observed a strong enrichment of A/T immediately 3’ of the IERFs, i.e. on the other side of the site cleaved by micrococcal nuclease (MNase) during the Ribo-RET procedure; ‘A’ was found most frequently (53% of IERFs), and ‘G’ found the least frequently (2% of IERFs; Figure 2—figure supplement 2). This sequence bias is likely not due to a biological phenomenon, but rather to the sequence preference of MNase, which is known to display sequence bias when cutting DNA (Dingwall et al., 1981) and RNA (Woolstenhulme et al., 2015). The sequence bias is apparent in the complete Ribo-RET libraries, with 74% of sequenced ribosome-protected fragments having an ‘A’ or ‘U’ 3’ of the upstream MNase site. Given that the genomic A/T content in Mtb is only 34%, it is likely that inefficient RNA processing by MNase led to an underrepresentation of some G/C-rich translation initiation sites in the Ribo-RET data, and may explain the extended footprints ( > 15 nt) in G/C-rich contexts (see Discussion). This sequence bias also likely favors cleavage precisely at exposed start codons, which are strongly enriched for A/T bases, creating more RNA library fragments that end in these sequences (e.g. enriched Ribo-seq signal precisely at start codons in Figure 2A).

Identification of putative ORFs from Ribo-RET data

A total of 1994 IERFs were found in both replicate experiments (Supplementary file 1B). 71% (1406) of these IERFs were associated with a potential ATG or GTG start codon 14–18 nt upstream of their 3’ ends, or a potential TTG start codon 15 nt upstream of their 3’ ends (Supplementary file 1C), a far higher proportion than that expected by chance (17%). Thus, these 1,406 IERFs correspond to the start codons of putative ORFs, with an overall estimated false discovery rate (FDR) of 9% (see Materials and methods for details). 34% (478; FDR of 0.3%) of the putative ORFs precisely match previously annotated ORFs; 27% (373; FDR of 9%) overlap , and are in frame with previously annotated ORFs (i.e. isoform ORFs); 39% (555; FDR of 15%) are novel ORFs, with no match to a previously annotated stop codon. A total of 112 novel ORFs were found entirely in regions presently designated as intergenic; the remaining novel ORFs overlap partly or completely with annotated genes in sense and/or antisense orientations (Figure 3A; Supplementary file 1C). Strikingly, 77% (430) of the novel ORFs we identified are sORFs, with 48 novel ORFs consisting of only a start and stop codon (Supplementary file 1C), an architecture recently described in E. coli (Meydan et al., 2019).

Figure 3. Features of higher-confidence ORFs identified by Ribo-RET.

(A) Distribution of different classes of ORFs identified by Ribo-RET. The pie-chart shows the proportion of identified ORFs in each class. Isoform ORFs are further classified based on whether they are longer (‘N-terminal extension’) or shorter (‘N-terminal truncation’) than the corresponding annotated ORF. Novel ORFs are further classified based on their overlap with annotated genes. ‘Sense’, ‘antisense’, and ‘mixed’ refer to whether the overlapping gene(s) is/are in the sense, antisense, or both (multiple overlapping genes) orientations with respect to the novel ORF. ‘Fully’ and ‘Partially’ indicate whether all or only some of the novel ORF overlaps annotated genes. (B) Strip plot showing the ΔG for the predicted minimum free energy structures for the regions from –40 to +20 nt relative to putative start codons for the different classes of ORF, and for a set of 500 random sequences. Median values are indicated by horizontal lines.

Figure 3.

Figure 3—figure supplement 1. Enrichment of SD-like sequences upstream of higher-confidence ORFs identified by Ribo-RET.

Figure 3—figure supplement 1.

Heatmap showing the enrichment of AGG and GGA trinucleotide sequences relative to control regions, for positions upstream of the start codons of annotated, isoform, and novel ORFs identified by Ribo-RET (higher-confidence set of ORFs).

We reasoned that if the isoform ORFs and novel ORFs are genuine, they should have S-D sequences upstream, and their start codons should each be associated with a region of reduced RNA secondary structure, as has been described for ORFs in other bacterial species (Baez et al., 2019; Del Campo et al., 2015). As we had observed for the set of all IERFs, regions upstream of isoform ORFs and novel ORFs are associated with an enrichment of AGG and GGA sequences in the expected location of a S-D sequence (Figure 3—figure supplement 1). This enrichment is lower than for annotated genes, but it is important to note that a S-D sequence was likely a contributing criterion in computationally predicting the initiation codons of annotated genes. We also assessed the level of RNA secondary structure upstream of all the putative ORFs identified by Ribo-RET. The predicted secondary structure for a set of random genomic sequences was significantly higher than the predicted secondary structure around the start of the identified annotated, novel, or isoform ORFs (Mann-Whitney U Test P < 2.2e–16 in all cases; Figure 3B). Moreover, the predicted secondary structure around the start of the annotated ORFs was only modestly, albeit significantly, higher than that of novel ORFs (Mann-Whitney U Test P = 1e–3). Collectively, the ORFs identified from Ribo-RET data exhibit the expected features of genuine translation initiation sites.

ORFs identified by Ribo-RET are actively translated in untreated cells

To determine if isoform ORFs and novel ORFs are actively and fully translated in cells not treated with retapamulin, we analyzed Ribo-seq data generated from cells grown without drug treatment. We assessed ribosome occupancy for annotated, novel, and isoform ORFs identified by Ribo-RET. As for the predicted leaderless ORFs, we reasoned that expressed leadered ORFs would be associated with increased ribosome occupancy at start and stop codons, as exemplified by previously annotated, leadered ORFs (Figure 1A; Oh et al., 2011; Woolstenhulme et al., 2015). Accordingly, annotated ORFs identified by Ribo-RET were strongly enriched for Ribo-seq signal 15 nt downstream of their start codons and 12 nt downstream of their stop codons (Figure 4A–B). We observed similar Ribo-seq enrichment profiles at the start and stop codons of novel ORFs, and downstream of the start codons of isoform ORFs (Figure 4A and C–D), but we did not observe these enrichment profiles for a set of mock ORFs (Figure 4—figure supplement 1A). Moreover, we did not observe enrichment of RNA-seq signal at start/stop codons, ruling out biases associated with library construction (Figure 4—figure supplement 1B-D). Overall, our data are consistent with most Ribo-RET-predicted isoform and novel ORFs being actively translated from start to stop codon, independent of retapamulin treatment.

Figure 4. Ribo-seq data support the translation of hundreds of isoform and novel ORFs identified by Ribo-RET.

(A) Ribo-seq and Ribo-RET sequence read coverage (read 3’ ends) across two genomic regions, showing examples of putative ORFs in the annotated (blue arrow), novel (orange arrow), and isoform (green arrow) categories. ORFs identified by Ribo-RET shown with a black outline. (B) Metagene plot showing normalized Ribo-seq sequence read coverage (data indicate the position of ribosome footprint 3’ ends) for untreated cells in the regions around start (left graph) and stop codons (right graph) of ORFs predicted from Ribo-RET profiles, that correspond to previously annotated genes. (C) Equivalent data to (B) but for putative novel ORFs identified from Ribo-RET data. (D) Equivalent data to (B) but for putative isoform ORFs identified from Ribo-RET data. Only data for start codons are shown because the same stop codon is used by both an annotated and isoform ORF.

Figure 4.

Figure 4—figure supplement 1. Control analyses using mock ORFs or RNA-seq data.

Figure 4—figure supplement 1.

(A) Metagene plot showing normalized Ribo-seq sequence read coverage (data indicate the position of RNA fragment 3’ ends) for untreated cells in the regions around start (left graph) and stop codons (right graph) of mock ORFs. (B) Metagene plot showing normalized RNA-seq sequence read coverage (read 3’ ends) for untreated cells in the regions around start (left graph) and stop codons (right graph) of annotated ORFs identified from Ribo-RET data. (C) Equivalent data to (B) but for putative novel ORFs identified from Ribo-RET data. (D) Equivalent data to (B) but for putative isoform ORFs identified from Ribo-RET data. Only data for start codons are shown because the same stop codon is used by both an annotated and isoform ORF.
Figure 4—figure supplement 2. Features of lower-confidence ORFs identified by Ribo-RET.

Figure 4—figure supplement 2.

(A) Distribution of different classes of lower-confidence ORFs identified by Ribo-RET. (B) Heatmap showing the enrichment of AGG and GGA trinucleotide sequences relative to control regions, for positions upstream of the start codons of lower-confidence annotated, isoform, and novel ORFs identified by Ribo-RET. (C) Strip plot showing the ΔG for the predicted minimum free energy structures for the regions from –40 to +20 nt relative to start codons for the different classes of lower-confidence ORF, and for a set of 500 random sequences. Median values are indicated by horizontal lines. (D) Metagene plot showing normalized Ribo-seq sequence read coverage (data indicate the position of ribosome footprint 3’ ends) for untreated cells in the regions around start (left graph) and stop codons (right graph) of lower-confidence annotated ORFs identified from Ribo-RET data. (E) Equivalent data to (D) but for lower-confidence novel ORFs identified from Ribo-RET data. (F) Equivalent data to (D) but for lower-confidence isoform ORFs identified from Ribo-RET data. Only data for start codons are shown because the same stop codon is used by both an annotated and isoform ORF.

Identification of lower-confidence ORFs from Ribo-RET data

In addition to the 1994 IERFs present in both replicates of Ribo-RET data, 4216 IERFs were found in only the first replicate dataset, which was associated with a stronger enrichment of ribosome occupancy at start codons (compare Figure 2A and Figure 2—figure supplement 1). Strikingly, 2791 (66%) of IERFs found in only the first Ribo-RET dataset were associated with a potential start codon 14–18 nt upstream of their 3’ ends (Supplementary file 1C; see Materials and methods for details), a far higher proportion than that expected by chance (17%), and a similar proportion to that observed for IERFs found in both replicate Ribo-RET datasets (70%). We refer to ORFs identified from only the first Ribo-RET dataset as ‘lower-confidence’ ORFs, reflecting the marginally higher FDRs; we refer to ORFs identified from both Ribo-RET datasets as ‘higher-confidence’ ORFs. 22% (614; FDR of 0.6%) of the lower-confidence ORFs are annotated, 29% (801; FDR of 10%) are isoform, and 49% (1372; FDR of 16%) are novel. 77% (1061) of the novel lower-confidence ORFs are sORFs, with 120 consisting of only a start and stop codon (Figure 4—figure supplement 2A), mirroring the proportions observed in the higher-confidence dataset.

Regions upstream of lower-confidence annotated, novel, and isoform ORFs are associated with an enrichment of AGG and GGA sequences in the expected location of a Shine-Dalgarno sequence (Figure 4—figure supplement 2B). The predicted secondary structure for a set of random genomic sequences was significantly higher than the predicted secondary structure around the start of the lower-confidence annotated ORFs, novel ORFs, and isoform ORFs (Mann-Whitney U Test P < 2.2e–16 in all cases; Figure 4—figure supplement 2C). Moreover, the predicted secondary structure around the start of the lower-confidence annotated ORFs was not significantly higher than that of the lower-confidence novel ORFs (Mann-Whitney U Test P = 0.22). Lastly, we examined ribosome occupancy at the start and stop codons of the lower-confidence ORFs from our Ribo-seq data generated from cells grown without drug treatment. Lower-confidence annotated, novel, and isoform ORFs were strongly enriched for Ribo-seq signal 15 nt downstream of their start codons and 12 nt downstream of their stop codons (Figure 4—figure supplement 2D-F). Collectively, the lower-confidence ORFs exhibit the characteristics of actively translated regions.

Novel ORFs tend to be weakly transcribed but efficiently translated

To investigate how efficiently novel ORFs are expressed, we determined RNA levels from RNA-seq data, and ribosome occupancy levels from Ribo-seq data, for all annotated and novel ORFs detected in this study (leaderless and leadered ORFs). We also determined RNA and ribosome occupancy levels for putatively untranslated regions of 1854 control transcripts (see Materials and methods for details). For novel ORFs, we analyzed only the 871 ORFs for which ≥ 50 nt of the ORF is ≥30 nt from an annotated gene on the same strand, to avoid overlapping signal from other ORFs. As a group, novel ORFs have lower RNA levels and lower ribosome occupancy levels than the 1670 annotated ORFs (Figure 5A top panel; Figure 5—figure supplement 1A top panel; Figure 5—figure supplement 1B-C). By contrast, the non-coding control transcripts as a group have similar RNA levels to novel ORFs, but lower ribosome occupancy levels (Figure 5A, lower panels; Figure 5—figure supplement 1A lower panels; Figure 5—figure supplement 1B-C). To estimate the ribosome occupancy per transcript, we determined the ratio of Ribo-seq reads to RNA-seq reads for each region analyzed (Figure 5B; Supplementary file 1, tabs A + C). As a group, novel ORFs have only slightly lower ribosome occupancy per transcript than annotated ORFs, while both novel and annotated ORFs have markedly higher ribosome occupancy per transcript than the control non-coding transcripts. We conclude that the RNA level for novel ORFs tends to be lower than that for annotated ORFs, but novel ORFs are translated with similar efficiency to annotated ORFs, and are thus clearly distinct from non-coding transcripts. The overall lower expression of novel ORFs relative to annotated ORFs is also reflected by lower Ribo-RET occupancy at their start codons (Figure 5—figure supplement 2).

Figure 5. Novel ORFs are efficiently translated.

(A) Pairwise comparison of normalized RNA-seq and Ribo-seq coverage for annotated, novel and non-coding control transcripts. Reads are plotted as RPM per nucleotide using a single replicate of each dataset for reads aligned to the reference genome at their 3’ ends. The categories compared are: (i) annotated ORFs (higher-confidence and lower-confidence ORFs detected by Ribo-RET, and leaderless ORFs; blue datapoints), (ii) novel ORFs (higher-confidence and lower-confidence ORFs detected by Ribo-RET and leaderless ORFs, for regions at least 30 nt from an annotated gene; orange datapoints), and (iii) a set of 1854 control transcript regions that are expected to be non-coding (see Materials and methods; purple datapoints). ORF/transcript sets are plotted in pairs to aid visualization. (B) Normalized ribosome density per transcript (ratio of Ribo-seq coverage to RNA-seq coverage) for the same sets of ORFs/transcripts. The graph shows the frequency (%) of ORFs/transcripts within each group for bins of 0.05 density units.

Figure 5.

Figure 5—figure supplement 1. Novel and isoform ORFs are expressed at lower levels than annotated ORFs.

Figure 5—figure supplement 1.

(A) Pairwise comparison of normalized RNA-seq and Ribo-seq coverage for annotated, novel, and non-coding control transcripts. Each data-point represents one transcript. Values are plotted as RPM per nucleotide using a single replicate of each dataset for reads aligned to the reference genome at their 3’ ends (c.f. Figure 5, which shows data for the other replicate for each dataset). The categories compared are: (i) annotated ORFs (higher-confidence and lower-confidence ORFs detected by Ribo-RET, and leaderless ORFs; blue datapoints), (ii) novel ORFs (higher-confidence and lower-confidence ORFs detected by Ribo-RET and leaderless ORFs, for regions at least 30 nt from an annotated gene; orange datapoints), and (iii) a set of 1854 control transcript regions that are expected to be non-coding (see Materials and methods; purple datapoints). ORF/transcript sets are plotted in pairs to aid visualization. (B) Cumulative frequency distributions of normalized RNA-seq coverage for annotated, novel, and non-coding control transcripts. Coverage values (x-axis) are the average of two replicate datasets. Values on the y-axis indicate the percentage of transcripts with coverage less than or equal to a given value on the x-axis. (C) Cumulative frequency distributions of normalized Ribo-seq coverage for annotated, novel, and non-coding control transcripts.
Figure 5—figure supplement 2. Novel and isoform ORF start codons have lower ribosome occupancy than annotated ORF start codons in Ribo-RET data.

Figure 5—figure supplement 2.

(A) Strip plot showing the normalized sequencing read depth for a single Ribo-RET replicate dataset, at start codons of higher-confidence annotated, isoform, and novel ORFs identified by Ribo-RET. Median values are indicated by horizontal lines. (B) Equivalent to (A) but for a second replicate Ribo-RET dataset.

Validation of novel ORFs using mass spectrometry

Mass spectrometry (MS) provides a rigorous methodology to define the Mtb proteome. However, we predict that many of the small proteins we describe here are likely to be missed by MS because (i) there are biases against retaining small proteins in standard sample preparation methods and, (ii) small proteins generate few tryptic peptides. We hypothesized that we could enrich for small proteins by processing the normally discarded fractions from each of two standard preparations (Wisniewski et al., 2009). In total, we analyzed five samples prepared in different ways designed to enrich for small proteins (see Materials and methods). We also analyzed a sample made by in-solution digestion, which does not discard small proteins during final preparative stages (see Materials and methods). Nano-UHPLC-MS/MS on these samples identified proteins encoded by 44 of the putative leaderless and leadered novel ORFs identified in this study, at an estimated overall FDR of 1% (Tang et al., 2008). Novel proteins detected by MS are indicated in Supplementary file 1A, C. Eight proteins were detected in more than one preparation, or with independent peptide matches. Direct analysis from the mixed-organic extraction (with and without demethylation), and analysis of a minimally treated in-solution digestion, yielded the majority of the protein identifications. Ten of the proteins we detected are <50 amino acids in length, with the shortest being 23 amino acids long. The methods aimed at enriching for small proteins detected proteins of a smaller average size: the mean predicted length of novel proteins identified with small protein enrichment strategies was 60 amino acids, versus 86 amino acids for proteins identified from in-solution digestion. We anticipate that additional modifications in the enrichment protocols for small proteins will further improve the sensitivity of detection of small proteins.

Since many small proteins were only identified as single peptides by MS, we sought a direct approach to validate their detection. Three MS-detected novel small proteins were commercially synthesized, and their MS/MS spectra determined for empirical comparison to the native small protein. The three proteins were selected from high- (local FDR < 1%), and medium- (local FDR < 5%) search scores. Two of these proteins are translated from leaderless ORFs and one from a leadered ORF. For all three proteins, the numerical ions from the synthetic peptide matched those from the proteomic datasets, with conservation of the mass intensity (Figure 6). We conclude that all three proteins are translated as stable products that match the sequence expected based on Ribo-RET data.

Figure 6. Mass spectrometry validation of selected ORFs.

MS/MS spectra from novel ORFs measured with a synthetic peptide compared to spectra measured from the Mtb proteome. The genome coordinate and strand of each selected novel ORF start codon is indicated. (A) Leaderless ORF 1272167 (-) was identified from amino-acids 2–24. The y4 and parent m/z ions are off-scale. (B) Leaderless ORF 1242703 (+) was observed from amino acids 46–61. (C) Leadered ORF 4071711 (+) was observed from amino acids 4–26. The b3 ion is off-scale. Measured b-ions are in blue, and y-ions are in red. The nearly complete spectrum obtained for each peptide and the fragment-mass balance clearly indicate that these sORFs are identical to their synthetic cognates.

Figure 6.

Figure 6—figure supplement 1. Validation of selected novel and isoform ORFs using luciferase reporter fusions.

Figure 6—figure supplement 1.

Luciferase reporter assays for constructs consisting of the region from position –25 up to the Ribo-RET-predicted start codon fused translationally to a luciferase reporter gene, as illustrated in the schematic. Fusions were tested for 18 putative novel ORFs identified from Ribo-RET data, and three previously annotated ORFs that serve as positive controls. Wild-type and mutant start codon reporter construct pairs were separately integrated into the M. smegmatis chromosome to quantify the net contribution of translation from the predicted start codon. The genome coordinate and strand of each selected novel ORF start codon is indicated. Underlined coordinates indicate novel ORFs identified from a single Ribo-RET replicate dataset.
Figure 6—figure supplement 2. Validation of selected novel and isoform ORFs by western blot.

Figure 6—figure supplement 2.

(A) Western blot with anti-FLAG antibody to detect FLAG-tagged novel ORFs integrated into the M. smegmatis chromosome with either an intact (wild-type) or mutated start codon. The integrated constructs included the entire 5’ UTR and open-reading frame (indicated by a dashed box), but not the native promoter. Bands corresponding to the tagged novel ORF are indicated with an orange arrow. Asterisks indicate the positions of common cross-reacting proteins. Novel ORF 4329885 (-) was identified from a single Ribo-RET replicate dataset. The positions of molecular weight marker bands are indicated. (B) Western blots with anti-FLAG antibody to detect FLAG-tagged isoform ORFs integrated into the M. smegmatis genome with either an intact (wild-type) or mutated start codon. The integrated constructs included the overlapping full-length, annotated ORF and its entire 5’ UTR. Bands corresponding to the tagged full-length and isoform ORFs are indicated with blue and green arrows, respectively. The western blot for the isoform ORF overlapping fadB5 was developed with a short (left panel) and a long (right panel) exposure due to the large difference in steady-state levels of the full-length and isoform proteins. Isoform ORF 4152736 (-) was identified from a single Ribo-RET replicate dataset.
Figure 6—figure supplement 2—source data 1. Images of full western blots are provided.
The zipped folder includes (i) individual files for each blot, and (ii) a summary file showing all blots, with boxes to show the regions used in Figure 6—figure supplement 2.

Validation of novel and isoform start codons using reporter gene fusions

We sought to validate selected novel and isoform ORFs. We hypothesized that the start codons identified by Ribo-RET would direct translation initiation in a reporter system that controls for extraneous contextual variables. We selected 18 novel predicted start codons that scored in the top quartile for ribosome occupancy in Ribo-RET datasets; in Ribo-seq profiles, the associated ribosome densities per transcript cover a broad range of values (median percentile rank of 37 for the eight ORFs that could be assessed). We tested these start codons by fusing them to a luciferase reporter, including 25 bp of upstream sequence for each ORF tested. We constructed equivalent reporter fusions with a single base substitution in the predicted start codon (RTG to RCG). For comparison, we included wild-type and start codon mutant luciferase reporter fusions for three annotated ORFs (icl1, sucC, and mmsA). The reporter plasmids were integrated into the chromosome of M. smegmatis. Luciferase expression from each of the 20 luciferase fusions, including those for five novel ORFs from our lower-confidence list, was significantly reduced by mutation of the start codon (Figure 6—figure supplement 1; p < 0.05 or 0.01, as indicated, one-way Student’s T-test). Mutation of the start codons reduced, but did not abolish, luciferase expression; this was true even for the three annotated ORFs. We speculate that translation can initiate at low levels from non-canonical start codons, as has been described for E. coli (Hecht et al., 2017). We note that our plasmid reporter system was designed to minimize extraneous variables between constructs that could confound initiation codon evaluation, which necessarily removed the candidate start codons from their larger native context. Overall, the luciferase reporter fusion data are consistent with active translation from the start codons identified by Ribo-RET.

Validation of novel and isoform ORFs using western blotting

To directly assess translation of selected putative ORFs, we generated constructs for two complete novel ORFs with 3 x FLAG tags fused at the encoded C-terminus. We generated equivalent constructs with a single base substitution in the putative start codon. The tagged constructs were integrated into the chromosome of M. smegmatis. The two proteins were detected by western blot, and they were not detected from cells with mutant start codons (Figure 6—figure supplement 2A). We generated equivalent 3 x FLAG-tagged strains for two isoform ORFs. We detected the overlapping, full-length protein by western blot, and expression of these full-length proteins was unaffected by mutation of the isoform ORF start codon (Figure 6—figure supplement 2B). We also detected a protein of smaller size, corresponding to the expected size of the isoform protein; expression of these small isoform proteins was not detected in the start codon mutant constructs (Figure 6—figure supplement 2B). Notably, for the pairs of novel and isoform proteins we detected by western blot, the two more highly expressed proteins were from the lower-confidence set of ORFs. Overall, these data support the ORF predictions from the Ribo-RET data, and the existence of novel and isoform ORFs identified from only a single replicate of Ribo-RET data.

Limited G/C-Skew in the codons of non-overlapping novel ORFs

The Mtb genome has a high G/C content (65.6%). There is a G/C bias within codons of annotated genes: the second position of codons is particularly constrained to encode specific amino acids, which supersedes the G/C bias of the genome, whereas the third (wobble) position has few such constraints. Hence, functional ORFs under purifying selection exhibit G/C content below the genome average at the second codon position and above the genome average at the third codon position (Bibb et al., 1984). We refer to the difference in G/C content at third positions and second positions of codons as ‘G/C-skew’, with positive G/C-skew expected for ORFs subject to purifying selection. We reasoned that we could exploit G/C-skew to assess the likelihood that novel ORFs identified by Ribo-RET have experienced purifying selection at the codon level. We assessed G/C skew for all 2299 novel ORFs identified in this study (leadered and leaderless). We limited the analysis to regions that do not overlap previously annotated genes, since G/C-skew could be impacted by selective pressure on an overlapping gene; 62% of ORFs were discarded because they completely overlap an annotated gene, and 17% of ORFs had some portion excluded. The set of all tested novel ORFs has modest, but significant, positive G/C-skew (Fisher’s exact test P < 2.2e–16; n = 19,750 codons; Figure 7; Supplementary file 1A, C ), consistent with a subset of codons in this class having been under purifying selection. However, the degree of positive G/C-skew for the novel ORFs is much smaller than that for the annotated ORFs we identified in our datasets (Figure 7), suggesting that the proportion of novel ORFs experiencing purifying selection, and/or the intensity of that selection, is much lower than that for the annotated ORF group. To identify specific novel ORFs that have likely experienced purifying selection of their codons, and hence are likely to contribute to cell fitness, we determined G/C-skew for the non-overlapping regions of each novel ORF individually. We then ranked the ORFs by the significance of their G/C-skew (Fisher’s exact test; see Materials and methods for details). Of the 103 ORFs with the most significant G/C-skew, there is a strong enrichment for positive G/C-skew: 90 of the ORFs have positive G/C skew and 13 have negative G/C skew. This suggests that ~80 of the 90 ORFs with positive G/C skew have been subject to purifying selection on their codons. It is important to note that the size of the ORF is a major consideration when determining the significance of G/C-skew; the small size of novel ORFs therefore limits this analysis. Moreover, the G/C-skew analysis provides no information on regions of novel ORFs that overlap annotated genes. Hence, the number of novel ORFs that we predict to be functional based on their G/C-skew is almost certainly a substantial underestimate. Nonetheless, the overall G/C-skew of novel ORFs relative to that of annotated ORFs provides strong evidence that the majority of novel ORFs are not functional.

Figure 7. G/C skew within codons of novel and annotated ORFs.

Figure 7.

Histogram showing the frequency of G/C nucleotides at each of the three codon positions for annotated ORFs or novel ORFs. Note that only regions of novel ORFs that do not overlap a previously annotated ORF were analyzed.

Discussion

Ribo-Seq identifies thousands of isoform and novel ORFs

We have identified thousands of actively translated novel and isoform ORFs with high confidence. This conclusion is strongly supported by the clear association of initiating and terminating ribosomes with the start and stop codons, respectively, in untreated cells. We note that the enrichment of terminating ribosomes at the stop codons of novel ORFs in Ribo-seq data (i.e. no retapamulin treatment) is independent of the methods used to identify the novel ORF start codons. The novel and isoform ORFs are also supported by validation of selected ORFs using multiple independent genetic and biochemical approaches. Overall, our data reveal a far greater number of ORFs than previously appreciated, with annotated ORFs outnumbered by isoform and novel ORFs. Many genomic regions encode overlapping ORFs on opposite strands or on the same strand in different frames, contrary to the textbook view of genome organization.

There are 3898 annotated Mtb ORFs, but the ORF discovery approaches applied here under-sampled these, identifying 1669. Failure to identify more annotated ORFs is likely due to the following biological and technical reasons: (i) Many genes are likely to be expressed at levels too low to be detected. In support of this idea, the median Ribo-seq read coverage for leadered, annotated ORFs identified by Ribo-RET was significantly higher than that for equivalent ORFs not identified by Ribo-RET (3.8-fold; Mann-Whitney U test P < 2.2e–16); (ii) Many ORF start codons are likely to be misannotated, so they would be classified as isoforms. (iii) The A/T sequence preference of MNase (Figure 2—figure supplement 2) likely led to exclusion of some ORFs from the Ribo-RET libraries. In support of this idea, the base at position +17 relative to the start codon (i.e. immediately downstream of the preferred MNase cleavage site) is 1.7-fold more likely to be ‘A’, and 1.6-fold less likely to be ‘G’, for annotated ORFs we identified, than for those we did not. Given the clear underrepresentation of annotated ORFs in our datasets, we conclude that there are many more isoform and novel ORFs to be discovered.

The abundance of novel start codons likely reflects pervasive translation

Evidence from other bacterial species suggests that the primary determinants of leadered translation initiation in Mtb are likely to be (i) a suitable start codon, (ii) an upstream sequence that can act as a S-D, and (iii) low local secondary structure around the ribosome-binding site. We detected enrichment of three different start codons in Ribo-RET data (Figure 2B), while S-D sequences can be located at a range of distances upstream of the start codon (Vellanoweth and Rabinowitz, 1992). Hence, there is limited sequence specificity associated with translation initiation. Moreover, a recent report showed that in E. coli, an S-D sequence is not an essential requirement for translation initiation (Saito et al., 2020). Leaderless translation initiation has even fewer sequence requirements; our data suggest that a 5’ AUG or GUG is sufficient for robust leaderless translation (Shell et al., 2015). While AUG and GUG represent only ~3% of all possible trinucleotide sequences, there is likely to be a strong bias towards 5’ AUG or GUG from the process of transcription initiation; the majority of TSSs in Mtb are purines, and the majority of +2 nucleotides are pyrimidines (Cortes et al., 2013; Shell et al., 2015). We propose that many Mtb transcripts are subject to spurious translation either by the leaderless or leadered mechanism, simply because the nominal sequence requirements for these processes commonly occur by chance. Thus, there is pervasive translation of the Mtb transcriptome, similar to the pervasive translation described in eukaryotes (Ingolia et al., 2014; Ruiz-Orera et al., 2018; Wacholder et al., 2021). Pervasive translation has been proposed as an explanation for some of the novel ORFs detected in E. coli by Ribo-RET (Meydan et al., 2019).

The process of pervasive translation is analogous to pervasive transcription, whereby many DNA sequences function as promoters, often from within genes, to drive transcription of spurious RNAs (Lybecker et al., 2014; Wade and Grainger, 2014). Indeed, there are many intragenic promoters in Mtb (Cortes et al., 2013; Shell et al., 2015), providing an additional source of potential spurious translation. We speculate that like spurious transcripts, which are rapidly degraded by RNases, the protein products of pervasive translation are rapidly degraded, as has been proposed for pervasively translated ORFs in E. coli (Stringer et al., 2021). Since Ribo-seq and Ribo-RET detect translation, not the protein product, the stability of the encoded proteins would not impact our ability to detect the corresponding ORFs.

Pervasive translation, by definition, means that ribosomes will spend some fraction of the time translating spurious ORFs. Although we detected many more novel ORFs than annotated ORFs, the total number of codons in all detected novel ORFs is ~20% that of annotated ORFs because of the smaller size of novel proteins. Moreover, novel ORFs as a group are expressed at substantially lower levels than annotated ORFs (Figure 5; Figure 5—figure supplements 1 and 2). Thus, it is likely that <10% of translation in Mtb at any given time is of spurious ORFs, so pervasive translation is unlikely to be overly detrimental to the cell.

Proto-genes and the evolution of new functional genes

Studies of eukaryotes indicate the existence of proto-genes, targets of pervasive translation of either intergenic sequences or sequences antisense to annotated genes (Ingolia et al., 2014; Ruiz-Orera et al., 2018; Wacholder et al., 2021). Proto-genes have the potential to evolve into functional ORFs that contribute to cell fitness (Blevins et al., 2021; Carvunis et al., 2012; Lu et al., 2017; Ruiz-Orera et al., 2018; Vakirlis et al., 2018; Vakirlis et al., 2020; Van Oss and Carvunis, 2019). There is also evidence that some bacterial protein-coding genes evolved from intergenic sequence (Yomtovian et al., 2010). Our data suggest that Mtb has a rich source of proto-genes. As described for proto-genes in yeast, the novel ORFs we identified in Mtb tend to be less well expressed, have less adapted codon usage, and are shorter than annotated genes (Blevins et al., 2021; Carvunis et al., 2012). Pervasive translation in Mtb likely facilitates the evolution of new gene function in Mtb. Since pervasive translation represents a low proportion of all translation, the fitness cost of pervasive translation may be balanced by the benefits of having a large pool of proto-genes.

New functional ORFs/proteins in Mtb

The question of whether an ORF is functional first requires a definition of function (Keeling et al., 2019). Here, we define function as the ability to improve cell fitness. While functional ORFs need not be under purifying selection, ORFs undergoing purifying selection are presumably functional. One metric of purifying selection available in the G/C-rich genomes of mycobacteria is G/C-skew. Analysis of G/C-skew in the codons of novel ORFs identified 90 ORFs that are likely to be functional (positive G/C, p < 0.1 in Supplementary file 1A, C). 54 of these 90 novel ORFs are leadered, and the Ribo-RET signal associated with these 54 ORFs was significantly higher than that for the set of all other novel ORFs (Mann-Whitney U test P = 1.8e–5), consistent with the idea that functional ORFs are likely to be more highly expressed than non-functional ORFs (Carvunis et al., 2012; Vakirlis et al., 2020). Of the 90 ORFs that are likely functional based on their G/C-skew, 44 are ≤51 codons long. Thus, this single indicator of purifying selection has greatly expanded the set of likely functional small ORFs/proteins described for Mtb. There may be other constraints that additionally limit codon selection, especially for regulatory sORFs, such that functional sORFs lack positive G/C skew. Indeed, this is the case for a phylogenetically conserved set of cysteine-rich regulatory sORFs; cysteine codons that are likely to be essential for sORF regulatory function (Canestrari et al., 2020) also reduce the G/C-skew (Supplementary file 1D).

Analysis of codon usage for isoform ORFs is not informative due to their overlap with annotated ORFs. Some isoform ORFs are likely to represent mis-annotations of annotated ORFs. Multiple lines of evidence support this idea: (i) 19% (288) of isoform start codons are ≤10 codons from the corresponding annotated start codon (Supplementary file 1E); this was 3.4-fold more likely for leaderless isoform ORFs, presumably because they lack a S-D, which likely reduces the accuracy of start codon prediction by annotation pipelines. (ii) Leadered isoform ORFs that initiate within 10 codons of an annotated ORF have significantly higher Ribo-RET occupancy than other leadered isoform ORFs (Mann Whitney U Test P = 6.3e–13; Ribo-RET occupancy from a single replicate), and are significantly less likely to overlap an annotated gene whose start codon was identified by Ribo-RET (Fisher’s Exact Test P = 3e–4). Nonetheless, since most isoform ORFs start far from an annotated ORF start, we presume that most do not represent mis-annotations; indeed, for 43% (644) of the isoform ORFs, we also detected the start codon of the overlapping annotated ORF by Ribo-RET. While we expect many isoform ORFs to be a manifestation of pervasive translation, we speculate that some encode proteins with functions related to the function of protein encoded by the overlapping, annotated gene, as has been proposed for isoform ORFs in E. coli (Meydan et al., 2019).

Conclusions

Our data suggest that the Mtb transcriptome is pervasively translated. The unprecedented extent of translation we observe suggests that much of the translation is biological ‘noise’, and that most of the translated ORFs are unlikely to be functional. As ribosome-profiling studies are extended to more diverse species, we anticipate a massive increase in the discovery of bacterial sORFs/small proteins. Future studies aimed at functional characterization of sORFs/small proteins will require prioritizing with clear supporting evidence for function from codon usage patterns, phylogenetic conservation (Sberro et al., 2019), or genetic data.

Materials and methods

Key resources table.

Reagent type (species) or resource Designation Source or reference Identifiers Additional information
Strain, strain background (Mycobacterium tuberculosis) mc27000 DOI: 10.1016/j.vaccine.2006.05.097 ΔpanCD ΔRD1
Strain, strain background (Mycobacterium smegmatis) mc2155 DOI: 10.1111/j.1365–2958.1990.tb02040.x
Antibody Monoclonal anti-FLAG M2 antibody (Mouse monoclonal) SIGMA Catalog # F1804 Used at (1:1,000) dilution for western blot
Recombinant DNA reagent pRV1133C (plasmid) This study pRV1133C Integrates at attP site; includes the metE promoter region
Recombinant DNA reagent pGE450 (plasmid) This study pGE450 Derivative of pRV1133C containing 3 x FLAG
Recombinant DNA reagent pGE190 (plasmid) This study pGE190 Derivative of pRV1133C containing the nLuc gene from pNL1.1 (Promega, cat no 1001)
Other Micrococcal nuclease (S7) SIGMA Catalog # 10107921001
Other Nano-Glo Luciferase Assay Reagent Promega Catalog # N1110
Chemical compound, drug Retapamulin SIGMA Catalog # CDS023386
Software, algorithm CLC Genomics Workbench Qiagen v8.5.1 Alignment of sequence reads from.fastq files
Software, algorithm RNAfold DOI:10.1186/1748-7188-6-26 v2.4.14 ViennaRNA Package https://www.tbi.univie.ac.at/RNA/

Strains and plasmids

All oligonucleotides used in this study are listed in Supplementary file 1F. Ribo-seq and Ribo-RET experiments were performed using the M. tuberculosis strain mc27000 (Sambandamurthy et al., 2006). M. tuberculosis mc27000 cells were grown in 7H9 medium supplemented with 10% OADC (Oleic acid, Albumin, Dextrose, Catalase), 0.2% glycerol, 100 µg/ml pantothenic acid and 0.05% Tween80 at 37 °C, without shaking, to an OD600 of ~1.

We constructed a shuttle vector, pRV1133C, to allow integration of luciferase or FLAG-tag fusion constructs into the M. smegmatis mc2155 (Snapper et al., 1990) chromosome, with a constitutive promoter driving transcription. pRV1133C was derived from pMP399, retaining its oriE for episomal maintenance in E. coli, its integrase and attP site for integration at the L5 attB site in mycobacteria, and apramycin resistance (Consaul and Pavelka, 2004). The hsp60 promoter of pMP399 was replaced by the promoter of the M. tuberculosis Rv1133c (metE) gene (genome coordinates 1,261,811–1,261,712 from the minus strand of the M. tuberculosis genome, stopping one base pair upstream of the transcription start site; GenBank accession: AL123456.3). The criterion for selecting Rv1133c was its strong constitutive expression assessed by transcription start site metrics (Shell et al., 2015).

A luciferase (NanoLuc) gene amplified from pNL1.1 (Promega, cat. no 1001) was cloned downstream of the Rv1133c promoter to generate pGE190. To construct individual reporter fusion plasmids, the entire pGE190 plasmid was amplified by inverse PCR using Q5 High Fidelity DNA polymerase (NEB) with oligonucleotides TGD4006 and TGD5162. Sequences corresponding to the 25 bp upstream of, and including the start codons for selected ORFs were PCR-amplified using oligonucleotide pair TGD5163 and TGD5164, to amplify template oligonucleotides TGD5165-5173, TGD5175, TGD5178-5186, and TGD5795-5797. PCR products were cloned into the linearized pGE190 using the In-Fusion cloning system (Takara). The oligonucleotide templates had a ‘Y’ (mixed base ‘C’ or ‘T’) at the position corresponding to the central position of the start codon. Clones were sequenced to identify wild-type and mutant constructs, where the central position of the start codon was a ‘T’ or a ‘C’, respectively. Plasmid DNA was electroporated into M. smegmatis mc2155 for chromosomal integration before assaying luciferase activity.

A 3 x FLAG-epitope-tag sequence was integrated into pRV1133C to generate pGE450. To construct individual FLAG-tagged constructs, the entire pGE450 plasmid was amplified by inverse PCR using Q5 High Fidelity DNA polymerase (NEB) with oligonucleotides TGD4981 and TGD4982. Sequences from the predicted transcription start site up to the stop codon for selected ORFs were PCR-amplified using oligonucleotide pairs TGD5208 and TGD5209, TGD5216 and TGD5217, TGD5241 and TGD5242, or TGD5247 and TGD5248. PCR products were cloned into pGE450 using the In-Fusion cloning system (Takara). Start codon mutant constructs were made by inverse PCR-amplification of the wild-type constructs using primers that introduce a start codon mutation (‘T’ to ‘C’ change at the central position of the start codon; oligonucleotides TGD5210, TGD5211, TGD5218, TGD5219, TGD5256, TGD5257, TGD5258, and TGD5259). PCR products were treated with DpnI and cloned using the In-Fusion cloning system (Takara). Following sequence confirmation, plasmid DNA was electroporated into M. smegmatis mc2155 for chromosomal integration before performing expression analysis by western blot.

Ribo-seq without drug treatment

Ten ml of M. tuberculosis (OD600 of 0.4) was used to inoculate 400 ml of medium and grown to an OD600 of 1 (2–3 weeks). Cells were collection by filtration through a 0.22 μm filter. Libraries were prepared for sequencing, and sequencing data were processed as described previously for M. smegmatis (Shell et al., 2015).

RNA-Seq

Cell extracts were prepared in parallel to those used for Ribo-seq. RNA was extracted using acid phenol and chloroform followed by isopropanol precipitation. Ribosomal RNA was removed using the Ribo-Zero Magnetic Kit (Epicentre). RNA fragmentation, library preparation, sequencing, and data processing were performed as described previously for M. smegmatis (Shell et al., 2015).

Ribosome profiling with retapamulin treatment (Ribo-RET)

Ten ml of M. tuberculosis (OD600 of 0.4) was used to inoculate 400 ml of medium and grown to an OD600 of 1 (2–3 weeks). Cells were treated with retapamulin (Sigma CDS023386) at a final concentration of 0.125 mg/ml for 15 min at room temperature, with occasional manual shaking, and collected by filtration through a 0.22 μm filter. Cells were flash frozen in liquid nitrogen with 0.7 ml lysis buffer (20 mM Tris pH 8.0, 10 mM MgCl2, 100 mM NH4Cl, 5 mM CaCl2, 0.4% Triton X-100, 0.1 % NP-40, 1 mM chloramphenicol, 100 U/mL DNase I). Frozen cells were milled using a Retsch MM400 mixer mill for 8 cycles of 3 min each at 15 Hz. Milling cups were re-frozen in liquid nitrogen in between each milling cycle. Cell extracts were thawed and incubated on ice for 30 min. Samples were clarified by centrifugation. Supernatants were passed twice through 0.22 μm filters. 1 mg aliquots of cell extracts were flash-frozen in liquid nitrogen. Monosomes were isolated by digesting 1 mg of cell extract with 1,500 units of micrococcal nuclease for 1 hr at room temperature on a rotisserie rotator. The reaction was quenched by adding 2 μl 0.5 M EGTA, after which the digest was fractionated through a 10–50% sucrose gradient. Fractions from the sucrose gradients were electrophoresed on a 1% agarose gel with 1% bleach to identify ribosomal RNA peaks. Those fractions were selected, pooled, and monosomes isolated by acid phenol:chloroform extraction and isopropanol precipitation.

Libraries for sequencing were prepared using a previously described method (Ingolia, 2010). RNA from monosomes was run on a 15% denaturing gel alongside a 31 nt RNA oligonucleotide to size-select 31 ± 5 nt fragments. Samples were gel-extracted in 500 μl RNA gel extraction buffer (300 mM NaOAc (pH 5.5), 1 mM EDTA, 0.1 U/mL SUPERase-In RNase inhibitor) followed by isopropanol precipitation. The samples were dephosphorylated by incubating with 10 U of T4 Polynucleotide Kinase (NEB) for 1 hr at 37 °C, before extraction with phenol:chloroform:isoamyl alcohol, and ethanol precipitation. The dephosphorylated RNAs were ligated to the 3’ linker oligonucleotide JW9371 using T4 RNA Ligase 2 (truncated, K227Q) at a 1:4 RNA:linker ratio. The ligation reactions were incubated for 3 hr at 37 °C, followed by 20 min at 65 °C. The reactions were separated on a 15% polyacrylamide denaturing gel alongside a control RNA oligonucleotide (JW9370) of the expected size of the ligated product. The RNA-ligation products were excised and extracted in 500 μL RNA extraction buffer and concentrated by ethanol precipitation. Reverse transcription was performed on the RNA samples using Superscript III (Life Technologies) and oligo JW8875, as described previously (Ingolia, 2010). The reactions were separated through a 10% polyacrylamide denaturing gel and cDNAs excised and extracted in 500 μL DNA extraction buffer (300 mM NaCl, 10 mM Tris-Cl (pH 8), 1 mM EDTA). Reverse-transcribed cDNA was circularized using CircLigase, and PCR-amplified as described previously (Ingolia, 2010). Between 4 and 9 cycles of PCR were performed using Phusion High Fidelity DNA Polymerase, JW8835 as the standard forward primer, and JW3249, 3250, 8876 or 8877, corresponding to Illumina index numbers 1, 2, 34, or 39 respectively, as the reverse primer. Samples were separated through an 8% polyacrylamide gel. DNAs of the appropriate length (longer than the control adapter band) were excised from the gel and extracted in 500 μL of DNA extraction buffer. DNAs were concentrated by isopropanol precipitation. Samples were quantified and subject to DNA sequence analysis on a NextSeq instrument.

Inferring ORF positions from Ribo-RET data

Sequencing reads from Ribo-RET datasets were trimmed to remove adapter sequences using a custom python script that trimmed reads up to the first instance of CTGTAGGCACC, keeping trimmed reads in the length range 20–44 nt. Trimmed sequence reads were aligned to the reference genome and separately to a reverse-complemented copy of the reference genome, using Rockhopper (McClure et al., 2013). The positions of read 3’ ends were determined from the resultant.sam files, and used to determine coverage on each strand at each genome position. Coverage values were set to 0 for regions encompassing all annotated non-coding genes, and the 50 nt regions downstream of the 1285 TSSs associated with an RUG (i.e. the first 50 nt of all predicted leaderless ORFs) (Cortes et al., 2013; Shell et al., 2015). Read counts were then normalized to total read count as reads per million (RPM).

Every genome coordinate on each strand was considered as a possible IERF. To be selected as an IERF, a position required a minimum of 5.5 RPM coverage (equivalent to 20 sequence reads in the first replicate dataset), with at least 10-fold higher coverage than the average coverage in the 101 nt region centered on the coordinate being considered, and equal or higher coverage than every position in the 21 nt region centered on the coordinate being considered. For high-confidence ORF calls, all criteria had to be met in both replicate datasets. IERFs were inferred to represent an ORF if the IERF position was 15 nt downstream of a TTG, 14–18 nt downstream of an ATG, or 14–18 nt downstream of a GTG. These trinucleotide sequences and distances were selected based on a > 1.4 fold enrichment upstream of IERFs (Figure 2B). In a small number of cases, two IERFs were associated with the same start codon; this only occurred in cases where the two IERFs had identical Ribo-RET sequence coverage. This double-matching means that the number of IERFs is slightly higher than the number of identified ORFs.

Calculating false discovery rates for ORF prediction from Ribo-RET data

The likelihood of randomly selecting a genome coordinate with an associated start codon sequence (as defined above for IERFs) was estimated by selecting 100,000 random genome coordinates and determining the fraction, ‘R’, that would be associated with a start codon. The set of IERFs contains a number of true positives (i.e. corresponding to a genuine start codon), and a number of false positives. We assume that true positive IERFs are all associated with a start codon using the parameters described above for calling ORFs. We assume that false positive IERFs are associated with a start codon at the same frequency as random genome coordinates, that is R. Since we know how many IERFs were not associated with a start codon, we can use this number to estimate how many false positive IERFs were associated with a start codon by chance. With the total number of IERFs as ‘I’ and the total number of identified ORFs as ‘O’, the FDR for ORF calls is estimated by:

(100(IO)(R/(1R)))/O

To estimate the distribution of false positive IERFs between annotated, isoform, and novel ORFs, we determined the relative proportion of each class of ORF from the set of randomly selected genome coordinates that were associated with a start codon by chance.

Selection of mock ORFs

As a control for potential artifacts of DNA sequence on Ribo-seq coverage we selected 1,000 mock ORFs: sequences that begin at an ATG or GTG and extend to the first in-frame stop codon. Mock ORF stop codons do not match those of previously annotated genes or novel genes identified from Ribo-RET data. To ensure that mock ORFs are in transcribed regions, we required non-zero RNA-seq coverage at the first position of each mock ORF. For simplicity, mock ORFs were only selected on the forward strand of the genome.

RNA folding prediction

The sequence from –40 to +20 relative to each start codon, or for 500 × 60 nt sequences randomly selected from the M. tuberculosis genome, were selected for prediction of the free energy of the predicted minimum free energy structure using a local installation of ViennaRNA Package tool RNAfold, version 2.4.14, using default settings (Lorenz et al., 2011).

Determining normalized sequence read coverage from Ribo-Seq and RNA-Seq data

Library construction for Ribo-seq and RNA-seq included polyadenylation of RNA fragments, and sequence reads were trimmed at their 3’ ends, immediately upstream of the first instance of ‘AAA’, before aligning to the reference genome; hence, it is impossible for a trimmed sequence read to end with an ‘A’. This likely explains why we observed apparent differences in ribosome occupancy in Ribo-seq data precisely at start and stop codons for all classes of ORF (e.g. Figure 1), since these codons are strongly enriched for specific bases. We note that the same patterns were observed for RNA-seq data (Figure 4—figure supplement 1B-D; RNA-seq library construction included a polyadenylation step, and reads were trimmed and mapped identically to those from Ribo-seq datasets), and for mock ORFs in Ribo-seq data (Figure 4—figure supplement 1A).

Sequence reads were aligned to the reference genome (NCBI Reference Sequence: NC_000962.3) and separately to a reverse-complemented copy of the reference genome, using Rockhopper, version 2.0.3 (McClure et al., 2013). The positions of read 3’ ends were determined from the resultant .sam files, and used to determine coverage on each strand at each genome position, normalized to total read count as reads per million (RPM).

Generating metagene plots

Metagene plots (i.e. Figure 1A–D, Figure 1—figure supplement 1, Figure 2A, Figure 2—figure supplement 1, Figure 4B–D, Figure 4—figure supplement 1A-D; Figure 4—figure supplement 2D-F) used normalized coverage values (RPM) for Ribo-seq, RNA-seq, or Ribo-RET data, calculated as described above. Coverage scores were selected for regions from –50 to +100 relative to start codons or TSSs, or –100 to +50 relative to stop codons. Coverage RPM values were further normalized to the highest value in the selected range. For metagene plots of leadered, previously annotated ORFs (Figures 1A and 2A, Figure 2—figure supplement 1), previously annotated genes were excluded if they were pseudogenes, non-coding, or had a TSS within 5 nt upstream of the start codon. For the metagene plot of TSSs not associated with leaderless ORFs, TSSs were selected from published reports (Cortes et al., 2013; Shell et al., 2015) if they were located at least 6 nt upstream of a previously annotated start codon.

Calculating relative ribosome density per transcript for ORFs and transcript regions

We selected three sets of genomic regions: (i) all annotated ORFs identified either by Ribo-RET (higher confidence and lower confidence) or from leaderless analysis, (ii) all novel ORFs identified either by Ribo-RET (higher confidence and lower confidence) or from leaderless analysis, and (iii) a set of 1854 control transcript regions, described below. For (ii), we removed regions of ORFs that are not at least 30 nt from an annotated gene on the same strand; in many cases this led to exclusion of the ORF or trimming one or both ends of the region to be analyzed. We also excluded any remaining ORF or ORF region <50 nt in length. A set of control transcript regions, intended to comprise mostly non-coding RNA, was selected by identifying transcription start sites (Cortes et al., 2013; Shell et al., 2015) > 5 nt upstream of an RTG trinucleotide sequence. We then selected the first 50 nt of the associated transcribed regions. These control regions were excluded if they are not at least 30 nt from an annotated gene on the same strand, or if they overlap partially or completely a novel or isoform ORF identified in this study.

For each category of region, (i), (ii), and (iii), described above, we calculated the normalized sequence read coverage (RPM) from two replicates each of RNA-seq and Ribo-seq data, aligning only the sequence read 3’ ends (see section titled ‘Determining normalized sequence read coverage from Ribo-seq and RNA-seq data’). We excluded 7 of the regions in category (iii) that had zero RNA-seq coverage in both replicates. Data in Figure 5A and Figure 5—figure supplement 1, show the sequence read coverage normalized to the length of each region analyzed. To calculate the relative ribosome occupancy per transcript (Figure 5B), we first averaged the RNA-seq and Ribo-seq normalized coverage values from each of the two replicate datasets for each region analyzed. We then calculated the ratio of the Ribo-seq value to the RNA-seq value.

Analysis of G/C usage within codons

For novel ORFs identified using the first replicate of Ribo-RET data, including ORFs identified in both replicates, we first trimmed the start and stop codons. We then trimmed any region of the remaining ORF that overlaps a previously annotated gene, leaving only complete codons; in many cases this removed the entire ORF from the analysis. For the remaining sequences, we scored the first, second or third position of all codons for the presence of a G or C. The G/C-skew was calculated as the ratio of the sum of G/C bases at the third codon position to that at the second codon position. Statistical comparisons were performed using a Fisher’s exact test comparing G/C base count at the second and third positions; tests were one-tailed or two-tailed as indicated, with the null hypothesis for one-tailed tests being that the G/C base count at the third codon position was not higher than that at the second codon position. Values plotted in Figure 7 represent the sum of values for each individual ORF, or the equivalent number for the annotated ORFs we identified by Ribo-RET (we did not trim these except for start and stop codons).

Analysis of G/C usage within codons for predicted regulatory cysteine-rich sORFs

We examined the G/C-skew of 6 ORFs that we predict regulate expression of the downstream gene in response to cysteine availability, based on their conservation with regulatory sORFs in M. smegmatis (Canestrari et al., 2020). Strikingly, only one of these ORFs individually has significantly positive G/C-skew (Supplementary file 1D; Fisher’s exact test p < 0.05). Moreover, as a group, the six sORFs do not have significantly positive G/C-skew (Supplementary file 1D; Fisher’s exact test p = 0.13; n = 145 codons). We repeated this analysis after removing the cysteine codons from the sORFs, reasoning that cysteine codons have a neutral or negative effect on G/C-skew, and that the presence of cysteine codons is likely essential for the regulatory activity of the sORFs. Removing the cysteine codons increased G/C-skew for all ORFs; in two cases, the G/C-skew of the ORFs with cysteine codons removed is significantly positive (Supplementary file 1D; Fisher’s exact test p < 0.05). Moreover, as a group, the six ORFs with cysteine codons removed have significantly positive G/C-skew (Supplementary file 1D; Fisher’s exact test p = 1.5e–4; n = 120 codons).

Analysis of trinucleotide sequence content upstream of IERFs

The frequency of each trinucleotide was determined for the 50 nt upstream of all IERFs. For each trinucleotide sequence, the frequencies at positions –50 to –41 were averaged (mean), and frequencies at all other positions were normalized to this averaged number. The frequency of AGG and GGA trinucleotide sequences upstream of putative start codons was determined similarly, with the control region used for normalization located at positions –35 to –26 relative to the start codons.

Luciferase reporter assays

M. smegmatis mc2155 strains with integrated luciferase reporter constructs were grown in TSB with Tween80 overnight at 37 °C to an OD600 of ~1.0. Ten µl of Nano-Glo Luciferase Assay Reagent was mixed with 10 µl of cell culture. Luminescence readings (relative light units; RLUs) were taken using a Turner Biosystems Veritas microplate luminometer. Relative luminescence values were reported as RLU/OD600. Assays were performed in triplicate (biological replicates).

Western blots

M. smegmatis MC2155 strains with integrated FLAG-tagged constructs were grown in TSB with Tween overnight at 37 °C to an OD600 of ~1.0. Cells were harvested by centrifugation and resuspended in 1 x NuPage LDS sample buffer (Invitrogen) +5 mM sodium metabisulfite. Samples were heated at 95 °C for 10 min before loading onto a 4–12% gradient Bis-Tris mini-gel (Invitrogen). After separation, proteins were transferred to a nitrocellulose membrane (Life Technologies) or a PVDF membrane (Thermo Scientific). Membranes were probed with a monoclonal mouse anti-FLAG antibody (M2; Sigma). Secondary antibody and detection reagents were obtained from Lumigen (ECL plus kit) and used according to the manufacturer’s instructions.

Integrated genome browser

All ribosome profiling and Ribo-RET data, and identified ORFs are available for visualization on our interactive genome browser (Shell et al., 2015): https://mtb.wadsworth.org/.

Mass spectrometry

Five ml of Mtb (OD600 of 0.4) was used to inoculate 200 ml of medium and grown to an OD600 of 1.175. Cells were collection by filtration through a 0.22 μm filter. Cells were flash frozen in liquid nitrogen with 0.6 ml lysis buffer (20 mM Tris pH 8.0, 10 mM MgCl2, 100 mM NH4Cl, 5 mM CaCl2, 0.4% Triton X100, 0.1 % NP-40, 1 mM chloramphenicol, 100 U/mL DNase I). Frozen cells were milled using a Retsch MM400 mixer mill for 8 cycles of 3 min each at 15 Hz. Milling cups were re-frozen in liquid nitrogen between each milling cycle. Cell extracts were thawed and incubated on ice for 30 min. Samples were clarified by centrifugation. Supernatants were passed twice through 0.22 μm filters. Samples were prepared for MS analysis from 100 µg aliquots of Mtb cytosolic lysate in each of six different ways, with the numbers listed below matching those in Supplementary file 1A, C:

  1. Protein was precipitated by addition of acetonitrile at a ratio of 2:1, placed on ice for 20 min, then clarified by centrifugation for 10 min at 12,000 x g. The supernatant (enriched for small proteins) was decanted into two aliquots and dried using a speedvac (Thermo). One aliquot was resuspended and using a 10 mg HLB Solid-phase extraction cartridge (Waters) according to the manufacturer’s instructions, and dried. The second aliquot was used in method (2), as described below.

  2. The remaining aliquot from (1) was resuspended in 100 µl Tri-ethyl ammonium bicarbonate TEAB (Sigma) and subjected to dimethyl labeling of Lys and N-termini to increase the mass and reduce the charge, and thereby increase detectability of small proteins (Boersema et al., 2009; Yan et al., 2020). The sample was desalted as above, and dried.

  3. Protein was denatured by addition of powdered urea (Alfa Aesar) to 8 M final concentration. The sample was subjected to centrifugal filtration through a 10 K Amicon filter similar to that employed in Filter Aided Sample Prep (FASP) proteomics (Wisniewski et al., 2009), except the sample flow-through containing small molecular weight proteins, not the retentate, was retained and split into two aliquots. One aliquot was desalted and dried. The second aliquot was used in method (4), as described below.

  4. The remaining aliquot from (3) was diluted until the urea concentration was <2 M, before being chemically reduced and alkylated (Yan et al., 2020). To reduce the size of large and hydrophobic proteins, the sample was digested with 1 µg of sequencing-grade trypsin (Promega) for 6 hr at 37 °C. Following digestion, the sample was quenched by addition of formic acid, desalted and dried. Samples were resuspended in 20 µl of 0.2% formic acid in water.

  5. Protein was denatured by addition of powdered urea (Alfa Aesar) to 8 M final concentration. The sample was subjected to centrifugal filtration through a 3 K Amicon filter, retaining the small molecular weight protein flow-through. The sample was desalted and dried.

  6. A total-protein digest was performed using an in-solution trypsin digestion procedure, as a potential source for small proteins not enriched using the approaches described above (Champion et al., 2003).

All samples were analyzed by nano-UHPLC-MS/MS on a Q-Exactive instrument (Thermo) (Bosserman et al., 2017; Canestrari et al., 2020). RAW files were converted to mgf (mascot generic format) using MS-Convert (Adusumilli and Mallick, 2017). Spectrum mass matching was performed using the Paragon Algorithm with feature sets as appropriate for each sample (e.g. demethylation, trypsin, no-digest) in thorough mode (Champion et al., 2012; Shilov et al., 2007). A custom small protein, and leaderless FASTA constructed from the Ribo-seq data was used for database search. FDRs were determined using the target-decoy strategy, as in Elias and Gygi, 2007. Proteins identified using this method were subjected to manual spectral interpretation to validate peptide spectral matches, in particular for b,y-ion consistency, and y fragments to Pro with high intensity. The presence of His and Phe immonium ions (110.0718, 120.0813 [M + H] + m/z), when present in the target sequence, were used for additional validation. Selected small proteins and small protein-derived peptides from high- and medium-observed abundance proteins were chemically synthesized (Genscript) and subjected to LC-MS/MS as above. Synthetic small protein spectra were compared to the empirical-matched small proteins using the peptide spectral annotator (Brademan et al., 2019).

Acknowledgements

We thank Mike Palumbo and Dan Muller for assistance setting up the interactive genome browser (https://mtb.wadsworth.org/), Gabriele Baniulyte, Yunlong Li, and Yong Yang for technical support, David Grainger for comments on the manuscript, and Anne-Ruxandra Carvunis for helpful discussions. We thank the Wadsworth Center Applied Genomic Technologies, Bioinformatics, and Media Core Facilities, and Dr. Boggess in the Notre Dame MS and Proteomics Facility. This work was supported by National Institutes of Health grants R21AI117158 and R21AI119427 (JTW, KMD, TAG) and R01GM139277 (JTW, KMD, TAG, MMC).

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Keith M Derbyshire, Email: keith.derbyshire@health.ny.gov.

Todd A Gray, Email: todd.gray@health.ny.gov.

Joseph T Wade, Email: joseph.wade@health.ny.gov.

Bavesh D Kana, University of the Witwatersrand, South Africa.

Bavesh D Kana, University of the Witwatersrand, South Africa.

Funding Information

This paper was supported by the following grants:

  • National Institute of Allergy and Infectious Diseases R21AI117158 to Keith M Derbyshire, Todd A Gray, Joseph T Wade.

  • National Institute of Allergy and Infectious Diseases R21AI119427 to Keith M Derbyshire, Todd A Gray, Joseph T Wade.

  • National Institute of General Medical Sciences R01GM139277 to Matthew M Champion, Keith M Derbyshire, Todd A Gray, Joseph T Wade.

Additional information

Competing interests

No competing interests declared.

No competing interests declared.

Reviewing editor, eLife.

Author contributions

Investigation, Writing – review and editing.

Investigation.

Investigation, Methodology.

Funding acquisition, Investigation, Project administration, Writing – review and editing.

Conceptualization, Funding acquisition, Project administration, Supervision, Writing – review and editing.

Conceptualization, Funding acquisition, Project administration, Supervision, Writing – review and editing.

Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Software, Supervision, Visualization, Writing – original draft, Writing – review and editing.

Additional files

Supplementary file 1. Supplementary tables.

(A) List of putative leaderless ORFs. (B) List of IERFs. (C) List of ORFs identified by Ribo-RET. (D) Analysis of G/C skew for cys-rich regulatory ORFs. (E) Analysis of isoform ORFs and their position relative to overlapping annotated ORFs. (F) List of oligonucleotides used in this study.

elife-73980-supp1.xlsx (2.2MB, xlsx)
Transparent reporting form

Data availability

Raw Illumina sequencing data are available from the ArrayExpress and European Nucleotide Archive repositories with accession numbers E-MTAB-8039 and E-MTAB-10695. Raw mass spectrometry data are available through MassIVE, with exchange #MSV000087541. Python code is available at https://github.com/wade-lab/Mtb_Ribo-RET (copy archived at swh:1:rev:c6a41047e001550aab663588a13fe935547b9431).

The following datasets were generated:

Smith C, Wang AJ, Wade J. 2019. Pervasive Translation in Mycobacterium tuberculosis. ArrayExpress. E-MTAB-8039

Wang AJ, Wade J. 2021. Pervasive Translation in Mycobacterium tuberculosis. ArrayExpress. E-MTAB-10695

References

  1. Adusumilli R, Mallick P. Data Conversion with ProteoWizard msConvert. Methods in Molecular Biology (Clifton, N.J.) 2017;1550:339–368. doi: 10.1007/978-1-4939-6747-6_23. [DOI] [PubMed] [Google Scholar]
  2. Baez WD, Roy B, McNutt ZA, Shatoff EA, Chen S, Bundschuh R, Fredrick K. Global analysis of protein synthesis in Flavobacterium johnsoniae reveals the use of Kozak-like sequences in diverse bacteria. Nucleic Acids Research. 2019;47:10477–10488. doi: 10.1093/nar/gkz855. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Beck HJ, Moll I. Leaderless mRNAs in the Spotlight: Ancient but Not Outdated! Microbiology Spectrum. 2018;6:RWR-0016-2017. doi: 10.1128/microbiolspec.RWR-0016-2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Besemer J, Borodovsky M. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Research. 2005;33:451–454. doi: 10.1093/nar/gki487. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bibb MJ, Findlay PR, Johnson MW. The relationship between base composition and codon usage in bacterial genes and its use for the simple and reliable identification of protein-coding sequences. Gene. 1984;30:157–166. doi: 10.1016/0378-1119(84)90116-1. [DOI] [PubMed] [Google Scholar]
  6. Blevins WR, Ruiz-Orera J, Messeguer X, Blasco-Moreno B, Villanueva-Cañas JL, Espinar L, Díez J, Carey LB, Albà MM. Uncovering de novo gene birth in yeast using deep transcriptomics. Nature Communications. 2021;12:604. doi: 10.1038/s41467-021-20911-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Boersema PJ, Raijmakers R, Lemeer S, Mohammed S, Heck AJR. Multiplex peptide stable isotope dimethyl labeling for quantitative proteomics. Nature Protocols. 2009;4:484–494. doi: 10.1038/nprot.2009.21. [DOI] [PubMed] [Google Scholar]
  8. Bosserman RE, Nguyen TT, Sanchez KG, Chirakos AE, Ferrell MJ, Thompson CR, Champion MM, Abramovitch RB, Champion PA. WhiB6 regulation of ESX-1 gene expression is controlled by a negative feedback loop in Mycobacterium marinum. PNAS. 2017;114:E10772–E10781. doi: 10.1073/pnas.1710167114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Brademan DR, Riley NM, Kwiecien NW, Coon JJ. Interactive Peptide Spectral Annotator: A Versatile Web-based Tool for Proteomic Applications. Molecular & Cellular Proteomics. 2019;18:S193–S201. doi: 10.1074/mcp.TIR118.001209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Burge CB, Karlin S. Finding the genes in genomic DNA. Current Opinion in Structural Biology. 1998;8:346–354. doi: 10.1016/s0959-440x(98)80069-9. [DOI] [PubMed] [Google Scholar]
  11. Canestrari JG, Lasek-Nesselquist E, Upadhyay A, Rofaeil M, Champion MM, Wade JT, Derbyshire KM, Gray TA. Polycysteine-encoding leaderless short ORFs function as cysteine-responsive attenuators of operonic gene expression in mycobacteria. Molecular Microbiology. 2020;114:93–108. doi: 10.1111/mmi.14498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Carvunis A-R, Rolland T, Wapinski I, Calderwood MA, Yildirim MA, Simonis N, Charloteaux B, Hidalgo CA, Barbette J, Santhanam B, Brar GA, Weissman JS, Regev A, Thierry-Mieg N, Cusick ME, Vidal M. Proto-genes and de novo gene birth. Nature. 2012;487:370–374. doi: 10.1038/nature11184. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Champion MM, Campbell CS, Siegele DA, Russell DH, Hu JC. Proteome analysis of Escherichia coli K-12 by two-dimensional native-state chromatography and MALDI-MS. Molecular Microbiology. 2003;47:383–396. doi: 10.1046/j.1365-2958.2003.03294.x. [DOI] [PubMed] [Google Scholar]
  14. Champion M.M., Williams EA, Kennedy GM, Champion PAD. Direct detection of bacterial protein secretion using whole colony proteomics. Molecular & Cellular Proteomics. 2012;11:596–604. doi: 10.1074/mcp.M112.017533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Chen J, Brunner AD, Cogan JZ, Nuñez JK, Fields AP, Adamson B, Itzhak DN, Li JY, Mann M, Leonetti MD, Weissman JS. Pervasive functional translation of noncanonical human open reading frames. Science (New York, N.Y.) 2020;367:1140–1146. doi: 10.1126/science.aay0262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Consaul SA, Pavelka MS. Use of a novel allele of the Escherichia coli aacC4 aminoglycoside resistance gene as a genetic marker in mycobacteria. FEMS Microbiology Letters. 2004;234:297–301. doi: 10.1016/j.femsle.2004.03.041. [DOI] [PubMed] [Google Scholar]
  17. Cortes T, Schubert OT, Rose G, Arnvig KB, Comas I, Aebersold R, Young DB. Genome-wide mapping of transcriptional start sites defines an extensive leaderless transcriptome in Mycobacterium tuberculosis. Cell Reports. 2013;5:1121–1131. doi: 10.1016/j.celrep.2013.10.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Del Campo C, Bartholomäus A, Fedyunin I, Ignatova Z. Secondary Structure across the Bacterial Transcriptome Reveals Versatile Roles in mRNA Regulation and Function. PLOS Genetics. 2015;11:e1005613. doi: 10.1371/journal.pgen.1005613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics (Oxford, England) 2007;23:673–679. doi: 10.1093/bioinformatics/btm009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Dingwall C, Lomonossoff GP, Laskey RA. High sequence specificity of micrococcal nuclease. Nucleic Acids Research. 1981;9:2659–2673. doi: 10.1093/nar/9.12.2659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature Methods. 2007;4:207–214. doi: 10.1038/nmeth1019. [DOI] [PubMed] [Google Scholar]
  22. Gvozdjak A, Samanta MP. Genes Preferring Non-AUG Start Codons in Bacteria. arXiv. 2020 https://arxiv.org/abs/2008.10758#:~:text=showing%20a%20preference%20for%20non,higher%20than%20among%20all%20genes
  23. Hecht A, Glasgow J, Jaschke PR, Bawazer LA, Munson MS, Cochran JR, Endy D, Salit M. Measurements of translation initiation from all 64 codons in E. coli. Nucleic Acids Research. 2017;45:3615–3626. doi: 10.1093/nar/gkx070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119. doi: 10.1186/1471-2105-11-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Ingolia NT, Ghaemmaghami S, Newman JRS, Weissman JS. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science (New York, N.Y.) 2009;324:218–223. doi: 10.1126/science.1168978. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Ingolia N.T. Genome-wide translational profiling by ribosome footprinting. Methods in Enzymology. 2010;470:119–142. doi: 10.1016/S0076-6879(10)70006-9. [DOI] [PubMed] [Google Scholar]
  27. Ingolia NT, Brar GA, Stern-Ginossar N, Harris MS, Talhouarne GJS, Jackson SE, Wills MR, Weissman JS. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. Cell Reports. 2014;8:1365–1379. doi: 10.1016/j.celrep.2014.07.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Ji Z, Song R, Huang H, Regev A, Struhl K. Transcriptome-scale RNase-footprinting of RNA-protein complexes. Nature Biotechnology. 2016;34:410–413. doi: 10.1038/nbt.3441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Keeling DM, Garza P, Nartey CM, Carvunis AR. The meanings of “function” in biology and the problematic case of de novo gene emergence. eLife. 2019;8:e47014. doi: 10.7554/eLife.47014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Laursen BS, Sørensen HP, Mortensen KK, Sperling-Petersen HU. Initiation of protein synthesis in bacteria. Microbiology and Molecular Biology Reviews. 2005;69:101–123. doi: 10.1128/MMBR.69.1.101-123.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Lomsadze A, Gemayel K, Tang S, Borodovsky M. Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes. Genome Research. 2018;28:1079–1089. doi: 10.1101/gr.230615.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Lorenz R, Bernhart SH, Höner zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL. ViennaRNA Package 2.0. Algorithms for Molecular Biology. 2011;6:26. doi: 10.1186/1748-7188-6-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Lu TC, Leu JY, Lin WC. A Comprehensive Analysis of Transcript-Supported De Novo Genes in Saccharomyces sensu stricto Yeasts. Molecular Biology and Evolution. 2017;34:2823–2838. doi: 10.1093/molbev/msx210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Lybecker M, Bilusic I, Raghavan R. Pervasive transcription: detecting functional RNAs in bacteria. Transcription. 2014;5:e944039. doi: 10.4161/21541272.2014.944039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. McClure R, Balasubramanian D, Sun Y, Bobrovskyy M, Sumby P, Genco CA, Vanderpool CK, Tjaden B. Computational analysis of bacterial RNA-Seq data. Nucleic Acids Research. 2013;41:e140. doi: 10.1093/nar/gkt444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Meydan S, Vázquez-Laslop N, Mankin AS, Storz G, Papenfort K. Genes within Genes in Bacterial Genomes. Microbiology Spectrum. 2018;6:RWR-0020-2018. doi: 10.1128/microbiolspec.RWR-0020-2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Meydan S, Marks J, Klepacki D, Sharma V, Baranov PV, Firth AE, Margus T, Kefi A, Vázquez-Laslop N, Mankin AS. Retapamulin-Assisted Ribosome Profiling Reveals the Alternative Bacterial Proteome. Molecular Cell. 2019;74:481–493. doi: 10.1016/j.molcel.2019.02.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Moll I, Grill S, Gualerzi CO, Bläsi U. Leaderless mRNAs in bacteria: surprises in ribosomal recruitment and translational control. Molecular Microbiology. 2002;43:239–246. doi: 10.1046/j.1365-2958.2002.02739.x. [DOI] [PubMed] [Google Scholar]
  39. Oh E, Becker AH, Sandikci A, Huber D, Chaba R, Gloge F, Nichols RJ, Typas A, Gross CA, Kramer G, Weissman JS, Bukau B. Selective ribosome profiling reveals the cotranslational chaperone action of trigger factor in vivo. Cell. 2011;147:1295–1308. doi: 10.1016/j.cell.2011.10.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Orr MW, Mao Y, Storz G, Qian SB. Alternative ORFs and small ORFs: shedding light on the dark proteome. Nucleic Acids Research. 2020;48:1029–1042. doi: 10.1093/nar/gkz734. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Romero DA, Hasan AH, Lin YF, Kime L, Ruiz-Larrabeiti O, Urem M, Bucca G, Mamanova L, Laing EE, van Wezel GP, Smith CP, Kaberdin VR, McDowall KJ. A comparison of key aspects of gene regulation in Streptomyces coelicolor and Escherichia coli using nucleotide-resolution transcription maps produced in parallel by global and differential RNA sequencing. Molecular Microbiology. 2014;1:12810. doi: 10.1111/mmi.12810. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Ruiz-Orera J, Verdaguer-Grau P, Villanueva-Cañas JL, Messeguer X, Albà MM. Translation of neutrally evolving peptides provides a basis for de novo gene evolution. Nature Ecology & Evolution. 2018;2:890–896. doi: 10.1038/s41559-018-0506-6. [DOI] [PubMed] [Google Scholar]
  43. Saito K, Green R, Buskirk AR. Translational initiation in E. coli occurs at the correct sites genome-wide in the absence of mRNA-rRNA base-pairing. eLife. 2020;9:e55002. doi: 10.7554/eLife.55002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Sambandamurthy VK, Derrick SC, Hsu T, Chen B, Larsen MH, Jalapathy KV, Chen M, Kim J, Porcelli SA, Chan J, Morris SL, Jacobs WR., Jr Mycobacterium tuberculosis DeltaRD1 DeltapanCD: a safe and limited replicating mutant strain that protects immunocompetent and immunocompromised mice against experimental tuberculosis. Vaccine. 2006;24:6309–6320. doi: 10.1016/j.vaccine.2006.05.097. [DOI] [PubMed] [Google Scholar]
  45. Sawyer EB, Phelan JE, Clark TG, Cortes T. A snapshot of translation in Mycobacterium tuberculosis during exponential growth and nutrient starvation revealed by ribosome profiling. Cell Reports. 2021;34:108695. doi: 10.1016/j.celrep.2021.108695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Sberro H, Fremin BJ, Zlitni S, Edfors F, Greenfield N, Snyder MP, Pavlopoulos GA, Kyrpides NC, Bhatt AS. Large-Scale Analyses of Human Microbiomes Reveal Thousands of Small, Novel Genes. Cell. 2019;178:1245–1259. doi: 10.1016/j.cell.2019.07.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Shell SS, Wang J, Lapierre P, Mir M, Chase MR, Pyle MM, Gawande R, Ahmad R, Sarracino DA, Ioerger TR, Fortune SM, Derbyshire KM, Wade JT, Gray TA. Leaderless Transcripts and Small Proteins Are Common Features of the Mycobacterial Translational Landscape. PLOS Genetics. 2015;11:e1005641. doi: 10.1371/journal.pgen.1005641. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Shilov IV, Seymour SL, Patel AA, Loboda A, Tang WH, Keating SP, Hunter CL, Nuwaysir LM, Schaeffer DA. The Paragon Algorithm, a next generation search engine that uses sequence temperature values and feature probabilities to identify peptides from tandem mass spectra. Molecular & Cellular Proteomics. 2007;6:1638–1655. doi: 10.1074/mcp.T600050-MCP200. [DOI] [PubMed] [Google Scholar]
  49. Snapper SB, Melton RE, Mustafa S, Kieser T, Jacobs WR. Isolation and characterization of efficient plasmid transformation mutants of Mycobacterium smegmatis. Molecular Microbiology. 1990;4:1911–1919. doi: 10.1111/j.1365-2958.1990.tb02040.x. [DOI] [PubMed] [Google Scholar]
  50. Storz G, Wolf YI, Ramamurthi KS. Small proteins can no longer be ignored. Annual Review of Biochemistry. 2014;83:753–777. doi: 10.1146/annurev-biochem-070611-102400. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Stringer A, Smith C, Mangano K, Wade JT. Identification of novel translated small ORFs in Escherichia coli using complementary ribosome profiling approaches. bioRxiv. 2021 doi: 10.1101/2021.07.02.450978. [DOI] [PMC free article] [PubMed]
  52. Tang WH, Shilov IV, Seymour SL. Nonlinear fitting method for determining local false discovery rates from decoy database searches. Journal of Proteome Research. 2008;7:3661–3667. doi: 10.1021/pr070492f. [DOI] [PubMed] [Google Scholar]
  53. Vakirlis N., Hebert AS, Opulente DA, Achaz G, Hittinger CT, Fischer G, Coon JJ, Lafontaine I. A Molecular Portrait of De Novo Genes in Yeasts. Molecular Biology and Evolution. 2018;35:631–645. doi: 10.1093/molbev/msx315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Vakirlis N, Acar O, Hsu B, Castilho Coelho N, Van Oss SB, Wacholder A, Medetgul-Ernar K, Bowman RW, Hines CP, Iannotta J, Parikh SB, McLysaght A, Camacho CJ, O’Donnell AF, Ideker T, Carvunis A-R. De novo emergence of adaptive membrane proteins from thymine-rich genomic sequences. Nature Communications. 2020;11:781. doi: 10.1038/s41467-020-14500-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Van Oss SB, Carvunis AR. De novo gene birth. PLOS Genetics. 2019;15:e1008160. doi: 10.1371/journal.pgen.1008160. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. VanOrsdel CE, Kelly JP, Burke BN, Lein CD, Oufiero CE, Sanchez JF, Wimmers LE, Hearn DJ, Abuikhdair FJ, Barnhart KR, Duley ML, Ernst SEG, Kenerson BA, Serafin AJ, Hemm MR. Identifying New Small Proteins in Escherichia coli. Proteomics. 2018;18:e1700064. doi: 10.1002/pmic.201700064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Vellanoweth RL, Rabinowitz JC. The influence of ribosome-binding-site elements on translational efficiency in Bacillus subtilis and Escherichia coli in vivo. Molecular Microbiology. 1992;6:1105–1114. doi: 10.1111/j.1365-2958.1992.tb01548.x. [DOI] [PubMed] [Google Scholar]
  58. Wacholder A, Acar O, Carvunis AR. A Reference Translatome Map Reveals Two Modes of Protein Evolution. Genomics. 2021;1:452746. doi: 10.1101/2021.07.17.452746. [DOI] [Google Scholar]
  59. Wade JT, Grainger DC. Pervasive transcription: illuminating the dark matter of bacterial transcriptomes. Nature Reviews. Microbiology. 2014;12:647–653. doi: 10.1038/nrmicro3316. [DOI] [PubMed] [Google Scholar]
  60. Weaver J, Mohammad F, Buskirk AR, Storz G. Identifying Small Proteins by Ribosome Profiling with Stalled Initiation Complexes. MBio. 2019;10:e02819-18. doi: 10.1128/mBio.02819-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Wisniewski JR, Zougman A, Nagaraj N, Mann M. Universal sample preparation method for proteome analysis. Nature Methods. 2009;6:359–362. doi: 10.1038/nmeth.1322. [DOI] [PubMed] [Google Scholar]
  62. Woolstenhulme CJ, Guydosh NR, Green R, Buskirk AR. High-precision analysis of translational pausing by ribosome profiling in bacteria lacking EFP. Cell Reports. 2015;11:13–21. doi: 10.1016/j.celrep.2015.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Yan X, Sun L, Dovichi NJ, Champion MM. Minimal deuterium isotope effects in quantitation of dimethyl-labeled complex proteomes analyzed with capillary zone electrophoresis/mass spectrometry. Electrophoresis. 2020;41:1374–1378. doi: 10.1002/elps.202000051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Yomtovian I, Teerakulkittipong N, Lee B, Moult J, Unger R. Composition bias and the origin of ORFan genes. Bioinformatics (Oxford, England) 2010;26:996–999. doi: 10.1093/bioinformatics/btq093. [DOI] [PMC free article] [PubMed] [Google Scholar]

Editor's evaluation

Bavesh D Kana 1

The use of ribosome profiling in this study allowed for the identification of translated regions of the Mycobacterium tuberculosis genome, identifying new genomic regions that undergo active translation. A select set of these appears to have been the subject of purifying evolutionary selection, suggesting that this pervasive translation of short genetic regions serves as the basis for the evolution of new proteins/protein functions.

Decision letter

Editor: Bavesh D Kana1
Reviewed by: Scarlet Shell2

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

[Editors’ note: the authors submitted for reconsideration following the decision after peer review. What follows is the decision letter after the first round of review.]

Thank you for submitting your work entitled "Pervasive translation in Mycobacterium tuberculosis" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by me serving as both Reviewing and Senior Editor. The reviewers have opted to remain anonymous.

As you can see from their comments below, the reviewers were somewhat discrepant in their perspectives on this paper, necessitating more extensive consultation. The major concern of two of the reviewers was that ~1000 novel ORFs seems extremely high. Both of these reviewers thought further computational analysis (such as re-evaluation of the ribosome profiling data to better sort out real candidates from noise) and further validation (such as tagging of representative candidates in the native genome) was required to support this conclusion. Given the centrality of the "pervasive translation" to your paper, I agree that further analysis and validation is warranted. Since it is anticipated that this work will take longer than two months, I need to reject this submission. However, if you are able to thoroughly address the reviewers' comments, I would encourage you to resubmit.

Reviewer #1:

Ribosome profiling can be used to experimentally improve the annotation of genomes, identifying open reading frames that were missed by computational approaches. Recently Mankin (Mol. Cell 2019) and Storz (mBio 2019) published studies in E. coli in which antibiotics trap ribosomes exclusively at start sites, allowing them to identify novel ORFs inside and outside of annotated genes. Wade and co-workers employ this strategy in M. tuberculosis. While this study is certainly valuable in defining the proteome of this important pathogen, it does not make use of novel methods nor arrive at substantially different conclusions from these earlier studies.

Wade and co-workers claim that they have identified >1000 novel ORFs but I have some concerns about such a large number of new genes. The ribosomal profiling data (from untreated and retapamulin-treated cells) appear to be quite good. The authors show that novel ORFs on average have enriched Shine-Dalgarno sequences, lower RNA structure, and higher density at stop codons. But I want to see evidence for each novel ORF that it is translated (not the average). For ORFs outside of annotated genes, they can establish a threshold of ribosome reads that establishes confidence in translation (this data is given in table 1). Some of these values are quite low suggesting that the ORFs are not translated (at least at appreciable levels). This could also be done for the 300 or so leaderless mRNAs. For ORFs inside annotated genes, the signal at the novel start site should be compared to the signal at the annotated start site (or perhaps the ribosome density across the gene in untreated cells) and only if the novel start site signal is above some ratio should it be considered.

I also have concerns about the validation studies. A very small amount of sequence upstream of the start sites candidate genes was integrated into a different organism with a constant promoter. This means that other things that are essential for translation (transcription of that region, mRNA structure at the start site) are not taken into account. It is not surprising that start codons will initiate translation of a reporter gene if driven by a decent promoter. The question is whether or not this happens in the native genome. Validation should be done by adding an epitope tag to the 3'-end of the ORF in the native genome. I recognize that this is a lot to ask in an organism that is BSL3 with such a slow growth rate. But the strategy shown in Figure 4 doesn't work.

Reviewer #2:

In this manuscript, Smith and colleagues perform ribosome profiling on M. tuberculosis in the presence and absence of an antibiotic that blocks ribosomes at initiation. Collectively, the data support the idea that there is widespread low-level translation of small unannotated ORFs throughout the genome, most of which are unlikely to produce functional proteins. The study was rigorously and thoughtfully performed from both the experimental and analytical perspectives. The manuscript is well-written and most of the figures are clear and straightforward to interpret. The findings are significant, with a number of potentially important implications for mycobacterial physiology and evolution.

Reviewer #3:

The manuscript by Smith et al. describes a ribosome profiling (Ribo-seq) study of the translatome of the human pathogen Mycobacterium tuberculosis (Mtb). The authors map elongating and initiating (via Retapamulin treatment, which stalls initiation complexes, "ribo-RET") ribosomes on mRNAs by deep sequencing. This provides empirical evidence for translated ORFs (annotated or novel), including those within coding regions via Retapamulin-based identification of internal initiation sites. Based on genome-wide analysis of this data, the authors suggest ~1000 novel translated ORFs with high confidence and even more with lower confidence. Some of the novel ORFs, including isoforms of annotated genes or novel ORFs, are validated based on luciferase reporters or Western blots analysis of FLAG-tagged variants of the WT sequences or start-codon mutants. Finally, codon analysis suggests that most of the novel ORFs are likely non-functional as they do not show signatures of purifying selection. The study also provides a resource of novel ORFs that could be studied for functions in TB physiology. This includes small proteins, which are underrepresented in genome annotations, as well as functional studies. This is the first Ribo-seq study of Mtb, and ribo-RET has only been used in E. coli so far. The authors provide their data in a public web-based genome browser.

The Ribo-seq data and analysis appear solid. The major finding of the study is that there appears to be pervasive, non-productive translation in Mycobacteria, where the protein itself has no function. This mirrors the extensive antisense transcription that has also been reported, which might likewise provide substrate for evolution. Such pervasive translation has been reported previously (Meydan et al. 2019 Mol Cell, Weaver et al. MBio2019, Impens et al. Nat Microbiol. 2017). However, to much lower extend. The suggested number of >1000 novel ORFs seems rather high. While the topic is generally interesting, some additional analyses and experiments (see detailed comments) should be performed to validate the novel ORFS and strengthen the case for pervasive translation. Also, more discussion of how broadly this applies to bacteria (and/or higher organisms) and to put the study into context with previous reports of pervasive translation is also needed, in order to highlight novelty and strengthen the manuscript.

1) The introduction/discussion is rather short and appears too focused on previous work on Mycobacteria/Ribo-seq. It would be helpful to discuss also if proteomics approaches indicated pervasive translation and place the findings of this study more into context of other studies.

2) According to Figure 4C, 425 annotated ORFs were detected by ribo-RET. How many ORFs are annotated in total in Mtb and what percentage is detected? For the ones for which no TIS is found in the ribo-RET, are these not expressed under the examined conditions or why are they not captured?

3) The authors suggest that most of the novel ORFs are probably non-functional. However, they might act as regulatory elements based on translation, e.g. like attenuators or upstream ORFs Such ORFs might also be under little selection at the codon level. This should be discussed.

4) The suggested number or novel ORFs appears very high and only a small number is further validated in this study. I think it is important to provide some more evidence for them. For example, is there is any evidence from Mtb N-terminal proteomics data (Shell et al. PLOS Genet 2015) for any of these novel ORFs? This would nicely support the Ribo-seq data, even if the sensitivity allows detection of only a few.

5) Along the same lines: How many of these novel ORFs were detected in previous Ribo-seq data in M. smegmatis? Do these show hallmarks of selection at the codon level? How many are conserved in other strains? This might identify candidates for future study and strengthen the catalogue.

6) Figure 4A: What kind of start codon mutation was introduced? A stop codon? Why do the start codon mutations of the fluorescence reporters not fully abolish expression, or at least abolish expression to similar background levels for all constructs? Some of the mutations have a surprisingly small effect. Are these canonical ATG start codons? It is also unclear what is the background level (i.e. bacteria without the luciferase reporter). A positive control, i.e. an known annotated ORF and its start codon mutant, should also be included. How actively translated are these novel examples compared to annotated ORFs?

7) Figures 4B and C and S9: Some controls are missing from the Western blot validation experiments: samples from an untagged WT strain should be loaded in parallel as a control to ensure the bands are specific. Moreover, for 4C, it would be helpful to add also a start-codon mutant of Rv3709c full-length alone (to see if the smaller isoform band remains) and in combination with the start codon mutant of the isoform.

8) Figure 1 and 3: The authors show multiple times sequence coverages from ribosome profiling data. How would the sequence coverage look like for a fragmented total RNA sample as control?

9) Lines 48-50: The authors discuss library artefacts towards the 5'end. Have they tried alternative library preparation protocols to avoid that bias?

10) Can the authors show that micrococcal nuclease digestion was complete by providing gradient profiles (+/- MNase and +/- Ret). Were non-coding RNAs under-represented in the ribosome footprint libraries?

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "Pervasive Translation in Mycobacterium tuberculosis" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Bavesh Kana as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Scarlett Shell (Reviewer #1).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Reviewers felt that the current submission has been substantively improved. Some concerns remain, these primarily relate to clarifying the analysis approach. In addition, the E. coli comparisons emerged repeatedly as problematic amongst reviewer comments. Please reconsider how these data are presented or interpreted. Your revision should carefully address the comments by revising the text, no wet lab experiments are necessary.

1. Reviewers felt it was difficult to evaluate whether the ribosome profiling data really support so many new translated regions in Mtb, given that so much depends on what threshold is set for including or excluding candidate sites. More discussion of the threshold used and an explanation / validation in the Results section (what exactly qualifies as an IERF?) is required.

2. Are the novel genes conserved in other mycobacterial species? As the ribo-RET data do not yet exist for these species, the authors could perform bioinformatic analyses to determine whether the ORFs in question are conserved, and if so, whether there is translation in the M. smegmatis ribo-seq data already available. Conservation or lack thereof would speak to their being functional and subject to selection.

3. Reviewers concur that the E. coli experiments are somewhat problematic. The authors could repeat the analyses in E. coli and Mtb without using prior annotations. In other words, find ORFs in each genome de novo and determine how many possible start sites have associated ribo-RET density. If indeed Mtb has pervasive translation and E. coli does not, there should be a higher fraction in Mtb with no reference to prior annotations. This analysis would make the comparison much more compelling.

4. The authors have selected 18 novel predicted ORFs for further validation using luciferase reporters. However, it remains unclear how these ORFs were selected and if they reflect high/low confidence candidates. This is important to assess how many of the >1000 candidates (and pervasive translation as a take home message of this paper) are indeed bona fide ORFs or just artifacts. Or was the validation focused on examples that have strong signals in the sequencing data or those with strong conservation? Please clarify. In case the validated candidates just reflect the top candidates, it might be useful to shorten the list of pervasive ORFs to a higher confidence set, e.g. integrating 3-nt periodicity using smORFer (PMID: 34125903).

5. While start/stop codon patterns were revealed by the metagene analysis of the Ribo-seq data, it remains unclear how well they are reflected on a single gene basis. How many of the >1000 ORFs have this pattern? What is the "median" pattern? Or is it just approximately 100 abundant bona fide sORFs that skew the plots? Single genes with abundant read coverages in Ribo-Ret can introduce artifact peaks into metagene plots.

Other points that must be addressed:

1. Line 107-109: as a possible explanation of the displacement of the start codon peak further downstream, the authors suggest that retapamulin (RET) may not trap initiating ribosomes at leaderless mRNAs. However, it appears that the samples shown in Figure 1 were not treated with RET.

2. A leaderless mRNA should give a 15 nt ribosome footprint that would likely be missed in the study. This footprint size was confirmed by Sawyer 2021. Please show the footprint length distribution of the libraries in the supplemental information. It is puzzling why there are peaks of density 25-35 downstream of leaderless start codons. Please explain.

3. Can the Mtb ribo-seq data from Sawyer 2021 be used to validate expression of the novel ORFs described here?

4. Line 183: 1,994 IERFS. What is the threshold for the authors to call these as significant? Add one or two lines about this in the Results section. Why is this a reasonable threshold? Much depends on this seemingly arbitrary cutoff.

5. The RNA secondary structural analyses in Figure 3B – does this include all IERFs? Perhaps it should be separated in leadered vs leaderless. The region included in the RNA folding would be different, given that leaderless mRNAs have no upstream mRNA. Leaderless initiation is very sensitive to RNA structure (Bharmal, NAR Genomics and Bioinformatics, September 2021).

6. Figure 5: Perhaps there would be value in showing violin plots with the distribution of ribosome density (rpkm) and perhaps translational efficiency (ribo-seq / RNA-seq) for each gene.

7. 44 proteins expressed from novel ORFs were confirmed by MS analyses. Is there anything that these proteins have in common (biochemical similarities, high expression levels, etc) that would explain why these were detectable, but the vast majority of the new proteins were not detectable? More information about these 44 new proteins would be useful. The primary data with the y and b ion series for the three peptides confirmed by MS/MS (with synthetic standards) represent a quality control issue that could be moved to the supplementary materials.

8. Lines 76-79: Please be more specific here. It is unclear what means "large numbers" and "many sites". 10, 100, 1000?

9. Line 85: It would be helpful to indicate the genome size and total number of annotated genes at the start of this section.

10. Can the authors comment on whether they have detected any dual function sRNAs (sRNA encoding a sORF) in their data?

11. Lines 463-465: It feels that up to 10% spurious translation could be quite significant and detrimental to the cell. Can the authors comment on/clarify, what would be the cutoff for the definition of "pervasive"? 1%? 10%? 50%?

12. How are leaderless transcripts defined in this manuscript? Can leaderless also not include mRNAs with very short 5'UTRs (1-3 nt) and no Shine-Dalgarno? Does the data suggest that a 5'-end AUG/GTG etc is a requirement for leaderless initiation? Or could it also initiate 1-2nt downstream of the TSS/5'-end?

13. Please clarify what is an IERF vs an ORF? Sometimes it is not clear if the authors are requiring that an EIRF also be associated with an in-frame stop codon? Perhaps mention also that some of the detected IERFs could be due to resistance to MNase digestion?

14. Line 259: "Novel ORFs tend to be weakly expressed but efficiently translated" – this seems a bit confusing or conflicting. Would it be better to say weakly "transcribed"?

15. Lines 102-103 – "attributable to sequence biases associated with library preparation". Can the authors be more specific here? Why would library preparation biases be more prevalent at start and stop codons?

16. Lines 85-86: Simply having an RUG at the 5' position does not usually result in annotation of a leaderless gene as it also requires an in-frame stop codon. Was an RUG the only requirement in these studies? Or an RUG associated with an ORF? Please clarify.

17. Figure 3a: It would be interesting to see the proportions of different types of novel ORFs split up in the orange part of the pie chart (currently all novel ORFs are merged together).

18. Figure 7B and 7C: Please add size information on the western blots.

Reviewer #1:

To my reading, all of the reviewers' concerns have been thoroughly addressed. There is a substantial amount of new data validating the pervasive translation of short unannotated ORFs, and the figures have been revised for improved clarity. The quality of the data and presentation are high, and the findings are of high significance.

Reviewer #2:

This manuscript by Smith and co-workers uses ribosome profiling to discover new sites of translation in Mycobacterium tuberculosis. The data appear to be high quality, and importantly, they make use of the antibiotic retapamulin to trap newly initiated 70S ribosomes at start codons (while allowing elongating ribosomes to run-off transcripts), defining many new sites of translational initiation genome-wide. There are a stunning number of novel translated regions, more than the number of annotated genes whose translation can be detected in their experiments. The authors argue that pervasive translation occurs throughout the Mtb transcriptome.

It is difficult to evaluate whether the ribosome profiling data really support so many new translated regions in Mtb, given that so much depends on what threshold they set for including or excluding candidate sites. I would like to see more discussion of the threshold used and an explanation / validation in the Results section (what exactly qualifies as an IERF?). But I recognize that ribo-seq and retapamulin have been used in previous studies in E. coli, confirming the validity of the approach. And the authors argue that their novel initiation sites show enrichment of SD sequences upstream of the start codon, lower local mRNA structure, and levels of ribosome density and translational efficiency similar to known annotated genes. Furthermore, some fraction of these novel proteins can be detected by MS experiments. Together, I find these arguments compelling.

1) Are the novel genes conserved in other mycobacterial species? I realize that ribo-RET data do not yet exist for these species, but the authors could perform bioinformatic analyses to ask whether the ORFs in question are conserved, and if so, whether there is translation in the M. smegmatis ribo-seq data already available. Conservation or lack thereof would speak to their being functional and subject to selection.

2) The authors argue for pervasive translation in Mtb but not in E. coli on the grounds that previous ribo-RET studies in E. coli observed translation of a larger fraction of annotated ORFs and found fewer novel genes. This is problematic because the E. coli genome is arguably the best annotated genome after decades of studies. And it may be that a smaller fraction of annotated genes are expressed in Mtb under standard lab conditions than in E. coli under different lab conditions. Finally, it is not clear that the methods used to decide what counts as a novel start site are the same in this study and the E. coli study. I propose that with the E. coli data from Meyden 2019, the authors use exactly the same metrics for calling start sites to repeat the analyses in E. coli and Mtb without using prior annotations. In other words, find ORFs in each genome de novo and determine how many possible start sites have associated ribo-RET density. If indeed Mtb has pervasive translation and E. coli does not, there should be a higher fraction in Mtb with no reference to prior annotations. This analysis would make the comparison much more compelling.

Reviewer #3:

Smith et al. describe the application of ribosome profiling (Ribo-seq) and Ribo-Ret (Ribo-seq with retapamulin treatment) to Mycobacterium tuberculosis (Mtb) with the aim of identifying novel small ORFs (sORFs) and start codons. Ribo-seq and Ribo-Ret in Mtb surprisingly provided evidence for the translation of >1000 novel ORFs, many of which were short. In their resubmitted manuscript, the authors have carefully addressed the previous comments of the three reviewers and have added further data (e.g. additional reporter fusions and mass spectrometry-based detection of small proteins) to further support their conclusions about pervasive translation in Mycobacteria and validation on novel ORFs. Global analysis of Ribo-seq patterns at the start and stop codons, translation efficiency, as well as the newly included small protein-targeted proteomics and GC skew, supported that many of these novel ORFs are bona fide, translated ORFs. Validation of selected translation initiation regions by luciferase translational fusions and ORFs by C-terminal FLAG-tagging/western blotting support the global data. A focus is placed on leaderless ORFs, which are especially prevalent in Mtb. The authors' analysis of novel ORF conservation suggests that most of the novel ORFs are not under purifying selection, making it unclear if they have a function in Mtb. Based on this, the authors propose that Mtb experiences pervasive, apparently non-productive, translation, as has been described previously for bacterial transcriptomes.

Ribo-seq is a powerful method for monitoring translation and detection of novel ORFs. Its derivative, Ribo-Ret, has not yet been applied to many prokaryotes and is not trivial to establish or analyze. The analysis methods established and presented in this study are of interest to others applying the technique to diverse prokaryotes to overall increase confidence in Ribo-seq-predicted ORFs. Furthermore, the detected, conserved sORFs serve as a resource for the Mtb and small protein communities, and it is highly appreciated that the data has been made readily available in a browser. Compared to E. coli, Mtb is a high-GC organism with a unique genomic structure. The idea of pervasive translation is fairly new in prokaryotes, and the study has implications for exploring how genes arise.

Overall, this is an important study with careful analysis of the data, considerable validation, and careful response to reviewers' comments. However, I have the following suggestions to strengthen the conclusions and to further clarify certain aspects of the manuscript:

– p. 20: The authors have selected 18 novel predicted ORFs for further validation using luciferase reporters. However, it remains unclear how these ORFs were selected and if they reflect high/low confidence candidates. This is important to assess how many of the >1000 candidates (and pervasive translation as a take home message of this paper) are indeed bona fide ORFs or just artifacts? Or was the validation focused on examples that have strong signals in the sequencing data or those with strong conservation?

– In case the validated candidates just reflect the top candidates, it might be useful to shorten the list of pervasive ORFs to a higher confidence set, e.g. integrating 3-nt periodicity using smORFer (PMID: 34125903).

– While start/stop codon patterns were revealed by the metagene analysis of the Ribo-seq data, it remains unclear how well they are reflected on a single gene basis. How many of the >1000 ORFs have this pattern? What is the "median" pattern? Or is it just approximately 100 abundant bona fide sORFs that skew the plots? Single genes with abundant read coverages in Ribo-Ret can introduce artifact peaks into metagene plots.

eLife. 2022 Mar 28;11:e73980. doi: 10.7554/eLife.73980.sa2

Author response


[Editors’ note: the authors resubmitted a revised version of the paper for consideration. What follows is the authors’ response to the first round of review.]

Reviewer #1:

Ribosome profiling can be used to experimentally improve the annotation of genomes, identifying open reading frames that were missed by computational approaches. Recently Mankin (Mol. Cell 2019) and Storz (mBio 2019) published studies in E. coli in which antibiotics trap ribosomes exclusively at start sites, allowing them to identify novel ORFs inside and outside of annotated genes. Wade and co-workers employ this strategy in M. tuberculosis. While this study is certainly valuable in defining the proteome of this important pathogen, it does not make use of novel methods nor arrive at substantially different conclusions from these earlier studies.

We agree that we have not developed any novel techniques; Ribo-RET has been applied before in E. coli. However, we disagree that the conclusions of our work are not substantially different to those of earlier studies”. Ribo-RET in E. coli identified new ORFs that we would classify as “novel” and “isoform”. However, these represented a relatively small minority of the total number of ORFs identified, as we now discuss in the manuscript. In other words, the large majority of ORFs identified by Ribo-RET in E. coli are annotated genes. By contrast, we detected more unannotated ORFs than annotated ORFs in M. tuberculosis. It is possible that our data allowed for more sensitive detection of ORFs, but this is unlikely given the enrichment of ribosome occupancy at start codons afforded by addition of retapamulin in M. tuberculosis versus E. coli (compare Figure 1B of Meydan et al., to Figure 2A in our manuscript). Hence, we conclude that pervasive translation is much more prevalent in M. tuberculosis than in E. coli.

Wade and co-workers claim that they have identified >1000 novel ORFs but I have some concerns about such a large number of new genes. The ribosomal profiling data (from untreated and retapamulin-treated cells) appear to be quite good. The authors show that novel ORFs on average have enriched Shine-Dalgarno sequences, lower RNA structure, and higher density at stop codons. But I want to see evidence for each novel ORF that it is translated (not the average). For ORFs outside of annotated genes, they can establish a threshold of ribosome reads that establishes confidence in translation (this data is given in table 1). Some of these values are quite low suggesting that the ORFs are not translated (at least at appreciable levels). This could also be done for the 300 or so leaderless mRNAs.

Like the reviewer, we were surprised to find so many novel ORFs, and we were initially skeptical of this result. Hence, we previously performed rigorous analytical and experimental follow-up work showing that (i) putative sites of translation initiation identified by Ribo-RET are far more frequently associated with likely start codons than expected by chance, (ii) novel start codons are associated with Shine-Dalgarno sequences, (iii) novel start codons are associated with reduced RNA secondary structure, (iv) novel start and stop codons have enriched ribosome density in Ribo-seq data, similar to that observed for annotated ORFs, (v) regions around novel and isoform start codons drive translation in a luciferase reporter construct, and (vi) two selected novel ORFs and two selected isoform ORFs are supported by western blot data. As the reviewer notes, these data do not assign confidence scores to individual ORFs, but rather demonstrate a pattern of pervasive translation. We now provide additional support for individual novel ORFs in the form of mass spectrometry data.

To further address the reviewer’s concern, we have determined the ribosome density across novel ORF transcripts by normalizing Ribo-seq coverage to RNA-seq coverage, a common measure of relative ribosome occupancy. Normalizing to RNA-seq data accounts for the large variability in RNA levels for the set of novel ORFs; thus, low Ribo-seq signal could reflect efficient translation of an RNA that is of low abundance. We limited our analysis to the regions of novel ORFs that do not overlap an annotated ORF on the same strand, including an additional 30 nt at either end of annotated ORFs to account for the footprints of initiating/terminating ribosomes that extend beyond the ORF boundaries. It is important to note that the novel ORFs analyzed are likely a biased set, enriched for ORFs located antisense to annotated genes; however, avoiding overlap with annotated genes is essential to interpret these data. For the remaining 871 novel ORFs, some represented in full and others represented only partially, RNA-seq and Ribo-seq coverage tends to be lower than for annotated ORFs, but ribosome density per transcript is only slightly reduced. Moreover, ribosome density per transcript tends to be much higher for novel ORFs than for a set of control non-coding transcripts selected based on transcription start site data for regions not close to ORFs. Thus, our data clearly show that most of the novel ORF transcripts are robustly translated, even if many are weakly expressed. These analyses are included as a new figure and supplementary figure (Figure 5, and Figure 5, Figure Supplement 1). We have also added normalized ribosome density measurements for individual ORFs to Supplementary Tables 1 and 3; we anticipate that readers will use these numbers to assign confidence to individual ORF calls.

For ORFs inside annotated genes, the signal at the novel start site should be compared to the signal at the annotated start site (or perhaps the ribosome density across the gene in untreated cells) and only if the novel start site signal is above some ratio should it be considered.

It is impossible to assign Ribo-seq or RNA-seq signal to a single ORF if it overlaps another ORF. Hence, we have chosen to limit the analysis described above to novel ORFs that do not overlap annotated genes.

I also have concerns about the validation studies. A very small amount of sequence upstream of the start sites candidate genes was integrated into a different organism with a constant promoter. This means that other things that are essential for translation (transcription of that region, mRNA structure at the start site) are not taken into account. It is not surprising that start codons will initiate translation of a reporter gene if driven by a decent promoter. The question is whether or not this happens in the native genome. Validation should be done by adding an epitope tag to the 3'-end of the ORF in the native genome. I recognize that this is a lot to ask in an organism that is BSL3 with such a slow growth rate. But the strategy shown in Figure 4 doesn't work.

Our interpretation of the nLuc reporter fusion data is that the selected sequences can function as ribosome binding sites; we have added text to note that the sequences are not in their native context.

We agree that the optimal version of the western blot experiment would be to introduce epitope tags at the native locus in M. tuberculosis. However, this is technically challenging and would take ~6 months. Hence, we opted to introduce the tagged ORFs into M. smegmatis, including the complete predicted 5’ UTR in each case. We used a strong promoter to rule out the possibility that an M. tuberculosis promoter might be transcriptionally inactive in M. smegmatis. We disagree that any potential start codon would be efficiently translated if it were located within a well-transcribed RNA; our Ribo-RET and Ribo-seq data clearly show that this is not the case. We also note that for the isoform ORFs, the overlapping ORF serves as an internal control for the western blot that accounts for transcript levels. Nonetheless, we agree with the reviewer’s assertion that many sequences could function as start codons if present in an RNA, and indeed we believe this is the basis for the pervasive translation we observe.

To address the reviewer’s concern, we have used mass spectrometry to identify proteins translated from the novel ORFs. While there are major technical challenges for identification of small proteins by mass spectrometry, as we now discuss in the manuscript, we detected 44 proteins translated from novel ORFs.

Reviewer #3:

The manuscript by Smith et al. describes a ribosome profiling (Ribo-seq) study of the translatome of the human pathogen Mycobacterium tuberculosis (Mtb). The authors map elongating and initiating (via Retapamulin treatment, which stalls initiation complexes, "ribo-RET") ribosomes on mRNAs by deep sequencing. This provides empirical evidence for translated ORFs (annotated or novel), including those within coding regions via Retapamulin-based identification of internal initiation sites. Based on genome-wide analysis of this data, the authors suggest ~1000 novel translated ORFs with high confidence and even more with lower confidence. Some of the novel ORFs, including isoforms of annotated genes or novel ORFs, are validated based on luciferase reporters or Western blots analysis of FLAG-tagged variants of the WT sequences or start-codon mutants. Finally, codon analysis suggests that most of the novel ORFs are likely non-functional as they do not show signatures of purifying selection. The study also provides a resource of novel ORFs that could be studied for functions in TB physiology. This includes small proteins, which are underrepresented in genome annotations, as well as functional studies. This is the first Ribo-seq study of Mtb, and ribo-RET has only been used in E. coli so far. The authors provide their data in a public web-based genome browser.

The Ribo-seq data and analysis appear solid. The major finding of the study is that there appears to be pervasive, non-productive translation in Mycobacteria, where the protein itself has no function. This mirrors the extensive antisense transcription that has also been reported, which might likewise provide substrate for evolution. Such pervasive translation has been reported previously (Meydan et al. 2019 Mol Cell, Weaver et al. MBio2019, Impens et al. Nat Microbiol. 2017). However, to much lower extend. The suggested number of >1000 novel ORFs seems rather high. While the topic is generally interesting, some additional analyses and experiments (see detailed comments) should be performed to validate the novel ORFS and strengthen the case for pervasive translation. Also, more discussion of how broadly this applies to bacteria (and/or higher organisms) and to put the study into context with previous reports of pervasive translation is also needed, in order to highlight novelty and strengthen the manuscript.

We have added new controls and a new mass spectrometry experiment that strengthen the case for pervasive translation in M. tuberculosis.

1) The introduction/discussion is rather short and appears too focused on previous work on Mycobacteria/Ribo-seq. It would be helpful to discuss also if proteomics approaches indicated pervasive translation and place the findings of this study more into context of other studies.

We have greatly expanded the Discussion, and we have added text comparing our work to other related studies. In the new section of the Results that describes proteomic data, we discuss the limitations of standard techniques that greatly limit the number of small proteins that can be detected. For these reasons, we have chosen not to compare our work to proteomic studies.

2) According to Figure 4C, 425 annotated ORFs were detected by ribo-RET. How many ORFs are annotated in total in Mtb and what percentage is detected? For the ones for which no TIS is found in the ribo-RET, are these not expressed under the examined conditions or why are they not captured?

See response to Reviewer #1.

3) The authors suggest that most of the novel ORFs are probably non-functional. However, they might act as regulatory elements based on translation, e.g. like attenuators or upstream ORFs Such ORFs might also be under little selection at the codon level. This should be discussed.

We have added a section on this topic in the Discussion. uORFs might have atypical codon usage because they are embedded within an RNA structure that is conserved because they are involved in regulation, and/or because rare codons participate in regulation. To investigate this question, we have now looked at codon usage in a set of six previously described regulatory uORFs that are enriched in cysteine codons. Only one of the individual uORFs has significant G/C skew (Fisher’s exact test p < 0.05), and the group of ORFs as a whole does not have significant G/C skew; this is despite strong evidence that these are conserved, regulatory uORFs. We repeated the analysis after removing the cysteine codons, reasoning that they are likely essential for the uORF regulatory functions, and they are expected to reduce G/C skew. Indeed, removing the cysteine codons increased the G/C skew for all ORFs, and the G/C skew for the set of uORFs as a group was much higher, and statistically significant (Fisher’s exact test p < 0.001). We now discuss the implications of this analysis for the full set of novel ORFs.

4) The suggested number or novel ORFs appears very high and only a small number is further validated in this study. I think it is important to provide some more evidence for them. For example, is there is any evidence from Mtb N-terminal proteomics data (Shell et al. PLOS Genet 2015) for any of these novel ORFs? This would nicely support the Ribo-seq data, even if the sensitivity allows detection of only a few.

See comments in response to Reviewer 1. In short, we believe the Ribo-seq data provide very strong evidence that the majority of novel ORFs are robustly translated. Nonetheless, we have added mass spectrometry data as suggested, providing independent experimental support for 44 novel ORFs. We note that the small size of most of the novel ORFs makes them difficult to detect using standard proteomic approaches, which are biased strongly against small proteins. We attempted to enrich for small proteins, but this clearly requires further optimization. We have also determined the ribosome density per transcript for many of the novel ORFs; these data further support the idea that the novel ORFs are efficiently translated.

5) Along the same lines: How many of these novel ORFs were detected in previous Ribo-seq data in M. smegmatis? Do these show hallmarks of selection at the codon level? How many are conserved in other strains? This might identify candidates for future study and strengthen the catalogue.

To address the question of whether novel ORFs are functional, we have ranked non-overlapping novel ORFs by their G/C-skew, highlighting ORFs that are likely under purifying selection. Thus, we identify ~90 likely functional novel ORFs. Comparison to data from M. smegmatis is confounded by the difficulty in identifying novel ORFs from standard Ribo-seq data. We plan to address this in a future manuscript that includes Ribo-RET data for M. smegmatis.

6) Figure 4A: What kind of start codon mutation was introduced? A stop codon? Why do the start codon mutations of the fluorescence reporters not fully abolish expression, or at least abolish expression to similar background levels for all constructs? Some of the mutations have a surprisingly small effect. Are these canonical ATG start codons? It is also unclear what is the background level (i.e. bacteria without the luciferase reporter). A positive control, i.e. an known annotated ORF and its start codon mutant, should also be included. How actively translated are these novel examples compared to annotated ORFs?

The start codons were mutated to RCG. The luciferase assay data are plotted on a log scale, so the effect of mutating the start codon in each case is large. Nonetheless, most of the mutant constructs retain some activity. Interestingly, we observed the same phenomenon when investigating novel ORFs in E. coli, where we integrated the luciferase reporter at the native chromosomal locus (https://www.biorxiv.org/content/10.1101/2021.07.02.450978v2). We speculate that there are alternative internal start codons, or that non-canonical start codon sequences can be used, as described for E. coli (PMID 28334756). As a control, we have included sequences for three annotated ORF; these behave similarly to the other reporters, giving us confidence in our conclusion that the novel/isoform RBSs are functional.

7) Figures 4B and C and S9: Some controls are missing from the Western blot validation experiments: samples from an untagged WT strain should be loaded in parallel as a control to ensure the bands are specific. Moreover, for 4C, it would be helpful to add also a start-codon mutant of Rv3709c full-length alone (to see if the smaller isoform band remains) and in combination with the start codon mutant of the isoform.

The start codon mutants serve as a control, and the blots looking at different proteins, all of which use the same epitope tag, control for each other when detecting non-specific bands. For isoform ORFs, we did not mutate the start codon of the full-length gene because this would likely lead to premature Rho-dependent transcription termination upstream of the isoform start codon.

8) Figure 1 and 3: The authors show multiple times sequence coverages from ribosome profiling data. How would the sequence coverage look like for a fragmented total RNA sample as control?

We have added this control dataset, which does not show enrichment 15/12 nt downstream of start/stop codons (Figure 4, Figure Supplement 1B-D). We do observe the artifactual signal precisely at start/stop codons that we attribute to imprecise sequence read alignment due to these libraries being made by polyadenylation of RNA fragments. We have also included a control analysis of mock ORFs (Figure 4, Figure Supplement 1A). Specifically, we identified mock ORFs whose stop codons do not match any annotated ORF or any ORF we identified by Ribo-RET. We limited the set of mock ORFs to those found in regions that are detectably transcribed. We do not observe Ribo-seq signal enrichment 15/12 nt downstream of the start/stop codons of the mock ORFs.

9) Lines 48-50: The authors discuss library artefacts towards the 5'end. Have they tried alternative library preparation protocols to avoid that bias?

We have not, but our data suggest that this would be a good idea for any groups applying Ribo-seq methods to bacteria.

10) Can the authors show that micrococcal nuclease digestion was complete by providing gradient profiles (+/- MNase and +/- Ret). Were non-coding RNAs under-represented in the ribosome footprint libraries?

We do not have gradient profiles for the samples used for Ribo-RET and Ribo-seq. However, the clear enrichment of ribosome occupancy at start and stop codons shows that MNase processing was sufficient. Non-coding RNAs are strongly under-represented in the Ribo-seq data relative to RNA-seq, as now indicated in Figure 5 and Figure 5, Figure Supplement 1 (new figures).

[Editors’ note: what follows is the authors’ response to the second round of review.]

Essential revisions:

Reviewers felt that the current submission has been substantively improved. Some concerns remain, these primarily relate to clarifying the analysis approach. In addition, the E. coli comparisons emerged repeatedly as problematic amongst reviewer comments. Please reconsider how these data are presented or interpreted. Your revision should carefully address the comments by revising the text, no wet lab experiments are necessary.

1. Reviewers felt it was difficult to evaluate whether the ribosome profiling data really support so many new translated regions in Mtb, given that so much depends on what threshold is set for including or excluding candidate sites. More discussion of the threshold used and an explanation / validation in the Results section (what exactly qualifies as an IERF?) is required.

We have added the details for IERF identification to the relevant part of the Results. The most important parameter in identifying IERFs is the minimum read coverage in the Ribo-RET data. The value we chose (5.5 reads per million) is arbitrary; a higher value would identify fewer IERFs and hence fewer ORFs, but with a lower FDR. We tried to strike a balance between detecting more translated ORFs and keeping the FDR low. We do not believe that there is a defined set of translated ORFs; rather, there is a continuum of expression levels for all possible ORFs, such that a lower threshold of discovery or a more sensitive approach will identify more translated ORFs.

2. Are the novel genes conserved in other mycobacterial species? As the ribo-RET data do not yet exist for these species, the authors could perform bioinformatic analyses to determine whether the ORFs in question are conserved, and if so, whether there is translation in the M. smegmatis ribo-seq data already available. Conservation or lack thereof would speak to their being functional and subject to selection.

This is an interesting and important question, but we believe it is beyond the scope of the current study. tBLASTn analysis reveals potential homologues for >100 non-overlapping, novel ORFs, with only about a third of these having significant G/C-skew. However, a more sophisticated analysis is required to determine whether conservation is due to selective pressure on the ORF itself or overlapping sequence features. This is something we are working on for a future publication. Similarly, we have generated Ribo-RET data for M. smegmatis to identify novel ORFs. Comparison of the Mtb and M. smegmatis datasets will be the subject of a future paper.

3. Reviewers concur that the E. coli experiments are somewhat problematic. The authors could repeat the analyses in E. coli and Mtb without using prior annotations. In other words, find ORFs in each genome de novo and determine how many possible start sites have associated ribo-RET density. If indeed Mtb has pervasive translation and E. coli does not, there should be a higher fraction in Mtb with no reference to prior annotations. This analysis would make the comparison much more compelling.

We have chosen to remove the text suggesting that Mtb has a larger proportion of translation dedicated to novel ORFs than E. coli. While our data suggest that this is the case, there are enough technical differences between the studies of the two species that we do not feel we can make this claim with sufficient confidence, even if we were to analyze the data using the same pipeline.

4a. The authors have selected 18 novel predicted ORFs for further validation using luciferase reporters. However, it remains unclear how these ORFs were selected and if they reflect high/low confidence candidates. This is important to assess how many of the >1000 candidates (and pervasive translation as a take home message of this paper) are indeed bona fide ORFs or just artifacts. Or was the validation focused on examples that have strong signals in the sequencing data or those with strong conservation? Please clarify.

They are all from the top 25th percentile of Ribo-RET scores, but they cover a broad range of values for ribosome density per transcript (median percentile rank of 37 for the 8 ORFs that could be assessed in non-overlapping regions). We have added these details to the text in the Results section. We believe the MS data provide a much better assessment of the novel ORFs identified by Ribo-RET. Hence, we have moved the luciferase assay data and the western blot data to the supplement.

4b. In case the validated candidates just reflect the top candidates, it might be useful to shorten the list of pervasive ORFs to a higher confidence set, e.g. integrating 3-nt periodicity using smORFer (PMID: 34125903).

There is strong evidence from our data and from published E. coli Ribo-seq data (PMID 27924019) that the 3 nt periodicity in bacterial Ribo-seq data is due to sequence biases within codons (e.g. G/C-skew), and does not reflect the codon-by-codon movement of ribosomes across ORFs.

5. While start/stop codon patterns were revealed by the metagene analysis of the Ribo-seq data, it remains unclear how well they are reflected on a single gene basis. How many of the >1000 ORFs have this pattern? What is the "median" pattern? Or is it just approximately 100 abundant bona fide sORFs that skew the plots? Single genes with abundant read coverages in Ribo-Ret can introduce artifact peaks into metagene plots.

For each ORF that contributes data to the metagene plots, values are normalized to the maximum value in the analyzed region. Thus, all ORFs are contribute equally; no single ORF can dominate.

Other points that must be addressed:

1. Line 107-109: as a possible explanation of the displacement of the start codon peak further downstream, the authors suggest that retapamulin (RET) may not trap initiating ribosomes at leaderless mRNAs. However, it appears that the samples shown in Figure 1 were not treated with RET.

This sentence refers to Ribo-seq data without drug treatment. We observe an enrichment of signal 15 nt downstream of leadered ORF start codons even without RET treatment, albeit to a lesser degree (e.g. Figure 1A).

2. A leaderless mRNA should give a 15 nt ribosome footprint that would likely be missed in the study. This footprint size was confirmed by Sawyer 2021. Please show the footprint length distribution of the libraries in the supplemental information. It is puzzling why there are peaks of density 25-35 downstream of leaderless start codons. Please explain.

We have chosen not to make a plot of the distribution of sequence read lengths because it is the mappable sequence reads that count, and the standard read-mapping tools ignore very short reads. We note that Sawyer et al. showed similar sequence read coverage around leaderless start codons as our data. We agree that the peak of ribosome footprint 3’ density 25-30 nt downstream of leaderless start codons is an interesting observation, but it is one that we cannot explain. We presume it reflects the fundamentally different mechanism of leaderless translation initiation.

3. Can the Mtb ribo-seq data from Sawyer 2021 be used to validate expression of the novel ORFs described here?

Ribo-seq data from the Sawyer et al. study have substantially lower enrichment of signal downstream of start and stop codons, likely because cells were treated with chloramphenicol before harvesting. Hence, these data are not as useful for validating the novel ORF calls.

4. Line 183: 1,994 IERFS. What is the threshold for the authors to call these as significant? Add one or two lines about this in the Results section. Why is this a reasonable threshold? Much depends on this seemingly arbitrary cutoff.

See response to major comment #1.

5. The RNA secondary structural analyses in Figure 3B – does this include all IERFs? Perhaps it should be separated in leadered vs leaderless. The region included in the RNA folding would be different, given that leaderless mRNAs have no upstream mRNA. Leaderless initiation is very sensitive to RNA structure (Bharmal, NAR Genomics and Bioinformatics, September 2021).

This analysis only considers leadered ORFs. Since the leaderless ORFs were inferred based on transcription start site data, there is no expectation that the secondary structure at the start of these RNAs will be lower than that of random sequence.

6. Figure 5: Perhaps there would be value in showing violin plots with the distribution of ribosome density (rpkm) and perhaps translational efficiency (ribo-seq / RNA-seq) for each gene.

We have added supplementary figure panels showing cumulative frequency plots for relative RNA coverage and relative ribosome density (Figure 5 —figure supplement 1B-C). We think Figure 5B is a good way to show the ribosome density-per-transcript numbers.

7. 44 proteins expressed from novel ORFs were confirmed by MS analyses. Is there anything that these proteins have in common (biochemical similarities, high expression levels, etc) that would explain why these were detectable, but the vast majority of the new proteins were not detectable? More information about these 44 new proteins would be useful. The primary data with the y and b ion series for the three peptides confirmed by MS/MS (with synthetic standards) represent a quality control issue that could be moved to the supplementary materials.

The y and b ion series data are important to show that the peptides we detected from cell lysates were correctly assigned to their corresponding proteins. We did not identify any features strongly enriched in the proteins we validated by MS, although the proteins we detected by MS tend to be associated with ORFs with higher ribosome occupancy per transcript than ORFs associated with MS-undetected proteins (Mann-Whitney U Test p = 0.03). Note that the MS-validated proteins are indicated in Supplementary Tables 1 and 3, so readers can see the various features associated with these proteins.

8. Lines 76-79: Please be more specific here. It is unclear what means "large numbers" and "many sites". 10, 100, 1000?

We have replaced references to “large numbers” with specific ranges. We have kept “many sites” because we do not believe the number of pervasively translated ORFs can be quantified. In other words, we don’t believe there is a discrete set of pervasively translated ORFs.

9. Line 85: It would be helpful to indicate the genome size and total number of annotated genes at the start of this section.

Done.

10. Can the authors comment on whether they have detected any dual function sRNAs (sRNA encoding a sORF) in their data?

There are annotated sRNAs for which we observe Ribo-seq coverage consistent with translation of an sORF, but these will be the subject of a future paper.

11. Lines 463-465: It feels that up to 10% spurious translation could be quite significant and detrimental to the cell. Can the authors comment on/clarify, what would be the cutoff for the definition of "pervasive"? 1%? 10%? 50%?

We are making the point here that our data suggest that pervasive translation contributes less than 10% of all translation, although it could be considerably less than that. By our definition, there is no cut-off for the amount of translation required to be defined as “pervasive”, although we acknowledge that the physiological relevance of pervasive translation will depend to a large extent on the extent of pervasive translation.

12. How are leaderless transcripts defined in this manuscript? Can leaderless also not include mRNAs with very short 5'UTRs (1-3 nt) and no Shine-Dalgarno? Does the data suggest that a 5'-end AUG/GTG etc is a requirement for leaderless initiation? Or could it also initiate 1-2nt downstream of the TSS/5'-end?

We define leaderless transcripts as having no 5’ UTR (stated on line 43). Recent data from the Schrader lab indicate that very short 5’ UTRs can be tolerated in Caulobater crescentus. That is likely to also be the case in mycobacteria, but we have not investigated this in detail. The effect of very short 5’ UTRs on leaderless translation in mycobacteria will be the subject of a future paper.

13. Please clarify what is an IERF vs an ORF? Sometimes it is not clear if the authors are requiring that an EIRF also be associated with an in-frame stop codon? Perhaps mention also that some of the detected IERFs could be due to resistance to MNase digestion?

As discussed in more detail in our response to major comment #1, we have added text to the Results to clarify how IERFs were identified. In brief, an IERF is a site of enrichment in the Ribo-RET data. Most IERFs correspond to a site of translation initiation, but ~30% were not assigned to an ORF, likely for a variety of reasons, e.g. the ORF uses a non-canonical start codon, the spacing between the start codon and the enriched Ribo-RET signal does not fall within the range we required, the Ribo-RET signal is not due to ribosome occupancy but rather to association with another large complex, the Ribo-RET signal is due to ribosomes stalled for a reason other than being trapped at a start codon. Without more data, we are reluctant to comment in the manuscript on specific reasons why some IERFs are not associated with expected start codons at expected positions.

14. Line 259: "Novel ORFs tend to be weakly expressed but efficiently translated" – this seems a bit confusing or conflicting. Would it be better to say weakly "transcribed"?

We have changed “expressed” to “transcribed”, as suggested. We previously used “expressed” because we wanted to include the possibility that RNA stability contributes to the RNA level, but we are comfortable saying transcription.

15. Lines 102-103 – "attributable to sequence biases associated with library preparation". Can the authors be more specific here? Why would library preparation biases be more prevalent at start and stop codons?

The MNase bias is the same everywhere, but start codons always have a “T” at the second position, and stop codons always have a “T” at the first position, so the bias lines up across all ORFs. The same effect would be observed if metagene plots were made for any specific nucleotide sequence, e.g. all AAT codons. We have added some text to clarify this point: “We note that there are also smaller peaks and troughs of Ribo-seq signal precisely at start and stop codons, likely attributable to sequence biases associated with library preparation that are highlighted when similar sequences (e.g. start/stop codons) are aligned”.

16. Lines 85-86: Simply having an RUG at the 5' position does not usually result in annotation of a leaderless gene as it also requires an in-frame stop codon. Was an RUG the only requirement in these studies? Or an RUG associated with an ORF? Please clarify.

We are not sure we understand the question. Every RUG will have an in-frame stop somewhere downstream. The 1,285 transcripts referred to are those that begin with RUG. If a 5’ RUG is sufficient for leaderless translation, the locations of leaderless ORFs can be inferred from the positions of the transcription start sites.

17. Figure 3a: It would be interesting to see the proportions of different types of novel ORFs split up in the orange part of the pie chart (currently all novel ORFs are merged together).

We have expanded the figure to indicate the different subclasses of novel and isoform ORF. We also now indicate the different subclasses of novel ORF in Supplementary Tables 1 and 3.

18. Figure 7B and 7C: Please add size information on the western blots.

Done.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Smith C, Wang AJ, Wade J. 2019. Pervasive Translation in Mycobacterium tuberculosis. ArrayExpress. E-MTAB-8039 [DOI] [PMC free article] [PubMed]
    2. Wang AJ, Wade J. 2021. Pervasive Translation in Mycobacterium tuberculosis. ArrayExpress. E-MTAB-10695 [DOI] [PMC free article] [PubMed]

    Supplementary Materials

    Figure 6—figure supplement 2—source data 1. Images of full western blots are provided.

    The zipped folder includes (i) individual files for each blot, and (ii) a summary file showing all blots, with boxes to show the regions used in Figure 6—figure supplement 2.

    Supplementary file 1. Supplementary tables.

    (A) List of putative leaderless ORFs. (B) List of IERFs. (C) List of ORFs identified by Ribo-RET. (D) Analysis of G/C skew for cys-rich regulatory ORFs. (E) Analysis of isoform ORFs and their position relative to overlapping annotated ORFs. (F) List of oligonucleotides used in this study.

    elife-73980-supp1.xlsx (2.2MB, xlsx)
    Transparent reporting form

    Data Availability Statement

    Raw Illumina sequencing data are available from the ArrayExpress and European Nucleotide Archive repositories with accession numbers E-MTAB-8039 and E-MTAB-10695. Raw mass spectrometry data are available through MassIVE, with exchange #MSV000087541. Python code is available at https://github.com/wade-lab/Mtb_Ribo-RET (copy archived at swh:1:rev:c6a41047e001550aab663588a13fe935547b9431).

    The following datasets were generated:

    Smith C, Wang AJ, Wade J. 2019. Pervasive Translation in Mycobacterium tuberculosis. ArrayExpress. E-MTAB-8039

    Wang AJ, Wade J. 2021. Pervasive Translation in Mycobacterium tuberculosis. ArrayExpress. E-MTAB-10695


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES