Abstract
It is generally assumed that new genes arise through duplication and/or recombination of existing genes. The probability that a new functional gene could arise out of random non-coding DNA is so far considered to be negligible, since it seems unlikely that such a RNA or protein sequence could have an initial function that influences the fitness of an organism. We have here tested this question systematically, by expressing clones with random sequences in E . coli and subjecting them to competitive growth. Contrary to expectations, we find that random sequences with bioactivity are not rare. In our experiments we find that up to 25% of the evaluated clones enhance the growth rate of their cells and up to 52% inhibit growth. Testing of individual clones in competition assays confirms their activity and provides an indication that their activity could be exerted either by the transcribed RNA or the translated peptide. This suggests that transcribed and translated random parts of the genome could indeed have a high potential to become functional. The results also suggest that random sequences may become an effective new source of molecules for studying cellular functions, as well as for pharmacological activity screening.
Introduction
De novo evolution of genes and proteins from random sequences has long been considered to be highly unlikely1,2. Given that the combinatorial possibilities of even short protein sequences are almost infinite, while the number of actually found folds in solved structures is not more than a few thousand, it seemed for a long time that only a very minute fraction of the sequence space could possibly become bioactive3–5. On the other hand, comparative genome and transcriptome analyses have shown that new transcripts and proteins can easily arise de novo from random parts of the genome6–9. Intriguingly, the highest rates of de novo emergence are always found in the evolutionarily youngest lineages10 suggesting that the constraints for de novo evolution cannot be very high. Further, a re-analysis of presumed non-coding transcripts has shown that many of them associate with ribosomes and develop potentially functional ORFs11,12.
We have set out here to ask whether a random translatable sequence expressed in a cell could cause an effect within the cell that gives it an advantage or disadvantage in growth compared to other cells. We have chosen to use Escherichia coli as a test system for its ease of manipulation in laboratory conditions while maintaining large population sizes. E. coli is known to respond to even small selection differences under competitive growth conditions13.
Assuming that de novo gene birth occurs via acquisition of specific elements to produce transcribed and translated protogenes, we decided to mimic the process and provide a population of bacterial cells with artificial protogenes in which all basic elements for transcription and translation are already present. We used an expression vector to express random sequences with the potential to code for peptides under the control of an inducible promoter. Each RNA with its open reading frame acts both as a novelty in the cell and as a marker in a screening-by-sequencing approach. Because all other parts of the bacterial machinery are virtually identical in the clonal population, we can expect that the differences in growth are due to competition dynamics13 and that these should be causally related to the expression of the plasmids in the cell.
In our approach, we do not use restrictive conditions as was done previously in a similar system14, but allow optimal growth conditions and monitor only frequency changes of clones of a transformed library, rather than expecting fixation of selected variants. We find that a very high fraction of the random sequences that could be statistically assessed in the experiments show changes of frequency over time, thus affecting the growth rate of the cells either positively or negatively. Repetitions of the experiment repeatedly yield the same subsets of clones with the same response modes. We conclude that the majority of randomly generated sequences have reproducible biochemical activity, possibly through interactions with components of the cellular machinery, and are thus relevant for fitness differences in the bacteria.
Results
To generate a library of random sequences with coding potential, we synthesized oligo-nucleotides including 150nt with equal representations of all 4 nucleotides at each position during synthesis, and cloned them into the pFLAG-CTC™ expression vector containing an IPTG-inducible promoter and a C-terminal FLAG sequence (see Methods). This results in transcribed RNAs with a length of about 700nt, of which 150nt are randomly generated sequence. When translated, the corresponding peptides have a length of 65aa with a central part of 50aa of random sequence, unless the sequence includes a premature stop codon. All our further analysis focuses only on clones coding for such full length peptides to make results best comparable.
The initially transformed library was amplified without induction through IPTG. This pool of amplified clones served as a source for all experiments. Two major sets of experiments were performed, one with passage to a new flask after 3 hours, the other after 24h of growth. Inoculation at each cycle was done with a large pool of cells to avoid drift effects through bottlenecks. This meant that only about 4-5 cell-divisions occurred before the stationary phase was reached. Each experiment was done in 10 replicates and was run for four growth cycles in the presence of IPTG. The frequency of each clone was determined through parallel sequencing at the end of each growth cycle. Counts of sequence reads for each clone were used for statistical analysis using DESeq215.
Across all experiments we found a large number of different clones in the library (see Methods). However, most of these occurred only at low frequencies, the majority were detected only once across all experiments. Since these rare clones preclude statistical analysis, we focused only on those for which at least five counts were observed in at least one of the parallel replicates. This reduced the number to 1,082 clones that could be evaluated in at least one experiment, and all further statistical analyses were based on these.
In a first test, we asked whether any major frequency changes could be detected when the expression was not induced by IPTG. Figure 1 shows the effect of induction with IPTG compared to replicate cultures of the same experiment without induction. The induced experiment showed major shifts in clone frequency over time, while the non-induced experiment showed only minor non-significant variation. This proves that expression of the random sequences from the clones is strictly required to cause frequency changes.
Figure 1. Induction of expression through IPTG drives changes in clone frequency over time.
A) with IPTG, B) without IPTG. Plots of fold-change (compared to the first cycle) versus mean counts across pairwise comparisons. Negative fold changes are indicative of depletion compared to the first cycle, and positive fold changes are indicative of enrichment compared to the first cycle. Left, center and right panels indicate comparisons with the 2nd, 3rd and 4th cycle (24 hours per cycle). Green and red dots indicate clones with significant fold changes (5% FDR), positive and negative, respectively. Black crosses indicate clones with non-significant fold changes. The number in the lower-right corner of each plot indicates the total number of clones with significant changes. Both experiments were derived from the same stock culture, and performed simultaneously. We observe that the induction has a clear effect on the dynamics of competition and differentiation along the experiment.
A large number of clones showing an increase or decrease in frequency were observed in the experiments under induction conditions. Most of them showed a consistent increase or decrease across the four experimental cycles. Figure 2 shows such examples.
Figure 2. Examples of four clones with significant changes in frequency over time.
Each time point represents a 24 h cycle. A)and B) show clones increasing in frequency (as normalized counts, see Methods). C) and D) show clones decreasing in frequency. Boxplots show the median and the interquartile ranges (outliers as dots) across the 10 replicates of each cycle.
The overall results are summarized in Table 1 and all clones with up or down responses are listed in suppl. Table 1. In the three experiments with 3h cycles (E1-E3), we find for around 70% of the analyzed clones a significant change in frequency; with about 3 to 4-fold more clones going down in frequency than going up. The 24h cycle experiments are more heterogeneous with respect to these overall percentages, but maintain the same overall trend.
Table 1. Summary across seven different experiments.
| 3h cycles | 24h cycles | ||||||
|---|---|---|---|---|---|---|---|
| E1 | E2 | E3 | E4 | E5 | E6 | E72 | |
| total number of reads | 457,212 | 324,090 | 678,823 | 293,638 | 300,807 | 812,315 | 4,142,836 |
| analyzed clones1 | 623 | 529 | 607 | 616 | 499 | 618 | 1061 |
| reads for analyzed clones | 75,927 | 80,745 | 265,413 | 68,285 | 97,877 | 326,375 | 715,675 |
| significant clones at FDR 0.05 | 389 | 322 | 389 | 287 | 148 | 369 | 713 |
| enriched | 68 | 67 | 68 | 64 | 58 | 67 | 277 |
| depleted | 321 | 255 | 321 | 223 | 90 | 302 | 436 |
| % bioactive clones | 62% | 61% | 64% | 47% | 30% | 60% | 67% |
number of clones with at least 5 reads in at least one replicate in any experiment
experiment with only five replicates but deeper sequencing, resulting in ten-fold higher read coverage per replicate
The observation of a higher proportion of depleted clones depends on sequencing depth, combined with the statistical requirement of a certain depth of coverage for each clone to detect a significant change. Identifying statistically significant depletion is easier when a clone already has a high starting frequency - and such clones will also be found at lower sequencing depth. Significant increases in frequency, on the other hand, can start at very low initial frequencies and these will become more visible at higher sequencing depth. To test this, we conducted an additional experiment (E7) with four 24h cycles, including five replicates (instead of 10) and a deeper sequencing depth. At this higher sequencing depth we find an overall fraction of significant clones similar to most of the other experiments (79%), but indeed also a larger number of enriched clones (Table 1).
Using the data from experiment E7 we could also estimate the power to detect depleted versus enriched clones. We performed analyses for subsets at 10% intervals of the total reads in the experiment. Figure 3 shows the fold-change plots for four depths of sampling. We find that from 50% onwards more and more initially low frequency clones become significantly enriched. Suppl. Figure 1 shows the corresponding rarefaction analysis where the detection of depleted clones is more or less complete at 60% coverage, while the detection of enriched clones keeps rising.
Figure 3. Assessment of read depth on detection power.
Progression of significant fold changes with sampling depth in experiment E7 (from 10% of the reads through 100% of the reads). Red circles (left of the dotted line) indicate clones significantly decreasing over time; green circles (right of the dotted line) indicate clones significantly increasing over time, and black crosses indicate clones with non-significant fold changes. Significance was set at 5% FDR.
Among the clones with significant changes in frequencies, we found a fraction that differs only by a single nucleotide/amino acid change from another clone in the set. These variants were likely created through PCR induced mutations during the oligo-nucleotide amplification step. 95% of them showed the same direction of change, confirming the high repeatability of the individual effects and suggest that single nucleotide differences do not have much influence on the effect.
Compositional analysis
To assess whether there are systematic compositional patterns in sequences that have an effect on the cells, we focused on the translated peptides, since they provide potentially more information than nucleotide composition of the underlying RNAs. We have thus compared the amino acid composition and biochemical properties of the amino acids of all analyzed peptides with a set of random peptides and with biological controls derived from E. coli annotated proteins (suppl. Figure 2). We find that for almost every case where the biological control deviates from random expectation, the peptides in our study are closer to the random control. Similarly, there are no major differences between up and down regulated clones with respect to amino acid composition in most comparisons. Still, minor differences are found for some comparisons in the relative frequency of certain amino acids such as less R, D, C, G, S, V and more N, E, Q, I and T in the real E. coli protein sequences. However, these are biochemically rather diverse amino acids in each set, which do not allow much speculation about possible structural differences at present. Moreover, given that a subset of the clones may act via their RNA rather than the coding peptide (see below), such inferences would still be highly speculative.
Validation of individual clones with growth advantage
All of the above experiments were conducted in the context of a large mixture of clones. We were interested in testing some clones individually or in less complex mixtures to see whether the activity patterns could be confirmed. Further, while there could be many ways in which random sequences, especially peptides, could inhibit growth processes, it is of particular interest to study clones that provide cells with a growth advantage. We chose three such clones and isolated the respective plasmids from the library. Western blotting shows that all three express a peptide in dependence of IPTG, albeit at different steady state levels, which could reflect different overall stability (Figure 4).
Figure 4. Expression of peptides.
Western blot with antiFLAG antibody for the three individual clones and the whole library. Left side: after induction of the promoter with IPTG, right side control without induction.
To test whether the clones have an advantage with respect to clones harboring only an empty plasmid, we ran a standard 4-cycle experiment as above, amplified the inserts and quantified the DNA on an Agilent gel chip. We find that all three clones show an increase in frequency over time (Figure 5A). We tested further whether they would also show this in competition with each other in different combinations. This is indeed the case; all are better than the empty vector (Figure 5B).
Figure 5. Growth competition experiment with three selected clones.
A) Competition against empty vector separately with each of the clones. B) Competition in different combinations of vector and clones. C) Competition between vector and stop codon version of the respective clones. v = vector, 4 = clone 4, 32 = clone 32, 600 = clone 600. Note that in each experiment we mixed the cells of the corresponding clones in equal starting amounts, but the subsequent PCR favored different fragments to different extents. Hence, only the trajectories are comparable, not the absolute values.
Given that the bioactivity could be conveyed either by the transcribed RNA or the translated peptide, we produced versions of these clones harboring a stop codon directly at the start of the random part of the sequence, i.e. only the first four amino acids that are common among all clones would be translated. These mutated clones were also tested in pairwise competition assays with the empty vector. Only one of the clones (clone 600) showed a clear difference between the mutated and the non-mutated version (Figure 5C), which would suggest that only this clone exerts its effect via the encoded peptide, while the two other clones might act through their RNA alone. To study this in more detail, we did an experiment with a direct competition of each clone with its stop codon counterpart, but with the same qualitative results (suppl. Figure 3).
Discussion
Our experiments show that an unexpectedly large fraction of random RNA or peptide sequences are bioactive, at least in the sense of influencing relative growth rates in E. coli cells. The results imply that it could be either the RNA itself, or the corresponding translated protein that conveys the bioactivity. Although two of our three individually tested clones suggest that the RNA function could be more important than the protein function, this constitutes at present only a small sample and may not be indicative of the true ratio between RNA and peptide functions. However, this observation fits well with the notion that an active RNA may precede an active peptide during de novo gene evolution of genes6,10–12.
Previous studies have shown that specific biochemical activities or specific resistance against stress conditions can be recovered from random peptides using very large sample sizes and directed selection experiments14,16. However, this occurred only at low frequencies and required multiple rounds of selection. Our results show that a non-directional approach in which only proxies for biological fitness are considered, rather than specific activities, recovers a much higher fraction of bioactive RNAs and peptides and thus allows a better understanding of the functional potential of the unexplored random sequence space for molecular innovation.
Almost any random RNA could fold into a higher order structure, or interact with other RNAs via base pairing, although the free energies and interaction would be expected to be weak. For peptides, one could expect that they interact via charged or hydrophobic interactions with other molecules. They would not need to fold into a stable structure to do this. Many proteins exist that are partly or fully made up of intrinsically disordered protein regions17,18. One can assume that such disordered peptides or proteins can associate with molecular complexes and can influence their activity. This can be more or less specific, i.e. only a single complex, or multiple complexes are affected19,20. But as long as the specific effect or activity is reproducible, it becomes functionally - and thus also evolutionarily - relevant.
Negative effects of expressed peptides may not necessarily be very specific, given that a strong promoter is used in the expression vector and that some peptides might simply aggregate and thus harm the cell. However, since our first frequency measurement is taken at the end of cycle #1 where cells have already grown under induction conditions, we expect that the strongly deleterious peptides are already mostly lost. Hence, we consider even the negative effects as possibly specific, i.e. in the sense that they do not simply block the whole cell physiology.
We find a very high reproducibility in our experiments, both with respect to trends across cycles, but also between and within experimental setups. Further, we show for three clones that their effects are also measurable in isolation. Evidently, it will now be of interest to study many more isolated clones in a range of conditions, to assess whether they are broadly active, or generate a growth effect only within a limited range of conditions.
Our current results suggest that a large fraction of all possible random sequences may have some biochemical activity of biological relevance, at least in E. coli, but possibly in any other cellular organisms too. Still, it will require testing more libraries with different clone compositions to confirm this notion. Similarly, although we find more deleterious than advantageous clones, it is too early to speculate on the exact relative frequencies between them. Our results show that this ratio is heavily dependent on sequencing depth and we expect that it may also depend on growth conditions and other environmental variables. Further, different cells might show different responses. It will therefore be necessary to do similar experiments also with other bacterial and/or eukaryotic cells with a range of different conditions.
We note that our findings may also have practical implications. Assuming that one can show that a given RNA or peptide interacts with a specific cellular process, such molecules could be seen as novel probes for studying cell physiology and growth. Random expression libraries could also be used for screening approaches similar to the ones that have been developed for short hairpin RNA libraries (shRNA21), to identify specific RNAs or peptides that influence particular pathways or physiological states. This could also lead to novel procedures to identify pharmaceutically relevant molecules.
Conclusion
The results presented here suggest that an unexpectedly high fraction of randomly combined nucleotide or amino acid chains are biologically active and can influence fitness. This lends credence to the possibility that de novo evolution of genes has played a significant role in evolutionary history. But the finding raises also many new questions, including the one of whether any of these molecules would be able to form a stable fold on its own, or whether they can only act in interactions with stable molecules or complexes. We can now experimentally address these and many other questions related to de novo gene evolution and their implications for the cell and the organism.
Methods
Construction of a library of random inserts in an expression vector
We opted for a length of 150nt of random sequence to be expressed, corresponding to 50aa when fully translated. This length was chosen since this should give peptides a higher chance to interact with other components of the cell machinery. At the same time it is short enough to allow full-length sequencing by an Illumina-based approach. We used the pFLAG-CTC™ expression vector (Sigma-Aldrich, catalog no. E8408) for cloning. This includes the strong Ptac promoter (a hybrid of the trp and lac promoters from E.coli) regulated by the presence of the lacO sequences and inclusion of the lac repressor gene (lacI) on the plasmid. It drives a transcript that includes a ribosome binding site and a start codon, as well as a C-terminal FLAG sequence with a stop codon. The oligo-nucleotide including the randomized sequence was cloned between the HindIII and SalI sites of the multiple cloning site. The insert sequence was synthesized as an oligo-nucleotide of the following sequence:
5´-ACGTCCAAGCTTAGC(N150)GCATTGGTCGACGTA-3´
whereby (N150) represents 150 nucleotides that were synthesized as equimolar mixes of A, C, G and T at every position. This oligo-nucleotide pool was purified on a 8% acrylamide gel and amplified using the following primers:
Oligo fwd: 5’-ACGTCCAAGCTTAGC-3’ / Oligo rev: 5’-TACGTCGACCAATGC-3’
The double stranded product was digested with HindIII and SalI and ligated into the digested vector. This results in the predicted peptide sequence:
ATGAAGCTTAGC NNN GCATTGGTCGACTACAAGGACGATGACGACAAGTGA
MetLysLeuSer (aa50) AlaLeuValAspTyrLysAspAspAspAspLysSTOP
whereby (aa50) denotes 50 amino acids in random combinations. Positions in italics represent the parts provided by the vector, including the FLAG sequence. The ligation products were transformed into E.coli DH10B cells by electro-transformation. The initially transformed cells were grown without IPTG induction to stationary phase to generate the library stock, which was frozen after adding 20% glycerol.
Growth competition experiments
The design of the experiments was aimed to identify clones that would consistently show a frequency change across multiple cycles of growth, whereby all cycles are run under the same culture conditions. The following general setup was used: (1) generate a pre-culture by inoculating 25mL LB medium (Invitrogen; cat. 12780-052) supplemented with 50µg/mL ampicillin (AMP) (Roth; cat. K029.1) with 500µL from the library stock and grow overnight at 37°C at 250rpm shaking conditions in an Erlenmeyer flask. (2) inoculate up to 10 replicates with 500µL each of the pre-culture in 5mL fresh LB medium + AMP + IPTG (Sigma; cat. I1284; 1 mM final concentration) in 14ml tubes with snap lid (Falcon, 17 x 100mm, cat. 352057); grow overnight at 37°C at 250rpm shaking conditions; this is cycle #1. (3) repeat the last step, but use 500µL each of the culture from the previous step until cycle #4.
This setup means that the cells from cycle #1 have already grown under induction conditions, i.e. the frequency of their clones may already have been changed in comparison to the starting frequency in the library or the pre-culture. We chose this setup to ensure that all frequency estimates from the sequencing of the clones (see below) come from samples grown under the same conditions. Further, given the inoculation with a high number of cells (approx. 108) at each cycle, one assures that there is no dilution effect with respect to the number of clones in the library (approx. 106). The inoculation with the high number of cells implies also that there are only about 3-4 generations until the new stationary phase is reached.
A total of seven experiments were performed according to this scheme, all with four cycles. Three (E1-E3) where each cycle lasted 3 hours and three (E4-E6) where each cycle lasted 24h (i.e. a prolonged stationary phase), with 10 replicates each. A further experiment (E7) was done with 24h cycles and 5 replicates.
Sampling and sequencing
For all experiments and after each cycle, 2mL were used for plasmid extraction (QIAGEN Plasmid Mini Kit; cat. 12125). Library preparation was done using PCR amplicon sequencing, targeted to the periphery of the insert on the plasmid, and providing the primers for subsequent sequencing, using the following primers:
Fwd primer: 5’-ATGATACGGCGACCACCGAGATCTACACNNNNNNNNTATGGTAATTGTCATCATAACGGTTCTGGCAAATATTC-3’
Rev primer: 5’-CAAGCAGAAGACGGCATACGAGATNNNNNNNNAGTCAGTCAGCCCTGTATCAGGCTGAAAATCTTCT-3’
The hexanucleotide with Ns indicates a region where sequencing barcodes were placed to distinguish the different replicates. After PCR, the samples were pooled together. Sequencing was carried out with an Illumina MiSeq sequencer, following the standard amplicon sequencing protocol, and using the MiSeq Reagent Kit v3 for 600 cycles (cat. MS-102-3003) to produce 300 bp paired-end reads from each sequenced fragment.
Single clone recovery
To obtain single clones from the library, we used PCR primers based on the determined sequences of the clones facing outward of each other. Stop codons at the desired positions were engineered by modifying one of the primers at its 5`-end to code for a stop codon. Amplification then yields the full vector that needs only to be religated. However, to ensure that the vector had not suffered a mutation, we re-cloned the inserts of the recovered clones into a new common vector. All inserts obtained in this way were re-sequenced to confirm the original sequence.
We used Western blots to check for the expression of the respective peptides. 1.4 mL overnight culture were spun down and resuspended in 100μL Laemmli buffer with 5% β-mercaptoethanol, then samples were incubated at 99°C for 5 minutes and debris was centrifuged down. 30μL were loaded onto a 4-20% tris-glycine gel (Bio-Rad) and run for 1 hour 40 minutes at 70 volts. The proteins were then transferred to PVDF membrane for 15 minutes at 13 volts using a Bio-Rad semi-dry electroblot unit. The membrane was washed 2 x 10 minutes with gentle shaking in PBS with 0.1% tween 20 (PBST) and then blocked in 5% powdered milk (1% fat) dissolved in PBST with shaking at room temperature for one hour. The monoclonal mouse anti-FLAG M2 antibody (F1804 Sigma) was added, diluted 1 in 2000 in 2.5% milk PBST. The membrane was incubated overnight with shaking in a cold room (approx. 6 °C). The membrane was washed 3 x 10 minutes in PBST with shaking. Goat-anti mouse HRP (A16072 Thermo-Fisher) diluted 1 in 2500 in 2.5% milk PBST was added and incubated with shaking at room temperature for one hour. The membrane was washed 3 x 10 minutes in PBST with shaking. ECL (Clarity Western ECL from Bio-Rad) was pipetted onto the blot (approximately 3 mL per blot) and incubated for 5 minutes, then blotted with thick filter paper and protected from light. The membrane was the imaged using a digital imager (Alpha Innotech) with increasing exposures until bands were well visible.
Single clone competition experiments
The competition experiments with individual clones or combinations thereof were done under the same conditions as for the 24h cycles. But instead of proceeding with a sequencing step (see above), we amplified the inserts by PCR and ran the products on an Agilent Biochip (DNA 7500). To distinguish the sizes of the clones with inserts, we digested them first with diagnostic restriction enzymes. The Agilent software for quantification of the bands was then used to obtain concentration differences of the fragments between time points.
Data processing and analysis
Fastq paired-end reads were collapsed into a single fasta sequence each using usearch22. Whenever conflicting bases were detected between pairs, those with the best quality score were retained. The translated peptide sequence was obtained from each merged sequence using getorf from the emboss suite23. Only those ORFs starting and ending with the expected sequences (see above), and having exactly 65 amino acid residues (50 from the randomized sequences and 15 from the vector, including the tag) were retained for further analyses.
A non-redundant database was constructed with usearch22 for all experiments using protein sequences at 100% identity, i.e., similar non-identical sequences are treated as independent units. This implies that this database includes translated sequences with possible sequencing errors or PCR induced mutations.
It was possible to estimate the error rates per sequencing run using the first 85nt of plasmid sequences in the reads. We cropped the reads to this length using Unix shell scripts, mapped them to the reference plasmid sequence with NextGenMap24, and determined the percentage of mismatches using samtools fillmd25 to assess substitutions as a proxy for errors. We found error rates in the range between 0.12-0.56%. Given these low rates, we did not try to curate the database further. The sequences of each replicate in each experiment were matched to the database using diamond26. This provided a quantitative representation of each sequence in each cycle and each replicate, as well as across experiments. These counts were used to compare the changes in frequency of each clone over time.
Statistical procedure
The number of times a clone was observed was recorded and the counts for each time point were determined. Very low frequency clones cannot be statistically analyzed. For this reason, we required occurrence of at least five times or more in any one replicate of an experiment to consider a statistical analysis of a given clone. The further statistical analysis is based on procedures designed for differential gene expression15, but is applicable to any type of count data, in particular those derived from high-throughput sequencing experiments. The analysis was done using the R package DESeq215.
Clones with significantly different frequencies between the first time point and the last time point were recorded. By including comparisons at the other two time points, they were further categorized into increasing or decreasing in frequency across time points when the tendency was consistent.
Comparison across experiments
Results from different experiments were compared by collecting all clones with any significance across all experiments, and recording the direction of the fold change (enrichment or depletion). Clones with opposing effects in any two experiments were not further considered to avoid inflation from false positives, and these remained only a minor fraction from the overall detected clones.
Rarefaction analyses
Experiment E7, which was sequenced intensively, was used to estimate the effects of sampling on the discovery of enrichment or depletion. All replicates were normalized at 50,000 clones each, thus giving the whole experiment a total number of 1 million sequences. Random subsamples were obtained at 10% intervals. Subsampled experiments were analyzed as described above.
Sequence properties
To assess whether the enriched or depleted peptides in the experiments behave like random sequences or biological protein sequences, we simulated random 150nt RNA sequences, and added the vector information to obtain a translation like the one performed experimentally. We also obtained all protein sequences from E. coli deposited in the GenBank and fragmented them into 65aa length to act as biological controls. We extracted simple compositional properties derived from sequence information using the package protr27 and ran Wilcoxon rank tests to compare properties of experimental peptides to random and biological sequences.
Supplementary Material
Acknowledgements
We thank S Künzel for sequencing and E Özkurt for contributions during her rotation project. The project was financed through an ERC advanced grant to DT (NewGenes - 322564). Results described in this manuscript are subject to patent application.
Footnotes
Data availability
The data tables with read counts and corresponding statistics for each experiments are provided at Dryad under the accession number xxxx.
Conflict of financial interests
The work described in this publication is subject to patent application by the Max-Planck Society.
Author contributions
RN and DT designed the experiment, CA constructed the library, CA, BY and EM conducted the experiments, RN did the bioinformatic analysis, RN and DT wrote the paper.
Cited Literature
- 1.Jacob F. Evolution and Tinkering. Science. 1977;196:1161–1166. doi: 10.1126/science.860134. [DOI] [PubMed] [Google Scholar]
- 2.Tautz D. The discovery of de novo gene evolution. Perspectives in Biology and Medicine. 2014;57:149–161. doi: 10.1353/pbm.2014.0006. [DOI] [PubMed] [Google Scholar]
- 3.Chothia C. Proteins - 1000 families for the molecular biologist. Nature. 1992;357:543–544. doi: 10.1038/357543a0. [DOI] [PubMed] [Google Scholar]
- 4.Lupas AN, Ponting CP, Russell RB. On the evolution of protein folds: Are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? Journal of Structural Biology. 2001;134:191–203. doi: 10.1006/jsbi.2001.4398. [DOI] [PubMed] [Google Scholar]
- 5.Orengo CA, Thornton JM. Protein families and their evolution - A structural perspective. Annual Review of Biochemistry. 2005;74:867–900. doi: 10.1146/annurev.biochem.74.082803.133029. [DOI] [PubMed] [Google Scholar]
- 6.Carvunis AR, et al. Proto-genes and de novo gene birth. Nature. 2012;487:370–374. doi: 10.1038/nature11184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Reinhardt JA, et al. De Novo ORFs in Drosophila Are Important to Organismal Fitness and Evolved Rapidly from Previously Non-coding Sequences. Plos Genetics. 2013;9 doi: 10.1371/journal.pgen.1003860. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Zhao L, Saelao P, Jones CD, Begun DJ. Origin and Spread of de Novo Genes in Drosophila melanogaster Populations. Science. 2014;343:769–772. doi: 10.1126/science.1248286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Neme R, Tautz D. Fast turnover of genome transcription across evolutionary time exposes entire non-coding DNA to de novo gene emergence. Elife. 2016;5 doi: 10.7554/eLife.09977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Tautz D, Domazet-Loso T. The evolutionary origin of orphan genes. Nature Reviews Genetics. 2011;12:692–702. doi: 10.1038/nrg3053. [DOI] [PubMed] [Google Scholar]
- 11.Xie C, et al. Hominoid-Specific De Novo Protein-Coding Genes Originating from Long Non-Coding RNAs. Plos Genetics. 2012;8 doi: 10.1371/journal.pgen.1002942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ruiz-Orera J, Messeguer X, Subirana JA, Alba MM. Long non-coding RNAs as a source of new peptides. Elife. 2014;3 doi: 10.7554/eLife.03523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Barrick JE, Lenski RE. Genome dynamics during experimental evolution. Nature Reviews Genetics. 2013;14:827–839. doi: 10.1038/nrg3564. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Stepanov VG, Fox GE. Stress-driven in vivo selection of a functional mini-gene from a randomized DNA library expressing combinatorial peptides in Escherichia coli. Molecular Biology and Evolution. 2007;24:1480–1491. doi: 10.1093/molbev/msm067. [DOI] [PubMed] [Google Scholar]
- 15.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014;15 doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Keefe AD, Szostak JW. Functional proteins from a random-sequence library. Nature. 2001;410:715–718. doi: 10.1038/35070613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Uversky VN, Dunker AK. Understanding protein non-folding. Biochimica Et Biophysica Acta-Proteins and Proteomics. 2010;1804:1231–1264. doi: 10.1016/j.bbapap.2010.01.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Tompa P, Schad E, Tantos A, Kalmar L. Intrinsically disordered proteins: emerging interaction specialists. Current Opinion in Structural Biology. 2015;35:49–59. doi: 10.1016/j.sbi.2015.08.009. [DOI] [PubMed] [Google Scholar]
- 19.Cumberworth A, Lamour G, Babu MM, Gsponer J. Promiscuity as a functional trait: intrinsically disordered regions as central players of interactomes. Biochemical Journal. 2013;454:361–369. doi: 10.1042/bj20130545. [DOI] [PubMed] [Google Scholar]
- 20.Tompa P, Davey NE, Gibson TJ, Babu MM. A Million Peptide Motifs for the Molecular Biologist. Molecular Cell. 2014;55:161–169. doi: 10.1016/j.molcel.2014.05.032. [DOI] [PubMed] [Google Scholar]
- 21.Sims D, et al. High-throughput RNA interference screening using pooled shRNA libraries and next generation sequencing. Genome Biology. 2011;12 doi: 10.1186/gb-2011-12-10-r104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461. doi: 10.1093/bioinformatics/btq461. [DOI] [PubMed] [Google Scholar]
- 23.Rice P, Longden I, Bleasby A. EMBOSS: The European molecular biology open software suite. Trends in Genetics. 2000;16:276–277. doi: 10.1016/s0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]
- 24.Sedlazeck FJ, Rescheneder P, von Haeseler A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics. 2013;29:2790–2791. doi: 10.1093/bioinformatics/btt468. [DOI] [PubMed] [Google Scholar]
- 25.Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nature Methods. 2015;12:59–60. doi: 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]
- 27.Xiao N, Cao DS, Zhu MF, Xu QS. protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics. 2015;31:1857–1859. doi: 10.1093/bioinformatics/btv042. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





