Skip to main content
. 2008 Nov 28;4(11):e1000176. doi: 10.1371/journal.pcbi.1000176

Figure 1. Incidence of open reading frames (ORFs) in randomly generated transcripts of increasing length.

Figure 1

Twenty thousand transcripts of varying length and random nucleotide composition were computationally generated and scanned for ORFs. The maximum ORF and transcript lengths were plotted and fitted to a logarithmic curve. The shaded regions represent incidences of randomly occurring ORFs at 1, 2, or 3 standard deviations from the mean. The red line indicates the 300 nt ORF threshold that is commonly used to distinguish protein-coding genes in transcript classification pipelines. Therefore, this plot illustrates that for transcripts longer than ∼1000 bp, such a threshold may define transcripts as protein-coding that would be expected to occur by chance. The function y = 91.Ln(x)−330, which approximates random ORF incidence according to transcript length at two standard deviations above the mean (i.e., 95% confidence interval, indicated in green), could be used to discriminate noncoding from protein-coding transcripts in a transcript-length–dependent manner.