Progressive partitioning reveals splice site consensus sequences. (A) The figure summarizes the consensus sequences that emerge by using the following techniques: (i) The information profile of regions around the donor and acceptor sites of all 10,057 introns (Suppl. Fig. S3) showed >0.5 bits of information at donor positions D - 1 to D + 6 and acceptor positions A - 5 and A - 3 to A - 1. (Nucleotide positions are defined relative to the splice site junctions.) (ii) Partitioning introns by length (Figs. 2 and 3) showed that longer introns had at least 0.5 bits of information at additional positions A - 11 to A - 9 and A - 6. Underlined nucleotides indicate positions that have an increase in information of at least 0.5 bits over the lengths 64 to 32,767 nt (Fig. 3). The position D + 5 (square border) has least information for intermediate lengths, and progressively higher information for shorter and longer lengths (Fig. 3). (iii) Partitioning introns by forced mismatches (Fig. 5; Suppl. Fig. S7) showed >0.5 bits of information at the additional positions D - 2 and A - 14, and revealed a preference for A at D + 3. Segments of U1, U5, and U6 snRNAs are aligned with donor and acceptor site sequences, where they are thought to interact through base-pairing or non-Watson-Crick interactions. Pre-mRNA regions where snRNP and associated proteins are thought to bind are indicated with brackets; these include U2AF35(38), U2AF65(58), and U5 snRNP Prp8 (vertebrate homolog p220). Also illustrated are regions where forced mismatches (Fig. 5; Suppl. Figs. S5, S6, S7) led to A enrichment, and to a lesser extent, U enrichment. This bias in composition may promote the splicing reaction by favoring reduced RNA secondary structure and enhanced binding of protein or nucleic acid components of the spliceosome. Forced mismatches gave particularly strong A enrichment at D - 2 and D - 3, where U5 snRNA binds. (B) The graph shows the frequency distributions for numbers of matches to the donor consensus sequence AGGUAAGU (D - 2 through D + 6) and the acceptor consensus sequence UNNUUUNNUUNCAG (A - 14 through A - 1) for short introns (64-80 nt) and long introns (8192-32,767 nt). Assuming that the distributions are approximately normal, an approximate t-test using the means and standard deviations of the two distributions shows that in both the donor and acceptor cases, the distribution means are higher for the longer introns (p > 0.99). The distribution means and standard deviations are listed in parentheses.