Abstract
Replication error deficient (RER+) colorectal cancers are a distinct subset of colorectal cancers, characterized by inactivation of the DNA mismatch repair system. These cancers are typically pseudodiploid, accumulate mutations in repetitive sequences as a result of their mismatch repair deficiency, and have distinct pathologies. Regulatory sequences controlling all aspects of mRNA processing, especially including message stability, are found in the 3′UTR sequence of most genes. The relevant sequences are typically A/U-rich elements or U repeats. Microarray analysis of 14 RER+ (deficient) and 16 RER− (proficient) colorectal cancer cell lines confirms a striking difference in expression profiles. Analysis of the incidence of mononucleotide repeat sequences in the 3′UTRs, 5′UTRs, and coding sequences of those genes most differentially expressed in RER+ versus RER− cell lines has shown that much of this differential expression can be explained by the occurrence of a massive enrichment of genes with 3′UTR T repeats longer than 11 base pairs in the most differentially expressed genes. This enrichment was confirmed by analysis of two published consensus sets of RER differentially expressed probesets for a large number of primary colorectal cancers. Sequence analysis of the 3′UTRs of a selection of the most differentially expressed genes shows that they all contain deletions in these repeats in all RER+ cell lines studied. These data strongly imply that deregulation of mRNA stability through accumulation of mutations in repetitive regulatory 3′UTR sequences underlies the striking difference in expression profiles between RER+ and RER− colorectal cancers.
Approximately 15% of sporadic colorectal cancers (CRCs) are microsatellite unstable (MSI+) as a result of inactivation of one of the components of the DNA mismatch repair system (1). This is mostly because of a combination of epigenetic silencing or mutation, coupled with loss of heterozygosity, of MLH1 and less frequently MSH2 or MSH6 (2, 3). As a result, such tumors are replication error deficient (RER+), which leads to the accumulation of insertion/deletion mutations in mono-, di-, tri-, and tetranucleotide repeats throughout the genome, probably because of DNA polymerase slippage during replication (4, 5).
RER+ tumors have distinct clinico-pathologic and genetic profiles in comparison with microsatellite stable/replication proficient tumors (MSI−/RER−). These tumors tend to be right sided (proximal), poorly differentiated with mucinous histology, and have a more favorable prognosis (6–8). In addition, they mostly have a near diploid or pseudodiploid karyotype; RER− tumors are mostly chromosomally unstable and have aneuploid karyotypes (9, 10). The underlying genetic instability in RER+ tumors results in a genetic profile that is distinct from RER− tumors and is characterized by mutation of “susceptible” genes containing repeat sequences within their protein coding regions. For example, although the TGF signaling pathway is commonly disrupted in CRC, RER+ tumors tend to achieve inactivation of the TGF pathway through frame-shift mutations of TGFBR2 (11, 12), whilst RER− tumors reach a similar end through inactivation of SMAD4 (11).
Studies on RER+ CRCs have tended to focus on identifying mutations in mononucleotide repeat sequences within coding regions of genes. These sequences are typically insertions or deletions of 1 or 2 base pairs leading to frame-shift mutations, which result in either nonsense mediated decay (NMD) of message (12, 13) or truncated proteins. The most highly repetitive DNA sequences, however, occur outside of coding regions, and consequently accumulation of mutations in these regions in RER+ tumors may be of little significance and represent a “bystander” effect of a defective mismatch repair mechanism. However, untranslated regions of the genome include 3′ and 5′UTR sequences of genes, which do affect gene transcription and translation efficiency. Mutations in 3′UTR sequences can, for example, alter the binding properties of trans-acting factors specific for such sequences, leading to de-regulation of message stability and translation efficiency, as discussed below.
A/U rich elements (AREs) in UTRs of message are recognized by an extensive family of RNA-binding proteins, including ELAV (HuR and HuD) proteins and heterogeneous nuclear ribonucleoparticle (hnRNP) proteins. AREs affect mRNA levels either by mediating rapid deadenylation and degradation of message or by modulating translation efficiency (reviewed in ref. 14). AREs can be divided into several classes, depending on the number and positioning of AUUUA repeats and polyU tracts and their impact on polyA shortening in mRNA. Briefly, Class I AREs have one to three AUUUA motifs separated by U-rich sequences, and class II AREs have multiple clusters of the AUUUA motif. Class III AREs are less well defined and have mainly U-rich sequences (14). These important regulatory elements interact with RNA binding proteins, which can act to either stabilize or destabilize the mRNA. The RNA binding proteins that bind these sequences play a critical role in regulating mRNA stability, and changes in expression levels of individual hnRNP proteins have been identified in cancers and, in some instances, shown to have prognostic significance [(15–20); reviewed in ref. 20].
We reasoned that mononucleotide repeat sequences within the 3′UTRs of expressed genes might be targeted in RER+ cancers and that this mechanism could contribute to global changes in expression profiles. To examine this hypothesis, we analyzed Affymetrix U133+2 microarray mRNA expression data on 16 RER− and 14 RER+ CRC-derived cell lines. We specifically looked in differentially expressed genes for the presence of mononucleotide repeats in 3′UTR sequences, and then compared the frequency of these repeat sequences with the overall frequencies found in all coding genes. This analysis showed that the enrichment of genes with 3′UTR mononucleotide T-repeat sequences between 10 and 32 base pairs in length accounts for ≈95% of the top 30 most differentially expressed genes. Sequencing of a subset of these genes confirmed that all contained significant deletions in all of the RER+ cell lines but remained relatively stable in the RER− cell lines.
Results
RER Differentially Expressed Genes Are Highly Enriched with Sequences Containing Mononucleotide Repeats Within 3′UTRs.
The microarray gene expression analysis shows a characteristic and strikingly different expression profile for RER+ vs. RER− CRCs, illustrated by the heatmap of the top 100 differentially expressed probesets (Fig. S1). A total of 5,161 probesets were differentially up-regulated (2,986) or down-regulated (2,175) in RER+ vs. RER− cell lines when implementing a Benjamini and Hochberg false-discovery rate (FDR) adjusted P-value cutoff of 0.05. The top 30 differentially expressed genes, based on P values, are listed in Table 1, together with the sizes of their 3′UTR T(n) repeats and the significance of the qRT-PCR expression-level differences between the RER+ and RER− cell lines. Similar data for the top 100 genes are listed in Dataset S1.
Table 1.
Top 30 RER differentially expressed genes
| Gene symbol | Mean (RER−) log2n | Mean (RER+) log2n | FDR P value (RER+ vs. RER+−) | Fold-change (RER+ vs. RER−) | 3′UTR T(n) ≥T10 | qRT-PCR (P value) |
| PAFAH1B2 | 7.3 | 9.4 | 1.37E-12 | 4.1 | 0.005 | |
| CDC42SE1 | 6.0 | 8.5 | 3.53E-12 | 5.4 | t27 | 0.41 |
| FANCD2 | 7.8 | 9.6 | 3.53E-12 | 3.3 | t13; t19* | 0.17 |
| RAPGEF6 | 5.3 | 7.7 | 3.53E-12 | 5.3 | ||
| LETM1 | 7.8 | 8.8 | 4.80E-11 | 2.0 | t17; t26 | |
| MTA2 | 6.6 | 8.8 | 1.12E-10 | 4.7 | t27* | 3.85E-05 |
| HNRNPL | 10.5 | 12.5 | 1.57E-10 | 4.1 | t11; t26* | 0.0117 |
| RIT1 | 8.0 | 6.4 | 7.11E-10 | −3.1 | t11; t27 | 0.40 |
| STRN3 | 6.2 | 8.8 | 3.66E-09 | 5.8 | t11; t26 | |
| USP28 | 8.0 | 8.8 | 4.41E-09 | 1.8 | t11 | 2.99E-06 |
| TMTC3 | 5.3 | 6.6 | 6.42E-09 | 2.4 | t12; t13; t25 | |
| TOB1 | 10.6 | 7.0 | 8.07E-09 | −12.0 | ||
| CASP2 | 7.8 | 9.0 | 1.93E-08 | 2.2 | t25 | |
| SENP5 | 5.4 | 7.1 | 3.89E-08 | 3.3 | t11; t15; t28 | |
| THRAP3 | 7.7 | 8.7 | 3.92E-08 | 1.9 | t11; t17 | |
| RQCD1 | 7.2 | 8.5 | 4.23E-08 | 2.5 | t28 | |
| TOR1AIP2 | 5.0 | 7.2 | 7.09E-08 | 4.6 | t12; t13; t22 | |
| SET///hCG_1644608 | 11.5 | 12.7 | 8.18E-08 | 2.3 | t24* | 0.002 |
| NUTF2 | 10.6 | 12.1 | 9.93E-08 | 2.8 | t22 | |
| CDC42 | 11.0 | 8.5 | 1.46E-07 | −5.7 | t20 | |
| IQGAP3 | 5.8 | 7.2 | 1.51E-07 | 2.6 | t25 | |
| USF1 | 6.5 | 7.4 | 1.51E-07 | 1.9 | t12; t28 | |
| HNRNPH1 | 11.9 | 12.7 | 1.57E-07 | 1.8 | t13; t22* | 0.001 |
| MARCH6 | 10.8 | 8.6 | 1.97E-07 | −4.3 | ||
| EI24 | 10.1 | 11.1 | 2.16E-07 | 2.1 | ||
| LRRC16A | 9.1 | 7.0 | 4.39E-07 | −4.2 | t28 | |
| CTNNB1 | 11.7 | 12.8 | 6.38E-07 | 2.1 | t19*; t25* | 1.53E-09 |
| ITCH | 7.4 | 9.4 | 6.38E-07 | 4.1 | t27 | |
| TUG1 | 11.6 | 10.2 | 7.44E-07 | −2.6 | ||
| TNRC6B | 8.1 | 6.6 | 1.02E-06 | −2.9 | t10; t12; t14; t15; t25; t27 |
P values are for qRT-PCR validation. See SI Materials and Methods for details and explanation of genes in non-boldface text.
*Genes where 3′UTR repeat sequences were sequenced.
The frequency of each mononucleotide repeat length for the 3′UTR, 5′UTR, and coding sequence of all protein-coding genes is shown in Fig. S2A and for all genes with an Affymetrix U133+2 ID in Fig. S2B. The actual numbers of genes containing each individual 3′UTR repeat length (from A2 to A32; G2 to G32; C2 to C32; T2 to T32) in all of the datasets analyzed is given in Dataset S2. These data were then used to calculate the expected frequencies in the RER differentially expressed gene datasets for each repeat length separately, and also for repeat lengths when considered in approximate terciles representing lengths between 2 and 10, 11 and 20, and 21 and 32 base pairs, respectively. Increases in observed vs. expected frequencies (enrichment) of mononucleotide repeats divided into terciles for 3′UTR, 5′UTR, and coding sequences in each of the differentially expressed datasets (using the data for all protein coding genes as a reference dataset) are given in Tables 2 and 3 and Fig. 1 for 3′UTR analysis, and Tables S1–S4 for 5′UTR and coding sequences.
Table 2.
Enrichment of 3′UTR mononucleotide repeats sequences in RER differentially expressed genes in the 30 cell lines, using all protein coding genes as a reference set
| Top 30 genes (n = 28) |
Top 100 genes (n = 95) |
Top 800 probesets (n = 439) |
||||||||
| 3′UTR Mononucleotide repeat length | Reference protein-coding genes (n = 19,497) | Observed | Expected | Fold increase | Observed | Expected | Fold increase | Observed | Expected | Fold increase |
| a2-10 | 81,822 | 135 | 117.5 | 1.1 | 486 | 411.3 | 1.2 | 2,056** | 1,842.3 | 1.1 |
| a11-20 | 2,288 | 6 | 3.3 | 1.8 | 32** | 11.5 | 2.9 | 91** | 51.5 | 1.8 |
| a21-32 | 2,82 | 1 | 0.4 | 2.5 | 4 | 1.4 | 2.9 | 20** | 6.3 | 3.1 |
| c2-10 | 64,669 | 102 | 92.9 | 1.1 | 343 | 325.1 | 1.1 | 1,503 | 1,456.1 | 1.0 |
| c11-20 | 60 | 0 | 0.1 | 0.0 | 1 | 0.3 | 3.4 | 2 | 1.4 | 1.5 |
| c21-32 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 |
| g2-10 | 63,377 | 106 | 91.0 | 1.2 | 356 | 318.6 | 1.2 | 1,517 | 1,427.0 | 1.1 |
| g11-20 | 43 | 0 | 0.1 | 0.0 | 1 | 0.2 | 4.8 | 3 | 1.0 | 3.1 |
| g21-32 | 2 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 |
| t2-10 | 85,003 | 159* | 122.1 | 1.3 | 541** | 427.3 | 1.3 | 2,272*** | 1,914.0 | 1.2 |
| t11-20 | 2,799 | 22*** | 4.0 | 5.5 | 64***** | 14.1 | 4.7 | 161**** | 63.0 | 2.6 |
| t21-32 | 400 | 21**** | 0.6 | 36.6 | 41***** | 2.0 | 20.5 | 70***** | 9.0 | 7.8 |
n = the actual number of unique genes for each dataset, for which 3′UTR sequences are available, and upon which calculations are based.
*P values between 1E-3 and 1E-5;
**P values between 1E-5 and 1E-10;
***P values between 1E-10 and 1E-20;
****P values between 1E-20 and 1E-40;
*****P values less than 1E-40.
Table 3.
Enrichment of 3′UTR mononucleotide repeats sequences in RER differentially expressed genes in primary tissue, using all protein coding genes as a reference set
| Watanabe top 177 probesets (n = 84) |
Jorissen top 192 probesets (n = 148) |
Jorissen top 829 probesets (n = 633) |
||||||||
| 3′UTR Mononucleotide repeat length | Reference protein-coding genes (n = 19,497) | Observed | Expected | Fold increase | Observed | Expected | Fold increase | Observed | Expected | Fold increase |
| a2-10 | 81,822 | 382 | 352.5 | 1.1 | 754** | 621.1 | 1.2 | 3,026*** | 2,656.5 | 1.1 |
| a11-20 | 2,288 | 15 | 9.9 | 1.5 | 48*** | 17.4 | 2.8 | 154*** | 74.3 | 2.1 |
| a21-32 | 282 | 1 | 1.2 | 0.8 | 4 | 2.1 | 1.9 | 16 | 9.2 | 1.7 |
| c2-10 | 64,669 | 299 | 278.6 | 1.1 | 508 | 490.9 | 1.0 | 2,186 | 2,099.6 | 1.0 |
| c11-20 | 60 | 2* | 0.3 | 7.7 | 0 | 0.5 | 0.0 | 4 | 1.9 | 2.1 |
| c21-32 | 0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 |
| g2-10 | 63,377 | 299 | 273.1 | 1.1 | 503 | 481.1 | 1.0 | 2,143 | 2,057.6 | 1.0 |
| g11-20 | 43 | 1 | 0.2 | 5.4 | 0 | 0.3 | 0.0 | 3 | 1.4 | 2.1 |
| g21-32 | 2 | 0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 | 0 | 0.1 | 0.0 |
| t2-10 | 85,003 | 399 | 366.2 | 1.1 | 833*** | 645.3 | 1.3 | 3,202*** | 2,759.8 | 1.2 |
| t11-20 | 2,799 | 17 | 12.1 | 1.4 | 43** | 21.2 | 2.0 | 134** | 90.9 | 1.5 |
| t21-32 | 400 | 9** | 1.7 | 5.2 | 10 | 3.0 | 3.3 | 30** | 13.0 | 2.3 |
Fig. 1.
Enrichment for sequences containing 3′UTR mononucleotide T repeats [corresponding to the third tercile (T21 to T32) as analyzed in Tables 2 and 3]. Fold-changes for individual repeat lengths between T21 and T32 are shown in the three RER differentially expressed gene datasets. The actual numbers of genes containing all 3′UTR repeat lengths analyzed in all of the datasets are given in Dataset S2. (A) Fold-increase in 3′UTR mononucleotide repeats in top differentially expressed genes in CRC cell lines versus the overall incidence in all protein coding genes (n = 19,497). (B) Fold-increase in 3′UTR mononucleotide repeats in the top differentially expressed genes in primary CRC samples from published data versus the overall incidence in all protein coding genes (n = 19,497). Datasets are labeled by the first author from the corresponding publication (21, 22). Note that the scale for the fold increase (y axis) differs between the two graphs.
The results using data from all genes with an Affymetrix U133+2 ID are very similar (Tables S1–S7 for 3′UTR, 5′UTR, and coding regions). All subsequent comparisons will therefore be restricted to using the data on all protein-coding genes.
There was no convincing evidence of enrichment for sequences containing mononucleotide repeats in 5′UTR sequences (Tables S1 and S2). Although the presence of a single gene (representing a 32-fold increase over expected) with a T22 repeat in the top 30 genes reached statistical significance, this is clearly an isolated event and of questionable importance.
Analysis of 3′UTR enrichment in all three datasets of differentially expressed genes in the cell lines revealed massive enrichment for mononucleotide T-repeat sequences. This finding was most marked for the polyT repeats between 21 and 32 base pairs long (P = 5.80E-160 and a 36.6-fold increase in the top 30 genes, P = 1.9E-166 and a 20.4-fold increase in the top 100 genes, and P = 7.9E-92 and a 7.8-fold increase in the top 800 probesets) (Fig. 1A and Table 2). Thus, although there are only 400 of 19,497 protein-coding genes that contain a 3′UTR sequence longer than T20, 21 of them are found within the top 28 genes (Table 1) and 41 in the top 96 genes (Dataset S1).
The enrichment for polyA repeats only achieves significance in the larger datasets of 100 genes and 800 probesets, and there is no enrichment for 3′UTR sequences containing polyC or polyG mononucleotide repeats. Although there remained a clear and very highly significant increase in sequences with 3′UTR mononucleotide T repeats in the top 100 genes and top 800 probesets, the fold-enrichment in the larger datasets was less extreme, as might be expected. For example, the fold-increase for 3′UTR T repeats between 21 and 32 base pairs drops from 36.6 in the top 30 genes to 20.4 in the top 100 genes, and to 7.8 in the top 800 probesets. Within this upper tercile of mononucleotide repeat lengths, there was also an apparent enrichment for sequences of specific length. For example, although the average fold-increase for all 3′UTR T repeats between 21 and 32 base pairs in length in the top 30 genes is 36.3, T27 repeat lengths, taken separately, showed a 139-fold increase(Fig. 1A and Table 2).
By subtracting the expected numbers from the observed numbers for each differentially expressed dataset, it is possible to obtain a rough estimate of the degree to which the presence of mononucleotide T repeats accounts for the overall RER+ vs. RER− expression differences (see SI Materials and Methods for details of calculation). This calculation suggests that 27 genes in the top 28 (95.9%) are present because of enrichment for 3′UTR polyT sequences between 11 and 32 base pairs. Similar calculations for the top 100 genes and top 800 probesets shows that this enrichment process accounts for ≈84.3% of genes in the top 100 and 37% in the top 800 probesets.
RER-Associated Expression Changes Between CRC Cell Lines Are Consistent with Those from Published Data on Primary CRCs.
Studies using cell lines rather than primary tissue are often criticized for not reflecting the situation in primary cancers. We have therefore used published lists of differentially expressed genes in RER+ vs. RER− primary tumors to explore whether the same characteristics of differential expression found in the cell lines are also found in fresh tumor tissue.
Watanabe et al. (21) published a list of 177 Affymetrix U133+2 probesets that were differentially expressed between 33 microsatellite high (MSI-H/RER+) and 51 microsatellite stable (MSS/RER−) primary CRCs. More recently, Jorissen et al. (22) published a list of 829 probesets consistently differentially expressed across three independent datasets using primary tissue samples total of 82 RER− cases and 93 RER+ cases, which included the samples analyzed by Watanabe et al. (21–23). Jorissen et al. also gave a list of 192 probesets consistently differentially expressed across four independent datasets that included data from 10 RER+ and 10 RER− cell lines (2). All three of the above lists of probesets were downloaded and the corresponding 3′UTR sequences interrogated for the frequencies of mononucleotide repeat lengths, as described in Materials and Methods. Fig. 1B shows the fold-increases for each mononucleotide T-repeat length for each of the three groups of published probesets and Tables 2 and 3 indicate the significance of the differences from expected. Although the degree of enrichment in the differentially expressed genes for 3′UTR mononucleotide T repeats was less significant than for our data, obtained using the cell lines, this analysis confirms that a very significantly higher-than-expected subset of the most differentially expressed genes contain polyA and polyT repeat sequences in their 3′UTRs in primary CRCs, just as in the CRC cell lines. It seems most probable that even small amounts of contaminating normal and stromal tissue in fresh tissue samples would tend to dilute differences in expression profiles compared with those obtained from cell lines. Furthermore, the lists of 829 and 192 probesets published by Jorissen et al. (22) reflect a selected proportion of differentially expressed genes that were consistent across three or four independent studies, rather than the top differentially expressed genes in any one study. Despite these differences, these results show that enrichment for genes containing mononucleotide repeats in 3′UTR sequences is consistently observed in analyses from primary tissue, and is therefore not a “cell culture” phenomenon related, for example, to the accumulation of deletions in these sequences over time in RER+ cell lines.
Frame-Shift Mutation in Coding Regions Leading to NMD Is Not a Major Mechanism Contributing to Overall RER-Associated Expression Differences.
There is evidence of a small but significant enrichment of differentially expressed genes containing mononucleotide repeat sequences in their coding regions (Tables S3–S5). Within the top 30 and top 100 differentially expressed genes, there is a small increase in observed vs. expected genes with polyA (observed in the top 100 = 419 vs. expected = 362.1; P for χ2 is 0.0028) and polyT (observed in the top 100 = 370 vs. expected = 321.5; P for χ2 is 0.0068) repeats between 2 and 10 base pairs. Using a calculation similar to that described for 3′UTR enrichment (see SI Materials and Methods for details) reveals that 4% of the top 30 and 15.8% of the top 100 genes can be accounted for by the presence of A2 to A10 coding repeats. Similarly, T2 to T10 repeats account for 5.8 and 15% of the top 30 and top 100 genes, respectively. However, there is likely to be a considerable overlap between genes with small polyA and small polyT repeats, such that the estimate to which either type of repeat contributes to the enrichment is more than likely closer to 15% than if the effects were additive. When the analysis of enrichment was performed separately for each repeat length between 2 and 10 base pairs, none of the individual polyA repeat lengths were significantly different from expected. However, analysis of individual polyT repeats between 2 and 10 base pairs in length (Table S5) showed statistically significant enrichment for genes with T6, T7, and T8 repeat lengths in the top 30 genes, and for T5 and T6 repeats in the top 100 genes.
PolyT 3′UTR Sequences Are Partially Deleted in RER+ Cell Lines.
The 3′UTR mononucleotide T repeats in the length range 19 to 27 were sequenced in eight genes to look for differences in the frequency of deletions between RER+ and RER− cell lines. Six of these genes were in the top 30 differentially expressed set of genes. PPP1CB was selected based on analyses using an earlier version of the Affymetrix annotation for the probesets, where it was the 61st ranked RER differentially expressed probeset (Dataset S1). The latest version of the annotation has this probeset with no gene annotation. ZNF462 was chosen as a gene with long T repeats (T24 and T25) but with no significant difference in expression between RER+ and RER− lines (P = 0.524, FDR corrected P = 0.726). In all cases, there were consistently very significantly larger deletions in the RER+ than in the RER− cell lines (Fig. 2). The deletions in the RER− cell lines are all relatively small (average deletion size across all genes in all RER− cell lines is 1.2 base pairs) and may largely be a result of the background PCR amplification error that is to be expected when sequencing such highly repetitive sequences. The overall average deletion size for all sequenced repeats in the seven differentially expressed genes was 8.3. The average deletion size for ZNF462 was 5.6 base pairs in the RER+ cell lines and not significantly different from the average size for the differentially expressed genes (8.3 base pairs).
Fig. 2.
Mean deletion size in base pairs of 3′UTR mononucleotide repeat sequences of selected RER differentially expressed genes. Error bars indicate SEM. n = number of cell lines sequenced; see SI Materials and Methods for details. P values are based on two-tailed t test for mean deletion size differences between RER+ and RER− cell lines.
ZNF462 was specifically included because, although it is expressed in the cell lines and contains three separate 3′UTR polyT repeats, ranging in size between 19 and 25 base pairs, it showed no evidence of differential expression between RER+ and RER− cell lines. If there were selective pressure against destabilization of message levels for a particular gene, then deletions would be likely to be selected against, and so less commonly found, if at all. However, the results show essentially the same extent of deletion for each of the three 3′UTR repeat sequences in ZNF462 as for the differentially expressed genes. Thus, either the shortening of 3′UTR polyT sequences in this gene does not confer any change in message stability, or message-level changes are compensated for by other means, for example, by transcriptional regulation.
Discussion
There have been several previous publications describing RER-associated global gene expression changes in CRC using oligonucleotide and cDNA arrays (2, 21–25).
Several mechanisms that might account for the differences in gene expression signatures between RER+ and RER− tumors have been proposed (2, 22). Giacomini et al. (2), for example, suggested that increased frame-shift mutations in RER+ tumors would lead to reduced expression of these genes through NMD (12, 13). Similarly, gene copy number changes arising from chromosomal gains and losses could contribute to changes in gene expression levels in RER− tumors. Thus, Jorissen et al. (22) recently found in their data a statistically significant association between DNA copy number changes and RER-associated gene expression levels (P < 0.001). Giacomini et al. (2) have also suggested that expression changes could occur as a result of genes and pathways secondarily affected by the primary targets of genetic instability in RER+ tumors.
Although these remain valid mechanisms that may all undoubtedly contribute to the overall differences in expression signatures between RER+ and RER− tumors, the results described here clearly suggest that the major mechanism responsible for the most significant of the expression signature differences involves deregulation of message stability through deletion of polyT (and possibly polyA) 3′UTR sequences in CRC-derived cell lines and primary tumors that are replication error deficient. Calculations based on the excess in observed over expected numbers suggest that this mechanism accounts (on average) for almost all of the differential expression in the top 30 genes, and over 80% in the top 100 genes.
The regulatory function of polyT tracts in 3′UTR sequences has recently been established for key genes, such as TP53 (26) and CTNNB1 (27), and deletions in 3′UTR polyT or polyA sequences have been shown to be functionally significant and affect message stability for CTNNB1 (27), CEACAM1 (28), and CDK2AP1 (29).
ELAV/Hu proteins regulate the stability and translation efficiency of labile mRNAs containing AU- and U-rich sequences (30). Park-Lee et al. (31) showed that deletion of As in the 3′UTR AU rich sequence UU(AUUU)NAUU, generated a polyU tract (U13 in the article) in the mRNA, which resulted in increased binding affinity of both HuD (ELAVL4) and HuB (ELAVL2). Increasing the length of the polyU tract to 32 conferred an even greater affinity for HuD and HuB. The authors suggested that a high affinity for polyU sequences might be a general characteristic of proteins in the Hu family. Although neither HuD nor HuB is differentially expressed in RER+ cell lines in the present study (FDR-corrected P values are 0.01 and 0.8, respectively), the Park-Lee et al. (31) article demonstrates that at least some of the members of Hu proteins bind preferentially to polyU tracts. The differential expression of HuR (ELAVL1, FDR P value = 0.0018), a member of the same family, is therefore intriguing because (assuming it too bound to polyU tracts in mRNA) it would compound the effect of 3′UTR deletions in its target genes if, in addition to binding preferentially/differentially to deleted target genes, it was, itself, expressed at higher levels.
Perhaps the best-characterized RNA binding proteins are the family of more than 20 hnRNP proteins (20, 32, 33). These proteins are involved in regulating a variety of processes, including RNA stability, translation efficiency, RNA splicing, and telomere biogenesis, and on binding RNA can cause either stabilization or destabilization of message (34). It is notable that two proteins of this family, HNRNPL and HNRNPH1, are in the top 30 differentially expressed RER genes. Both have polyT repeat sequences of 22 and 26 base pairs, respectively, which are consistently deleted in the RER+ cell lines, and both are expressed at higher levels in RER+ vs. RER− cell lines. A simple interpretation is that the deletions in these 3′UTR regulatory sequences have lead to stabilization of message for both genes. It is noteworthy that 26 probesets representing 11 members of the hnRNP family of proteins are differentially expressed at a FDR P-value cutoff of 0.05, and that for 24 of the 26 probesets, there is a positive fold-change in the RER+ cell lines (Table S8).
Changes in expression levels for genes involved in controlling RNA stability, such as the HuR and hnRNP proteins discussed above, may be of particular significance, because they would be expected to affect the levels of all their downstream targets, and so compound the primary effect of 3′UTR mutations in their downstream target genes.
We found a relatively low level of enrichment for mononucleotide repeats within coding regions of the differentially expressed genes. Frame-shift mutations in shorter polynucleotide repeats of certain presumed tumor-suppressor genes, such as TGFBR2(A10), BAX(G8), MSH3(A8), and MSH6(C8) are reportedly common in RER+ tumors, but rare in RER− tumors (35, 36). The enrichment for genes with coding mononucleotide repeats in the RER top 30 and top 100 genes occurs most significantly for genes containing polyT repeats of between 5 and 8 base pairs. Our findings suggest that NMD is unlikely to account for much more than ≈15% of the top 100 differentially expressed genes between RER+ and RER− CRCs. However, the twofold expression differences that may be expected from such a mechanism may not be readily detected by many of the methods commonly used for assessing mRNA expression differences, and may result in underrepresentation of this mechanism.
Using published data, we have shown that the enrichment in RER+ vs. RER− differentially expressed genes for sequences containing 3′UTR mononucleotide repeats is found in primary tissue, as well as in the cell lines. This finding shows that expression changes and associated deletions in 3′UTR sequences are stable over long periods in tissue culture, even on a mismatch repair-deficient background.
It is worth noting that although there is a huge enrichment for genes with polyT repeats in their 3′UTRs, not all genes with long T repeats in the 3′UTR are differentially expressed between RER+ and RER−, and it is not immediately clear what distinguishes those that are from those that are not. For example, when considering all probesets on the Affymetrix U133+22 array that are associated with T27 to T32 repeats, only an average of 19% of probesets are differentially expressed. Aside from those probesets representing genes that are not expressed in colorectal tissue or CRC, there clearly remain a significant number of genes that contain these 3′UTR repeat sequences, but which are not differentially expressed. Why a subset of these genes is so clearly differentially expressed and others not may be a function of surrounding sequences. For example, we have not analyzed other kinds of repeat sequences, such as di-, tri-, and tetranucleotide sequences, and other repetitive sequences such as the AREs, which may well be targeted to some extent in the RER+ cancers. Thus, the context within which the mononucleotide T-repeat sequence occurs may modulate the effect of any deletions.
Another possible consequence of a defective mismatch repair system is the accumulation of mutations within intronic repeat sequences at splice junctions, leading to expression level changes of splice variants. We did not explore this mechanism here because the Affymetrix U133+2 arrays used in this study interrogate the 3′ end of transcripts and are, therefore, not designed to detect expression-level differences of splice variants in the way that exon arrays are.
In summary, we have uncovered a mechanism, namely deletion of 3′UTR T-repeat sequences that control mRNA stability, which accounts for much of the difference in expression patterns between RER+ and RER− CRCs. We have confirmed that this mechanism operates in both cell lines and primary tissue. The next step will be to see which of the changes in expression associated with this mechanism are translated into relevant changes in protein levels that contribute functionally to the overall RER+ phenotype, and so may provide clues to its better treatment.
Materials and Methods
Cell Lines and Cell Culture.
Fourteen RER+ and 16 RER− lines were analyzed. See SI Materials and Methods for details.
Gene Expression Microarrays.
Gene expression microarrays were performed using the Human genome U133+2 chips following the manufacturer's instructions (Affymetrix) and analyzed using Partek Genomics Suite software (see SI Materials and Methods for details).
The data reported in this article have been deposited in the GeneExpression Omnibus (GEO) database, www.ncbi.nlm.nih.gov/geo (accession no. GSE24795).
Quantitative RT-PCR.
Quantitative RT-PCR levels were determined using standard TaqMan Gene Expression Assays (see SI Materials and Methods for details).
Sequencing.
PCR amplicons (primer sequences and PCR conditions available from the authors on request) were sequenced directly by using the appropriate PCR primers and Big Dye Sequencing kit (Applied Biosystems) on an ABI 377 (Applied Biosystems) sequencer (see SI Materials and Methods for details of cell lines and repeats sequenced).
Mononucleotide Frequency Determination and Statistical Analysis.
Sequences were downloaded from Ensembl database as described in SI Materials and Methods. Perl scripts were generated to query 3′UTR, 5′UTR and coding sequences of (i) all protein coding genes, (ii) all Affymetrix annotated genes, or (iii) various subsets of differentially expressed annotated genes, for the incidence of each length of A, G, T, or C mononucleotide repeat from n = 2 to n = 32. Significant departures of observed vs. expected numbers were determined separately for each individual repeat length, as well as for repeat lengths when grouped into approximate terciles representing repeat lengths of between 2 and 10 base pairs, 11 and 20 base pairs, and 21 and 32 base pairs. For further details of the analysis, see SI Materials and Methods and Table S9.
For analysis of previously published data, lists of RER differentially expressed probesets were downloaded from supplementary data tables of refs. 21 and 22, and then the 3′UTR sequences for the unique genes represented by the listed probesets were obtained from Ensembl as described above.
Supplementary Material
Acknowledgments
We thank Dr. Rachael Hancox for sequencing of 3′UTR mononucleotide T repeats in a subset of the genes. This work was supported in part by a Cancer Research United Kingdom programme grant (to W.F.B.).
Footnotes
The authors declare no conflict of interest.
Data deposition: The Microarray expression data reported in this paper have been deposited in the Gene Expression Omnibus (GEO) database, www.ncbi.nlm.nih.gov/geo (accession no. GSE24795).
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1015604107/-/DCSupplemental.
References
- 1.Peltomäki P. Role of DNA mismatch repair defects in the pathogenesis of human cancer. J Clin Oncol. 2003;21:1174–1179. doi: 10.1200/JCO.2003.04.060. [DOI] [PubMed] [Google Scholar]
- 2.Giacomini CP, et al. A gene expression signature of genetic instability in colon cancer. Cancer Res. 2005;65:9200–9205. doi: 10.1158/0008-5472.CAN-04-4163. [DOI] [PubMed] [Google Scholar]
- 3.Peltomäki P. Deficient DNA mismatch repair: A common etiologic factor for colon cancer. Hum Mol Genet. 2001;10:735–740. doi: 10.1093/hmg/10.7.735. [DOI] [PubMed] [Google Scholar]
- 4.Hauge XY, Litt M. A study of the origin of ‘shadow bands’ seen when typing dinucleotide repeat polymorphisms by the PCR. Hum Mol Genet. 1993;2:411–415. doi: 10.1093/hmg/2.4.411. [DOI] [PubMed] [Google Scholar]
- 5.Kolodner R. Biochemistry and genetics of eukaryotic mismatch repair. Genes Dev. 1996;10:1433–1442. doi: 10.1101/gad.10.12.1433. [DOI] [PubMed] [Google Scholar]
- 6.Kim H, Jen J, Vogelstein B, Hamilton SR. Clinical and pathological characteristics of sporadic colorectal carcinomas with DNA replication errors in microsatellite sequences. Am J Pathol. 1994;145(1):148–156. [PMC free article] [PubMed] [Google Scholar]
- 7.Gryfe R, et al. Tumor microsatellite instability and clinical outcome in young patients with colorectal cancer. N Engl J Med. 2000;342(2):69–77. doi: 10.1056/NEJM200001133420201. [DOI] [PubMed] [Google Scholar]
- 8.Ionov Y, Peinado MA, Malkhosyan S, Shibata D, Perucho M. Ubiquitous somatic mutations in simple repeated sequences reveal a new mechanism for colonic carcinogenesis. Nature. 1993;363:558–561. doi: 10.1038/363558a0. [DOI] [PubMed] [Google Scholar]
- 9.Grady WM. Genomic instability and colon cancer. Cancer Metastasis Rev. 2004;23(1-2):11–27. doi: 10.1023/a:1025861527711. [DOI] [PubMed] [Google Scholar]
- 10.Lynch HT, et al. Genetics, natural history, tumor spectrum, and pathology of hereditary nonpolyposis colorectal cancer: An updated review. Gastroenterology. 1993;104:1535–1549. doi: 10.1016/0016-5085(93)90368-m. [DOI] [PubMed] [Google Scholar]
- 11.Woodford-Richens KL, et al. SMAD4 mutations in colorectal cancer probably occur before chromosomal instability, but after divergence of the microsatellite instability pathway. Proc Natl Acad Sci USA. 2001;98:9719–9723. doi: 10.1073/pnas.171321498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Holbrook JA, Neu-Yilik G, Hentze MW, Kulozik AE. Nonsense-mediated decay approaches the clinic. Nat Genet. 2004;36:801–808. doi: 10.1038/ng1403. [DOI] [PubMed] [Google Scholar]
- 13.Conti E, Izaurralde E. Nonsense-mediated mRNA decay: Molecular insights and mechanistic variations across species. Curr Opin Cell Biol. 2005;17:316–325. doi: 10.1016/j.ceb.2005.04.005. [DOI] [PubMed] [Google Scholar]
- 14.Barreau C, Paillard L, Osborne HB. AU-rich elements and associated factors: Are there unifying principles? Nucleic Acids Res. 2005;33:7138–7150. doi: 10.1093/nar/gki1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Guhaniyogi J, Brewer G. Regulation of mRNA stability in mammalian cells. Gene. 2001;265(1-2):11–23. doi: 10.1016/s0378-1119(01)00350-x. [DOI] [PubMed] [Google Scholar]
- 16.Bevilacqua A, Ceriani MC, Capaccioli S, Nicolin A. Post-transcriptional regulation of gene expression by degradation of messenger RNAs. J Cell Physiol. 2003;195:356–372. doi: 10.1002/jcp.10272. [DOI] [PubMed] [Google Scholar]
- 17.van Hoof A, Parker R. Messenger RNA degradation: Beginning at the end. Curr Biol. 2002;12:R285–R287. doi: 10.1016/s0960-9822(02)00802-3. [DOI] [PubMed] [Google Scholar]
- 18.Wilusz CJ, Wormington M, Peltz SW. The cap-to-tail guide to mRNA turnover. Nat Rev Mol Cell Biol. 2001;2:237–246. doi: 10.1038/35067025. [DOI] [PubMed] [Google Scholar]
- 19.Malter JS. Regulation of mRNA stability in the nervous system and beyond. J Neurosci Res. 2001;66:311–316. doi: 10.1002/jnr.10021. [DOI] [PubMed] [Google Scholar]
- 20.Carpenter B, et al. The roles of heterogeneous nuclear ribonucleoproteins in tumour development and progression. Biochim Biophys Acta. 2006;1765(2):85–100. doi: 10.1016/j.bbcan.2005.10.002. [DOI] [PubMed] [Google Scholar]
- 21.Watanabe T, et al. Distal colorectal cancers with microsatellite instability (MSI) display distinct gene expression profiles that are different from proximal MSI cancers. Cancer Res. 2006;66:9804–9808. doi: 10.1158/0008-5472.CAN-06-1163. [DOI] [PubMed] [Google Scholar]
- 22.Jorissen RN, et al. DNA copy-number alterations underlie gene expression differences between microsatellite stable and unstable colorectal cancers. Clin Cancer Res. 2008;14:8061–8069. doi: 10.1158/1078-0432.CCR-08-1431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Koinuma K, et al. Epigenetic silencing of AXIN2 in colorectal carcinoma with microsatellite instability. Oncogene. 2006;25(1):139–146. doi: 10.1038/sj.onc.1209009. [DOI] [PubMed] [Google Scholar]
- 24.Banerjea A, et al. Colorectal cancers with microsatellite instability display mRNA expression signatures characteristic of increased immunogenicity. Mol Cancer. 2004;3:21. doi: 10.1186/1476-4598-3-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kruhøffer M, et al. Gene expression signatures for colorectal cancer microsatellite status and HNPCC. Br J Cancer. 2005;92:2240–2248. doi: 10.1038/sj.bjc.6602621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Vilborg A, et al. The p53 target Wig-1 regulates p53 mRNA stability through an AU-rich element. Proc Natl Acad Sci USA. 2009;106:15756–15761. doi: 10.1073/pnas.0900862106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Thiele A, Nagamine Y, Hauschildt S, Clevers H. AU-rich elements and alternative splicing in the beta-catenin 3’UTR can influence the human beta-catenin mRNA stability. Exp Cell Res. 2006;312:2367–2378. doi: 10.1016/j.yexcr.2006.03.029. [DOI] [PubMed] [Google Scholar]
- 28.Ruggiero T, et al. Deletion in a (T)8 microsatellite abrogates expression regulation by 3′-UTR. Nucleic Acids Res. 2003;31:6561–6569. doi: 10.1093/nar/gkg858. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Shin J, et al. A del T poly T (8) mutation in the 3′ untranslated region (UTR) of the CDK2-AP1 gene is functionally significant causing decreased mRNA stability resulting in decreased CDK2-AP1 expression in human microsatellite unstable (MSI) colorectal cancer (CRC) Surgery. 2007;142:222–227. doi: 10.1016/j.surg.2007.04.002. [DOI] [PubMed] [Google Scholar]
- 30.Brennan CM, Steitz JA. HuR and mRNA stability. Cell Mol Life Sci. 2001;58:266–277. doi: 10.1007/PL00000854. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Park-Lee S, Kim S, Laird-Offringa IA. Characterization of the interaction between neuronal RNA-binding protein HuD and AU-rich RNA. J Biol Chem. 2003;278:39801–39808. doi: 10.1074/jbc.M307105200. [DOI] [PubMed] [Google Scholar]
- 32.Dreyfuss G, Matunis MJ, Piñol-Roma S, Burd CG. hnRNP proteins and the biogenesis of mRNA. Annu Rev Biochem. 1993;62:289–321. doi: 10.1146/annurev.bi.62.070193.001445. [DOI] [PubMed] [Google Scholar]
- 33.Chaudhury A, Chander P, Howe PH. Heterogeneous nuclear ribonucleoproteins (hnRNPs) in cellular processes: Focus on hnRNP E1’s multifunctional regulatory roles. RNA. 2010;16:1449–1462. doi: 10.1261/rna.2254110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Krecic AM, Swanson MS. hnRNP complexes: Composition, structure, and function. Curr Opin Cell Biol. 1999;11:363–371. doi: 10.1016/S0955-0674(99)80051-9. [DOI] [PubMed] [Google Scholar]
- 35.Parsons R, et al. Microsatellite instability and mutations of the transforming growth factor beta type II receptor gene in colorectal cancer. Cancer Res. 1995;55:5548–5550. [PubMed] [Google Scholar]
- 36.Schwartz S, Jr, et al. Frameshift mutations at mononucleotide repeats in caspase-5 and other target genes in endometrial and gastrointestinal cancer of the microsatellite mutator phenotype. Cancer Res. 1999;59:2995–3002. [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


