Abstract
Using the Cap Analysis of Gene Expression (CAGE) technology, the FANTOM5 consortium provided one of the most comprehensive maps of transcription start sites (TSSs) in several species. Strikingly, ~72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers. Here, we probe these unassigned TSSs and show that, in all species studied, a significant fraction of CAGE peaks initiate at microsatellites, also called short tandem repeats (STRs). To confirm this transcription, we develop Cap Trap RNA-seq, a technology which combines cap trapping and long read MinION sequencing. We train sequence-based deep learning models able to predict CAGE signal at STRs with high accuracy. These models unveil the importance of STR surrounding sequences not only to distinguish STR classes, but also to predict the level of transcription initiation. Importantly, genetic variants linked to human diseases are preferentially found at STRs with high transcription initiation level, supporting the biological and clinical relevance of transcription initiation at STRs. Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism.
Subject terms: Machine learning, Genomics, Transcriptomics
Mammalian genomes are scattered with repetitive sequences, but their biology remains largely elusive. Here, the authors show that transcription can initiate from short tandem repetitive sequences, and that genetic variants linked to human diseases are preferentially found at repeats with high transcription initiation level.
Introduction
RNA polymerase II (RNAPII) transcribes many loci outside annotated protein-coding gene promoters1,2 to generate a diversity of RNAs, including for instance enhancer RNAs3 and long noncoding RNAs (lncRNAs)4. In fact, >70% of all nucleotides are thought to be transcribed at some point1,5,6. Using the Cap Analysis of Gene Expression (CAGE) technology7,8, the FANTOM5 consortium provided one of the most comprehensive maps of TSSs in several species2. Integrating multiple collections of transcript models with FANTOM CAGE datasets, Hon et al. built a new annotation of the human genome (FANTOM CAGE-Associated Transcriptome, FANTOM CAT), with an atlas of 27,919 human lncRNAs, among them 19,175 potentially functional RNAs4. Despite this annotation, many CAGE peaks remain unassigned to a specific gene and/or initiate at unconventional regions, outside promoters or enhancers, providing an unprecedented mean to further characterize noncoding transcription within the genome “dark matter”9 and to decode part of the transcriptional “noise”.
Noncoding transcription is indeed far from being fully understood10 and some authors suggest that many of these transcripts, often faintly expressed, can simply be “noise” or “junk”11,12. On the other hand, many non annotated RNAPII transcribed regions correspond to open chromatin1 and cis-regulatory modules bound by transcription factors (TFs)13. Besides, genome-wide association studies showed that trait-associated loci, including those linked to human diseases, can be found outside canonical gene regions14–16. Together, these findings suggest that the noncoding regions of the human genome harbor a plethora of potentially transcribed functional elements, which can drastically impact genome regulations and functions9,16.
The human genome is scattered with repetitive sequences, and a large portion of noncoding RNAs derives from repetitive elements17,18, in particular DNA tandem repeats, such as satellite DNAs19 and minisatellites20. Microsatellites, also called short tandem repeats (STRs), constitute the third class of DNA tandem repeats. They correspond to repeated DNA motifs of 2–6 bp and constitute one of the most polymorphic and abundant repetitive elements21. Classes of STRs can be defined based on the repeated DNA motif (e.g., (AC)n will correspond to all STRs with repeats of the dinucleotide AC). STR polymorphism, which corresponds to variation in the number of repeated DNA motif (i.e., STR length), is presumably due to their susceptibility to slippage events during DNA replication. STRs have been shown to widely impact gene expression and to contribute to expression variation22–25. Some constitute genuine expression Quantitative Trait Loci (eQTLs)23,24, called eSTRs23. At the molecular level, STRs can for instance affect expression by inducing inhibitory DNA structures26 and/or by modulating TF binding27,28.
Provided the abundance of STRs on the one hand and the widespread transcription of the genome, including at repeated elements, on the other hand, we hypothesize that transcription initiation also occurs at STRs. To test this hypothesis, we probe CAGE data collected by the FANTOM5 consortium2 using the STRs catalog built by Willems et al.29. We specifically show that a significant portion of CAGE peaks (~8.6%) initiate at STRs. This transcription is confirmed by Cap Trap RNA-seq (CTR-seq), a technology that combines cap trapping and long-read MinION sequencing. Transcription of STR-containing RNAs has previously been reported in several species30–33. We report here that thousands of STRs can also initiate transcription in human and mouse, therefore not being only a mere passenger in other RNAs but containing genuine TSSs. We further learn sequence-based Convolutional Neural Networks (CNNs) able to predict these transcription initiation levels with high accuracy (correlation between observed and predicted CAGE signal >0.65 for 14 STR classes with >5000 elements). These models unveil the importance of STR flanking sequences in distinguishing STR classes, one from the other, and also in predicting transcription initiation. We finally show that genetic variants linked to human diseases, are located, not only within, but also around STRs associated with high transcription initiation levels.
Results
CAGE peaks are detected at STRs
We first intersected the coordinates of 1,048,124 CAGE peak summits2 with that of 1,620,030 STRs called by HipSTR29. We found that 89,948 CAGE peaks (~8.6%) initiate at 84,555 STRs (Fig. 1a and Supplementary Fig. 1). As a comparison, only 2.3% of an equal number of randomly selected intervals with equivalent size intersected with CAGE peaks (Fisher’s exact test P value < 2.2e-16). Among CAGE peaks intersecting with STRs, 10,727 correspond to TSSs of FANTOM CAT transcripts4 and 8823 to enhancer boundaries3 (Supplementary Data 1). Note that the FANTOM CAT annotation was shown to be more accurate in 5’ end transcript definitions compared to other catalogs (GENCODE34, Human BodyMap35, and miTranscriptome36), because transcript models combine various independent sources (GENCODE release 19, Human BodyMap 2.0, miTranscriptome, ENCODE and an RNA-seq assembly from 70 FANTOM5 samples) and FANTOM CAT TSSs were validated with Roadmap Epigenome DHS and RAMPAGE datasets4. This transcription does not correspond to random noise because the fraction of STRs harboring a CAGE peak within each class differs depending on the STR class, without any link with their abundance (Fig. 1a, c). Some STR classes with low abundance are indeed more often associated with a CAGE peak than more abundant STRs (Fig. 1a, c, compare for instance (CTTTTT)n or (AAAAG)n vs. (AT)n or (ATTT)n). Likewise, the number of STRs associated with CAGE peaks cannot merely be explained by their length, as several STR classes have similar length distribution but very different fractions of CAGE-associated loci (compare for instance (AT)n and (GT)n in Fig. 1c and Supplementary Fig. 2).
We computed the tag count sum along each STR ± 5 bp, and averaged the signal across 988 FANTOM5 libraries. We noticed the existence of very low (tag count = 1) CAGE counts along STRs, which artificially increase the signal (see examples in Fig. 1a, Spearman correlation coefficient between sum CAGE tag count along STR and STR length ~0.26). To remove any dependence between STR length and CAGE signal, the mean tag count was normalized by the length of the window used to compute the signal (i.e., STR length + 10 bp). Looking directly at this CAGE signal (not CAGE peaks) along the genome, we observed that some STR classes are more transcribed than others (Fig. 1d, compare (CGG)n or (CCG)n vs. (AAGG)n or (AAAAT)n). No drastic difference in terms of CAGE signal was noticed between intra- and intergenic STRs (Supplementary Fig. 3). Looking at each STR class separately, we confirmed that our CAGE signal computation is not sensitive to the STR length (Supplementary Fig. 4). Supplementary Fig. 4 also shows that STRs with different lengths can be associated with the same CAGE signal while, conversely, two STRs with different CAGE signals can have the same length. Thus, considering transcription, STR polymorphism appears to not only rely on their length (number of repeated elements). Transcription initiation, therefore, appears to complexify STR polymorphism.
CAGE tags correspond to genuine transcriptional products
CAGE read detection at STRs faces two problems. First, CAGE tags can capture not only TSSs but also the 5’ ends of post-transcriptionally processed RNAs37. To clarify this point, we used a strategy described by de Rie et al.38, which compares CAGE tags obtained by Illumina (ENCODE) vs. Heliscope (FANTOM) technologies. Briefly, the 7-methylguanosine cap at the 5’ end of CAGE tags produced by RNAPII can be recognized as a guanine nucleotide during reverse transcription. This artificially introduces mismatched Gs at Illumina tag 5’ end, not detected with Heliscope sequencing, because it skips the first nucleotide38. We then evaluated the existence of this G bias in CAGE tags corresponding to peaks detected at STRs, peaks assigned to genes (for positive control), and peaks intersecting the 3’ end of precursor microRNAs (pre-miRNAs for a negative control) (Fig. 2). While most CAGE tag 5’ ends perfectly match the sequences of pre-miRNA 3’end in all cell types tested, as previously reported38, a G bias was clearly observed when considering assigned CAGEs and CAGEs detected at STRs, confirming that the vast majority of STR-associated CAGE tags are truly capped. We also confirmed that STRs located within RNAPII-binding sites exhibit a stronger CAGE signal than STRs not associated with RNAPII-binding events (Supplementary Fig. 5).
Second, because of their repetitive nature, mapping CAGE reads to STRs is problematic and may yield ambiguous results. To circumvent this issue, we developed CTR-seq, which combines cap trapping and long-read MinION sequencing. With this technology, the median read length is >500 bp, thereby greatly limiting the chance of erroneous mapping. Two libraries were generated in A549 cells, including or not polyA tailing. This polyA tailing step before reverse transcription allows the detection of polyA-minus noncoding RNAs. Long reads initiating at STRs were readily detected in both libraries (Fig. 3). As expected given the depth of MinION sequencing in only one cell line, the number of STRs associated with long reads is lower than that obtained with CAGE sequencing collected in 988 libraries (n = 5472 and 7812, respectively, with and without polyA tailing with 2291 STRs associated with long reads in both libraries). Among these 2291 STRs, 904 (39%) are also associated with a CAGE peak. Thus, compared to the reproducibility of MinION sequencing in both libraries (only 2291 STRs in common out of 5472 (42%) or 7812 (29%)), CAGE and CTR-seq sequencing results are overall in agreement. In fact, STR classes associated with CAGE peaks correspond to those associated with CTR-seq reads (Fig. 3 compared to Fig. 1c). The Spearman correlation ρ between the fractions of STRs associated with CAGE and MinION reads with and without polyA tailing equals 0.88 and 0.89 respectively. Besides, 301 out of 904 STRs associated with both CAGE peak and CTR-seq long read correspond to TSSs of FANTOM CAT transcripts and 54 to enhancer boundaries. Overall, CTR-seq confirms CAGE data and the existence of transcription initiating at STRs. The similarity of the results obtained with and without the polyA tailing step also indicates that RNAs initiating at STRs are mostly polyadenylated.
Transcription initiation at STRs exhibits specific features
We further looked at the subcellular localization of STR-initiating transcripts and used CAGE sequencing data generated after cell fractionation (see “Methods” section). While the majority of CAGE tags, including those assigned to genes, are detected in both the nucleus and cytoplasm, CAGE tags initiating at STRs are mostly detected in the nuclear compartment (Fig. 4a). Functionally distinct RNA species were previously categorized by their transcriptional directionality39. We then sought to compute the directionality score, as defined by Hon et al. in ref. 4, for each STR associated with CAGE signal (Fig. 4b). Briefly, this score corresponds to the difference between the CAGE signal on the (+) strand and that on the (−) strand divided by their sum (in HipSTR catalog, STRs are systematically defined on the (+) strand i.e., (T)n on (−) strand are defined as (A)n). A score equals to 1 or −1 indicates that transcription is strictly oriented toward the (+) or (−) strand, respectively. A score close to 0 indicates that the transcription is balanced and that it occurs equally on the (+) and (−) strands. As shown in Fig. 4b, some STR classes are associated with directional transcription either on the (+) (e.g., (ATTT)n, (T)n) or (−) (e.g., (A)n, (ATG)n) strand, while others are bidirectional and balanced ((CGG)n, (CCG)n). Furthermore, scores obtained at (A)n STRs are mostly negative, while scores obtained at (T)n STRs are mostly positive. This indicates that transcription initiation preferentially occurs on the strand where (T)n STRs are found. The fact that transcription can be either directional or bidirectional depending on the STR class suggests that transcription initiation at STRs is governed by different features, which are specific to STR classes. We looked for motifs known to be involved in transcription directionality at canonical TSSs, namely, polyadenylation sites (polyA sites) and U1-binding sites40. Sequences encompassing −3/+10bp41 around FANTOM CAT 5’ donor splice sites were used to build a position weight matrix (PWM) corresponding to the U1-binding site (Supplementary Fig. 6). This PWM was further used to scan 2 kb-long sequences centered around (T)n 3’ end and FANTOM CAT TSSs (used as positive control). (T)n STRs have been chosen as a prototype of directional transcription initiation at STRs (Fig. 4b). While we confirmed enrichment of potential U1-binding sites downstream FANTOM CAT TSSs40, such enrichment was not observed downstream (T)n 3’ ends (Supplementary Fig. 6). Likewise, polyA sites are clearly enriched upstream FANTOM CAT TSSs, but this observation does not hold true for (T)n STRs (Supplementary Fig. 6). Our results extend the findings of Ibrahim et al., who reported that a single model of transcription initiation within and across eukaryotic species is not evident42.
A sequence-based deep learning model reveals that features governing transcription initiation depend on the STR classes
We further probed transcription initiation at STRs using a machine-learning approach. We used a deep Convolutional Neural Network (CNN), which is able to successfully predict CAGE signal in large regions of the human genome43,44. This type of machine-learning approach takes as input the DNA sequence directly, without the need to manually define predictive features before analysis. The first question that arose was then to determine the sequence to use as input.
We first sought to build a model common to all STR classes to predict the CAGE signal as computed in Fig. 1d. Note that, because we used mean signal across CAGE libraries, our model is cell-type agnostic. This choice was motivated by the observation that the CAGE signal at STRs in each library is very sparse, thereby strongly reducing the prediction accuracy of our model. As input, we used sequences spanning 50 bp around the 3’ end of each STR. Model architecture and constructions of the different sets used for learning are detailed in the “Methods” section and in Supplementary Fig. 7. Source code is available at https://gite.lirmm.fr/ibc/deepSTR. The accuracy of our model was computed as Spearman correlation between the predicted and the observed CAGE signals on held-out test data (see “Methods”). The performance of this global model was overall high (Ρ ~0.72), indicating that transcription initiation at STRs can indeed be predicted by sequence-level features. However, looking at the accuracy for each STR class, we noticed drastic differences with accuracies ranging from <0.6 to 0.81 depending on the STR class (Fig. 5a, blue dots). The global model is notably accurate for the most represented STR class (i.e., (T)n with 766,747 elements), but performs worse in other STR classes. Differences in accuracies are not simply linked to the number of elements available for learning in each STR class. They rather suggest that, as proposed above (Fig. 4b), transcription initiation may be governed by features specific to each STR class.
STR flanking sequences can classify STR classes, independently of the DNA repeated motif
It was previously shown that 50-bp-long sequences flanking (AC)n have evolved unusually to create specific nucleotide patterns45. To determine if such specific patterns hold true for other STRs, we sought to classify STRs based only on their 50 bp surrounding sequences. We trained a CNN model to classify pairs of STR classes (Supplementary Fig. 7). To avoid any problem due to the imprecise definition of STR boundaries, we masked the seven bases located downstream the STR 3’ ends (see “Methods”). In that case, model performance is evaluated by the Area Under the ROC (Receiver Operating Characteristics) curve (AUC, Fig. 5b). The AUCs obtained in these pairwise classifications were very high (AUC > 0.7, Fig. 5b), with the notable exceptions of (GTTT)n vs. (GTTTTT)n (see below). Thus, STRs can be accurately distinguished, one from each other, using only 50-bp flanking sequences, and not the DNA repeated motif, even in the case of complementary STRs, such as (AC)n and (GT)n (Fig. 5b).
Deep learning models unveil the key role of STR flanking sequences
To further probe the sequence-level features for transcription initiation at STRs, we decided to build a model for each STR class with >5000 elements (n = 47). Here, CNN is again used in a regression task to predict the CAGE signal. Sequences spanning 50 bp around the 3’ end of each STR were used as input. Longer sequences were tested without improving the accuracy of the model (Supplementary Fig. 8). These class-specific models achieved overall better performances than the global model tested on each STR class separately (Fig. 5a and Supplementary Fig. 9). The only exceptions were classes composed of repetitions of T ((GTTTTT)n, (GTTT)n, and (CTTTT)n). In these cases, global and (T)n-specific models achieved better performance than (GTTTTT)n, (GTTT)n, or (CTTTT)n-specific models. These results have two explanations: (i) compared to (T)n, these classes have less occurrences (18,707 for (GTTTTT)n, 55,898 for (GTTT)n and 15,433 for (CTTTT)n), making it hard to learn models for these classes and (ii) the classification AUCs to distinguish (GTTTTT)n, (GTTT)n or (CTTTT)n from (T)n was among the lowest observed (Fig. 5b), suggesting the existence of common sequence features that can be used by global and (T)n-specific models. Overall, we estimated that STR class-specific models were accurate for 14 STR classes (ρ > 0.65).
We anticipated that class-specific models should not be equivalent and could not be interchangeable. We formally tested this hypothesis by measuring the accuracy of a model learned on one STR class and tested on another one (Fig. 5c). We caution again the fact that the performance of an STR-specific model also depends on the number of sequences available for learning. As observed earlier, the best accuracy is obtained with (T)n, which are overrepresented in our catalog. Overall, the performance of one model tested on another STR class drastically decreases (Fig. 5c), revealing the existence of STR class-specific features predictive of transcription initiation. We also noticed that several models achieved non-negligible performances on other STR classes (Spearman ρ > 0.5, Fig. 5c), implying that some features governing transcription initiation at STRs are conserved between these STR classes. Thus, CNN models identified both common and specific features able to predict transcription initiation at STRs.
Our results unveil the importance of STR flanking sequences. We then evaluated the contribution of the sole surrounding sequences in transcription initiation prediction and built a model considering only these sequences (50 bp upstream and downstream STR, masking the STR itself, Fig. 5e). These models were less accurate than the formers but accuracies were still high for several classes (Fig. 5d), confirming that surrounding sequences contain features for transcription initiation prediction. The observed decrease in accuracies (Fig. 5d) implies that the STR itself contains features, which are combined with others present in flanking regions to predict transcription initiation. Remember that the CAGE signal predicted by our CNN models is normalized by the length of the STR (see above), which makes them unable to assess the contribution of STR length in transcription initiation.
Several sequence-level features predicting transcription initiation at STRs are conserved between human and mouse
To test whether transcription at STRs is biologically relevant, we relied on two criteria: conservation and association with diseases. First, we studied conservation in mouse.
The number of loci within each STR class differs in mouse and human HipSTR catalogs (Figs. 1b and 6a and Supplementary Fig. 10). We applied the strategy used in human to compute the CAGE signal (as mean raw tag count in STR ± 5 bp divided by STR length + 10 bp) in mouse using 397 CAGE libraries (Fig. 6b). As observed in human, several STR classes were associated with CAGE signal. This signal appears lower than in human (compare Figs. 1d and 6b). This might be due to the fact that mouse CAGE data are small-scaled in terms of the number of reads mapped and diversity in CAGE libraries, compared to human CAGE data2, making the mouse CAGE signal at STRs probably less accurate than the human one.
We nonetheless tested the correlation of the human and mouse CAGE signals at orthologous STRs. Orthologous STRs were identified converting the mouse STR coordinates into human coordinates with the UCSC liftover tool (see “Methods”). We intersected the coordinates of human STRs with that of orthologous mouse STRs and computed the Pearson correlation between the CAGE signal observed in human and that observed in mouse on the same strand (n = 18,072). In that case, Pearson’s r reaches ~0.87 (Spearman ρ ~ 0.51), suggesting that transcription at STRs is indeed conserved between mouse and human. As expected, no correlation was observed (r < 0.01) when randomly shuffling one of the two vectors or when correlating the signals of 18,072 randomly chosen mouse and human STRs.
We then built a CNN model to predict the CAGE signal at mouse STR classes corresponding to the 14 classes shown in Fig. 5a (Fig. 6c, green dots). The performances of the models ranged from ~0.4 to ~0.8, demonstrating that, as observed for human STRs, transcription at several mouse STR classes can be predicted by sequence-level features. A notable exception is (CTTTT)n with Spearman ρ < 0.2 (see below). The mouse models were overall less accurate than human models (Fig. 6c, compare red and green dots), likely due to differences in the quality of the CAGE signal (i.e., predicted variable), as mentioned above.
We then tested whether the sequence features able to predict STR transcription initiation were conserved between mouse and human. We specifically tested the performances of models learned in one species and tested on another one (Fig. 6c, blue dots and Supplementary Fig. 11). For all STR classes tested, the Spearman correlation between the signal predicted by the human model and the observed mouse signal was >0.4 (Fig. 6c), implying that several features are conserved between human and mouse. For some classes (e.g., (A)n, (AC)n, (AAAT)n), the human and mouse models even appeared equally efficient in predicting transcription initiation in mouse (Fig. 6c, green and blue dots are close), indicative of strong conservation of predictive features. For other classes (e.g., (CT)n, (AGG)n), the performance of the human model was lower than that obtained with the mouse model when tested on mouse data (Fig. 6c, green and blue dots are distant). Thus, specific features also exist in mouse that were not learned in human sequences. Likewise, human-specific features also exist (Supplementary Fig. 11). In the case of (CTTTT)n, the human model performs better than the mouse one (Fig. 6c). This effect is likely due to the number of examples, which is higher in human (n = 15,433) than in mouse (n = 10,494). Overall, we conclude that several features predictive of transcription initiation at STRs are conserved between human and mouse and that the level of conservation also varies depending on STR classes.
ClinVar pathogenic variants are found at STRs with high transcription initiation level
Second, we evaluated the potential implication of transcription initiation at STRs in human diseases and used the ClinVar database, which lists medically important variants46. We found that STRs harboring ClinVar variants, located in a window encompassing STR ± 50 bp (n = 34,578), are associated with high CAGE signal compared to STRs without variants (n = 3,068,280, Fig. 7a), indicative of potential biological and clinical relevance for transcription initiation at STRs. Looking at the clinical significance of the variants, as defined in the ClinVar database, we indeed noticed that STRs associated with pathogenic variants exhibit stronger transcription initiation than STRs associated with other variants (Fig. 7b and Supplementary Fig. 12). STRs could be associated with more or less variants linked to a given disease than expected by chance (adjusted P value < 5e-3, Supplementary Data 2) but no clear association with a specific clinical trait was noticed.
We initially sought to identify representations of sequence motifs captured by CNN first layer filters using a strategy inspired by Maslova et al.47 and identified several influential first layers correlating with JASPAR PMW scores (see “Methods” section and Supplementary Tables provided here at https://gite.lirmm.fr/ibc/deepSTR//first_layer_interpretation). However, it is important to remember that our models were optimized to predict CAGE signal, not to learn interpretable representations from input DNA sequences. Koo and Eddy have indeed demonstrated that tackling these two questions—prediction and interpretation—requires distinct CNN architectures, in particular adapting max-pooling and convolutional filter size48. At present, our models likely learn partial motifs and do not limit the ability to learn full interpretable motifs in deeper layers. We then used a perturbation-based approach49 and randomly created in silico mutations to identify key positions of the models (see “Methods” section). Random variations were directly introduced into STR sequences, and predictions were made on these mutated sequences using the CNN model-specific of the STR class considered. The impact of the variation was then assessed as the difference between the predictions obtained with mutated and reference sequences. Same analyses were performed with ClinVar variants (Fig. 7c and Supplementary Fig. 13). Key positions were defined as positions, which, when mutated, have a strong impact on the prediction changes (i.e., high variance), being either positive or negative. As shown in Fig. 7c, for both random and ClinVar variants, the most important positions appeared located around STR 3’ end (−15 bp/+30 bp) and their distribution is skewed toward the sense orientation of the transcripts. Strikingly, a significant proportion of ClinVar variants are located in the immediate vicinity of the STR 3’ end (Fig. 7d). Hence, the most important positions identified by our models correspond to positions with high occurrences of ClinVar variants (Fig. 7c, d). However, neither the distribution nor the impact of variants appears linked to their pathogenicity because similar results are observed for both benign and pathogenic variants (Supplementary Fig. 14). Note that ClinVar variants are also concentrated around assigned CAGE peak summits and all identified CAGE peak summits (Supplementary Fig. 15). Overall, we conclude that the pathogenicity of ClinVar variants appears to be linked to the transcription initiation level at the targeted STR rather than to the position of the variation or its impact on prediction.
Finally, as machine-learning approaches only unveil correlation between predictive and predicted features, not direct causation, we sought to determine whether the features learned by our models correspond to sequence-level instructions for transcription initiation. We looked for gene TSSs located at STRs and harboring variants acting as eQTLs for the corresponding genes, in a scenario similar to that described by Bertuzzi et al. in the case of a minisatellite and the NPRL3 gene20. Gene expression is considered here as a proxy for the measure of transcription initiation at STRs. In that scenario, if our models capture instructions for expression, the difference of the predictions made by our models for the reference and the alternative alleles should have the same sign as the eQTL slope (i.e., gene expression increase (slope > 0) or decrease (slope < 0)) more often than expected by chance. First, to identify STRs potentially acting as TSSs, we selected STRs located in gene promoters (considering 1 kb around FANTOM CAT gene start). We only considered models with accuracy >0.7 (Fig. 5c). Second, based on our results depicted in Fig. 7c, we selected GTEx eQTLs located in a −15-bp/+30-bp window around STR 3’ end and linked to the expression of the genes associated with STRs in the first step. These selections yielded 86 cases of STR sequence variations linked to gene expression by eQTL. Of note, we first thought to use FANTOM CAT transcript TSSs directly, instead of gene TSSs, but only one case was identified with prediction error (measured as the absolute value of the difference between the predicted and the observed CAGE signals) < 0.2. The alternative alleles corresponding to the selected eQTLs were inserted into their cognate STR sequences and a prediction was made for this modified sequence. The sign of the difference between the two predictions (alternative - reference) was compared to the sign of the eQTL slope. We counted the number of times these signs were identical or different (Supplementary Fig. 16). The prediction errors of the models for these 86 STRs were also computed in the case of the reference genome (Supplementary Fig. 16). As shown in Supplementary Fig. 17, when predictions are accurate on the reference genome (error ≤ 0.2), the models are able to predict the impact of variants on expression i.e., in most cases, the sign of the difference between the predictions made with the alternative and predictive alleles is similar to that of the eQTL slope. Importantly, this is no longer observed when the models poorly perform (error > 0.2). Binomial tests were used to statistically assess the relevance of these findings. Thus, when accurate, our models are able to predict the effects of eQTLs, supporting a causal relationship between the predictive and the predicted variables rather than a mere correlation.
Discussion
We report here the discovery of widespread transcription initiation at STRs in human and mouse. These results extend previous findings30–33 and reveal that, in addition to being the passenger of host RNAs initiating at their own TSSs30–33, STRs can also initiate the transcription of distinct and autonomous RNAs. The next main issue is to determine the role(s) of these transcripts. RNA species can be functionally categorized according to transcriptional directionality39. In the case of STRs, transcription directionality appears to depend on the STR class (Fig. 4b). It is thus likely that RNAs initiating at STRs fulfill distinct functions and many hypotheses could be proposed at this stage. For instance, 10,727 CAGE peaks mapped at STRs correspond to TSSs of FANTOM CAT transcripts (Supplementary Data 1), extending the findings made by Bertuzzi et al. in the case of a minisatellite and the NPRL3 gene20 to STRs. Many RNAs initiating at STRs may also correspond to noncoding RNAs, as for instance enhancer RNAs (Supplementary Data 1). As could have been anticipated given the distinction of enhancers and promoters based on CpG dinucleotide50, FANTOM CAT transcripts mostly initiate at GC-rich STRs, while enhancer RNAs more often correspond to A/T-rich STRs (Supplementary Data 1). Another possible function is provided by (T)n, which are overrepresented in eukaryotic genomes51 and have been shown to act as promoter elements by depleting repressive nucleosomes52. As a consequence, (T)n can increase transcription of reporter genes in similar levels to TF-binding sites53. The findings that (A)n and (T)n represent distinct directional signals for nucleosome removal54 are very well compatible with differences observed in flanking sequences (Fig. 5b) and directional transcription (Fig. 4b), both able to create asymmetry at (A)n and (T)n. Besides, we show that most CAGE tags initiating at STRs remain nuclear (Fig. 4a). This observation suggests that, similar to other repeat-initiating RNAs55,56, RNAs initiating at STRs could also play roles at the nuclear/chromatin levels, for instance in DNA topology56,57. Note that we also calculated the enrichment of STR classes in FANTOM CAT biotypes (Supplementary Data 3). The strongest enrichments correspond to (A)n, (AT)n, and (AAAT)n at enhancers, which are known to be GC-poor sequences compared to promoters for instance50. It also remains to clarify whether STR-associated RNAs or the act of transcription per se is functionally important10. Dedicated experiments are now required to formally identify the biological functions linked to the transcription of each STR class. These experiments are all the more warranted as STR transcription is associated with clinically relevant genomic variations (Fig. 7).
One key finding of our study is the discovery that STR flanking sequences are not inert but rather contain important features that play critical roles in their biology, as previously suspected45. These results call for the development of novel methods able to take these sequences into account in order to revisit STR mapping/genotyping and integrate SNVs located in STR vicinity. These methods should have broad applications in various fields of research and medicine, from forensic medicine to population genetics for instance. STR length variations have notably been shown to influence gene expression and, similar to eQTLs, several eSTRs have been identified58,59. Their exact mode of action still remains largely elusive but, the majority of eSTRs appear to act by global mechanisms, in a tissue-agnostic manner58. Interestingly, some eSTRs have strand-specific effects58, which is again compatible with the possible sources of asymmetry unveiled by our study (i.e., flanking sequences and directional transcription). Using transcription initiation level at STRs, as predicted by our CNN models for instance, coupled with length variations58,59, may help to take into account the impact of genetic variants located in sequences surrounding STRs60, and to refine eSTR computations. Results depicted in Supplementary Figs. S16 and S17 show that CNN models can indeed refine eSTR computations by simply re-assigning eQTLs as eSTRs.
There are still several ways to improve our CNN models. Notably, to avoid any bias linked to the CAGE noise signal observed along STRs, we decided to predict a signal normalized by the STR length. Therefore, our models do not allow to properly assess the contribution of STR length in transcription, although it clearly represents the most studied feature of STRs21,58,59. Note that simply increasing the quality of the reads considered (using Q20 instead of Q3 filter) yields sparse data and decreases the performance of our model. A new computation of the CAGE signal aimed at removing “noise” at STRs could be developed. This may also help develop tissue-specific CNN models, which will only use CAGE data44. Besides, the same architecture was used for all STR classes while achieving different accuracies (Fig. 5a, c). These results cannot be merely explained by the number of STR sequences available for training because swapping the models for training and testing demonstrated the existence of STR class-specific features predictive of transcription initiation (Fig. 5c). It is rather possible that the chosen architecture may not be optimal for all STRs, as illustrated by the design of a global model with overall good performance, but very distinct accuracies depending on the STR class (Fig. 5a). Our CNN architecture was initially optimized on the (T)n class, which represents the most abundant class (n = 766,747). Because each STR class harbors sequence specificities including in flanking sequences, hyperparameters, such as convolutional filter sizes, their number, and/or max-pooling, could be adapted to each STR class. These hyperparameters have indeed already been shown to influence the results of CNN models as well as their interpretation48.
More broadly, the same rationale could be applied to other methods aimed at predicting CAGE signal along the genome44, distinguishing biological entities (genes, enhancers, …), genomic segments61,62, and/or isochores63 based on their sequence features. Building a general model increases the risk of designing a model suited for the most represented elements, not for the others. Notably, promoters and enhancers can be distinguished by different CpG content, the presence of polyA signal and of 5’ splice sites40,50, as well as different transcription factor combinations3,64. It is therefore likely that the same filters will not apply similarly to predict transcription in both cases and that one may want to develop a specific model for each of these entities to increase the accuracy of the predictions.
The prediction of transcription initiation based solely on sequence features has long been studied, especially using CAGE data65,66. The high accuracy achieved by CNN models for this task, as illustrated in this study or in refs. 43,44,47, as well as the development of methods aimed at interpreting this type of statistical models48,49,67,68, will certainly accelerate the achievement of this goal, which becomes more than ever “a realistic short-term objective rather than a distant aspiration”66.
Methods
Data and bioinformatic analyses
The bedtools window69 was used to look for CAGE peaks (coordinates available at http://fantom.gsc.riken.jp/5/datafiles/phase1.3/extra/CAGE_peaks/hg19.cage_peak_coord_permissive.bed.gz) at STRs ± 5bp (catalog available at https://github.com/HipSTR-Tool/HipSTR-references/raw/master/human/hg19.hipstr_reference.bed.gz) as follows:
windowBed -w 5 -a hg19.hipstr_reference.bed -b hg19.cage_peak_coord_permissive.bed
As a comparison, random intervals were generated using bedtools shuffle69.
shuffleBed -i hg19.hipstr_reference.bed -g hg19.chrom.sizes -excl hg19.hipstr_reference.bed -seed 927442958 > hg19.hipstr_reference.shuffled.bed
Similar analyses were performed using mouse STR catalog (available at https://github.com/HipSTR-Tool/HipSTR-references/blob/master/mouse/mm10.hipstr_reference.bed.gz) liftovered to mm9 using UCSC liftover tool70:
liftover mm10.hipstr_reference.bed mm10ToMm9.over.chain.gz mm9.hipstr_reference.bed unlifted.bed
To compute the CAGE signal, we used raw tag count along the genome with a 1-bp binning and Q3 quality mapping filter. At each position of the genome, the mean tag count across 988 libraries for human and 387 for mouse was computed. The values obtained at each position of a window encompassing the STR ± 5 bp were then summed and normalized (i.e., divided by the STR length + 10 bp) to limit the impact of the CAGE noise signal observed along STRs. CAGE signals at human and mouse STRs are available at https://gite.lirmm.fr/ibc/deepSTR, as, respectively, hg19.hipstr_reference.cage.bed and mm9.hipstr_reference.cage.bed (The CAGE signal is indicated in the 5th column). The fasta files (500 bp around STR 3’ end) used to build our models are also available at the same location as hg19.hipstr_reference.cage.500bp.around3end.fa and mm9.hipstr_reference.cage.500bp.around3end.fa. CNN models use as input 101-bp-long sequences centered around STR 3’ ends.
The bedtools intersect69 was used to distinguish intra- and intergenic STRs, intersecting their coordinates with that of the FANTOM gene annotation (available at https://fantom.gsc.riken.jp/5/suppl/Hon_et_al_2016/data/assembly/lv3_robust/FANTOM_CAT.lv3_robust.bed.gz).
Coordinates of FANTOM CAT robust transcripts and FANTOM enhancers can be found, respectively, at these URLs: transcripts [http://fantom.gsc.riken.jp/5/suppl/Hon_et_al_2016/data/assembly/lv3_robust/FANTOM_CAT.lv3_robust.gtf.gz] and enhancers [https://fantom.gsc.riken.jp/5/datafiles/latest/extra/Enhancers/human_permissive_enhancers_phase_1_and_2.bed.gz]. ENCODE RNAPII ChIP-seq bed files can be downloaded following these links: GM12878, H1-hESC [http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeAwgTfbsHaibH1hescPol2V0416102UniPk.narrowPeak.gz], HeLa-S3 [http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeAwgTfbsHaibHelas3Pol2Pcr1xUniPk.narrowPeak.gz] and K562.
Expression data used to determine the nucleo-cytoplasmic distribution of CAGE peaks can be found at http://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/hg19.cage_peak_phase1and2combined_tpm_ann.osc.txt.gz.
Orthologous STRs were identified using UCSC liftover tool70 and the mm9ToHg19.over.chain.gz file.
For eQTLs, we used GTEx V7 data [https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL.tar.gz].
All statistical tests were performed with R (wilcoxon.test, fisher.test) or Python (scipy.stats.f_oneway, scipy.stats.mannwhitneyu, scipy.stats.kstest), as indicated. When indicated, P values were corrected for multiple testing using R p.adjust (method="fdr").
Evaluating mismatched G bias at Illumina 5’ end CAGE reads
Comparison between Heliscope vs. Illumina CAGE sequencing was performed as in de Rie et al.38. Briefly, ENCODE CAGE data were downloaded as bam files (using the following url [http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeRikenCage/] (’*NucleusPap*’ files) and converted into bed files using samtools view71 and UNIX awk:
samtools view file.bam ∣ awk ’{FS="\t"}BEGIN{OFS="\t"}{if($2=="0") print $3,$4-1,$4,$10,$13,"+"; else if($2=="16") print $3,$4-1,$4,$10,$13,"-"}’ > file.bed
The bedtools intersect69 was further used to identify all CAGE tags mapping a given position. The UNIX awk command was used to count the number and type of mismatches:
intersectBed -a positions_of_interest.bed -b file.bed -wa -wb -s ∣ awk ’{if(substr($11,1,6)=="MD:Z:0" && $6=="+") print substr($10,1,1)}’ ∣ grep -c "N"
with N = {A, C, G or T}, positions_of_interest.bed being coordinates of CAGE peaks assigned to genes, or that located at pre-miRNA 3’ ends, or peaks associated with STRs. The file.bed corresponds to the Illumina CAGE tag coordinates.
The absence of mismatch focusing on the plus strand was counted as:
intersectBed -a positions_of_interest.bed -b file.bed -wa -wb -s ∣ awk ’{if(substr($11,1,6)!="MD:Z:0" && $6=="+") print $0}’ ∣Êwc -l
As a control, we used the 3’ end of the pre-miRNAs, which were defined, as in de Rie et al.38, as the 3’ nucleotide of the mature miRNA on the 3’ arm of the pre-miRNA (miRBase V21 [ftp://mirbase.org/pub/mirbase/21/genomes/hsa.gff3]), the expected Drosha cleavage site being immediately downstream of this nucleotide (pre-miR end + 1 base).
Cap-Trapping MinION sequencing
A549 cells were grown in Dulbeccoõs modified Eagle medium (DMEM) supplemented with 10% fetal bovine serum (FBS). A549 cells were washed with PBS. The RNAs were isolated by using RNeasy kit (QIAGEN). The poly-A tail addition to A549 total RNA was carried out by poly-A polymerase (PAPed RNA). The cDNA synthesis was carried out by using 5 μg of total RNA or 1 μg of PAPed RNA with RT primer (5-TTTTTTTTUUUTTTTTVN-3) by PrimeScript II Reverse Transcriptase (TaKaRa Bio). The full-length cDNAs were selected by the Cap Trapper method72. After the ligation of 5’ linker, cDNAs were treated with USER enzyme to shorten the poly-T derived from RT primer. After SAP treatment, a 3’ linker was ligated to the cDNAs. The linkers used in the library preparation were prepared as in ref. 72 with oligos provided in Supplementary Table 1. As for the 3’ linker, after annealing step, the UMI complemental region (BBBBBBBB) was filled with Phusion High-Fidelity DNA polymerase (NEB) and dVTPs (dATP/dGTP/dCTP) instead of dNTPs. The second strand was synthesized using a second primer with KAPA HiFi HS mix (KAPA Biosystems). The double-stranded cDNAs were amplified using Illumina adapter-specific primers and LongAmp Taq DNA polymerase (NEB). After 16 cycles of PCR (8 min for elongation time), amplified cDNAs were purified with an equal volume of AMPure XP beads (Beckmann Coulter). Purified cDNAs were subjected to Nanopore sequencing library following manufacturerõs 1D ligation sequencing protocol (version NBE_9006_v103_revO_21Dec2016).
Nanopore libraries were sequenced by MinION Mk1b with R9.4 flowcell. Sequence data were generated by MinKNOW 1.7.14. Basecalling was processed by ÓAlbacore v2.1.0 basecaller software provided by Oxford Nanopore Technologies to generate fastq files from FAST5 files. To prepare clean reads from fastq files, adapter sequence was trimmed by Porechop v0.2.3. Data were deposited on DNA Data Bank of Japan Sequencing Read Archive (accession number: DRA010491). The mapping computational pipeline used a prototype of primer-chop available at https://gitlab.com/mcfrith/primer-chop. The precise methods and command lines are provided as Supplementary Methods. Data were first mapped on hg38 reference genome and liftovered to hg19 for analyses.
Directionality score
We collected CAGE signal at each STR of the HipSTR catalog (see above). When a signal was detected on both (+) and (−) strands, we computed the directionality score for each STR using the following formula:
The CAGE signal was computed as explained above. A score equals to 1 or −1 indicates that transcription is strictly oriented towards the (+) or (−) strand, respectively. A score close to 0 indicates that the transcription is balanced and that it occurs equally on the (+) and (−) strands.
U1 PWM was built using MEME73 and sequences encompassing −3/+10 bp around FANTOM CAT 5’ donor splice sites (exon 3’ end). We then used this PWM and FIMO74 to scan 2kb regions centered around 3’ ends (T)n STRs (considering the top 50,000 sequences with the highest CAGE signal) and FANTOM CAT TSSs. For polyA sites, we used the UCSC track corresponding to the predictions made by Cheng et al.75, as a bed file and used it in bedtools intersect69 to look at polyA site distribution in regions encompassing 1 kb around (T)n 3’ ends (top 50,000 with the highest CAGE signal) and FANTOM CAT TSSs.
Convolutional neural network
CNN architecture is described in Supplementary Fig. 7. To build a CNN, we needed aligned sequences of equal length. However, as shown in Supplementary Fig. S1, CAGE peaks are scattered along STRs. We thus decided to align the sequences on STR 3’ ends, as defined by the CAGE data. HipSTR indeed provides a catalog built on the (+) strand but CAGE data are stranded data (see Fig. 1a). CAGE thus allows to orientate each STR of the HipSTR catalog as exemplified here:
**HipSTR catalog (see hg19.hipstr_reference.bed):
chr1 10001 10468 6 78 Human_STR_1 AACCCT
**Same STR with CAGE data (see hg19.hipstr_reference.cage.bed made available at https://gite.lirmm.fr/ibc/deepSTR)
chr1 10001 10468 Human_STR_1; AACCCT; + 0.410901 +
chr1 10001 10468 Human_STR_1; AACCCT; − 0.354298 −
It is then possible to determine the 3’ end of each STR according to the strand considered (here 10468 on the (+) strand and 10002 on the (−) strand). This procedure almost doubles the number of elements in each class.
Sequences spanning 50 bp around the 3’ end of each STR were used as input unless otherwise stated (see Fig. 5e). Longer sequences were tested without improving the accuracy of the model (Supplementary Fig. 8). Note that only 89,189 STRs (out of 1,620,030, ~5.5%) are longer than 50 bp and, only in these few cases, the sequence located upstream STR 3’ end only corresponds to the STR itself. The parameters of the model were determined by brute force algorithms using a grid search approach. This approach makes a complete search over all hyperparameters (number of layers, number of neurons, activation functions, different learning rates, shape of convolutional kernels, number of convolutional filters, …). The grid search algorithm trains and tests all possible models with all combinations of parameters and returns the most accurate model. The model was implemented in PyTorch. The source code of the model, alongside scripts and Jupyter notebooks are available at https://gite.lirmm.fr/ibc/deepSTR.
In order to minimize overfitting, droupout is added to the fully connected layers (probability of droupout = 0.30). The training pipeline is described in Supplementary Fig. 7: we separate training, testing, and validation datasets prior to model training, and these sets are stored on disk. This allows us to carry out analyses on held-out data that has never been seen by the models. We stop the training once the loss function calculated on the validation set drops for five consecutive epochs (early stopping). Relatively good performances on mouse datasets (Fig. 6c) show that the model generalizes well to unknown CAGE data. Our models were optimized to predict CAGE signal and cannot, as such, be applied to other types of data. However, the methodology used here is generic and could be applied to other types of data as long as one can associate a numeric signal to a specific genomic region.
To make sure that our models do not overfit due for instance to homologous sequences present in both train and test sets, we used BLASTn76 to look for homology between (T)n sequences of the test and train sets. The model learned on (T)n STRs was used because it is the most accurate and therefore the more likely to overfit. We found 102,209 sequences from the test set with >60% query cover and >80% identity with at least one sequence of the train set. We separated these sequences (test set #1, homologous sequences) from the rest of the test set (test set #2, 121,808 nonhomologous sequences). We then computed Spearman correlations between the predicted and the observed CAGE signals using these two test sets: 0.73 with test set #1 and 0.78 with test set #2. In both cases, correlations decreased, as compared to correlation computed with the whole test set (0.84). This decrease is due to differences in CAGE signal distribution between the whole test set, test set #1 and #2 (Supplementary Fig. 18) likely linked to mapping issues. However, model performance measured on test set #2 was greater than that obtained with test set #1. This is in contrast to what is expected in the case of model overfitting due to sequence homology. We then concluded that homology observed between train and test sets is not sufficient to make the model overfit.
For comparison to the baseline model, we computed the correlation between the observed CAGE signal and randomized CAGE signal (equivalent to a predictor that returns a random value drawn from observed values). Randomization was repeated ten times and Spearman correlation was invariably close to 0 (absolute value (ρ) < 5e-4).
The models are provided at https://gite.lirmm.fr/ibc/deepSTR. They can be used to predict transcription initiation level at STRs using a fasta file. Likewise, impact of genetic variations can be assessed by comparing the predictions obtained for instance with reference and mutated sequences (see Fig. 7 and Supplementary Fig. 17).
Classification
The CNN model can also be set up for a classification task (Fig. 5b and Supplementary Fig. 7). In that case, the only difference with the regression model is the last neuron in the last fully connected layer. The classifier CNN uses the same training method. The data are also prepared by separate scripts before training is done and stored on disk. All analyses resulting from the classification are performed on the test sets to avoid optimistic bias in accuracy estimation. Note that 7 bp downstream STR 3’ end were masked and replaced by Ns (Fig. 5e) because we noticed that this window can contain bases corresponding to the DNA repeat motif, a feature that can easily be learned by a CNN. The sequences used as input, for classification using flanking sequences only (Fig. 5d), are centered around STR 3’ end and consist of 50-bp-long upstream sequence + 9 Ns, which mask the STR itself +7 Ns + 43-bp-long downstream sequence (total length = 109 bp, Fig. 5e).
Model swaps between human STR classes
After models are trained on all STR classes, their weights are stored in a .pt file (following the PyTorch convention). Predictions were then computed on all test sets with all models.
Model interpretation
First, for each of the 14 models presented in Fig. 5, we measured the influence of each first layer filters by removing them iteratively and computing the accuracy of the model (Spearman correlation between observed and predicted CAGE signal) with the 49 remaining filters. We also computed an influence threshold by learning each CNN model ten times and computing a 95% confidence interval (CI). The threshold was calculated as log2(CI length/2). This allows to focus our analyses on key filters, with performance impact greater than what would have been obtained by chance, simply re-training the model. Influential first layer filters are then ranked according to their influence. Second, on the one hand, we used FIMO74 to scan 101-bp-long sequences centered around STR 3’ end (considering all STR sequences if n < 10,000 or 10,000 randomly chosen sequences otherwise) with JASPAR PWMs77. For each PWM, we identified a set of STR sequences harboring PWM hits. For each sequence, we kept the PWM maximal score found. On the other hand, we scanned the 10,000 STR sequences with influential first layer filters as defined in step #1 (using matrix multiplication as in convolution) and kept the maximal value obtained for each sequence. We then computed the correlation between JASPAR PWM scores and first layer filter scores. We reasoned that if a filter represents a partial PWM, their score should be correlated. The results of these analyses are provided as Supplementary Tables located on our git repository [https://gite.lirmm.fr/ibc/deepSTR//first_layer_interpretation].
Predicting the impact of ClinVar variants
ClinVar vcf file was downloaded January 8th 2019 from this url [ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/] and then converted into bed file. We looked for STRs associated with ClinVar variants (Fig. 7a) using bedtools window69 as follows:
bedtools window -w 50 -a clinvar_mutation.bed -b str_coordinates.bed
Variants were directly introduced into STR sequences ( ± 50 bp) using Biopython78 library and the seq.tomutable() function. To keep sequences aligned, we only considered single nucleotide variants (SNVs). CNN models were then used to predict the CAGE signal of the initial and mutated sequences. The change was computed by the difference between the prediction obtained with the mutated sequence and that obtained with the reference sequence. To insert random variations (Fig. 7c, d), we created a mutation position map, which follows a uniform distribution (each position has an equal probability of receiving a mutation). Then, we took sequences in the database and mutated them one by one at a position taken from the mutation map. All possible mutations at the chosen position have an equal probability of occurrence (Fig. 7d).
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Supplementary information
Acknowledgements
We thank Cédric Notredame, Anthony Mathelier, Oriol Fornes Crespo, Philip Richmond, Jean-Christophe Andrau, Diego Garrido Martin, Dimitri D. Pervouchine, Roderic Guigo, Charles Plessy, and Chung Hon for their help in analyzing the data and for insightful suggestions. We also thank Takahiro Arakawa for the preparation and provision of cell culture samples. We are indebted to the researchers around the globe who generated experimental data and made them freely available. C.-H.L. is grateful to Marc Piechaczyk and Edouard Bertrand for their continued support. The work was supported by funding from CNRS (International Associated Laboratory “miREGEN”), INSERM-ITMO Cancer project “LIONS” BIO2015-04, Plan d’Investissement d’Avenir #ANR-11-BINF-0002 Institut de Biologie Computationnelle (young investigator grant to C-H.L.) and GEM Flagship project funded from Labex NUMEV (ANR-10-LABX-0020). M.G. was supported by a Conventions Industrielles de Formation par la Recherche (CIFRE) PhD fellowship from SANOFI R&D. FANTOM5 was made possible by the following grants: Research Grant for RIKEN Omics Science Center from MEXT to Y.H.; Grant of the Innovative Cell Biology by Innovative Technology (Cell Innovation Program) from the MEXT to Y.H.; Research Grant from MEXT to the RIKEN Center for Life Science Technologies; Research Grant to RIKEN Preventive Medicine and Diagnosis Innovation Program from MEXT to Y.H. This work was further supported by a Research Grant from MEXT to the RIKEN Center for Integrative Medical Sciences.
Author contributions
C.B., M.S., M.G., C.M., W.W.W., M.d.H., L.B., and C.-H.L. analyzed and interpreted the data. M.S. and M.G. developed CNN models and studied the impact of ClinVar variants. J.R., Y.H., A.H., H.S., S.N., and I.M. generated CAGE data used in this study. M.d.H., J.S., and C.-H.L. generated Zenbu tracks. M.d.H. and C.-H.L. studied G bias at ENCODE read 5’ ends. M.T., M.M., M.K.-I., S.N., S.N., T.K., H.N., and M.F. developed CTR-seq and generated data used in this study. Y.H., P.C., C.C., W.W.W., L.B., and C.-H.L. acquired fundings. C.-H.L. wrote the manuscript. All authors have read and approved the manuscript.
Data availability
The data that support this study are available from the corresponding author upon reasonable request. CAGE peaks coordinates [http://fantom.gsc.riken.jp/5/datafiles/phase1.3/extra/CAGE_peaks/hg19.cage_peak_coord_permissive.bed.gz]; human STR catalog [https://github.com/HipSTR-Tool/HipSTR-references/raw/master/human/hg19.hipstr_reference.bed.gz]; mouse STR catalog [https://github.com/HipSTR-Tool/HipSTR-references/blob/master/mouse/mm10.hipstr_reference.bed.gz]; CAGE signals at human and mouse STRs, alongside fasta sequence files, are available on our git repository [https://gite.lirmm.fr/ibc/deepSTR]; FANTOM gene annotation [https://fantom.gsc.riken.jp/5/suppl/Hon_et_al_2016/data/assembly/lv3_robust/FANTOM_CAT.lv3_robust.bed.gz]; Coordinates of FANTOM CAT robust transcripts [http://fantom.gsc.riken.jp/5/suppl/Hon_et_al_2016/data/assembly/lv3_robust/FANTOM_CAT.lv3_robust.gtf.gz] and FANTOM enhancers [https://fantom.gsc.riken.jp/5/datafiles/latest/extra/Enhancers/human_permissive_enhancers_phase_1_and_2.bed.gz]; ENCODE RNAPII ChIP-seq bed files: GM12878 [http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsHaibGm12878Pol2Pcr2xUniPk.narrowPeak.gz], H1-hESC [http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeAwgTfbsHaibH1hescPol2V0416102UniPk.narrowPeak.gz], HeLa-S3 [http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeAwgTfbsHaibHelas3Pol2Pcr1xUniPk.narrowPeak.gz] and K562; CAGE expression data [http://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/hg19.cage_peak_phase1and2combined_tpm_ann.osc.txt.gz]; GTEx V7 data [https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL.tar.gz]; ClinVar vcf file [ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/]. CTR-seq data were deposited on DNA Data Bank of Japan Sequencing Read Archive (accession number: DRA010491). The mapping computational pipeline used a prototype of primer-chop available at https://gitlab.com/mcfrith/primer-chop. The precise methods and command lines are provided as Supplementary Methods.
Code availability
Data, alongside source code of the models, a readme.txt file and other instructions for installing and running the analyses are available on our git repository [https://gite.lirmm.fr/ibc/deepSTR]. This repository can be downloaded using the following command line:
curl https://gite.lirmm.fr/ibc/deepSTR/-/archive/master/deepSTR-master.zip–-output DeepSTR.zip or simply at https://gite.lirmm.fr/ibc/deepSTR/-/archive/master/deepSTR-master.zip.
Competing interests
The authors declare no competing interests.
Footnotes
Peer review information Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Mathys Grapotte, Manu Saraswat, Chloé Bessière.
A list of authors and their affiliations appears at the end of the paper.
Change history
3/22/2022
In the original version of this article, the given and family names of Elena Torlai Triglia were incorrectly structured. The name was displayed correctly in all versions at the time of publication. The original article has been corrected.
Change history
3/1/2022
A Correction to this paper has been published: 10.1038/s41467-022-28758-y
Contributor Information
Laurent Bréhélin, Email: brehelin@lirmm.fr.
Charles-Henri Lecellier, Email: charles.lecellier@igmm.cnrs.fr.
FANTOM consortium:
Imad Abugessaisa, Stuart Aitken, Bronwen L. Aken, Intikhab Alam, Tanvir Alam, Rami Alasiri, Ahmad M. N. Alhendi, Hamid Alinejad-Rokny, Mariano J. Alvarez, Robin Andersson, Takahiro Arakawa, Marito Araki, Taly Arbel, John Archer, Alan L. Archibald, Erik Arner, Peter Arner, Kiyoshi Asai, Haitham Ashoor, Gaby Astrom, Magda Babina, J. Kenneth Baillie, Vladimir B. Bajic, Archana Bajpai, Sarah Baker, Richard M. Baldarelli, Adam Balic, Mukesh Bansal, Arsen O. Batagov, Serafim Batzoglou, Anthony G. Beckhouse, Antonio P. Beltrami, Carlo A. Beltrami, Nicolas Bertin, Sharmodeep Bhattacharya, Peter J. Bickel, Judith A. Blake, Mathieu Blanchette, Beatrice Bodega, Alessandro Bonetti, Hidemasa Bono, Jette Bornholdt, Michael Bttcher, Salim Bougouffa, Mette Boyd, Jeremie Breda, Frank Brombacher, James B. Brown, Carol J. Bult, A. Maxwell Burroughs, Dave W. Burt, Annika Busch, Giulia Caglio, Andrea Califano, Christopher J. Cameron, Carlo V. Cannistraci, Alessandra Carbone, Ailsa J. Carlisle, Piero Carninci, Kim W. Carter, Daniela Cesselli, Jen-Chien Chang, Julie C. Chen, Yun Chen, Marco Chierici, John Christodoulou, Yari Ciani, Emily L. Clark, Mehmet Coskun, Maria Dalby, Emiliano Dalla, Carsten O. Daub, Carrie A. Davis, Michiel J. L. de Hoon, Derek de Rie, Elena Denisenko, Bart Deplancke, Michael Detmar, Ruslan Deviatiiarov, Diego Di Bernardo, Alexander D. Diehl, Lothar C. Dieterich, Emmanuel Dimont, Sarah Djebali, Taeko Dohi, Jose Dostie, Finn Drablos, Albert S. B. Edge, Matthias Edinger, Anna Ehrlund, Karl Ekwall, Arne Elofsson, Mitsuhiro Endoh, Hideki Enomoto, Saaya Enomoto, Mohammad Faghihi, Michela Fagiolini, Mary C. Farach-Carson, Geoffrey J. Faulkner, Alexander Favorov, Ana Miguel Fernandes, Carmelo Ferrai, Alistair R. R. Forrest, Lesley M. Forrester, Mattias Forsberg, Alexandre Fort, Margherita Francescatto, Tom C. Freeman, Martin Frith, Shinji Fukuda, Manabu Funayama, Cesare Furlanello, Masaaki Furuno, Chikara Furusawa, Hui Gao, Iveta Gazova, Claudia Gebhard, Florian Geier, Teunis B. H. Geijtenbeek, Samik Ghosh, Yanal Ghosheh, Thomas R. Gingeras, Takashi Gojobori, Tatyana Goldberg, Daniel Goldowitz, Julian Gough, Dario Greco, Andreas J. Gruber, Sven Guhl, Roderic Guigo, Reto Guler, Oleg Gusev, Stefano Gustincich, Thomas J. Ha, Vanja Haberle, Paul Hale, Bjrn M. Hallstrom, Michiaki Hamada, Lusy Handoko, Mitsuko Hara, Matthias Harbers, Jennifer Harrow, Jayson Harshbarger, Takeshi Hase, Akira Hasegawa, Kosuke Hashimoto, Taku Hatano, Nobutaka Hattori, Ryuhei Hayashi, Yoshihide Hayashizaki, Meenhard Herlyn, Peter Heutink, Winston Hide, Kelly J. Hitchens, Shannon Ho Sui, Peter A. C. ’t Hoen, Chung Chau Hon, Fumi Hori, Masafumi Horie, Katsuhisa Horimoto, Paul Horton, Rui Hou, Edward Huang, Yi Huang, Richard Hugues, David Hume, Hans Ienasescu, Kei Iida, Tomokatsu Ikawa, Toshimichi Ikemura, Kazuho Ikeo, Norihiko Inoue, Yuri Ishizu, Yosuke Ito, Masayoshi Itoh, Anna V. Ivshina, Boris R. Jankovic, Piroon Jenjaroenpun, Rory Johnson, Mette Jorgensen, Hadi Jorjani, Anagha Joshi, Giuseppe Jurman, Bogumil Kaczkowski, Chieko Kai, Kaoru Kaida, Kazuhiro Kajiyama, Rajaram Kaliyaperumal, Eli Kaminuma, Takashi Kanaya, Hiroshi Kaneda, Philip Kapranov, Artem S. Kasianov, Takeya Kasukawa, Toshiaki Katayama, Sachi Kato, Shuji Kawaguchi, Jun Kawai, Hideya Kawaji, Hiroshi Kawamoto, Yuki I. Kawamura, Satoshi Kawasaki, Tsugumi Kawashima, Judith S. Kempfle, Tony J. Kenna, Juha Kere, Levon Khachigian, Hisanori Kiryu, Mami Kishima, Hiroyuki Kitajima, Toshio Kitamura, Hiroaki Kitano, Enio Klaric, Kjetil Klepper, S. Peter Klinken, Edda Kloppmann, Alan J. Knox, Yuichi Kodama, Yasushi Kogo, Miki Kojima, Soichi Kojima, Norio Komatsu, Hiromitsu Komiyama, Tsukasa Kono, Haruhiko Koseki, Shigeo Koyasu, Anton Kratz, Alexander Kukalev, Ivan Kulakovskiy, Anshul Kundaje, Hiroshi Kunikata, Richard Kuo, Tony Kuo, Shigehiro Kuraku, Vladimir A. Kuznetsov, Tae Jun Kwon, Matt Larouche, Timo Lassmann, Andy Law, Kim-Anh Le-Cao, Charles-Henri Lecellier, Weonju Lee, Boris Lenhard, Andreas Lennartsson, Kang Li, Ruohan Li, Berit Lilje, Leonard Lipovich, Marina Lizio, Gonzalo Lopez, Shigeyuki Magi, Gloria K. Mak, Vsevolod Makeev, Riichiro Manabe, Michiko Mandai, Jessica Mar, Kazuichi Maruyama, Taeko Maruyama, Elizabeth Mason, Anthony Mathelier, Hideo Matsuda, Yulia A. Medvedeva, Terrence F. Meehan, Niklas Mejhert, Alison Meynert, Norihisa Mikami, Akiko Minoda, Hisashi Miura, Yohei Miyagi, Atsushi Miyawaki, Yosuke Mizuno, Hiromasa Morikawa, Mitsuru Morimoto, Masaki Morioka, Soji Morishita, Kazuyo Moro, Efthymios Motakis, Hozumi Motohashi, Abdul Kadir Mukarram, Christine L. Mummery, Christopher J. Mungall, Yasuhiro Murakawa, Masami Muramatsu, Mitsuyoshi Murata, Kazunori Nagasaka, Takahide Nagase, Yutaka Nakachi, Fumio Nakahara, Kenta Nakai, Kumi Nakamura, Yasukazu Nakamura, Yukio Nakamura, Toru Nakazawa, Guy P. Nason, Chirag Nepal, Quan Hoang Nguyen, Lars K. Nielsen, Kohji Nishida, Koji M. Nishiguchi, Hiromi Nishiyori, Kazuhiro Nitta, Shuhei Noguchi, Shohei Noma, Cedric Notredame, Soichi Ogishima, Naganari Ohkura, Hiroshi Ohno, Mitsuhiro Ohshima, Takashi Ohtsu, Yukinori Okada, Mariko Okada-Hatakeyama, Yasushi Okazaki, Per Oksvold, Valerio Orlando, Ghim Sion Ow, Mumin Ozturk, Mikhail Pachkov, Triantafyllos Paparountas, Suraj P. Parihar, Sung-Joon Park, Giovanni Pascarella, Robert Passier, Helena Persson, Ingrid H. Philippens, Silvano Piazza, Charles Plessy, Ana Pombo, Fredrik Ponten, Stéphane Poulain, Thomas M. Poulsen, Swati Pradhan, Carolina Prezioso, Clare Pridans, Xiang-Yang Qin, John Quackenbush, Owen Rackham, Jordan Ramilowski, Timothy Ravasi, Michael Rehli, Sarah Rennie, Tiago Rito, Patrizia Rizzu, Christelle Robert, Marco Roos, Burkhard Rost, Filip Roudnicky, Riti Roy, Morten B. Rye, Oxana Sachenkova, Pal Saetrom, Hyonmi Sai, Shinji Saiki, Mitsue Saito, Akira Saito, Shimon Sakaguchi, Mizuho Sakai, Saori Sakaue, Asako Sakaue-Sawano, Albin Sandelin, Hiromi Sano, Yuzuru Sasamoto, Hiroki Sato, Alka Saxena, Hideyuki Saya, Andrea Schafferhans, Sebastian Schmeier, Christian Schmidl, Daniel Schmocker, Claudio Schneider, Marcus Schueler, Erik A. Schultes, Gundula Schulze-Tanzil, Colin A. Semple, Shigeto Seno, Wooseok Seo, Jun Sese, Jessica Severin, Guojun Sheng, Jiantao Shi, Yishai Shimoni, Jay W. Shin, Javier SimonSanchez, Asa Sivertsson, Evelina Sjostedt, Cilla Soderhall, Georges St Laurent, III, Marcus H. Stoiber, Daisuke Sugiyama, Kim M. Summers, Ana Maria Suzuki, Harukazu Suzuki, Kenji Suzuki, Mikiko Suzuki, Naoko Suzuki, Takahiro Suzuki, Douglas J. Swanson, Rolf K. Swoboda, Michihira Tagami, Ayumi Taguchi, Hazuki Takahashi, Masayo Takahashi, Kazuya Takamochi, Satoru Takeda, Yoichi Takenaka, Kin Tung Tam, Hiroshi Tanaka, Rica Tanaka, Yuji Tanaka, Dave Tang, Ichiro Taniuchi, Andrea Tanzer, Hiroshi Tarui, Martin S. Taylor, Aika Terada, Yasuhisa Terao, Alison C. Testa, Mark Thomas, Supat Thongjuea, Kentaro Tomii, Elena Torlai Triglia, Hiroo Toyoda, H. Gwen Tsang, Motokazu Tsujikawa, Mathias Uhlén, Eivind Valen, Marc van de Wetering, Erik van Nimwegen, Dmitry Velmeshev, Roberto Verardo, Morana Vitezic, Kristoffer Vitting-Seerup, Kalle von Feilitzen, Christian R. Voolstra, Ilya E. Vorontsov, Claes Wahlestedt, Wyeth W. Wasserman, Kazuhide Watanabe, Shoko Watanabe, Christine A. Wells, Louise N. Winteringham, Ernst Wolvetang, Haruka Yabukami, Ken Yagi, Takuji Yamada, Yoko Yamaguchi, Masayuki Yamamoto, Yasutomo Yamamoto, Yumiko Yamamoto, Yasunari Yamanaka, Kojiro Yano, Kayoko Yasuzawa, Yukiko Yatsuka, Masahiro Yo, Shunji Yokokura, Misako Yoneda, Emiko Yoshida, Yuki Yoshida, Masahito Yoshihara, Rachel Young, Robert S. Young, Nancy Y. Yu, Noriko Yumoto, Susan E. Zabierowski, Peter G. Zhang, Silvia Zucchelli, and Martin Zwahlen
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-021-23143-7.
References
- 1.Dunham I, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Forrest AR, et al. A promoter-level mammalian expression atlas. Nature. 2014;507:462–470. doi: 10.1038/nature13182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Andersson R, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507:455–461. doi: 10.1038/nature12787. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hon CC, et al. An atlas of human long non-coding RNAs with accurate 5’ ends. Nature. 2017;543:199–204. doi: 10.1038/nature21374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Birney E, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. doi: 10.1038/nature05874. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Carninci P, et al. The transcriptional landscape of the mammalian genome. Science. 2005;309:1559–1563. doi: 10.1126/science.1112014. [DOI] [PubMed] [Google Scholar]
- 7.Kanamori-Katayama M, et al. Unamplified cap analysis of gene expression on a single-molecule sequencer. Genome Res. 2011;21:1150–1159. doi: 10.1101/gr.115469.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Murata M, et al. Detecting expressed genes using CAGE. Methods Mol. Biol. 2014;1164:67–85. doi: 10.1007/978-1-4939-0805-9_7. [DOI] [PubMed] [Google Scholar]
- 9.Clark MB, Choudhary A, Smith MA, Taft RJ, Mattick JS. The dark matter rises: the expanding world of regulatory RNAs. Essays Biochem. 2013;54:1–16. doi: 10.1042/bse0540001. [DOI] [PubMed] [Google Scholar]
- 10.Ard R, Allshire RC, Marquardt S. Emerging properties and functional consequences of noncoding transcription. Genetics. 2017;207:357–367. doi: 10.1534/genetics.117.300095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Palazzo AF, Lee ES. Non-coding RNA: what is functional and what is junk? Front Genet. 2015;6:2. doi: 10.3389/fgene.2015.00002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Struhl K. Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nat. Struct. Mol. Biol. 2007;14:103–105. doi: 10.1038/nsmb0207-103. [DOI] [PubMed] [Google Scholar]
- 13.Cheneby, J., Gheorghe, M., Artufel, M., Mathelier, A. & Ballester, B. ReMap 2018: an updated atlas of regulatory regions from an integrative analysis of DNA-binding ChIP-seq experiments. Nucleic Acids Res.46, D267–D275 (2017). [DOI] [PMC free article] [PubMed]
- 14.Schaub MA, Boyle AP, Kundaje A, Batzoglou S, Snyder M. Linking disease associations with regulatory information in the human genome. Genome Res. 2012;22:1748–1759. doi: 10.1101/gr.136127.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Maurano MT, et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–1195. doi: 10.1126/science.1222794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Kellis M, et al. Defining functional DNA elements in the human genome. Proc. Natl Acad. Sci. USA. 2014;111:6131–6138. doi: 10.1073/pnas.1318948111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Matylla-Kulinska K, Tafer H, Weiss A, Schroeder R. Functional repeat-derived RNAs often originate from retrotransposon-propagated ncRNAs. Wiley Interdiscip Rev. RNA. 2014;5:591–600. doi: 10.1002/wrna.1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Fort A, et al. Deep transcriptome profiling of mammalian stem cells supports a regulatory role for retrotransposons in pluripotency maintenance. Nat. Genet. 2014;46:558–566. doi: 10.1038/ng.2965. [DOI] [PubMed] [Google Scholar]
- 19.Ferreira D, et al. Satellite non-coding RNAs: the emerging players in cells, cellular pathways and cancer. Chromosome Res. 2015;23:479–493. doi: 10.1007/s10577-015-9482-8. [DOI] [PubMed] [Google Scholar]
- 20.Bertuzzi M, et al. A human minisatellite hosts an alternative transcription start site for NPRL3 driving its expression in a repeat number-dependent manner. Hum. Mutat. 2020;41:807–824. doi: 10.1002/humu.23974. [DOI] [PubMed] [Google Scholar]
- 21.Willems T, Gymrek M, Highnam G, Mittelman D, Erlich Y. The landscape of human STR variation. Genome Res. 2014;24:1894–1904. doi: 10.1101/gr.177774.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Bagshaw AT. Functional mechanisms of microsatellite DNA in eukaryotic genomes. Genome Biol. Evol. 2017;9:2428–2443. doi: 10.1093/gbe/evx164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Gymrek M, et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat. Genet. 2016;48:22–29. doi: 10.1038/ng.3461. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Quilez J, et al. Polymorphic tandem repeats within gene promoters act as modifiers of gene expression and DNA methylation in humans. Nucleic Acids Res. 2016;44:3750–3762. doi: 10.1093/nar/gkw219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Press MO, McCoy RC, Hall AN, Akey JM, Queitsch C. Massive variation of short tandem repeats with functional consequences across strains of Arabidopsis thaliana. Genome Res. 2018;28:1169–1178. doi: 10.1101/gr.231753.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Rothenburg S, Koch-Nolte F, Rich A, Haag F. A polymorphic dinucleotide repeat in the rat nucleolin gene forms Z-DNA and inhibits promoter activity. Proc. Natl Acad. Sci. USA. 2001;98:8985–8990. doi: 10.1073/pnas.121176998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Contente A, Dittmer A, Koch MC, Roth J, Dobbelstein M. A polymorphic microsatellite that mediates induction of PIG3 by p53. Nat. Genet. 2002;30:315–320. doi: 10.1038/ng836. [DOI] [PubMed] [Google Scholar]
- 28.Martin P, Makepeace K, Hill SA, Hood DW, Moxon ER. Microsatellite instability regulates transcription factor binding and gene expression. Proc. Natl Acad. Sci. USA. 2005;102:3800–3804. doi: 10.1073/pnas.0406805102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Willems T, et al. Genome-wide profiling of heritable and de novo STR variations. Nat. Methods. 2017;14:590–592. doi: 10.1038/nmeth.4267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Yap K, et al. A short tandem repeat-enriched RNA assembles a nuclear compartment to control alternative splicing and promote cell survival. Mol. Cell. 2018;72:525–540. doi: 10.1016/j.molcel.2018.08.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Jain A, Vale RD. Rna phase transitions in repeat expansion disorders. Nature. 2017;546:243–247. doi: 10.1038/nature22386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Zhu Q, et al. Brca1 tumour suppression occurs via heterochromatin-mediated silencing. Nature. 2011;477:179–184. doi: 10.1038/nature10371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Mills WK, Lee YCG, Kochendoerfer AM, Dunleavy EM, Karpen GH. Rna from a simple-tandem repeat is required for sperm maturation and male fertility in Drosophila melanogaster. eLife. 2019;8:e48940. doi: 10.7554/eLife.48940. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Frankish A, et al. Gencode reference annotation for the human and mouse genomes. Nucleic Acids Res. 2019;47:D766–D773. doi: 10.1093/nar/gky955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Cabili MN, et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 2011;25:1915–1927. doi: 10.1101/gad.17446611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Iyer MK, et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 2015;47:199–208. doi: 10.1038/ng.3192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Fejes-Toth K, et al. Post-transcriptional processing generates a diversity of 5’-modified long and short RNAs. Nature. 2009;457:1028–1032. doi: 10.1038/nature07759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.de Rie D, et al. An integrated expression atlas of miRNAs and their promoters in human and mouse. Nat. Biotechnol. 2017;35:872–878. doi: 10.1038/nbt.3947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Andersson R, et al. Nuclear stability and transcriptional directionality separate functionally distinct RNA species. Nat. Commun. 2014;5:5336. doi: 10.1038/ncomms6336. [DOI] [PubMed] [Google Scholar]
- 40.Almada AE, Wu X, Kriz AJ, Burge CB, Sharp PA. Promoter directionality is controlled by u1 snRNP and polyadenylation signals. Nature. 2013;499:360–363. doi: 10.1038/nature12349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Sibley CR, Blazquez L, Ule J. Lessons from non-canonical splicing. Nat. Rev. Genet. 2016;17:407. doi: 10.1038/nrg.2016.46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Ibrahim MM, et al. Determinants of promoter and enhancer transcription directionality in metazoans. Nat. Commun. 2018;9:1–15. doi: 10.1038/s41467-018-06962-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Kelley DR, et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 2018;28:739–750. doi: 10.1101/gr.227819.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Agarwal V, Shendure J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 2020;31:107663. doi: 10.1016/j.celrep.2020.107663. [DOI] [PubMed] [Google Scholar]
- 45.Vowles EJ, Amos W. Evidence for widespread convergent evolution around human microsatellites. PLoS Biol. 2004;2:E199. doi: 10.1371/journal.pbio.0020199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Landrum MJ, et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 2016;44:D862–868. doi: 10.1093/nar/gkv1222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Maslova A, et al. Deep learning of immune cell differentiation. Proc. Natl Acad. Sci. USA. 2020;117:25655–25666. doi: 10.1073/pnas.2011795117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Koo PK, Eddy SR. Representation learning of genomic sequence motifs with convolutional neural networks. PLoS Comput. Biol. 2019;15:e1007560. doi: 10.1371/journal.pcbi.1007560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Eraslan G, Avsec Z, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 2019;20:389–403. doi: 10.1038/s41576-019-0122-6. [DOI] [PubMed] [Google Scholar]
- 50.Andersson R, Sandelin A. Determinants of enhancer and promoter activities of regulatory elements. Nat. Rev. Genet. 2020;21:71–87. doi: 10.1038/s41576-019-0173-8. [DOI] [PubMed] [Google Scholar]
- 51.Dechering KJ, Cuelenaere K, Konings RN, Leunissen JA. Distinct frequency-distributions of homopolymeric DNA tracts in different genomes. Nucleic Acids Res. 1998;26:4056–4062. doi: 10.1093/nar/26.17.4056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Segal E, Widom J. Poly(dA:dT) tracts: major determinants of nucleosome organization. Curr. Opin. Struct. Biol. 2009;19:65–71. doi: 10.1016/j.sbi.2009.01.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Weingarten-Gabbay S, et al. Systematic interrogation of human promoters. Genome Res. 2019;29:171–183. doi: 10.1101/gr.236075.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Krietenstein N, et al. Genomic nucleosome organization reconstituted with pure proteins. Cell. 2016;167:709–721. doi: 10.1016/j.cell.2016.09.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Frank L, Rippe K. Repetitive RNAs as regulators of chromatin-associated subcompartment formation by phase separation. J. Mol. Biol. 2020;432:4270–4286. doi: 10.1016/j.jmb.2020.04.015. [DOI] [PubMed] [Google Scholar]
- 56.Nikumbh S, Pfeifer N. Genetic sequence-based prediction of long-range chromatin interactions suggests a potential role of short tandem repeat sequences in genome organization. BMC Bioinformatics. 2017;18:218. doi: 10.1186/s12859-017-1624-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Sun JH, et al. Disease-associated short tandem repeats co-localize with chromatin domain boundaries. Cell. 2018;175:224–238. doi: 10.1016/j.cell.2018.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Fotsing SF, et al. The impact of short tandem repeat variation on gene expression. Nat. Genet. 2019;51:1652–1659. doi: 10.1038/s41588-019-0521-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Jakubosky D, et al. Properties of structural variants and short tandem repeats associated with gene expression and complex traits. Nat. Commun. 2020;11:2927. doi: 10.1038/s41467-020-16482-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Chen HY, et al. The mechanism of transactivation regulation due to polymorphic short tandem repeats (strs) using igf1 promoter as a model. Sci. Rep. 2016;6:38225. doi: 10.1038/srep38225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods. 2012;9:215–216. doi: 10.1038/nmeth.1906. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Hoffman MM, et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat. Methods. 2012;9:473–476. doi: 10.1038/nmeth.1937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Jabbari K, Bernardi G. An isochore framework underlies chromatin architecture. PLoS ONE. 2017;12:1–12. doi: 10.1371/journal.pone.0168023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Vandel J, Cassan O, Lebre S, Lecellier CH, Brehelin L. Probing transcription factor combinatorics in different promoter classes and in enhancers. BMC Genomics. 2019;20:103. doi: 10.1186/s12864-018-5408-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Carninci P, et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 2006;38:626–635. doi: 10.1038/ng1789. [DOI] [PubMed] [Google Scholar]
- 66.Frith MC, et al. A code for transcription initiation in mammalian genomes. Genome Res. 2008;18:1–12. doi: 10.1101/gr.6831208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. ICML’17: Proceedings of the 34th International Conference on Machine Learning. 70, 3145–3153 (2017).
- 68.Shrikumar, A. et al. Technical note on transcription factor motif discovery from importance scores (tf-modisco) version 0.5.6.5. Preprint at https://arxiv.org/abs/1811.00416 (2018).
- 69.Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Hinrichs AS, et al. The UCSC Genome Browser Database: update 2006. Nucleic Acids Res. 2006;34:D590–598. doi: 10.1093/nar/gkj144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Morioka M. S. et al. Cap Analysis of Gene Expression (CAGE): A Quantitative and Genome-Wide Assay of Transcription Start Sites. In Bioinformatics for Cancer Immunotherapy. Methods in Molecular Biology, vol 2120. (ed. Boegel S.) (Humana, New York, 2020). [DOI] [PubMed]
- 73.Bailey TL, et al. Fitting a mixture model by expectation maximization to discover motifs in bipolymers. Proc Int Conf Intell Syst Mol Biol. 1994;2:28–36. [PubMed] [Google Scholar]
- 74.Grant CE, Bailey TL, Noble WS. Fimo: scanning for occurrences of a given motif. Bioinformatics. 2011;27:1017–1018. doi: 10.1093/bioinformatics/btr064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Cheng Y, Miura RM, Tian B. Prediction of mRNA polyadenylation sites by support vector machine. Bioinformatics. 2006;22:2320–2325. doi: 10.1093/bioinformatics/btl394. [DOI] [PubMed] [Google Scholar]
- 76.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 77.Fornes O, et al. Jaspar 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2020;48:D87–D92. doi: 10.1093/nar/gkz1001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Severin J, et al. Interactive visualization and analysis of large-scale sequencing datasets using ZENBU. Nat. Biotechnol. 2014;32:217–219. doi: 10.1038/nbt.2840. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support this study are available from the corresponding author upon reasonable request. CAGE peaks coordinates [http://fantom.gsc.riken.jp/5/datafiles/phase1.3/extra/CAGE_peaks/hg19.cage_peak_coord_permissive.bed.gz]; human STR catalog [https://github.com/HipSTR-Tool/HipSTR-references/raw/master/human/hg19.hipstr_reference.bed.gz]; mouse STR catalog [https://github.com/HipSTR-Tool/HipSTR-references/blob/master/mouse/mm10.hipstr_reference.bed.gz]; CAGE signals at human and mouse STRs, alongside fasta sequence files, are available on our git repository [https://gite.lirmm.fr/ibc/deepSTR]; FANTOM gene annotation [https://fantom.gsc.riken.jp/5/suppl/Hon_et_al_2016/data/assembly/lv3_robust/FANTOM_CAT.lv3_robust.bed.gz]; Coordinates of FANTOM CAT robust transcripts [http://fantom.gsc.riken.jp/5/suppl/Hon_et_al_2016/data/assembly/lv3_robust/FANTOM_CAT.lv3_robust.gtf.gz] and FANTOM enhancers [https://fantom.gsc.riken.jp/5/datafiles/latest/extra/Enhancers/human_permissive_enhancers_phase_1_and_2.bed.gz]; ENCODE RNAPII ChIP-seq bed files: GM12878 [http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/wgEncodeAwgTfbsHaibGm12878Pol2Pcr2xUniPk.narrowPeak.gz], H1-hESC [http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeAwgTfbsHaibH1hescPol2V0416102UniPk.narrowPeak.gz], HeLa-S3 [http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeAwgTfbsHaibHelas3Pol2Pcr1xUniPk.narrowPeak.gz] and K562; CAGE expression data [http://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/hg19.cage_peak_phase1and2combined_tpm_ann.osc.txt.gz]; GTEx V7 data [https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL.tar.gz]; ClinVar vcf file [ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/]. CTR-seq data were deposited on DNA Data Bank of Japan Sequencing Read Archive (accession number: DRA010491). The mapping computational pipeline used a prototype of primer-chop available at https://gitlab.com/mcfrith/primer-chop. The precise methods and command lines are provided as Supplementary Methods.
Data, alongside source code of the models, a readme.txt file and other instructions for installing and running the analyses are available on our git repository [https://gite.lirmm.fr/ibc/deepSTR]. This repository can be downloaded using the following command line:
curl https://gite.lirmm.fr/ibc/deepSTR/-/archive/master/deepSTR-master.zip–-output DeepSTR.zip or simply at https://gite.lirmm.fr/ibc/deepSTR/-/archive/master/deepSTR-master.zip.