Abstract
Background: The distribution and composition of cis-regulatory modules composed of transcription factor (TF) binding site (TFBS) clusters in promoters substantially determine gene expression patterns and TF targets. TF knockdown experiments have revealed that TF binding profiles and gene expression levels are correlated. We use TFBS features within accessible promoter intervals to predict genes with similar tissue-wide expression patterns and TF targets.
Methods: Genes with correlated expression patterns across 53 tissues and TF targets were respectively identified from Bray-Curtis Similarity and TF knockdown experiments. Corresponding promoter sequences were reduced to DNase I-accessible intervals; TFBSs were then identified within these intervals using information theory-based position weight matrices for each TF (iPWMs) and clustered. Features from information-dense TFBS clusters predicted these genes with machine learning classifiers, which were evaluated for accuracy, specificity and sensitivity. Mutations in TFBSs were analyzed to in silico examine their impact on cluster densities and the regulatory states of target genes.
Results: We initially chose the glucocorticoid receptor gene ( NR3C1), whose regulation has been extensively studied, to test this approach. SLC25A32 and TANK were found to exhibit the most similar expression patterns to NR3C1. A Decision Tree classifier exhibited the largest area under the Receiver Operating Characteristic (ROC) curve in detecting such genes. Target gene prediction was confirmed using siRNA knockdown of TFs, which was found to be more accurate than those predicted after CRISPR/CAS9 inactivation. In-silico mutation analyses of TFBSs also revealed that one or more information-dense TFBS clusters in promoters are required for accurate target gene prediction.
Conclusions: Machine learning based on TFBS information density, organization, and chromatin accessibility accurately identifies gene targets with comparable tissue-wide expression patterns. Multiple information-dense TFBS clusters in promoters appear to protect promoters from effects of deleterious binding site mutations in a single TFBS that would otherwise alter regulation of these genes.
Keywords: Transcription factors, position-specific scoring matrices, chromatin, binding sites, gene expression profiles, Bray-Curtis similarity, mutation, machine learning, information theory
Introduction
The distinctive organization and combination of transcription factor binding sites (TFBSs) and regulatory modules in promoters dictates specific expression patterns within a set of genes 1. Clustering of multiple adjacent binding sites for the same TF (homotypic clusters) and for different TFs (heterotypic clusters) defines cis-regulatory modules (CRMs) in human gene promoters. Experimental studies have shown that these clusters can reinforce (and in some instances, amplify) the impact of individual TFBSs on gene expression through increasing binding affinities, facilitated diffusion mechanisms and funnel effects 2. Because tissue-specific TF-TF interactions in TFBS clusters are prevalent, these features can assist in identifying correct target genes by altering binding specificities of individual TFs 3. Previously, we derived iPWMs from ChIP-seq data that can accurately detect TFBSs and quantify their strengths by computing associated R i values (Rate of Shannon information transmission for an individual sequence 4), with R sequence being the average of R i values of all binding site sequences and representing the average binding strength of the TF 3. Furthermore, information density-based clustering (IDBC) can effectively identify functional TF clusters by taking into account both the spatial organization (i.e. intersite distances) and information density (i.e. R i values) of TFBSs 5.
TF binding profiles, either derived from in vivo ChIP-seq peaks 6– 8 or computationally detected binding sites and CRMs 9, have been shown to be predictive of absolute gene expression levels using a variety of tissue-specific machine learning classifiers and regression models. Because signal strengths of ChIP-seq peaks are not strictly proportional to TFBS strengths 3, representing TF binding strengths by ChIP-seq signals may not be appropriate; nevertheless, both achieved similar accuracy 10. CRMs have been formed by combining two or three adjacent TFBSs 9, which is inflexible, as it arbitrarily limits the number of binding sites contained in a module, and does not consider differences between information densities of different CRMs. Chromatin structure (e.g. histone modification (HM) and DNase I hypersensitive sites (DHSs)) were also found to be statistically redundant with TF binding in explaining tissue-specific mRNA transcript abundance at a genome-wide level 7, 8, 11, 12, which was attributed to the heterogeneous distribution of HMs across chromatin domains 8. Combining these two types of data explained the largest fraction of variance in gene expression levels in multiple cell lines 7, 8, suggesting that either contributes unique information to gene expression that cannot be compensated for by the other.
The number of genes directly bound by a TF significantly exceeds the number of differentially expressed (DE) genes whose expression levels significantly change upon knockdown of the TF. Only a small subset of direct target genes whose promoters overlap ChIP-seq peaks were DE after individually knocking 59 TFs down using small interfering RNAs (siRNAs) in the GM19238 cell line 13. Using these knockdown data on 8,872 genes as the gold standard, correlation between TFBS counts and gene expression levels across 10 different cell lines was more predictive of DE targets than setting a minimum threshold on TFBS counts 14. Their TFBS counts were defined as the number of ChIP-seq peaks overlapping the promoter, though it was unknown how many binding sites were present in these peaks; positives might not be direct targets in the TF regulatory cascade, as the promoters of these targets were not intersected with ChIP-seq peaks. By perturbing gene expression with CAS9-directed clustered regularly interspaced short palindromic repeats (CRISPR) of 10 different TF genes in K562 cells, the regulatory effects of each TF on 22,046 genes were dissected by single cell RNA sequencing with a regularized linear computational model 15; this accurately revealed DE targets and new functions of individual TFs, some of which were likely regulated through direct interactions at TFBSs in their corresponding promoters. Machine learning classifiers have also been applied in a small number of gene instances to predict targets of a single TF using features extracted from n-grams derived from consensus binding sequences 16, or from TFBSs and homotypic binding site clusters 5.
To investigate whether the distribution and composition of CRMs in promoters substantially determines gene expression profiles of direct TF targets, we developed a general machine learning framework that predicts which genes have similar tissue-wide expression profiles to a given gene and predicts DE direct TF targets by combining information theory-based TF binding profiles with DHSs. Upon filtering of accessible promoter intervals based on the locations of DHSs, features designed to capture the spatial distribution and information composition of CRMs were extracted from clusters of iPWM-detected TFBSs identified by IDBC. Though not all direct targets regulated by multiple TFs share a common tissue-wide expression profile, this framework provides insight into the transcriptional program of genes with similar profiles by dissecting their cis-regulatory element organization and strengths. We identify genes with comparable tissue-wide expression profiles by application of Bray-Curtis similarity 17. Using transcriptome data generated by CRISPR- 15 and siRNA-based 13 TF knockdowns, we predicted DE TF target genes that are simultaneously direct targets whose promoters overlap tissue-specific ChIP-seq peaks.
Methods
To identify genes with similar tissue-wide expression patterns, we formally define tissue-wide gene expression profiles and pairwise similarity measures between profiles of different genes. A general machine learning framework relates features extracted from the organization of TFBSs in these genes to their tissue-wide expression patterns. Since protein-coding (PC) sequences represent the most widely studied and best understood component of the human genome 18, positives and negatives for deriving machine learning classifiers for predicting DE direct TF target genes that encode proteins (TF targets for short below) were obtained from CRISPR- and siRNA-generated knockdown data (see below).
Similarity between GTEx tissue-wide expression profiles of genes
The Genotype-Tissue Expression (GTEx, version 6p) Project measured expression levels in 53 tissues, each of which is represented by different numbers of individuals (5 to 564). For each tissue population, the median expression value is given in RPKM (Reads Per Kilobase of transcript per Million mapped reads) for each gene 19. Data are available on Zenodo 20. To capture the tissue-wide overall expression pattern of a gene instead of within a single tissue, the tissue-wide expression profile of a gene was defined as its median RPKM across GTEx tissues, which is described by a 53 element vector ( Equation 1). Note that different isoforms whose expression patterns may significantly differ from each other cannot be distinguished by this approach.
where EP A is the tissue-wide expression profile of Gene A, is the median expression value of Gene A in Tissue 1, is the median expression value of Gene A in Tissue 2, etc.
To discover other genes whose tissue-wide expression profiles are similar to a given gene, we computed the Bray-Curtis Similarity ( Equation 2) between the tissue-wide expression profiles of gene pairs. Compared to other similarity metrics ( Table 1, Example 1, Additional file 1 21), the application of this function is justified by desirable properties, including: 1) maintaining bounds between 0 and 1, 2) achieving the maximal similarity value 1 if and only if two vectors are identical, and 3) larger values having a larger impact on the resultant similarity value.
Table 1. Comparison between metrics in measurement of similarity between GTEx tissue-wide expression profiles of genes.
Similarity metric | Property 1 †, ‡ | Property 2 † | Property 3 † |
---|---|---|---|
Bray-Curtis | √; [0,1] | √ | √ |
Euclidean | √; (0,1] | √ | × |
Cosine | √; [0,1] | × | √ |
Pearson correlation 24 | ×; [-1,1] | × | × |
Spearman correlation 25 | ×; [-1,1] | × | × |
† The symbols √ and ×, respectively, indicate that the similarity metric satisfies and does not satisfy the property. ‡The interval in each cell indicates the range in which the result computed by the similarity metric lies.
Example 1. Assume that Genes A, B, C, D, E, F respectively have the following GTEx expression profiles across two tissues: EP A = [1,1], EP B =[2,2], EP C = [3,3], EP D = [1,2], EP E = [1,99], EP F = [1,100]. The ground-truth similarity relationships that we can intuitively infer include sim( EP C, EP A) < sim( EP C, EP B) < 1, and sim( EP A, EP D) < sim( EP E, EP F) < 1. Only the results computed by the Bray-Curtis Similarity are completely concordant with these ground-truth relationships ( Table 2).
Table 2. Similarity values computed by different metrics.
Similarity metric | sim( EP C, EP B) | sim( EP C, EP A) | sim( EP E, EP F) | sim( EP A, EP D) |
---|---|---|---|---|
Bray-Curtis | 0.8 | 0.5 | ≈ 0.995 | 0.8 |
Euclidean | ≈ 0.41 | ≈ 0.26 | 0.5 | 0.5 |
Cosine | 1 | 1 | ≈ 0.999999995 | ≈ 0.949 |
Pearson correlation | Undefined | Undefined | 1 | 1 |
Spearman correlation | 1 | 1 | 1 | 1 |
Prediction of genes with similar tissue-wide expression profiles
The framework for identifying genes whose tissue-wide expression profiles most resemble a particular gene is shown in Figure 1A, B. The genomic locations of all DHSs in 95 cell types generated by the ENCODE project[ 22; hg38 assembly] were selected for known promoters 23, then 94 iPWMs exhibiting primary binding motifs for 82 TFs 3 were used to detect TFBSs within the overlapping intervals. Data are available on Zenodo 20. When detecting heterotypic TFBS clusters with the IDBC algorithm, a minimum threshold 0.1 * R sequence was specified for the R i values of TFBSs in order to remove weak binding sites that were more likely to be false positive TFBSs 3.
The information density-related features (Additional file 1 21) derived from each TFBS cluster included: 1) The distance between this cluster and the transcription start site (TSS); 2) The length of this cluster; 3) The information content of this cluster (i.e. the sum of R i values of all TFBSs in this cluster); 4) The number of binding sites of each TF within this cluster; 5) The number of strong binding sites ( R i > R sequence) of each TF within this cluster; 6) The sum of R i values of binding sites of each TF within this cluster; 7) The sum of R i values of strong binding sites ( R i > R sequence) of each TF within this cluster.
For a gene, each of Features 1-3 was defined as a vector whose size equals the number of clusters in the promoter; thus, the entire vector could be input into a classifier. If two genes contained different numbers of clusters, the maximum number of clusters among all instances was determined, and null clusters were added at the 5’ end of promoters with fewer clusters, enabling all instances to have the same cluster count. Using all instances as training data, machine learning classifiers with default parameter values in MATLAB were used to generate ROC curves (Additional file 1 21).
Prediction of TF targets
Using gene expression in the CRISPR-based perturbations. Dixit et al. performed CRISPR-based perturbation experiments using multiple guide RNAs for each of ten TFs in K562 cells, resulting in a regulatory matrix of coefficients that indicate the effect of each guide RNA on each of 22,046 genes 15. Data are available on Zenodo 20. The coefficient of a guide RNA on a gene is defined as the log 10(fold change in gene expression level) 15. Among these ten TFs, we have previously derived iPWMs exhibiting primary binding motifs for seven (EGR1, ELF1, ELK1, ETS1, GABPA, IRF1, YY1) 3. Therefore, the framework for predicting TF targets in the K562 cell line ( Figure 1A and 1C) was applied to these TFs. The criteria for defining a positive (i.e. a target gene), of a TF were:
-
1)
The fold change in the expression level of this gene for each guide RNA of the TF exceeds (or is less than) 1, eliminating those genes exhibiting both increased and decreased expression levels for different guide RNAs, and maximizing the possibility that the gene was downregulated (or upregulated) by the TF (Additional file 1 21). and
-
2)
The average fold change in the expression level of this gene for all guide RNAs of the TF exceeds the threshold (or is less than 1/), and
-
3)
The promoter interval (10 kb) upstream of a TSS of this gene overlaps a ChIP-seq peak of the TF in the K562 cell line.
If the coefficients of all guide RNAs of the TF for a gene are zero, the gene was defined as a negative. As the threshold ε increases, the number of positives strictly decreases; as ε decreases, we have increasingly lower confidence in the fact that the positives were indeed differentially expressed because of the TF perturbation. To achieve a balance, we evaluated three different values (i.e. 1.01, 1.05 and 1.1) for ε (Additional file 1 21). For each TF, all ENCODE ChIP-seq peak datasets from the K562 cell line were merged to determine positives. Data are available on Zenodo 20. To make the numbers of negatives and positives equal to avoid imbalanced datasets that significantly compromise the classifier performance 26, the Bray-Curtis function was applied to compute the similarity values in the tissue-wide expression profile between all negatives and the positive with the largest average coefficient, then the negatives with the smallest values were selected ( Figure 1C).
The DHSs in the K562 cell line were intersected with known promoters. Data are available on Zenodo 20. Because TFs may exhibit tissue-specific sequence preferences due to different sets of target genes and binding sites in different tissues 3, the iPWMs of EGR1, ELK1, ELF1, GABPA, IRF1, YY1 from the K562 cell line were used to most accurately detect binding sites; for ETS1, we used the only available iPWM from the GM12878 cell line 3. Six features (Features 1-5 and 7) were derived from each homotypic cluster (i.e. Feature 6 became identical to Feature 3, because only binding sites from a single TF were used) ( Figure 1A). The results of 10 rounds of 10-fold cross validation were averaged to more accurately evaluate the predictive power of the classifier.
Using gene expression in the siRNA-based knockdown. In the GM19238 cell line, 59 TFs were individually knocked down using siRNAs, and significant changes in the expression levels of 8,872 genes were indicated according to their corresponding P-values 13. Data are available on Zenodo 20. In these cases, the P-value of a gene for a TF is the probability of observing the change in the expression level of this gene under the null hypothesis of no differential expression after TF knockdown; thus, the larger the change in the expression level, the smaller the P-value and the more likely this gene is differentially expressed. They also indicated whether the promoters of these genes display evidence of binding to TFs by intersecting with ChIP-seq peaks in the GM12838 cell line. Among these 59 TFs, we have previously derived accurate iPWMs exhibiting primary binding motifs for 11 (BATF, JUND, NFE2L1, PAX5, POU2F2, RELA, RXRA, SP1, TCF12, USF1, YY1) 3. Therefore, the framework for predicting TF targets in the GM19238 cell line ( Figure 1A, D) was applied to these 11 TFs.
We defined a positive (i.e. a target gene) for a TF, if the P-value of this gene for the TF was ≤ 0.01, and the promoter interval (10 kb) upstream of a TSS of this gene overlapped a ChIP-seq peak of the TF in the GM12878 cell line. All other genes with P-values > 0.01 were considered to exhibit insufficient evidence of being TF targets, i.e. these were considered negatives or non-targets.
The DHSs in the GM19238 cell line mapped from the hg19 genome assembly were first remapped to the hg38 assembly using liftOver prior to being intersected with known promoters 27. Data are available on Zenodo 20. Aside from RELA, RXRA and NFE2L1, the iPWMs of TFs from the GM12878 cell line were used to detect binding sites. For RELA, the iPWM from the GM19099 cell line was used; for RXRA and NFE2L1, the only available iPWMs were respectively derived from HepG2 and K562 cells and were applied. Although the knockdown was performed in GM19238, GM12878 and GM19099 are also lymphoblastic cell lines, with GM19099 and GM19238 both being derived from Yorubans. For this analysis, the iPWMs derived in GM12878 and GM19099 were more appropriate sources of accessible TFBSs than those from HepG2 and K562, since GM12878 and GM19099 are of the same tissue type and are thus more likely comparable to GM19238 than HepG2 and K562. Similarly, the results of 10 rounds of 10-fold cross validation were averaged to more accurately evaluate the predictive power of the classifier.
Mutation analyses on promoters of TF targets
To better understand the significance of individual binding sites for information-dense clusters and the regulatory state of direct targets, we evaluated the effects of sequence changes that altered the R i values of these sites on cluster formation and whether a gene was predicted to be a TF target. Mutations were sequentially introduced into the strongest binding sites in TFBS clusters of the EGR1 target gene, MCM7, to determine the threshold for cluster formation after disappearing clusters disabled induction of MCM7 expression. For one target gene of each TF from the CRISPR-generated perturbation data, effects of naturally occurring TFBS variants present in dbSNP 28 were also evaluated to explore aspects of TFBS organization that enabled both clusters and promoter activity to be resilient to binding site mutations. This was done by analyzing whether the occurrence of individual or multiple single nucleotide polymorphisms (SNPs) lead to the loss of binding sites and the corresponding clusters, and resulted in changes in the predictions for these targets.
Results
Similarity between GTEx tissue-wide expression profiles of genes
To confirm that the Bray-Curtis Similarity can indeed effectively measure how akin the tissue-wide expression profiles of two genes are to one another, Equation 2 was applied to compute the similarity values between the tissue-wide expression profiles of the glucocorticoid receptor ( GR or NR3C1) gene and all other 18,812 PC genes. NR3C1 is an extensively characterized TF with many known direct target genes 29. As a constitutively expressed TF activated by glucocorticoid ligands, the protein can mediate the up-regulation of anti-inflammatory genes by binding of homodimers to glucocorticoid response elements and down-regulation of proinflammatory genes by complexing with other activating TFs (e.g. NFKB and AP1) and eliminating their ability to bind targets 29. NR3C1 can bind its own promoter forming an auto-regulatory loop, which also contains functional binding sites of 11 other TFs (e.g. SP1, YY1, IRF1, NFKB) whose iPWMs have been developed and/or mutual interactions have been described previously 3, 29. However, since the tissue-wide expression profile of NR3C1 comprises all different splicing and translational isoforms (e.g. GRα-A to GRα-D, GRβ, GRγ, GRδ), the tissue-specific expression patterns of these isoforms are indistinguishable (e.g. levels of the GRα-C isoforms are significantly higher in the pancreas and colon, whereas levels of GRα-D are highest in spleen and lungs) 29. SLC25A32 and TANK have the greatest similarity in expression to NR3C1 (0.880 and 0.877 respectively), which is evident based on their overall similar expression patterns across the 53 tissues ( Figure 2).
Prediction of genes with similar GTEx tissue-wide expression profiles
In prediction of genes with similar tissue-wide expression profiles to NR3C1, we generated ROC curves to compare the performance of different classifiers (Naïve Bayes, Decision Tree (DT), Random Forest and Support Vector Machines (SVM) with four different kernels), under two scenarios depending on whether promoter sequences were first intersected with DHSs ( Figure 3). DT exhibited the largest AUC (area under the curve) under both scenarios, and was one of two most stable classifiers (i.e. ΔAUC < 0.01), with the other being the SVM with RBF kernel. Inclusion of DHS information significantly improved AUC values of the other classifiers with the exception of Naïve Bayes, and in many instances, all TFBSs in a contiguous DHS interval formed a single binding site cluster.
Prediction of TF targets
Since the DT classifier performed the best in distinguishing genes with NR3C1-like tissue-wide expression profiles from others, we further used this classifier type to predict TF targets respectively based on the CRISPR- 15 and siRNA-generated 13 perturbation data, and assessed its performance with 10 rounds of 10-fold cross validation. To validate that using all six machine learning features more comprehensively capture the distribution and composition of CRMs in the promoter, all of the features, except for TFBS counts, were removed. The classifier performance decreased, except for CRISPR-perturbed GABPA, IRF1 and YY1 after inclusion of DHS information (Additional file 5 21).
On the CRISPR-generated knockdown data, after eliminating TFBSs in inaccessible promoter intervals, i.e. those excluded from tissue-specific DHSs, the DT classifier predicted TF targets with greater sensitivity and specificity ( Table 3). Specifically, predictions for TFs: EGR1, ELK1, ELF1, ETS1, GABPA, and IRF1 were more accurate than for YY1, which itself represses or activates a wide range of promoters by binding to sites overlapping the TSS ( Table 3). Accordingly, the perturbation data indicated that YY1 has ~4-22 times more PC targets in the K562 cell line than the other TFs ( ε = 1.05), and its binding has a more significant impact on the expression levels of target genes (for YY1, the ratio of the PC target counts at ε = 1.1 vs ε = 1.01 was 0.334, which significantly exceeded those of the other TFs (0.017-0.082); Additional file 3 21). This is concordant with our previous finding that YY1 extensively interacts with 11 cofactors (e.g. DNA-binding IRF9 and TEAD2; non-DNA-binding DDX20 and PYGO2) in K562 cells, consistent with a central role in specifying erythroid-specific lineage development 3.
Table 3. The Decision Tree classifier performance for predicting TF targets using the CRISPR-generated knockdown data.
TF | Excluding DHS information † | Including DHS information † | ||||
---|---|---|---|---|---|---|
Sensitivity | Specificity | Accuracy | Sensitivity | Specificity | Accuracy | |
EGR1 | 0.58 | 0.62 | 0.60 | 0.78 | 0.81 | 0.80 |
ELF1 | 0.59 | 0.65 | 0.62 | 0.83 | 0.87 | 0.85 |
ELK1 | 0.59 | 0.59 | 0.59 | 0.80 | 0.81 | 0.81 |
ETS1 | 0.59 | 0.6 | 0.59 | 0.81 | 0.81 | 0.81 |
GABPA | 0.55 | 0.57 | 0.56 | 0.72 | 0.75 | 0.74 |
IRF1 | 0.54 | 0.55 | 0.54 | 0.76 | 0.64 | 0.70 |
YY1 | 0.50 | 0.51 | 0.51 | 0.45 | 0.69 | 0.57 |
†The average performance of 10 rounds of 10-fold cross validation when setting ε to 1.05 is indicated. The CRISPR-generated knockdown data were obtained from Dixit et al. 15.
Despite a high accuracy of target recognition, sensitivity did not exceed specificity except for IRF1 ( Table 3), due to a relatively large number of false negative genes. We find that promoters of most TF targets contain accessible, likely functional binding sites that significantly are correlated with changes in gene expression levels. By contrast, promoters of non-targets contain either no accessible binding sites at all, or accessible, but non-functional sites. The fact that these false negatives were erroneously predicted to non-targets was attributable to the inability of the classifier to distinguish between likely functional binding sites in their promoters and non-functional ones in non-targets in some cases. In vivo co-regulation mediated by interacting cofactors, which was excluded by the classifier, assisted in distinguishing these non-functional sites that do not significantly affect gene expression 13.
As the threshold ε increased, the accuracy of the classifier for all the TFs monotonically increased as expected ( Figure 4). For a gene to be defined as a DE target of a TF, the average fold change in its expression level for all guide RNAs that downregulated the TF were required to reach the minimum threshold ε. Upon TF knockdown, ε is inversely correlated with the number of target genes, but positively correlated with fold changes in their corresponding expression levels. In general, more significantly DE genes have been associated with a higher number of TFBSs in their promoters 13. Thus, at greater ε, there are larger differences in the values of machine learning features derived from TFBS clusters between direct targets and non-targets.
With the siRNA-generated knockdown data, the performance of the DT classifier was compared to the approach inferring DE targets by correlating TF binding with gene expression levels across ten cell types 14. In this correlation-based approach, three measures (i.e. the absolute Pearson correlation coefficient (PCC), the absolute Spearman correlation coefficient (SCC), and the absolute combined angle ratio statistic (CARS)), whose performance was evaluated with precision-recall curves, were alternatively used to compute a correlation score between the number of ChIP-seq peaks overlapping the promoter and gene expression values. Genes predicted to be DE targets had scores above the threshold resulting in a 1.5-fold increase compared to the background precision (i.e. the DE target count / 8,872). For example, in the case of YY1, which was the only TF analyzed by both approaches, the performance of the DT classifier was 0.98 (precision) and 0.55 (recall) after including DHS information ( Table 4). This classifier outperformed all three correlation measures (PCC: 0.467 and 0.003; SCC: 0.467 and 0.006; CARS: 0.467 and 0.003 directly obtained from 14), even though the correlation-based approach used a less stringent P-value threshold (0.05) for defining differential expression of likely non-direct targets, and intersected ChIP-seq peaks over shorter 5kb promoter intervals upstream of the TSS. Three reasons explain why the correlation-based approach exhibited lower recall, including: 1) it did not use machine learning classifiers, 2) its larger P-value threshold (0.05) generated a larger number of positives and, 3) positives also include DE targets that cannot be directly bound.
Table 4. The Decision Tree classifier performance for predicting TF targets using the siRNA-generated knockdown data.
TF | Excluding DHS information † | Including DHS information † | ||||
---|---|---|---|---|---|---|
Sensitivity | Specificity | Accuracy | Sensitivity | Specificity | Accuracy | |
BATF | 0.96 | 0.97 | 0.96 | 0.85 | 1 | 0.93 |
JUND | 0.86 | 0.90 | 0.88 | 0.80 | 1 | 0.90 |
NFE2L1 | 0.92 | 0.95 | 0.94 | 0.71 | 0.93 | 0.82 |
PAX5 | 0.96 | 0.97 | 0.96 | 0.88 | 0.98 | 0.93 |
POU2F2 | 0.97 | 0.97 | 0.97 | 0.89 | 0.99 | 0.94 |
RELA | 0.95 | 0.96 | 0.96 | 0.83 | 0.97 | 0.90 |
RXRA | 0.93 | 0.91 | 0.92 | 0.84 | 0.95 | 0.89 |
SP1 | 0.98 | 0.98 | 0.98 | 0.89 | 0.99 | 0.94 |
TCF12 | 0.98 | 0.98 | 0.98 | 0.86 | 0.99 | 0.93 |
USF1 | 0.97 | 0.98 | 0.97 | 0.83 | 0.98 | 0.90 |
YY1 | 1 | 1 | 1 | 0.55 | 0.99 | 0.77 |
†The average performance of 10 rounds of 10-fold cross validation is indicated. The siRNA-generated knockdown data were obtained from Cusanovich et al. 13.
Intersection of genes with similar tissue-wide expression profiles and TF targets
To determine how many TF targets have similar tissue-wide expression profiles, we intersected the set of targets with the set of 500 PC genes with the most similar tissue-wide expression profiles for each TF ( Table 5, Additional file 6 21). The TFs PAX5 and POU2F2 are primarily expressed in B cells, and their respective targets IL21R and CD86 are also B cell-specific, which accounts for the high similarity in the tissue-wide expression profile between them. There are respectively 21 and 7 nuclear mitochondrial genes (e.g. MRPL9 and MRPS10, which are subunits of mitochondrial ribosomes) in the intersections for YY1 in the K562 and GM19238 cell lines 30. Previous studies reported that YY1 upregulates a large number of mitochondrial genes by complexing with PGC-1α in C2C12 cells 31, and genes involved in the mitochondrial respiratory chain in K562 cells 15, which is consistent with the idea that YY1 may broadly regulate mitochondrial function (within all 53 tissues in addition to the erythrocyte, lymphocyte and skeletal muscle cell lines).
Table 5. Intersection of TF targets and 500 protein-coding genes with the most similar tissue-wide expression profiles.
TF | Cell line | Number of
targets |
Size of
intersection |
Targets among the most similar 10 genes § |
---|---|---|---|---|
EGR1 | K562 | 169 | 12 | None |
ELF1 | 78 | 5 | None | |
ELK1 | 112 | 4 | GNL1(8 th) | |
ETS1 | 267 | 15 | None | |
GABPA | 513 | 25 | TAF1(1 st) | |
IRF1 | 457 | 10 | None | |
YY1 | 1752 | 127 |
MRPL9(2
nd),
BAZ1B(6
th),
ENY2(7
th),
NUB1(8
th),
USP1(9 th), HNRNPR(10 th) |
|
GM19238 | 1040 | 61 | MED4(1 st), SURF6(3 rd), BAZ1B(6 th) | |
BATF | 186 | 21 | None | |
JUND | 44 | 2 | None | |
NFE2L1 | 58 | 4 | None | |
RELA | 247 | 13 | HMG20B(9 th) | |
RXRA | 181 | 3 | None | |
SP1 | 1595 | 81 | None | |
TCF12 | 655 | 20 | None | |
USF1 | 301 | 21 | None | |
PAX5 | 918 | 86 | IL21R(8 th) | |
POU2F2 | 532 | 26 | CD86(3 rd) |
§The rank of each target in the list of similar genes in the descending order of Bray-Curtis similarity values is shown in the brackets immediately following the target.
Between 0.4%–25% of genes with similar tissue-wide expression profiles to the TFs are actually their targets ( Table 5); the majority are non-targets whose promoters contain non-functional binding sites that are distinguished from targets by their lack of co-regulation by corresponding cofactors. For YY1 and EGR1, we validated this hypothesis by contrasting the flanking cofactor binding site distributions and strengths in the promoters of the most similarly expressed target genes (YY1: MRPL9, BAZ1B; EGR1: CANX, NPM1) and non-target genes (YY1: ADNP, RNF25; EGR1: GUCY2F, AWAT1). In the promoters of these target genes, strong and intermediate recognition sites for TFs: SP1, KLF1, CEBPB formed heterotypic clusters with adjacent YY1 sites; as well TFBSs of SP1, KLF1, and NFY were frequently present adjacent to EGR1 binding sites (Additional file 7 21). These patterns contrasted with the enrichment of CTCF and ETV6 binding sites in gene promoters of YY1 and EGR1 non-targets (Additional file 7 21). Previous studies have reported that KLF1 is essential for terminal erythroid differentiation and maturation 32, direct physical interactions between YY1 and the constitutive activator SP1 synergistically induce transcription 33, the activating CEBPB promotes differentiation and suppresses proliferation of K562 cells by binding the promoter of the G-CSFR gene encoding a hematopoietin receptor 34, EGR1 and SP1 synergistically cooperate at adjacent non-overlapping sites on EGR1 promoter but compete binding at overlapping sites 35; whereas occupied CTCF binding sites often function as an insulator blocking the effects of cis-acting elements and preventing gene activation by mediating long-range DNA loops to alter topological chromatin structure 36, 37, and ETV6, a member of the ETS family, is a transcriptional repressor required for bone marrow hematopoiesis and associated with leukemia development 38.
Mutation analyses on promoters of direct targets
Because the promoters of direct target genes contain multiple binding site clusters, we hypothesized that this organization could stabilize gene expression against the effect of mutations in individual binding sites; in other words, the other clusters might be able to compensate for the loss of a cluster destroyed by mutations, so that the mutated promoters would still be capable of effectively regulating gene transcription upon TF binding. First, we examined whether introducing artificial variants into binding sites in silico in the promoter of the target gene MCM7 of EGR1 changed the classifier output ( Figure 5). Specifically, in the K562 cell line, MCM7 is upregulated by EGR1. Knockdown of MCM7 has an anti-proliferative and pro-apoptotic effect on K562 cells 39 and the loss of EGR1 increases leukemia initiating cells 40, which suggests that EGR1 may act as a tumor suppressor in K562 cells through the MCM7 pathway.
First, the strongest binding site (at position chr7:100103347[hg38], - strand, R i = 12.0 bits) in the promoter was eliminated by a G->A mutation, resulting in the loss of Cluster 1, which consists of two sites (the other site at chr7:100103339, -, 4.3 bits). The other two clusters comprising weaker binding sites of intermediate strength (chr7:100102252, +, 7.0 bits; chr7:100102244, +, 1.3 bits; chr7:100101980, +, 8.9 bits; chr7:100101977, +, 2.2 bits; chr7:100101984, +, 1.9 bits) were still predicted to compensate for this mutation, enabling the promoter to maintain capability of inducing MCM7 expression ( Figure 5). It is well known that adjacent clustered sites, which themselves may not be strong enough to individually bind TFs and activate transcription, can stabilize each other’s binding 2. The weaker sites flanking a strong binding site in a cluster can direct the TF molecule to the strong site and extend the period of the molecule physically associating with the strong site, which is termed, the funnel effect 2. In this example, Clusters 2 and Cluster 3 were also respectively removed by G->T and C->G mutations abolishing the strongest site in either cluster, which altered the prediction, that is, EGR1 should lose the capability to induce MCM7 transcription ( Figure 5). The remaining four sparse weak sites do not form a cluster and cannot completely supplant the disrupted strong sites.
Further, we examined the predicted impacts of known natural SNPs on binding site strengths, clusters and the regulatory state of the promoter for a direct target of each of the seven TFs from the CRISPR-generated perturbation data ( Table 6). Often a single SNP (e.g. rs996639427 of EGR1) can affect the strengths of multiple binding sites ( Table 6). Apart from SNPs that are predicted to abolish binding ( Figure 5), leaky variants that merely weaken TF binding are common ( Table 6). Binding stabilization between adjacent sites and the funnel effect enable CRMs comprised of information-dense clusters to be robust to mutations in individual binding sites 2, 41. In this way, neither mutations that abolish TFBSs nor leaky SNPs in flanking weak sites would be expected to destroy clusters (e.g. rs1030185383 and rs5874306 of IRF1), whereas SNPs with large reductions in R i values of central strong sites are more likely to abolish clusters (e.g. rs865922947, rs946037930, rs917218063 and rs928017336 of YY1) ( Table 6). More generally, the presence of multiple clusters enables promoters to be effectively resilient to the effects of binding site mutations; only the complete abolishment of all clusters resulting from the simultaneous occurrence of multiple SNPs should be able to transform the promoter to be unresponsive to TF binding to residual weak sites (e.g. rs997328042, rs1020720126 and rs185306857 of GABPA) ( Table 6). Furthermore, a relatively small number of SNPs that strengthen TF binding and eventually reinforce the regulatory effect of the TF are also present in these cases (e.g. rs887888062 of EGR1 and rs751263172 of ELF1) ( Table 6), suggesting that, in addition to deleterious mutations, potentially benign variants may also be found in promoters, consistent with the expectations of neutral theory 42.
Table 6. Mutation analyses on promoters of TF targets.
TF | Target | Normal cluster | Normal binding site § | SNP ID § | Variant binding site § | Variant
cluster ‡ |
Classifier
output |
||
---|---|---|---|---|---|---|---|---|---|
Variant † | Wild-
type |
||||||||
EGR1
( R sequence = 12.2899 bits) |
EID2B | Cluster 1
of 2 |
GA
GGGGGC
ATC
(chr19:39540296, -, 7.22 bits) |
rs538610162
(chr19:39540296C>G) |
CAGGGGGCATC
(chr19:39540286, -, 4.84 bits) |
Abolished | √ | × | √ |
rs759233998
(chr19:39540294C>T) |
GA
AGGGGCATC
(chr19:39540286, -, 0.06 bit) |
Abolished | √ | ||||||
rs974735901
(chr19:39540288T>A) |
GAGGGGGC
TTC
(chr19:39540286, -, 6.90 bits) |
Cluster 1
of 2 |
√ | ||||||
rs978230260
(chr19:39540287A>T) |
GAGGGGGCA
AC
(chr19:39540286, -, 5.31 bits) |
Abolished | √ | ||||||
Cluster 2
of 2 |
GCGTGCGT
GGG
(chr19:39540162, +, 1.59 bits) |
rs764734511
(chr19:39540162G>A) (chr19:39540162G>C) |
ACGTGCGTGGG
(chr19:39540162, +, -0.72 bit) |
Cluster 2
of 2 |
√ | ||||
CCGTGCGTGGG
(chr19:39540162, +, -0.79 bit) |
Cluster 2
of 2 |
√ | |||||||
rs996639427
(chr19:39540170G>C) |
GCGTGCGT
CGG
(chr19:39540162, +, -5.21 bits) |
Abolished | √ | ||||||
GCGT
GGGC
GC
T
(chr19:39540166, +, 9.72 bits) |
GCGT
CGGCGCT
(chr19:39540165, +, -0.85 bit) |
||||||||
rs1027751538
(chr19:39540174G>A) |
GCGTGGGC
ACT
(chr19:39540166, +, 5.16 bits) |
Abolished | √ | ||||||
rs887888062
(chr19:39540176T>A) |
GCGTGGGCGC
A
(chr19:39540166, +, 10.94 bits) |
Cluster 2
of 2 |
√ | ||||||
ELF1
( R sequence = 11.2057 bits) |
HIST1H4H | Cluster 1
of 2 |
GC
GGAAG
CGTG
(chr6:26286540, +, 9.92 bits) |
rs760968937
(chr6:26286547C>T) (chr6:26286547C>A) |
GCGGAAG
TGTG
(chr6:26286540, +, 10.71 bits) |
Cluster 1
of 2 |
√ | √ | √ |
GCGGAAG
AGTG
(chr6:26286540, +, 8.84 bits) |
Cluster 1 of 2 | √ | × | ||||||
rs1000196206
(chr6:26286542G>C) |
GC
CGAAGCGTG
(chr6:26286540, +, -6.26 bits) |
Abolished | √ | ||||||
rs144759258
(chr6:26286543G>A) |
GCG
AAAGCGTG
(chr6:26286540, +, -3.60 bits) |
Abolished | √ | ||||||
rs966435996
(chr6:26286544A>G) |
GCGG
GAGCGTG
(chr6:26286540, +, 5.28 bits) |
Abolished | √ | ||||||
rs950986427
(chr6:26286548G>A) |
GCGGAAGC
ATG
(chr6:26286540, +, 8.28 bits) |
Cluster 1 of 2 | √ | ||||||
Cluster 2
of 2 |
CAG
GAGATGC
G
(chr6:26286483, -, 6.98 bits) |
rs373649904
(chr6:26286483G>A) |
TAGGAGATGCG
(chr6:26286473, -, 0.61 bit) |
Abolished | √ | ||||
rs926919149
(chr6:26286480C>T) |
CAG
AAGATGCG
(chr6:26286473, -, -6.53 bits) |
Abolished | √ | ||||||
rs751263172
(chr6:26286479T>G) |
CAGG
CGATGCG
(chr6:26286473, -, 1.24 bits) |
Abolished | √ | ||||||
rs369076253
(chr6:26286473C>G) |
CAGGAGATGC
C
(chr6:26286473, -, 6.92 bits) |
Cluster 2
of 2 |
√ | ||||||
rs751263172
(chr6:1044474314C>T) |
CAGGA
A
ATGCG
(chr6:26286473, -, 11.43 bits) |
Cluster 2
of 2 |
√ | √ | |||||
ELK1
( R sequence = 11.9041 bits) |
G0S2 | Cluster 1
of 2 |
C
AGGGAAG
ACC
(chr1:209667969, -, 1.92 bits) |
rs146048477
(chr1:209667961T>A) |
CAGGGAAG
TCC
(chr1:209667959, -, 2.24 bits) |
Cluster 1 of 2 | √ | √ | √ |
rs887606802
(chr1:209667968T>C) |
C
GGGGAAGACC
(chr1:209667959, -, -3.35 bits) |
Cluster 1 of 2 | √ | × | |||||
rs1021034916
(chr1:209667967C>T) |
CA
AGGAAGACC
(chr1:209667959, -, -3.57 bits) |
Cluster 1 of 2 | √ | ||||||
GAGGA
AATGAG
(chr1:209667969, +, 8.14 bits) |
rs941962117
(chr1:209667974A>G) |
GAGGA
GATGAG
(chr1:209667969, +, 4.11 bits) |
Abolished | √ | |||||
Cluster 2 of 2 |
CTGGAAGA
GCA
(chr1:209673554, -, 5.91 bits) |
rs896117033
(chr1:209673545G>A) |
CTGGAAGAG
TA
(chr1:209673544, -, 3.95 bits) |
Cluster 2 of 2 | √ | ||||
rs971962577
(chr1:209673546C>T) |
CTGGAAGA
ACA
(chr1:209673544, -, 3.49 bits) |
Cluster 2 of 2 | √ | ||||||
rs1011969709
(chr1:209673554G>C) |
GTGGAAGAGCA
(chr1:209673544, -, 3.92 bits) |
Abolished | √ | ||||||
CCA
GAAGTCA
A
(chr1:209673551, +, 7.44 bits) |
CCA
CAAGTCAA
(chr1:209673551, +, -5.50 bits) |
||||||||
rs1023312090
(chr1:209673561A>G) |
CCAGAAGTCA
G
(chr1:209673551, +, 8.40 bits) |
Cluster 2 of 2 | √ | √ | |||||
ETS1
( R sequence = 10.0788 bits) |
TTC19 | Cluster 1 of 1 | GCA
GGGAA
AGG
(chr17:16022293, +, 7.92 bits) |
rs1022234223
(chr17:16022296G>C) |
GCA
CGGAAAGG
(chr17:16022293, +, -4.98 bits) |
Abolished | × | × | √ |
rs968299415
(chr17:16022301A>T) |
GCAGGGAA
TGG
(chr17:16022293, +, 10.01 bits) |
Cluster 1 of 1 | √ | √ | |||||
GABPA
( R sequence = 10.8567 bits) |
PLEKHB2 | Cluster 1 of 1 | A
CA
GGAAAGGG
(chr2:131112770, +, 10.36 bits) |
rs997328042
(chr2:131112771C>T) |
A
TAGGAAAGGG
(chr2:131112770, +, -3.68 bits) |
Abolished | × | × | √ |
rs1020720126
(chr2:131112773G>C) |
ACA
CGAAAGGG
(chr2:131112770, +, -4.16 bits) |
Abolished | × | ||||||
T
CTGGAAAC
TA
(chr2:131112760, +, 1.53 bits) |
rs185306857
(chr2:131112761C>A) |
T
ATGGAAACTA
(chr2:131112760, +, -2.86 bits) |
Cluster 1 of 1 | √ | |||||
rs772728699
(chr2:131112762T>A) |
TC
AGGAAACTA
(chr2:131112760, +, 5.23 bits) |
Cluster 1 of 1 | √ | ||||||
rs965753671
(chr2:131112769T>C) |
TCTGGAAAC
CA
(chr2:131112760, +, 2.13 bits) |
Cluster 1 of 1 | √ | ||||||
IRF1
( R sequence = 13.5544 bits) |
SMIM13 | Cluster 1 of 1 |
GAG
AA
TGAA
AGC
A
(chr6:11093663, +, 12.56 bits) |
rs950528541
(chr6:11093663G>C) |
CAGAATGAAAGCA
(chr6:11093663, +, 8.97 bits) |
Cluster 1 of 1 | √ | × | √ |
rs886259573
(chr6:11093664A>G) |
G
GGAATGAAAGCA
(chr6:11093663, +, 9.65 bits) |
Cluster 1 of 1 | √ | ||||||
rs982931728
(chr6:11093666A>G) |
GAG
GATGAAAGCA
(chr6:11093663, +, 8.09 bits) |
Cluster 1 of 1 | √ | ||||||
rs1020218811
(chr6:11093668T>G) |
GAGAA
GGAAAGCA
(chr6:11093663, +, 9.36 bits) |
Cluster 1 of 1 | √ | ||||||
rs570723026
(chr6:11093672A>G) |
GAGAATGAA
GGCA
(chr6:11093663, +, 8.01 bits) |
Cluster 1 of 1 | √ | ||||||
rs1004825794
(chr6:11093675A>C) (chr6:11093675A>T) |
GAGAATGAAAGC
C
(chr6:11093663, +, 10.47 bits) |
Cluster 1 of 1 | √ | ||||||
GAGAATGAAAGC
A
(chr6:11093663, +, 10.42 bits) |
Cluster 1 of 1 | √ | |||||||
AA
GACCAA
AGGCA
(chr6:11093641, +, 2.43 bits) |
rs1030185383
(chr6:11093649A>C) |
AAGACCAA
CGGCA
(chr6:11093641, +, -3.39 bits) |
Cluster 1 of 1 | √ | |||||
rs5874306
(chr6:11093650delG) |
AAGACCAAAGCAG
(chr6:11093641, +, 0.90 bit) |
Cluster 1 of 1 | √ | ||||||
rs558896490
(chr6:11093643G>A) |
AA
AACCAAAGGCA
(chr6:11093641, +, 7.06 bits) |
Cluster 1 of 1 | √ | √ | |||||
YY1
( R sequence = 12.8554 bits) |
CKLF | Cluster 1 of 1 |
GC
GGC
CATCGGC
(chr16:66549797, -, 10.06 bits) |
rs865922947
(chr16:66549791G>A) |
CCGGCCATCGGC
(chr16:66549785, -, 6.80 bits) |
Cluster 1 | √ | × | √ |
rs946037930
(chr16:66549794C>A) |
GC
TGCCATCGGC
(chr16:66549785, -, 8.02 bits) |
Cluster 1 | √ | ||||||
rs917218063
(chr16:66549793C>T) |
GCG
ACCATCGGC
(chr16:66549785, -, 5.41 bits) |
Abolished | × | ||||||
rs928017336
(chr16:66549791G>A) |
GCGGC
TATCGGC
(chr16:66549785, -, -3.62 bits) |
Abolished | × | ||||||
GCCGCCCCCGTC
(chr16:66549792, +, 1.34 bits) |
§All coordinates are based on the hg38 genome assembly. A bold italic letter in a binding site sequence indicates the base where a SNP occurs. For each normal and variant binding site sequence, the genome coordinate of its most 5’-end base and its R i value are indicated. The negative R i value of a variant binding site sequence implies this site is abolished. The SNPs strengthening binding sites and corresponding variant binding site sequences are underlined.
‡The impact on whether the occurrence of a single SNP resulted in the disappearance of the cluster containing it is shown; ‘Abolished’ indicates that the cluster is eliminated by the existence of the variant allele.
†After a single SNP occurred or multiple SNPs simultaneously occurred, the classifier produced a new prediction on whether the TF is still capable of significantly affecting gene expression via the variant promoter.
Discussion
In this study, the Bray-Curtis similarity function was initially shown (for the NR3C1 gene) to measure the relatedness of overall expression patterns between genes across a diverse set of tissues. A machine learning framework distinguished Bray-Curtis function-defined similar from dissimilar genes based on the distribution, strengths and compositions of TFBS clusters in accessible promoters, which can substantially account for the corresponding gene expression patterns. Using knockdown data as the gold standard, the combinatorial use of TF binding profiles and chromatin accessibility was also demonstrated to be predictive of TF targets. A binding site comparison confirmed that coregulatory cofactors can be used to distinguish between functional sites in targets and non-functional ones in non-targets. Furthermore, mutation analyses on binding sites of targets demonstrated that the existence of both multiple TFBSs in a cluster and multiple information-dense clusters in a promoter enables both the cluster and the promoter to be resilient to binding site mutations.
The DT classifier improved after intersecting promoters with DHSs in prediction of TF targets with the CRISPR-generated knockdown data ( Table 3). This intersection eliminated noisy binding sites that are inaccessible to TF proteins in promoters; specifically, it widened discrepancies in feature vectors between positives and negatives. If the 10 kb promoter of a gene instance does not overlap DHSs, its feature vector will only consist of 0; the percentages of negatives whose promoters do not overlap DHSs considerably exceeded those of positives (Additional file 8 21), which led to an excess of negatives with feature vectors containing only 0 after intersection. This explains why these negatives are not DE targets of the TFs in the K562 and GM19238 cell lines, because their entire promoters are not open to TF molecules; other regulatory regions besides the proximal promoters (e.g. intronic enhancers 43) still enable the TFs to effectively control the expression of the positives with inaccessible promoters. The relatively poor performance of the classifier on YY1 ( Table 3) is attributable to its smaller percentage of negatives with inaccessible promoters (Additional file 8 21) and the larger number of functional binding sites in the K562 cell line.
Our in-silico mutation analyses revealed that some deleterious TFBS mutations could be compensated for by other information-dense clusters in the same promoter 2; thus, predicting the effects of mutations in individual binding sites might not be sufficient to interpret downstream effects without considering their context. Though compensatory clusters may maintain gene expression, the promoter will provide lower levels of activity than the wild-type promoter could, which is a recipe for achieving natural phenotypic diversity 41. Few published studies in molecular diagnostics have specifically examined the effects of naturally occurring variants within clustered TFBSs; thus, IDBC-based machine learning provides an alternative computational approach to predict deleterious mutations that actually impact (i.e. repress or abolish) transcription of target genes and result in abnormal phenotypes, and to simultaneously minimize false positive calls of TFBS mutations that individually have little or no impact.
Apart from these TFs, the Bray-Curtis Similarity can be directly applied to identify the ground-truth genes with overall similar tissue-wide expression patterns to any other gene whose expression profile is known. Further studies could investigate the biological significance underlying the phenomenon that all these genes share a common expression pattern, including the similarity between other regulatory regions besides proximal promoters in terms of TFBSs and epigenetic markers. This machine learning framework can also be applied to predict target genes for other TFs and in other cell lines, depending on the availability of corresponding knockdown data.
There are a number of limitations of our approach. The Bray-Curtis function seems unable to accurately measure the similarity between the tissue-wide expression profiles of a gene (e.g. MIR23A) without any detectable mRNA in any of the 53 tissues analyzed and genes (e.g. the ubiquitously expressed NR3C1 and stomach-specific PGA3) that are expressed in at least one tissue. Intuitively, in terms of expression patterns PGA3 is more similar to MIR23A than NR3C1; however, the Bray-Curtis similarity values indicate that both PGA3 and NR3C1 bear no similarity to MIR23A (i.e. sim Bray-Curtis ( NR3 C1, MIR23 A) = sim Bray-Curtis ( PGA3, MIR23 A) = 0). Another possible limitation in classifier performance in the prediction of genes with similar tissue-wide expression profiles is that only binding sites of 82 TFs were analyzed due to a lack of available iPWMs for other TFs, given that 2000-3000 sequence-specific DNA-binding TFs are estimated to be encoded in the human genome 44. For example, four TFs (CREB, MYB, NF1, GRF1) were previously reported to bind the promoter of the NR3C1 gene to activate or repress its expression; however, their iPWMs exhibiting known primary motifs could not be successfully derived from ChIP-seq data 3, 29. Regarding the CRISPR-generated knockdown data used here, positives were inferred to be direct targets by intersecting their promoters with corresponding ChIP-seq peaks, which may not be completely accurate, due to the presence of noise peaks that do not contain true TFBSs 3, 45. Small fold changes in the expression levels of DE targets could arise from compromised efficiency of knockdowns as a result of suboptimal guide RNAs or the limitations of perturbing only a single allele of the TF 46. Finally, the framework developed here only takes into account the 10 kb interval proximal to the TSS, and would not therefore capture long range enhancer effects beyond this distance; by contrast, correlation based approaches have successfully incorporated multiple definitions of promoter length 14.
Conclusions
The Bray-Curtis function is able to effectively quantify the similarity between tissue-wide gene expression profiles. By analysis of information theory-based TF binding profiles that captured the spatial distribution and information contents of TFBS clusters, ChIP-seq and chromatin accessibility data, we described a machine learning framework that distinguished tissue-wide expression profiles of similar vs dissimilar genes and identified TF target genes. Functional binding sites in target genes that significantly alter expression levels upon direct binding are at least partially distinguished by TF-cofactor coregulation from non-functional sites in non-targets. Finally, in-silico mutation analyses demonstrated that the presence of multiple information-dense clusters in the promoter, as a protective mechanism, reduces deleterious mutations that can significantly alter the regulatory state and expression level of the gene.
An earlier version this article is available from bioRxiv: https://doi.org/10.1101/283267 47.
Data availability
Underlying data
The median RPKM, TSS coordinate, DNase I hypersensitivity and ChIP-seq data were respectively obtained from the GTEx Analysis V6p release ( www.gtexportal.org), Ensembl Biomart ( www.ensembl.org) and ENCODE ( www.encodeproject.org). The CRISPR- and siRNA-generated knockdown data were obtained from the supplementary information files of Dixit et al. 15 and Cusanovich et al. 13. The source datasets generated and/or analysed by this framework, along with sample results and compiled software are available from Zenodo. DOI: https://doi.org/10.5281/zenodo.1707423 20.
Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).
Extended data
Additional files are available from Zenodo. DOI: https://doi.org/10.5281/zenodo.1698281 21.
Additional file 1. The mathematical definitions of the four other similarity metrics, the workflow of the IDBC algorithm, an example feature vector, the mathematical definitions of five statistical variables to measure classifier performance, the default parameter values of classifiers in MATLAB, and histograms visualizing the first two criteria for selecting positives from the CRISPR-generated knockdown data.
Additional file 2: The lists of positives and negatives in the machine learning classifiers to predict genes with similar tissue-wide expression profiles.
Additional file 3: The lists of positives and negatives in the DT classifier to predict TF targets based on the CRISPR-generated knockdown data.
Additional file 4: The lists of positives and negatives in the DT classifier to predict DE direct targets based on the siRNA-generated knockdown data.
Additional file 5: The performance of the DT classifier using only TFBS counts.
Additional file 6: The list of the most similar 500 PC genes to each TF in terms of tissue-wide expression profiles, and the intersection of these 500 genes and target genes of the TF.
Additional file 7: Cofactor binding sites adjacent to YY1 and EGR1 sites in the promoters of their targets and non-targets.
Additional file 8: The percentages of positives and negatives whose promoters do not overlap DHSs for the CRISPR-perturbed TFs.
Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).
Software availability
Source code implementing the machine learning framework available at: https://bitbucket.org/cytognomix/information-dense-transcription-factor-binding-site-clusters/.
Archived source code at time of publication: https://doi.org/10.5281/zenodo.1892051 48.
License: GNU General Public License 3.0.
Acknowledgements
We are grateful to Ben Shirley and Eliseos Mucaki for constructive comments on the paper.
Funding Statement
Natural Sciences and Engineering Research Council of Canada Discovery Grant [RGPIN-2015-06290]; Canada Foundation for Innovation; Canada Research Chairs; Cytognomix Inc. Compute Canada and Shared Hierarchical Academic Research Computing Network (SHARCNET) provided high performance computing and storage facilities. Funding for open access charge: University of Western Ontario and the Natural Sciences and Engineering Research Council.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 1; peer review: 2 approved with reservations]
References
- 1. Hosseinpour B, Bakhtiarizadeh MR, Khosravi P, et al. : Predicting distinct organization of transcription factor binding sites on the promoter regions: a new genome-based approach to expand human embryonic stem cell regulatory network. Gene. 2013;531(2):212–9. 10.1016/j.gene.2013.09.011 [DOI] [PubMed] [Google Scholar]
- 2. Ezer D, Zabet NR, Adryan B: Homotypic clusters of transcription factor binding sites: A model system for understanding the physical mechanics of gene expression. Comput Struct Biotechnol J. 2014;10(17):63–9. 10.1016/j.csbj.2014.07.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Lu R, Mucaki EJ, Rogan PK: Discovery and validation of information theory-based transcription factor and cofactor binding site motifs. Nucleic Acids Res. 2017;45(5):e27. 10.1093/nar/gkw1036 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Schneider TD: Information content of individual genetic sequences. J Theor Biol. 1997;189(4):427–41. 10.1006/jtbi.1997.0540 [DOI] [PubMed] [Google Scholar]
- 5. Dinakarpandian D, Raheja V, Mehta S, et al. : Tandem machine learning for the identification of genes regulated by transcription factors. BMC Bioinformatics. 2005;6:204. 10.1186/1471-2105-6-204 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Ouyang Z, Zhou Q, Wong WH: ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc Natl Acad Sci U S A. 2009;106(51):21521–6. 10.1073/pnas.0904863106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Cheng C, Alexander R, Min R, et al. : Understanding transcriptional regulation by integrative analysis of transcription factor binding data. Genome Res. 2012;22(9):1658–67. 10.1101/gr.136838.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Budden DM, Hurley DG, Cursons J, et al. : Predicting expression: the complementary power of histone modification and transcription factor binding data. Epigenetics Chromatin. 2014;7(1):36. 10.1186/1756-8935-7-36 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Smith AD, Sumazin P, Xuan Z, et al. : DNA motifs in human and mouse proximal promoters predict tissue-specific expression. Proc Natl Acad Sci U S A. 2006;103(16):6275–80. 10.1073/pnas.0508169103 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. McLeay RC, Lesluyes T, Cuellar Partida G, et al. : Genome-wide in silico prediction of gene expression. Bioinformatics. 2012;28(21):2789–96. 10.1093/bioinformatics/bts529 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Karlić R, Chung HR, Lasserre J, et al. : Histone modification levels are predictive for gene expression. Proc Natl Acad Sci U S A. 2010;107(7):2926–31. 10.1073/pnas.0909344107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Dong X, Greven MC, Kundaje A, et al. : Modeling gene expression using chromatin features in various cellular contexts. Genome Biol. 2012;13(9):R53. 10.1186/gb-2012-13-9-r53 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Cusanovich DA, Pavlovic B, Pritchard JK, et al. : The functional consequences of variation in transcription factor binding. PLoS Genet. 2014;10(3):e1004226. 10.1371/journal.pgen.1004226 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Banks CJ, Joshi A, Michoel T: Functional transcription factor target discovery via compendia of binding and expression profiles. Sci Rep. 2016;6:20649. 10.1038/srep20649 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Dixit A, Parnas O, Li B, et al. : Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens. Cell. 2016;167(7):1853–1866.e17. 10.1016/j.cell.2016.11.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Cui S, Youn E, Lee J, et al. : An improved systematic approach to predicting transcription factor target genes using support vector machine. PLoS One. 2014;9(4):e94519. 10.1371/journal.pone.0094519 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Bray JR, Curtis JT: An Ordination of the Upland Forest Communities of Southern Wisconsin. Ecol Monogr. 1957;27(4):325–349. 10.2307/1942268 [DOI] [Google Scholar]
- 18. International Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature. 2004;431(7011):931–45. 10.1038/nature03001 [DOI] [PubMed] [Google Scholar]
- 19. GTEx Consortium: The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013;45(6):580–5. 10.1038/ng.2653 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Lu R, Rogan PK: Information-dense transcription factor binding site clusters identify target genes with similar tissue-wide expression profiles and serve as a buffer against mutations - Source datasets, sample results and compiled software.2018. 10.5281/zenodo.1707423 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Lu R, Rogan PK: Information-dense transcription factor binding site clusters identify target genes with similar tissue-wide expression profiles and serve as a buffer against mutations - Additional files.2018. 10.5281/zenodo.1698281 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. 10.1038/nature11247 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Thurman RE, Rynes E, Humbert R, et al. : The accessible chromatin landscape of the human genome. Nature. 2012;489(7414):75–82. 10.1038/nature11232 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Pearson K: Note on Regression and Inheritance in the Case of Two Parents. Proc R Soc Lond. 1895;58:240–2. 10.1098/rspl.1895.0041 [DOI] [Google Scholar]
- 25. Spearman C: The Proof and Measurement of Association between Two Things. Am J Psychol. 1904;15(1):72–101. 10.2307/1412159 [DOI] [PubMed] [Google Scholar]
- 26. He H, Garcia EA: Learning from Imbalanced Data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–1284. 10.1109/TKDE.2008.239 [DOI] [Google Scholar]
- 27. Kent WJ, Sugnet CW, Furey TS, et al. : The human genome browser at UCSC. Genome Res. 2002;12(6):996–1006. 10.1101/gr.229102 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Sherry ST, Ward MH, Kholodov M, et al. : dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29(1):308–11. 10.1093/nar/29.1.308 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Vandevyver S, Dejager L, Libert C: Comprehensive overview of the structure and regulation of the glucocorticoid receptor. Endocr Rev. 2014;35(4):671–93. 10.1210/er.2014-1010 [DOI] [PubMed] [Google Scholar]
- 30. Calvo SE, Clauser KR, Mootha VK: MitoCarta2.0: an updated inventory of mammalian mitochondrial proteins. Nucleic Acids Res. 2016;44(D1):D1251–1257. 10.1093/nar/gkv1003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Cunningham JT, Rodgers JT, Arlow DH, et al. : mTOR controls mitochondrial oxidative function through a YY1-PGC-1alpha transcriptional complex. Nature. 2007;450(7170):736–40. 10.1038/nature06322 [DOI] [PubMed] [Google Scholar]
- 32. Tallack MR, Perkins AC: KLF1 directly coordinates almost all aspects of terminal erythroid differentiation. IUBMB Life. 2010;62(12):886–90. 10.1002/iub.404 [DOI] [PubMed] [Google Scholar]
- 33. Seto E, Lewis B, Shenk T: Interaction between transcription factors Sp1 and YY1. Nature. 1993;365(6445):462–4. 10.1038/365462a0 [DOI] [PubMed] [Google Scholar]
- 34. Ferrari-Amorotti G, Mariani SA, Novi C, et al. : The biological effects of C/EBPalpha in K562 cells depend on the potency of the N-terminal regulatory region, not on specificity of the DNA binding domain. J Biol Chem. 2010;285(40):30837–50. 10.1074/jbc.M110.128272 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Huang RP, Fan Y, Ni Z, et al. : Reciprocal modulation between Sp1 and Egr-1. J Cell Biochem. 1997;66(4):489–99. [DOI] [PubMed] [Google Scholar]
- 36. Bell AC, West AG, Felsenfeld G: The protein CTCF is required for the enhancer blocking activity of vertebrate insulators. Cell. 1999;98(3):387–96. 10.1016/S0092-8674(00)81967-4 [DOI] [PubMed] [Google Scholar]
- 37. Hou C, Zhao H, Tanimoto K, et al. : CTCF-dependent enhancer-blocking by alternative chromatin loop formation. Proc Natl Acad Sci U S A. 2008;105(51):20398–403. 10.1073/pnas.0808506106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Wang LC, Swat W, Fujiwara Y, et al. : The TEL/ETV6 gene is required specifically for hematopoiesis in the bone marrow. Genes Dev. 1998;12(15):2392–402. 10.1101/gad.12.15.2392 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Tian L, Liu J, Xia GH, et al. : RNAi-mediated knockdown of MCM7 gene on CML cells and its therapeutic potential for leukemia. Med Oncol. 2017;34(2):21. 10.1007/s12032-016-0878-x [DOI] [PubMed] [Google Scholar]
- 40. Maifrede S, Liebermann D, Hoffman B: Egr-1, a Stress Response Transcription Factor and Myeloid Differentiation Primary Response Gene, Behaves As Tumor Suppressor in CML. Blood. 2014;124:2211 Reference Source [Google Scholar]
- 41. Smith T, Husbands P, Layzell P, et al. : Fitness landscapes and evolvability. Evol Comput. 2002;10(1):1–34. 10.1162/106365602317301754 [DOI] [PubMed] [Google Scholar]
- 42. Kimura M: The neutral theory of molecular evolution. Sci Am. 1979;241(5):98–100, 102, 108 passim. [DOI] [PubMed] [Google Scholar]
- 43. Hural JA, Kwan M, Henkel G, et al. : An intron transcriptional enhancer element regulates IL-4 gene locus accessibility in mast cells. J Immunol. 2000;165(6):3239–49. 10.4049/jimmunol.165.6.3239 [DOI] [PubMed] [Google Scholar]
- 44. Vaquerizas JM, Kummerfeld SK, Teichmann SA, et al. : A census of human transcription factors: function, expression and evolution. Nat Rev Genet. 2009;10(4):252–63. 10.1038/nrg2538 [DOI] [PubMed] [Google Scholar]
- 45. Kidder BL, Hu G, Zhao K: ChIP-Seq: technical considerations for obtaining high-quality data. Nat Immunol. 2011;12(10):918–22. 10.1038/ni.2117 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Shao Y, Chan CY, Maliyekkel A, et al. : Effect of target secondary structure on RNAi efficiency. RNA. 2007;13(10):1631–40. 10.1261/rna.546207 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Lu R, Rogan PK: Information-dense transcription factor binding site clusters identify target genes with similar tissue-wide expression profiles and serve as a buffer against mutations. bioRxiv. 2018;283267 10.1101/283267 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Lu R, Rogan PK: Information dense transcription factor binding site clusters identify target genes with similar tissue-wide expression profiles and buffer against mutations - source code. Zenodo. 2018. 10.5281/zenodo.1892051 [DOI] [PMC free article] [PubMed] [Google Scholar]