Unsupervised tensor decomposition-based method to extract candidate transcription factors as histone modification bookmarks in post-mitotic transcriptional reactivation

Y-h Taguchi; Turki Turki

doi:10.1371/journal.pone.0251032

. 2021 May 25;16(5):e0251032. doi: 10.1371/journal.pone.0251032

Unsupervised tensor decomposition-based method to extract candidate transcription factors as histone modification bookmarks in post-mitotic transcriptional reactivation

Y-h Taguchi ^1,^*, Turki Turki ²

Editor: Andrei Chernov³

PMCID: PMC8148352 PMID: 34032804

Abstract

The histone group added to a gene sequence must be removed during mitosis to halt transcription during the DNA replication stage of the cell cycle. However, the detailed mechanism of this transcription regulation remains unclear. In particular, it is not realistic to reconstruct all appropriate histone modifications throughout the genome from scratch after mitosis. Thus, it is reasonable to assume that there might be a type of “bookmark” that retains the positions of histone modifications, which can be readily restored after mitosis. We developed a novel computational approach comprising tensor decomposition (TD)-based unsupervised feature extraction (FE) to identify transcription factors (TFs) that bind to genes associated with reactivated histone modifications as candidate histone bookmarks. To the best of our knowledge, this is the first application of TD-based unsupervised FE to the cell division context and phases pertaining to the cell cycle in general. The candidate TFs identified with this approach were functionally related to cell division, suggesting the suitability of this method and the potential of the identified TFs as bookmarks for histone modification during mitosis.

1 Introduction

During the cell division process, gene transcription must be initially terminated and then reactivated once cell division is complete. However, the specific mechanism and factors controlling this process of transcription regulation remain unclear. Since it would be highly time- and energy-consuming to mark all genes that need to be transcribed from scratch after each cycle of cell division, it has been proposed that genes that need to be transcribed are “bookmarked” to easily recover these positions for reactivation [1–4]. Despite several proposals, the actual mechanism and nature of these “bookmarks” have not yet been identified. [5] suggested that condensed mitotic chromosomes can act as bookmarks, some histone modifications were suggested to serve as these bookmarks [6–8], and some transcription factors (TFs) have also been identified as potential bookmarks [9–13].

Recently, [14] suggested that histone 3 methylation or trimethylation at lysine 4 (H3K4me1 and H3K4me3, respectively) can act as a “bookmark” to identify genes to be transcribed, and that a limited number of TFs might act as bookmarks. However, there has been no comprehensive search of candidate “bookmark” TFs based on large-scale datasets.

We here propose a novel computational approach to search for TFs that might act as “bookmarks” during mitosis, which involves tensor decomposition (TD)-based unsupervised feature extraction (FE) (Fig 1). In brief, after fragmenting the whole genome into DNA regions of 25,000 nucleotide, the histone modifications within each region were summed. In this context, each DNA region is considered a tensor and various singular-value vectors associated with either the DNA region or experimental conditions (e.g., histone modification, cell line, and cell division phase) are derived. After investigating singular-value vectors attributed to various experimental conditions, the DNA regions with significant associations of singular-value vectors attributed to various experimental conditions were selected as potentially biologically relevant regions. The genes included in the selected DNA regions were then identified and uploaded to the enrichment server Enrichr to identify TFs that target the genes. To our knowledge, this is the first method utilizing a TD-based unsupervised FE approach in a fully unsupervised fashion to comprehensively search for possible candidate bookmark TFs.

2 Materials and methods

Sample R code is available in S1 Text.

2.1 Histone modification

The whole-genome histone modification profile was downloaded from the Gene Expression Omnibus (GEO) GSE141081 dataset. Sixty individual files (with extension .bw) were extracted from the raw GEO file. After excluding six CCCTC-binding factor (CTCF) chromatin immunoprecipitation-sequencing files and six 3rd replicates of histone modification files, a total of 48 histone modification profiles were retained for analysis. The DNA sequences of each chromosome were divided into 25,000-bp regions. Note that the last DNA region of each chromosome may be shorter since the total nucleotide length does not always divide into equal regions of 25,000. Histone modifications were then summed in each DNA region, which was used as the input value for the analysis. In total, N = 123,817 DNA regions were available for analysis. Thus, with approximately 120,000 regions of 25,000 bp each, we covered the approximate human genome length of 3 × 10⁹.

2.2 Tensor data representation

Histone modification profiles were formatted as a tensor, $x_{i j k m s} \in R^{N \times 2 \times 4 \times 3 \times 2}$ , which corresponds to the kth histone modification (k = 1: acetylation, H3K27ac; k = 2: H3K4me1; k = 3: H3K4me3; and k = 4:Input) at the ith DNA region of the jth cell line (j = 1: RPE1 and j = 2: USO2) at the mth phase of the cell cycle(m = 1: interphase, m = 2: prometaphase, and m = 3: anaphase/telophase) of the sth replicate (s = 1,2). x_ijkms was normalized as ∑_i x_ijkms = 0 and $\sum_{i} x_{i j k m s}^{2} = N$ (Table 1). There are two biological replicates for each of the combinations of one of cell lines (either RPE1 or USO2), one of ChIP-seq (either acetylation or H3Kme1 or H3Kme4 or inout), and one of three cell cycle phases.

Table 1. Combinations of experimental conditions.

Individual conditions are associated with two replicates.

Phases	Histone modifications
Phases	Cell lines
	H3K27ac		H3K4me1		H3K4me3		Input
	RPE1	U2OS	RPE1	U2OS	RPE1	U2OS	RPE1	U2OS
interphase	○	○	○	○	○	○	○	○
prometaphase	○	○	○	○	○	○	○	○
anaphase/telophase	○	○	○	○	○	○	○	○

Open in a new tab

2.3 Tensor decomposition

Higher-order singular value decomposition (HOSVD) [15] was applied to x_ijkms to obtain the decomposition

\begin{matrix} x_{i j k m s} = \sum_{ℓ_{1} = 1}^{2} \sum_{ℓ_{2} = 1}^{4} \sum_{ℓ_{3} = 1}^{3} \sum_{ℓ_{4} = 1}^{2} \sum_{ℓ_{5} = 1}^{N} G (ℓ_{1} ℓ_{2} ℓ_{3} ℓ_{4} ℓ_{5}) u_{ℓ_{1} j} u_{ℓ_{2} k} u_{ℓ_{3} m} u_{ℓ_{4} s} u_{ℓ_{5} i}, \end{matrix}

(1)

where $G \in R^{2 \times 4 \times 3 \times 2 \times N}$ is the core tensor, and $u_{ℓ_{1} j} \in R^{2 \times 2}, u_{ℓ_{2} k} \in R^{4 \times 4}, u_{ℓ_{3} m} \in R^{3 \times 3}, u_{ℓ_{4} s} \in R^{2 \times 2}$ , and $u_{ℓ_{5} i} \in R^{N \times N}$ are singular-value vector matrices, which are all orthogonal matrices. The reason for using the complete representation instead of the truncated representation of TD is that we employed HOSVD to compute TD. In HOSVD, the truncated representation is equal to that of the complete representation; i.e., u_ℓ₁j, u_ℓ₂k, u_ℓ₃m, and u_ℓ₄s are not altered between the truncated and the full representation. For more details, see [15].

Here is a summary on how to compute Eq (1) using the HOSVD algorithm, although it has been described in detail previously [15]. At first, x_ijkms is unfolded to a matrix, $x_{i (j k m s)} \in R^{N \times 48}$ . Then SVD is applied to get

\begin{matrix} x_{i (j k m s)} = \sum_{ℓ_{5} = 1}^{N} u_{ℓ_{5} i} λ_{ℓ_{5}} v_{ℓ_{5} j m k s} \end{matrix}

(2)

Then, only u_ℓ₅i is retained, and v_ℓ₅,jmks is discarded. Similar procedures are applied to x_ijkms by replacing i with one of j, k,m, s in order to get u_ℓ₁j, u_ℓ₂k, u_ℓ₃m, u_ℓ₄s. Finally, G can be computed as

\begin{matrix} G (ℓ_{1} ℓ_{2} ℓ_{3} ℓ_{4} ℓ_{5}) = \sum_{i = 1}^{N} \sum_{j = 1}^{2} \sum_{k = 1}^{4} \sum_{m = 1}^{3} \sum_{s = 1}^{2} x_{i j m k s} u_{ℓ_{5} i} u_{ℓ_{1} j} u_{ℓ_{2} k} u_{ℓ_{3} m} u_{ℓ_{4} s} \end{matrix}

(3)

2.4 TD-based unsupervised FE

Although the method was fully described in a recently published book [15], we summarize the process of selecting genes starting from the TD.

To identify which singular value vectors attributed to samples (e.g., cell lines, type of histone modification, cell cycle phase, and replicates) are associated with the desired properties (e.g., “not dependent upon replicates or cell lines,” “represents re-activation,” and “distinct between input and histone modifications”), the number of singular value vectors selected are not decided in advance, since there is no way to know how singular value vectors behave in advance, because of the unsupervised nature of TD.
To identify which singular value vectors attributed to genomic regions are associated with the desired properties described above, core tensor, G, is investigated. We select singular value vectors attributed to genomic regions that share G with larger absolute values with the singular value vectors selected in the process mentioned earlier, because these singular value vectors attributed to genomic regions are likely associated with the desired properties.
Using the selected singular value vectors attributed to genomic regions, those associated with the components of singular value vectors with larger absolute values are selected, because such genomic regions are likely associated with the desired properties. Usually, singular value vectors attributed to genomic regions are assumed to obey Gaussian distribution (null hypothesis), and P-values are attributed to individual genomic regions. P-values are corrected using multiple comparison correction, and the genomic regions associated with adjusted P-values less than the threshold value are selected.
There are no definite ways to select singular value vectors. The evaluation can only be done using the selected genes. If the selected genes are not reasonable, alternative selection of singular value vectors should be attempted. When we cannot get any reasonable genes, we abort the procedure.

To select the DNA regions of interest (i.e., those associated with transcription reactivation), we first needed to specify the singular-value vectors that are attributed to the cell line, histone modification, phases of the cell cycle, and replicates with respect to the biological feature of interest, transcription reactivation. Consider selection of a specific index set ℓ₁, ℓ₂, ℓ₃, ℓ₄ as one that is associated with biological features of interest, we then select ℓ₅ that is associated with G with larger absolute values, since singular-value vectors u_ℓ₅i with ℓ₅ represent the degree of association between individual DNA regions and reactivation. Using ℓ₅, we attribute P-values to the ith DNA region assuming that u_ℓ₅i obeys a Gaussian distribution (null hypothesis) using the χ² distribution

\begin{matrix} P_{i} = P_{χ^{2}} [> {(\frac{u_{ℓ_{5} i}}{σ_{ℓ_{5}}})}^{2}], \end{matrix}

(4)

where P_χ²[> x] is the cumulative χ² distribution in which the argument is larger than x, and $σ_{ℓ_{5}}$ is the standard deviation. P-values are then corrected by the BH criterion [15], and the ith DNA region associated with adjusted P-values less than 0.01 were selected as those significantly associated with transcription reactivation.

Algorithm displayed with mathematical formulas can be available in Fig 2.

2.5 Enrichment analysis

Gene symbols included in the selected DNA regions were retrieved using the biomaRt package [16] of R [17] based on the hg19 reference genome. The selected gene symbols were then uploaded to Enrichr [18] for functional annotation to identify their targeting TFs.

2.6 DESeq2

When DESeq2 [19] was applied to the present data set, six samples within each cell lines measured for three cell cycles and associated with two replicates were considered. Three cell cycles were regarded to be categorical classes associated with no rank order since we would like to detect not monotonic change between cell cycles but re-activation during them. All other parameters are defaults. Counts less than 1.0 were truncated so as to have integer values (e.g., 1400.53 was converted to 1400).

2.7 csaw

Since csaw [20] required bam files not available in GEO, we first mapped 60 fastq files to hg38 human genome using bowtie2 [21] where 60 fastq files in GEO ID GSE141081 were downloaded from SRA. Sam files generated by bowtie2 were converted and indexed by samtools [22] and sorted bam files were generated. Generated bam files that correspond to individual combinations of cell lines and ChIP-seq were loaded into csaw in order to identify differential binding among three cell cycle phases.

2.8 Identification of overlapping regions between peak call

We retrieved 36 peak call data set (with extension peaks.txt.gz) that correspond to 48 Chip-Seq files with excluding 12 input files. Starting from these 48 peak call files, using findOverlapsOfPeaks function included in ChIPpeakAnno package in R, we selected overlap regions step by step as follows.

Identify overlap regions between two biological replicates; this results in 9 regions for U2OS cell lines and RPE1 cell lines, respectively, in total 18 peak calls.
Identify overlap regions among three cell cycles; retrieve regions commonly expressed in three cell cycle phases for H3K4me1 and H3K4me3 whereas those expressed only in interphase and anaphase/telophase; this results in three regions, each of which was attributed to H3K4me1, H3K27ac, or H3K4me3, for U2OS cell lines and RPE1 cell lines, respectively, in total 6 peak calls.
Identify overlap between 6 peak calls.

This process was illustrated in Fig 3.

3 Results and discussion

We first attempted to identify which singular-value vector is most strongly attributed to transcription reactivation among the vectors for cell line (u_ℓ₁j), histone modification (u_ℓ₂k), cell cycle phase (u_ℓ₃m), and replicate (u_ℓ₄s) (Fig 4). First, we considered phase dependency. Fig 5 shows the singular-value vectors u_ℓ₃m attributed to cell cycle phases. In the case that there are a set of genes that share some dependence, singular value vectors reflect their mean behaviour. Specifically, singular value vectors act as some kind of pseudo representative genes. Thus, by investigating singular value vectors, we can find what kind of cell cycle dependence can appear in the group of genes. Since the reactivation means that being expressive in inter and ana/telophases whereas not expressive in prometapahse, singular value vectors supposed to be related to be reactivation take opposite signs between inter/ana/telophased and prometaphase. Thus, u_3m are most likely associated with reactivation. Although u_2m and u_3m were associated with reactivation, we further considered only u_3m since it showed a more pronounced reactivation profile. Next, we investigated singular-value vectors u_ℓ₂m attributed to histone modification (Fig 6). There was no clearly interpretable dependence on histone modification other than for u_1k, which represents the lack of histone modification, since the values for H3K27ac, H3K4me1, and H3K4me3 were equivalent to the Input value that corresponds to the control condition; thus, u_2k, u_3k, and u_4k were considered to have equal contributions for subsequent analyses. By contrast, since u_1j and u_1s showed no dependence on cell line and replicates, respectively, we selected these vectors for further downstream analyses (Fig 7).

Fig 5 — Left: u_1m, middle: u_2m, right: u_3m.

Fig 6 — Upper left: u_1k, upper right: u_2k, lower left: u_3k, lower right: u_4k.

Fig 7 — Top left: u_1j, top right: u_2j, bottom left: u_1s, bottom right: u_2s.

Finally, we evaluated which vector u_ℓ₅i had a larger $\sum_{ℓ_{2} = 2}^{4} {| G (1, ℓ_{2}, 3, 1, ℓ_{5}) |}^{α}, α = 1, 2, 3$ (Fig 8); in this case, we calculated the squared sum for 2 ≤ ℓ₂ ≤ 4 to consider them equally. Although we do not have any definite criterion to decide α uniquely, since ℓ₅ = 4 always takes largest values for α ≥ 1, ℓ₅ = 4 was further employed. The P-values attributed to the ith DNA regions were calculated using Eq (4), resulting in selection of 507 DNA regions associated with adjusted P-values less than 0.01.

We next checked whether histone modification in the selected DNA regions was associated with the following transcription reactivation properties:

H3K27ac should have larger values in interphase and anaphase/telophase than in prometaphase, as the definition of reactivation.
H3K4me1 and H3K4me3 should have constant values during all phases of the cell cycle, as the definition of a “bookmark” histone modification
H3K4me1 and H3K4me3 should have larger values than the Input; otherwise, they cannot be regarded to act as “bookmarks” since these histones must be significantly modified throughout these phases.

To check whether the above criteria are fulfilled, we applied six t tests to histone modifications in the 507 selected DNA regions (Table 2). The results clearly showed that histone modifications in the 507 selected DNA regions satisfied the requirements for transcription reactivation; thus, our strategy could successfully select DNA regions that demonstrate reactivation/bookmark functions of histone modification.

Table 2. Hypotheses for t tests applied to histone modification in the selected 507 DNA regions.

The null hypothesis was that the inequality relationship of the alternative hypothesis is replaced with an equality relationship. int: interphase, ana: anaphase, tel: telophase, pro: prometaphase.

Test	Alternative hypothesis	P-value	Description of desired relationships
1	{x_ij1ms\|m = 1, 3} > {x_ij12s}	3.30 × 10⁻³	H3K27ac reactivation (int & ana/tel > pro)
2	{x_ij2ms\|m = 1, 3} ≠ {x_ij22s}	0.60	H3K4me1 bookmark (int & ana/tel = pro)
3	{x_ij3ms\|m = 1, 3}≠{x_ij32s}	0.72	H3K4me3 bookmark (int & ana/tel = pro)
4	{x_ij4ms\|m = 1, 3} ≠ {x_ij42s}	0.86	Input as control (int & ana/tel = pro)
5	{x_ij2ms} > {x_ij4ms}	8.98 × 10⁻⁶	H3K4me1 > Input
6	{x_ij3ms} > {x_ij4ms}	3.79 × 10⁻³	H3K4me3 > Input

Open in a new tab

After confirming that selected DNA regions are associated with targeted reactivation/bookmark features, we queried all gene symbols contained within these 507 regions to the Enrichr server to identify TFs that significantly target these genes. These TFs were considered candidate bookmarks that remain bound to these DNA regions throughout the cell cycle and trigger reactivation in anaphase/telophase (i.e., after cell division is complete). Table 3 lists the TFs associated with the selected regions at adjusted P-values less than 0.05 in each of the seven categories of Enrichr.

Table 3. Number of transcription factors (TFs) associated with adjusted P-values less than 0.05 in various TF-related Enrichr categories.

See S2 Table for the full list.

Adjusted P-values
	Terms	> 0.05	< 0.05
(I)	ChEA 2016	537	97
(II)	ENCODE and ChEA Consensus TFs from ChIP-X	91	12
(III)	ARCHS4 TFs Coexp	1533	54
(IV)	TF Perturbations Followed by Expression	1577	346
(V)	Enrichr Submissions TF-Gene Coocurrence	587	1135
(VI)	ENCODE TF ChIP-seq 2015	788	28
(VII)	TF-LOF Expression from GEO	239	11

Open in a new tab

Among the many TFs that emerged to be significantly likely to target genes included in the 507 DNA regions selected by TD-based unsupervised FE, we here focus on the biological functions of TFs that were also detected in the original study suggesting that TFs might function as histone modification bookmarks for transcription reactivation [14]. RUNX was identified as an essential TF for osteogenic cell fate, and has been associated with mitotic chromosomes in multiple cell lines, including Saos-2 osteosarcoma cells and HeLa cells (Young et al. 2007). Table 4 shows the detection of RUNX family TFs in seven TF-related categories of Enrichr; three RUNX TFs were detected in at least one of the seven TF-related categories. In addition, TEADs (Kegelman et al. 2018), JUNs [23], FOXOs [24], and FosLs citepKang01072020 were reported to regulate osteoblast differentiation. Tables 5–8 show that two TEAD TFs, three JUN TFs, four FOXO TFs, and two FOSL TFs were detected in at least one of the seven TF-related categories in Enrichr, respectively.