Summary
Here, we present a protocol for calculating the spatial density of regulatory chromatin interactions (SD-RCI) using Hi-C, ATAC-seq, and ChIP-seq datasets from the same cell line. We describe steps for selecting and preprocessing datasets, training and predicting a model to obtain regulatory chromatin interactions, and evaluating model performance. We then detail calculation of SD-RCI and visualization of the correlation between SD-RCI and gene expression. This protocol is applicable to Hi-C, ATAC-seq, and ChIP-seq data from the human cell line.
For complete details on the use and execution of this protocol, please refer to Gong et al. (2023).1
Subject areas: Bioinformatics, Sequence Analysis, Sequencing, ChIPseq, Gene Expression, Biotechnology and Bioengineering
Graphical abstract
Highlights
-
•
A pipeline for analyzing the spatial density of regulatory chromatin interactions
-
•
Steps for using the MINE tool, including MINE-Loop, MINE-Density, and MINE-Viewer
-
•
Steps for training and predicting model to get regulatory chromatin interactions
-
•
Quantitative calculation steps for analyzing regulatory chromatin interactions
Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.
Here, we present a protocol for calculating the spatial density of regulatory chromatin interactions (SD-RCI) using Hi-C, ATAC-seq, and ChIP-seq datasets from the same cell line. We describe steps for selecting and preprocessing datasets, training and predicting a model to obtain regulatory chromatin interactions, and evaluating model performance. We then detail calculation of SD-RCI and visualization of the correlation between SD-RCI and gene expression. This protocol is applicable to Hi-C, ATAC-seq, and ChIP-seq data from the human cell line.
Before you begin
As Figure 1 shows, this protocol details the use of MINE, a toolkit that includes MINE-Loop, MINE-Density, and MINE-Viewer to calculate and visualize the spatial density of regulatory chromatin interactions (RCI, i.e., chromatin interactions that are anchoring regulatory elements to chromatin). The protocol’s advantages include the multimodal neural network for identifying the regulatory chromatin interactions. By changing the type of target histone modification ChIP-seq data seperately (e.g., H3K4me3, H3K27ac is active), the RCI can be divided into active or repressive RCI. This protocol also provides the method of calculating and visualizing the spatial density of RCI (SD-RCI), which may facilitate the quantitative studies on different aspects of chromatin conformation and regulatory activity. The software was developed in the Linux environment but is also usable under Windows and macOS. The method was implemented using Python code with the aid of deep learning library PyTorch. The protocol below describes the specific steps for using GM12878 and HepG2 cells. However, we have also used this protocol in IMR90, K562, H1-hESC, and HeLa cells.
Figure 1.
An overview of the protocol
Install required software and libraries
Timing: ∼3 h
-
1.Install required software libraries, such as Python 3 (version 3.6 tested and recommended) and the required packages creating a conda virtual environment (suggested):
-
a.Install Anaconda from https://www.anaconda.com/products/distribution.
-
b.Install cuda10.1 environment from https://docs.nvidia.com/cuda/.
-
c.Download or clone the original GitHub repository from https://github.com/MICL-biolab/MINE:> git clone https://github.com/MICL-biolab/MINE.git
-
d.Create the virtual environment using Anaconda:>conda create -n MINE python=3.6
-
e.Install PyTorch 1.7.1:> pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2-f https://download.pytorch.org/whl/torch_stable.html
-
f.Install required packages using pip:>pip install -r requirements.txt
-
g.Install jupter notebook using conda:> Conda install jupyter notebook
-
a.
-
2.Install juicer_tools, fithic, MUSTACHE, HiCDB, SDOC:
-
a.Install juicer_tools2 for dumping Hi-C contact matrix from .hic file:
-
i.Download and install java 1.8.0_291 from https://www.oracle.com/java/technologies/downloads/ and https://docs.oracle.com/en/java/javase/19/install/overview-jdk-installation.html for running juicer_tools.
-
ii.Download juicer_tools.jar from https://github.com/aidenlab/juicer.
-
i.
-
b.create a new conda virtual environment and then install FitHiC23 from https://github.com/ay-lab/fithic for calling loops from Hi-C contact matrix.
-
c.create a new conda virtual environment and then install MUSTACHE4 from https://github.com/ay-lab/MUSTACHE for calling loops from Hi-C contact matrix or .hic format file.
-
d.create a new conda virtual environment and then install HiCDB5 from https://github.com/ChenFengling/HiCDB for identifying the boundaries of topologically associated domains (TADs) from Hi-C contact matrix.>git clone https://github.com/ChenFengling/HiCDB.git
-
e.To run HiCDB, we need to install MATLAB R2020a from https://www.mathworks.com/help/install/ug/install-using-a-file-installation-key.html.
-
f.create a new conda virtual environment and then install SDOC6 from https://github.com/birmjiangs/SDOC for calculating the volume of TADs and SD-RCI.
-
a.
Select and preprocess dataset
Timing: ∼5 h (for step 3)
Timing: ∼6 h (for step 4)
Timing: ∼7 h (for step 5)
-
3.Select Hi-C, ATAC-seq, ChIP-seq and annotation file data to use to train an active or repressive model as Figure 3 shows.
-
a.Use ATAC-seq, H3K27ac, H3K4me3 ChIP-seq and cis-regulatory element file (annotation file) datasets to train a transcriptionally active model, and get RCI related to gene transcription.
-
b.Use Hi-C, H3K27me3 and H3K9me3 ChIP-seq dataset to train a transcriptionally repressive model, and get more transcriptional inhibition-related interactions.
-
a.
Note: In this protocol below, we use the datasets in GM12878 cells to train the network, and we use the datasets in HepG2 cells to do prediction. Datasets in the two cell lines can be found in key resources table.
-
4.Preprocess Hi-C data, where the Hi-C data is required to be .hic format.
-
a.Convert .hic format data into .txt file, where the .txt file is organized by start chromatin location, end chromatin location, interaction frequency. The input is the path of juicer_tools.jar, the path of Hi-C data. The output is the path for storing txt file.>bash hic2txt.sh <juicer_tools_path> <hic_path> /folder/to/txt
-
b.Convert .txt format data into .npz format file. The input is the path of txt file generated in step a. the output is the path of .npz file.>python txt2npy.py -i /folder/to/txt -o /folder/to/npz -r 1000
-
c.Generate the training data of Hi-C data. The input is the path of .npz file generated in step b, the output is the path of training data. -s is the size of submatrix, default is 400. -f is the size of the matrix to follow, default is 2000.>python generate_train_data.py -i /folder/to/npz -o /folder/to/train -s 400 -f 2000
-
a.
-
5.Preprocess ATAC-seq, ChIP-seq and cis-regulatory element file data.
-
a.Preprocess ATAC-seq, ChIP-seq into .npz file. The input is the path of ATAC-seq, ChIP-seq data, where ATAC-seq, ChIP-seq is .bigWig format. The ouput is the path of .npz file. -r is the number of bases to generate the signal data of ATAC-seq, ChIP-seq.>python analysis_epi.py -i /path/to/bigWig -o /folder/to/epi -r 1000
-
b.Preprocess npz file generated from step a into correlation matrix. The ouput is the path of training data of epigenome data (i.e., ATAC-seq, ChIP-seq dta). -s is the size of submatrix, default is 400. -f is the size of the matrix to follow, default is 2000.>python epi_concat.py -i /folder/to/epis -o /folder/to/train/epi -r 1000 -s 400 -f 2000
-
c.Generate Masked-hic.
-
i.For active model, please preprocess cis-regulatory element file (.bigBed format) into Masked-hic. The input is the path of cis-regulatory element file. The ouput is the path of Masked-hic, where the Masked-hic is used for training network.>python generate_train_annotation_data.py -i /path/to/bigBed -o /folder/to/train/annotation
-
ii.For repressive model, please preprocess peaks data of ChIP-seq (.bed format) into Masked-hic. The input is the path of peaks data of ChIP-seq. The ouput is the path of Masked-hic, where the Masked-hic is used for training network.>python generate_train_annotation_data_by_peaks.py -i /path/to/bigBed -o /folder/to/train/annotation
-
i.
-
a.
Figure 3.
The pipeline of training active model using data in GM12878 cell line and doing prediction in HepG2 cell line
(A) The pipeline of training active model.
(B) The pipeline of doing prediction.
(C) The pipeline of calculating SD-RCI.
(D) The pipeline of calculating the 3D structures.
Key resources table
Materials and equipment
All the models and analyses were written using python or shell languages. Ubuntu 18.04 LTS operating system installed on a 48-core, 503 GB RAM, and 1.8 TB ROM machine was used to confirm the data reproducibility of our published work. The GPU version is NVIDIA Tesla P100 SXM2, CPU version is Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz. Although this paper employs a highly configured system for neural network training, users could choose to install software, such as cuda and pytorch in a linux or windows or mac operating system. At least 250 GM RAM is required to do the calculation steps. If you want to run all of the datasets including raw datasets, you need about 2 TB ROM to store data.
Datasets used in the pipeline includes Hi-C, ATAC-seq, ChIP-seq, and cis- annotation file. These datasets need to meet the following requirements: (1) datasets belong to the same cell line; (2) Histone modification ChIP-seq data are used for training or predicting network. The choice type of Histone modification is determined by the active model or repressive model. The TF (e.g., CTCF, POLR2A, SMC3, EZH2) ChIP-seq data are used for verifying the effect of the trained model.
Step-by-step method details
Here, we describe how to select and preprocess Hi-C, ATAC-seq and ChIP-seq data, how to predict Hi-C a contact matrix and regulatory chromatin interactions (RCI), and how to do analysis using RCI. To illustrate the data process steps, we show the analysis process and results of two different datasets used by Gong et al.1 as an example. Figure 1 shows the pipeline of the protocol. Figure 2 shows the structure of the scripts used in this protocol.
Figure 2.
The structure of scripts in the protocol
Train and predict to get regulatory chromatin interactions
Training model
Timing: ∼3 h (for step 1)
This section mainly describes the details about training a model.
-
1.This step is required for training a multimodal neural network. However, this step can be omitted if you use the trained models in MINE/data/. After validation, the provided models can be adapted to multiple human cell lines. Therefore, users can choose to the provided trained models.
-
a.Open a terminal in the repository root folder.
-
b.Open the train folder and run the train_model.py. -i: the folder of training data. -o: the folder of storing model file generated in different checkpoint.
-
a.
>CUDA_VISIBLE_DEVICES=1,2,3,4,5,6 python -m torch.distributed.launch --nproc_per_node=6 train_model.py -i /folder/to/train -o /folder/to/checkpoint
Note: We have stored four trained model in the MINE/data folder. Users can use these models directly. Among them, GM12878_ATAC_H3K27ac_H3K4me3_epoch27.pth is the model trained using ATAC-seq, H3K27ac ChIP-seq, H3K4me3 ChIP-seq and cis-regulatory element in GM12878 cell line, GM12878_ATAC_H3K27ac_epoch34.pth is the model trained using ATAC-seq, H3K27ac ChIP-seq, and cis-regulatory element in GM12878 cell line, GM12878_ATAC_H3K4me3_epoch27.pth is the model trained using ATAC-seq, H3K4me3 ChIP-seq and cis-regulatory element in GM12878 cell line, GM12878_H3K9me3_H3K27me3_epoch10.pth is the model trained using H3K9me3 ChIP-seq and H3K27me3 ChIp-seq data in GM12878 cell line. In this protocol, GM12878_ATAC_H3K27ac_epoch34.pth refers to the active model we want to train, GM12878_H3K9me3_H3K27me3_epoch10.pth refers to the repressive model we want to train.
Predicting model
Timing: ∼1 h (for step 2)
This section mainly describes the details about predicting chromatin interactions using a trained model.
-
2.Perform prediction using the trained model in step 1.
-
a.--train_folder: the folder of training (or prediction) dataset.
-
b.–model: the folder of trained model.
-
c.–results: the folder of prediction result, where the prediction results is 22 n∗n Hi-C contact matrices (chrX_1000b.npz, X = 1,2,…,22).
-
d.In this protocol, we do prediction in HepG2 cell line as an example.
-
a.
>python validate.py --train_folder /folder/to/train --model /path/to/model --results /folder/to/result
Calling loops
Timing: ∼6 h (for step 3)
This section mainly describes the details about calling loops using FitHiC23 and MUSTACHE,4 respectively.
-
3.Call loops for raw Hi-C contact matrices and predicted Hi-C contact matrices generated from step 2 using FitHiC23 and MUSTACHE,4 respectively. We defined loops called from different experiments as shown at Table 1.
-
a.Open a terminal in the repository MINE/loop_caller_script/fithic&MUSTACHE folder.
-
b.Generate the FitHiC2 (or MUSTACHE) input file by executing the below script. -r: the resolution of Hi-C contact matrix, default = 1000. -p: whether it’s a predicted matrix or not, default = False.>python generate_input_file.py -i input_folder -o output_folder -r 1000 -p FalseNote: after executing generate_input_file.py, we can get “interactions.gz” and “fragments.gz” file. “interactions.gz” is the file that contains the pair of interactions. “fragments.gz” is the fragments file.
-
c.Call loops for raw Hi-C contact matrices (or predicted Hi-C contact matrices) generated from step 3 (b) using FitHiC2.>bash run_fithic_example.shIn the run_fithic_example.sh file, the parameters below are editable:
-
i.folder_path: the output folder.
-
ii.-U: --upperbound, default is 100000.
-
iii.-L: --lowerbound, default is 2000.
-
iv.-o: the folder of output files.
-
v.-d: the compressed output file, default is outputs_2_100/FitHiC.spline_pas-s1.res1000.significances.txt.gz.Note: after executing run_fithic_example.sh, we can get “FitHiC.fithic.lo-g”, ”FitHiC.fithic_pass1.res1000.txt”, and “FitHiC.spline_pass1.res1000.significances.txt” file. “FitHiC.spline_pass1.res1000.significances.txt.gz” is the file that contains the significant pair of interactions, i.e., raw-Loop-fithic, or active-Loop-fithic or repressive-Loop-fithic. By changing the -U parameter to be 100000, 300000, 500000, we could get the raw-Loop-fithic, or active-Loop-fithic or repressive-Loop-fithic within a genomic distance of 2-100, 2-300, 2-500 kb. In the following analysis, we will use the raw-Loop-fithic, or active-Loop-fithic or repressive-Loop-fithic within a genomic distance of 2-100 kb to do analysis.
-
i.
-
d.Call loops for raw Hi-C contact matrices using MUSTACHE, where the input file format of raw Hi-C contact matrices is .hic format.> MUSTACHE -f folder-of-hic-file -ch 1 2 X -r 1kb -pt 0.01 -o hic_out.tsv
-
e.Call loops for predicted Hi-C contact matrices using MUSTACHE, where the input file format of raw Hi-C contact matrices is generated from step 6 (b).>bash run_MUSTACHE_example.shIn the run_MUSTACHE_example.sh file, the parameters below are editable:
-
i.folder_path: the output folder.
-
ii.-f: Location of contact map.
-
iii.-r: Resolution of the provided contact map.
-
iv.-o: the folder of output files, default is ./MUSTACHE_output.tsv.
-
v.-pt: P-Value threshold for an interaction to be reported in the final output file. Default is 0.1.
-
vi.-p: Number of parallel processes to run. Default is 4. Increasing this will also increase the memory usage.
-
i.
-
a.
Note: after executing run_MUSTACHE_example.sh, we can get “hic_out.tsv” and “enhanced_out_pt05.tsv” file, which are the files that contains the significant pair of interactions, i.e., raw-Loop-MUSTACHE, or active-Loop- MUSTACHE or repressive-Loop-MUSTACHE. In step 4, we will describe how to assess the model perform well.
Table1.
The definition of loops called from different experiments
Loop name | Definition |
---|---|
raw-Loop-fithic | loops called from raw Hi-C contact matrices using FitHiC2 |
raw-Loop-MUSTACHE | loops called from raw Hi-C contact matrices using MUSTACHE |
active-Loop-fithic | loops called from predicted Hi-C contact matrices of active model using FitHiC2 |
active-Loop-MUSTACHE | loops called from predicted Hi-C contact matrices of active model using MUSTACHE |
repressive-Loop-fithic | loops called from predicted Hi-C contact matrices of repressive model using FitHiC2 |
repressive-Loop-MUSTACHE | loops called from predicted Hi-C contact matrices of repressive model using MUSTACHE |
Do analysis using RCI
Timing: ∼3 h (for steps 4 to 8)
In this section, we describe how to do analysis using RCI, including the number of anchored factors (e.g., CTCF, POLR2A, EZH2 and other transcription factors) comparison between raw-Loop-fithic and active-Loop-fithic (or repressive-Loop-fithic), the calculation and visualization of SD-RCI.
-
4.To evaluate whether active-Loop-fithic can anchor more TFs signal, we visualize the number of anchored factors comparison between raw-Loop-fithic and active-Loop-fithic (or repressive-Loop-fithic).
-
a.Open the folder “MINE/analyse/”.
-
i.If you want to do analysis in GM12878 cell line, enter the folder “MINE/analyse/fig2/”.
-
ii.If you want to do analysis in other cell line, enter the folder “MINE/analyse/fig3/”.Note: In this protocol, we do analysis in HepG2 cell line. For the analysis of active-Loop-fithic, we choose the script file “HepG2 ATAC_H3K27ac_H3K4me3 2_100.ipynb” to do analysis. For the analysis of repressive-Loop-fithic, we choose the script file “../represses/HepG2 H3K9me3_H3K27me3 2_100.ipynb” to do analysis. Due to the same analysis process, we only describe the protocol of script file “HepG2 ATAC_H3K27ac_H3K4me3 2_100.ipynb”.
-
i.
-
b.Open file “HepG2 ATAC_H3K27ac_H3K4me3 2_100.ipynb” using jupyter notebook.
-
c.Run the scripts of In [1] and In [2] in file “HepG2 ATAC_H3K27ac_H3K4me3 2_100.ipynb”.
-
d.Edit the parameters below to set the paths:
-
i.analyse_path: the path of storing analyzed data.
-
ii.auxiliary_files_path = '/data1/lmh_data/MINE/source/HepG2': the path of storing ChIP-seq data including .bigWig, and .bed file.
-
i.
-
e.Run the scripts of In [3] and In [4] to get the venn graph as shown in Figure 4A.
-
f.Run the scripts of In [5] - In [8] to get the graph of the number of anchored factors comparison between raw-Loop-fithic and active-Loop-fithic. Run the script below to visualize the POLR2A transcription factor (Figure 4B), where ‘POLR2A_peaks.bed’ is the name of CTCF ChIP-seq signal file.>chip_seq_path = os.path.join(auxiliary_files_path, 'POLR2A_peaks.bed')>figure(chip_seq_path)
-
g.Run the scripts of In [9] to get the ‘∗_all_enhanced_sig.npy’, where the ‘∗_all_enhanced_sig.npy’ file store the significant chromatin interaction data.
-
a.
-
5.Calculate the spatial density of RCI (SD-RCI) by running the script below.>bash main.shNote: As Gong et al.1 shows, the calculation of SD-RCI is divided into four steps: (1) identify the boundaries of TADs; (2) reconstruct the 3D structures using Pastis-PM27 algorithm to get the 3D coordinates of all loci ; (3) calculate the volume of a given TAD. The 3D coordinates of all loci in the TAD are used to calculate a convex hull, where the volume of the convex hull is defined as the TAD volume. (4) Calculate the SD-RCI as Gong et al.1 described in the section “Spatial density of regulatory chromatin interactions”. We organized these steps into a shell script “MINE/MINE_Density/TAD/main.sh”.The parameters below are editable:
-
a.hic_path: the path of Hi-C data.
-
b.analyse_anchor_path: the path of active-Loop-fithic file (∗_enhanced_anchor.bed).
-
i.excete the script below to get the ‘∗_enhanced_anchor.bed’ file.>python MINE/MINE_Density/loop2anchor.pyNote: The input_file is ‘∗_all_enhanced_sig.npy’ file generated from step 4.g, the output file is the ‘∗_enhanced_anchor.bed’ file.
-
i.
-
c.out_path: the SD-RCI analysis data folder.
-
d.resolution: the resolution you want to set to identify TADs using HiCDB.
-
e.cell_line: the cell line you want to do analysis, for example HepG2, HeLa.
-
f.juicer_tools_path: the path of juicer_tools.jar.
-
g.HiCDB_path: the path of HiCDB tool.
-
h.SDOC_path: the path of SDOC tool.
-
a.
Note: In this step, we will get the SD-RCI file “out_path/SDOC/out_with_active_loop_anchor/ HepG2_SDOC.tsv”, the 3D structure files “out_path/SDOC/out_with_active_loop_anchor/HepG2/PASTIS_out/∗.pdb”.
-
6.this step defines the SD-RCI level (low, middle, high and utra_high) by using the SD-RCI file generated from step 5.
-
a.Open the file “MINE/analyse/fig4/SDOC-active.ipynb” using jupyter notebook.
-
b.Run the scripts of In [1].
-
c.Copy the “out_path/SDOC/out_with_active_loop_anchor/ HepG2_SDOC.tsv” into the folder ““MINE/analyse/fig4/”, and renamed it to be “HepG2_SDOC_active_result.tsv”.
-
d.Edit the parameter “input_file” of In [2] to be “HepG2_SDOC_active_result.tsv”.
-
e.Run In [2]∼In [4], we will get the SD-RCI distribution graph of the number of active-Loop-fithic changes with the SD-RCI value (Figure 4C).
-
f.Edit the parameters below and run the script of In [5]∼In [8].
-
i.Homo_sapiens_GRCh38_file: your_path/Homo_sapiens.GRCh38.84.gtf.
-
ii.RNA_seq_file: your_path/RNA_seq_rpkms.xls.Note: We will get the box plot of RPKM and SD-RCI degree, where P values were calculated by two-sided Mann-Whitney-Wilcoxon test with Bonferroni correction, the SD-RCI is divided into four levels according to the value of δ in the Gaussian distribution of graph generated from step 9.e (Figure 4D).
-
i.
-
g.run the script of In [9], we will get the dot plot of the number of RCIs and volume of TADs.
-
h.Edit the parameter below and run the script of In [10]∼In [17] to get the list of TAD genomic locations in different SD-RCI levels.
-
i.Gene_table: the path of genes files, where the file could be downloaded from https://uswest.ensembl.org/info/data/biomart/index.html.
-
a.
>Gene_table = pd.read_csv("/data1/lmh_data/MINE/source/Gene_table_20211231.txt", sep="\t")
Note: We can run the script of ‘MINE/analyse/fig4/SDOC-represses.ipynb’ to do a same analysis using repressive-Loop-fithic.
-
7.In this step, we will do the 3D genome TAD structure visualization with the gene expression strength in four levels.
-
a.Open the file “MINE/analyse/fig4/3D structure analysis of HEPG2(active).ipynb” using jupyter notebook.
-
b.Run the scripts of In [1] in the file “MINE/analyse/fig4/3D structure analysis of HEPG2(active).ipynb”.
-
c.Edit the parameters below and run the scripts of In [2]∼In [3].
-
i.HepG2_ATAC_H3K27ac_H3K4me3_2_100_all_enhanced_sig: load the predicted loop‘∗_all_enhanced_sig.npy’ file generated from step 7 (g).
-
ii.Homo_sapiens_GRCh38_file: your_path/Homo_sapiens.GRCh38.84.gtf.
-
iii.RNA_seq_file: your_path/RNA_seq_rpkms.xls.
-
iv.auxiliary_files_path: the path of storing ChIP-seq data including .bigWig, and .bed file.
-
v.CTCF_path: change to be the name of CTCF peaks file.
-
vi.POLR2A_path: change to be the name of POLR2A peaks file.
-
vii.EP300_path: change to be the name of EP300 peaks file.
-
viii.H3K27ac_path : change to be the name of H3K27ac peaks file.
-
ix.H3K4me3_path: change to be the name of H3K4me3 peaks file.
-
i.
-
d.Edit the parameters below and run the scripts of In [4]:
-
i.PASTIS_out_path: the path of 3D structure generated from step 8, i.e., ‘out_path/SDOC/out_with_active_loop_anchor/HepG2/PASTIS_out’.
-
ii._ranges: set the TADs genomic locations you want to do 3D visualization. We can choose TADs in different SD-RCI levels according to the result of step 9 (h). For example:>_ranges = {'active_147': [1, 143690, 147250], % TAD in ultra_high SD-RCI level%[chrom_num, start_location, end_location], corresponding to the file name of PM2.chr1_143690_147250.pdb'active_1379': [2, 88740, 98390], %TAD in high SD-RCI level'active_1727': [3, 43530, 50620], %TAD in middle SD-RCI level'active_1372': [2, 74890, 85060] %TAD in low SD-RCI level}
-
i.
-
e.Run the scripts of In [5]∼In [9], we can get the 3D genome TAD structure visualization with target under different SD-RCI levels (Figure 4E).
-
a.
Note: We can run the script of ‘3D structure analysis of HEPG2(represses).ipynb’ to do a same analysis using repressive-Loop-fithic.
-
8.In this step, we will do the visualization of loops, CTCF, and histone mark ChIP-seq tracks in different SD-RCI levels.
-
a.Open the file “MINE/analyse/fig4/Landmark position.ipynb” using jupyter notebook.
-
b.Run the scripts of In [1] in the file “MINE/analyse/fig4/Landmark position.ipynb”.
-
c.Edit the parameters below In [2]:>_CTCF=pyBigWig.open('your_path/HepG2/CTCF.bigWig')>CEBPB=pyBigWig.open('your_path/HepG2/CEBPB.bigWig')>POLR2A=pyBigWig.open('your_path/HepG2/POLR2A.bigWig')>EZH2=pyBigWig.open('your_path/HepG2/EZH2.bigWig')>KDM4B=pyBigWig.open('your_path/HepG2/KDM4B.bigWig')>H3K27ac=pyBigWig.open('your_path/HepG2/H3K27ac.bigWig')>H3K27me3=pyBigWig.open('your_path/HepG2/H3K27me3.bigWig')>H3K9me3=pyBigWig.open('your_path/HepG2/H3K9me3.bigWig')>GENES=pd.read_table("your_path/hg38_gc_cov_100kb.tsv")
-
d.Run the scripts of In [2]∼In [6].
-
e.Edit and Run the scripts of In [8] to visualize genomic location of chr2: 88740000: 98390000 (Figure 4F).
-
a.
>_path='you_path/MINE/GM12878_ATAC_H3K27ac_H3K4me3/analyse/HepG2_ATAC_H3K27ac_H3K4me3/experiment/loop'
>show("chr1", 143690000, 147250000, 1000, _path, FDR=0.3)
Figure 4.
An example of using MINE toolkit
The MINE-Loop model is trained using Hi-C, ATAC-seq, H3K27ac ChIP-seq, H3K4me3 ChIP-seq in GM12878 cell line and used to do prediction in HepG2 cell line.(figures are reusing from the work of Gong et al.1).
(A) The venn graph between raw-Loop-fithic and active-Loop-fithic.
(B) The comparison between density distribution of POLR2A transcription factor anchors in 2–100 kb loops called from MINE-enhanced Hi-C and raw Hi-C in HepG2 cell line.
(C) The distribution graph between number of loops and SD-RCI value in HepG2 cell line.
(D) The box graph of RPKM and SD-RCI degree in HepG2 cell line.
(E) The 3D genome TAD structure visualization with the gene expression strength when SD-RCI level equals to be high.
(F) The visualization of loops, CTCF, and histone mark ChIP-seq tracks. loops are identified from MINE enhanced Hi-C by using MUSTACHE.4
Expected outcomes
The output of this protocol provides the trained model, enhanced Hi-C contact matrix, loops called from the enhanced Hi-C contact matrix, which can anchor a greater number of transcription factors than raw Hi-C contact matrix. As Figure 3 shows, Figure 3A includes the data required at step 1∼step 4. The outcome of step 4 is the trained model that could be used to do prediction. Figure 3B includes the data required at step 5∼step 7. We name these loops anchor transcription factors as regulatory chromatin interactions (RCI). Figure 3C includes the data required at step 8∼step 9 and the output of the spatial density of RCI (SD-RCI). Based on the loops called from the enhanced Hi-C contact matrix and SD-RCI, the protocol created the figures (Figure 4): Figure 4C shows the relationship between the number of chromatin loop and SD-RCI. Based on the value of δ in the Gaussian distribution of SD-RCI, the genome structure is divided into four levels (i.e., ultra_high, high, middle, low). Figure 4D plots the box graph of RPKM and SD-RCI degree in HepG2 cell line, shows that a higher level of SD-RCI in active model obtains a higher gene expression. Figure 3D includes the data required at step 10. Figure 4E plots the 3D visualization of TAD structure with the signal of gene expression. Figure 4F visualizes the loops and CTCF, and histone mark ChIP-seq tracks.
In this protocol, we focused on the human cells. If you want to do in other organisms, you could prepare the Hi-C, ATAC-seq, ChIP-seq data in a same cell and train the model. The given trained model is only supported for human cells.
Quantification and statistical analysis
The “before you begin” and “step-by-step method details” sections explain the operating system, software, and packages used in the present protocol.
Limitations
The Hi-C datasets with more than 400 million filtered reads as the inputs of MINE-Loop model are suggested to get a better prediction performance.
The MINE protocol is limited for two loop callers (FitHiC2 and MUSTACHE). In the future, we may improve it to accommodate more loop callers.
The MINE protocol requires many data, include Hi-C, ATAC-seq, ChIP-seq and cis-regulatory element file.
Troubleshooting
Problem 1
Related to “install required software and libraries” (step 1). The code “conda create -n MINE python=3.6” may occur error “Preparing transaction: done Verifying transaction: / WARNING conda.core.path_actions:verify(963): Unable to create environments file. Path not writable. environment location: /home/codedancing/.conda/environments.txt”.
Potential solution
The user may have previously installed the conda environment as an administrator. We could delete the ∼/.conda and ∼/.condarc files in the user’s home directory, and then run the “conda remove -n MINE –all” command to remove the faulty virtual environment and reinstall it.
Problem 2
Related to “install required software and libraries” (step 1). The steps for installing MINE, fithic, Mustache, HiCDB, SDOC may occurs “No module named 'XX'”.
Potential solution
MINE, fithic, Mustache, HiCDB, SDOC are suggested to install and run in different conda environments, because they may need different python environment.
Problem 3
Related to “select and preprocess dataset” (step 4). It may occur the error “Memory Error” when running the operation in step 4(b).
Potential solution
txt2npy.py requires a large memory. In the experiment, we use 503 GB RAM to run this script. At least 250 GM is required to run this step.
Problem 4
Related to “training model” (step 1). In the model training step, It may occur the error “AssertionError: Invalid device id”.
Potential solution
The CUDA_VISIBLE_DEVICES and nproc_per_node Settings are strongly related to GPU devices and multi-GPU training in PyTorch, For details, see https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html. You can first check the number of GPU using the code “nvidia-smi” and then set the settings.
Problem 5
Related to “training model” (step1). In the model training step, It may occur the error “RuntimeError: module must have its parameters and buffers on device cuda:1 (device_ids[0]) but found”.
Potential solution
In multi-GPU parallel computing, the main GPU must be included. Otherwise, this problem may occur. For example, the number of main GPU is 0, then you need to include GPU numbered 0.
>CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=6 train_model.py -i /folder/to/train -o /folder/to/checkpoint
Problem 6
Related to “calling loops” (step3). In the execution process, many scripts (generate_input_file.py and .sh scripts) and processing are traversed through chromosomes one by one, which is relatively slow.
Potential solution
If memory and CPU permit, you can simply modify the scope of ‘for statement’ and run the scripts repeatedly to speed it up.
Problem 7
Related to “do analysis using RCI” (step4). In the execution process, It may occur the error “[Errno 2] No such file or directory”.
Potential solution
In this step, you need to define the directory exactly as in the previous step and execute the code exactly as in the case given.
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Yang Chen (yc@ibms.pumc.edu.cn).
Materials availability
This study did not generate new unique reagents.
Acknowledgments
This work was funded by grants from the National Natural Science Foundation of China (81890994, 61971031), the Foshan Higher Education Foundation (BKBS202203), the Scientific and Technological Innovation Foundation of Shunde Graduate School, USTB (BK20BF009), the National Key R&D Program of China (2018YFA0801402), and the CAMS Innovation Fund for Medical Sciences (2021-RC310-007, 2021-I2M-1-020, 2022-I2M-JB-003, 2022-I2M-1-020).
Author contributions
Y.C., X.Z., H.G., and M.L. conceived and designed the project. H.G. and M.L. performed the experiments. M.J., Z.Y., S.Z., and Y.Y. contributed to the implementation of the research. C.L. contributed to the design of figures. H.G. and M.L. completed the figures and writing of the paper with the guidance of Y.C. and X.Z.
Declaration of interests
The authors declare no competing interests.
Contributor Information
Xiaotong Zhang, Email: zxt@ies.ustb.edu.cn.
Yang Chen, Email: yc@ibms.pumc.edu.cn.
Data and code availability
This paper analyzes existing, publicly available data. These accession numbers for the datasets are listed in the key resources table. The full processed data (i.e., the model predictions when trained on different datasets) generated during this study are available at a general use repository NMDMS (http://nmdms.ustb.edu.cn/),8 and the accession number of datasets is 10.12110/mater10.121. NKRDP.20221209.ds.63930883e571e2448aaed532. In addition, this data will be shared directly by the lead contact upon request.
The analysis code is available in the GitHub repository (https://doi.org/10.5281/zenodo.7388730).
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
References
- 1.Gong H., Li M., Ji M., Zhang X., Yuan Z., Zhang S., Yang Y., Li C., Chen Y. MINE is a method for detecting spatial density of regulatory chromatin interactions based on a multi-modal network. Cell Rep. Methods. 2023;3 doi: 10.1016/j.crmeth.2022.100386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Robinson J.T., Turner D., Durand N.C., Thorvaldsdóttir H., Mesirov J.P., Aiden E.L. Juicebox. js provides a cloud-based visualization system for Hi-C data. Cell Syst. 2018;6:256–258.e1. doi: 10.1016/j.cels.2018.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kaul A., Bhattacharyya S., Ay F. Identifying statistically significant chromatin contacts from Hi-C data with FitHiC2. Nat. Protoc. 2020;15:991–1012. doi: 10.1038/s41596-019-0273-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Roayaei Ardakany A., Gezer H.T., Lonardi S., Ay F. Mustache: multi-scale detection of chromatin loops from Hi-C and micro-C maps using scale-space representation. Genome Biol. 2020;21:256. doi: 10.1186/s13059-020-02167-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Chen F., Li G., Zhang M.Q., Chen Y. HiCDB: a sensitive and robust method for detecting contact domain boundaries. Nucleic Acids Res. 2018;46:11239–11250. doi: 10.1093/nar/gky789. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Jiang S., Li H., Hong H., Du G., Huang X., Sun Y., Wang J., Tao H., Xu K., Li C., et al. Spatial density of open chromatin: an effective metric for the functional characterization of topologically associated domains. Brief. Bioinform. 2021;22:bbaa210. doi: 10.1093/bib/bbaa210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Varoquaux N., Ay F., Noble W.S., Vert J.-P. A statistical approach for inferring the 3D structure of the genome. Bioinformatics. 2014;30:i26–i33. doi: 10.1093/bioinformatics/btu268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Liu S., Su Y., Yin H., Zhang D., He J., Huang H., Jiang X., Wang X., Gong H., Li Z., et al. An infrastructure with user-centered presentation data model for integrated management of materials data and services. NPJ Comput. Mater. 2021;7:88. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
This paper analyzes existing, publicly available data. These accession numbers for the datasets are listed in the key resources table. The full processed data (i.e., the model predictions when trained on different datasets) generated during this study are available at a general use repository NMDMS (http://nmdms.ustb.edu.cn/),8 and the accession number of datasets is 10.12110/mater10.121. NKRDP.20221209.ds.63930883e571e2448aaed532. In addition, this data will be shared directly by the lead contact upon request.
The analysis code is available in the GitHub repository (https://doi.org/10.5281/zenodo.7388730).
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.