#Enrichment analysis of chromatin accessibility in transcription start sites\ #and its relationship to G4s. Accessible chromatin regions were\ #retrieved from ChIP-ATLAS (https://chip-atlas.org) using the following\ #parameters: antigen class: DNase-Seq; cell type class: blood; threshold for\ #significance: 500; antigen: DNase-Seq; cell type: all. sortBed -i DNS.Bld.50.DNase-Seq.AllCell.bed | mergeBed\ > DNS.Bld.50.DNase-Seq.AllCell.sort.merged.only.bed #A hg19 transcription start sites (TSS) file was generated as previously \ #described (Hänsel-Hertsch et al. 2016). # + Strand: Get start of first exons on each transcript awk -v OFS="\t" '$7 == "+" {print $1, $4, $5, $7, $10, $12}' genes.gtf \ | sed 's/"//g' | sed 's/ //g' | sed 's/;//g' \ | sort -k 6,6 -k2,2n \ | groupBy -g 6 -c 1,2,5 -o first,first,first \ | awk -v OFS="\t" '{print $2, $3-1000, $3+1000, $4, $1, "+"}' > tss.plus.bed # - strand: Get end of last exons on each transcript awk -v OFS="\t" '$7 == "-" {print $1, $4, $5, $7, $10, $12}' genes.gtf \ | sed 's/"//g' | sed 's/ //g' | sed 's/;//g' \ | sort -k 6,6 -k3,3nr \ | groupBy -g 6 -c 1,3,5 -o first,first,first \ | awk -v OFS="\t" '{print $2, $3-1000, $3+1000, $4, $1, "-"}' > tss.minus.bed # Merge promoters within genes cat tss.minus.bed tss.plus.bed \ | cut -f1,2,3,4 \ | sort \ | uniq \ | awk -v OFS="\t" '{print $1 "_" $4, $2, $3, $4}' \ | sortBed \ | mergeBed \ | sed 's/_/\t/' \ | awk -v OFS="\t" '{print $1, $3, $4, $2}' \ | sortBed > hg19.gene_name.promoters.bed #A maximum loop length of 7 nucleotides, 4 G-repeats \ #and a minimum of 3 Gs per G-repeat were considered to predict PQS in the human\ #genome (hg19). The fastaRegexFinder tool (Python) was used with the following\ #parameters to predict PQS: “--quiet -r '([gG]{3,}\w{1,7}){3,}[gG]{3,}'”. #install fastaRegexFinder --> Python tool fastaRegexFinder.py --quiet -r '([gG]{3,}\w{1,7}){3,}[gG]{3,}' -f\ genome.fa >> PQS_N1-7_hg19.bed" #The 4 available OQS files were retrieved \ #(Chambers et al. 2015) and unified to a single OQS file using cat and the\ #bedtools functions mergeBed and sortBed (Quinlan and Hall 2010). cat Na_K_minus_hits_intersect.bed Na_K_plus_hits_intersect.bed\ Na_PDS_minus_hits_intersect.bed Na_PDS_plus_hits_intersect.bed\ | mergeBed -i | sortBed -i > SortBed_on_G4_seq_hits.bed #Bedtools intersect was used to retrieve ~300,000 PQS that overlap with OQS,\ (PQS/OQS). bedtools intersect -u -a PQS_N1-7_hg19.bed -b SortBed_on_G4_seq_hits.bed\ > PQS_N1-7_hg19.OQS.bed #Bedtools intersect was used to retrieve accessible \ #chromatin regions in TSS that are either positive or negative for PQS/OQS. intersectBed -u -a DNS.Bld.50.DNase-Seq.AllCell.sort.merged.only.bed -b\ hg19.gene_name.promoters.bed\ > DNS.Bld.50.DNase-Seq.AllCell.sort.merged.only.int.TSS.bed intersectBed -v -a DNS.Bld.50.DNase-Seq.AllCell.sort.merged.only.int.TSS.bed\ -b PQS_N1-7_hg19.in.OQS.sort.bed\ > DNS.Bld.50.DNase-Seq.AllCell.sort.merged.only.int.TSS.negative.for.PQS.OQS.bed intersectBed -u -a DNS.Bld.50.DNase-Seq.AllCell.sort.merged.only.int.TSS.bed\ -b PQS_N1-7_hg19.in.OQS.sort.bed\ > DNS.Bld.50.DNase-Seq.AllCell.sort.merged.only.int.TSS.int.PQS.OQS.bed #Bedtools shuffle was employed to generate 10 random distributions of\ #accessible chromatin regions in TSS that are either positive or negative for\ #PQS/OQS regions across the hg19 genome. n=(1 2 3 4 5 6 7 8 9 10); for f in *bed; do for i in "${!n[@]}"; do bedtools shuffle -incl hg19.wgEncodeDukeMapabilityRegionsExcludable.whitelist.sort.bed -chromFirst -noOverlapping -seed ${n[i]} -i $f -g hg19.genome | sort -k1,1 -k2,2n > ${f%%.bed}.shuffle_${n[i]}.bed; done; done #The plotEnrichment function of deepTools (Ramírez et al. 2016) was used to\ #measure the coverage of the individual bam files in 11 bed files per\ #condition (1 bed file + 10 randomised bed files). For each bam file,\ #enrichments in accessible chromatin regions in TSS that contain or lack\ #PQS/OQS were calculated using the average ratio of the real \ #coverage divided by 10 different coverages in the permutated accessible\ #chromatin regions in TSS that contain or lack PQS/OQS. for f in *.bam; do samtools index -b $f; done plotEnrichment -b *.bam --BED *bed -o plotEnrichment.png --variableScales\ --outRawCounts plotEnrichment.txt #Similarity of predicted (PQS/OQS) G4-forming chromatin accessibility in\ #transcription start sites with G4 regions identified in chromatin. To assess\ #whether accessible chromatin regions in TSS contain PQS/OQS that also adopt\ #G4 secondary structure in chromatin, we first assembled (cat, sortBed,\ #mergeBed) various annotated chromatin G4 regions (chromatin G4Rs), which were\ #mapped by G4-ChIP-seq using BG4 (GSE145090, GSE99205, GSE152216) or G4P\ #(GSE133379). #download the following files: GSE133379_293T-G4P-hg19-rep1.narrowPeak GSE133379_H1975-G4P-hg19-rep1.narrowPeak GSE145090_20180108_K562_async_rep1-3.mult.5of8.bed GSE133379_293T-G4P-hg19-rep2.narrowPeak GSE133379_H1975-G4P-hg19-rep2.narrowPeak GSE145090_HepG2_async_rep1-3.mult.2of9.bed GSE133379_A549-G4P-hg19-rep1.narrowPeak GSE133379_HeLa-S3-G4P-hg19-rep1.narrowPeak GSE145090_HepG2_async_rep1-3.mult.6of9.bed GSE133379_A549-G4P-hg19-rep2.narrowPeak GSE145090_20180108_K562_async_rep1-3.mult.2of8.bed GSE99205_common_HaCaT_G4_ChIP_peaks_RNase_treated.bed #Retrieve all G4 regions identfied in breast cancer tissues. #download https://github.com/sblab-bioinformatics/qG4-ChIP-seq-of-breast\ #-cancer-PDTX/blob/master/input_files/temp_concat_hg19_libraries.txt cut -f 1,2,3 temp_concat_hg19_libraries.txt > temp_concat_hg19_libraries.bed cat GSE133379_293T-G4P-hg19-rep1.narrowPeak\ GSE133379_H1975-G4P-hg19-rep1.narrowPeak\ GSE145090_20180108_K562_async_rep1-3.mult.5of8.bed\ GSE133379_293T-G4P-hg19-rep2.narrowPeak\ GSE133379_H1975-G4P-hg19-rep2.narrowPeak\ GSE145090_HepG2_async_rep1-3.mult.2of9.bed\ GSE133379_A549-G4P-hg19-rep1.narrowPeak\ GSE133379_HeLa-S3-G4P-hg19-rep1.narrowPeak\ GSE145090_HepG2_async_rep1-3.mult.6of9.bed\ GSE133379_A549-G4P-hg19-rep2.narrowPeak\ GSE145090_20180108_K562_async_rep1-3.mult.2of8.bed\ GSE99205_common_HaCaT_G4_ChIP_peaks_RNase_treated.bed\ temp_concat_hg19_libraries.bed | sortBed | mergeBed > insitu.G4Rs.bed #Second, bedtools intersect was used to retrieve ~81,000 PQS that overlap with\ #chromatin G4Rs (PQS/chromatin.G4Rs). intersectBed -u -a PQS_N1-7_hg19.bed -b insitu.G4Rs.bed > PQS.int.insituG4Rs.bed #Third, bedtools intersect was used to retrieve accessible chromatin regions\ #in TSS that are positive for PQS/chromatin.G4Rs. intersectBed -u -a DNS.Bld.50.DNase-Seq.AllCell.sort.merged.only.int.TSS.bed\ -b PQS.int.insituG4Rs.bed > OC.int.TSS.int.PQS.int.insituG4Rs.bed #Fourth, we used intervene venn to assess the similarity between accessible\ #regions in TSS that are either a) positive or b) negative for PQS/OQS, and\ #c) positive for PQS/chromatin.G4Rs. intervene venn -i OC.int.TSS.int.PQS.int.insituG4Rs.bed\ OC.int.TSS.int.PQS.OQS.bed OC.int.TSS.negative.for.PQS.OQS.bed\ --names=OC.TSS.PQS.insituG4Rs,OC.TSS.PQS.OQS,OC.TSS.Neg.PQS.OQS --save-overlaps #Enrichment analysis of transcription start sites containing either cancer or\ #normal blood-related accessible chromatin regions.Accessible chromatin regions\ #were retrieved from ChIP-ATLAS (https://chip-atlas.org) using the following\ #parameters: antigen class: DNase-Seq; cell type class: blood; threshold for\ #significance: 500; antigen: All; cell type: eythrocyte.progenitors (SRX062346\ #,SRX027080, SRX2187113, SRX201280, SRX089247, SRX027082, SRX027081, SRX015226,\ #SRX026926, SRX026923, SRX026927, SRX015228, SRX027077, SRX015227, SRX026920,\ #SRX026922, SRX062365, SRX026915, SRX214040, SRX026921, SRX062345, SRX040374,\ #SRX062347, SRX4457261, SRX4457260, SRX4457262, SRX4457263, SRX4457266,\ #SRX4457267, SRX4457265, SRX4457264), Erythroid cells (SRX2105479, SRX2105477,\ #SRX2105478, SRX838173, SRX2370592, SRX2370809, SRX2370805, SRX2370710,\ #SRX2370651, SRX2370789, SRX2370806, SRX2370593, SRX2370676, SRX2370711,\ #SRX2370602, SRX2370601, SRX2370642, SRX2370810, SRX2370641, SRX2370788,\ #SRX2370652), lymphocytes (SRX201251, SRX201254, SRX193595, SRX20130\ #, SRX069095, SRX201250, SRX193593, SRX201249, SRX193594, SRX069185, SRX201270,\ #SRX201259, SRX193612, SRX201263, SRX037118, SRX201262, SRX193613, SRX201300,\ #SRX5718539, SRX5718538, SRX040389, SRX100956, SRX055204, SRX040414, SRX204403,\ #SRX055163, SRX213515, SRX040396, SRX055172, SRX204402, SRX040412, SRX189431,\ #SRX040388, SRX055153, SRX100962, SRX201299, SRX040415, SRX193609, SRX055203,\ #SRX055155, SRX204404, SRX214041, SRX055180, SRX055190, SRX055152, SRX040413,\ #SRX089240, SRX055189, SRX204362, SRX252603, SRX201292, SRX189402, SRX055164,\ #SRX201276, SRX201275, SRX193597, SRX189409, SRX189387, SRX055157, SRX055171,\ #SRX055156, SRX342324) and monocytes (SRX2370645, SRX2370646, SRX581585,\ #SRX581580, SRX581582, SRX581593, SRX581578, SRX581579, SRX581576, SRX581577,\ #SRX581589, SRX581581, SRX581584, SRX581595, SRX581594, SRX581590, SRX581588,\ #SRX201301, SRX055167, SRX040416, SRX252602, SRX055205, SRX069106, SRX581591,\ #SRX581587, SRX581586). DNase-seq peaks of each experiment (SRX…) within either\ #the eythrocyte.progenitors or lymphocytes categories were sorted (sortBed),\ #merged (mergeBed) and filtered for unique peaks that only appear in the\ #respective categories (intersectBed). Lymphocyte and erythrocyte progenitors\ #TSSs were selected from overlaps with their specific unique accessible\ #chromatin peaks (intersectBed). We performed the same filtering process to\ #reveal monocyte related TSS but due to the substantially reduced number of\ #TSS (259) relative to lymphocyte (1971 TSS) and erythrocyte progenitors\ #(878 TSS), ignored them for further analysis. cat\ DNS.Bld.50.AllAg.CD34PULUS.bed\ DNS.Bld.50.AllAg.CD34PULUS_cells.bed\ DNS.Bld.50.AllAg.Erythroid_Cells.bed\ DNS.Bld.50.AllAg.Hematopoietic_Stem_Cells.bed\ DNS.Bld.50.AllAg.Peripheral_blood_stem_cells.bed | sortBed | mergeBed\ > eythrocyte.progenitors_DHS_hg19.bed cat\ DNS.Bld.50.AllAg.B-cell_BRACKETLCD19PULUSBRACKETR.bed\ DNS.Bld.50.AllAg.B-Lymphocytes.bed\ DNS.Bld.50.AllAg.CD20PULUS_B_cells.bed\ DNS.Bld.50.AllAg.CD20PULUS.bed\ DNS.Bld.50.AllAg.CD3PULUS.bed\ DNS.Bld.50.AllAg.CD4-Positive_T-Lymphocytes.bed\ DNS.Bld.50.AllAg.CD4PULUS.bed\ DNS.Bld.50.AllAg.CD4PULUS_Th1.bed\ DNS.Bld.50.AllAg.CD56PULUS.bed\ DNS.Bld.50.AllAg.CD8PULUS.bed\ DNS.Bld.50.AllAg.PBMC.bed\ DNS.Bld.50.AllAg.Th17_Cells.bed\ DNS.Bld.50.AllAg.Th1_Cells.bed\ DNS.Bld.50.AllAg.Th2_Cells.bed\ DNS.Bld.50.AllAg.Treg.bed\ DNS.Bld.50.AllAg.Treg_Wb83319432.bed | sortBed | mergeBed\ > lymphocytes_DHS_hg19.bed cat\ DNS.Bld.50.AllAg.Dendritic_Cells.bed\ DNS.Bld.50.AllAg.Macrophages.bed\ DNS.Bld.50.AllAg.Monocytes.bed\ DNS.Bld.50.AllAg.Monocytes-CD14PULUS.bed | sortBed | mergeBed\ > monocytes_DHS_hg19.bed cat monocytes_DHS_hg19.bed lymphocytes_DHS_hg19.bed | sortBed | mergeBed\ > tmp && intersectBed -v -a eythrocyte.progenitors_DHS_hg19.bed -b tmp\ > uniqe.eythrocyte.progenitors_DHS_hg19.bed && rm tmp && intersectBed\ -u -a hg19.gene_name.promoters.bed -b uniqe.eythrocyte.progenitors_DHS_hg19.bed\ > TSS.int.unique.eythrocyte.progenitors_DHS_hg19.bed cat monocytes_DHS_hg19.bed eythrocyte.progenitors_DHS_hg19.bed | sortBed\ | mergeBed > tmp && intersectBed -v -a lymphocytes_DHS_hg19.bed -b tmp\ > uniqe.lymphocytes_DHS_hg19.bed && rm tmp && intersectBed -u -a\ hg19.gene_name.promoters.bed -b uniqe.lymphocytes_DHS_hg19.bed\ > TSS.int.unique.lymphocytes_DHS_hg19.bed #To reveal TSS that contain pan cancer accessible chromatin, a genomic\ #coordinate (bed, genome alignment to hg38) pan cancer ATAC-seq file was\ #retrieved from https://gdc.cancer.gov/about-data/publications/ATACseq-AWG,\ #sorted (sortBed),lifted to hg19 (https://genome.ucsc.edu/cgi-bin/hgLiftOver),\ #sorted and TSS (intersectBed) captured to reveal cancer-specific TSS. To\ #compare a similar number of pan cancer related TSS with blood related TSS\ #(Lymphocyte and erythrocyte progenitors), we selected the top 5,000 enriched\ #ATAC-seq peaks by its enrichment score (The normalized peak score of the given\ #peak (i.e. "score-per-million", see methods of Corces & Granja et al Science\ #2018 publication)). Bedtools shuffle was employed to generate, respectively,\ #10 random distributions of the 3 TSS categories (top 5K pan cancer,\ #lymphocytes, erythrocyte progenitors). The plotEnrichment function of\ #deepTools (Ramírez et al. 2016) was used to measure the coverage #of the\ # individual bam files in 11 bed files per condition (1 bed file + 10\ #randomised bed files). For each bam file, enrichments in the 3 TSS\ #categories were calculated using the average ratio of the real coverage\ #divided by 10 different coverages in the permutated TSS datasets. #download https://api.gdc.cancer.gov/data/116ebba2-d284-485b-9121-faf73ce0a4ec #-->TCGA-ATAC_PanCancer_PeakSet.txt intersectBed -u -a hg19.gene_name.promoters.bed -b\ TCGA-ATAC_PanCancer_PeakSet.top5k.sort.clean.hg19.sort.bed\ > TSS.plus.minus.1kbp.int.TCGA.ATAC.PanCancer.top5kpeaks.bed n=(1 2 3 4 5 6 7 8 9 10); for f in\ TSS.plus.minus.1kbp.int.TCGA.ATAC.PanCancer.top5kpeaks.bed\ TSS.int.unique.lymphocytes_DHS_hg19.bed\ TSS.int.unique.eythrocyte.progenitors_DHS_hg19.bed; do for\ i in "${!n[@]}"; do bedtools shuffle -incl\ hg19.wgEncodeDukeMapabilityRegionsExcludable.whitelist.sort.bed -chromFirst\ -noOverlapping -seed ${n[i]} -i $f -g hg19.genome | sort -k1,1 -k2,2n\ > ${f%%.bed}.shuffle_${n[i]}.bed; done; done plotEnrichment -b *.bam --BED *bed -o plotEnrichment.png --variableScales\ --outRawCounts plotEnrichment.txt