Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2020 Nov 30;16(11):e1008422. doi: 10.1371/journal.pcbi.1008422

Systematic clustering algorithm for chromatin accessibility data and its application to hematopoietic cells

Azusa Tanaka 1,2,*,#, Yasuhiro Ishitsuka 3,4,*,#, Hiroki Ohta 3,5,*,#, Akihiro Fujimoto 1, Jun-ichirou Yasunaga 2,6, Masao Matsuoka 2,6
Editor: Avner Schlessinger7
PMCID: PMC7728210  PMID: 33253153

Abstract

The huge amount of data acquired by high-throughput sequencing requires data reduction for effective analysis. Here we give a clustering algorithm for genome-wide open chromatin data using a new data reduction method. This method regards the genome as a string of 1s and 0s based on a set of peaks and calculates the Hamming distances between the strings. This algorithm with the systematically optimized set of peaks enables us to quantitatively evaluate differences between samples of hematopoietic cells and classify cell types, potentially leading to a better understanding of leukemia pathogenesis.

Author summary

High-throughput sequencing provides us huge amounts of data about gene regulation. In order to extract useful information from the data, data reduction is needed. Although RNA-seq data analysis has been extensively studied, where the focus is mainly on genetic loci, tools for epigenetic sequencing data, such as ATAC-seq data which represent chromatin accessibility, are comparatively lacking. Since the binding of transcription factors mainly occurs in open chromatin regions, it is presumably important to understand how chromatin accessibility landscape affects cell phenotype. In this context, we developed a systematic algorithm to select a set of peaks representing the open state of chromatin for a given sample of ATAC-seq data. This algorithm quantifies the difference between samples by regarding the genome as a string of 1s and 0s with Hamming distances and then performs hierarchical clustering. This algorithm has less computational cost and gives a reasonable cell type classification compared to a previous method. In this work, as an application of this algorithm, we present a comparative analysis of leukemia samples with healthy hematopoietic cells and provide new insights about the relationship between chromatin structures, cell surface proteins, and symptoms in leukemia.

Introduction

Cellular phenotypes are governed by epigenetic mechanisms. For example, information about how human DNA is packed and chemically modified in the nucleus plays an important role in understanding the differentiation and regulation of cells [14]. Methods such as chromatin immunoprecipitation sequencing (ChIP-seq) and assay for transposase accessible chromatin using sequencing (ATAC-seq) have proven useful for understanding the modification and detection of open chromatin on a genome-wide scale [59]. Those epigenetic data analysis methods usually start with data enrichment along the whole genome, also known as “peak calling” [10, 11].

Compared to RNA-seq data analysis, whose target regions are mainly in certain loci or genes across samples, the target regions on epigenetic sequencing data are undetermined. To determine the target regions, peak calling with an appropriate tool is often performed for the entire genome of every sample, and the target regions are defined as merged peaks among all samples. Then the total number of reads or fragments present in each region is counted for each sample, leading to a matrix, X = (xi,j), where xi,j represents the number of reads/fragments from sample i in region j. The matrix elements are normalized by quantile normalization to reduce the biases arising from variations in the data size over samples, followed by downstream processing [79].

However, this process raises two concerns. First, we do not fully understand the effect of merging all the peaks from different samples. For example, if two peaks from different samples slightly overlap, those two peaks are considered as one peak after the peak merging step. Therefore, the difference of the two peak positions, which may reflect cell identity, may be unintentionally ignored. The second concern is that we have no justification for applying quantile normalization over samples that are phenotypically different [12, 13].

Thus, the aim of the present study is to avoid these concerns by constructing an algorithm that systematically classifies epigenetic data obtained from high-throughput sequencing. In this analysis, toward cell type classification, we provide a systematic algorithm to select a set of peaks used for the downstream analysis, where the difference between samples are quantified by using the Hamming distance from information theory [14]. This algorithm has less computational cost while still producing reasonable classification compared to a previous method [7].

As an application of the developed algorithm, we use it to obtain new insights on samples of leukemia cells from chronic lymphocytic leukemia (CLL), acute myeloid leukemia (AML), and adult T-cell leukemia (ATL) at the chromatin level. In particular, using this algorithm, we infer the phenotype of a given leukemia sample as output by using only ATAC-seq data of that sample as input.

Results

ATAC-seq samples

In this paper, we mainly focused on 77 ATAC-seq datasets from 13 human primary blood cell types [7] as test data. The 13 cell types are comprised of hematopoietic stem cells (HSC), multipotent progenitor cells (MPP), lymphoid-primed multipotent progenitor cells (LMPP), common myeloid progenitor cells (CMP), megakaryocyte-erythroid progenitor cells (MEP), granulocyte-macrophage progenitor cells (GMP), common lymphoid progenitor cells (CLP), natural killer cells (NK), B cells, CD4+T cells (CD4+T), CD8+T cells (CD8+T), monocytes (Mono) and erythroids (Ery). These cell types are experimentally categorized by immunophenotypes described by the combination of cell surface markers shown in Table 1.

Table 1. Immunophenotypes of samples.

Types of hematopoietic cells and their corresponding cell surface markers in [7]. For example, CD34+ and CD38- for cell type ν means that a cell of type ν expresses CD34 but not CD38 at its surface.

Cell type (ν) Number of replicates Immunophenotypes
HSC 7 Lin-, CD34+, CD38-, CD10-, CD90+
MPP 6 Lin-, CD34+, CD38-, CD10-, CD90-
LMPP 3 Lin-, CD34+, CD38-, CD10-, CD45RA+
CMP 8 Lin-, CD34+, CD38+, CD10-, CD45RA-, CD123+
MEP 7 Lin-, CD34+, CD38+, CD10-, CD45RA-, CD123-
GMP 7 Lin-, CD34+, CD38+, CD10-, CD45RA+, CD123+
CLP 5 Lin-, CD34+, CD38+, CD10+, CD45RA+
NK 6 CD56+
B 4 CD19+, CD20+
CD4+T 5 CD3+, CD4+
CD8+T 5 CD3+, CD8+
Mono 6 CD14+
Ery 8 CD71+, GPA+, CD45-low

For convenience, T denotes a set of the thirteen cell types;

T={B,CD4+T,CD8+T,CLP,CMP,Ery,GMP,HSC,LMPP,MEP,Mono,MPP,NK}.

For all 77 samples, we assigned ATAC-seq reads to reference genome hg19 (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/), and among them only those which had high mapping quality values (MQ ≥ 30) were used for the peak calling by MACS2 (see S1 Appendix for details of the preprocessing) [15]. The peak calling results consisted of the location with a peak width and the associated p-value. Concretely, the location of the k-th peak is expressed by gk = (γk, αk, βk), where γk is the chromosome number, αk is the start position, and βk is the end position. Note that we used MACS2 to call all ATAC-seq peaks with the following parameters (--nomodel --nolambda --keep-dup all -p pG), where the number of peaks is affected by the peak calling parameter ‘-p pG’. The parameter pG is larger than any p-values of the peak calling results. (See Materials and methods for details of the peak-calling).

Note that the peak position depends on parameter pG of the MACS2 algorithm as shown in Fig 1. For example, the start and end positions of a peak could change and one peak could split into two peaks depending on pG. Thus, we need to take into account the dependence of a set of peaks on different values of pG for careful analysis.

Fig 1. The number of reads vs genomic positions.

Fig 1

The plots show representative data of Mono obtained from SRA with accession number SRR2920475. (A) The number of reads Yx at each position x along chr 1 (γ = 1) and the peak region (αk, βk) as determined by the MACS2 algorithm with peak calling parameter pG = 10−2 (pink shaded regions) is shown. The peak region and its associated p-value ((αk, βk), pk) are (1092756, 1094068, 10−20.36428). (B) The obtained peak regions are ((1092817, 1093330), 10−20.36428) and ((1093480, 1094025), 10−8.19447) for pG = 10−4.

Parameterized binarization

First we ranked the peak results in the order of ascending p-values and then investigated the relationship between the peak width and the corresponding ranking. We found that as the p-value increased, the width of the ATAC-seq peaks became shorter statistically, which suggested the feasibility of robust data reduction against small noise in the data by selecting peaks with smaller p-values (Fig 2).

Fig 2. The statistics of peak width.

Fig 2

Distribution of peak width (βkαk) and its corresponding ranking k obtained from the peak calling result of CD4+T cells with peak calling parameter pG = 10−2. The bin size is 400 × 400. The color code indicates the number of data in each bin.

Thus, we define Mcut as the threshold such that only peaks with rankings not greater than Mcut are used for the analysis hereafter. Then, for a given set of (Mcut, pG), we introduce B = {hγ,x}, where hγ,x = 1 when position x in chromosome γ is inside a peak and 0 otherwise (Fig 3). The process to obtain the binary sequence from the reads data is illustrated in Fig 4. Note that we do not perform any coarse-grained description for the genome position x but keep 1bp resolution. (See Materials and methods for details of the binarization).

Fig 3. How to calculate Hamming distance.

Fig 3

Schema of the Hamming distance calculation from the peak locations with two samples c1,c2S. Each locus is converted to 1 or 0 based on the peak overlapping status.

Fig 4. Binarizing the number of reads.

Fig 4

(A) The number of reads Yx at each position x along chr 3 (γ = 3) and the peak region (αk, βk) as determined by the MACS2 algorithm with peak calling parameter pG = 10−2 (pink shaded regions). This figure shows representative data of NK cells obtained from SRA with accession number SRR2920495. The peak regions and the associated p-values ((αk, βk), pk) in the left and right peaks are ((188271079, 188271985), 10−422.5872) and ((188286401, 188287077), 10−329.52139), respectively. Thus, the width of the peaks (βkαk) in the left- and right-hand sides are 906 and 676, respectively. (B) Binary sequence (hx) as determined by the peak regions seen in (A) when we chose Mcut satisfying pMcut10329.52139.

Quantifying differences between two binary sequences by Hamming distance

Let us move onto the situation when one considers a set of samples to evaluate the difference between two binary sequences B. Here our strategy is to find the proper distance that can be measured from the normalized ATAC-seq data of two samples. Using that distance, we try to obtain hierarchical clustering of a set of hematopoietic cell samples to quantitatively characterize the relationship among those samples.

Let Ns be the number of samples. We then write the set of samples as

S:={1,2,,Ns},

where Ns = 77 in this study. For sample cS, we add index c to related objects as a superscript. For example, we write a binary sequence B associated to sample c as Bc:={hγ,xc}.

There are many methods to evaluate the difference between a binary sequence Bc from sample cS and Bc from sample cS. In this paper, we evaluated the difference between two samples (c, c′) by using the Hamming distance H(Bc,Bc) between two binary sequences, Bc and Bc. H(Bc,Bc) is calculated as the sum of the number of pairs with different values at every position x between Bc and Bc (Fig 5). We used the distance as an initial condition for the hierarchical clustering and then used Ward’s method to complete the hierarchical clustering [16]. Examples of hierarchical clustering with (Mcut, pG) = (2000, 10−2) and (80000, 10−2) are shown in Fig 6. (See Materials and methods for details of the Hamming distance and hierarchical clustering).

Fig 5. Matrix of Hamming distances.

Fig 5

Matrix of Hamming distances dij between samples i and j. This matrix is used for the downstream analysis.

Fig 6. Examples of clustering dendrograms.

Fig 6

Hierarchical clustering obtained by Ward’s method with parameters (Mcut, pG) = (2000, 10-2) (A) and (80000, 10−2) (B).

Optimization of hierarchical clustering toward cell-type classification

By using the methods explained above, we can obtain a clustering dendrogram that depends on (Mcut, pG). We then need to systematically determine the best clustering, which is the clustering closest to the “perfectly classified dendrogram” where each set Sν of all samples with type νT coincides with an offspring set. This condition can be restated as an optimization problem by introducing a cost function “penalty” for the performance of clustering as follows.

Concretely, to quantitatively evaluate the obtained dendrogram for each combination of (Mcut, pG), we define type penalty λν for a given cell type νT. Type penalty λν corresponds to the number of samples from different cell types in cluster ν formed when all samples of cell type ν meet together from the bottom of the dendrogram (Fig 7). Additionally, we define global penalty λ:=νTλν as the “cost function” of the optimization. Note that λ ≥ 0, and a “perfectly classified dendrogram” gives λ = 0. (See Materials and methods for details of the penalty).

Fig 7. Schema of penalty score calculation.

Fig 7

Note that this dendrogram is constructed by artificial data to explain how to calculate the penalty, though we use the same labels such as HSC1. This dendrogram has six leaves, and three of them are classified to type HSC. To explain details of this dendrogram, we freely use the symbols and definitions in Materials and methods in this caption. We can see that τ(HSC) = 10. The corresponding node is n10 (displayed by the blue dot), and the corresponding cluster C10 is the set {HSC1, HSC2, HSC3, MPP} (surrounded by the blue dashed line). Among the elements of C10, one leaf, MPP, is not in type HSC, but the three others are. Hence, the type penalty of HSC in this figure is computed as λHSC = 4 − 3 = 1.

Determination of the best parameters for the optimization

As mentioned above, the optimization problem we have to solve is to find (Mcut*,pG*) that minimizes the cost function λ(Mcut, pG). The schematic workflow in our algorithm is shown in Fig 8.

Fig 8. Schematic workflow of our algorithm.

Fig 8

See Materials and methods for details.

First we took into account all the peaks by setting Mcut = ∞ and checked how the dendrograms and λ(∞, pG) depended on pG, as shown in Fig 9. Considering the tendency of the parameter searching, we concluded that 1.5-log10pG*4.

Fig 9. Global penalty without cutoff of reads.

Fig 9

Global penalty λ(Mcut = ∞, pG) obtained by Ward’s method.

We then sought the best parameters to optimize the dendrograms and found that (Mcut*,pG*) was close to (64000, 10−2), which gave the smallest penalty λ in our searching resolution, as shown in Figs 10 and 11. Note that 64000 is the midpoint of (60000, 62000, 64000, 66000, 68000) which give the same minimum penalty in our searching resolution. Hereafter, to investigate the property of the best clustering, we set (Mcut*,pG*) as (64000, 10−2). In our searching resolution, the increment in terms of Mcut was 2000 near Mcut = 64000. Note that more-refined resolutions might give better estimates of the optimized value (Mcut*,pG*), but naturally the computational costs get higher. Even then, the following procedures are operationally unchanged.

Fig 10. Penalty with cutoff of reads.

Fig 10

The distribution of global penalty λ (A) and type penalty λν for each cell type ν (B) along with Mcut with parameter pG = 10−2 by Ward’s method.

Fig 11. Our best clustering dendrogram.

Fig 11

Hierarchical clustering obtained by Ward’s method with (Mcut, pG) = (64000, 10−2).

The value of the minimum penalty achieved at (Mcut*,pG*) was 18. This minimum was smaller than the penalty value of 27 for the clustering of the data from GSE74912_ATACseq_All_Counts.txt in [7]. The procedure of the latter clustering was as follows. First we performed a quantile normalization of the reads count in the distal elements (> 1000 bp away from a transcription start site (TSS)). Then we calculated the Pearson coefficients over all samples leading to a distance matrix where each entry is 1-(Pearson coefficient). By using Ward’s method, we finally obtained the clustering dendrogram. Note that for this case, Ward’s method gives penalty λ = 27 and UPGMA gives λ = 29.

Computational cost of the algorithm

As explained above, after obtaining data of the reads positions, we perform the MACS2 algorithm to get peak regions, and then finally we produce a hierarchical clustering. Here we consider the computational cost of our algorithm after acquiring the data of the reads positions and until acquiring a distance matrix to produce the hierarchical clustering. Note that the computational cost of the MACS2 algorithm is not more than O(Ns), where O() is the Landau notation and Ns is the total number of samples. We consider two situations. (i) One is the case where new samples to analyze are given. (ii) The other is the case where one new sample to analyze is added to the already analyzed samples, for which peak regions and the distance matrix are already calculated. For case (ii), we use the symbol Ns to write the total number of already analyzed samples. We claim that the computational cost of our algorithm is significantly lower than that of a previous method using target regions merged over samples [7] for large values of Ns for case (ii) and, in our case with Ns = 77, that the computational cost of our algorithm is practically lower for case (i).

Specifically, in case (i) for our algorithm, the corresponding computational cost is K1McutNs2, which comes solely from the calculation of the Hamming distance. In case (ii), the corresponding computational cost is K2 Mcut Ns, which also comes solely from the calculation of the Hamming distance. Note that K1 and K2 are constants that do not depend on Mcut or Ns.

In the context of estimating the best optimization parameter Mcut*, by using Mm different values for Mcut, the computational cost becomes K1McutMmNs2 for case (i) and K2 Mcut MmNs for case (ii), where Mm does not depend on Ns or genome size L and can be adjusted according to the searching resolution of the optimization. Note that K1 and K2 do not depend on Mm. In addition, we optimize pG by Mp different values for pG. Since this optimization can be done for any algorithm, we do not take into account this cost for the comparison of different algorithms. Typically, we set (Mm, Mp) ≃ (30, 10) in our optimization corresponding to case (i). Note that in the section of “Application to leukemic cells” discussed later, corresponding to case (ii), we use the optimized parameters (Mcut,pG)=(Mcut*,pG*), leading to (Mm, Mp) = (1, 1).

The previous method using targeted regions merged over samples in [7] includes (a) the merging of reads before peak calling and (b) calculating the distance matrix by the Pearson coefficients which automatically depend on Ns. Thus, for a given number Nnew of unanalyzed samples, the computational cost corresponding to the process of (a) and (b) is at least KrNrNnew+KLL1Ns2, where Nr is the minimum reads number over all samples, and L1 is the number of target regions merged over all samples. The first term comes from counting the reads and the second term comes from calculating the distance matrix. Note that Kr is a constant that does not depend on Nr or Nnew, and KL is a constant that does not depend on L1 or Ns. This form of the computational cost KrNrNnew+KLL1Ns2 is the same for case (i) with Nnew = Ns and case (ii) with Nnew = 1, leading to the conclusion that the computational cost of our algorithm is significantly lower than the previous method, especially for case (ii) with sufficiently large Ns. We do not have the exact estimate of the coefficients K1, K2, Kr, KL, but because Nr=3265006Mcut* and L1=590650Mcut* in our case, then KrNrNnew+KLL1Ns2 could be costly compared to K1McutNs2. In practice, even in case (i) with Ns = 77, we numerically found that the computational cost of our algorithm is lower due to our algorithm not using the process of merging reads unlike [7].

How to relate the best parameters to genomic context

In order to understand why ATAC-seq data under the condition of (Mcut, pG) = (64000, 10−2) was well classified, we analyzed the properties of the peaks with higher rankings.

The result of the previous section suggested that peaks of {gk}k=1Mcut* with Mcut*=64000 included key regions for characterizing cell types. Therefore, we investigated which functional genomic regions such as promoters, enhancers, etc. are dominantly related to these top 64000 peaks.

Functional annotation of peaks depending on rank

In order to investigate functional annotations on the genome overlap with ATAC-seq peaks data, we applied the top 80000 peaks in three cell types (HSC, B cells, and Mono) to the 15-state ChromHMMmodel data. One can obtain data of the biological functions on the genome for HSC, B cells, and Mono from an integrative analysis of 111 reference human epigenome datasets, where we used the data of E032 for B cells, E035 for HSC, and E029 for Mono (https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/) [17].

ATAC-seq peaks were ranked according to p-values and divided into groups consisting of 1000 peaks. Then we calculated the average ratio and the standard deviation for each of the 15 states over all samples in each cell type. For an explicit description, let us introduce a set of functional annotations, W:={Wy}y=115, where Wy is the set of regions on the genome, each of which corresponds to functional annotation y. We want to know how many peaks, k, of every 1000 peaks belong to each functional annotation y. For this purpose, we define

Exy:={xk<x+1000|(γk,[σ,ϵ])Wysuchthatσ(αk+βk)/2ϵ},

where gk = (γk, αk, βk) is the peak position. We computed |Exy|/1000 for x{1+(j-1)×1000}j=180, as shown in Fig 12. Note that we used the position of the peak center, (αk + βk)/2, to annotate biological function.

Fig 12. Functional annotations of peaks.

Fig 12

Percentage (100×|Exy|/1000) of functional annotations in every 1000 peaks for B cells (A), Mono (B), and HSC (C). Only the functional annotations that have maximum percentages ≥ 12%, y ∈ {FlankingActiveTSS, ActiveTSS, Enhancers, Quiescent_Low}, are shown.

As shown in Fig 12, most of the peaks with higher rankings belonged to “Active TSS”, which was related to the promoters of active genes, but as the rank went down, the ratio of peaks from enhancer regions started to increase. As the rank went down further, the ratio of peaks from “quiescent-low” regions started to increase. The ratio of peaks from promoters and enhancers crossed at around peak rank 10000 and the ratio of peaks from enhancers and “quiescent-low” regions crossed at around peak rank 60000. Therefore, we concluded that the number around the 64000th peak is strongly related to the point that the contribution of “quiescent-low” regions to the Hamming distances exceeds the contribution of enhancer regions to the Hamming distances.

Note that the type penalty of HSC under the condition (Mcut*,pG*) was not as good as that of B cells or Mono, and the functional annotation result of HSC did not show clear behaviors compared with B cells and Mono (Fig 12C), which may partially explain the worse type penalty of HSC (Fig 10B).

Variations of hierarchical clustering methods

In general, when one performs data clustering, the effect of variations of the clustering algorithms and the effect of loss of data on the clustering output should be considered.

First we considered the dependence of the clustering results on the variations of the clustering algorithms. Besides Ward’ method which we used until here, there are several hierarchical clustering methods including UPGMA (Unweighted Pair Group Method with Arithmetic mean), WPGMA (Weighted Pair Group Method with Arithmetic Mean), UPGMC (Centroid Clustering or Unweighted Pair Group Method with Centroid Averaging), and WPGMC (Median Clustering or Weighted Pair Group Method with Centroid Averaging). We performed optimization also with UPGMA, as shown in Fig 13, and found that the minimum value of the penalty is 36 with Mcut = 12000. The other methods give worse results in general. Specifically, the minimum values of the penalty we found were 59 for WPGMA with Mcut = 20000, 127 for UPGMC with Mcut = 30000, and 149 for WPGMC with Mcut = 35000. These results suggested that Ward’s method giving 18 as the minimum value of the penalty was a better choice than that of the other methods for our purpose.

Fig 13. Penalty by UPGMA method.

Fig 13

The distribution of global penalty λ (A) and type penalty λν for each cell type ν (B) along with Mcut with parameter pG = 10−2 by using UPGMA.

Robustness of our best clustering against the loss of data

Regarding the loss of data, let us consider making new reads data R^ from original data R. Specifically, we set r with 0 ≤ r ≤ 1 as the probability of randomly removing ⌈rNr⌉ reads from R with the uniform distribution, where ⌈χ⌉ means the minimum integer larger than or equal to χ. Thus we can obtain R^={Ri}i=1Nr-rNr, where Ri is one read in R. Using this procedure, we computed λ for (Mcut*,pG*)=(64000,10-2). As shown in Fig 14B, when ratio r was increased, the value of λ was constant until r = 0.007 and gradually increased thereafter. In the region r ≥ 0.7, λ increased dramatically. Note that r = 0 gave λ = 18 and the highest possible value of λ for 77 samples is 924. Thus, we concluded that for small r, the average penalty tended to be stably close to that of r = 0.

Fig 14. Robustness of penalty against the loss of reads data.

Fig 14

The effect of the loss of reads on the global penalty λ. Reads were removed randomly from the uniform distribution with probability r. Then global penalty λ was calculated with parameter Mcut = 30000 (A), Mcut = 64000 (B) or Mcut = 80000 (C). Each circle indicates one sample and each square indicates the average over samples at the same r value.

Further, we investigated λ for different values of Mcut than 64000 to check the robustness of Mcut* against random selections. Specifically, we investigated the behavior of λ by varying r for Mcut = 30000 and 80000 with pG*=10-2. The minimum value of λ as a function of r was 27 for Mcut = 30000 and located at r = 0 (Fig 14A) and was 38 for Mcut = 80000 and again located at r = 0 (Fig 14C). Note that in the region r ≥ 0.08, λ for Mcut = 30000 was smaller than λ for Mcut = 64000, which suggested that Mcut* becomes less than 64000 when the data size is decreased.

Thus, for the present data size, we concluded that our algorithm was stable against small losses of the data and it could also work well by adjusting Mcut for losses of data up to 50 percent. The obtained results imply that when the given data size is increased, our algorithm becomes more stable or potentially achieves better clustering with a smaller penalty than our current best clustering.

Application to leukemic cells

To evaluate the practicality of our algorithm with the optimized parameters (Mcut*,pG*) on cancer research, we analyzed three types of leukemia: CLL, AML, and ATL, by calculating Ward’s distance function, HWard(ζ,Sν), between a given leukemia sample ζ and all samples cSν of cell type ν. (See Materials and methods for details of HWard).

To separate normal and leukemic cells effectively, information about the cell surface markers was used. CLL is a disease that is characterized by the clonal proliferation of malignant B lymphocytes. Leukemic cells from CLL patients were purified by using the cell surface markers CD5 and CD19, which are commonly used as markers for CLL (Table 2) [18].

Table 2. Immunophenotypes of leukemic samples.

Immunophenotype of CLL [8]: Note that B cells are CD19+, as shown in Table 1. Immunophenotype of AML [7]: SSC-high means that the intensity of side scatter in the flow cytometry is high. Note that HSC, MPP, and LMPP are Lin-, CD34+, CD38- as shown in Table 1. Immunophenotype of ATL [24, 25]: Note that CD4+T cells are CD4+, as shown in Table 1.

Type of sample Marker expression
CLL CD19+, CD5+
AML pHSC Lin-, CD34+, CD38-, TIM3-, CD99-
AML LSC Lin-, CD34+, CD38-, TIM3+, CD99+
AML Blast Non-LSC; CD45-Intermediate, SSC-High
ATL CD4+, CADM1+

The AML samples analyzed in this study were divided into three stages, preleukemic HSC (pHSC), leukemia stem cells (LSC), and AML blasts by cell surface markers according to [7] (Table 2). Briefly summarizing these three types, HSC that acquired founder mutations become pHSC, which expand to generate preleukemic clones. The subsequent acquisition of progressor mutations creates LSC, which can self-renew and produce AML blasts [19]. It has been reported that mature LSC populations more closely resemble normal GMP, and immature LSC populations are functionally similar to LMPP [20]. A recent study has revealed that CD99-positive cells are almost entirely composed of LMPP-like cells in the sense of Ref. [21]. Thus, the LSC used in our study, which are CD99-positive, can be presumed to be LMPP-like LSC.

Human T-cell leukemia virus type 1 (HTLV-1) is a causative agent of ATL and HTLV-1-associated myelopathy/tropical spastic paraparesis (HAM/TSP) [22]. ATL has been subclassified into four clinical subtypes: acute, lymphoma, chronic, and smoldering. The chronic and smoldering subtypes are considered indolent, while patients with the acute or lymphoma subtype generally have a poor prognosis. HTLV-1 can infect a variety of cell types, but more than 90% of infected cells are CD4+ memory T cells in vivo [23]. In order to specifically separate HTLV-1-infected cells from other normal T-cells, Cell adhesion molecule 1 (CADM1/TSLC1) is used because of its sensitivity and specificity [24, 25]. Thus, in this study, to purify leukemic cells (HTLV-1 infected cells) from the peripheral blood mononuclear cells (PBMC) of ATL patients, we used the cell surface markers shown in Table 2.

The objective of our analysis using leukemic samples was to evaluate which type of hematopoietic cell is closest to a given leukemic sample at the chromatin level. Specifically, we added the ATAC-seq data of a leukemic sample to healthy hematopoietic ATAC-seq data and calculated the Hamming distances where (Mcut*,pG*)=(64000,10-2) is used. We computed HWard(ζ,Sν) as the distance between cell type νT and leukemic sample ζ; in this case, sample ζ was extracted from one patient.

We define the q-th closest cell type of sample ζ as type νζ(q)T to provide the qth minimum of HWard(ζ,Sν) in terms of ν. Using this quantity, we define the rank gap between a given reference type T0T and sample ζ as

GT0,ζ=q-1,

such that T0=νζ(q). In particular, we call νζ(1) the closest type of sample ζ. Note that rank gap GT0,ζ=0 holds when T0=νζ(1). Thus, we not only revealed the closest cell type, but also identified the second, third, and so on closest cell type, and quantified the difference between the characterization results of our algorithm and a given type as the “rank gap”.

As shown in Table 3, by calculating the Hamming distance between each CLL sample and a set of hematopoietic cells, we found that the closest cell type for all CLL samples was B cells, which coincides well with the characteristics of CLL cell surface markers. This result led us to conjecture that our method could infer the cell type of a given leukemic cell characterized by immunophenotypes with using only its ATAC-seq data.

Table 3. Classification of ATAC-seq data of CLL samples.

“Closest cell type” computed by our algorithm.

sample name (ζ) SRR number Type consistent to surface marker “closest cell type” calculated by our algorithm (νζ(1))
CLL1 SRR6762820 B B
CLL2 SRR6762844 B B
CLL3 SRR6762861 B B
CLL4 SRR6762895 B B
CLL5 SRR6762925 B B
CLL6 SRR6762952 B B
CLL7 SRR6762968 B B

In order to assess the applicability of our method to leukemia whose cell of origin is not uniform and has high levels of heterogeneity between cases, we analyzed AML samples [7]. We found that the results of our analysis for pHSC and Blast had substantial overlap with those of a previous study [7], where 12 out of 16 samples for pHSC and 13 out of 18 samples for Blast are overlapped, as shown in Table 4. However, in the case of LSC, we found differences between the results of our analysis and those from [7]. Most of the LSC samples were closest to LMPP using our algorithm, but to GMP in [7]. As mentioned above, the LSC used in the present study were CD99-positive and are presumed to be composed of LMPP-like cells, which suggests that our characterization by using information of the Hamming distance infers the cell type with high accuracy, though further investigation is required.

Table 4. Classification of ATAC-seq data of AML samples.

Comparison between the “closest normal cell” in Fig 6i of [7] and “closest cell type” computed by our algorithm. The second, the third, and …-th “closest type” were also identified by our algorithm. The “rank gap” represents the difference of the result between the two analytical methods. For example, the “closest normal cell” of sample SU351-pHSC is MPP in [7], but is LMPP by our algorithm. “MPP” was the second “closest cell type”. Thus, the rank gap was calculated as 2-1 (= 1). If the results from the two analytical methods coincide with each other, the rank gap is 0.

sample name (ζ) SRR number “closest normal cell” (T0) calculated in Fig 6i from Ref. [7] “closest cell type” calculated by our algorithm (νζ(1)) rank gap (GT0,ζ)
SU654-pHSC SRR2920595 MPP MPP 0
SU353-pHSC SRR2920571 MPP MPP 0
SU351-pHSC SRR2920568 MPP LMPP 1
SU209-pHSC1 SRR2920564 GMP MPP 4
SU209-pHSC2 SRR2920562 GMP GMP 0
SU209-pHSC3 SRR2920561 GMP GMP 0
SU070-pHSC1 SRR2920557 HSC MPP 1
SU070-pHSC2 SRR2920556 HSC HSC 0
SU048-pHSC SRR2920552 MPP MPP 0
SU583-pHSC1 SRR2920588 GMP LMPP 2
SU583-pHSC2 SRR2920587 GMP GMP 0
SU575-pHSC SRR2920584 MPP MPP 0
SU501-pHSC SRR2920581 MPP MPP 0
SU496-pHSC SRR2920579 MPP MPP 0
SU484-pHSC SRR2920576 MPP MPP 0
SU444-pHSC SRR2920574 MPP MPP 0
SU654-LSC SRR2920594 LMPP LMPP 0
SU583-LSC SRR2920586 GMP LMPP 1
SU575-LSC SRR2920583 GMP LMPP 2
SU496-LSC SRR2920578 GMP GMP 0
SU444-LSC SRR2920573 GMP LMPP 1
SU353-LSC SRR2920570 GMP LMPP 1
SU209-LSC SRR2920559 GMP LMPP 1
SU070-LSC SRR2920555 GMP LMPP 1
SU654-Blast SRR2920593 GMP LMPP 1
SU444-Blast SRR2920572 Mono Mono 0
SU353-Blast SRR2920569 GMP GMP 0
SU351-Blast SRR2920567 Mono GMP 1
SU209-Blast SRR2920558 GMP GMP 0
SU070-Blast1 SRR2920554 Mono Mono 0
SU070-Blast2 SRR2920553 Mono Mono 0
SU048-Blast1 SRR2920551 GMP GMP 0
SU048-Blast2 SRR2920550 GMP Mono 1
SU048-Blast3 SRR2920549 GMP GMP 0
SU048-Blast4 SRR2920548 GMP Mono 1
SU048-Blast5 SRR2920547 GMP GMP 0
SU048-Blast6 SRR2920546 GMP GMP 0
SU583-Blast SRR2920585 GMP GMP 0
SU575-Blast SRR2920582 GMP LMPP 1
SU501-Blast SRR2920580 Mono Mono 0
SU496-Blast SRR2920577 GMP GMP 0
SU484-Blast SRR2920575 Mono Mono 0

Finally we analyzed ATL samples (See Materials and methods for details of sample preparation). When we calculated the Hamming distance between each ATL sample and a set of hematopoietic cells, we found that the closest cell type for two ATL samples was Mono (hereafter we term these samples “Mono-like ATL”), while that of the other samples was CD4+T, as shown in Table 5. Surprisingly, the two Mono-like ATL samples were categorized into chronic-type ATL. Since CD14 is the marker of Mono (Table 1), we investigated the CD14 gene expression pattern in CD4+T, Mono and ATL samples. Particularly, we calculated the ratio of the CD14 reads count to the CD4 reads count from RNA-seq data and found that the two Mono-like ATL samples exhibited higher values among all ATL samples (Fig 15). In this way, the obtained results led us to conjecture that our algorithm could infer the cell phenotype, potentially including clinical subtypes, only using ATAC-seq data. However, we need to analyze more samples to validate this conclusion.

Table 5. Classification of ATAC-seq data of ATL samples.

Clinical subtypes of ATL samples and “closest cell type” computed by our algorithm.

sample name (ζ) DRR number clinical subtypes “closest cell type” calculated by our algorithm (νζ(1))
ATL1 DRR250710 Acute CD4+T
ATL2 DRR250711 Acute CD4+T
ATL3 DRR250712 Acute CD4+T
ATL4 DRR250713 Acute CD4+T
ATL5 DRR250714 Acute CD4+T
ATL6 DRR250715 Chronic Mono
ATL7 DRR250716 Chronic Mono

Fig 15. Comparison of RNA-seq data among CD4+T, Mono and ATL samples.

Fig 15

The reads count of CD14 over the reads count of CD4 from RNA-seq data of CD4+T, Mono, and ATL samples.

Discussion

In this paper, we presented a new algorithm to systematically perform clustering of epigenomic data using the Hamming distance, which enabled us to find optimal parameters of the data reduction toward cell-type classification. This algorithm has one clear advantage in terms of computational cost compared to a previous method using targeted regions merged over samples [7]. Especially, when adding new samples to the analysis, we only have to calculate the distances between newly appearing pairs of samples and not between preexisting samples. The computational cost of the presented systematic algorithm is significantly lower for this situation compared to the previous method with merging targeted regions. Furthermore, this algorithm was found to effectively detect the closest cell type of a leukemic sample, with the results being broadly consistent with the characterization of leukemic samples by cell surface markers or RNA-seq. Thus, the developed algorithm potentially serves as a screening for the phenotype of a leukemia sample by using the ATAC-seq data of the sample as input.

As a next step, we need to investigate if our constructed algorithm is robust for other existing methods and data. For example, for the same data of hematopoietic cells, we replaced the Hamming distance with the Dice coefficient, which has been used in the CODEX project [26] to quantify the differences between two samples, but found the results with pG = 10−2 were not improved in terms of the penalty. We also compared our algorithm with DiffBind [27], which is commonly used as a ChIP-seq differential analysis tool, but again found that DiffBind with its default setting did not give a better clustering result. Note that there are other existing methods and data to be checked in the future.

A unique point of our constructed algorithm is that we only used ATAC-seq data without gene expression data. Our analysis suggests that ATAC-seq data itself contains enough information to determine cell types even in the absence of regional annotation data such as promoters or enhancers. This feature implies that our algorithm reveals elusive epigenomic properties that significantly affect the phenotype of cell types. Another advantage of our algorithm is that we do not assume a strong property for the statistics of the reads data, which is otherwise implicitly assumed when quantile normalization is performed. Instead of using the strong assumption, we took a data-driven approach for the normalization of the reads data, where we pre-analyzed the statistics of the reads data before performing any normalization.

Finally, our algorithm could extend its application to leukemic samples whose properties are uncertain. We also expect that our whole approach with slight modifications will be applicable to other epigenetic sequencing data such as ChIP-seq and bisulfite sequencing available, for example, from The International Human Epigenome Consortium (https://epigenomesportal.ca/ihec/), ROADMAP Epigenomics (http://www.roadmapepigenomics.org/) and many other resources, whose target regions for the analysis are not uniform between samples.

Materials and methods

Ethics statement

Experiments using clinical samples were conducted according to the principles expressed in the Declaration of Helsinki and approved by the Institutional Review Board of Kyoto University (permit numbers G310 and G204). ATL patients provided written informed consent for the collection of samples and subsequent analysis.

Sequencing sample preparation

ATL patient PBMCs were thawed and washed with PBS containing 0.1% BSA. To discriminate dead cells, we used the LIVE/DEAD Fixable Dead Cell Stain Kit (Invitrogen). For cell surface staining, cells were stained with APC anti-human CD4 (clone: RPA-T4) (BioLegend) and anti-SynCAM (TSLC1/CADM1) mAb-FITC (MBL) antibodies for 30 minutes at 4°C followed by a wash with PBS. HTLV-1 infected cells (CADM1+ and CD4+) were sort-purified with FACS Canto (Beckman Coulter) to reach 98–99% purity. Data was analyzed by FlowJo software (Treestar). Soon after the sorting, 10000-50000 HTLV-1 infected cells were centrifuged and used for ATAC-seq as previously described [5]. Total RNA was isolated from the remaining cells using the RNeasy Mini Kit (Qiagen). Library preparation and high-throughput sequencing were performed by Macrogen Inc. (Seoul, Korea). The diagnostic criteria and classification of clinical subtypes of ATL were performed as previously described [28]. 77 ATAC-seq datasets from 13 human primary blood cell types and datasets from 42 AML patients were obtained from the Gene Expression Omnibus (GEO) with accession number GSE74912 [7]. ATAC-seq datasets from 7 CLL patients were obtained from GSE111015 [18] and RNA-seq datasets of CD4+T and Mono cells were obtained from GSE74246 [7].

Sequencing data analysis

ATAC-seq reads were aligned using BWA version 0.7.16a [29] with default parameters. SAMtools [30] was used to convert SAM files to compressed BAM files and sort the BAM files by chromosome coordinates. PICARD software (v1.119) (http://broadinstitute.github.io/picard/) was then used to remove PCR duplicates using the MarkDuplicates options. Reads with mapping quality scores less than 30 were removed from the BAM files. For peak calling, MACS2 (v2.1.2) software was used [15]. RNA-seq data were aligned to human reference genome hg19 using STAR 2.6.0c [31] with the --quantMode GeneCounts function. Normalization was not performed, and only raw reads count data of CD14 and CD4 were used in this study.

Principles of data reduction

When we analyze preprocessed ATAC-seq data with P^, we have to care for biases caused by the fact that the amount of reads, Nr, depends on the setting of the sample preparation and on the sequencers used. (See S1 Appendix for the explicit construction of P^.) Normalization is done to remove such biases.

A conventional way to perform normalization is to use quantile normalization, where the distribution of the reads number on certain regions in the DNA is assumed to be the same for all samples [12, 13]. However, there is no strong reason to support this assumption, particularly for sample sets of different cell types. Furthermore, under this assumption, there is a risk that we overlook important differences between different cell types. Therefore, in this paper, we do not assume this property.

An alternative way to perform normalization is to reduce the data into a simple binary value hγ,x ∈{0, 1} on each genomic position (γ, x), where hγ,x depends on the data size Nr as little as possible. For example, one could determine the state of hγ,x = 1 and hγ,x = 0 as an “open” and “closed” chromatin status, respectively, on genomic position (γ, x).

In this direction, our ultimate purpose is to look for the “best” principle that determines two states for hγ,x, by which a set of samples including different cell types are completely classified into groups of the same cell type. We use no information about cell types when determining the value of hγ,x, because we would like to have an algorithm that can be applied without knowing the cell types.

Peak-calling with ranking

Currently we do not have the best solution to properly determine two effective states for hγ,x. As a candidate to approach the best solution, we use the MACS2 algorithm, which was originally invented to analyze ChIP-seq data [15] but is now widely used to estimate the location of open chromatin regions from ATAC-seq data [32, 33].

We would like to find the set of position (γ, x) where the number of reads overlapping with position (γ, x), Yγ,x(P^), is relatively high in the neighborhood (γ, x). The MACS2 algorithm is likely to detect those positions from the data of the reads described by P^. In our calculation, we use the MACS2 (v2.1.2) callpeak command with option “--nomodel --nolambda --keep-dup all -p pG”, where we need to set parameter pG as a parameter of peak inference (for details, see [15]).

By applying MACS2 to the input ATAC-seq data, we obtain the following output data structure:

  • The label γkX of the chromosome to which the k-th peak has a start position 1 ≤ αkLγ and end position 1 ≤ βkLγ for 1 ≤ kM (here M is the number of peaks). We call gk = (γk, αk, βk) the k-th peak region.

  • For each gk, p-value pk with pkpG is associated to the k-th peak. Note that MACS2 outputs log10(1/pk) = −log10 pk instead of pk.

X and Lγ are the set of all chromosomes and the length of chromosome γ, respectively (see S1 Appendix for details of the notations). We define A as

A(gk,pk)k=1M,gk(γk,αk,βk).

By reordering the terms of k, we can set pkpk for any k < k′ without loss of information.

In Fig 2, we show the distribution of the peak width |βkαk| versus ranking k. Note that gk with high pk could be affected significantly by the conditions of the experiments including sequencing, because the data above rank value 40000 unnaturally touches the value of the lower limit of width 200, which is predetermined by the MACS2 algorithm. Thus, there is a possibility that peaks with higher p-values could strongly depend on both the inference algorithm and the number of reads Nr. Those peaks would presumably not contribute to the detection of cell phenotypes. This observation suggests we should remove peaks with higher p-values as mentioned in Results.

Parameterized binarization by cutting off low-ranked peaks

Next we reconsidered how to alleviate biases in the data by introducing threshold number Mcut, such that

A¯(Mcut){gk}k=1Mcut,

which leads to the removal of {gk}k=Mcut+1M as a candidate for the normalization of the ATAC-seq data. Note that A¯(Mcut=)={gk|(gk,pk)A}. Then, by using A¯, we may introduce a binary sequence

B{hγ,x}γX,1xLγ,

such that hγ,x = 1 if there is k satisfying αkxβk with (αk,βk)A¯; otherwise hγ,x = 0 as shown in Fig 4.

pG and Mcut can be regarded as parameters for determining the value of hγ,x within the MACS2 algorithm and what part of the data is taken into account, respectively. Thus, our task under the principle above turns out to be how to determine a proper set of (Mcut,pG) for the cell-type classification.

Hamming distance

The Hamming distance is often used to compare two binary sequences in information theory (see Section 13 in [14]) and is equal to the number of positions on which two symbols have different values. See Fig 3 for an illustrative explanation.

The Hamming distance between two binary sequences Bc1 and Bc2 with c1,c2S is defined as

H(Bc1,Bc2)γX1xLγδ(hγ,xc1,hγ,xc2),

where we define

δ(hγ,xc1,hγ,xc2)={1(hγ,xc1hγ,xc2)0(hγ,xc1=hγ,xc2).

Algorithm of hierarchical clustering

In this and the next subsection, we recall algorithms for agglomerative hierarchical clusterings and drawing dendrograms. We use two methods, UPGMA and Ward’s. Though they are described in many textbooks (for example, see Chapter 4 in [34]), we need the description in order to define the global penalty and the type penalty. Our description of the algorithms follows [16].

To describe the algorithms, we define two distance functions between two subsets, C1,C2S as follows (for inductive definitions and other distance functions, see Section 4.2 in [34]). One distance function, HUPGMA comes from the UPGMA method and is defined as the average of all the distances between samples in C1 and C2. Equivalently, we define

HUPGMA(C1,C2)1|C1||C2|c1C1c2C2H(Bc1,Bc2).

If C1 or C2 is empty, we set HUPGMA(C1,C2)=0.

Another choice of the distance function, HWard, comes from Ward’s method and is defined as

HWard(C1,C2)D1,2|C1|+|C2|-|C2|D1|C1|(|C1|+|C2|)-|C1|D2|C2|(|C1|+|C2|)

where we define

D112c1C1c2C1H(Bc1,Bc2)2,D212c1C2c2C2H(Bc1,Bc2)2,D1,2c1C1c2C2H(Bc1,Bc2)2.

Again, if C1 or C2 is empty, we set HWard(C1,C2)=0.

In the following, we fix H(C1,C2) as HUPGMA or HWard. We sometimes identify sample cS and subset {c} of single element c. For example, we write H(C1,c2) for H(C1,{c2}). Note that H({c1},{c2})=H(c1,c2)=KH(Bc1,Bc2) where K = 1 for H=HUPGMA and K = 2−1/2 for H=HWard by definition.

We define a cluster as subset C of S with a specified order of elements. Hierarchical clustering is an algorithm that can construct set MNs of clusters and order the elements in S to draw dendrograms.

  1. We set Cτ{τ} for 1 ≤ τNs. We do not consider the order of the elements in Cτ because they are sets of a single element.

  2. We define the list of uncombined clusters as L1{C1,C2,,CNs} and set the historical list of clusters as M1=L1.

  3. At the t-th step (1 ≤ tNs − 1), we define Ct+Ns,Lt+1 and Mt+1 inductively.
    • (a)
      We look up the pair Cτ and Cτ with τ′ < τ″ in Lt such that their distance is a minimum; that is,
      H(τ,τ)=min,LtH(,).
      Note that 1 ≤ τ′ < τ″ < t + Ns by construction. We consider only the case when the pair is uniquely determined.
    • (b)
      We define a new cluster Ct+Ns=CτCτ. If the elements of Cτ are ordered as c1, c2, …, cz and the elements of Cτ are c1,c2,,cz, then the elements of Ct+Ns are ordered as
      c1,c2,,cz,c1,c2,,cz.
    • (c)
      We define
      Lt+1(Lt\{Cτ,Cτ}){Ct+Ns},Mt+1Mt{Ct+Ns}.
      If t < Ns − 1, go to the (t + 1)-th step.

We can easily see that if we do not consider the ordering, then we have C2Ns-1=S as a set. Thus we finally obtain a list of 2Ns − 1 clusters MNs={C1,C2,,C2Ns-1} and an ordering of all elements of S from C2Ns-1.

How to draw dendrograms

The (rooted) dendrogram displays how our clustering combines pairs of clusters and the distance of the pairs. In the following, we explain an algorithm that introduces new symbols. For details, see [16].

  1. If sample τS appears in the ordering of C2Ns-1 as the aτ-th element, then we associate point nτ = (aτ, 0) in two-dimensional coordinate space to cluster Cτ. We call point nτ the leaf, which corresponds to Cτ.

  2. For 1 ≤ tNs − 1, we inductively associate point nt+Ns to cluster Ct+Ns. If Ct+Ns is constructed as the union of Cτ and Cτ with 1 ≤ τ′ < τ″ < t + Ns, we associate to Ct+Ns the node
    nt+Ns=(at+Ns=aτ+aτ2,H(Cτ,Cτ)).
    Note that Cτ and Cτ are uniquely determined. We call nt+Ns the node associated to the (t + Ns)-th cluster Ct+Ns.
  3. We connect nt+Ns with nτ and nτ.

Since each node or leaf n corresponds to cluster C, we can define the offspring set Bn of n as set C without ordering. Graphically, the offspring set of node n is the set of samples corresponding to leaves branched from node n, as displayed in Fig 7. This intuitional explanation is justified, since the y-coordinate of the “mother node” nt+Ns is larger than or equal to those of the “child nodes” nτ, nτ if we use Ward’s method or UPGMA. Note that there are many choices to draw dendrograms; for example, at any branching node, we can exchange two branches without any essential change in the data structure.

Global penalty as a cost function

In this section, we discuss the global penalty, a quantity that measures how the obtained hierarchical clustering differs from our knowledge of cell type classifications. We also give examples displaying the computation of the penalties and extreme situations that represent the theoretical bounds of the penalties. Note that these examples are just for explanation and not obtained from actual data.

In our settings, each sample is previously classified by types. Explicitly, set T consists of thirteen types:

T={B,CD4+T,CD8+T,CLP,CMP,Ery,GMP,HSC,LMPP,MEP,Mono,MPP,NK}.

For each type νT, we denote the set of samples classified to type ν as Sν. This set could be empty, though it is not in our case. For every pair ν, ν′ of distinct types, there are no common elements in Sν and Sν, and the union of Sν among all types νT coincides with S. Equivalently,

S=νTSν.

For a given hierarchical clustering constructed in the manner of the previous section, the type penalty for type ν is the quantity λν defined as follows. If Sν is empty, we set λν = 0. Otherwise, since the cluster grows step by step, there is the minimum τ for 1 ≤ τ ≤ 2Ns − 1 such that SνCτ. We denote the minimum τ by τ(ν). Then we define λν as the number of elements in Cτ(ν) that are not of type ν. In other words, we set

λν|Cτ(ν)|-|Sν|.

Since Cτ includes all elements of type ν, we find λν ≥ 0. Also since Cτ is a subset of S, we find λν|S|-|Sν|. Thus we have

0λν|S|-|Sν|.

(See Fig 7 for an explanation of type penalties).

For a given hierarchical clustering, the global penalty λ is defined to be the total sum of type penalties,

λνTλν.

λ is bounded as

0λνT(|S|-|Sν|)=(|T|-1)·|S|. (1)

In our case, since |T|=13 and |S|=77, we have 0 ≤ λ ≤ (13 − 1) ⋅ 77 = 924. Note that for a certain class of trees, these upper and lower bounds are not achieved. Fig 16 displays examples of the upper and lower bounds.

Fig 16. Examples of dendrograms with extreme penalties.

Fig 16

Note that this dendrogram is constructed using artificial data to explain how to calculate the penalty, though we use the same labels such as Mono1. Both of these dendrograms have six leaves (|S|=6) that are classified into three types (in these examples, |T|=3). (A) This example gives the lowest global penalty 0. (B) In this example, we have τ(CD4+T)=τ(CD8+T)=τ(NK)=11. Since the corresponding cluster C11 is the whole set S, the local penalty is 6 − 2 = 4 for each type, and the global penalty is 4 × 3 = 12. This result gives the upper bound (|T|-1)·|S|=(3-1)·6=12 in Eq (1).

Further, we write λ(Mcut, pG) as λ to point out that λ depends on (Mcut, pG). Note that Cτ(ν) is equal to Bnτ(ν), which was defined in the previous section.

Supporting information

S1 Appendix. Additional details of sequencing analysis.

(PDF)

Acknowledgments

The authors thank P. Karagiannis for valuable comments and proofreading of this manuscript. They also thank MACS Program at Graduate School of Science Kyoto University which allowed this collaboration to be carried out.

Data Availability

All ATAC-seq and RNA-seq data needed to reproduce this study have been deposited at the DNA Data Bank of Japan (DDBJ) under accession number DRA010939. The source code is available from https://github.com/tanakanishi/findclosest.

Funding Statement

This research was supported by JSPS KAKENHI Grant Numbers JP19K16740 (AT), JP18J40119 (AT), JP19H03689 (MM), JP20H03514 (JiY), and by Japan Agency for Medical Research and Development (AMED) Grant Numbers JP20fk0108088h0002 (MM), JP17km0405207h0002 (AF), JP18km0405207S0103 (AF), and by a grant from the Naito Foundation (AT). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Klemm SL, Shipony Z, Greenleaf WJ. Chromatin accessibility and the regulatory epigenome. Nature Reviews Genetics. 2019;20:207–220. 10.1038/s41576-018-0089-8 [DOI] [PubMed] [Google Scholar]
  • 2. Gaspar-Maia A, Alajem A, Meshorer E, Ramalho-Santos M. Open chromatin in pluripotency and reprogramming. Nature Reviews Molecular Cell Biology. 2011;12:36–47. 10.1038/nrm3077 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. John S, Sabo PJ, Thurman RE, Sung MH, Biddie SC, Johnson TA, et al. Chromatin accessibility pre-determines glucocorticoid receptor binding patterns. Nature Genetics. 2011;43:264–268. 10.1038/ng.759 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, et al. The accessible chromatin landscape of the human genome. Nature. 2012;489:75–82. 10.1038/nature11232 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature Methods. 2013;10:1213–1218. 10.1038/nmeth.2688 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Buenrostro JD, Wu B, Litzenburger UM, Ruff D, Gonzales ML, Snyder MP, et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature. 2015;523:486–490. 10.1038/nature14590 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Corces MR, Buenrostro JD, Wu B, Greenside PG, Chan SM, Koenig JL, et al. Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Natature Genetics. 2016;48:1193–1203. 10.1038/ng.3646 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Rendeiro AF, Schmidl C, Strefford JC, Walewska R, Davis Z, Farlik M, et al. Chromatin accessibility maps of chronic lymphocytic leukaemia identify subtype-specific epigenome signatures and transcription regulatory networks. Nature Communications. 2016; 7:11938 10.1038/ncomms11938 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Qu K, Zaba LC, Satpathy AT, Giresi PG, Li R, Jin Y, et al. Chromatin Accessibility Landscape of Cutaneous T Cell Lymphoma and Dynamic Response to HDAC Inhibitors. Cancer Cell. 2017; 32:27–41.e4. 10.1016/j.ccell.2017.05.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Tu S and Shao Z. An introduction to computational tools for differential binding analysis with ChIP-seq data. Quantitative Biology. 2017; 5(3):226–235. 10.1007/s40484-017-0111-8 [DOI] [Google Scholar]
  • 11. Yan F, Powell DR, Curtis DJ, Wong NC. From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis. Genome Biology. 2020; 21(1):22 10.1186/s13059-020-1929-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Meyer SU, Pfaffl MW, Ulbrich SE. Normalization strategies for microRNA profiling experiments: A’normal’ way to a hidden layer of complexity? Biotechnology Letters. 2010; 32:1777–1788. 10.1007/s10529-010-0380-z [DOI] [PubMed] [Google Scholar]
  • 13. Hicks SC and Irizarry RA. quantro: A data-driven approach to guide the choice of an appropriate normalization method. Genome Biology. 2015; 16:1–8. 10.1186/s13059-015-0679-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. MacKay DJC. Information theory, inference and learning algorithms. Cambridge University Press; 2003 [Google Scholar]
  • 15. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biology. 2008; 9:R137 10.1186/gb-2008-9-9-r137 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Müllner D. (2011). Modern hierarchical, agglomerative clustering algorithms. arXiv:1109.2378 [Preprint]. 2011 Available from: https://arxiv.org/abs/1109.2378
  • 17. Ernst J. and Kellis M. Chromhmm: automating chromatin-state discovery and characterization. Nature Methods. 2012; 9:215–216. 10.1038/nmeth.1906 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Rendeiro AF, Krausgruber T, Fortelny N, Zhao F, Penz T, Farlik M, et al. Chromatin mapping and single-cell immune profiling define the temporal dynamics of ibrutinib response in CLL. Nature Communications. 2020; 11:1–14. 10.1038/s41467-019-14081-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Döhner H, Weisdorf DJ, Bloomfield CD. Acute myeloid leukemia. The New England Journal of Medicine. 2015; 373:1136–1152. [DOI] [PubMed] [Google Scholar]
  • 20. Goardon N, Marchi E, Atzberger A, Quek L, Schuh A, Soneji S, et al. Coexistence of LMPP-like and GMP-like leukemia stem cells in acute myeloid leukemia. Cancer Cell. 2011; 19:138–152. 10.1016/j.ccr.2010.12.012 [DOI] [PubMed] [Google Scholar]
  • 21. Chung SS, Eng WS, Hu W, Khalaj M, Garrett-Bakelman FE, Tavakkoli M, et al. Cd99 is a therapeutic target on disease stem cells in myeloid malignancies. Science Translational Medicine. 2017; 9(374). 10.1126/scitranslmed.aaj2025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Matsuoka M and Jeang KT. Human T-cell leukaemia virus type 1 (HTLV-1) infectivity and cellular transformation. Nature Reviews Cancer. 2007; 7:270–280. 10.1038/nrc2111 [DOI] [PubMed] [Google Scholar]
  • 23. Richardson JH, Edwards AJ, Cruickshank JK, Rudge P, Dalgleish AG. In vivo cellular tropism of human T-cell leukemia virus type 1. Journal of Virology. 1990; 64:5682–5687. 10.1128/JVI.64.11.5682-5687.1990 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Manivannan K, Rowan AG, Tanaka Y, Taylor GP, Bangham CR. CADM1/TSLC1 Identifies HTLV-1-Infected Cells and Determines Their Susceptibility to CTL-Mediated Lysis. PLoS Pathogens. 2016; 12:1–18. 10.1371/journal.ppat.1005560 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Nakahata S, Saito Y, Marutsuka K, Hidaka T, Maeda K, Hatakeyama K, et al. Clinical significance of CADM1/TSLC1/IgSF4 expression in adult T-cell leukemia/lymphoma. Leukemia. 2012; 26:1238–1246. 10.1038/leu.2011.379 [DOI] [PubMed] [Google Scholar]
  • 26. Sánchez-Castillo M, Ruau D, Wilkinson AC, Ng FS, Hannah R, Diamanti E, et al. CODEX: A next-generation sequencing experiment database for the haematopoietic and embryonic stem cell communities. Nucleic Acids Research. 2015; 43(D1):D1117–D1123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Ross-Innes CS, Stark R, Teschendorff AE, Holmes KA, Ali HR, Dunning MJ, et al. Differential oestrogen receptor binding is associated with clinical outcome in breast cancer. Nature. 2012; 481(7381):389–393. 10.1038/nature10730 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Shimoyama M. Diagnostic criteria and classification of clinical subtypes of adult t-cell leukaemia-lymphoma. a report from the lymphoma study group (1984-87). British Journal of Hematology. 1991; 3:428–437. [DOI] [PubMed] [Google Scholar]
  • 29.Li H. Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv:1303.3997 [Preprint]. 2013 Available from: https://arxiv.org/abs/1303.3997?upload=1
  • 30. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009; 25:2078–207. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics. 2013; 29:15–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Corces MR, Granja JM, Shams S, Louie BH, Seoane JA, Zhou W, et al. The chromatin accessibility landscape of primary human cancers. Science. 2018; 362:1–58. 10.1126/science.aav1898 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Denny SK, Yang D, Chuang CH, Brady JJ, Lim JS, Grüner BM, et al. Nfib Promotes Metastasis through a Widespread Increase in Chromatin Accessibility. Cell. 2016; 166:328–342. 10.1016/j.cell.2016.05.052 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Everitt BS, Landau S, Leese M, Stahl D. Cluster Analysis, 5th Edition Wiley Series in Probability and Statistics. John Wiley & Sons, Ltd.,; 2011 [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008422.r002

Decision Letter 0

Jason A Papin, Avner Schlessinger

29 Jun 2020

Dear Dr. Tanaka,

Thank you very much for submitting your manuscript "Systematic clustering algorithm for epigenetic data and its application to hematopoietic cells" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Avner Schlessinger

Associate Editor

PLOS Computational Biology

Jason Papin

Editor-in-Chief

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Dear Colleagues,

I read with interest your manuscript on identifying cell or tissue type using ATAC-Seq. In it you present a simple algorithm to compute a distance between samples. The text itself is very well written.

Following on the comments in the previous round of reviews, I have two major questions:

- What exactly is the advance over prior art? While it is true that there are few ATAC-Seq specific differential analysis pipelines, a number of ChIP-Seq differential analysis tools such as HOMER, DBChIP and DiffBind are evidently commonly used (1). In fact, a quick search reveals that there are roughly a dozen such methods (2) that can be used to cluster samples based on ChIP-Seq measurements. Have you considered benchmarking your approach against these tools?

- Why is the speed of calculation an issue? Are existing methods too slow? Even assuming a large number of datasets to be compared, computing a matrix of pairwise comparisons is embarrassingly parallel hence the term N_s is not necessarily a concern in practice.

Minor points:

- l85. What does the length of the peaks have to do with data reduction? If you store each peak as a triple (\\gamma, \\alpha, \\beta) (as in l.73), then the difference between \\beta and \\alpha has no incidence on information content.

- l131. Do you have evidence that you algorithm is significantly faster than the approach described in Corces et al? A difference in computational complexity (as derived below) is not a proof of speed, just asymptotic behaviour.

- l151. I don't understand what the comparison of N_r and L_1 to M*_cut brings to the complexity analysis. The Landau notation is about the trend of a function, and the constants within the O(...) are have no bearing to the actual run time, nor to the asymptotic behaviour.

- l173. Are you applying the same M_cut to all samples? Would this not affect results on cell types that have significantly more/fewer accessible regions than others?

Sincerely,

Daniel Zerbino

(1) https://link.springer.com/article/10.1186/s13059-020-1929-3#Fig4

(2) https://hbctraining.github.io/Intro-to-ChIPseq/lessons/08_diffbind_differential_peaks.html

Reviewer #2: Major comments:

1) Some inconsistent concepts are being used throughout the manuscript. In the title, the authors claimed it’s a “clustering algorithm”. However, in the abstract and the main text, the authors descripted it as a ‘data reduction method’. And meanwhile, it was explained as a ‘classification method’. These concepts are all essentially different and mixing them up makes this manuscript read confusing. Especially ‘clustering algorithm’ and ‘classification’ are self-contradictory. Apparently, the authors were utilizing known labels to optimize the loss function or classify samples so it should not be considered as an unsupervised clustering method. The authors need to be careful, consistent, and clear about the method description.

2) For the parameter optimization, instead of using grid search, the authors narrowed down the range of PG first by setting Mcut to infinity and then determine the best parameter pair within the range. But this can be problematic since both parameters contribute to the loss function simultaneously. In other words, the optimal Mcut might fall outside of the range of PG [1.5,4] so the best parameters determined this way might not be optimal, especially when no clear correlation between PG and Mcut was not observed in Figure9 and Figure 10. The authors need to further justify the parameter optimization.

3) Can the determined ‘best parameters’ be generalizable to other cases? It seems different linkage criteria may already result in different solutions of parameters, which means the parameter optimization is a bit sensitive. How to decide the parameters when no labels are given? Also, in the application to leukemic cells, how were the parameters determined?

4) The authors claimed the proposed algorithm works better than quantile normalization however this manuscript lacks a stringent comparison between these two strategies. Line 166-167, although the authors did a basic comparation, comparison details are not described. Are both methods using the same regions? Do the clustering solutions have the same number of clusters? Does the linkage criterion ‘Ward’ give the best performance for quantile normalization? Instead of hierarchical clustering, will another clustering method suit quantile normalization better?

5) How to determine the best parameters for cell-type classification is a critical part for the proposed algorithm. That being so, it makes more sense to put the section of ‘Computational cost of the algorithm’ after “Determination of the best parameters for the best cell-type classification”.

6) Instead of overstating the method as a systematic clustering algorithm for ‘epigenetic data’, it would be better to be more specific since in this manuscript, the algorithm is mainly designed for ATAC-seq and has only been tested on ATAC-seq data.

7) The authors should consider re-organizing the figures and tables (E.g. There is no need to separate table2-4. They can all be merged into one table) to highlight the primary results and improve the readability. Many of them can go to supplementary figures. It would also be very helpful if the authors can add a main figure to illustrate the workflow of the proposed algorithm.

Minor comments:

1) Please define the supersript ‘c’ in Figure3 when first introducing it.

2) Figure7 legend. The correct term for ‘broken line’ is ‘dashed line’

3) Line 164 “Our searching resolution in terms of increasing Mcut was 1000 near Mcut = 64000.” What is ‘1000’ here?

4) Line 287 ‘substantial overlap’. Please give the specific overlap ratio.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Daniel Zerbino

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008422.r004

Decision Letter 1

Jason A Papin, Avner Schlessinger

27 Aug 2020

Dear Dr. Tanaka,

Thank you very much for submitting your manuscript "Systematic clustering algorithm for chromatin accessibility data and its application to hematopoietic cells" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. Please address the comments by Reviewer 1.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Avner Schlessinger

Associate Editor

PLOS Computational Biology

Jason Papin

Editor-in-Chief

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Dear Colleagues,

Thank you for answering most of my questions.

I still do not understand however why your comparison to prior art consists of only one method. Further, the selected competitor was not particularly advanced: the Corces 2016 paper was essentially focused on a new and exciting dataset, not a refined analysis method. I take on board that DiffBind was too slow, but as pointed in my previous review, there are at least a dozen other methods out there.

People have long been doing epigenomic similarity matrices on ChIP-Seq and DNAse-Seq. For example, the website of of the international human epigenome consortium (IHEC) allows you to compute correlation matrices on all their datasets. Also, the CODEX project (1) has been using the Dice coefficient, which is very similar to the Hamming distance used in your method. Why can't these existing methods simply be used on ATAC-Seq data?

Sincerely,

Daniel Zerbino

(1) https://academic.oup.com/nar/article/43/D1/D1117/2439489

Reviewer #2: The authors have sufficiently addressed my concerns. The manuscript has been improved by further clarifying the method and re-organizing the main text structure. I would recommend it for publication.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Daniel Zerbino

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008422.r006

Decision Letter 2

Jason A Papin, Avner Schlessinger

6 Oct 2020

Dear Dr. Tanaka,

We are pleased to inform you that your manuscript 'Systematic clustering algorithm for chromatin accessibility data and its application to hematopoietic cells' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Avner Schlessinger

Associate Editor

PLOS Computational Biology

Jason Papin

Editor-in-Chief

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Dear Colleagues,

Thank you for answering my questions.

Best regards,

Daniel

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Daniel Zerbino

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008422.r007

Acceptance letter

Jason A Papin, Avner Schlessinger

24 Nov 2020

PCOMPBIOL-D-20-00774R2

Systematic clustering algorithm for chromatin accessibility data and its application to hematopoietic cells

Dear Dr Tanaka,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Nicola Davies

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Additional details of sequencing analysis.

    (PDF)

    Attachment

    Submitted filename: 103118_1_rebuttal_1719150_q89rkr.pdf

    Attachment

    Submitted filename: ReplytoReviewer20200807.pdf

    Attachment

    Submitted filename: response_to_reviewer_200905.pdf

    Data Availability Statement

    All ATAC-seq and RNA-seq data needed to reproduce this study have been deposited at the DNA Data Bank of Japan (DDBJ) under accession number DRA010939. The source code is available from https://github.com/tanakanishi/findclosest.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES