Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Mar 24.
Published in final edited form as: Interdiscip Sci. 2019 Dec 16;12(1):69–81. doi: 10.1007/s12539-019-00354-7

Q-Nuc: A Bioinformatics Pipeline for the Quantitative Analysis of Nucleosomal Profiles

Yuan Wang 1, Qiu Sun 2, Jie Liang 3, Hua Li 1,*, Daniel M Czajkowsky 1,*, Zhifeng Shao 1
PMCID: PMC7990035  NIHMSID: NIHMS1652090  PMID: 31845186

Abstract

Nucleosomal profiling is an effective method to determine the positioning and occupancy of nucleosomes, which is essential to understand their roles in genomic processes. However, the positional randomness across the genome and its relationship with nucleosome occupancy remains poorly understood. Here we present a computational method that segments the profile into nucleosomal domains and quantifies their randomness and relative occupancy level. Applying this method to published data, we find on average ~3-fold differences in the degree of positional randomness between regions typically considered “well-ordered”, as well as an unexpected predominance of only two types of domains of positional randomness in yeast cells. Further, we find that occupancy levels between domains actually differ maximally by ~2–3 fold in both cells, which has not been described before. We also developed a procedure by which one can estimate the sequencing depth that is required to identify nucleosomal positions even when regional positional randomness is high. Overall, we have developed a pipeline to quantitatively characterize domain-level features of nucleosome randomness and occupancy genome-wide, enabling the identification of otherwise unknown features in nucleosomal organization.

Keywords: Nucleosome organization, Genome structure, Positional randomness, Nucleosome occupancy, Hidden Markov Models

Introduction

Nucleosome binding to the eukaryotic genome plays important roles in basically all genomic functions, from influencing the flexibility (and thus the folding) of chromatin to regulating the access of the genome to other DNA-binding proteins [14]. Our understanding of these roles has been significantly advanced owing to the development of highly efficient biochemical methods to profile nucleosome distributions on genomes in situ [58]. However, while these biochemical techniques are now routine, our ability to quantitatively rationalize such data has lagged behind [9].

Generally, there are two properties that characterize the nucleosomal distribution in a local region of a nucleosomal profile: positioning and occupancy [10,4]. Positioning reflects the localization of the nucleosome at a given genomic locus in a fraction of cells in the population, whereas occupancy reflects the fraction of cells within the population that a given locus is bound by the nucleosome [4]. Most often in nucleosomal profiling studies, the focus is on “well-positioned” nucleosomes in which the nucleosomes localize to certain loci in the majority of cells [11,4], or on regular, equally spaced arrays of well-positioned nucleosomes called phased nucleosome arrays (PNAs) [1214]. Well-known examples of PNAs include the pericentromeric region of human chromosome 12 [15] or those generally associated with the transcription-start sites (TSSs) of actively transcribed genes [1,2]. In terms of occupancy, given that most of the genome is thought to be occupied by nucleosomes [16,4], most research has focused on genomic regions of “low occupancy” that are depleted in nucleosomes, usually within enhancer or promoter regions, at least partly, owing to the presence of DNA-binding proteins [5,17].

However, these descriptions of “well-positioned” or “low occupancy” have not been, to our knowledge, actually quantified systematically. That is, while the peaks in the pericentromeric region of human chromosome 12 are generally found to be more narrowly defined with less background than those within most TSS-associated PNAs, which clearly reflect a greater degree of randomness in the positioning within the latter, there is presently no understanding of the magnitude of this degree: both of these regions are simply called “well-positioned”. Similarly for extended domains that exhibit low occupancy: domains are simply recognized as having a lower level of bound nucleosomes than most locations in the genome, without careful inspection of the magnitude of this difference. Although previous studies have done much important work related to nucleosome positioning [1821], they fail to directly quantify the nucleosome randomness and relative occupancy on a genome-wide scale. This lack of explicit quantitative information of positioning or occupancy translates into a lack of any understanding of the actual fraction of cells in a population with a particular positioned nucleosome or occupied region. Such information is ultimately essential for a proper understanding of the role of nucleosomes in genomic processes, whose cell-to-cell variability is often now well-known [2224], as well as the functioning of nucleosomes that are well-positioned in only a few cells in the population, which are largely ignored in present analysis owing to an inability to properly identify and characterize these regions.

Here, we have developed an integrative bioinformatics pipeline for Quantitative Nucleosomal analysis (Q-Nuc) (Fig. S1) in which the core method segments the entire genome into domains in terms of nucleosomal positioning and then quantifies the level of randomness and relative occupancy of each domain. Applying this method to published datasets, we find dramatic differences (~3-fold) in the positional randomness between domains that would otherwise be considered equivalent (well-ordered PNAs), as well as an unexpected predominance of just two different nucleosomal organizations, genome-wide, in yeast and human cells. Further, we find that there is generally a 2 to 3-fold difference between the most highly occupied and least occupied regions in yeast and human cells, which is higher than presently appreciated. We further integrated a procedure by which one can estimate the sequencing depth that is required for the identification of all preferred nucleosomal positions, even if they are identically positioned in only a small fraction of cells of a population, if one wishes to maximize the quantifiable information in a specific project. Overall, this pipeline provides the first means to segment chromatin by nucleosomal randomness within a profile, providing essential information about the heterogeneous organization of nucleosomes within the cell population, as well as to rationally determine the sequencing depth that is needed for nucleosome position determination. Such a pipeline should enable a more detailed understanding of the complex relationship between nucleosome distribution and various genomic processes, particularly with notable heterogeneity.

Methods

Introduction of conceptual definitions

As mentioned in the introduction, the two properties that characterize the nucleosomal distribution in a local region of a nucleosomal profile are the positioning and occupancy (Fig. 1). To avoid confusion with existing nomenclature, we introduce three definitions to characterize possible nucleosomal distributions: “specifically-positioned”, “partially-positioned” and “randomly-positioned” nucleosomes (Fig. 1A).

Figure 1.

Figure 1.

Schematic diagram illustrating the cell-level view associated with nucleosomal positioning and occupancy for regions of high occupancy (A) and low occupancy (B).

“Specifically-positioned” nucleosomes reflect an extreme case of the localization of nucleosomes at precisely the same genomic locus in all of the cells in the population (Fig. 1A). Operationally, we identify this case in a nucleosomal profile as a sharp peak in the number of read midpoints across a given ~200 bp locus, whose breadth in sharpness is only determined by experimental biases (Fig. 1A). The “randomly-positioned” nucleosomes is also an extreme case in which the nucleosomes are found at different locations in different cells in the population without any sequence preference. Operationally, we identify randomly-positioned nucleosomes by a completely identical number of read midpoints across a given ~200 bp locus (Fig. 1A).

The third case, “partially-positioned” nucleosomes, are actually the vast majority of nucleosomes in a cell population and consist of a certain percentage of specifically-positioned nucleosomes, reflecting the fraction of cells that have this position occupied by a nucleosome, and a percentage of randomly positioned nucleosomes, reflecting the random distribution of nucleosomes in the remaining fraction of cells.

For occupancy, we associate the number of read midpoints within a given ~200 bp locus with the occupancy within this region, which is the fraction of cells in the population with a nucleosome within this region, regardless of the percentage of randomness (Fig. 1). In practice, though, we use the average read density across a nucleosomal domain as the measure of occupancy level. Domains with the highest average read density are considered as fully occupied, which is then used to quantify the relative occupancy level of the remaining domains.

Simulated profiling data using domains of defined positional randomness

The first step in our method is to generate simulated profiling data for an “ideal” extended nucleosomal distribution that consists of domains with various degrees of randomness (Fig. 2A).

Figure 2. Nucleosomal distribution model and generation of simulated profiling data.

Figure 2.

(A) Model overview. The specifically-positioned nucleosome (Gaussian distribution) and the randomly-positioned nucleosome (uniform distribution) are combined in different proportions. Several nucleosomes are arranged together into a region and the simulated reads are produced according to the theoretical distribution of nucleosome position. (B) Model simulation. Rep1 and Rep2 are the results of two separate simulations, representing two replicates. Simulations are performed at a read density of 0.1 reads/bp. The first track is the theoretical distribution of nucleosome position. The level of randomness is indicated at the top of each distribution.

To make a single domain, we first combine the distribution of a single, specifically-positioned nucleosome, which we consider as a Gaussian (as suggested in previous studies [1,25] and also shown in Fig. S2) whose breadth is only owing to experimental noise, with a uniform distribution that we associate with the randomly distributed nucleosomes at a proportion that reflects the degree of randomness at this locus (Fig. 2A). Mathematically, using S(x) to denote the specifically-positioned distribution (i.e., the probability density function) and R(x) to denote the randomly-positioned distribution, we introduce the nucleosome position distribution (NPD(x)) as

S(x)=12πσ2e(xd/z)22σ2,x[0,d] (1)
R(x)=1d,x[0,d] (2)
NPD(x)=(1npr)*S(x)+npr*R(x),x[0,d] (3)

where npr is the fraction of cells with uniformly distributed nucleosomes within the given region. The size of the nucleosome, d, is taken as the typical distance between two adjacent nucleosomes. This is obtained from the experimental MNase-seq data, as is the standard deviation of the Gaussian distribution, σ (Fig. S2).

The domain is then generated by concatenating these distributions together to a size of several kb. Likewise, concatenation of these domains together yields the “ideal” distribution (~100 kb) for the simulated profiling experiments. For the analysis here, we examined domains whose nucleosomal separation, d, and occupancy level were the same everywhere, but with various values of positional randomness (npr). However, we emphasize that this method can also be applied to regions of different nucleosomal separations, d, and so is not limited to the identification of only periodically arranged nucleosomes.

We then use Monte Carlo calculations to generate simulated profiling data of the nucleosomal midpoints according to the theoretical probability distribution of the nucleosomes, NPD. Shown in Fig. 2B are simulated profiling “reads” determined from regions with various percentages of randomness with a read density of 0.1 reads/bp. This read density was chosen for reasons that are described below. We obtained two simulated datasets (or replicates), akin to replicates in the experimental data, to enable comparisons between independent samplings of the same distribution (see below). As expected, simple inspection of these simulated data shows that the similarity between the replicates (or correlation) decreases as the nucleosome position randomness increases (Fig. 2B).

Segmentation of the genome into specifically-positioned or randomly-positioned regions

We next segment the experimental nucleosome profiling data into one of two types of domains: mainly specifically-positioned (SP) nucleosomes and mainly randomly-positioned (RP) nucleosomes. After this segmentation, we quantify the percentage of randomness within each domain, which results in a further sub-categorization of these two domains.

For this segmentation, we employ Hidden Markov Models (HMM) [26,27] (Fig. 3). For the “observations” in the HMM models, we required a parameter that could be used as a surrogate for the percentage of randomness within a region. Examining simulated profiling data, we found that the Pearson correlation between replicates changes in a well-defined way with the percentage of randomness within the domain, particularly at read densities lower than 0.2 reads/bp (Fig. S3). Thus, we used the Pearson correlation between replicates of the simulated profiling data as “observations” to train the HMM model, and then used this model to classify the experimental data based on the Pearson correlation between replicates of the experimental data, performing all calculations at 0.1 reads/bp.

Figure 3. Transition and emission probabilities of Hidden Markov Model (HMM).

Figure 3.

S1 and S2 are the two states; O1, O2 and O3 are the observations; and Th1 and Th2 are the two thresholds for observation determination.

In the HMM, we employ three types of observations (O1, O2, O3) according to threshold values of the Pearson correlation (th1, th2) (Fig. 3). For the analysis here, we use values of 0.23 and 0.70 for th1 and th2, respectively, which correspond to domains with randomness of 90% and 50%, respectively. These were chosen after a manual inspection of the obtained classification using different levels of randomness.

The steps to segment the profile are as follows:

  1. Two simulated nucleosomal profile data are generated from a nucleosomal distribution of alternating 3 kb domains with either 20% or 100% randomly positioned nucleosomes at a density of 0.1 reads/bp, following the method in the previous section. These profiles are then divided into 200 bp segments and the Pearson correlation between the two replicates within each segment is calculated (after smoothing [28]), from which the observation type of each 200 bp segment is determined.

  2. The HMM transition probabilities and emission probabilities are trained using the simulated profiling data.

  3. The experimental profiling data is randomly split into two and divided into 200 bp segments. The Pearson correlation between two replicates is then calculated and the observation type of each segment is determined.

  4. The type of nucleosomal distribution (SP or RP) was then predicted using the Viterbi algorithm with the trained parameters in the HMM model.

The HMM training and the calculation using the Viterbi-algorithm were performed using the R package-HMM.

Following this categorization, the randomness of each domain is determined by computing the Pearson correlation between the experimental replicates of the entire domain and translating this into a percentage of randomness, according to the simulated results in the previous section. Fig. S4 shows an example of this analysis for the pericentromeric region of the human chromosome 12.

Classification of the relative nucleosome occupancy level of each domain

We associate the average read density across each domain as the measure of its occupancy and the domain with the highest read density as “fully occupied”. With this, the relative occupancy level of each domain is determined by the ratio of its average read density to the highest read density.

To enable further analysis, we categorized the domains as low, medium, or high occupancy by rank ordering the domains according to their occupancy level and then defined low and high occupancy as those domains within the first and fourth quartile of this distribution, respectively, while those in the second and third quartiles are defined as medium occupancy (Fig. S5).

However, we note that profiles with low read density are expected to exhibit large regions of low-to-no reads simply owing to the smallness of the read density. Thus, we propose that samples with a density lower than 0.5 reads/bp, yielding an average of 100 reads per 200 bp locus (thus an expected error, assuming a Poisson process, of 10 reads), may not be generally sufficient for such quantification purposes. In this case, quantitative analysis could be performed between only those domains with densities higher than this threshold.

Protocol to predict the true positive rate (TPR) of calling locations of the specifically positioned nucleosomes and estimate the required sequencing depth

Given that the majority of the genome is partially positioned, one consequence of knowing the percentage of randomness in a specific domain is that we could then predict the sequencing depth that is needed to precisely identify the locations of the specifically-positioned nucleosomes within that domain, which are only present in a fraction of cells. Obviously, a greater sequencing depth is needed for smaller fractions. Generally, extended regions that contain both specifically-positioned and randomly-positioned nucleosomes yield poor TPR predictions due to biased randomness estimation (Fig. S6). Hence, our segmentation of the profile into domains that are enriched for specifically-positioned or randomly-positioned nucleosomes thereby enables a more effective calculation of the TPR.

Moreover, this analysis was aided by an unexpected observation: there is a simple, mathematical relationship between the Pearson correlation of replicates (of simulated data) and the TPR for calling the locations of the specifically-positioned nucleosomes (Logistic function, Fig. S6A,B). Thus, just as in the aforementioned analysis where the Pearson correlation could be used as a surrogate for the percentage of positional randomness, here we can use the Pearson correlation at a given sequencing depth to predict the TPR.

We constructed a protocol to predict the TPR as a function of sequencing depth (Fig. 4). In particular, (1) As described above, we segment the profile into domains using the HMM-based model, and determine the percentage of positional randomness of each domain using the Pearson correlation of the experimental replicates at 0.1 reads/bp; (2) Given this randomness for a specific domain, we determine the Pearson correlation of the simulated profiling data at greater sequencing depths; (3) The TPR of this domain is calculated by the logistic function (Fig. S6B) and the overall TPR is based on weighted average TPR of all the domains:

TPRi=11+e13.61*(ri0.64) (4)
TPR=sisTPRi (5)

where ri is the Pearson correlation coefficient of the i-th sub-region, TPRi is the predicted nucleosome position calling TPR of the i-th sub-region, si is the size of the i-th sub-region and s is the size of whole region analyzed. (4) Finally, we identify the read density at which the TPR for the whole dataset reaches 90% as the required sequencing depth for efficient identification of the specifically-positioned nucleosomal positions.

Figure 4.

Figure 4.

Workflow for the prediction of the nucleosome position calling true positive rate (TPR)

Applications of the model to simulated and experimental data

In the examination of simulated profiles, the 1,000 kb ideal nucleosomal profile consisted of 2 kb domains that differed in percentage of positional randomness and degree of occupancy. The randomness in each domain is sampled from a uniform distribution ranging from 0% to 100%, while the occupancy level was chosen from a Gaussian distribution ranging from 0% to 100% (mean 50%, standard deviation 20%), which we found is similar to the experimental case.

For the analysis of the yeast data, we mapped the raw MNase-seq data of budding yeast (GEO: #GSE47023) [29] to the sacCer3 genome, removed duplicates and obtained the read midpoints with a fragment length between 120 bp and 180 bp. To examine our method on two replicates independently, we down-sampled the data to a read density of 0.2 reads/bp for each replicate and split each replicate into two. As we only used these split replicates to compute the Pearson correlation, there is no experimental bias introduced into the correlation calculations. We then smoothed the midpoints [28] and examined these data with our HMM-classification procedure. Further confidence in the designation of a domain as specifically-positioned or randomly-positioned is made by examination of the corresponding smoothed signals of the fragment midpoints. Only the domains that overlapped in both replicates were analyzed further, except for the data shown in Fig. 6AC and Fig. 8B, which included all of the regions. The TSS-associated region (TSS-1k) includes all regions 500 bp upstream and downstream of the TSS, while the gene-body associated region includes the entire gene minus the TSS-associated region. The gene expression level was obtained from [29] and the gene expression fold change of young and old yeast was calculated as log2(Old/Young).

Figure 6. Changes in nucleosomal positional randomness during aging in budding yeast.

Figure 6.

(A) HMM segmentation results of two replicates in young and old yeast. Red blocks represent the specifically-positioned domains and the white blocks represent the randomly-positioned domains. (B) Genome-wide distribution of the positional randomness in the young yeast. (C) Genome-wide distribution of the positional randomness in the old yeast. (D) Comparison of the positional randomness of the overlapped domains in the young and old yeast. (E) Distribution of the positional randomness in the TSS- and gene body-associated regions of young yeast. (F) Distribution of the positional randomness in the TSS- and gene body-associated regions of old yeast. “size/200” is used to reflect the fact that the genome is divided into 200 bp bins. (G) Relationship between the change in randomness and gene expression (Pearson correlation coefficient r =0.036).

Figure 8. Nucleosomal positional randomness in human lymphoblast cells.

Figure 8.

(A) HMM segmentation results of the two experimental replicates. The red blocks represent the specifically-positioned nucleosomal domains and the white blocks represent the randomly-positioned domains. (B) The distribution of randomness and occupancy levels in chromosome 1 of these cells. (C) Nucleosome positioning randomness distribution in the TSS- and gene body-associated regions of human chromosome 1 of these cells.

For the human cell data, we mapped the raw MNase-seq data of human lymphoblast cell line GM19238 (GEO: #GSM907790) [15] to the human genome (genome build version hg38), removed duplicates and obtained the read midpoints with a fragment length between 120 bp and 180 bp. Since there are no experimental replicates of this data set, we used data from different sequencing runs as replicates (Table S1). When analyzing the PNA in the pericentromeric region of chromosome 12, reads located within the region from 34, 332 to 34, 407 kb are used. For the gene expression analysis, we used the RNA-seq data of the same cell line (GEO: #GSM1234095) [30] and obtained the gene expression level (measured by FPKM) using Cufflinks after mapping.

For the prediction of the TPR, we mapped the yeast data (GEO:# GSE117881) [31] to the sacCer3 genome and the midpoint of fragments with lengths between 120 to 180 bp were smoothed [28] before position calling. For position calling, we first detected all of the local maxima in the smoothed data, and then counted the number of reads within 30 bp of each side of the local maximum and calculated a p-value based on the Poisson distribution, similar to that in previous work [32,33]. We then adjusted the p-value using the Benjamini-Hochberg method and selected the peaks with a p-value smaller than 0.01 as a called position.

Results

Performance evaluation of the HMM-based classification method

The goal of this method is to characterize domains within an experimental nucleosomal profile in terms of the percentage of randomness in nucleosomal positions and the relative level of nucleosomal occupancy. To this end, we first segment the experimental profiling data into one of two types of domains (mainly “specifically-positioned” (SP) or mainly “randomly-positioned” (RP) nucleosomes) using HMM-based classification, and then quantify the percentage of positional randomness and occupancy of the nucleosomes within each domain (Methods).

To verify the effectiveness of this approach, we first applied this method to simulated nucleosomal profiles and compared the results from our method to the original nucleosomal distribution from which the simulated profiles were generated (Fig. 5). As shown in Fig. 5A, there is indeed good correspondence between the segmented domains and the domain boundaries and those of the original distribution. Further, we find very good agreement between the measured percentage of positional randomness of the domains with the expected randomness (Pearson correlation, R = 0.80) as well as the measured and expected occupancy levels (R = 0.83) (Fig. 5B).

Figure 5. Application of the HMM-based classification method to simulated and experimental data.

Figure 5.

(A) Example of the simulation and classification results. In the bottom panel, the red blocks depict specifically-positioned nucleosomal regions and the white regions are randomly-positioned domains. (B) Performance assessment of the positional randomness (upper panel) and occupancy level (lower panel) prediction using the HMM-based method. (C) An application of the HMM-based classification method to the pericentromeric region of human chromosome 12. The red blocks in the bottom panel denote the specifically-positioned nucleosome regions and the orange lines delimit randomly-positioned nucleosome regions.

To validate our method with experimental data, we examined the well-known PNA of the pericentromeric region of human chromosome 12 from a profile of lymphoblast cells [15]. The SP domains determined by the HMM method indeed align well with the known PNAs in this region (Fig. 5C). Interestingly, the HMM method also identified two small RP domains within this domain that are also evident in the data by simple visual inspection of the smoothed signal (Fig. 5C). Thus, our method indeed appears to be highly accurate in segmenting the profile and quantifying the domains.

Quantitative analysis of nucleosome positioning and occupancy in young and old budding yeast

To demonstrate the utility of this method with experimental data, we examined the nucleosomal profiles of young and old budding yeast [29]. The HMM-based classification method identified 7,712 domains (4,083 SP and 3,629 RP) within the young yeast profile and 7,492 (3,140 SP and 4,352 RP) domains in the old (Fig. 6A). More details about these domains genomic localization are showed in supplementary file (Table S2). Interestingly, examination of the percentage of randomness in these domains revealed that, genome-wide, rather than exhibiting a continuous range of values, there were essentially only two major groups of domains in the young yeast: low randomness domains (type 1) that peak at ~35% randomness and high randomness domains (type 2) with a randomness of 75% (Fig. 6B). The domains in the old yeast also appear to be dominated by these two types of domains, although with much fewer type 1 domains and a slight shift in its peak to ~42% randomness (Fig. 6C). We note that the existence of two main distributions is not a necessary consequence of the HMM classification method (see below). Overall, the randomness in the domains is greater in the old yeast (average increase of ~11%), although, rather than owing to a general increase in randomness in all domains, this increase is primarily a result of the conversion of type 1 domains in the young yeast to type 2 domains in the old (Fig. 6D).

In terms of occupancy, there was a similar, relative distribution of the low, medium, and high occupancy levels, regardless of the percentage of randomness in the domains in the young (Fig. 7A) or old yeast (Fig. 7B). Overall, we found that the median occupancy level in the high occupancy domains in the young yeast is 1.8-fold that of the low occupancy domains while for the old yeast, this is only 1.5-fold.

Figure 7. Nucleosome occupancy distribution and its relationship with gene expression.

Figure 7.

(A) Proportion of different occupancy levels at different levels of randomness in young yeast. (B) Proportion of different occupancy levels at different levels of randomness in old yeast. (C) Relationship between expression and occupancy types in TSS- and gene body-associated regions in young yeast. (D) Relationship between expression and occupancy types in TSS- and gene body-associated regions in old yeast. Gene expression levels are measured by FPKM. Significant differences of Wilcoxon test are marked with asterisks. (p-values < 0.0001: “****”; p-values < 0.001: “***”)

Owing to the prevalence of TSS-associated PNAs in yeast [12,3436], we expected that the type 1 domain might generally overlap with a TSS. Thus, we examined the positional randomness associated with the TSS region (Methods) in young yeast and indeed found that the majority are type 1, with a minority of type 2 (Fig. 6E). By contrast, analyzing the randomness associated with the gene body (Methods), we find more type 2 domains in the gene body region of young yeast (Fig. 6E). Hence, the nucleosome position is generally more random in the gene body. In old yeast, there is roughly an equal number of type 1 and 2 domains overlapping the TSS region and a majority of type 2 domains in the gene body (Fig. 6F). Among the TSS-associated domains, 29.3% switched from type 1 in the young to type 2 in the old, 70.2% remained unchanged and only 0.5% switched from type 2 to type 1. We also examined for any relation between the randomness change in the TSS between young and old yeast with expression changes in the corresponding genes and, overall, somewhat surprisingly, there is no correlation (Fig. 6G). There is also no significant difference in expression with increased or decreased randomness (p-value > 0.05). Thus, this result suggests that the change from type 1 to type 2 domains in the TSS region is not directly consequential to transcriptional activity.

By contrast though, analysis of the occupancy levels reveals clearly a correlation to expression level. In particular, we found that, regardless of the occupancy level of the TSS-associated region, there was higher expression in those genes with a lower occupied gene body region, especially in the young yeast (Fig. 7C,D).

Quantitative analysis of nucleosome positioning in human lymphoblast cells

As an additional example, we also applied our method to the categorization of the nucleosomal distribution in human lymphoblast cells [15]. We identified 99,377 domains (26,981 SP and 72,396 RP) in chromosome 1 of these cells (Fig. 8A). Different from what was observed in the yeast cells, there is only one dominant domain associated with ~65% randomness in human (Fig. 8B).

We note that this more “well-ordered” region (type 1) is nonetheless much more random than the “well-ordered” pericentromeric region of chromosome 12 in these cells (Fig. 5C), which we measured to have a 2.9-fold lower level of randomness (14%).

Also similar to what was observed in yeast, there is a roughly equal distribution of low, medium, and high occupancy domains across all domains of positional randomness (Fig. 8B), and the median occupancy level in highly occupied domains is 2.9-fold that of the lower occupied domains in these lymphoblast cells.

We also examined the positional randomness within the TSS-associated regions and gene-body regions (defined similarly as above in yeast) and, similar to what was observed in yeast, the gene-body region was associated with the more random, type 2 domain. However, unlike what was observed in the yeast, the TSS-associated region was primarily associated with type 2 domains, although there is a slight enrichment of TSS-associated domains with a lower degree of randomness than those observed within the gene-body (Fig. 8C). Also, there was a much more complicated relationship of occupancy levels in the TSS-associated and gene-body regions to expression than in yeast (Fig. S8). Thus, while many of the quantitative features observed in yeast are found in these human cells, there are nonetheless significant differences both qualitatively and quantitatively between these cells.

Estimate required sequencing depth to improve the true positive rate (TPR) of calling specifically-positioned nucleosomes

The ability to quantitatively identify partially positioned domains raised the possibility to determine all preferred nucleosomes in the genome even if they are only occupied by a small fraction of the cells. Therefore, we further integrated a method to estimate the required sequencing depth in order to accurately identify all nucleosome positions (see Methods).

To test this method, we used an ultra-deep MNase-seq data set from Saccharomyces cerevisiae [31]. The density of uniquely mapped reads is 13 reads/bp, which is among the highest density in the literature. We used nucleosome position calling results from total mapped reads as the “gold standard” reference of specifically-positioned nucleosomes to evaluate the nucleosome position TPR (Fig. S7).

We compared the predicted TPR by our method (averaged over 10 calculations) to the position calling TPR from down-sampled data as a function of read density. Indeed, there is excellent agreement between the predicted and measured TPR (Fig. 9). With this, we suggest that a desired TPR value of 90% would require a read density of 2 reads/bp, which is ~1/6 of the number in this data set.

Figure 9.

Figure 9.

Comparison of the nucleosome position calling TPR and the predicted TPR using experimental data as a function of the read density for S. cerevisiae ChrII (A) and ChrIV (B)

Discussion

We have developed a pipeline, Q-Nuc (Fig. S1) that can effectively quantify nucleosomal profiling data. We segment the profiling data into domains according to the positioning randomness, without regard to the annotation status of the genome, and then quantify the positional randomness and relative occupancy level of the domains genome-wide. As specifically positioned nucleosomes remain a major interest in many projects, we also integrated a method for the prediction of the required sequencing depth to call all of the locations, even when occupied in a minority of cells in the population. Preliminary applications of this method already discovered interesting features of the nucleosomal organization in both yeast and human cells.

Our method is adaptable to a wide range of possible nucleosomal distributions, since it is neither dependent on a particular pattern nor spacing between the nucleosomes. Instead, we use a given distribution as a probability distribution to which Monte Carlo calculations generate a possible nucleosomal profiling pattern that can be directly related to experimental data through Pearson correlation of the replicates. Indeed, we find that this Pearson correlation is an effective and robust measure of randomness within a domain for the segmentation method as well as nucleosomal position calling procedure. We find that the most significant discrepancy between the correct domain boundaries and the HMM-predicted domain boundaries occurs at the borders between domains that differ by only a small degree of randomness (Fig. 5A). However, it should be possible to minimize this discrepancy by including more than just two domains in the HMM model, which is straightforward to implement.

It is worth mentioning that there are also several methods that characterize chromatin state with epigenetic markers, such as ChromHMM[37], EpiCSeg[38] and Segway[39]. For example, EpiCSeg combines multiple histone modification data to train the complex chromatin features and identify epigenetic patterns. Since our method provides extra information on the nucleosomal positioning and occupancy feature, all these methods could be further integrated in the future for a more comprehensive understanding of chromatin structure and function.

Perhaps the most intriguing result from this analysis is the measure of the positional randomness within each nucleosomal domain. The well-known highly ordered PNA in the pericentromeric region of human chromosome 12 actually exhibits ~14% positional randomness (excluding the small RP domains), thus indicating that this distribution is essentially “universal” in this cell population. By contrast, the highly organized yeast genome exhibits primarily domains with two different randomness, ~40% and ~75%, in both young and old cells. The former appears to be enriched in the TSS region (typically associated with well-ordered PNAs), while the latter is found equally in the TSS and gene body regions (Fig. 6). Thus, regions usually considered equally “well-ordered” (such as the pericentromeric region in human chromosome 12 and the TSS-associated regions) can nonetheless exhibit dramatic differences (~3-fold) in positional randomness, reflecting significant differences in the fraction of the cell population in which the specifically-positioned nucleosomes are present. While the difference between these ordered regions may be owing to the extent of DNA-binding proteins bound at these locations [4,17], we speculate that the domains of 65% to 75% randomness are owing to the underlying DNA sequence, perhaps reflecting the static disorder in the genome and the sequence-preferences of the nucleosome binding [40,41]. However, since these values also indicate that there are 25% to 35% of cells with specifically-positioned nucleosomes in these regions, these values might also reflect the general “background” activity of chromatin remodelers in these cells throughout the genome.

The examination of the relative occupancy levels within the different domains also uncovered unexpected results in both yeast and human cells. Overall, even though most studies consider the occupancy relatively uniform across the genome, except for locally (~200 bp) depleted regions (near the TSS, for example), we find a broad range of occupancy levels genome-wide, with a general maximal difference between highly occupied/least occupied domains of 2 to 3-fold. We suggest that, apart from specific effects on local genomic processes, these differences will also yield large differences in the flexibility of the chromatin, and thus could influence genomic activities indirectly by effects on the genome structure [42].

In addition, in terms of specific genomic processes, we found a strong relationship between a low occupancy status in the gene body and the level of expression in yeast, although processes that are more complicated appear to be occurring in human cells. We speculate that, in yeast, this relationship may be owing to the functioning of the RNA polymerase, which might be expected to displace nucleosomes in more highly transcribed gene bodies more frequently than in less transcribed gene bodies. However, the absence of a similar relationship with the TSS-associated domain points to a different mechanism of replacing displaced nucleosomes in the two regions, which can be further investigated.

Supplementary Material

1

Acknowledgments

This work was supported by grants from National Basic Research Program of China (2018YFC1003501), the National Natural Science Foundation of China (nos. 11374207, 31501054, 31670722, 31971151, 81627801 and 81972909) and the U.S. National Institutes of Health (R01CA204962 and R21AI126308 to J.L.).

Footnotes

Competing interests

The authors have declared no competing interests.

Note: The source codes and associated documents are freely available upon request.

References

  • 1.Jiang C, Pugh BF (2009) Nucleosome positioning and gene regulation: advances through genomics. Nature reviews Genetics 10 (3):161–172. doi: 10.1038/nrg2522 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Radman-Livaja M, Rando OJ (2010) Nucleosome positioning: How is it established, and why does it matter? Dev Biol 339 (2):258–266. doi: 10.1016/j.ydbio.2009.06.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Luger K, Dechassa ML, Tremethick DJ (2012) New insights into nucleosome and chromatin structure: an ordered state or a disordered affair? Nat Rev Mol Cell Bio 13 (7):436–447. doi: 10.1038/nrm3382 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Struhl K, Segal E (2013) Determinants of nucleosome positioning. Nature structural & molecular biology 20 (3):267–273. doi: 10.1038/nsmb.2506 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Sekinger EA, Moqtaderi Z, Struhl K (2005) Intrinsic histone-DNA interactions and low nucleosome density are important for preferential accessibility of promoter regions in yeast. Mol Cell 18 (6):735–748. doi: 10.1016/j.molcel.2005.05.003 [DOI] [PubMed] [Google Scholar]
  • 6.Schones DE, Cui KR, Cuddapah S, Roh TY, Barski A, Wang ZB, Wei G, Zhao KJ (2008) Dynamic regulation of nucleosome positioning in the human genome. Cell 132 (5):887–898. doi: 10.1016/j.cell.2008.02.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Cole HA, Howard BH, Clark DJ (2012) Genome-Wide Mapping of Nucleosomes in Yeast Using Paired-End Sequencing. Method Enzymol 513:145–168. doi: 10.1016/B978-0-12-391938-0.00006-9 [DOI] [PubMed] [Google Scholar]
  • 8.Voong LN, Xi L, Sebeson AC, Xiong B, Wang JP, Wang X (2016) Insights into Nucleosome Organization in Mouse Embryonic Stem Cells through Chemical Mapping. Cell 167 (6):1555–1570 e1515. doi: 10.1016/j.cell.2016.10.049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Teif VB (2016) Nucleosome positioning: resources and tools online. Brief Bioinform 17 (5):745–757. doi: 10.1093/bib/bbv086 [DOI] [PubMed] [Google Scholar]
  • 10.van der Heijden T, van Vugt JJFA, Logie C, van Noort J (2012) Sequence-based prediction of single nucleosome positioning and genome-wide nucleosome occupancy. P Natl Acad Sci USA 109 (38):E2514–E2522. doi: 10.1073/pnas.1205659109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Bai L, Morozov AV (2010) Gene regulation by nucleosome positioning. Trends Genet 26 (11):476–483. doi: 10.1016/j.tig.2010.08.003 [DOI] [PubMed] [Google Scholar]
  • 12.Lantermann AB, Straub T, Stralfors A, Yuan GC, Ekwall K, Korber P (2010) Schizosaccharomyces pombe genome-wide nucleosome mapping reveals positioning mechanisms distinct from those of Saccharomyces cerevisiae. Nature structural & molecular biology 17 (2):251–U215. doi: 10.1038/nsmb.1741 [DOI] [PubMed] [Google Scholar]
  • 13.Wu Y, Zhang W, Jiang J (2014) Genome-wide nucleosome positioning is orchestrated by genomic regions associated with DNase I hypersensitivity in rice. Plos Genet 10 (5):e1004378. doi: 10.1371/journal.pgen.1004378 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Baldi S, Jain DS, Harpprecht L, Zabel A, Scheibe M, Butter F, Straub T, Becker PB (2018) Genome-wide Rules of Nucleosome Phasing in Drosophila. Mol Cell 72 (4):661–672 e664. doi: 10.1016/j.molcel.2018.09.032 [DOI] [PubMed] [Google Scholar]
  • 15.Gaffney DJ, McVicker G, Pai AA, Fondufe-Mittendorf YN, Lewellen N, Michelini K, Widom J, Gilad Y, Pritchard JK (2012) Controls of nucleosome positioning in the human genome. Plos Genet 8 (11):e1003036. doi: 10.1371/journal.pgen.1003036 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Jiang C, Pugh BF (2009) A compiled and systematic reference map of nucleosome positions across the Saccharomyces cerevisiae genome. Genome Biol 10 (10):R109. doi: 10.1186/gb-2009-10-10-r109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Yan C, Chen H, Bai L (2018) Systematic Study of Nucleosome-Displacing Factors in Budding Yeast. Mol Cell 71 (2):294–305 e294. doi: 10.1016/j.molcel.2018.06.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Chen K, Xi Y, Pan X, Li Z, Kaestner K, Tyler J, Dent S, He X, Li W (2013) DANPOS: dynamic analysis of nucleosome position and occupancy by sequencing. Genome Res 23 (2):341–351. doi: 10.1101/gr.142067.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Fu K, Tang Q, Feng J, Liu XS, Zhang Y (2012) DiNuP: a systematic approach to identify regions of differential nucleosome positioning. Bioinformatics 28 (15):1965–1971. doi: 10.1093/bioinformatics/bts329 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Flores O, Orozco M (2011) nucleR: a package for non-parametric nucleosome positioning. Bioinformatics 27 (15):2149–2150. doi: 10.1093/bioinformatics/btr345 [DOI] [PubMed] [Google Scholar]
  • 21.Xi L, Brogaard K, Zhang Q, Lindsay B, Widom J, Wang JP (2014) A locally convoluted cluster model for nucleosome positioning signals in chemical map. Journal of the American Statistical Association 109 (505):48–62. doi: 10.1080/01621459.2013.862169 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Elowitz MB, Levine AJ, Siggia ED, Swain PS (2002) Stochastic gene expression in a single cell. Science 297 (5584):1183–1186. doi: 10.1126/science.1070919 [DOI] [PubMed] [Google Scholar]
  • 23.Golding I, Paulsson J, Zawilski SM, Cox EC (2005) Real-time kinetics of gene activity in individual bacteria. Cell 123 (6):1025–1036. doi: 10.1016/j.cell.2005.09.031 [DOI] [PubMed] [Google Scholar]
  • 24.Miura H, Takahashi S, Poonperm R, Tanigawa A, Takebayashi SI, Hiratani I (2019) Single-cell DNA replication profiling identifies spatiotemporal developmental dynamics of chromosome organization. Nature genetics. doi: 10.1038/s41588-019-0474-z [DOI] [PubMed] [Google Scholar]
  • 25.Feng JH, Dai XH, Dai ZM, Xiang QA, Wang JA, Deng YY, He CS (2010) A simulation model for nucleosome distribution in the yeast genome based on integrated cross-platform positioning datasets. Math Comput Model 52 (11–12):1932–1939. doi: 10.1016/j.mcm.2010.03.043 [DOI] [Google Scholar]
  • 26.Birney E (2001) Hidden Markov models in biological sequence analysis. Ibm J Res Dev 45 (3–4):449–454. doi: 10.1147/Rd.453.0449 [DOI] [Google Scholar]
  • 27.Yoon BJ (2009) Hidden Markov Models and their Applications in Biological Sequence Analysis. Current genomics 10 (6):402–415. doi: 10.2174/138920209789177575 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Valouev A, Johnson SM, Boyd SD, Smith CL, Fire AZ, Sidow A (2011) Determinants of nucleosome organization in primary human cells. Nature 474 (7352):516–U148. doi: 10.1038/nature10002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Hu Z, Chen K, Xia Z, Chavez M, Pal S, Seol JH, Chen CC, Li W, Tyler JK (2014) Nucleosome loss leads to global transcriptional up-regulation and genomic instability during yeast aging. Genes & development 28 (4):396–408. doi: 10.1101/gad.233221.113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Kasowski M, Kyriazopoulou-Panagiotopoulou S, Grubert F, Zaugg JB, Kundaje A, Liu YL, Boyle AP, Zhang QC, Zakharia F, Spacek DV, Li JJ, Xie D, Olarerin-George A, Steinmetz LM, Hogenesch JB, Kellis M, Batzoglou S, Snyder M (2013) Extensive Variation in Chromatin States Across Humans. Science 342 (6159):750–752. doi: 10.1126/science.1242510 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Munoz S, Minamino M, Casas-Delucchi CS, Patel H, Uhlmann F (2019) A Role for Chromatin Remodeling in Cohesin Loading onto Chromosomes. Mol Cell 74 (4):664–673 e665. doi: 10.1016/j.molcel.2019.02.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y (2008) RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18 (9):1509–1517. doi: 10.1101/gr.079558.108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Ghaffari N, Yousefi MR, Johnson CD, Ivanov I, Dougherty ER (2013) Modeling the next generation sequencing sample processing pipeline for the purposes of classification. Bmc Bioinformatics 14:307. doi: 10.1186/1471-2105-14-307 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Jansen A, Verstrepen KJ (2011) Nucleosome positioning in Saccharomyces cerevisiae. Microbiology and molecular biology reviews : MMBR 75 (2):301–320. doi: 10.1128/MMBR.00046-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Ganguli D, Chereji RV, Iben JR, Cole HA, Clark DJ (2014) RSC-dependent constructive and destructive interference between opposing arrays of phased nucleosomes in yeast. Genome Res 24 (10):1637–1649. doi: 10.1101/gr.177014.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Chereji RV, Ramachandran S, Bryson TD, Henikoff S (2018) Precise genome-wide mapping of single nucleosomes and linkers in vivo. Genome Biol 19 (1):19. doi: 10.1186/s13059-018-1398-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ernst J, Kellis M (2012) ChromHMM: automating chromatin-state discovery and characterization. Nature methods 9 (3):215–216. doi: 10.1038/nmeth.1906 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Mammana A, Chung HR (2015) Chromatin segmentation based on a probabilistic model for read counts explains a large portion of the epigenome. Genome Biol 16:151. doi: 10.1186/s13059-015-0708-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS (2012) Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature methods 9 (5):473–476. doi: 10.1038/nmeth.1937 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Segal E, Fondufe-Mittendorf Y, Chen L, Thastrom A, Field Y, Moore IK, Wang JP, Widom J (2006) A genomic code for nucleosome positioning. Nature 442 (7104):772–778. doi: 10.1038/nature04979 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Kaplan N, Moore IK, Fondufe-Mittendorf Y, Gossett AJ, Tillo D, Field Y, LeProust EM, Hughes TR, Lieb JD, Widom J, Segal E (2009) The DNA-encoded nucleosome organization of a eukaryotic genome. Nature 458 (7236):362–U129. doi: 10.1038/nature07667 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Dekker J (2008) Mapping in Vivo Chromatin Interactions in Yeast Suggests an Extended Chromatin Fiber with Regional Variation in Compaction. J Biol Chem 283 (50):34532–34540. doi: 10.1074/jbc.M806479200 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES