Skip to main content
Journal of Computational Biology logoLink to Journal of Computational Biology
. 2022 Feb 16;29(2):106–120. doi: 10.1089/cmb.2021.0439

Normalizing Metagenomic Hi-C Data and Detecting Spurious Contacts Using Zero-Inflated Negative Binomial Regression

Yuxuan Du 1, Sarah M Laperriere 2, Jed Fuhrman 2, Fengzhu Sun 1,
PMCID: PMC8892984  PMID: 35020412

Abstract

High-throughput chromosome conformation capture (Hi-C) has recently been applied to natural microbial communities and revealed great potential to study multiple genomes simultaneously. Several extraneous factors may influence chromosomal contacts rendering the normalization of Hi-C contact maps essential for downstream analyses. However, the current paucity of metagenomic Hi-C normalization methods and the ignorance for spurious interspecies contacts weaken the interpretability of the data. Here, we report on two types of biases in metagenomic Hi-C experiments: explicit biases and implicit biases, and introduce HiCzin, a parametric model to correct both types of biases and remove spurious interspecies contacts. We demonstrate that the normalized metagenomic Hi-C contact maps by HiCzin result in lower biases, higher capability to detect spurious contacts, and better performance in metagenomic contig clustering.

Keywords: metagenomic Hi-C, normalization, spurious contact detection

1. INTRODUCTION

High-throughput chromosome conformation capture (Hi-C) is a DNA proximity ligation approach with many applications in the investigation of genomic structures, DNA interactions, and even characterizing virus–host interactions from metagenomes (Lieberman-Aiden et al., 2009; Bickhart et al., 2019; Dudchenko et al., 2017). In Hi-C experiments, chimeric junctions are formed between pieces of DNAs in close proximity within cells and then subjected to paired-end sequencing generating millions of paired-end reads linking DNA fragments (Lieberman-Aiden et al., 2009).

The number of reads connecting two DNA fragments is significantly related to the probability of contact between genomic loci in the three-dimensional structure at a fixed time point. Hi-C technique reveals the compartment property of the mammalian genomes (Lieberman-Aiden et al., 2009), identifies topologically associated domains (Dixon et al., 2012), and reconstructs haplotypes (Selvaraj et al., 2013).

Most recently, the Hi-C technique has been applied to the metagenomic domain (metagenomic Hi-C), and a series of Hi-C experiments have been conducted for microbial communities rather than a single species (Press et al., 2017; DeMaere and Darling, 2019). Combined with the traditional shotgun sequencing, the metagenomic Hi-C technique has displayed a powerful ability to probe virus–host interactions (Bickhart et al., 2019), simultaneously retrieve multiple genomes (Burton et al., 2014), deconvolute assembled contigs from whole-genome shotgun (WGS) sequencing data into genome bins in both simulated and real microbial communities (Baudry et al., 2019), and track horizontal gene transfer (Yaffe and Relman, 2020).

However, there exist strong experimental biases for the Hi-C interaction counts (Yaffe and Tanay, 2011); therefore, normalizing Hi-C data is essential to remove these biases. Although multiple strategies have been put forward (Hu et al., 2012; Imakaev et al., 2012), most of these normalization methods aim to normalize Hi-C data derived from a single species, mainly human cells, and are not suitable to be applied on metagenomic Hi-C data from complex communities. This is mainly because potential factors of biases for metagenomic Hi-C data are different from those for Hi-C data within individual species.

In addition, it is not valid to theoretically assume that all contigs should have equal visibility in metagenomic Hi-C data as the relative abundance levels of the different species can vary. Several relatively simple metagenomic Hi-C normalization methods have been developed. ProxiMeta (Press et al., 2017) applied a normalization to the raw Hi-C counts by accounting for the estimated abundance of the contigs, and further took the number of restriction sites on the contigs into consideration (Stalder et al., 2019). As a proprietary metagenomic genome binning platform without open-source pipeline, ProxiMeta did not clarify the normalization algorithms in detail.

Beitel et al. (2014) divided raw interaction counts by the product of the length of two contigs. MetaTOR (Baudry et al., 2019) normalized raw counts by the geometric mean of the contigs’ coverage. Metaphase (Burton et al., 2014) and Bin3C (DeMaere and Darling, 2019) divided raw Hi-C counts by the product of the number of restriction sites and Bin3C used the Knight–Ruiz algorithm (Knight and Ruiz, 2013) to construct a general doubly stochastic matrix after the first step correction. We show that these normalization methods are not effective in removing all biases. In addition, the biases of spurious interspecies contacts are ignored for metagenomic Hi-C data by all these normalization methods, considerably weakening the interpretability of the Hi-C data (Stalder et al., 2019).

Here we first comprehensively discuss potential experimental biases for metagenomic Hi-C data, and then propose HiCzin, a method to normalize metagenomic Hi-C data based on the zero-inflated negative binomial regression frameworks (Yau et al., 2003). We also develop a hybrid statistical method to detect spurious interspecies contacts. We show that the normalized metagenomic Hi-C contact maps by HiCzin lead to lower biases, higher ability to detect spurious contacts, and better performance in metagenomic contig clustering on the published metagenomic Hi-C data set.

2. METHODS

2.1. Framework of applying HiCzin to metagenomic Hi-C experiments

The workflow of HiCzin utilized in the metagenomic Hi-C analysis is shown in Figure 1. In metagenomic Hi-C experiments, short reads are obtained by shotgun sequencing from microbial communities. At the same time, metagenomic Hi-C sequencing reads are generated from the same sample. Contigs are assembled from the shotgun short reads and Hi-C reads are mapped to the assembled contigs to construct raw contact maps consisting of the number of Hi-C reads mapped to contig pairs. Then, HiCzin is used to normalize raw contact maps and discard spurious contacts. Finally, downstream analysis can be conducted on the basis of normalized contact maps by HiCzin.

FIG. 1.

FIG. 1.

Workflow of HiCzin utilized in metagenomic Hi-C analysis.

2.2. Calculating the coverage of assembled contigs

The coverage of contigs was computed using MetaBAT (Kang et al., 2015) v2.12.5 script: “jgi _summarize_bam_contig_depths.”

2.3. Applying TAXAassign to generate sample data of the intraspecies contacts

The taxonomic assignment of contigs was resolved using NCBI's Taxonomy and its nt database by TAXAassign(v0.4) with parameters “-p -c 20 -r 10 -m 98 -q 98 -t 95 -a ‘60,70,80,95,95,98’ -f.” Assignment results with “unclassified” at the species level were discarded, and only deterministic results of taxonomic assignment at the species level were kept. Intraspecies pairs were subsequently generated by pairwise combining contigs assigned to the same species, and corresponding contacts were treated as samples to fit the HiCzin model.

2.4. Normalization via the HiCzin model

Based on the zero-inflated generalized linear mixed framework (Lambert, 1992), the HiCzin is a two-component mixture model combining a mass point at zero with a count distribution. Specifically, within the intraspecies contacts, zero contacts may come from two sources: the count distribution, showing that these zeros are observations of the population of the intraspecies contacts and no interactions happened, or the zero mass points, indicating that Hi-C interactions happened, but the observations of the interactions were lost due to certain kinds of experimental noise.

Formally, denote the population of the intraspecies contacts as a random variable Y. The basic assumption of the HiCzin model is that Y follows the negative binomial distribution. Let πij denote the probability of unobserved contacts and Zij denote a zero-inflated random variable of the intraspecies contacts between the ith contig and the jth contig. Then the random variable Zij is given by the following:

Zij0,withprobabilityπij,ZijNB(μij,θij),withprobability1πij, (1)

where NB(μij,θij) is negative binomial distribution with mean μij and shape parameter θij.

Therefore, the zero-inflated density of Zij is the result of mixing a negative binomial distribution and a degenerate distribution at zero as follows:

2.4.

where Γ() is the gamma function. The random variable Zij will be degenerated to negative binomial distribution when πij=0.

We assume that the parameters μij and πij depend on the three factors of explicit biases, while θij is an independent parameter as a constant parameter θ in our model. Define sk, lk, and ck as the number of restriction sites, the length and the coverage of the kth contig, respectively. As Lord et al. (2008) suggested, link functions in generalized linear models are used to model the dependence of parameters μij and πij on the three factors of explicit biases. To be specific, we propose that μij is related to three factors by the logarithmic link, that is,

log(μij)=β0+βslog(sisj)+βllog(lilj)+βclog(cicj). (3)

We also propose that πij is related to three factors by the logistic link, that is,

logπij1πij=γ0+γslog(sisj)+γllog(lilj)+γclog(cicj). (4)

Let μZij denote the mean of random variable Zij. Then, the corresponding regression equation for μZij is

log(μZij)=log(1πij)+β0+βslog(sisj)+βllog(lilj)+βclog(cicj). (5)

The overall model parameters β=(β0,βs,βl,βc), γ=(γ0,γs,γl,γc), and the additional dispersion parameter θ can be estimated by maximum likelihood (ML) using the latest R package “glmmTMB” (Brooks et al., 2017).

Finally, the residuals of the counting part are the normalized metagenomic Hi-C contacts, that is,

eij=zijμij ^. (6)

Hence, given sample data of the intraspecies contacts, our HiCzin model can integrate all three factors of explicit biases. The influence of unobserved interactions is also taken into account simultaneously to “unbiased” the estimation and prediction.

2.5. Spurious contact detection by a hybrid statistical method based on HiCzin

From the HiCzin model, the intraspecies contact Yij follows the negative binomial distribution with mean μij ^ and shape θ^. Given any contig pairs with nonzero contacts, we denote the value of the observed raw contacts as Oij and the expected contacts under condition that the two contigs come from the same species as Eij, where Eij=E(Yij)=μij ^. We define the enrichment score as Sij=log(OijEij).

Under our statistical framework, we also design a hypothesis test to detect spurious contacts. Since observations of Hi-C interactions need to be protected, the null hypothesis of the test is that Oij belongs to the intraspecies contacts, while the alternative hypothesis is that Oij belongs to the spurious interspecies contacts. We directly regard Oij as the test statistic and OijYij under null hypothesis. We choose the one-tailed test and calculate the p-value of Oij as follows:

pij=Pr(YijOij). (7)

Then, we develop a hybrid statistical method to detect spurious contacts. We choose a threshold t for the enrichment score and a significance level α for the hypothesis test. Contacts of contig pairs whose enrichment score is less than t or the p-value is less than α will be regarded as spurious contacts and then discarded.

To determine the threshold and the significance level, we assume that the percentiles of the enrichment score and the p-value of the valid contacts in our sample data are similar to those in the whole data and preselect a percentage (default 10%) reflecting the acceptable fraction of losses of the valid contacts. Taking advantage of our generated sample data of the intraspecies contact, we can determine the threshold t and significance level α such that less than the preselected percentage of valid contacts in sample data is incorrectly identified as spurious contacts for both methods, respectively.

Based on our assumption, we suppose that around the same percentage of valid contacts in the whole data might be mistakenly discarded. Therefore, thresholds can be strictly restricted to detect most of spurious contacts while avoid incorrectly identifying a large proportion of valid contacts in the whole data.

2.6. Generalizing the HiCzin by selecting different independent variables

Let {xk}k=1n denote the set of factors. Then, we modify the regression equation in Equation (5) as follows:

log(μZij)=log(1πij)+β0+k=1nβklog(xikxjk), (8)

where πij in Equation (4) is modified as follows:

logπij1πij=γ0+k=1nγklog(xikxjk). (9)

Then, “glmmTMB” (Brooks et al., 2017) package is used to estimate {βk}k=0n,{γk}k=0n, and θ, and residuals of the counting part are considered normalized contacts.

2.7. HiCzin normalization without labeled contigs

As samples of the intraspecies contacts cannot be obtained in some scenarios, we just regard all nonzero raw contacts as our sample data. Although these samples contain both valid contacts and spurious contacts, the number of valid contacts is supposed to be much larger than that of spurious contact, and thus, we suppose that spurious contacts will not result in significant biases in parameter estimation. Moreover, as we do not have zero contacts to fit the zero-inflated part, one option to solve this problem is to set πij as a constant parameter, that is,

logit(πij)=log(πij1πij)=γ. (10)

Unknown parameters are estimated by ML using “glmmTMB” (Brooks et al., 2017) package and residuals are considered normalized contacts.

3. RESULTS

3.1. Source of biases in metagenomic Hi-C experiments

In addition to chromosomal contacts of interest, several other factors unrelated to chromosomal contacts can also influence the number of Hi-C interactions between contigs (Yaffe and Tanay, 2011). We refer to such factors as biases. We report on two kinds of biases with substantial influences on metagenomic Hi-C contact maps: explicit biases and implicit biases. Explicit biases include three potential factors: (1) the number of enzymatic restriction sites on contigs, (2) contig length, and (3) contig coverage (Beitel et al., 2014; Burton et al., 2014; Press et al., 2017), all of which can be observed.

Implicit biases include unobserved interactions and spurious interspecies contacts. Unobserved interactions are chimerical DNA fragments that are missed due to factors such as the mappability of contigs and in vivo constraints on accessibility. Spurious interspecies contacts arise from the ligation of DNA fragments between closely related species (Stalder et al., 2019). As implicit biases are unobservable, it is challenging to detect and correct implicit biases.

3.2. Analyses of experimental biases in synthetic metagenomic yeast samples

We analyzed metagenomic yeast (M-Y) samples, consisting of 16 yeast strains (BioProject: PRJNA245328) (Burton et al., 2014). After processing the raw WGS and Hi-C reads, we generated raw Hi-C contact maps for 6196 assembled contigs (Supplementary Data). Reference genomes of these 16 yeast strains were downloaded (Supplementary Table S1). To determine the true species identity of the assembled contigs, contigs were aligned to reference genomes at the species level by BLASTn (Altschul et al., 1990) (Supplementary Data). Thirty-seven contigs (0.6%) could not be aligned to the reference genomes and were not considered in the following analyses (Supplementary Fig. S1).

According to the alignment results to the reference genomes, we refer to contig pairs from the same species and different species as intraspecies pairs and interspecies pairs, respectively. Interaction counts of intraspecies pairs and interspecies pairs are defined as valid contacts and spurious contacts, respectively. In particular, we denote zero contacts if no interaction was observed between intraspecies pairs; hence, the intraspecies contacts, corresponding to intraspecies pairs, are composed of valid contacts and zero contacts. Valid contacts imply a high probability of contig pairs belonging to the same genome, while spurious contacts confound the interpretation of the Hi-C data.

Raw interaction counts were enriched between pairs of contigs with a high number of restriction sites, long contigs, and/or contigs with high coverage (Fig. 2), which can be explained by the following reasons. Longer contigs may have higher ligation efficiencies with other contigs than shorter contigs, more restriction sites are likely to increase the probability of enzymatic cuts within DNA fragments, and higher coverages, representing higher concentrations of contigs, can result in more Hi-C interactions between contigs. The Pearson correlation coefficients between raw valid contacts and the product of the number of restriction sites, the length, and the coverage for each pair of contigs were 0.429, 0.400, and 0.184, respectively, demonstrating that these three factors were indeed highly correlated with valid contacts.

FIG. 2.

FIG. 2.

Relationship between raw interaction counts and the product of the number of restriction sites, length, and coverage between contig pairs.

As for implicit biases, one remarkable phenomenon for intraspecies contacts was the presence of excess zeros, which means zero contacts account for a large proportion within intraspecies contacts. The number of valid contacts (i.e., nonzero intraspecies contacts) only made up 14.9% within all intraspecies contacts, suggesting the potential existence of unobserved interactions with high probability due to the experimental noise. The number of spurious contacts made up 25.5% of all nonzero contacts, which could not be neglected for the M-Y samples.

3.3. Normalization methods in the publicly available metagenomic Hi-C analysis pipelines

Because of the existence of aforementioned experimental biases, it is necessary to normalize the raw Hi-C contacts before downstream analysis, such as clustering and tracking virus–host interactions. Most of the current available pipelines divided the raw Hi-C interactions by the product of one factor of explicit biases to normalize raw Hi-C contacts, which we refer to as naive normalization methods (Beitel et al., 2014; Press et al., 2017; Baudry et al., 2019). These naive normalization methods only corrected part of the explicit biases, and the unnormalized factors of explicit biases might still be highly correlated with Hi-C contact maps.

As for the two-stage normalization method in Bin3C (DeMaere and Darling, 2019), equal visibility for all regions is a basic theoretical assumption for utilizing the matrix balancing algorithm they use to recover normalized Hi-C matrices (Imakaev et al., 2012), yet this assumption is not satisfied for metagenomic assembled contigs with huge differences in length and abundance. Moreover, all these normalization methods ignored the influence of implicit biases and did not attempt to detect and remove the spurious interspecies contacts. Therefore, it is imperative to develop new normalization methods to overcome these shortcomings.

3.4. Removing explicit biases and spurious contacts using zero-inflated negative binomial regression

The Poisson and negative binomial regression models are widely used in fitting count data and have been successfully used in fitting Hi-C interactions of human cells (Hu et al., 2012). Therefore, there is potential to apply frameworks based on Poisson or negative binomial regression to normalize metagenomic Hi-C data. Here we model the population of the intraspecies contacts using the negative binomial distribution rather than the Poisson distribution because Hi-C data are always overdispersed (Hu et al., 2012). In the classical negative binomial regression model, we can fit the model given sample data of the intraspecies contacts by regarding factors of biases and intraspecies contacts as predictor variables and the response variables, respectively. Then, the residuals of this conventional model serve as normalized contacts.

However, some underlying interactions may not be observed in Hi-C experiments due to the limited quantity of Hi-C reads and problems in mapping Hi-C reads to the contigs. Ignoring such influences may lead to serious biased estimation and prediction. In addition, although classical negative binomial models can capture the property of overdispersion, they are not sufficient for modeling the excess zeros observed in the Hi-C contact maps. To solve these problems, we developed HiCzin, a novel metagenomic Hi-C normalization method based on zero-inflated negative binomial regression frameworks (Yau et al., 2003), combining the counting distribution of the intraspecies contacts with a mass distribution of unobserved contacts. The residues of the counting part serve as normalized contacts (see Section 2).

Compared with raw valid contacts, the average value of raw spurious contacts was smaller (Fig. 3a), while the average number of restriction sites, length, and coverage of contigs was significantly larger (Fig. 3b–d). These evidences indicated that spurious interspecies contacts were more likely to be generated for longer contigs with more restriction sites and higher abundances. Therefore, we expect that the magnitude of the normalized spurious contacts by the factors of explicit biases to be significantly smaller than that of the normalized valid contacts.

FIG. 3.

FIG. 3.

Comparison of (a) the raw counts of spurious contacts and valid contacts, (b) the number of restriction sites of spurious contacts and valid contacts, (c) the length of spurious contacts and valid contacts, (d) the coverage of spurious contacts and valid contacts.

Thus, a basic idea is to discard the normalized contacts whose values are less than a selected threshold as spurious contacts (Stalder et al., 2019). However, determining the threshold is extremely challenging. Based on our HiCzin normalization model, we develop a hybrid statistical method to detect spurious contacts and determine thresholds (see Section 2).

3.5. Applying the HiCzin model to the M-Y samples

To fit the HiCzin model, samples of the intraspecies contacts were generated using TAXAassign (https://github.com/umerijaz/TAXAassign), which assigned 3441 (55.5%) contigs to the known reference genomes in the NCBI nt database (see Section 2). These 3441 contigs were assigned to 10 species by TAXAassign (Supplementary Fig. S2). We compared the taxonomy assignment results by TAXAassign with the corresponding true species identities obtained by BLASTn. Only 21 labels were different, indicating the high precision of taxonomy assignments at the species level by TAXAassign.

Then, taking advantage of these labeled contigs, we generated a relationship of intraspecies pairs by pairwise combining contigs from the same species, and corresponding contacts were obtained as sample data to fit the HiCzin model. A total of 1,492,856 samples of the intraspecies contacts were generated.

All sample data were then utilized to fit the HiCzin model. We compared our model with naive normalization methods, the two-stage normalization method in Bin3C, and the classical negative binomial regression model. To simplify the notation, we denote naive normalization methods by site, length, and coverage as Naive Site, Naive Length, and Naive Coverage, and denote the two-stage normalization method in Bin3C and the classical negative binomial regression model as Bin3C_Norm and Naive NB.

We first calculated the Pearson correlation coefficients between normalized valid contacts and the product of each of the three factors of explicit biases to gauge the bias effects (Table 1). The Naive Site and Naive Length approaches increased the Pearson correlation coefficients between valid contacts and the product of the coverage from 0.184 to 0.559 and 0.694; the Naive Coverage approach increased the correlation coefficient between valid contacts and the product of the site from 0.429 to 0.515 and increased the correlation coefficient between valid contacts and the product of the length from 0.400 to 0.481.

Table 1.

Pearson Correlation Coefficients (Absolute Value) Between Normalized Valid Contacts and the Product of Each of the Three Factors of Explicit Biases

  Site Length Coverage
Raw contacts 0.429 0.400 0.184
Naive site 0.004 0.004 0.559
Naive length 0.004 0.004 0.694
Naive coverage 0.515 0.481 0.006
Bin3C_Norm 0.024 0.025 0.011
Naive NB 0.023 0.024 0.154
HiCzin 2 × 10−4 0.002 0.069

These results proved that the naive normalization methods only corrected part of the explicit biases, and the unnormalized factors of explicit biases showed an even higher correlation with Hi-C contact maps. In contrast, the two-stage normalization method in Bin3C decreased all three correlation coefficients to 0.024, 0.025, and 0.011, indicating that the matrix balancing algorithm can assist in correcting explicit biases to some extent. These three correlation coefficients were decreased to 0.023, 0.024, and 0.154 using Naive NB, and further decreased to 2 × 10−4, 0.002, and 0.069 using HiCzin. Therefore, HiCzin achieved better performance than all other normalization methods in removing explicit biases.

The other objective of metagenomic Hi-C normalization is to identify valid contacts from all observed contacts. Although raw values of spurious contacts were significantly smaller than those of valid contacts, the distribution of spurious contacts mixed with the distribution of valid contacts (Fig. 4a), making it challenging to separate spurious contacts from valid contacts. After normalization, the distribution of normalized spurious contacts deviated considerably to the left from the distribution of normalized valid contacts (Fig. 4b), facilitating the distinction from spurious contacts to valid contacts.

FIG. 4.

FIG. 4.

(a) Comparison of the distribution of raw valid contacts and raw spurious contacts. (b) Comparison of the distribution of normalized valid contacts and normalized spurious contacts by HiCzin. (c) The proportions of discarded valid contacts and discarded spurious contacts. (d) The DR curve using all sample data. DR, discardretain.

Therefore, we adopted our hybrid statistical approach based on the HiCzin model to detecting and then discarding spurious contacts (see Section 2). The main procedure of our approach is to select thresholds of the enrichment score and the p-value, respectively, and any contacts whose enrichment scores or p-values are below the thresholds would be identified as spurious contacts. A percentage reflecting the acceptable fraction of losses of the valid contacts was preselected, and thresholds were determined such that less than the preselected percentage of valid contacts in sample data was incorrectly identified as spurious contacts.

Noticeably, both thresholds increased with the preselected percentage, and larger thresholds could detect more spurious contacts while incorrectly identifying a higher number of valid contacts. Although there existed a “trade-off,” the proportion of discarded spurious contacts increased much faster than that of discarded valid contacts (Fig. 4c), indicating that we could remove a large fraction of spurious contacts while keeping most of the valid contacts.

For instance, if we set the preselected percentage as default 10%, which means that we could withstand the losses of around 10% of valid contacts, about 60% of spurious contacts were detected, while only 13% of valid contacts were incorrectly removed. These results supported the feasibility of our spurious contact detection method.

According to our basic idea of spurious contact detection, naive normalization methods and the Naive NB method could also be used to detect spurious contacts by regarding normalized contacts less than certain thresholds as spurious contacts. We used the same technique proposed in our hybrid statistical spurious contact detection method to determine thresholds for other normalization methods. For the two-stage normalization method in Bin3C, as the matrix balancing algorithm in the second step may amplify the influence of certain spurious contacts, it is better to remove the noise of spurious contacts after the first step correction.

Since the first stage of Bin3C_Norm is equivalent to the Naive Site approach, the spurious contact detection result of Bin3C_Norm is the same as that of the Naive Site approach. To evaluate the capability of normalization methods to detect spurious contacts while retaining the valid contacts, we design the discard–retain (DR) curve. In the graph of a DR curve, the x-axis is the proportion of discarded spurious contacts among all spurious contacts in the whole data, and the y-axis represents the proportion of retained valid contacts within all valid contacts in the whole data.

We denote the area under the discard–retain curve as AUDRC. Larger AUDRC indicates that the normalization method can retain more valid contacts while discarding more spurious contacts. Therefore, we plotted the DR curve to evaluate the performance of different normalization methods (Fig. 4d). AUDRC was subsequently calculated for each of the normalization methods (Table 2), and our HiCzin model achieved the best result with respect to AUDRC.

Table 2.

Area Under the Discard–Retain Curve for Different Normalization Methods

Normalization method AUDRC
Naive site and Bin3C_Norm 0.682
Naive length 0.712
Naive coverage 0.757
Naive NB 0.792
HiCzin 0.804

Higher AUDRC score indicates better performance in spurious contact detection. The optimal values of the results are in bold.

AUDRC, area under the discardretain curve.

3.6. Evaluation of the impact of the number of labeled contigs and detected species on the HiCzin model

The above results showed that the HiCzin model achieved outstanding performance in Hi-C normalization and spurious contact detection by utilizing all sample data to fit the normalization model. Around half of the contigs corresponding to 10 out of 13 species were labeled by TAXAassign, providing us with enough samples to fit the model. However, in real situations, we sometimes can only label a small number of contigs at the species level and detect a low quantity of species.

To explore how the HiCzin model performs with only a small number of labeled contigs and detected species, we consequently removed labeled contigs belonging to certain species step by step. Specifically, we first removed three species (Saccharomyces mikatae, Pichia pastoris, and Lachancea kluyveri) to which only a small number of contigs (<20) were assigned. The remaining seven species were sorted in a descending order by the number of contigs assigned to these seven species. Then, we removed the labeled contigs of one species at a time in the above species’ order until fewer than 10% of all contigs remained. In this way, we could simulate both situations where some species are unknown, and only a small number of contigs can be labeled (Table 3).

Table 3.

Number of Remaining Species, Number of Remaining Contigs, and the Proportion of Assigned Contigs in Each Step

No. of remaining species No. of remaining contigs Proportion of assigned contigs
6 2156 34%
5 1122 18%
4 594 10%
3 369 6%

Number of remaining species and remaining contigs means the number of remaining species and remaining contigs.

Although the performance of normalization and spurious contact detection became slightly worse as the number of labeled contigs and detected species decreased, the results were still better than the naive normalization methods and the two-stage normalization methods in Bin3C for the different sample sizes (Table 4). The HiCzin model also obtained better performance in the spurious contact detection (Fig. 5). Therefore, the HiCzin model can achieve good results even when the number of labeled contigs and detected species is relatively low.

Table 4.

Pearson Correlation Coefficients (Absolute Value) Between Normalized Valid Contacts and the Product of Each of the Three Factors of Explicit Biases and Area Under the Discard–Retain Curve for Different Proportions of Labeled Contigs

Proportion of labeled contigs Site Length Coverage AUDRC
34% 0.004 0.006 0.059 0.801
18% 0.003 0.005 0.099 0.793
10% 0.011 0.010 0.065 0.771
6% 0.014 0.014 0.037 0.794

FIG. 5.

FIG. 5.

The DR curve using (a) 6 species and 2156 contigs, (b) 5 species and 1122 contigs, (c) 4 species and 594 contigs, (d) 3 species and 369 contigs.

3.7. Generalizing the HiCzin model

Our HiCzin model can be generalized to consider different independent variables and do normalization without labeled contigs (see Section 2). Here, we explore three significant scenarios.

3.7.1. HiCzin_LC

In some real situations, the specific enzymes utilized in Hi-C experiments are unknown; thus, only the length and the coverage of contigs can serve as independent variables.

3.7.2. HiCzin_GC

GC-content, which is defined as the fraction of guanine (G) and cytosine (C) bases in a molecule, is sometimes considered one source of biases in Hi-C experiments (Yaffe and Tanay, 2011; Hu et al., 2012). Therefore, we explored the influence of adding GC-content as a new predictor variable to our HiCzin model, although we did not observe a strong correlation between raw valid contacts and GC-content (Pearson correlation coefficient: 0.032) for the Hi-C contact maps of the synthetic M-Y samples.

3.7.3. Unlabeled HiCzin

In the real application of HiCzin, some extreme difficulties may be encountered. For example, there may not be enough computational resources to run TAXAassign or an extremely small number of contigs can be labeled. To solve these problems, a HiCzin normalization mode without labeled contigs (Unlabeled HiCzin) is designed.

3.7.4. Applying the generalized HiCzin models on the M-Y samples

We applied these three generalized HiCzin models on the M-Y samples (Table 5). For the HiCzin_LC and HiCzin_GC, the Pearson correlation coefficients between normalized contact counts and the three factors increased compared with those of the HiCzin in Table 1, although the AUDRC was slightly higher than that of the HiCzin in Table 2. For the unlabeled HiCzin, detecting spurious contacts was tough as it was challenging to determine thresholds without specific samples of the intraspecies contacts. Although the normalization results were worse than those of the HiCzin model using labeled contigs, the unlabeled mode of HiCzin still performed better than naive normalization methods and it is more applicable and requires fewer computational resources than the HiCzin model using labeled contigs.

Table 5.

Pearson Correlation Coefficients (Absolute Value) Between Normalized Valid Contacts and the Product of Each of the Three Factors of Explicit Biases, and Area Under the Discard–Retain Curve for Different Generalized HiCzin Models

  Site Length Coverage AUDRC
HiCzin_LC 0.006 0.002 0.097 0.812
HiCzin_GC 0.008 0.003 0.131 0.816
Unlabeled HiCzin 0.114 0.105 0.079  

3.8. Clustering of contigs by the Louvain algorithm

The Louvain algorithm has been widely used to cluster contigs based on metagenomic Hi-C data (Marbouty et al., 2017; Baudry et al., 2019). We applied this algorithm to the Hi-C data normalized by different methods. We set the preselected percentage of maximum incorrectly identified valid contacts in sample data as 10% for all HiCzin models and regarded groups above 500 kbp as effective bins to evaluate the clustering performance.

As shown in Table 6, the original HiCzin model achieved the best clustering performance by the Louvain algorithm. Although the matrix balancing algorithm in the second stage could improve the clustering quality, Bin3C_Norm grouped much fewer contigs compared with Naive Site. The Naive NB approach also grouped a relatively small number of contigs.

Table 6.

Comparison of the Clustering Results of Contigs Using the Louvain Algorithm

  No. of contigs Fowlkes–Mallows score Adjusted rand index Normalized mutual information
Naive site 5997 0.763 0.724 0.783
Naive length 6092 0.791 0.758 0.788
Naive coverage 6131 0.752 0.706 0.791
Bin3C_Norm 5266 0.791 0.761 0.793
Naive NB 4783 0.783 0.748 0.764
Unlabeled HiCzin 6105 0.794 0.761 0.806
HiCzin_LC 6044 0.802 0.771 0.807
HiCzin_GC 6039 0.799 0.767 0.803
HiCzin 6065 0.807 0.776 0.810

Number of contigs represents the number of contigs in groups. The optimal values of the results are in bold.

The performance of both correcting biases and clustering by HiCzin_GC was worse than that of the original HiCzin model; hence, it is not necessary to consider the GC-content in the regression process. One potential explanation for the poor performance of HiCzin_GC is that the genomes in the community have similar GC-content and the Hi-C contact maps are not dependent on GC-content. The clustering results of the HiCzin_LC and the unlabeled HiCzin were significantly better than all naive normalization methods and the two-stage normalization method in Bin3C, indicating that our normalization model still performed well when the restriction enzymes of Hi-C experiments or the labels of any contigs were unknown. These results ensure that the HiCzin models are widely applicable with excellent normalization effects under different circumstances.

4. DISCUSSION AND CONCLUSIONS

We put forward two types of experimental biases for metagenomic Hi-C data. Explicit biases include the number of restriction sites, contig length, and contig coverage, and implicit biases include unobserved interactions and spurious interspecies contacts. Both types of biases could be obviously observed in the M-Y samples. Naive normalization methods could only correct part of explicit biases, and the unnormalized factors of explicit biases showed an even higher correlation with Hi-C contact maps. Based on the basic assumption that the population of the intraspecies contacts follows the negative binomial distribution, we have presented HiCzin, a parametric model applying zero-inflated negative binomial regression framework to normalize metagenomic Hi-C data, and have introduced a hybrid statistical method to detect and remove the spurious interspecies contacts. The HiCzin model takes the impact of unobserved interactions into account. We have shown that normalized metagenomic Hi-C contact maps by HiCzin lead to lower biases, a higher ability to detect spurious contacts, and a better metagenomic contig clustering performance, compared with all naive methods and the two-stage normalization method in Bin3C.

In case that the specific enzymes utilized in Hi-C experiments are unknown or there are not enough computational resources to run TAXAassign, we come up with the generalized HiCzin by only selecting the length and the coverage of contigs as predictor variables, and a HiCzin mode without labeled contigs. We have shown that these two models also performed well in normalization, spurious contact detection, and metagenomic contig clustering. Although we can remove a large fraction of spurious contacts by our hybrid statistical approach, it is inevitable to lose a small quantity of useful valid contacts. Directly modeling the spurious contacts may separate the spurious contacts from valid contacts even better.

As the Hi-C technique will be increasingly utilized upon the metagenomic domain in the near future, we expect that the normalization model we propose here can facilitate the downstream analysis and improve results in retrieving metagenome-assembled genomes, identifying virus–host interactions, and tracking horizontal gene transfer and all other areas making use of metagenomic Hi-C data.

Supplementary Material

Supplemental data
Supp_Data.pdf (239.7KB, pdf)

AUTHORS’ CONTRIBUTIONS

Y.D. and F.S. conceived the ideas and designed the study. Y.D. implemented the methods, carried out the computational analyses, and drafted the article. S.L. and J.F. contributed to the HiCzin_LC mode. All authors modified and finalized the article.

AVAILABILITY OF DATA AND MATERIALS

The synthetic M-Y sample was downloaded under the following accession numbers: shotgun library SRR1263009 and Hi-C library SRR1262938 (Burton et al., 2014). The HiCzin software is freely available at https://github.com/dyxstat/HiCzin.

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests.

FUNDING INFORMATION

The research is partially funded by the National Institutes of Health grants (R01GM120624 and R01GM131407), and the National Science Foundation grant EF-2125142. Y.D. is supported by the Viterbi Fellowship.

Supplementary Material

Supplementary Data

Supplementary Figure S1

Supplementary Figure S2

Supplementary Table S1

REFERENCES

  1. Altschul, S. F., Gish, W., Miller, W., et al. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410. [DOI] [PubMed] [Google Scholar]
  2. Baudry, L., Foutel-Rodier, T., Thierry, A., et al. 2019. MetaTOR: A computational pipeline to recover high-quality metagenomic bins from mammalian gut proximity-ligation (meta3C) libraries. Front. Genet. 10, 753. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Beitel, C. W., Froenicke, L., Lang, J. M., et al. 2014. Strain-and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products. PeerJ 2, e415. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bickhart, D. M., Watson, M., Koren, S., et al. 2019. Assignment of virus and antimicrobial resistance genes to microbial hosts in a complex microbial community by combined long-read assembly and proximity ligation. Genome Biol. 20, 153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Brooks, M. E., Kristensen, K., Van Benthem, K. J., et al. 2017. glmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling. R J. 9, 378–400. [Google Scholar]
  6. Burton, J. N., Liachko, I., Dunham, M. J., et al. 2014. Species-level deconvolution of metagenome assemblies with Hi-C–based contact probability maps. G3 (Bethesda) 4, 1339–1346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. DeMaere, M. Z. and Darling, A. E.. 2019. bin3C: Exploiting Hi-C sequencing data to accurately resolve metagenome-assembled genomes. Genome Biol. 20, 46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Dixon, J. R., Selvaraj, S., Yue, F., et al. 2012. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Dudchenko, O., Batra, S. S., Omer, A. D., et al. 2017. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Hu, M., Deng, K., Selvaraj, S., et al. 2012. HiCNorm: Removing biases in Hi-C data via poisson regression. Bioinformatics 28, 3131–3133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Imakaev, M., Fudenberg, G., McCord, R. P., et al. 2012. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9, 999–1003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Kang, D. D., Froula, J., Egan, R., et al. 2015. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3, e1165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Knight, P. A. and Ruiz, D.. 2013. A fast algorithm for matrix balancing. IMA J. Numer. Anal. 33, 1029–1047. [Google Scholar]
  14. Lambert, D. 1992. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34, 1–14. [Google Scholar]
  15. Lieberman-Aiden, E., Van Berkum, N. L., Williams, L., et al. 2009. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Lord, D., Guikema, S. D., and Geedipally, S. R.. 2008. Application of the Conway–Maxwell–Poisson generalized linear model for analyzing motor vehicle crashes. Accid. Anal. Prev. 40, 1123–1134. [DOI] [PubMed] [Google Scholar]
  17. Marbouty, M., Baudry, L., Cournac, A., et al. 2017. Scaffolding bacterial genomes and probing host-virus interactions in gut microbiome by proximity ligation (chromosome capture) assay. Sci. Adv. 3, e1602105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Press, M. O., Wiser, A. H., Kronenberg, Z. N., et al. 2017. Hi-C deconvolution of a human gut microbiome yields high-quality draft genomes and reveals plasmid-genome interactions. bioRxiv 198713. [Google Scholar]
  19. Selvaraj, S., Dixon, J. R., Bansal, V., et al. 2013. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nat. Biotechnol. 31, 1111–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Stalder, T., Press, M. O., Sullivan, S., et al. 2019. Linking the resistome and plasmidome to the microbiome. ISME J. 13, 2437–2446. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Yaffe, E. and Relman, D. A.. 2020. Tracking microbial evolution in the human gut using Hi-C reveals extensive horizontal gene transfer, persistence and adaptation. Nat. Microbiol. 5, 343–353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Yaffe, E. and Tanay, A.. 2011. Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat. Genet. 43, 1059–1065. [DOI] [PubMed] [Google Scholar]
  23. Yau, K. K., Wang, K., and Lee, A. H.. 2003. Zero-inflated negative binomial mixed regression modeling of over-dispersed count data with extra zeros. Biom. J. 45, 437–452. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental data
Supp_Data.pdf (239.7KB, pdf)

Data Availability Statement

The synthetic M-Y sample was downloaded under the following accession numbers: shotgun library SRR1263009 and Hi-C library SRR1262938 (Burton et al., 2014). The HiCzin software is freely available at https://github.com/dyxstat/HiCzin.


Articles from Journal of Computational Biology are provided here courtesy of Mary Ann Liebert, Inc.

RESOURCES