Abstract
Motivation
The bacterial haplotype reconstruction is critical for selecting proper treatments for diseases caused by unknown haplotypes. Existing methods and tools do not work well on this task, because they are usually developed for viral instead of bacterial populations.
Results
In this study, we developed BHap, a novel algorithm based on fuzzy flow networks, for reconstructing bacterial haplotypes from next generation sequencing data. Tested on simulated and experimental datasets, we showed that BHap was capable of reconstructing haplotypes of bacterial populations with an average F1 score of 0.87, an average precision of 0.87 and an average recall of 0.88. We also demonstrated that BHap had a low susceptibility to sequencing errors, was capable of reconstructing haplotypes with low coverage and could handle a wide range of mutation rates. Compared with existing approaches, BHap outperformed them in terms of higher F1 scores, better precision, better recall and more accurate estimation of the number of haplotypes.
Availability and implementation
The BHap tool is available at http://www.cs.ucf.edu/∼xiaoman/BHap/.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
It is important to reconstruct haplotypes from bacterial clonal populations. Haplotypes are variant copies of a genome in a population that are created gradually with accumulated mutations in DNA (Lang et al., 2011). Reconstructing haplotypes in a bacterial population reveals the population structure and its evolutionary features (Pulido-Tamayo et al., 2015). In addition, reconstructing bacterial haplotypes is required to choose the right treatments for diseases caused by specific haplotypes in a population, which may vary in only a few base pairs (bps) compared with other haplotypes in the population (Schirmer, 2014).
The next generation sequencing (NGS) technologies provide a unique opportunity to reconstruct haplotypes in bacterial clonal populations. NGS technologies can sequence DNA from a bacterial population (Barrick and Lenski, 2009). The sequenced short reads are a mixture of DNA segments from different haplotypes in the population. Researchers can then regroup reads for individual haplotypes, reconstruct the haplotypes and discover the diversity in the population from these reads.
Several approaches have been developed for viral haplotype reconstruction. ShoRAH performs a local analysis to estimate the haplotype diversity at the local level and then applies a global analysis using the path cover algorithm to reconstruct genome-wide haplotypes (Zagordi et al., 2011). QColors constructs the read conflict graph and models the population reconstruction as a vertex coloring problem (Huang et al., 2011). ViSpA creates a weighted overlap graph for reads and iteratively finds maximum-weight paths and considers them as viral haplotypes (Astrovskaya et al., 2011). QuRe partitions the reference genome with mapped reads into sliding windows and scores partitions, and then constructs an overlap graph, and finally finds the path of genomes by a heuristic algorithm (Prosperi and Salemi, 2012).
Although the aforementioned methods are suitable for viral populations, they have difficulty in distinguishing bacterial haplotypes, which are more similar to each other due to much lower mutation rates compared with those in viral populations. In viral populations, the genomic distance between polymorphic sites is often shorter than the read length. Every read may thus contain polymorphic sites. Moreover, overlapping reads from the same haplotype likely share common polymorphic sites. Viral haplotype reconstruction methods typically use such overlapping information to infer haplotypes. However, using this piece of information is not enough for bacterial population reconstruction due to the much lower number of mutations. The distance between polymorphic sites in bacterial genomes is often longer than several thousand bps (Pulido-Tamayo et al., 2015). In other words, many reads contain no polymorphic site. Read overlapping thus cannot facilitate the grouping of reads with adjacent polymorphic sites in haplotypes.
To our knowledge, EVORhA is the first and the only existing haplotype reconstruction tool capable of identifying haplotypes in bacterial populations (Pulido-Tamayo et al., 2015). It defines windows on aligned short reads and infers template haplotypes per window to construct haplotypes locally. It then extends windows by concatenating template haplotypes based on their shared polymorphic sites. EVORhA reconstructs the final genome-wide haplotypes using the relative coverage of the extended haplotypes. Such a local-extension based strategy may be affected by ‘errors’ at the local levels and generates many false positive haplotypes.
In this study, we propose a haplotype reconstruction method for bacterial populations called BHap (Abbreviation for Bacterial Haplotype Reconstruction). Different from previous studies (Prosperi and Salemi, 2012; Pulido-Tamayo et al., 2015; Zagordi et al., 2011), which often start from locally constructed haplotype segments and then extend these segments to obtain final haplotypes, BHap always focuses on all polymorphic sites in a haplotype instead of local genomic regions, by an Expectation-Maximization (EM) algorithm and a fuzzy flow approach. Such a global-based approach, with a guidance of the estimated ‘global’ picture of the haplotype coverage, may be more robust to ‘errors’ and biases in local genomic regions. Tested on simulated and experimental datasets, BHap is capable of reliably reconstructing haplotypes with an average F1 score of 0.87, an average precision of 0.86 and an average recall of 0.88. Compared with existing approaches, BHap constructs more accurate haplotypes and generates fewer false positive haplotypes. The BHap tool is available on http://www.cs.ucf.edu/∼xiaoman/BHap/.
2 Materials and methods
2.1 Simulated datasets
To investigate the performance of BHap, we simulated 339 datasets with different configurations, such as different coverage, read lengths, mutation rates, haplotype proportions and sequencing error rates (Supplementary Table S1). Coverage refers to the sequencing depth of a dataset. It is defined as the ratio of the sum of the length of all reads in a dataset to the length of the corresponding reference genome. The main reason to test BHap on simulated instead of experimental datasets is that polymorphic sites are known in simulated datasets while unavailable in experimental ones, which are essential for an accurate evaluation of the methods.
To simulate data, we randomly selected the genomes of three bacterial species, Bartonella clarridgeiae, Enterococcus casseliflavus and Methanobrevibacter smithii (GenBank NC_014932, NC_020995 and NC_009515, respectively), as reference genomes. For each of the three reference genomes, we generated a default population composed of two haplotypes with the default parameters (Supplementary Table S1). Since bacterial populations often contain mutations several thousand bps apart from each other, the default mutation rate was set to be 0.01% (Pulido-Tamayo et al., 2015). Here the mutational rate is the percentage of the variations in a haplotype when it is compared with its reference genome. For every haplotype in a population, we simulated short paired-end Illumina reads using the dwgsim tool (https://github.com/nh13/DWGSIM) (Supplementary Table S1). All simulated reads for all haplotypes in the same population were mixed together as a simulated dataset to infer the original haplotypes in this population.
To study how different parameters affect the performance of BHap, we simulated eleven groups of datasets (Supplementary Table S1). The first group consisted of the above three default populations together with twelve populations generated similarly with the default parameters. In each of the ten remaining groups, the value of one or more parameters was changed. The second group was to study the effect of the read length on the haplotype reconstruction, in which we changed the read length from 60 to 150 bps except the default 100 bps for each of the above three reference genomes. The third group contained 42 datasets, where sequencing error rates varied from 0.2 to 1.5% for each of the three default populations. There were twelve datasets in the fourth group, with four individual haplotype proportions for each of the three default populations. A haplotype proportion tells the percentage of reads from every haplotype. For instance, the haplotype proportion 10/30/60 tells that 10, 30 and 60% of reads are from three haplotypes, respectively. This fourth group of datasets was used to assess the performance of BHap in reconstructing individual haplotypes with different haplotype proportions. To study the BHap performance with a 10/90 proportion on higher coverage, we generated additional fifteen datasets with different coverage as the fifth group. Since the mutation rate may affect the haplotype reconstruction, we simulated twelve datasets in the sixth group with the mutation rates ranging from 0.02 to 0.05%. We generated three additional groups with 99 datasets to compare BHap with the only existing method for bacterial haplotype reconstruction, EVORhA (Pulido-Tamayo et al., 2015). In the seventh group, for each of the three bacterial genomes, with the coverage of 50×, 100×, 150× and 200× and with each of the following three haplotype proportions: 30/70, 10/30/60 and 10/20/30/40, we generated twelve datasets. Since the best performance of EVORhA happened at higher coverage, we also generated the eighth group with additional six datasets for each bacterial genome, with 500× coverage, two different mutation rates and three haplotype proportions. Since the mutation rate in the EVORhA study was higher, we generated the ninth group with the mutation rate as 0.07, 0.1 and 0.15% and the coverage as 50×, 100×, 150×, 200× and 500× for each of the three genomes, respectively. Since haplotypes in a population evolved from the same reference genome through different trajectories, we simulated two additional groups with 117 datasets so that an evolutionary relationship was enforced in three or four haplotypes in each dataset (the tenth and the eleventh). Note that the enforced evolution relation on a population with only two haplotypes was not meaningful, since they were equal to those we already studied in the first nine groups. We thus studied three evolution trajectories for populations with three and four haplotypes. In brief, we simulated populations with three or four haplotypes, different haplotype proportions and different mutation rates. For populations with three haplotypes, two haplotypes were set to share a given fraction of polymorphic sites, and the remaining one share no polymorphic site with first two (Type 0 evolution trajectory). For populations with four haplotypes, we considered two different evolution scenarios: Type 1: Two haplotypes share a fraction of their polymorphic sites, the third share fewer polymorphic sites with the first two and the fourth share no polymorphic site with the first three; and Type 2: The first two haplotypes share a fraction of polymorphic sites and the remaining two share a fraction of polymorphic sites, while the two pairs share no polymorphic site (Supplementary Table S1).
2.2 Experimental datasets
We tested BHap on two experimental datasets: the mixed infection dataset of Clostridium difficile and the evolved population dataset of Escherichia coli strain SX4. The mixed infection dataset was generated with the Illumina technology at 150× coverage (Eyre et al., 2013). There were 54 mixed samples, each constructed from two of the 36 unmixed samples (https://www.ebi.ac.uk/ena/data/view/PRJEB1729). The two unmixed samples used to construct a mixed sample and their proportions were provided in Supplementary Table S2. For the evolved population dataset, 100 bps long paired-end Illumina reads at a coverage of ∼200× were available at three time points (http://www.ncbi.nlm.nih.gov/bioproject/262000). At every time point, a population as well as a corresponding clone were sequenced. Here the number and the haplotype proportions were unknown, while the haplotype(s) in the clone was likely in the population at the corresponding time points.
For the mixed infection dataset, we ran BHap and EVORhA on each mixed sample and each of its two corresponding unmixed samples to predict haplotypes. We then calculated the similarity of every pair of predicted haplotypes, with one haplotype from a mixed sample and the other haplotype from its unmixed samples. The similarity of a pair of haplotypes u and v was calculated in exactly the same way as that in EVORhA, which was calculated as (1), where and are the set of polymorphic sites in u and v, respectively. This similarity was called reliability in the EVORhA study. In this way, we identified one pair of most similar haplotypes for a mixed sample and each of its unmixed samples. Finally, we averaged the reliability of these two pairs of haplotypes for a mixed sample and its two unmixed samples to measure the performance of BHap and EVORhA. Similarly, for the evolved population dataset, at each time point, we predicted haplotypes in the population sample and its corresponding clone sample using BHap and EVORhA. We then identified the most similar pair of predicted haplotypes with one from the population sample and the other from its corresponding clone sample. Finally, we output the similarity of the most similar pair of haplotypes to measure the reliability of BHap and EVORhA at each of the three time points
| (1) |
Alternatively, we defined polymorphic sites in the clone or unmixed samples by SAMtools, which is commonly used to infer polymorphic sites from NGS reads (Li 2011; Li et al., 2009). We then applied BHap and EVORhA to the corresponding population (mixed) samples to predict haplotypes. Finally, we compared the polymorphic sites in the predicted haplotypes with those inferred from SAMtools to calculate the reliability of BHap and EVORhA. This was to make sure that BHap or EVORhA identified those polymorphic sites in populations that were also discovered independently in the corresponding clones or unmixed samples.
2.3 BHap, a novel approach for haplotype reconstruction in bacterial populations
BHap is composed of the following four major steps (Fig. 1): it determines a proper k-mer length for constructing a de Bruijn graph; it then creates a flow network from the de Bruijn graph and identifies sequencing errors and polymorphic sites; next, it decomposes the flow network to infer feasible flows with an EM algorithm. These flows are considered as potential haplotypes; finally, it repeats the above three steps with different k values and combines the results to infer the final haplotypes. See Sections 2.3.1–2.3.5 for details.
Fig. 1.

Flowchart of the BHap algorithm. Polymorphic nodes from different haplotypes and the haplotypes themselves are drawn with different patterns
2.3.1 Choosing a proper k-mer length
The k-mer length affects the construction of the de Bruijn graph and the inference from this graph (Zerbino and Birney, 2008). We observed that for the same k-mer length, when the read length, coverage or the reference genome size is different, our earlier BHap versions had different specificity and sensitivity and thus different F1 scores, in terms of correctly grouping polymorphic sites for individual haplotypes. To automatically choose a proper k-mer length, we trained the following polynomial regression model with 270 simulated datasets: , where is the k-mer length that resulted in the best F1 score for the i-th dataset; is the average read length, the coverage, the genome size in the i-th dataset, respectively; and the remaining variables are unknown parameters to be estimated from the regression. These 270 simulated datasets were generated similarly as the simulated datasets in Supplementary Table S1 with the dwgsim tool, the three reference genomes and the following parameters: seven different haplotype proportions (10/90, 20/80, 30/70, 40/60, 50/50, 10/30/60, 10/20/30/40), ten different read lengths from 60 to 150 bps, and five different coverage (50×, 100×, 150×, 200×, 500×). Given a new dataset, are known and the best k-mer length is obtained from the above model with the estimated parameters by the BHap tool.
2.3.2 Construction of the flow network
To identify polymorphic sites, BHap applies Velvet (Zerbino and Birney, 2008) to construct a de Bruijn graph and then converts this graph into a flow network. Velvet is a popular tool for assembling NGS reads based on de Buijn graphs. The de Bruijn graph is a time and memory efficient data structure commonly used to represent short reads for sequence assembly.
For a given k-mer length, each node in the de Bruijn graph represents a k-mer in input reads and each directed edge represents a (k + 1)-mer in input reads. In other words, each edge connects two nodes representing the two k-mers contained in the corresponding (k + 1)-mer for this edge. Edges are weighted with the corresponding number of reads containing the corresponding (k + 1)-mer.
With the de Bruijn graph, BHap applied Velvet to generate uncorrected contigs (Zerbino and Birney, 2008). These contigs are constructed without sequencing error correction and some contigs may thus have low coverage. This is different from normal assembly, in which sequencing error correction is carried out before assembly and corrected contigs are produced as the final product. BHap considers only the uncorrected contigs produced by Velvet, since the goal is to identify polymorphic sites.
BHap then constructs a flow network with the uncorrected contigs. BHap constructs one node in the network for every contig with coverage larger than a specified threshold (default 3). The coverage of contigs is calculated by Velvet, representing its estimated sequencing depth. The contigs with coverage smaller than the threshold likely contain sequencing errors and the remaining contigs likely contain all polymorphic sites. For each node, BHap maps the corresponding contig sequence to the reference genome by the BLAT tool (Kent, 2002). BLAT is used since the contigs are relative long and their number is much fewer compared with the input reads. If two contigs are mapped to overlapping regions, BHap connects the two corresponding nodes with a directed edge, in the same order as their occurrence in the reference genome. To reduce the storage cost, BHap merges consecutive nodes in a path that are not shared by any other path into one node. The edge weights are modified with the coverage of the corresponding sequences. In this way, BHap constructs a flow network, with the coverage of nodes as the flow capacities. In this network, nodes immediately following the branching points are likely polymorphic sites, which are called polymorphic nodes in the following (Fig. 1).
2.3.3 An EM algorithm for finding the capacities of an initial flow set
The capacity of nodes from each haplotype can be approximated as a Poisson distribution (Wang et al., 2015). Therefore, the coverage of the nodes in the population is a mixture of different Poisson distributions. Since polymorphic nodes distinguish different haplotypes, BHap applies an EM algorithm on the capacity of polymorphic nodes to find an initial set of Poisson distributions and flows.
In brief, assume that X = {x1, x2, …, xn}is the list of the polymorphic node capacities that follows a mixture of Poisson distributions with unknown parameters M = , where is unknown. In the Expectation step, BHap calculates the mean of the missing variables using (2) and (3), where for r from 1 to m, is an indicator function, means that is from the j-th Poisson distribution, and is the unknown probability that the capacity of a random node is from the j-th distribution. The Maximization step is estimating the parameters using and , Equation (4). The unknown parameter m is inferred similarly as in a previous study by starting from a large m and decreasing m by one at a time until the obtained m groups of parameters had no two groups with highly similar parameters (Li and Waterman, 2003). The polymorphic nodes are correspondingly grouped based on their probabilities of belonging to the m groups calculated with the final inferred
| (2) |
| (3) |
| (4) |
2.3.4 Fuzzy flow decomposition for finding haplotypes
Now BHap tries to decompose the network into a set of flows , where k is the number of flows (haplotypes) and is the coverage of the i-th haplotype for i from 1 to k. Under the assumption that different haplotypes have different coverage, the smallest parameter in M,, is likely the coverage of a haplotype, while other parameters in M may be the sum of the coverage of different haplotypes. BHap thus intends to first identify the haplotype with the coverage of .
To obtain this haplotype, BHap calculates the cost of passing a flow with the capacity of through each polymorphic node. There is no need to calculate the cost of going through other nodes since they are shared by all haplotypes. The cost of passing a polymorphic node is calculated as (5) and (6), where x is the coverage of this polymorphic node. BHap then identifies the path with the lowest cost that covers the reference genome and output the first haplotype
| (5) |
| (6) |
BHap subtracts the capacity of the polymorphic nodes by , for nodes in the latest extracted path. BHap then repeats the procedure of applying the EM algorithm on the polymorphic nodes with the remaining capacities, identifying in the updated M, outputting the flow and path with the capacity of the updated . The algorithm stops when the network becomes disconnected or when the cost of the current path exceeds a specific threshold.
2.3.5 Combining results of different k-mer values
Haplotypes in a population may have different coverage. Previous studies show that haplotypes with different coverage can be assembled better with different k-mer lengths (Surget-Groba and Montoya-Burgos, 2010). BHap thus uses different k-mer lengths to reconstruct haplotypes and clusters haplotypes from different k values to find the final set of haplotypes. BHap first selects a proper k-mer length by the above polynomial regression. BHap then considers two additional k-mer lengths that are larger or smaller than this k-mer length by 2. Such a combination showed the best F1 scores on the above 270 simulated datasets used to determine k.
With three k-mer lengths, BHap obtains three sets of haplotypes. Since the same haplotype in different sets should have similar coverage, BHap assigns haplotypes of similar coverage from different sets to the same cluster. BHap considers every haplotype in the set resulted from the best k-mer length to be a different initial cluster. For each remaining haplotype set, BHap compares the coverage of each of its haplotypes with the coverage of the existing clusters. The coverage of a cluster is the average coverage of its haplotypes. If for a haplotype, difference between its coverage and the coverage of the cluster with the most similar coverage is larger than half of the average coverage difference of the existing clusters, the algorithm creates a new cluster for this haplotype. If one haplotype has the same coverage difference when compared with two clusters, the algorithm assigns the haplotype to the cluster that has more shared polymorphisms with this haplotype. After assigning haplotypes in a haplotype set, the algorithm updates the coverage of the clusters and continues to work on another set. With the final clusters of haplotypes, BHap finds the consensus haplotype in each cluster, with its polymorphisms as those shared by the majority of haplotypes in this cluster.
2.4 Evaluation of BHap and other tools
We used precision, recall and F1 score to assess the performance of BHap and other tools on simulated data. On experimental data, where the haplotypes were unknown and these measurements could not be calculated, we used the reliability defined by the EVORhA study instead (Pulido-Tamayo et al., 2015).
3 Results
3.1 BHap has a robust performance with varied parameter values
To evaluate BHap, we compared the BHap predicted haplotypes with known haplotypes on simulated datasets (Material and Methods). We found the corresponding known haplotype for each predicted haplotype. We then compared the polymorphic sites in known haplotypes with those in the corresponding predicted haplotypes. In each dataset, to measure the performance of BHap, we calculated the precision, recall and F1 score based on the predicted polymorphic sites compared with the corresponding known ones. BHap had a good and robust performance in almost all cases.
Under the default parameters, BHap had a recall of 0.88, a precision of 0.86 and an F1 score of 0.87 (Fig. 2A and Supplementary Table S3). Such an average performance was based on 15 simulated datasets with two haplotypes in each dataset. In these datasets, the default average read length was 100 bps and the default sequencing error rate was 0.1%, which mimicked the parameters from the Illumina sequencers (Glenn, 2011). The default coverage was 100×, which was realistic in current practice with significantly decreased sequencing cost.
Fig. 2.
BHap performance under different parameters. (A) BHap performance under the default parameters. The three bars are for three different reference genomes; (B) BHap performance under different read lengths; (C) BHap performance under different error rates; (D) BHap performance under different haplotype proportions; (E) BHap performance under different mutation rates; (F) Predicted polymorphic sites by BHap compared with known polymorphic sites
We studied how the performance of BHap varied with different read lengths (Fig. 2B and Supplementary Table S4). The three measurements, especially the F1 score, were close to each other with varied read lengths. The largest F1 score appeared at the read length of 90 bps, and slightly decreased if we increased or decreased the read length. We thus concluded that the read length has a limited effect on the performance of BHap.
Since sequencing errors may affect the polymorphism identification by BHap, we studied the BHap performance under different sequencing error rates (Fig. 2C and Supplementary Table S5). Compared with the BHap performance under the default parameters, it fluctuated only slightly with the increment of sequencing error rates. For error rates from 0.1 to 1.5%, BHap had a minimum F1 score of 0.85, a minimum recall of 0.84 and a minimum precision of 0.84. On average, BHap had a F1 score of 0.87, a precision of 0.87 and a recall of 0.87. These numbers indicate that the BHap performance is quite robust to a variety of sequencing error rates.
We also studied how the haplotype proportion affected the BHap performance (Fig. 2D and Supplementary Table S6). For instance, in a population with two haplotypes, 10/90 or 50/50, with which proportion will BHap perform better? We observed that BHap performed the best at 20/80, where it had a F1 score of 0.88. We hypothesized that the proportion 10/90 may result in the best BHap performance, given a larger population coverage. We repeated the above experiments with larger coverage and BHap indeed had better F1 scores with increased coverage (Supplementary Table S6). The better performance of BHap with a 10/90 proportion was similar or better than that with the 20/80 proportion, suggesting that BHap performs differently on datasets with different haplotype proportions, and a higher coverage helps to reconstruct haplotypes better.
We also investigated how mutation rates affected the performance of BHap (Fig. 2E). We applied BHap to simulated datasets with different mutation rates (Supplementary Table S7). BHap performed better with higher mutation rates. The F1 score increased by 0.065 when mutation rate increased from 0.01 to 0.05%. The F1 score with lower mutation rates was slightly decreased. This suggests that BHap can reconstruct haplotypes with low mutation rates and is robust under different mutation rates.
It is worth mentioning that BHap accurately predicted haplotype proportions in almost all simulated datasets, including all datasets tested above (Supplementary Table S8). In the first group of fifteen simulated datasets, with the known haplotype proportion of 30/70, the estimated haplotype proportions were 29.93/70.07, respectively. Even by changing the read length, sequence error rate and mutation rate, BHap robustly predicted an average haplotype proportion as 29.76/70.24, 29.37/70.63 and 29.85/70.15, respectively, for the default haplotype proportion 30/70. When the haplotype proportion was changed into 10/90, 20/80, 30/70, 40/60 and 50/50, the estimated proportion was 11.67/88.83, 19/81, 29.93/70.07, 34/66 and 33.33/66.67, respectively. When the haplotype proportion was changed into 10/50/60 and 10/20/30/40, the estimated proportion was 13.69/33.31/57.42 and 10.91/21.27/32.9/46.33, respectively. In summary, with the exception of the haplotype proportion 50/50, BHap reliably identified the haplotype proportions.
We also want to point out that BHap predicted the number of polymorphic sites reasonably well in simulated datasets (Fig. 2F), especially in datasets with two haplotypes. In the above simulated datasets, with two haplotypes in a population, BHap had a recall of 0.86, with a standard deviation of 0.11. With three haplotypes in a population, the recall was 0.78, with a standard deviation of 0.22. With four haplotypes in a population, the recall became 0.54, with a standard deviation of 0.32. Correspondingly, the reliability score in these three scenarios was, 0.43, 0.36 and 0.17, respectively. Note that the largest reliability in theory is 0.50.
The lower recall and reliability above for three and four haplotypes may be due to the relatively small coverage in these datasets, most of which has a coverage of 100×. We hypothesized that the recall and the reliability were also reasonably good for bacterial populations with more than two haplotypes, given a higher sequencing depth. We thus examined the recall and the reliability when the coverage was high (Supplementary Table S8). We found that when coverage increased, the recall and the reliability increased as well. For the coverage of 500×, the recall was 0.92, 0.85 and 0.77, and the reliability was 0.47, 0.42 and 0.28 for two, three and four haplotypes, respectively. This implied that BHap is able of reliably predict polymorphic sites in bacterial populations, given a high sequencing depth.
3.2 BHap reconstructs haplotypes better than EVORhA on simulated datasets
Since EVORhA is the first and the only existing haplotype reconstruction tool for bacterial populations, we compared BHap with EVORhA on 216 simulated datasets (the 7th–9th and 10th–11th groups in Supplementary Table S1), with three bacterial species, different numbers of haplotypes (2–4), different sequencing depth (50×, 100×, 150×, 200× and 500×), different mutation rates (0.01, 0.02, 0.05, 0.07, 0.1 and 0.15%) and different evolution trajectories (no evolution relation, T0, T1 and T2).
On 36 datasets in the seventh group, on average, BHap had a F1 score of 0.64 while EVORhA had a F1 score of 0.18 (Table 1 and Supplementary Table S8). For populations with two haplotypes, the average F1 score of BHap was 0.86 and the average F1 score of EVORhA was 0.23. For populations with three haplotypes, BHap had an average F1 score of 0.72 while EVORhA had 0.17. For populations with four haplotypes, BHap performed better than EVORhA as well (F1 score of 0.33 versus 0.14). In terms of different population coverage, given a haplotype proportion, the higher the coverage was, the higher F1 scores was for both BHap and EVORhA (Table 1).
Table 1.
Performance comparison of BHap with EVORhA on the seventh group of simulated datasets
| Proportion (Coverage) | # of reconstructed haplotypes | Average F1 | Average precision | Average recall |
|---|---|---|---|---|
| 30|70(50×) | 2.33(4.33) | 0.86(0.18) | 0.84(0.50) | 0.87(0.11) |
| 30|70(100×) | 3(4.67) | 0.85(0.23) | 0.83(0.51) | 0.87(0.15) |
| 30|70(150×) | 3.67(5.67) | 0.85(0.25) | 0.81(0.51) | 0.90(0.17) |
| 30|70(200×) | 3.33(5.33) | 0.89(0.26) | 0.88(0.49) | 0.91(0.19) |
| 10|30|60(50×) | 3.0(5.33) | 0.59(0.12) | 0.59(0.42) | 0.59(0.07) |
| 10|30|60(100×) | 4.0(5.67) | 0.68(0.17) | 0.63(0.44) | 0.74(0.11) |
| 10|30|60(150×) | 4.0(6.7) | 0.79(0.19) | 0.72(0.37) | 0.87(0.13) |
| 10|30|60(200×) | 4.0(5.33) | 0.84(0.20) | 0.79(0.36) | 0.91(0.14) |
| 10|20|30|40 (50×) | 2.67(5.67) | 0.32(0.10) | 0.27(0.33) | 0.45(0.06) |
| 10|20|30|40 (100×) | 4.0(7.33) | 0.36(0.14) | 0.29(0.34) | 0.5(0.09) |
| 10|20|30|40 (150×) | 5.0(6.0) | 0.28(0.15) | 0.22(0.29) | 0.42(0.11) |
| 10|20|30|40 (200×) | 5.33(6.0) | 0.37(0.17) | 0.36(0.29) | 0.54(0.12) |
Note: In the last four columns, the first number is for BHap and the number in the parenthesis is for EVORhA.
We also noticed that EVORhA produced many false positive haplotypes per dataset, especially when there were more haplotypes in populations (Table 1). For 12 datasets with two haplotypes, on average, BHap predicted 3.08 haplotypes per samples while EVORhA predicted five haplotypes per samples. For another two groups of twelve datasets with three and four haplotypes, respectively, BHap predicted 3.75 and 4.25 haplotypes while EVORhA predicted 5.75 and 6.25 haplotypes, respectively. EVORhA may know that it predicted much more haplotypes than actual ones, but it did not provide a way to filter the false positive ones.
Since EVORhA had the best performance at 500× coverage and higher mutation rates (Pulido-Tamayo et al., 2015), we further compared BHap with EVORhA on the eighth group of eighteen datasets with 500× coverage (Supplementary Table S9). BHap had an average F1 score of 0.78, a precision of 0.75 and a recall of 0.84. Correspondingly, EVORhA only had an average F1 score of 0.21, a precision of 0.38 and a recall of 0.15 (Supplementary Table S9). The low recall from EVORhA suggested that it may be not good at predicting actual polymorphisms in the reconstructed haplotypes. We also compared EVORhA with BHap on the ninth group of 45 datasets with very high mutation rates (0.07, 0.1 and 0.15%) (Supplementary Table S10). On these datasets, BHap had an average F1 score of 0.94, a precision of 0.96 and a recall of 0.92, while EVORhA had 0.12, 0.48 and 0.07, respectively. The performance of BHap was significantly improved with higher mutation rates, while EVORhA still had a low recall and F1 score.
We also compared BHap with EVORhA on the 117 datasets where haplotypes had specified evolutionary relationship (the 10th and 11th groups, Supplementary Tables S11 and S12). In every case, BHap had a larger F1 score, precision and recall than EVORhA. On the 10th group of 72 datasets, we studied how BHap and EVORhA performed under different coverage and haplotype proportions. BHap had an average F1 score of 0.71 (0.49), a precision of 0.78 (0.50) and a recall of 0.66 (0.52) on datasets with three (four) haplotypes. Correspondingly, EVORhA had a F1 score of 0.36 (0.28), a precision of 0.45 (0.29), a recall of 0.33 (0.28) on datasets with three (four) haplotypes. In different datasets, the performance of BHap consistently changed with that of EVORhA, in the sense that when BHap had a better performance, EVORhA had a better performance, and vice versa. Both BHap and EVORhA estimated the haplotype proportions relatively well, especially when coverage was high. However, the higher coverage did not always result in better performance for EVORhA and BHap, although the F1 score was often the highest at the coverage 400× or 500×. Higher coverage may not result in better F1 scores, because the reads were not necessarily evenly distributed and the complexity in terms of sharing polymorphic sites by haplotypes, which may result in different number of predicted haplotypes and thus different accuracy. Since EVORhA performed better with higher mutation rates, we further compared EVORhA with BHap on the 11th group of 45 datasets, in which we studied how EVORhA and BHap performed with different mutation rates. Consistent with the above study, both tools performed better with higher mutation rates (Supplementary Table S12).
3.3 BHap reconstructs haplotypes better than EVORhA on experimental dataset
We compared BHap and EVORhA on two experimental datasets (Section 2). With haplotypes unknown in these datasets, we could not calculate F1 score, precision and recall. We thus focused on comparing their reliability in two ways. One was haplotype based, where the most similar pairs of haplotypes were predicted by a tool, with one from a population and mixed sample, and the other from its corresponding clone or unmixed sample, and then reliability was calculated based on these pairs of haplotypes. The other was SAMtools based, where we compared the polymorphic sites in the predicted haplotypes in the population or mixed sample by the tool with the polymorphic sites from raw reads for the corresponding clone or unmixed sample inferred by SAMtools directly. We found that BHap had a higher reliability than EVORhA based on both approaches.
By the haplotype based approach, BHap had an average reliability of 0.09 and 0.10 on the mixed infection dataset and the evolved population dataset, respectively, while EVORhA had an average reliability of 0.01 and 0.03, correspondingly (Fig. 3, Supplementary Tables S2 and S13). The reliability of BHap was significantly larger than that of EVORhA on the mixed infection dataset (Mann-Whitney test p-value 3.35 × 10-6). We did not consider the significance of the reliability difference on the evolved dataset, as there were only three time points involved. By the SAMtools based approach, BHap had an average reliability of 0.09 and 0.09 on the mixed infection dataset and the evolved population dataset, respectively, while EVORhA correspondingly had 0.01 and 0.01, respectively (Fig. 3, Supplementary Tables S2 and S13). The reliability of BHap was significantly larger than that of EVORhA on the mixed infection dataset (Mann-Whitney test p-value 3.347 × 10-6).
Fig. 3.

Reliability comparison on experimental datasets. The box plot for BHap is in front of that for EVORhA in the four comparisons
One should focus on the relative reliabilities above. The actual reliability of both tools seemed not large. In theory, the largest reliability is 0.50, when polymorphic sites from the clone or the unmixed sample are exactly the same as those from the corresponding population. Polymorphic sites can be added or removed in practice, making the reliability lower. In fact, we applied SAMtools to the 54 mixed infection datasets and their corresponding unmixed datasets to define polymorphic sites directly and calculated the reliability, the reliability was from 0.09 to 0.47, with a mean of 0.22 and a median of 0.20. Such a reliability was based on the assumption that there was only one haplotype in the population and the same one in the clones or unmixed samples. Since there may be different haplotypes in clones, unmixed samples and population, different haplotypes and their pairing most likely result in much smaller reliability. It is thus likely that we may have achieved the best reliabilities on these experimental datasets. More importantly, BHap had much higher reliability than EVORhA.
4 Discussion
We developed a novel haplotype reconstruction method for bacterial populations, called BHap. With an estimated global view of the coverage of haplotypes, BHap decomposes flows of polymorphic sites in the network and finds a set of feasible flows, each representing a haplotype. BHap repeats this process with different k-mer lengths and combines the results from different k-mer lengths to generate robust predictions. Such a global view based approach may prevent from the expansion of the errors made at local polymorphic sites and avoid the difficulty in extending these local sites based on the adjacent local sites. Tested on simulated datasets, BHap shows a high F1 score, precision and recall. Compared with EVORhA, BHap shows much better accuracy in terms of F1 score, precision, recall and reliability.
In addition to EVORhA, we also attempted to compare BHap with ShoRAH (Zagordi et al., 2011), one of the most highly cited haplotype reconstruction tools for viral populations. We were unable to run it on our simulated datasets. Neither were we able to test it on the two experimental datasets. The tool could not be run, likely because of the much lower mutation rates in these datasets. In fact, the EVORhA study also mentioned that ShoRAH cannot be run on bacterial genome (Pulido-Tamayo et al., 2015).
For the mixed infection dataset, the proportion of two unmixed samples in the corresponding mixed sample was provided (Supplementary Table S2). We tried to compare the predicted proportions with the known ones in these datasets and found that they often did not agree well. By further applying SAMtools to every mixed sample and every unmixed sample and then comparing the identified polymorphic sites from two samples, we noticed that the correspondence between the unmixed samples and the mixed samples provided in the above link could be wrong. For instance, we often found another pair of unmixed samples had more polymorphic sites shared with a mixed sample than its assigned pair of unmixed samples in Supplementary Table S2. Therefore, we believed that the correspondence of samples and their proportions provided in this link may be inaccurate.
We compared the predicted polymorphic sites by BHap with the ‘known’ polymorphic sites. BHap is able to predict the known polymorphic sites in simulated datasets, as shown in Figure 2F. However, it predicts much fewer ‘known’ polymorphic sites in experimental datasets, although it predicts much more ‘known’ sites than EVORhA. This is likely that the ‘known’ polymorphic sites in the experimental datasets cannot be well defined. It may also suggest that there is still room for the improvement of the bacterial haplotype reconstruction methods and tools.
BHap depends on the coverage difference of haplotypes in a population to distinguish these haplotypes. Our study shows that it works well on datasets with different coverage of haplotypes, such as a 30/70 haplotype proportion. However, it has a much lower F1 score around 0.50 when the haplotypes have the same coverage based on our study on simulated datasets. In this regard, BHap is not applicable to every bacterial population. Moreover, although BHap performs well in terms of identifying known polymorphic sites, it does not precisely identify all known polymorphic sites. This may be caused by the de Bruijn graph data structures and related Velvet libraries we used. In addition, we also noticed that BHap did not perform so well in populations with a specified evolution trajectory as in populations with no shared polymorphic sites among haplotypes, implying that it is important to take the evolution information into account for the inference and prediction. In the future, we hope to further improve BHap by taking these aspects into account.
Supplementary Material
Acknowledgements
We thank Amlan Talukder and Clayton Barham for the proofreading of the manuscript.
Funding
This work was supported by the National Science Foundation (Grant Nos. 1661414, 1356524 and 1149955) and the National Institute of Health (Grant No. R15 GM123407). Funding for open access charge: The National Institute of Health grant R15 GM123407.
Conflict of Interest: none declared.
References
- Astrovskaya I. et al. (2011) Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics, 12, S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barrick J.E., Lenski R.E. (2009) Genome-wide mutational diversity in an evolving population of Escherichia coli In: Cold Spring Harbor Symposia on Quantitative Biology. Cold Spring Harbor Laboratory Press. pp. sqb. 2009.2074. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eyre D.W. et al. (2013) Detection of mixed infection from bacterial whole genome sequence data allows assessment of its role in Clostridium difficile transmission. PLoS Comput. Biol., 9, e1003059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glenn T.C. (2011) Field guide to next‐generation DNA sequencers. Mol. Ecol. Resour., 11, 759–769. [DOI] [PubMed] [Google Scholar]
- Huang A. et al. QColors: an algorithm for conservative viral quasispecies reconstruction from short and non-contiguous next generation sequencing reads. In: IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW), 2011, pp. 130–136, IEEE. [DOI] [PMC free article] [PubMed]
- Kent W.J. (2002) BLAT–the BLAST-like alignment tool. Genome Res., 12, 656–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lang G.I. et al. (2011) Genetic variation and the fate of beneficial mutations in asexual populations. Genetics, 111, 128942. genetics. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H. (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27, 2987–2993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H. et al. (2009) The sequence alignment/map format and SAMtools. Bioinformatics, 25, 2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li X., Waterman M.S. (2003) Estimating the repeat structure and length of DNA sequences using ℓ-tuples. Genome Res., 13, 1916–1922. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prosperi M.C., Salemi M. (2012) QuRe: software for viral quasispecies reconstruction from next-generation sequencing data. Bioinformatics, 28, 132–133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pulido-Tamayo S. et al. (2015) Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations. Nucleic Acids Res., 43, e105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schirmer M. (2014). Algorithms for viral haplotype reconstruction and bacterial metagenomics: resolving fine-scale variation in next generation sequencing data. University of Glasgow, PhD thesis.
- Surget-Groba Y., Montoya-Burgos J.I. (2010) Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Res., 20, 1432–1440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y. et al. (2015) MBBC: an efficient approach for metagenomic binning based on clustering. BMC Bioinformatics, 16, 36.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zagordi O. et al. (2011) ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC Bioinformatics, 12, 119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zerbino D.R., Birney E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18, 821–829. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

