Abstract
Motivation
Metagenomics is the study of genetic materials directly sampled from natural habitats. It has the potential to reveal previously hidden diversity of microscopic life largely due to the existence of highly parallel and low-cost next-generation sequencing technology. Conventional approaches align metagenomic reads onto known reference genomes to identify microbes in the sample. Since such a collection of reference genomes is very large, the approach often needs high-end computing machines with large memory which is not often available to researchers. Alternative approaches follow an alignment-free methodology where the presence of a microbe is predicted using the information about the unique k-mers present in the microbial genomes. However, such approaches suffer from high false positives due to trading off the value of k with the computational resources. In this article, we propose a highly efficient metagenomic sequence classification (MSC) algorithm that is a hybrid of both approaches. Instead of aligning reads to the full genomes, MSC aligns reads onto a set of carefully chosen, shorter and highly discriminating model sequences built from the unique k-mers of each of the reference sequences.
Results
Microbiome researchers are generally interested in two objectives of a taxonomic classifier: (i) to detect prevalence, i.e. the taxa present in a sample, and (ii) to estimate their relative abundances. MSC is primarily designed to detect prevalence and experimental results show that MSC is indeed a more effective and efficient algorithm compared to the other state-of-the-art algorithms in terms of accuracy, memory and runtime. Moreover, MSC outputs an approximate estimate of the abundances.
Availability and implementation
The implementations are freely available for non-commercial purposes. They can be downloaded from https://drive.google.com/open?id=1XirkAamkQ3ltWvI1W1igYQFusp9DHtVl.
1 Introduction
A large part of the microbiomes living in the diverse habitat cannot be cultured in the traditional laboratory settings. This is due to the requirement of prior cloning and culturing individual microbes in isolation. As a result it is impossible to study entire microbial communities interacting in any given environment. In this context, metagenomics is a strong tool that can be used to decipher an approximate view of the microbial communities by an unbiased sampling of genetic materials from their natural habitats. It is also widely known as environmental genomics, ecogenomics or community genomics, and has been applied in a wide variety of fields to solve practical challenges, in areas including but not limiting to biofuels, biotechnology, food safety, agriculture, medications, etc. For an example, suppose we take a sample directly from food and perform metagenomics study. If the study reveals some microbes responsible for food poisoning, we can stop the production and/or the distribution of that particular food. In short it has the capability to impact the humanity indeed.
Widespread availability of next-generation sequencing technologies has prompted a recent surge in interest in the microbiome and as a consequence metagenomics is a fast growing area in bioinformatics (Garrido-Cardenas and Manzano-Agugliaro, 2017). One common application of next-generation sequencing to microbiome research is metagenomic whole-genome shotgun sequencing (mWGS), which involves the random sequencing of genomic DNA fragments extracted from complex microbial communities. Following sequencing, taxonomic characterization of mWGS data is often performed by comparing sequenced reads to databases that consist of complete genomes from individually isolated and taxonomically classified strains of microbes. Genome databases such as RefSeq (O’Leary et al., 2015) and GenBank (Benson et al., 2008) provide a growing resource against which to characterize mWGS sequenced datasets. However, both the size of these databases and the high degree of sequence homology that can exist between related genomes mean that accurate analysis of mWGS reads is computationally challenging.
The traditional approach to solve the metagenomic sequence classification is to align each input read to a large collection of reference genomes using alignment software such as BLAST (Camacho et al., 2009) or MegaBlast (Morgulis et al., 2008). However, aligning the reads onto the reference sequences requires a huge computing time. One way of salvaging execution time could be to align the reads to a selected marker gene in the reference genome instead of the whole genome. This approach is followed in Metaphlan (Truong et al., 2015), Metaphyler (Liu et al., 2010), Motu (Sunagawa et al., 2013) and Megan (Huson et al., 2007). However, this approach also becomes infeasible when there are more and more of reference genomes and the total number of reads grows more and more.
Researchers have tried several alignment-free methods. The most popular among such methods is based on k-mer spectrum analysis which uses a database of distinct subsequences of length k, or in short k-mers, from the reference sequence to classify the reads. If a read has distinct k-mers from multiple reference genomes then it is assumed to be from their lowest common taxonomic ancestor (see e.g. Wood and Salzberg, 2014). The algorithms using this broad approach differ in the way the database is built and queried. LMAT (Ames et al., 2013) is one of the first such algorithms. Subsequent improvements are: Kraken (Wood and Salzberg, 2014) which improves the speed and memory usage by employing a classification tree, Clark Ounit et al. (2015) improves memory usage further by storing only a reduced set of target specific k-mers. Clark/-S (Ounit and Lonardi, 2016) improves the specificity of CLARK by sacrificing a little speed and memory. Metacache (Müller et al., 2017) improves the memory usage further by a novel application of minhashing technique on a subset of the context aware k-mers to reduce memory usage. Kaiju (Menzel et al., 2016) uses the same k-mer based approach, however, exploits the fact that microbial and viral genomes are typically densely packed with protein-coding genes which are more conserved and more tolerant to sequencing errors because of the degeneracy of the genetic code. Some applications require an estimate of the relative abundance of constituent species. Using a probabilistic method based on Byesian likelihood, Braken (Lu et al., 2016) augments the output of Kraken with estimated abundance. There have been few alignment-free methods other than the k-mer spectrum analysis as well. MetaKallisto (Schaeffer et al., 2017) uses pseudo alignments. WGSQuikr (Koslicki et al., 2014) and Metapallette (Koslicki and Falush, 2016) use a compressed sensing approach where the abundance is estimated by solving a linear system of equations on the k-mer spectrum profile of input reads and that of the reference sequences.
As noted above, several computational algorithms exist in this domain to determine the identity of microbes in a heterogenous microbial sample. All of these algorithms suffer from low accuracy, high execution time, and high memory usage for solving the problem of identifying taxa present in a sample. For some algorithms the user also needs to post-process the output. To address all of these issues we offer a very fast and highly accurate metagenomic sequence classification (MSC) algorithm.
The algorithmic approach of MSC is a hybrid of the alignment based and the alignment-free k-mer based approaches. Like the other k-mer based algorithms, it first identifies the unique k-mers across all of the target sequences. However, all of these unique k-mers are not equally effective for classification. Thus we rank the unique k-mers based on the information of the target sequences they contain and discard all but a small set of the most informative k-mers. All other k-mer based approaches search these discriminating k-mers in the sequence reads of a metagenomic sample to save the computational resources required for aligning the reads to all genomes. However, they trade the resources with the length of k-mers. A smaller value of k makes them error prone. Suppose we find a discriminating k-mer in a read and classify the read based on it. This finding could lead us erroneously to count it as a true positive since we are not considering the surrounding regions of the discriminating k-mer. Our approach reduces the error by also considering the genomic regions of the target sequences flanking the informative k-mers. We concatenate all of these super k-mers of a particular target to build a model sequence. We then align the metagenomic sequencing reads onto the model sequences and estimate the genus and species information of the microbes residing in the given sample. Our novel approach drastically reduces the intensive computational burden of alignment and greatly enhances the accuracy by harnessing the positive aspect of both approaches.
We demonstrate its utility by identifying genome-specific most informative k-mers in 16 780 bacterial genomes downloaded from NCBI’s GenBank repository. We further use these unique k-mers generated for different bacterial species to build model sequences as a means for rapid taxonomic profiling of complex microbial communities. Experimental evaluations show that the proposed algorithm is indeed one of the most powerful metagenomics sequence classifier currently in this domain.
2 Materials and methods
2.1 Reference genomes
All the bacterial genomes available in GenBenk were downloaded from ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt and parsed to select a single genome for each unique species_taxid. Where possible the representative genome was selected for each unique taxid; however, if no representative genome was available, a single genome was randomly selected from all the accessions annotated as full. The full taxonomy of each genome was inferred by querying its assembly_accession number in the BioSQL database (biosql.org). At the time of downloading 16 780 genomes were available that met the above criteria.
2.2 Microbial metagenomes
A mock community of 36 bacterial species prevalent in the human microbiome (Peterson et al., 2009) was created as described by Diaz et al. (2012). Sequencing libraries were created using the Illumina TruSeq Nano DNA HT kit and sequenced on the Illumina HiSeq 2000 platform to generate 2 × 150 bp paired-end reads. Prior to analysis reads were processed with Trimmomatic to remove sequencing adapters. In addition to mWGS sequencing, the composition of the mock community was estimated using 16S rRNA gene abundance. Briefly, the 16S rRNA gene was polymerase chain reaction amplified using primers 27F and 1492R and amplicons were sequenced on the PacBio RS II sequencing platform (P6V4v2 chemistry) using circular consensus sequencing followed by post-processing with CCS2 v3.0.1. Relative abundance of each taxon was estimated by aligning reads to a database containing a single representative 16S gene copy for each species using the cross_match alignment tool in Phrap (http://www.phrap.org/phredphrapconsed.html). We have also created another mock community of 11 species using the same procedure stated above. In subsequent sections we refer to these datasets as MOCK-1 and MOCK-2 containing 11 and 36 species, respectively.
2.3 In silico microbial metagenomes
We simulated two metagenomic paired-end datasets from Illumina platform containing 61 and 282 species, respectively, using a shotgun sequence simulator named Grinder (Angly et al., 2012) on a collection of 16 780 reference genomes from the NCBI GenBank repository. We varied the relative abundance of the species significantly: from 0.0013 to 0.89 in the first dataset and from 0.000021 to 0.18 in the second dataset. For each of these datasets, we computed the number of reads for a species in proportion to the relative abundance rounded up to next even number so that there are exact pairs of reads and the least abundant species has 1000 reads. All the reads are of length 150 and the insert lengths are taken from a normal distribution with a mean of 500 and standard deviation of 50. The good and bad quality scores were set to 30 and 10, respectively. In short, we used Grinder command-line options -rd 150 -id 500 normal 50 -ql 30 10 -fq 1 and the rest were defaults. In the subsequent sections we refer to these datasets as SIM-1 and SIM-2, respectively. An in silico community of 192 bacterial species was generated by downloading the simBA-5 dataset from Ounit et al. (2015) and filtering it to remove reads annotated as archaea. We mention this dataset as SIM-3.
2.4 Metagenomic sequence classifier
There are four basic steps in the algorithm MSC that we propose in this paper. At first we mine k-mers that are unique to each of the input genomic sequences. The second step involves finding the most informative k-mers from the set of unique k-mers associated with each genome. We then build a model sequence for each genome using the topmost informative k-mers mined from that genome. Instead of using the full genome sequences, these shorter pre-built model sequences are then employed to accurately profile two taxonomic levels (i.e. species and genus). We also estimate approximate abundance of the two taxonomic levels residing in a given metagenomic sample. Details of each of these four steps are given below.
2.4.1 Identification of the unique k-mers
In this step we select a set of unique k-mers from the given set of target genomes . A k-mer is said to be unique if it occurs in only one of the genomes. Let the total number of base-pairs in all the genomes be B. The number of such k-mers will be no more than . Since each nucleotide (i.e. a, c, g or t) can be encoded in 2 bits, we need bits. We are using 16 780 bacterial genomes and each genome contains nucleotides on an average. If we want to search for unique k-mers from the set of all such k-mers in one single pass, it will be a very memory intensive and time consuming procedure. To reduce the memory footprint we use an out-of-core approach. We partition G into several non-overlapping parts P. In each part we search for unique k-mers across all the genomes residing in that particular partition. We do not employ all the unique k-mers in the partition. We only use a randomly picked subset of the unique k-mers in the partition. After collecting unique k-mers for each genome in one part, we save the unique k-mers with the associated genome ids in the disk. Finally, we collect all the unique set of k-mers for each genome from the reduced set of k-mers found in the previous step by checking if it occurs in the other genomes. The details of the above steps are described below.
We maintain two data structures to find the unique k-mers in each genome. Hash table H is used to record unique k-mers. Each key of the hash table H represents a k-mer and its corresponding value refers to the associated genome id. Hash set S contains the non-unique k-mers found in more than one genome. In both data structures the expected time complexity to search, update or delete operations is O(1).
For each k-mer x in each genome we process it as follows. We first check if x or its reverse complement is already declared as non-unique by searching for it in the hash set S. If S already contains either x or , it is non-unique. Otherwise we search for x in the hash table H. There are two possibilities: there is an entry for x in H or there is no entry for x in H. If there is an entry for x in H, there are two possibilities: (i) the k-mer in the entry corresponds to another genome or (ii) it corresponds to the same genome (as that of x). In the former case, we declare x as non-unique by recording both x and its reverse complement in S and delete the entry from H. If the latter is the case, we do not do anything. If we do not find any entry for x in H, we create an entry for x in H associated with its genome id.
As noted earlier, we partition G into P parts (for some suitable value of P) to reduce the memory footprint. For each of the P parts we follow the same procedure as described above and get H containing all the unique k-mers in that partition. Instead of using all of them, we only save a randomly chosen subset of these unique k-mers associated with their genome ids in the disk for each part. After exhausting all the parts we merge the unique k-mers for all parts by iteratively retrieving the unique k-mers along with their corresponding genome ids from the disk. Here we follow the same procedure by maintaining two data structures similar to H and S. Finally we collect the unique k-mers. Now these k-mers are guaranteed to be unique across all the genomes. In the experimentation we set k = 25.
2.4.2 Detecting informative k-mers
After finding unique k-mers, we extract the most informative unique k-mers from out of them. A unique k-mer is defined to be informative if it ‘matches’ the genome it comes from ‘closely’. The notion of ‘matching closely’ is defined as follows. Let x be any unique k-mer that is in the genome Gi (for some value of i). We choose a suitable integer and collect all the -mers from the genome Gi. We form a vector of length , where is the fraction of times that the character u occurs in the -mers of Gi in position j, for and . We compute a similar vector Vx using the -mers of x. We say that x matches Gi closely, if the Euclidean distance between V and Vx is ‘small’. Let the Euclidean distance between V and Vx be referred to as the distance between x and Gi. For each of the unique k-mers chosen for Gi we compute its distance with Gi. The unique k-mers are sorted based on this distance in non-decreasing order. The top X (for some appropriate value of X) unique k-mers from this sorted list will be used to build the model sequence for the genome Gi. In the experimentation we set = 4 and X = 500.
2.4.3 Building model sequences
In this step we build a model sequence for each genome . Let the set of most informative k-mers in Gi be Ii. Each k-mer in Ii is transformed to a superstring. We know the position of occurrence of each unique k-mer in Gi. Let x be any unique k-mer and let p be the starting position of its occurrence in its genome Gi. We extract c nucleotides (for some suitable value of c) to the left of position p in Gi and c nucleotides from position p + k. We concatenate the c nucleotides in the left of position p, x, and the c nucleotides starting from position p + k to get a superstring for x. In other words, we transform a unique k-mer x to a unique -mer y. Clearly, each superstring y will be unique across all the genomes in G. We refer to these unique -mers as the most informative regions of a particular genome Gi. We concatenate all such regions of a genome Gi to build a model sequence. Let these model sequences be where Mi is the model sequence of Gi.
2.4.4 Inferring genus, species and abundance
In this step we infer genus, species and their corresponding abundances from a given metagenomic sample E. Metagenomic sequence reads from the sample E are aligned onto each of the model sequences Mi built in the previous step. Suppose a read is aligned onto a model sequence Mi within a certain number d of errors, i.e. substitution, insertions or deletions (we set in our experiment). If a read r contains one of the most informative k-mers belonging to Mi, we can say that the read r belongs to the genome Gi. This is referred to as a hit. If the metagenomic sample E does contain reads originating from Gi, this hit will be considered as a true hit. This inference step can be done for all the reads E together in one step by maintaining two databases. The first database D1 contains the model sequences M and the second database D2 contains the start and end positions of the most informative k-mers for each . When a read r is aligned onto a model sequence Mi contained in D1, we can easily identify all the informative k-mers that it contains by checking the second database D2. When a read r is aligned onto a model sequence Mi, we know its alignment position and we also know its length a priori. By looking into D2 we can identify whether the read r contains informative k-mers. If r does not contain any k-mer stored in D2, we discard the read from further consideration.
If a read r contains informative k-mers from multiple genomes , then the read r is assigned to all of those genomes Gi. Since the taxonomic profiling of MSC is based on the model sequences of the genomes, not all the reads from the metagenomic sample E will be classified to taxonomic levels. This is since model sequences may have tiny stretches of the original genomes. As stated earlier, MSC is designed to detect species residing in E with a high level of accuracy. Thus instead of estimating the true abundance we offer approximate abundance estimation for each taxonomic level, i.e. genus and species. The abundance of a taxa is computed by taking the ratio of true hits with respect to that taxa and the total number of hits. Suppose that a total of N reads are aligned onto the model sequences (i.e. hits). Let ni be the number of reads aligned onto a particular model sequence Mi of Gi (i.e. true hits) and the corresponding taxa be Ti. The approximate abundance of Ti in E will be . The abundance of the genera is computed in two steps. At first we group all the taxa identified in the metagenomic sample with respect to their genera. To estimate the abundance for each of the genera, next we compute the ratio of the true hits (i.e. total number of reads classified for all the taxa in the same genera) and the total number of hits.
2.5 Analyses
2.5.1 Number of possible unique k-mers
Consider a database consisting of n genomes each of length m. Assume a random model where each character of each genome is uniformly randomly picked from . Let the genomes be . Let x be any k-mer of one of the genomes, say, Gi. What is the probability that x is unique, i.e. it does not occur in any other genome? If y is any k-mer of Gj (with ), the probability that x = y is . Probability that x does not equal any of the k-mers of Gj is . Here we assume that the k-mers are independent which is incorrect. However, a similar analysis has been employed in many prior works (see e.g. Buhler and Tompa, 2002) with the rationale that such analyses give a good guideline on the practical performance.
As a result, the probability that x in Gi is unique is . This probability could be very close to 1. As an example, let and n = 100. In this case, the above probability is around . This in turn means that when k is moderately large, each of the k-mers is likely to be unique.
We can do a slightly different and correct analysis as follows. We know that the probability of x being the same as y is . This means that the expected number of k-mers of Gj that equal x is . Also, the expected number of k-mers from the genomes other than Gi that equal x is . When and n = 100, this number is ! Along the same lines we can realize that the expected number of k-mers that match is no more than . For the above example, this number is no more than .
This is the reason why we choose a random sample of all the unique k-mers in the step of identifying unique k-mers.
2.5.2 Time complexity—in-core model
First consider the in-core model, i.e. assume that all the genomes can be stored in the core memory. Let the number of genomes be n and the length of each genome be m. Assume for simplicity that . The total number of k-mers in the database is O(mn).
Identification of unique k-mers: the expected time spent on the hash tables S and H is O(1) per operation. For each k-mer x from each genome, in the worst case we perform a constant number of operations in S and H. Thus the total expected time spent for each k-mer is O(1). This means that the expected time spent in identifying unique k-mers is O(mn). Let s be the sampling rate used to pick unique k-mers, i.e. we pick a fraction s of all possible unique k-mers. Total expected time spent in this step is O(mn).
Identification of informative unique k-mers: for each genome, we form a vector of length . This takes O(m) time per genome. Across all the genomes, the time spent is O(mn). For each of the unique k-mers we have to form a vector. This will take O(1) time per k-mer. Thus the time taken to form vectors for the unique k-mers is O(smn). Finding the top X unique k-mers can be done in O(smn) time. In total this step takes O(mn) time.
Building model sequences: from each informative unique k-mer we form a superstring which takes O(1) time. Thus the total time spent for this step is O(nX) (assuming that ).
Aligning the reads: let N be the number of reads and the length of each read be r. We use Bowtie to do the alignment. Bowtie will first construct a database using all the model sequences. The total length of all the model sequences is O(Xn). Thus the time to construct the database is O(Xn). Followed by this, each read will be aligned. The time spent for each read will be where h is the number of hits for the read. If we assume that , then the total alignment time is .
In summary, the runtime of the entire algorithm is which is, clearly, asymptotically optimal.
2.5.3 Time complexity—out-of-core model
Now consider the out-of-core model. Assume that the core memory is not large enough to hold all the genomes. In fact this is the model that holds in practice and the one that we have implemented. Assume that the core memory can only hold of the genomes. Since the time taken for the input/output (I/O) operations is typically much more than the time spent on local computations, it is customary in the literature to report only the I/O complexity for out-of-core algorithms. We follow this custom for our MSC algorithm also.
Identification of unique k-mers: we bring genomes at a time into the core memory and identify the unique k-mers in each part. Over all the parts, we do one pass through the entire genome dataset. Followed by this we bring all the unique k-mers from the disk into the core memory. Now we have to check which among these are indeed unique across all the genomes. This will take one more pass through the genome dataset. We assume that the randomly picked set of unique k-mers is small enough to fit in the core memory.
Identification of informative unique k-mers: we can form the vectors for all the genomes in one pass. Since there is only one vector per genome, the genome vectors need only a small amount of memory. Followed by this we have to compute the vectors for the unique k-mers. This does not involve any I/O operations.
Building model sequences: from each informative unique k-mer we have to form a superstring. This involves accessing the corresponding genome. This step can be done by bringing genomes at a time into the core memory. We will form superstrings for all the informative unique k-mers that are from the genomes currently in the core memory. Thus this entire step can be completed in one pass through the genome dataset.
Aligning the reads: the database on the model sequences can be constructed in memory. We can align the reads in one pass through the reads.
Putting together, the algorithm takes four passes through the genome dataset and one pass through the read dataset. Clearly, the algorithm is asymptotically optimal in its I/O complexity.
3 Results
3.1 Taxonomic profiling of microbial metagenomes
To provide a benchmark of taxonomic abundance estimation in in silico and mock bacterial communities, two k-mer based tools were selected based on previous reports of their performance (Lindgreen et al., 2016). CLARK v1.2.3.2 was downloaded and installed following online instructions (http://clark.cs.ucr.edu) and a discriminatory k-mer database for bacteria/archaea was created on November 21, 2017 using the default script set_targets.sh with the database option bacteria. Following Ounit and Lonardi (2016) a database of discriminative spaced k-mers for use with CLARK-S was subsequently built using the script buildSpacedDB.sh. Kraken v1.0 was downloaded from http://ccb.jhu.edu/software/kraken and a k-mer database for bacteria built on November 9, 2017 using the command kraken-build—download-library bacteria followed by kraken-build—build.
Taxonomic profiling with Kraken was performed with the command-line options—preload—threads 4—fastq-input followed by kraken-mpa-report. Taxonomic profiling with CLARK was run using the script classify_metagenome.sh with the options—threads 4—gzipped, followed by estimate_abundance.sh. CLARK-S was run by addition of the option—spaced when running classify_metagenome.sh. For both Kraken and CLARK, taxa were quantified at species-level and genus relative abundance estimates were calculated by collapsing values for all the species belonging to the same genus. MSC was also executed using four threads.
All profiling of mock community and in silico datasets were performed on HP Proliant XL Series servers with 28 cores at 2.6 GHz and 512 GB RAM. Servers were accessed via a PBS Torque cluster, and approximate runtime and memory usage statistics were retrieved via the Moab job scheduler.
3.2 Performance metrics
To demonstrate the utility of discriminatory k-mers for taxonomic profiling we compared the performance of MSC with two widely used, k-mer-based tools, CLARK(−S) and Kraken, selected for their high speed and accuracy. We computed the following four performance metrics related to accuracy:
• Recall: in information retrieval, recall is the fraction of the taxa levels (i.e. genus and species) in the metagenomic sample that are successfully detected. It is defined as: . Here TP stands for the number of true positives, i.e. the number of taxa present in the sample and correctly identified by an algorithm. FN stands for the number of false negatives, i.e. the number of taxa present in the sample and not identified by an algorithm.
• Precision: it is the fraction of retrieved taxa levels that are residing in the sample. The definition is: . Here FP is the number of false positives, i.e. the number of taxa identified by an algorithm that are not in the sample.
• APrecision: we have also computed precision with respect to abundance (APrecision, in short). It is loosely defined as the proportion of identifiable reads that are assigned to correct taxonomic rank.
• F-measure: it is the harmonic mean of precision and recall, the traditional F-measure or balanced F-score: . This measure is approximately the average of the two when they are close, and is more generally the harmonic mean, which, for the case of two numbers, coincides with the square of the geometric mean divided by the arithmetic mean.
• Abundance similarity score (ASS): It is defined as how much estimated abundances of taxonomic ranks are correlated with respect to ground truths. We use cosine similarity to estimate the abundance similarity score. It is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. It is normalized by the product of the vector lengths, so that output values close to 1 indicate high similarity. Let x and y be two n-dimensional vectors. The cosine similarity is computed as follows:
From the ground truth of all the datasets we identify the taxa, A that belong to each of the two taxa levels. For each algorithm we also identify the taxa predicted, B. We compute recall and precision as and , respectively. It is known from the ground truth which species and genera belong to each dataset. For each algorithm we compute the performance metrics for running each taxonomic profiler on the mock and in silico communities. The results are given in Tables 1 and 2.
Table 1.
Performance of Kraken, CLARK and MSC for the taxonomic profiling of two sequenced mock communities three in silico communities
| Rank | Genus | Species | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Metric type | Execution | performance | performance | ||||||||
| Metric | Time | Memory | Precision | Recall | F1 score | ASS | Precision | Recall | F1 score | ASS | |
| Dataset | Algorithm | ||||||||||
| SIM-1 | MSC | 0: 14: 22 | 1 GB | 0.96078 | 1.00000 | 0.98000 | 0.72775 | 0.77632 | 0.96721 | 0.86131 | 0.67736 |
| CLARK-S | 0: 25: 12 | 107 GB | 0.04905 | 0.63265 | 0.09104 | 0.69608 | 0.00475 | 0.09836 | 0.00906 | 0.29166 | |
| CLARK | 0: 03: 10 | 59 GB | 0.05041 | 0.63265 | 0.09337 | 0.64552 | 0.00481 | 0.09836 | 0.00917 | 0.33305 | |
| Kraken | 0: 22: 00 | 172 GB | 0.03716 | 0.69388 | 0.07054 | 0.70629 | 0.00365 | 0.16393 | 0.00713 | 0.33861 | |
| SIM-2 | MSC | 0: 44: 05 | 1 GB | 0.93407 | 0.98837 | 0.96045 | 0.85812 | 0.82671 | 0.81206 | 0.81932 | 0.86336 |
| CLARK-S | 1: 25: 03 | 107 GB | * | * | * | * | * | * | * | * | |
| CLARK | 0: 06: 08 | 78 GB | 0.13699 | 0.52326 | 0.21713 | 0.32160 | 0.01753 | 0.08511 | 0.02907 | 0.02794 | |
| Kraken | 0: 26: 07 | 172 GB | 0.11066 | 0.62791 | 0.18815 | 0.38444 | 0.01675 | 0.18085 | 0.03067 | 0.08766 | |
| SIM-3 | MSC | 0: 09: 20 | 1 GB | 0.92188 | 0.60825 | 0.73292 | 0.53892 | 0.68571 | 0.38710 | 0.49485 | 0.38366 |
| CLARK-S | 0: 19: 18 | 107 GB | 0.19835 | 0.98969 | 0.33046 | 0.93162 | 0.12237 | 0.98387 | 0.21766 | 0.82530 | |
| CLARK | 0: 02: 28 | 57 GB | 0.18356 | 0.98969 | 0.30968 | 0.92415 | 0.11568 | 0.97581 | 0.20684 | 0.84403 | |
| Kraken | 0: 19: 07 | 172 GB | 0.13623 | 0.96907 | 0.23888 | 0.92854 | 0.06007 | 0.98387 | 0.11323 | 0.82441 | |
| MOCK-1 | MSC | 0: 04: 32 | 1 GB | 0.75000 | 0.90000 | 0.81818 | * | 0.60000 | 0.81818 | 0.69231 | * |
| CLARK-S | 0: 13: 49 | 107 GB | 0.03636 | 0.80000 | 0.06957 | * | 0.02015 | 0.72727 | 0.03922 | * | |
| CLARK | 0: 01: 54 | 57 GB | 0.02105 | 0.80000 | 0.04103 | * | 0.01227 | 0.72727 | 0.02413 | * | |
| Kraken | 0: 19: 16 | 172 GB | 0.02821 | 0.90000 | 0.05471 | * | 0.01190 | 0.72727 | 0.02343 | * | |
| MOCK-2 | MSC | 0: 10: 53 | 1 GB | 0.85714 | 1.00000 | 0.92308 | * | 0.42373 | 0.69444 | 0.52632 | * |
| CLARK-S | 0: 19: 18 | 107 GB | 0.05426 | 0.87500 | 0.10219 | * | 0.02894 | 0.55556 | 0.05502 | * | |
| CLARK | 0: 02: 28 | 57 GB | 0.04158 | 0.87500 | 0.07940 | * | 0.02123 | 0.55556 | 0.04090 | * | |
| Kraken | 0: 19: 07 | 172 GB | 0.03668 | 0.95833 | 0.07066 | * | 0.01404 | 0.61111 | 0.02745 | * |
Note: ‘*’s represent missing values. In SIM-2 dataset CLARK-S was not able to finish the execution and ASS could not be computed as we do not know the true classification of reads in these two datasets.
Table 2.
Performance of Kraken, CLARK and MSC for the taxonomic profiling of a sequenced mock community (MOCK-2) and an in silico community SIM-3
| Genus | Species | ||||
|---|---|---|---|---|---|
| Correlation | Correlation | ||||
| with true | with true | ||||
| abundance | abundance | ||||
| APrecision | (Pearson r) | APrecision | (Pearson r) | ||
| MOCK-2 | MSC | 99.40 | 0.82 | 59.92 | 0.75 |
| CLARK | 94.72 | 0.80 | 92.36 | 0.81 | |
| CLARK-S | 95.70 | 0.77 | 93.29 | 0.78 | |
| Kraken | 95.59 | 0.76 | 86.35 | 0.81 | |
| SIM-3 | MSC | 90.60 | 0.41 | 76.07 | 0.17 |
| CLARK | 96.10 | 0.78 | 90.65 | 0.43 | |
| CLARK-S | 95.78 | 0.79 | 87.30 | 0.44 | |
| Kraken | 97.05 | 0.85 | 89.03 | 0.42 |
Note: Pearson’s correlation statistics are calculated based on the relative abundance of taxa present in each community and therefore do not account for the detection of false positives.
3.3 Higher precision and recall using MSC
The performance of a classification algorithm depends on both precision and recall. An algorithm can have a high recall but a small precision. This could be due to the fact that the algorithm with a small precision suffers from high false positives. As a consequence, the classification performance of an algorithm is measured by taking the harmonic mean of recall and precision (i.e. F1 score). It is observed that the existing algorithms for classifying metagenomic reads suffer from very high false positives and memory usage.
As noted earlier in silico datasets consist of three metagenomic samples namely SIM-1, SIM-2 and SIM-3. At first consider SIM-1 and SIM-2 datasets. For both taxonomic levels (i.e. genus and species) our algorithm outperforms all the algorithms with respect to precision, recall, F1 and ASS. F1 scores for genus are 0.98 and 0.96 in SIM-1 and SIM-2 datasets, respectively. It reflects our claim that MSC produces a very small number of false positives. The same argument also holds for species detection. Although we have less recall in SIM-3 dataset, F1 scores are higher than the other tools as seen in Figure 1 for visual comparisons. MSC has a very low memory footprint. MSC uses and less memory than CLARK and Kraken, respectively. The execution times of MSC are also comparable with the state-of-the-art algorithms as seen in Figure 2. Memory usages (i.e. resources) are shown in a logarithmic scale for proper visualization. Now consider MOCK-1 and MOCK-2 datasets. MSC outperforms all the algorithms of interest in terms of precision, recall, F1 score and memory for both of taxonomic ranks.
Fig. 1.
Comparison of various performance metrics (Precision, Recall, F1score and ASS) of different algorithms we experimented with (MSC, CLARK-S, CLARK, Kraken) on five datasets (SIM-1, SIM-2, SIM-3, MOCK-1 and MOCK-2). The actual abundance of taxonomic levels present in MOCK-1 and MOCK-2 are not known, hence we could not compute ASS for them
Fig. 2.
Comparison of time and memory taken by the algorithms (MSC, CLARK-S, CLARK, Kraken) on five datasets (SIM-1, SIM-2, SIM-3, MOCK-1 and MOCK-2). Note that memory usage is shown in log scale
4 Discussion
The algorithmic approach undertaken by MSC is different from the other algorithms. It is designed to answer the most important question with a high level of confidence: What are the genus and species interacting in a complex and diversified metagenomic sample? To answer this question with a high accuracy we search through the reference genomes, extract unique k-mers, and build model sequences for each genome. A model sequence contains unique and the most informative regions of a genome to classify two of its most important taxonomic ranks, i.e. genus and species. The total size of all the reference genomes is 75 GB where the model sequences consume only 1.83 GB disk space—approximately reduction! We align metagenomic sequencing reads onto the model sequences and identify two taxonomic ranks as stated above. Since the informative regions coming from a reference genome are highly discriminating, the metagenomic reads aligned onto it are also highly discriminating in nature. This is why our experimental evaluations show that MSC outputs very few false positives compared to the other state-of-the-art algorithms. In addition, another question has to be answered: what are the relative abundances among the two taxonomic ranks. As MSC deals with only a small part of the target genomic sequences, it estimates relative abundances approximately. It may not reflect the true abundance prevalent in the habitat. It is evident from the experimental evaluations that MSC indeed excels the other well-known algorithms with respect to precision, recall, F1 score and memory.
It is also noteworthy that both Kraken and CLARK dramatically over-predicted the number of taxonomic ranks present in each community. While false positive taxa were generally detected at low relative abundance, their presence may still affect biological interpretation of taxonomic profiling results. In addition to the biological relevance of low abundance taxa (Jousset et al., 2017), their prevalence can influence commonly used alpha diversity measures such as richness and chao1 (Chao, 1984). They will also have a bearing on statistical approaches that attempt to accommodate for zero-inflation in microbiome datasets (Xia and Sun, 2017). Compared to the other tools MSC had far fewer incorrect predictions at the genera and species-levels, suggesting that appropriate selection of the discriminatory k-mers has the potential to address the problem of false positive taxa in mWGS datasets.
MSC performed comparably in terms of speed and had a lower memory footprint than the other k-mer-based approaches. When considering precision with respect to abundance (i.e. APrecision) (herein loosely defined as the proportion of identifiable reads that were assigned to correct taxa), MSC performed well at the genus level, but could not match the other state-of-the-art tools when quantifying taxonomic rank at the species-level (please, see Table 2). We attribute this lack of precision at the species-level to the fact that we only selected a single representative genome for each bacterial species when building our k-mer database. Variation in the sequence composition between closely related genomes within the same clade is likely to substantially impact the accurate selection of clade-specific k-mers. This problem is likely to be pervasive while the number of available sequenced genomes remains small relative to the true number of distinct microbial taxa. However, this problem has the potential to be substantially reduced by a careful curation of the reference genomes prior to the identification of discriminatory k-mers.
While precision with respect to abundance [as defined by Bazinet and Cummings (2012), i.e. APrecision) is a useful measure of taxonomic profiling accuracy, the information it captures is limited. In microbiome samples where the relative abundance of different taxonomic levels is highly skewed, precision scores may be biased towards approaches that assign reads more accurately to dominant taxa, which constitute the majority of genomic material in an mWGS sample. This belies the fact that low abundance taxa may be of equal biological importance in microbiome samples (Jousset et al., 2017). We therefore considered the ability of CLARK, Kraken and MSC to reflect the true relative abundance profiles of different taxonomic levels in the MOCK-2 and in silico SIM-3 datasets.
Pearson’s correlations of predicted versus ‘true’ relative abundances for taxa present in each community (MOCK-2 and SIM-3) indicated that MSC performed comparably to the other k-mer approaches on the MOCK-2 datasets, but was less accurate on the larger, more diverse in silico SIM-3 dataset. This is largely due to the fact that MSC is not able to identify some taxonomic levels. Although it produces less false positives (high precision and F1 score), the recall is quite low for both of the taxonomic levels. Please, see Table 1 for a detailed statistics. MSC works on the model sequences of genomes by extracting informative regions. A model sequence is a tiny part of the original genome sequence. As a consequence MSC can align a small subset of the reads from metagenomic sample onto model sequences. So, it may not detect organisms with very low abundance as in the case of SIM-3 dataset. The goal of MSC is to detect taxa and genera with a high level of confidence. If a metagenomic sample contains high abundance organisms, MSC mimics the true abundance. Please, see ASS scores in Table 1. It is notable, however, that all the approaches underestimated the relative abundance of certain taxonomic level—e.g. the genera Faecalibacterium, and Propionibacterium in MOCK-2 dataset, suggesting that there may be systematic biases in the ability of all the k-mer-based approaches to detect specific taxonomic rank based on relative abundance.
5 Conclusions
As the number of genomes in reference databases continues to increase, there is a need for computational approaches that can quickly and robustly identify sequence regions that are unique to single genomes. In this paper, we present a novel algorithm for the accurate detection of unique and discriminating regions across vast datasets of individual DNA sequences. One application for this algorithm is the rapid taxonomic profiling of mWGS sequence datasets generated from complex microbial communities. We demonstrate this by using our discriminatory k-mer approach to profile five microbiome samples, and by comparing the results against the state-of-the-art metagenomic profilers. Experimental evaluations indicate that our algorithm has the potential to form the basis of a fast, accurate way of identifying and quantifying microbial sequences in mWGS samples, and further that it may address the issue of false positives that is prevalent in existing approaches.
Funding
This work was supported in part by the National Science Foundation grants [1447711, 1743418 to S.R.].
Conflict of Interest: none declared.
References
- Ames S.K., et al. (2013) Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics, 29, 2253–2260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Angly F.E., et al. (2012) Grinder: a versatile amplicon and shotgun sequence simulator. Nucleic Acids Res., 40, e94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bazinet A.L., Cummings M.P. (2012) A comparative evaluation of sequence classification programs. BMC Bioinformatics, 13, 92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benson D.A., et al. (2008) Genbank. Nucleic Acids Res., 36, D25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buhler J., Tompa M. (2002) Finding motifs using random projections. J. Comput. Biol., 9, 225–242. [DOI] [PubMed] [Google Scholar]
- Camacho C., et al. (2009) BLAST: architecture and applications. BMC Bioinformatics, 10, 421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chao A. (1984) Nonparametric estimation of the number of classes in a population. Scand. J. Stat., 11, 265–270. [Google Scholar]
- Diaz P.I., et al. (2012) Using high throughput sequencing to explore the biodiversity in oral bacterial communities. Mol. Oral Microbiol., 27, 182–201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garrido-Cardenas J.A., Manzano-Agugliaro F. (2017) The metagenomics worldwide research. Curr. Genet., 63, 819–829. [DOI] [PubMed] [Google Scholar]
- Huson D.H., et al. (2007) MEGAN analysis of metagenomic data. Genome Res., 17, 377–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jousset A., et al. (2017) Where less may be more: how the rare biosphere pulls ecosystems strings. ISME J., 11, 853. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koslicki D., Falush D. (2016) MetaPalette: a k-mer painting approach for metagenomic taxonomic profiling and quantification of novel strain variation. mSystems, 1, e00020-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koslicki D., et al. (2014) WGSQuikr: fast whole-genome shotgun metagenomic classification. PLoS One, 9, e91784. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lindgreen S., et al. (2016) An evaluation of the accuracy and speed of metagenome analysis tools. Sci. Rep., 6, 19233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu B., et al. (2010) MetaPhyler: taxonomic profiling for metagenomic sequences. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Hong Kong, China. [Google Scholar]
- Lu J., et al. (2016) Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci., 3, e104. [Google Scholar]
- Menzel P., et al. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun., 7, 11257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morgulis A., et al. (2008) Database indexing for production MegaBLAST searches. Bioinformatics, 24, 1757–1764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Müller A., et al. (2017) MetaCache: context-aware classification of metagenomic reads using minhashing. Bioinformatics, 33, 3740–3748. [DOI] [PubMed] [Google Scholar]
- O’Leary N.A., et al. (2015) Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res., 44, D733–D745. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ounit R., Lonardi S. (2016) Higher classification sensitivity of short metagenomic reads with CLARK-S. Bioinformatics, 32, 3823–3825. [DOI] [PubMed] [Google Scholar]
- Ounit R., et al. (2015) CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics, 16, 236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peterson J., et al. (2009) The NIH human microbiome project. Genome Res., 19, 2317–2323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schaeffer L., et al. (2017) Pseudoalignment for metagenomic read assignment. Bioinformatics, 33, 2082–2088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sunagawa S., et al. (2013) Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods, 10, 1196–1199. [DOI] [PubMed] [Google Scholar]
- Truong D.T., et al. (2015) MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods, 12, 902–903. [DOI] [PubMed] [Google Scholar]
- Wood D.E., Salzberg S.L. (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol., 15, R46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xia Y., Sun J. (2017) Hypothesis testing and statistical analysis of microbiome. Genes Dis., 4, 138–148. [DOI] [PMC free article] [PubMed] [Google Scholar]


