Efficient alignment-free DNA barcode analytics

Pavel Kuksa; Vladimir Pavlovic

doi:10.1186/1471-2105-10-S14-S9

. 2009 Nov 10;10(Suppl 14):S9. doi: 10.1186/1471-2105-10-S14-S9

Efficient alignment-free DNA barcode analytics

Pavel Kuksa ¹, Vladimir Pavlovic ^1,^✉

PMCID: PMC2775155 PMID: 19900305

Abstract

Background

In this work we consider barcode DNA analysis problems and address them using alternative, alignment-free methods and representations which model sequences as collections of short sequence fragments (features). The methods use fixed-length representations (spectrum) for barcode sequences to measure similarities or dissimilarities between sequences coming from the same or different species. The spectrum-based representation not only allows for accurate and computationally efficient species classification, but also opens possibility for accurate clustering analysis of putative species barcodes and identification of critical within-barcode loci distinguishing barcodes of different sample groups.

Results

New alignment-free methods provide highly accurate and fast DNA barcode-based identification and classification of species with substantial improvements in accuracy and speed over state-of-the-art barcode analysis methods. We evaluate our methods on problems of species classification and identification using barcodes, important and relevant analytical tasks in many practical applications (adverse species movement monitoring, sampling surveys for unknown or pathogenic species identification, biodiversity assessment, etc.) On several benchmark barcode datasets, including ACG, Astraptes, Hesperiidae, Fish larvae, and Birds of North America, proposed alignment-free methods considerably improve prediction accuracy compared to prior results. We also observe significant running time improvements over the state-of-the-art methods.

Conclusion

Our results show that newly developed alignment-free methods for DNA barcoding can efficiently and with high accuracy identify specimens by examining only few barcode features, resulting in increased scalability and interpretability of current computational approaches to barcoding.

Background

Identification of living species is one of the pressing tasks in science and technology today, prompted by our need to understand the natural biodiversity and its increasing interaction with the human society.

However, development of comprehensive species identification strategies is impeded by the enormous biodiversity of life on Earth. Traditional morphological identification of species is difficult, requires expertise of highly trained taxonomists, and takes up enormous amounts of time. Species identification methods based on molecular diagnostic technologies, including PCR, are limited in the number of species they can identify and lack standardization of technologies or are susceptible to tissue conditions. DNA barcoding has been recently introduced as a taxonomic tool for characterizing species using fragments of a DNA sequence from standard gene regions, such as the mitochondrial DNA (mtDNA) [1]. These relatively short sequences (about 650 symbols in the case of mtDNA) are used as markers for discerning taxonomical identities of specimens using the process of mtDNA extraction, fragment amplification, sequencing and database lookup [2]. A critical property of this particular region is its monophyletic association: the content of mtDNA is often preserved within a species and shows greater divergence between than within species (sometimes 10× or more when sister species are excluded) [3]. In particular, a region corresponding to c oxidase subunit 1 or cox1 gene is often used as a critical barcoding marker [1] that exhibits such properties.

Barcoding has shown great promise in practice. DNA barcodes can offer increased adaptability, robustness, and predictive value for rapid and accurate identification of species. For instance, barcoding analysis can result in improved correct placement of previously unknown species or increased resolution of specimens [4], identification of fish products with high accuracy [5], substitutes in fish species for human consumption [6] or marketing of endangered specimens [7]. DNA barcoding has been applied with great initial success to identification across the spectrum of living species, from algea [8], fungi [8], bacteria [9], to plants [10-12], spiders [13], fish [14], birds [1], and rats [15].

Most current barcoding computational methods leverage established modeling approaches from molecular phylogenetic analysis. Traditional barcoding methods, c.f., [1,16], are essentially tree-based phylogenetic approaches where identification decisions are made using a-priory threshold on the tree-induced distances. Choosing an optimal threshold is a challenging task, affected by variable relationship between the species morphology and the cox1 content similarity. More recently, sophisticated Bayesian and decision theory approaches [17,18] have been proposed that attempt to address this problem in a more systematic manner. Traditional phylogenetic methods are also sensitive to the choice of the sequence similarity metrics and the presence of exogenous variations in the sequence (such as those caused by bacterial cosegregation). Moreover, methods of molecular phylogeny are not inherently aimed at the task of sequence delineation, rather the study of relationships at different points in evolutionary history. As a consequence, they can also sometimes exhibit high computational complexity, justified for the complex analysis task but often unnecessary when the goal is e.g., species identification.

More recently, methods that more directly tackle the problem of barcode-based identifications have emerged. Some of these methods, such as [16] use the tools of generic but widely available and highly computationally optimized biological sequence comparison (BLAST or PSI-BLAST). Approach such as [19] even more immediately focuses on the prediction problem. However, a number of challenges remain to be addressed, including the accuracy of identification [16,18,20,21], as well as the efficiency and scalability of computational methods.

In this study we investigate alignment-free kernel methods for the DNA barcoding. Kernel-based classification has demonstrated strong performance in many related tasks of biological sequence analysis, such as protein classification and remote homology detection [22-24]. In the process, a number of kernel types or similarity measures between sequences have been proposed, including kernels derived from probabilistic models [25], k-mer string kernels [22,23], and weighted-decomposition kernels [26]. In this work we focus on k-mer string kernels, and in particular the spectrum/mismatch kernel methods. In our approach, species identification is performed by first transforming variable-length sequences into fixed-length representations (string spectra) and then classifying resulting spectral representations into one of many established species classes using state-of-the-art classification algorithm (e.g. nearest neighbor or Support Vector Machine (SVM) classifiers [27,28]). As a result, the alignment free kernel-based species identification in our study demonstrates both high accuracy, improved speed and classification performance compared to previously employed DNA barcoding identification methods.

Methods

In this section we discuss alignment-free analytics that we propose to use for accurate and efficient multi-class classification and identification of barcode sequences.

The spectrum kernel methods

Varying sequence length as well as the warping processes within sequences (insertions/deletions) typically preclude direct application of efficient computational models and algorithms designed for data in Euclidean spaces. The spectrum kernel methods [29,30] resolve this problem using fixed-length representations of arbitrary long sequences. These representations or features describe the statistics of short substrings of length k, also known as k-mers, contained in the original sequence. Such representations are both efficient to compute and informative for the tasks of sequence analysis.

Consider a sequence X of length n represented as a string of symbols (x₁, x₂,..., x_n) from some alphabet Σ, x_i∈ Σ. In the case of DNA sequences this alphabet consists of the set of the four DNA bases, {A, C, T, G}. Spectrum methods construct a fixed-length feature vector Φ(X) from this arbitrary long sequence by counting the frequencies of occurrence of all k-mers x_i, x_i+1,..., x_i+k-1in X. This feature, the histogram of k-mers in X, is commonly referred to as the sequence spectrum. The spectrum's domain has the dimension |Σ|^kcorresponding to the total number of all possible fragments of length k and, as a result, induces a fixed length representation.

This concept is illustrated in Figure 1. A sequence from the Astraptes set is represented as the histogram of frequencies with which 5-long fragments (5-mers) occur in that sequence. In the case of 5-mers there are Σ^k= 4⁵= 1024 such possible fragments, some of which are identified on the horizontal axes of the count plots in Figure 1. For instance, the fragment "CCGCG" occurs three times. Hence, the Astraptes sequence is mapped to a 1024-dimensional fixed-length representation. This representation will be subsequently used to judge similarities and dissimilarities between pairs of sequences coming from the same or different species.

**Illustration of spectrum and mismatch features**.

In practice the spectrum mapping will produce sparse feature vectors of counts when either k is long or the sequences are short. On average and assuming a random sequence generation process, for a sequence of length n each feature will appear n/|Σ|^ktimes. While the use of larger k is preferred to yield higher specificity of features, it inadvertently can lead to representations or feature spaces that are too high dimensional and produce low similarity even between sequences in the same class (species). As a consequence, it is often necessary to increase the "density" of these features to allow sufficient within-class sensitivity while maintaining the specificity across classes.

Increasing density for a fixed k-mer length can alternatively be viewed as the process of inexact sequence matching. The mismatch kernel method [29] accomplishes this task using the following general mismatch(k, m) |Σ|^k-dimensional representation of sequence X:

(1)

where I_m(α, γ) = 1 if α ∈ N (γ, m) and N(γ, m) denotes the set of contiguous substrings of length k that differ from γ in at most m positions. In other words, in addition to counting all k-mers α present in sequence X, one also adds counts of k-mers that differ in at most m symbols from each α. This process is illustrated in Figure 1 where 5-mer "GGAAT" is mapped to a set of Inline graphic ·(|Σ|^m- 1) + 1 = 5 × 3 + 1 = 16 similar k-mers, at most one symbol (m = 1) different from "GGAAT". The induced feature vector Φ^{k, m}(X) has the same dimension as the regular spectrum feature Φ(X), but is "denser". The choice of the maximum number of the mismatches (m) allowed between any two particular k-mers typically depends on whether sequences are relatively similar (e.g. closely related families, m is small) or are far apart (e.g. remote homologs, large values of m may be needed). The exact spectrum kernel is a particular case of the mismatch kernel and can be obtained from Eq. 1 by setting the number of mismatches m to zero (this will result in counting only exact matches between k-mers). Both mismatch and exact spectrum methods measure similarity of sequences by comparing the fixed-length features Φ^{k, m}of those sequences without performing any sequence alignment. As we discuss in the next section, the computational cost of evaluating this similarity is linear in the length of the sequences, compared to quadratic complexity required by alignment-based methods (e.g. Smith-Waterman) for similarity evaluation. This leads to a potentially important advantage for these methods when applied to large DNA barcode sets, which we demonstrate empirically in our Results.

Alignment-free algorithms

Both mismatch and spectrum methods typically evaluate similarity K(X, Y) of a pair of sequences by computing the dot-product between their corresponding feature vectors (Eq. 1):

(2)

Direct evaluation of the dot-product above for similarity computation results in costly O(|Σ|^kn) complexity. To efficiently evaluate the dot-product, we first note that in Eq. 2 the product I_m(α, γ)I_m(β, γ) is non-zero (i.e. contributes to the total similarity/kernel value) only if γ is the neighbor for both α and β. We then write the dot-product (Eq. 2) as follows:

(3)

where I_{k, m}(α, β) is the number of k-mers γ shared by α and β. We observe that the number of shared k-mers I_{k, m}(α, β) depends on the Hamming distance (i.e., the number of differences, in symbols, between the strings) d(α, β) between α and β for a fixed alphabet Σ, the length of the k-mer k, and the number of mismatches m (i.e. I_{k, m}(α, β) can only have a fixed set of values with each value corresponding to a particular Hamming distance). Since the maximum Hamming distance that will result in the non-zero I_{k, m}(α, β) is 2m, the dot-product in Eq. 3 reduces to computing the number of pairs (α, β), α ∈ X, β ∈ Y, for each of possible Hamming distances from 0 to 2m:

(4)

As we show in [31], the mismatch/spectrum similarity measure in the form as in Eq. 4 can be efficiently computed in O(c_{k, m}n) time, where c_{k, m}is a constant that depends only on the k-mer length and the maximum number of allowed differences m but not on the sequence length n. In the case of the exact spectrum method, the complexity is O(kn), i.e. is linear in both the sequence length n and the k-mer length k. It is also important to note that we typically need to evaluate this similarity for a set of N sequences (e.g., DNA barcode samples). Instead of evaluating similarity for every pair of N sequences, a task proportional to N², in [31] we also show that this can be accomplished in the time linear in N. Hence, the overall complexity of evaluating the mismatch(k, m) similarity on a set of N sequences of maximal length n is O(c_{k, m}nN). This results in significant computational savings (speedup) when it is necessary to compute similarity among a large number of sequences, as may be the case with DNA barcodes.

Prediction models

Given the similarity kernel for any pair of sequences, one can consider several predictive tasks. One such task is the classification of new sequence samples into one of the previously seen classes. In the context of DNA barcoding, this task can be interpreted as either the classification of a barcode sample into one of the known species or the verification task of resolving whether the sample belongs to a particular species or not. We first consider the latter (verification) task and then generalize it to the full classification task. A very general class of predictive models that relies on the similarity metric induced by the kernel K computes the matching score between the query sequence X and the previously seen sequences {X₁,..., X_N} whose class assignments {y₁,..., y_N} are known. The score is formed as

(5)

The sign of this score then typically indicates whether the query X belongs to a particular class, f(X) > 0, or not. The weights w_iare set in a training procedure prior to making predictions using a variety of available "learning" algorithms that attempt to optimize the predictive performance of this model. This verification model can also be generalized to the classification setting, where the sample is to be classified in one of M possible classes. In that case one can construct the predictive model for each class, f_m(X) = Σ_iw_{m, i}K(X, X_i), and make the final prediction by finding the class with the maximum score, y* = arg max_mf_m(X).

In this work we consider two classes of algorithms that have generally shown state-of-the-art performance on prediction tasks. One is the simple Nearest Neighbor classifier. In that setting w_{m, i}is non-zero, i.e. w_{m, i}= 1, only for the sequence X_i(of class y_i= m) which is "closest", or most similar, to the query sequence X. Nearest neighbor classifiers are simple and have appealing (asymptotic) theoretical properties.

The second class of learning algorithms used in this work is the well-known Support Vector Machine [28]. In the view of the model above the SVM selects an optimal subset of training sequences X_i(the so-called support vectors) and sets their weights to maximize the models predictive accuracy. In our work we use the "one-vs-rest" SVM learning approach described in [32].

Results and discussion

To demonstrate the utility of the alignment-free sequence representation for DNA barcode analytics we primarily focus on the task of species identification. The identification or classification task is one of the relevant analytical problems considered so far in DNA barcoding [16,18,20,21]. In this section we show that the spectrum-based, alignment-free representation possesses several interesting properties, among them the high accuracy of the sample-to-species assignments as well as the computational efficiency. Moreover, the spectral representations offer interesting insights into which sequence markers/features within the standard barcode region (e.g. cox1) serve as the most important discriminants among the sets of species. This result has further implication on computational efficiency but may also facilitate further taxonomical studies. We perform the barcode-based species classification experiments using several benchmark barcode datasets from various barcode collecting campaigns for mammals, fish, birds, lepidoptera, etc. In particular, we use seven data sets of DNA barcodes including Astraptes (12 species), Hesperiidae (364 species), Bats of Guyana (96 species), Fish of Australia (211 species), Birds of North America (656 species), ACG (573 species), and Fish larvae (7 species). Astraptes, Hesperiidae, Bats of Guyana, Birds of North America, and Fish of Australia were compiled from the BOLD [33] project. ACG set was published as a part of [34]. The Fish larvae set appeared in [16]. Table 1 summarizes details of these datasets.

Table 1.

Barcode datasets

Dataset	# species	# barcodes
ACG	573	4267
Hesperiidae	364	2185
Astraptes	12	465
Bats of Guyana	96	840
Birds of North America	656	2589
Fish of Australia	211	754
Fish larvae	7	35

Dataset	PSI-BLAST	spectrum	mismatch
ACG	3.07 ± 0.68	2.49 ± 0.87	3.63 ± 0.65
Hesperiidae	4.62 ± 0.97	3.57 ± 1.08	4.38 ± 1.49
Astraptes	13.82 ± 4.42	1.07 ± 1.81	1.50 ± 1.99
Bats	1.63 ± 1.22	1.63 ± 1.22	1.73 ± 2.01
Birds	7.46 ± 1.90	6.22 ± 1.50	7.29 ± 0.96
Fish Australia	5.62 ± 3.31	5.5 ± 3.27	5.29 ± 3.34
Fish larvae	2.86	2.86	5.71

dataset	k = 3	k = 5	k = 8	k = 10	k = 15
1-NN (nearest neighbor)

ACG	4.24 ± 0.90	3.19 ± 0.93	2.58 ± 0.94	2.49 ± 0.87	2.35 ± 0.83
Hesperiidae	5.32 ± 1.33	4.21 ± 1.18	3.66 ± 1.07	3.57 ± 1.08	3.39 ± 0.93
Astraptes	1.91 ± 1.87	1.90 ± 2.08	1.48 ± 1.75	1.07 ± 1.81	1.07 ± 1.81
Bats of Guyana	1.87 ± 1.36	1.63 ± 1.22	1.63 ± 1.22	1.63 ± 1.22	1.63 ± 1.22
Birds	7.77 ± 1.26	6.68 ± 1.22	6.42 ± 1.34	6.22 ± 1.50	6.13 ± 1.65
Fish Australia	5.47 ± 3.26	5.35 ± 3.36	5.35 ± 3.36	5.50 ± 3.27	5.50 ± 3.27
Fish larvae	8.57	5.71	2.86	2.86	2.86

3-NN

ACG	10.20 ± 1.31	8.98 ± 1.23	8.54 ± 1.11	8.67 ± 1.33	8.63 ± 1.21
Hesperiidae	15.55 ± 1.25	14.30 ± 1.46	14.22 ± 1.50	14.40 ± 1.85	14.30 ± 2.00
Astraptes	2.78 ± 2.49	2.36 ± 2.16	2.36 ± 2.16	2.15 ± 2.29	1.70 ± 1.96
Bats of Guyana	3.77 ± 1.68	4.46 ± 2.06	4.35 ± 2.01	4.46 ± 2.06	4.46 ± 2.06
Birds	20.23 ± 2.64	19.58 ± 2.48	18.88 ± 2.29	18.99 ± 2.22	18.37 ± 2.07
Fish Australia	12.44 ± 5.67	12.32 ± 5.68	12.31 ± 5.38	12.42 ± 5.46	11.93 ± 5.06
Fish larvae	14.29	14.29	11.43	11.43	11.43

5-NN

ACG	13.41 ± 2.00	12.42 ± 1.40	11.49 ± 1.28	11.49 ± 1.25	11.28 ± 1.20
Hesperiidae	19.70 ± 1.71	19.64 ± 2.62	18.63 ± 2.21	18.54 ± 2.17	18.22 ± 2.17
Astraptes	3.43 ± 3.09	3.01 ± 2.76	2.14 ± 1.74	1.70 ± 1.96	1.06 ± 1.12
Bats of Guyana	6.09 ± 3.01	5.85 ± 3.23	5.73 ± 2.77	5.87 ± 2.94	5.61 ± 2.79
Birds	27.32 ± 2.50	26.26 ± 2.17	26.49 ± 2.11	26.42 ± 2.44	26.10 ± 2.36
Fish Australia	19.40 ± 5.91	18.85 ± 6.15	19.28 ± 5.28	18.36 ± 5.04	18.81 ± 4.64
Fish larvae	22.86	22.86	22.86	22.86	22.86

Dataset	Spectrum	Hamming	Kimura	Smith-Waterman
ACG	2.49 ± 0.87	11.44 ± 1.52	5.51 ± 0.86	3.66 ± 0.66
Hesperiidae	3.57 ± 1.08	14.49 ± 2.36	3.81 ± 1.26	5.45 ± 1.20
Astraptes	1.07 ± 1.81	3.61 ± 2.77	1.71 ± 1.96	1.64 ± 1.03
Bats Guyana	1.63 ± 1.22	2.72 ± 1.83	1.63 ± 1.22	1.63 ± 1.22
Birds of North America	6.22 ± 1.50	18.38 ± 2.05	6.02 ± 1.36	8.20 ± 1.53
Fish Australia	5.50 ± 3.27	5.87 ± 4.01	5.35 ± 3.36	5.35 ± 3.36
Fish larvae^†	2.86	11.43	8.57	5.71

Dataset	spectrum (k = 10)	mismatch
ACG	2.32	3.48
Hesperiidae	3.25	3.36
Astraptes	0.86	1.07
Bats	1.63	1.67
Birds	5.99	7.09
Fish Australia	5.35	5.35
Fish larvae	2.86	5.71

	# features selected
Dataset	Full feature set (1048576 feat.)	4096	2048	1024	512	200	100
ACG	2.49 ± 0.87	2.51 ± 0.95	2.79 ± 1.02	3.00 ± 0.96	3.17 ± 0.86	3.52 ± 0.64	4.48 ± 0.86
Hesperiidae	3.57 ± 1.08	3.53 ± 1.12	3.80 ± 1.22	4.17 ± 1.05	4.40 ± 1.15	4.81 ± 1.30	5.64 ± 1.20
Astraptes	1.07 ± 1.81	0.44 ± 0.92	0.44 ± 0.92	0.44 ± 0.92	0.44 ± 0.92	0.64 ± 1.03	1.49 ± 1.75
Bats of Guyana	1.63 ± 1.22	1.63 ± 1.22	1.63 ± 1.22	1.63 ± 1.22	1.63 ± 1.22	1.63 ± 1.22	1.63 ± 1.22
Birds	6.30 ± 1.80	6.45 ± 1.82	6.94 ± 2.08	7.13 ± 2.05	7.41 ± 1.77	9.10 ± 1.64	9.84 ± 1.99
Fish of Australia	5.50 ± 3.27	5.35 ± 3.36	5.35 ± 3.36	6.14 ± 3.50	6.80 ± 3.15	8.32 ± 2.75	9.51 ± 2.40
Fish larvae	2.86	0	0	0	0	0	0

rank	feature	rank	feature	rank	feature	rank	feature
1	TTATTATTAT	26	ATTAATATAC	51	TTTTTATAGT	76	GGAGCCCCTG
2	ATTATTATTA	27	CTTTCCCCCG	52	TTTTTTATAG	77	TATTAATTTC
3	TATTACCCCC	28	GCTTTCCCCC	53	ATTAATTTCA	78	AATATTGCTC
4	CCCCCTCTTT	29	TTAATATACG	54	TAATATACGA	79	TTTTTGATCC
5	AATTTTATTA	30	TATTATAATT	55	GTTTATCCCC	80	TTTTTTGATC
6	AGGAGCTATT	31	AGCTTTCCCC	56	TTTTTTTATA	81	ATATTGCTCA
7	ATTGCCCATC	32	GCTCCTGATA	57	CCTTCTTTAA	82	GAGCCCCTGA
8	TTGCCCATCA	33	TAATTTTATT	58	TGGAGATGAT	83	ATTGCTCATC
9	TTAGGAGCTC	34	CTCCTGATAT	59	TTATTACCCC	84	TGCTCATCAA
10	TCAAATACCT	35	TCCTGATATA	60	TTTATCCCCC	85	TTGCTCATCA
11	ACCTTTATTT	36	CTAATATTGC	61	CCCCTCTTTC	86	ATTATTAATT
12	TAGGAGCTCC	37	TAGGAGCCCC	62	ATTTAGCAAT	87	CTCATCAAGG
13	ATTTTATTAC	38	TGATCAAATA	63	GATTTAGCAA	88	GCTCATCAAG
14	GGAGCTATTA	39	TTAATTTTAT	64	TATTATTATT	89	TACCCCCCTC
15	ATTAGGAGCT	40	TTTGATCAAA	65	AGGAGCTCCT	90	TAGCTTTCCC
16	TATTGCCCAT	41	TTGATCAAAT	66	AATATTGCCC	91	ATATTAGGAG
17	CAAATACCTT	42	CTTTATTTGT	67	ATATTGCCCA	92	TATTAGGAGC
18	TGCCCATCAA	43	CCTTTATTTG	68	ATTATTAATA	93	TACTATTGTT
19	AATACCTTTA	44	TATTAATATA	69	ATTATTACCC	94	ATAGCTTTCC
20	AAATACCTTT	45	ATTAATTTTA	70	TTATTAATAT	95	CAATTATTAA
21	CCCCTGATAT	46	TACCTTTATT	71	AGATGATCAA	96	ACAATTATTA
22	GCCCCTGATA	47	AATTTTTTCT	72	GAGATGATCA	97	TTTAGCAATT
23	CCCATCAAGG	48	TATAGCTTTC	73	GGAGATGATC	98	CCCCCGAATA
24	CCCTGATATA	49	AGCCCCTGAT	74	TTTTATAGTT	99	TCCCCCGAAT
25	GCCCATCAAG	50	ATACCTTTAT	75	TTACCCCCCT	100	TTCCCCCGAA

Dataset	error	# classes with 0% error
ACG	14.12	474/573
Hesperiidae	17.53	288/364
Astraptes	7.81	10/12
Bats of Guyana	5.80	90/96
Birds of North America	16.36	524/656
Fish of Australia	16.21	174/211

Dataset	Error
ACG	10.29
Hesperiidae	10.88
Astraptes	8.47
Bats of Guyana	9.95
Birds of North America	15.54
Fish of Australia	14.92
Fish larvae	15.77

Dataset	#clusters	error, %	Rand index	Jaccard index
ACG	644	2.84	99.85	83.96
Hesperiidae	382	4.44	99.79	86.42
Astraptes	17	1.51	95.59	81.59
Bats	98	0.95	99.21	86.58
Birds	650	5.25	99.90	86.59
Fish Australia	235	2.52	99.94	93.07
Fish larvae	7	2.86	98.66	95.51

rank	feature	rank	feature	rank	feature	rank	feature
1	TATACCAACA	26	CTTATATCAA	51	AATGGAGCTG	76	GGAGGAGACC
2	TTATACCAAC	27	TATATCAACA	52	ATGGAGCTGG	77	ATCTTGCCGG
3	TGAAAATGGA	28	TATCAACACT	53	GAAAATGGAG	78	CATCTTGCCG
4	AACTTCTTTA	29	TCAACACTTA	54	AATATACGAA	79	TCATCTTGCC
5	ACTTCTTTAA	30	TCTTATATCA	55	ATATACGAAT	80	CCGGTATTTC
6	CTTCTTTAAG	31	TTATATCAAC	56	ATTAATATAC	81	CGGTATTTCA
7	CTTTAAGATT	32	ACCCCCATCT	57	TAATATACGA	82	CTTGCCGGTA
8	TCTTTAAGAT	33	ATTACCCCCA	58	TATACGAATT	83	GCCGGTATTT
9	TTCTTTAAGA	34	ATTATTACCC	59	TATTAATATA	84	TCTTGCCGGT
10	TTTAAGATTA	35	TACCCCCATC	60	TTAATATACG	85	TGCCGGTATT
11	GAACTTCTTT	36	TATTACCCCC	61	AAAATGGGGC	86	TTGCCGGTAT
12	GGAACTTCTT	37	TTACCCCCAT	62	AAATGGGGCT	87	ATACGAATTA
13	TGGAACTTCT	38	TTATTACCCC	63	AATGGGGCTG	88	TACGAATTAA
14	AATCTTATAC	39	AATAATAGGT	64	ATGGGGCTGG	89	ACGAATTAAT
15	ACCAACACTT	40	AATAGGTGCC	65	GAAAATGGGG	90	AATATGCGAA
16	ATACCAACAC	41	AGGTGCCCCA	66	GGCTGGTACA	91	ATATGCGAAT
17	ATCTTATACC	42	ATAGGTGCCC	67	GGGCTGGTAC	92	ATGCGAATTA
18	CCAACACTTA	43	GGTGCCCCAG	68	GGGGCTGGTA	93	ATTAATATGC
19	CTTATACCAA	44	GTGCCCCAGA	69	TGAAAATGGG	94	GCGAATTAAT
20	TACCAACACT	45	TAGGTGCCCC	70	TGGGGCTGGT	95	TAATATGCGA
21	TCTTATACCA	46	TTGATTATTA	71	AGGAGACCCA	96	TATGCGAATT
22	AATCTTATAT	47	TGGAGGATTT	72	AGGAGGAGAC	97	TATTAATATG
23	ATATCAACAC	48	TGCCCCAGAT	73	GAGACCCAAT	98	TGCGAATTAA
24	ATCAACACTT	49	AAAATGGAGC	74	GAGGAGACCC	99	TTAATATGCG
25	ATCTTATATC	50	AAATGGAGCT	75	GGAGACCCAA	100	AATTGGAGGA

rank	feature	rank	feature	rank	feature	rank	feature
1	ATAATTGGAG	26	AGCTTCTGAC	51	TTTCCCCGAA	76	GTAACAGCCC
2	TTGTAATAAT	27	CCTGTCCTAG	52	GTCTTATTAC	77	TAACAGCCCA
3	TTTGTAATAA	28	CTGTCCTAGC	53	TCTTATTACT	78	TGTTCTAGCA
4	ATAAGCTTCT	29	GCTTCTGACT	54	AGCCCATGCC	79	ATCACTATAC
5	TAAGCTTCTG	30	GTTCTAGCAG	55	TTATTACTAC	80	TCACTATACT
6	GTCCTAGCAG	31	TGTCCTAGCA	56	CCCATGCCTT	81	AAGCAGGAGT
7	TCCTAGCAGC	32	TTCTAGCAGC	57	GCCCATGCCT	82	CACTATACTA
8	CAACACTTAT	33	TCCTGTCCTA	58	TTATAATTGG	83	ACACTTATTC
9	CCTAGCAGCA	34	ATAGTAGGCA	59	ATTATAATTG	84	CCTAGCAGGC
10	AATATAAGCT	35	GTAACAGCTC	60	TATAATTGGA	85	AAACCTTAAT
11	ATATAAGCTT	36	TAACAGCTCA	61	AACAGCCCAT	86	AACCTTAATA
12	CTTCCTGTCC	37	AACAGCTCAT	62	TACCTATTAT	87	ACCTTAATAC
13	TTCCTGTCCT	38	ATCATAATTG	63	CCTGTTCTAG	88	TATTAGGTGA
14	AAGCTTCTGA	39	ATCAACACTT	64	CTGTTCTAGC	89	CCCCGAATAA
15	TCTTCCTGTT	40	TATCAACACT	65	TATTAATATA	90	CCCGAATAAA
16	ACAGCTCATG	41	TCAACACTTA	66	TTTATTACTA	91	TCTGACTCCT
17	CAGCTCATGC	42	TCATAATTGG	67	ATCAAACACC	92	TTCTGACTCC
18	AACACTTATT	43	ATATCAAACA	68	ATTAGGTGAT	93	TTTTATTACT
19	CTTCTGACTC	44	GAAGCAGGAG	69	CCTTTGTAAT	94	AGGTATCACT
20	TAGTAGGCAC	45	CTTCCTGTTC	70	GCCTTTGTAA	95	CTCAATATCA
21	ACAGCCCATG	46	TAATTGGAGG	71	TATCAAACAC	96	GGTATCACTA
22	CAGCCCATGC	47	TTCCTGTTCT	72	TCCTATTACT	97	GTAATAATTT
23	AATATAAAAC	48	AGTAGGCACT	73	CATAATTGGA	98	GTATCACTAT
24	ATATAAAACC	49	TCCCCGAATA	74	GAGCTATTAA	99	TAATAATTTT
25	TCTAGCAGCA	50	TTCCCCGAAT	75	GGAGCTATTA	100	TCAATATCAA

rank	feature	rank	feature	rank	feature	rank	feature
1	ATCACAATAC	26	ACTTCATCAC	51	AAACAACATA	76	GATTCTTTGG
2	ATAATCGGAG	27	AACCTAGCCC	52	ACAACATAAG	77	TATACCAACA
3	ACCAACACCT	28	AACTTCATCA	53	ATTCTTCGAC	78	TAGCATTCCC
4	TACCAACACC	29	ACCTAGCCCA	54	AACTGACTAG	79	ATAGCATTCC
5	AAGCTTCTGA	30	ATCAACATAA	55	ACTGACTAGT	80	CGGAGCCTCA
6	AACATAAGCT	31	ATCAACTTCA	56	AACAACATAA	81	AGACGACCAA
7	ACATAAGCTT	32	TCAACTTCAT	57	AGCAATCAAC	82	CATGCCTTCG
8	CATAAGCTTC	33	ATACCAAACC	58	CAACTTCATC	83	GTAGACCTAG
9	TCCTACTCCT	34	CTAATCACTG	59	GGAGGAGACC	84	TAGACCTAGC
10	CCCCTATTCG	35	CTCACAATAC	60	CTCTCACAAT	85	ACCCCCCTAT
11	TTCTTCGACC	36	ATAAGCTTCT	61	ACAATACCAA	86	CCCCCCTATT
12	CCCTATTCGT	37	TAAGCTTCTG	62	ACGCCGGAGC	87	AACCCCCCTA
13	TCTTCGACCC	38	TCGTAATAAT	63	CACGCCGGAG	88	ATGCCTTCGT
14	GCCTTCGTAA	39	CAACATAAGC	64	GTCCTAATCA	89	TCATCACAAC
15	CCTTCGTAAT	40	GCAACCTAGC	65	TCCTAATCAC	90	TTCATCACAA
16	CTTCGTAATA	41	GGCAACCTAG	66	TGATTCTTTG	91	AAACTGACTA
17	AGCTTCTGAC	42	TAATCACTGC	67	TCCTCCTCCT	92	ATCTTCTCCC
18	GCTTCTGACT	43	AATACCAAAC	68	GAGGAGACCC	93	TCTTCTCCCT
19	GAGCCTCAGT	44	CAATACCAAA	69	ACATAGCATT	94	AAACCCCCCT
20	GGAGCCTCAG	45	ATAATTGGAG	70	GACATAGCAT	95	CAAACCCCCC
21	CAACATAAAA	46	TTCGTAATAA	71	CAGTAGACCT	96	AACCTAAACA
22	TCACAATACC	47	ACCAAACCCC	72	TCAGTAGACC	97	ACCTAAACAC
23	TCCTCCTACT	48	TACCAAACCC	73	TTCTGATTCT	98	CGTAATAATC
24	CACAATACCA	49	CATAGCATTC	74	TCTGATTCTT	99	ATCGGAGGAT
25	TCAACATAAA	50	TCTCACAATA	75	CATAAAACCC	100	TAATCTTCTT

rank	feature	rank	feature	rank	feature	rank	feature
1	AACATAAAAC	26	GACTTCTTCC	51	ACAGTCTACC	76	ATCTTCTCCC
2	ACATAAAACC	27	TAATAATTGG	52	CAGTCTACCC	77	TCTTCTCCCT
3	ATTATTAACA	28	AACATAAGCT	53	TAAATAATAT	78	ACTATTATTA
4	TTATTAACAT	29	ACATAAGCTT	54	AGCTTCTGAC	79	TATTATTAAC
5	TAACATAAAA	30	AATATCAAAC	55	CATAAAACCC	80	ATAGTAATAC
6	ATTAACATAA	31	CAATATCAAA	56	CCCCGAATAA	81	AGGAGACCCA
7	TTAACATAAA	32	TTATGATTGG	57	CCCGAATAAA	82	CTATTATTAA
8	TCCTTCTCCT	33	ACCAACACCT	58	GCTTCTGACT	83	TAATATAAAA
9	ATTATTAATA	34	TACCAACACC	59	TAGTAATACC	84	TTAATATAAA
10	TTATTAATAT	35	TTATTACAAC	60	TCATGATTGG	85	TTGACCCTGC
11	TATTAACATA	36	GAGACCCAAT	61	CCTCGAATAA	86	AATAAACAAC
12	GAACAGTTTA	37	GGAGACCCAA	62	CTCGAATAAA	87	AATTTTATTA
13	TGAACAGTTT	38	TTTATTACAA	63	CTTCTTCTCC	88	ATTACAATGC
14	GAGGAGACCC	39	ATACCAATTA	64	AATACCAAAC	89	ATTTTATTAC
15	GGAGGAGACC	40	CTTTACCAAC	65	CAATACCAAA	90	TTACAATGCT
16	TGACTTCTTC	41	GGAGGAGGAG	66	GAGGAGGAGA	91	TTGGAAACTG
17	ATCAAACACC	42	TACCAATTAT	67	ATGAGCTTCT	92	TTTGACCCTG
18	TATCAAACAC	43	TTTACCAACA	68	ATGATTGGAG	93	TTTGGAAACT
19	TTCTTCTCCT	44	AACAGTCTAC	69	TGAGCTTCTG	94	AATAAATAAT
20	AATATAAAAC	45	ACAGACCGAA	70	TTATGATCGG	95	ATTAATATAA
21	ATATAAAACC	46	CAGACCGAAA	71	ATAAATAATA	96	GAGGGGACCC
22	CGAATAAATA	47	GAACAGTCTA	72	TTACCAACAC	97	GGAGGGGACC
23	GAATAAATAA	48	TGAACAGTCT	73	TTTCCTCAAT	98	AATATGAGCT
24	TCTTTGACCC	49	ATAATTGGTG	74	ATATCAAACA	99	ACCCTGCAGG
25	TTCTTTGACC	50	TCCTTCTTCT	75	ATAATTGGAG	100	AGACCGAAAC

PERMALINK

Efficient alignment-free DNA barcode analytics

Pavel Kuksa

Vladimir Pavlovic

Supplement

Abstract

Background

Results

Conclusion

Background

Methods

The spectrum kernel methods

Figure 1.

Alignment-free algorithms

Prediction models

Results and discussion

Table 1.

Species identification

Nearest neighbor approach

Table 2.

Table 3.

Table 4.

Figure 2.

Figure 3.

Figure 4.

SVM-based classification

Table 5.

Comparison with previously published results

Barcode marker selection

Table 6.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

Table 7.

Table 8.

Table 9.

Table 10.

Table 11.

Table 12.

New species detection

Table 13.

Table 14.

Figure 11.

Barcode clustering

Table 15.

Figure 12.

Experimental running time analysis

Table 16.

Conclusion

Competing interests

Authors' contributions

Acknowledgments

Acknowledgements

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases