Measuring quality of DNA sequence data via degradation

Alan F Karr; Jason Hauzel; Adam A Porter; Marcel Schaefer

doi:10.1371/journal.pone.0271970

. 2022 Aug 3;17(8):e0271970. doi: 10.1371/journal.pone.0271970

Measuring quality of DNA sequence data via degradation

Alan F Karr ^1,^*, Jason Hauzel ^1,^¤, Adam A Porter ^1,^2,^#, Marcel Schaefer ^1,^#

Editor: Alvaro Galli³

PMCID: PMC9348684 PMID: 35921272

Abstract

We formulate and apply a novel paradigm for characterization of genome data quality, which quantifies the effects of intentional degradation of quality. The rationale is that the higher the initial quality, the more fragile the genome and the greater the effects of degradation. We demonstrate that this phenomenon is ubiquitous, and that quantified measures of degradation can be used for multiple purposes, illustrated by outlier detection. We focus on identifying outliers that may be problematic with respect to data quality, but might also be true anomalies or even attempts to subvert the database.

Introduction

As public genome databases proliferate, their immense scientific power is tempered by skepticism about their quality. The skepticism is not merely anecdotal: there are documented instances and implications [1–3]. Although we argue in S1 Appendix that data quality should not be construed as comprising only errors in data, the principal contribution of the paper is a novel paradigm for measuring quality of genome sequences by deliberately introducing errors that reduce quality, a process we term degradation. The errors are single nucleotide polymorphisms (SNPs), insertions and deletions that both occur naturally as mutations and arise in next generation sequencing. Our reasoning is that higher quality data are more fragile: the higher the initial quality, the greater the effect of the same amount of degradation. We present evidence that supports this reasoning, as well as demonstrates the scope and consequences of the phenomenon.

Even though the main contribution of the paper is methodological, applicability to bioinformatics problems is its raison d’être. Our exemplar problem is detection of outliers in genome databases. We identify genomes in a 26,953 coronavirus database downloaded from the National Center for Biotechnology Information (NCBI), whose degradation behavior is anomalous, and whose quality, therefore, may be suspect. We detect deliberately inserted low quality genomes, but other genomes in the original database are equally problematic. A second potential application is to thwart adversarial attacks on genome databases that, for instance, insert artificial genomes so that sequences of concern such as those generated by the methods in [4] will pass screening tests. Finally, degradation can be used to characterize the quality of synthetic DNA reads that are used to evaluate genome assemblers [5].

Materials and methods

Our method is rooted in total quality paradigms for official statistics, that is, censuses and surveys conducted by national statistics offices (S1 Appendix). In that context, data quality is a longstanding issue, and low quality data are known to be resistant to further errors, such as those introduced by editing, imputation or statistical disclosure limitation (SDL). We also draw on official statistics for techniques to quantify data quality. In experimental settings and because it is intuitive, we measure degradation by distance, appropriately defined, from the stating point. In real databases, this is not possible, so we employ measures of distance from a universal “endpoint” representing the lowest possible quality—pure randomness in the form of maximal entropy, which every genome reaches in the limit of infinite degradation.

Preliminaries

In this paper, a genome G is a character string chosen from the alphabet $B = {A, C, G, T}$ , and represents one strand of the DNA (or, for viruses, RNA. in an organism. The constituent bases (nucleotides) are A = adenine, C = cytosine, G = guanine and T = thymine. We denote the length of a genome G by |G|; the i^th base in G is G(i); and the bases from location i to location j > i are G(i: j). Given an integer n ≥ 1, the n-gram distribution is the probability distribution P_n(⋅|G) on the set all subsequences of length n chosen from $B$ —there are 4ⁿ of them—constructed by forming a table of all length n contiguous substrings of G and normalizing it so that its entries sum to 1. There are |G| − n + 1 such sequences, starting at 1, 2, …, |G| − n + 1, so the normalization amounts to division by |G| − n + 1.

In this paper, we focus on triplets, which are 64-dimensional summaries of genomes, and which also encode amino acids—the building blocks of proteins. The interpretation of P₃(⋅|G) is that for each choice of b₁, b₂, b₃ from $B$ ,

\begin{matrix} P_{3} (b_{1} b_{2} b_{3} | G) = Prob {G (k : [k + 2]) = b_{1} b_{2} b_{3}}, \end{matrix}

(1)

where k is chosen at random from {1, …, |G| − 2}. Triplets provide a generative model of a genome as a second-order Markov process, since P₃ contains the same information as the pair distribution P₂ and the 16 × 4 transition matrix

\begin{matrix} T_{3} (b_{1}, b_{2}, b_{3} | G) = Prob (G (k + 2) = b_{3} | G (k) = b_{1}, G (k + 1) = b_{2}) \end{matrix}

(2)

that gives the distribution of each base conditional on its two immediate predecessors.

Distributions of bases, pairs of successive bases, triplets of successive bases and quartets of successive bases differ across genomes, in ways that support a variety of analyses, including not only outlier identification, which we address below, but also Bayesian classification of simulated next generation sequencer reads and detection of contamination [6]. How tuple distributions behave under degradation supports our hypothesis regarding data quality. Higher-level genome structure such as repeats and palindromes is addressed in the below.

Other cases of interest are less suited to our purposes. Base (n = 1) and pair (n = 2) distributions are too coarse to be useful on their own. Quartets (n = 4) have been studied extensively [7–9]. For the problems we address, they are more cumbersome than triplets without being significantly more informative. Finally, although we do not do so here, triplet distributions can usefully be converted to amino acid distributions [6].

The hypothesis and initial evidence

We hypothesize that because high quality data are more fragile than low quality data, As noted, there is precedent in official statistics for this assertion. Some components of the argument appear in S1 Appendix, while the total survey error (TSE) paradigm referred to in S1 Appendix rests in part on this premise.) the quality of elements of a DNA sequence database can be measured by degrading them, Secondarily, the more complex the characteristic examined, the greater the impact of degradation. As we see below, the effect of degradation increases as we move from base distributions to pair distributions to triplet distributions to quartet distributions to repeats and palindromes.

We perform the degradation by iteratively applying the Mason_variator software [10]. Briefly, the Mason_variator simulates changes to a genome sequence: SNPs, insertions, deletions, inversions, translocations, and duplications, with specified probabilities for each. Such changes occur naturally as mutations as well as in reads produced by next generation sequencers, such as those manufactured by Illumina. Mason_variator runs from a command line interface with user-specified parameters, input files, and output files. For simplicity, in most of our analyses, only SNPs were simulated. The principal reason is to avoid burdensome computation of Levenshtein distances. Iterative use of Mason_variator means starting with a given genome, running Mason_variator on it, running Mason_variator again on the result, …, up to a specified number of iterations, which is usually 2000. Much the same effect could be achieved by increasing the error probabilities, but at a loss of interpretability, because parametrization by the number of iterations is more intuitive. Evidently this process of iteration is analogous to the real process of evolution.

We have investigated several measures of quality for degraded genomes. The first two of these are employed in string matching: Hamming distance [11] is usable when only SNPs are simulated, while Levenshtein distance [11] allows insertions and deletions that alter the length of the DNA sequence. (The Hamming distance between sequences with different lengths is infinite. Levenshtein distance is significantly more burdensome than Hamming distance computationally, with respect to both time and memory requirements, especially for longer genomes.) As discussed below, the origin point for Hamming and Levenshtein distances is crucial. We treat distances based on distributions of nucleotides, pairs, triplets and quartets as well as entropy of triplet distributions of degraded genomes.

Fig 1 visualizes the hypothesis for a single element of the coronavirus database employed in our primary experiments. All forms of errors were allowed. In the figure, the x-axis is the number of iterations of the Mason_variator, and the y-axis is Levenshtein distance between the degraded genome and the original genome. The most salient characteristic of the curve is its concavity: the more degradation already done, the less the effect of each additional iteration.

There are issues with the choice of the origin for Levenshtein distances. In Fig 2, there are 21 initial genomes—the one randomly selected coronavirus genome and that genome after 100, 200, …, 2000 Mason_variator iterations, representing continually decreasing initial data quality. In the top panel, Levenshtein distance is measured from the parent (0-iteration) genome, and the distance at iteration 0 has been subtracted from each curve. In a sense, however, this is “cheating,” because in real databases that potentially contain errors, there are not definitive parent genomes. In the middle panel in Fig 2, Levenshtein distance for each curve is measured from its starting point. The curves there differ little, and certainly not systematically. Fortunately, a work-around exists: the bottom panel in Fig 2 shows that (within reason), any fixed genome can be used as the origin. There, all Levenshtein distances are measured from a second randomly selected genome in the NCBI dataset. The key point is that the curves in the bottom panel vary significantly and systematically with respect to the initial degradation.

Experimental platform

Our experimental platform is a coronavirus database containing 26,953 genomes, which was downloaded from the NCBI in November of 2020. To it, we added eleven “known” problem cases: a single adenovirus genome with length 34,125 BP (downloaded as part of the Art read simulator software package [12]) and low-quality versions of 10 coronavirus selected randomly from the original 26,953, each created by 2000 iterations of the Mason_variator. Our method detects not only these known outliers—the minimal criterion for credibility, but also others.

All FASTA input files and analysis datasets used in this paper are available at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/Q6HVFO. The human genome file there contains only the four sequences (chromosomes 1, X, Y; mitochondrial) appearing in Fig 10. The full file is available at https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39, but has been superseded by https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40.

Measuring degradation

General effects

For the adenovirus genome, Fig 3 shows the effect of degradation on the base, pair, triplet and quartet distributions, measured by Hellinger distance [13] from corresponding distributions for the original genome. The interpretation is that as the number of Mason_variator iterations increases, base, pair, triplet distributions, and quartet distributions all move farther and farther away from the parent genome, at slower and slower rates. Moreover, confirming our secondary hypothesis, the higher the dimensionality, the more rapid the movement: quartets are more fragile than triplets, which are more fragile than pairs, which are more fragile than individual bases. The horizontal lines in Fig 3, whose colors match those of the curves, are simulation-derived 1% p-values: the probability that the distribution matches that of the original genome is less than 0.01 when the distance exceeds the line. Interestingly, the numbers of iterations at which the 1% thresholds are passed (i.e., where the curves cross the lines) are nearly the same for pairs, triplets and quartets, but lower than for bases alone.

Measuring degradation of triplet distributions via entropy

So far, we have confined attention to what degradation moves away from. In many ways, what it moves toward is more useful, because there is a single infinite degradation endpoint representing maximum entropy. The entropy of a probability distribution P on a finite set $S$ is

\begin{matrix} H (P) = - \sum_{s \in S} p (s) log p (s), \end{matrix}

(3)

with the convention that 0 × −∞ = 0. Entropy is minimized by distributions concentrated at a single point and maximized at the uniform distribution on $S$ , with maximizing value $log (| S |)$ , where $| \cdot |$ denotes cardinality. The existence of the universal maximizing value enables us to measure degradation as movement toward maximum entropy, removing the common origin issue discussed noted previously.

Fig 4 shows the effect of 500 Mason_variator iterations on entropy of triplet distributions—hereafter, just triplet entropy—of the adenovirus genome, starting from the genome itself (black curve) compared to starting from the genome degraded by 250 Mason_variator iterations (blue curve), degraded by 500 Mason_variator iterations (green curve), degraded by 1000 Mason_variator iterations (yellow curve), and degraded by 1500 Mason_variator iterations (red curve). The y-axis is the increase in entropy as a function of Mason_variator iterations, so Fig 4 shows movement toward maximal entropy.

Results

Full dataset

Fig 5 shows, albeit with massive overplotting, the triplet entropy degradation for the entire 26,964-element experimental database. The adenovirus genome, in blue, and the 10 degraded coronavirus genomes, in red, are apparent outliers. But, clearly there are also other outliers, which we pursue momentarily.

Outlier detection

One effective strategy for identifying elements of a genome database with problematic quality is to search for outliers. We concede that the implicit presumption that the bulk of the database is of high quality may be untested. The key question is, “Outlying with respect to what metric?” In this section, the metric is based on hierarchical cluster analysis of triplet entropy increase resulting from Mason_variator degradation, that is, on the shapes of the curves in Fig 5. The clustering is in three dimensions, as opposed to 64 dimensions for triplet distributions and 21 for amino acid distributions. Possibly unexpectedly, the two sets of clusters are very similar.

Outlier detection using triplet distributions

We showed in [6] that clustering of triplet distributions identifies outliers. Briefly, we performed hierarchical clustering, using Euclidean distances and “complete clustering” in R [14], on the 26,964-genome database, using as clustering variables the 64 standardized components of the triplet distributions. By means of standard heuristics that trade off model fit and model complexity, the number of clusters was determined to be 23.

Fig 6 is a plot of two-dimensional multidimensional scaling (MDS) [15, 16] of the 23 cluster centroids. The overwhelming majority of coronavirus genomes—26,433 of the original 26,953, or 98.1%—are in a single cluster. One original coronavirus genome appears by itself, in cluster 12. Cluster 13 contains the adenovirus genome alone, while each of the 10 degraded coronavirus genomes appears in a cluster by itself (clusters 14–23). Thus, the deliberate outliers are not only detected but also distinguished from one another. The dendrogram in Fig 7 shows that the coronavirus genome in cluster 12 and the deliberate outliers are separated from the remaining 26,952 coronavirus genomes at the first split in the clustering process. Clusters 1–10, which are small, are potential outliers as well. See [6] for details and a scientific interpretation.

Fig 7 — Dendrogram, in which clusters are in order from 1 at the bottom to 23 at the top. Colors are the same as in Fig 6.

Outlier detection using entropy degradation

Here we cluster the genomes on the basis of degradation behavior of triplet entropy, that is, the curves shown in Fig 5. As noted already, the results match closely with those using triplet distributions.

The clustering is now in only three dimensions, reached by a path that starts with Fig 5. Every one of the 26,964 curves plotted there is based on entropy following 0, 250, 500, 1000 and 2000 Mason_variator iterations. We fitted a quadratic function to each set of 5 values, reducing the dimension to 3. These quadratic models are uniformly good: the smallest of the coefficients of determination, R², is 0.941 and 99% of them exceed 0.976. Hierarchical clustering was then performed on standardized versions of the three quadratic coefficients, using the “ward.D” option in R, resulting in 34 clusters, with counts ranging from 11 to 1470. Statistically, the clustering is extremely good: the cluster numbers alone explain 98.96509%, 99.02426% and 98.45677% of the variation in the quadratic coefficients.

Paralleling Figs 6–9 show the result of applying two-dimensional MDS to the cluster centroids, as well as the associated dendrogram. There is nothing comparable to the massive coronavirus cluster in the triplet distribution analysis. As noted above, the largest cluster in the triplet entropy degradation cluster contains only 1470 genomes. The adenovirus outlier and the ten degraded coronavirus outliers are placed together in cluster 34, and Fig 8 shows that they clearly differ from all of the other genomes. Clusters 1–5 contain candidate outliers. Not only are they relatively small, but also each differs strongly from all of the other clusters. They are suggestively similar to clusters 1–10 for triplet distributions, which we pursue below. In Fig 8, the 11 deliberate outliers in cluster 34 are distant from the majority of the coronavirus genomes, but no more so than the 18 coronavirus genomes in cluster 2.

Fig 9 — Dendrogram, in which clusters are in order from 1 at the bottom to 34 at the top. Colors are the same as in Fig 8.

Fig 8 — Two-dimensional MDS plot of the 34 cluster centroids. Labels are cluster numbers and counts. Greater distance implies higher dissimilarity.

Relationships between the two sets of clusters

Clusters 1–10 in the triplet distribution analysis together contain 519 genomes, a number similar to the number of genomes in clusters 1–5 for the triplet entropy analysis. Moreover, both analyses separate the 11 deliberate outliers from the 26,953 legitimate coronavirus genomes, although differently. The triplet distribution analysis places these outliers in 11 distinct clusters, while the triplet entropy degradation analysis places them all in a single cluster.

Table 1 shows the complete and strongly block-diagonal relationship between the two sets of clusters. For clarity, cells in Table 1 containing values of 0 are highlighted in pink. In detail,

Table 1. Cross-tabulation of the 34 degradation clusters (rows) and the 23 triplet distribution clusters (columns).

Cells containing zeros are shaded in red.

Entropy Degradation Cluster	Triplet Distribution Cluster
	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	Sum
1	12	0	0	0	0	0	137	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	149
2	0	15	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	18
3	0	0	2	141	0	27	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	170
4	0	0	0	0	55	0	0	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	58
5	0	0	0	0	0	0	121	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	121
6	0	0	0	0	0	0	0	0	1	0	354	0	0	0	0	0	0	0	0	0	0	0	0	355
7	0	0	0	0	0	0	0	0	1	0	842	1	0	0	0	0	0	0	0	0	0	0	0	844
8	0	0	0	0	0	0	0	0	1	0	313	0	0	0	0	0	0	0	0	0	0	0	0	314
9	0	0	0	0	0	0	0	0	0	0	1470	0	0	0	0	0	0	0	0	0	0	0	0	1470
10	0	0	0	0	0	0	0	0	0	0	1466	0	0	0	0	0	0	0	0	0	0	0	0	1466
11	0	0	0	0	0	0	0	0	0	0	1005	0	0	0	0	0	0	0	0	0	0	0	0	1005
12	0	0	0	0	0	0	0	0	0	0	923	0	0	0	0	0	0	0	0	0	0	0	0	923
13	0	0	0	0	0	0	0	0	0	0	1215	0	0	0	0	0	0	0	0	0	0	0	0	1215
14	0	0	0	0	0	0	0	0	0	0	849	0	0	0	0	0	0	0	0	0	0	0	0	849
15	0	0	0	0	0	0	0	0	0	0	858	0	0	0	0	0	0	0	0	0	0	0	0	858
16	0	0	0	0	0	0	0	0	0	0	526	0	0	0	0	0	0	0	0	0	0	0	0	526
17	0	0	0	0	0	0	0	0	0	0	1138	0	0	0	0	0	0	0	0	0	0	0	0	1138
18	0	0	0	0	0	0	0	0	0	0	946	0	0	0	0	0	0	0	0	0	0	0	0	946
19	0	0	0	0	0	0	0	0	0	0	1259	0	0	0	0	0	0	0	0	0	0	0	0	1259
20	0	0	0	0	0	0	0	0	0	0	1111	0	0	0	0	0	0	0	0	0	0	0	0	1111
21	0	0	0	0	0	0	0	0	0	0	1393	0	0	0	0	0	0	0	0	0	0	0	0	1393
22	0	0	0	0	0	0	0	0	0	0	776	0	0	0	0	0	0	0	0	0	0	0	0	776
23	0	0	0	0	0	0	0	0	0	0	1027	0	0	0	0	0	0	0	0	0	0	0	0	1027
24	0	0	0	0	0	0	0	0	0	0	1176	0	0	0	0	0	0	0	0	0	0	0	0	1176
25	0	0	0	0	0	0	0	0	0	0	825	0	0	0	0	0	0	0	0	0	0	0	0	825
26	0	0	0	0	0	0	0	0	0	0	1115	0	0	0	0	0	0	0	0	0	0	0	0	1115
27	0	0	0	0	0	0	0	0	0	0	1069	0	0	0	0	0	0	0	0	0	0	0	0	1069
28	0	0	0	0	0	0	0	0	0	0	467	0	0	0	0	0	0	0	0	0	0	0	0	467
29	0	0	0	0	0	0	0	0	0	0	949	0	0	0	0	0	0	0	0	0	0	0	0	949
30	0	0	0	0	0	0	0	0	0	0	1261	0	0	0	0	0	0	0	0	0	0	0	0	1261
31	0	0	0	0	0	0	0	0	0	0	928	0	0	0	0	0	0	0	0	0	0	0	0	928
32	0	0	0	0	0	0	0	0	0	0	589	0	0	0	0	0	0	0	0	0	0	0	0	589
33	0	0	0	0	0	0	0	0	0	0	583	0	0	0	0	0	0	0	0	0	0	0	0	583
34	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	11
Sum	12	15	2	141	55	27	258	3	3	3	26433	1	1	1	1	1	1	1	1	1	1	1	1	26964

Open in a new tab

Triplet entropy degradation cluster 34 is, as noted above, an amalgamation of triplet distribution clusters 13–24; both contain the 11 deliberate outliers.
The lone coronavirus genome in triplet distribution cluster 12 is absorbed into entropy degradation cluster 7, along with 843 other coronavirus genomes. Perhaps it is not an outlier after all. There is further evidence to this effect in [6]: when clustering is done using amino acid distributions, it also ceases to be an outlier. Specifically, it is merged with cluster 11 to form a massive amino acid cluster of size 26,434.
Triplet entropy degradation clusters 6–33 disaggregate the massive, 26,433-genome triplet distribution cluster 11, modulo four additional genomes.
Triplet entropy degradation clusters 1–5, containing 516 genomes and triplet distribution clusters 1–8 and 10, both containing 516 genomes, are identical collectively. These are in the upper-left corner in Table 1. Clearly, the two approaches are detecting the same outliers, with different nuances.
Triplet distribution cluster 9, with 3 genomes, is anomalous. Each genome it contains lies in its own large entropy degradation cluster.

Therefore, much of the scientific interpretation of outliers in [6], which is based on the text string ID variable in the NCBI database, carries over here.

Higher-order DNA structure

Degradation attenuates (relatively) low-dimensional genome characteristics such as tuple distributions (Fig 3). We see here that more complex structure such as repeats and palindromes is affected more strongly. As exemplar, we use an E. coli genome of length 4,639,675 downloaded from NCBI; the same genome appears again in Fig 10.

Fig 10 — Included are two bacterial genomes—*P. gingivalis* and *E. coli*, three human chromosomes—1, X and Y, and human mitochondrial DNA.

Repeats

Repeats are inherent to non-virus genomes, leading, inter alia, to the discovery of clustered regularly interspaced short palindromic repeats (CRISPR) in E. coli [17]. Table 2 shows the effect of Mason_variator degradation on the numbers of repeats of various lengths in the E. coli genome. The column for length 29, rather than 30, honors the original discovery of CRISPR. By 500 Mason_variator iterations, all repeats of length 20 or longer have been obliterated. Those of length 29 are gone at 300 iterations.

Table 2. Numbers of repeats as a function of `Mason_variator` iterations.

The genome is E. coli. Repeats of length 20, 25, 29, 35 and 40 are columns and Mason_variator iterations are rows.

	Repeat Length
Iteration	20	25	29	35	40
0	37285	35588	34667	33429	32530
100	2865	1094	487	153	58
200	640	156	47	11	2
300	152	11	0	0	0
400	36	1	0	0	0
500	17	0	0	0	0
600	14	0	0	0	0
700	6	0	0	0	0
800	8	0	0	0	0
900	10	0	0	0	0
1000	6	0	0	0	0

Open in a new tab

Palindromes

Genomic palindromes, unlike those in ordinary language, consist of a sequence of bases followed immediately by its reverse complement. So, an example is ATTCGATT∥AATCGAAT. (The ∥ has been inserted for visual clarity.) In what follows, palindromes are parameterized by half-length; the example has half-length 8. Their behavior with respect to Mason_variator degradation differs somewhat from that of repeats.

Table 3 shows that long palindromes (half-lengths 12, 14 and 16), are not plentiful to begin with, and, modulo noise discussed momentarily, vanish within 100 Mason_variator iterations. Palindromes with half-length 8 and 10 do decline in number, but do not vanish, even by 2000 iterations. Moreover, their numbers can increase, although not enormously. Palindromes of half length 6 barely diminish at all, and fluctuate substantially, The noise and the increases suggest that short palindromes differ from the other genome features discussed in this paper, and especially from repeats. They are resistant to the Mason_variator degradation, and can even be produced by it. This is not surprising because very short repeats are low-dimensional and may, therefore, be too short to be interesting biologically.

Table 3. Numbers of palindromes as a function of `Mason_variator` iterations.

The genome is E. coli. Half lengths of 6, 8, 10, 12, 14 and 16 are columns and Mason_variator iterations are rows.

	Half-Length
Iterations	6	8	10	12	14	16
0	1128	113	22	11	2	1
100	1149	102	7	1	0	0
200	1163	90	4	1	0	0
300	1209	87	6	1	0	0
400	1137	81	6	1	0	0
500	1114	79	5	0	0	0
600	1141	79	8	0	0	0
700	1126	81	8	0	0	0
800	1130	75	4	1	1	0
900	1140	67	3	1	1	0
1000	1077	62	4	0	0	0
1500	1104	66	2	0	0	0
2000	1072	68	3	0	0	0

Open in a new tab

Other genomes

Fig 10 demonstrates that degradation behavior is not confined to viruses. It is the analog of Fig 5 for two bacterial genomes—P. gingivalis and E. coli, for three human chromosomes—1, X and Y, and for human mitochondrial DNA. All six genomes were downloaded from NCBI; the human genome is identified as GRCh38.p13.

Discussion

At least three paths for further research are clear. The first is that our paradigm does not yet produce quantified uncertainties about the decisions it may engender. Following this path also requires, as raised in S1 Appendix, more explicit attention to the decisions to be made using the data. To consider decision quality fully leads to the second path—better understanding of the effects of data quality on bioinformatics software pipelines. Now that we can create data of demonstrably and quantifiably lower quality, this path is feasible. Third and more speculatively, there is the relationship between data quality and adversarial attacks on genome databases or software pipelines [4, 18, 19]. Attempts to “pollute” databases with (what may turn out to be) low quality genomes are potentially detectable using the outlier identification strategies presented here. Risk-utility paradigms for statistical disclosure limitation (see S1 Appendix) are relevant, especially the need to distinguish attackers from legitimate users of the data.

Finally, there is clear potential to extend our paradigm to contexts other than genomics, provided that a credible generative model for quality degradation can be constructed. To illustrate, in the official statistics context, one simply needs a mechanism to simulate one of more forms of total survey error.

Conclusions

In this paper, we have introduced and investigated a new, degradation-based approach to data quality for genome sequence databases, and established that it is sound scientifically and statistically. Our principal application is to outlier detection, and our methods are demonstrably effective.

Supporting information

S1 Appendix. Background on data quality.

(PDF)

Click here for additional data file.^{(60.9KB, pdf)}

Data Availability

All FASTA input files and analysis datasets used in this paper are available at the Harvard Dataverse repository (https://doi.org/10.7910/DVN/Q6HVFO).

Funding Statement

All authors were supported by National Institutes of Health grant 5R01AI100947--06, "Algorithms and Software for the Assembly of Metagenomic Data," to the University of Maryland College Park (Mihai Pop, PI), via a subaward to Fraunhofer USA. The sponsor URL is www.nih.gov. The sponsor played no role in the research, decision to publish, or preparation of the manuscript.

References

1. Commichaux S, Shah N, Ghurye J, Stoppel A, Goodheart JA, Luque GG, et al. A critical assessment of gene catalogs for metagenomic analysis. Bioinformatics. 2021;. doi: 10.1093/bioinformatics/btab216 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Langdon WB. Mycoplasma contamination in the 1000 Genomes Project. BioData Mining. 2014;7:3. doi: 10.1186/1756-0381-7-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Steinegger M, Salzberg SL. Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Genome Biology. 2020;21(1):115. doi: 10.1186/s13059-020-02023-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Farbiash D, Puzis R. Cyberbiosecurity: DNA injection attack in synthetic biology; 2020.
5. Wang Z, Wang Y, Fuhrman JA, Sun F, Zhu S. Assessment of metagenomic assemblers based on hybrid reads of real and simulated metagenomic sequences. Briefings in Bioinformatics. 2020;21(3):777–790. doi: 10.1093/bib/bbz025 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Karr AF, Hauzel J, Porter AA, Schaefer M. Application of Markov structure of genomes to outlier identification and read classification. BMC Bioinformatics (submitted). [Google Scholar]
7. Pride DT, Meinersmann RJ, Wassenaar TM, Blaser MJ. Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Research. 2003;13:145–158. doi: 10.1101/gr.335003 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Teeling H, Meyerdierks A, Bauer M, Amann R, Glöckner FO. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology. 2004;6(9):938–947. doi: 10.1111/j.1462-2920.2004.00624.x [DOI] [PubMed] [Google Scholar]
9. Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. 2004;5(163). doi: 10.1186/1471-2105-5-163 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Holtgrewe M. Mason: A Read Simulator for Second Generation Sequencing Data. Technical Report FU Berlin. 2010;.
11. Navarro G. A guided tour to approximate string matching. ACM Computing Surveys. 2001;33(1):31–88. doi: 10.1145/375360.375365 [DOI] [Google Scholar]
12. Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28(4):593–594. doi: 10.1093/bioinformatics/btr708 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Nikulin MS. Hellinger distance. In: Encyclopedia of Mathematics. Berlin: EMS Press; 2001. [Google Scholar]
14.R Core Team. R: A Language and Environment for Statistical Computing; 2020. Available from: https://www.R-project.org/.
15. Kruskal JB. Nonmetric multidimensional scaling: a numerical method. Psychometrika. 1964;29:115–130. doi: 10.1007/BF02289694 [DOI] [Google Scholar]
16. Cox TF, Cox MAA. Multidimensional Scaling. London: Chapman and Hall; 2001. [Google Scholar]
17. Mojica FJM, Díez-Villaseñor C, Soria E, Juez G. Biological significance of a family of regularly spaced repeats in the genomes of archaea, bacteria and mitochondria. Molecular Microbiology. 2000;36:244–246. doi: 10.1046/j.1365-2958.2000.01838.x [DOI] [PubMed] [Google Scholar]
18. Biggio B, Roli F. Wild patterns: ten years after the rise of adversarial machine learning. Pattern Recognition. 2018;834:317–331. doi: 10.1016/j.patcog.2018.07.023 [DOI] [Google Scholar]
19. Valdivia-Granda WA. Big data and artificial intelligence for biodefense: a genomic-based approach for averting technological surprise. In: Singh SK, Kuhn JH, editors. Defense Against Biological Attacks. New York: Springer–Verlag; 2019. p. 317–327. [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0271970.r001

Decision Letter 0

Alvaro Galli

14 Apr 2022

PONE-D-22-04115Measuring Quality of DNA Sequence Data via DegradationPLOS ONE

Dear Dr. Karr,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by May 27 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Alvaro Galli

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. In the Methods section of your manuscript, please ensure you provide all information needed for a reader to locate the same data sources used in this study. This includes information specifically identifying the genomes used in this study, and information on how to locate the coronavirus database used in this study.

3. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

4. Thank you for stating the following in the Acknowledgments Section of your manuscript:

"This research was supported in part by NIH grant 5R01AI100947–06, “Algorithms and Software for the Assembly of Metagenomic Data,” to the University of Maryland College Park (Mihai Pop, PI)"

We note that you have provided funding information. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

"All authors were supported by National Institutes of Health grant 5R01AI100947--06, "Algorithms and Software for the Assembly of Metagenomic Data," to the University of Maryland College Park (Mihai Pop, PI) , via a subaward to Fraunhofer USA. The sponsor URL is www.nih.gov. The sponsor played no role in the research, decision to publish, or preparation of the manuscript."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

5. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

6. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ

7. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Dear author

This is an interesting topic. However there are some points I would like to address:

1- in page 3, you have referred the figures as follows

File = Figure2Top.tif

File = Figure2Middle.tif

Figure2middle is duplicated, and the figure 2 bottom is not mentioned.

2- is there copyright/ permission to use the figures from the Maison variator? please indicate.

3- please include line numbers

4- for the equation, use MathType for display and inline equations, as it will provide the most reliable outcome. If this is not possible, Equation Editor or Microsoft's Insert→Equation function is acceptable.

5- Footnotes are not permitted. If your manuscript contains footnotes, move the information into the main text or the reference list, depending on the content.

6- write full affiliation including ORCID when applicable , remove them from footnotes.

7- the title needs to be more clear.

8- I have seen a lot of self referencing:

Karr, A. F. (2013). Discussion of five papers on “Systems and Architectures for High-Quality Statistics

Production”. Journal of Official Statistics, 29(1):157–163.

Karr, A. F. and Cox, L. H. (2012). The World’s Simplest Survey Microsimulator (WSSM). Technical

Report 181, National Institute of Statistical Sciences, Research Triangle Park, NC. Available online at

http://www.niss.org/sites/default/files/tr181.pdf.

Karr, A. F., Hauzel, J., Menon, P., Porter, A. A., and Schaefer, M. (2021a). Specified Certainty Classification.

Technical report, Fraunhofer Center Mid-Atlantic, Riverdale, MD. arXiv/2109.06677.

Karr, A. F., Hauzel, J., Porter, A. A., and Schaefer, M. (2021b). Application of Markov Structure of

Genomes to Outlier Identification and Read Classification. Technical report, Fraunhofer Center MidAtlantic, Riverdale, MD.

Karr, A. F., Kohnen, C. N., Oganian, A., Reiter, J. P., and Sanil, A. P. (2006a). A framework for evaluating

the utility of data altered to protect confidentiality. The American Statistician, 60(3):224–232.

Karr, A. F., Sanil, A. P., and Banks, D. L. (2006b). Data quality: A statistical perspective. Statistical

Methodology, 3(2):137–173.

Karr, A. F., Sanil, A. P., Sacks, J., and Elmagarmid, E. (2001). Workshop Report: Affiliates Workshop on Data Quality. Technical Report, National Institute of Statistical

please include in the introduction this if this is a continuous research.

9- the data quality section should be moved after the reference heading.

10 - use vancuver style of referencing and in cite referencing. the references should be in Arabic numbers (1,2,3,4,5,6...) and not in alphabetical order.

11- short and long title cannot be the same.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Noora R. Al-Snan, PhD

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Aug 3;17(8):e0271970. doi: 10.1371/journal.pone.0271970.r002

Author response to Decision Letter 0

21 Jun 2022

We thank the reviewers and editors for their comments. Specific responses are as follows:

Formatting: The entire manuscript has been converted to the PLOSOne LATEX template, with the

correct section titles and the bibliography entries included in the source file

DataQualityViaDataDegradation_PLOSOne_Revised.tex. Footnotes have

been eliminated. The previous appendix has been converted supplementary material, namely

S1 Appendix. Because of the scope of the changes and use of LATEX, no “tracked changes”

file can sensibly be created.

Figures: Because figure numbering has changed, a completely new set of TIFF files has been

uploaded.

Self-referencing: has been reduced dramatically.

Funding Statement: Funding information has been removed from the manuscript. No changes are

necessary to the Funding Statement.

Data availability: Data files are now publicly available at

https://dataverse.harvard.edu/dataset.xhtml

?persistentId=doi%3A10.7910%2FDVN%2FQ6HVFO&version=DRAFT.

ORCID ID I believe that I have entered it successfully. However, in case there are any issues, it is

0000-0002-7253-0129, and I give my permission to add it. ORCID IDs for all authors who

have them appear in the manuscript

Attachment

Submitted filename: RebuttalLetter.pdf

Click here for additional data file.^{(46.5KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0271970.r003

Decision Letter 1

Alvaro Galli

12 Jul 2022

Measuring Quality of DNA Sequence Data via Degradation

PONE-D-22-04115R1

Dear Dr. Karr,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Alvaro Galli

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

**********

6. Review Comments to the Author

Reviewer #1: Dear author ,

Thank you for addressing all the comments. However, I have one comment only

1- please clearly state who is the corresponding author along with the email address.

once again, thank you for addressing all the comments

regards

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Noora Al-Snan

**********

PLoS One. doi: 10.1371/journal.pone.0271970.r004

Acceptance letter

Alvaro Galli

25 Jul 2022

PONE-D-22-04115R1

Measuring quality of DNA sequence data via degradation

Dear Dr. Karr:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Alvaro Galli

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Appendix. Background on data quality.

(PDF)

Click here for additional data file.^{(60.9KB, pdf)}

Attachment

Submitted filename: RebuttalLetter.pdf

Click here for additional data file.^{(46.5KB, pdf)}

Data Availability Statement

All FASTA input files and analysis datasets used in this paper are available at the Harvard Dataverse repository (https://doi.org/10.7910/DVN/Q6HVFO).

[pone.0271970.ref001] 1. Commichaux S, Shah N, Ghurye J, Stoppel A, Goodheart JA, Luque GG, et al. A critical assessment of gene catalogs for metagenomic analysis. Bioinformatics. 2021;. doi: 10.1093/bioinformatics/btab216 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0271970.ref002] 2. Langdon WB. Mycoplasma contamination in the 1000 Genomes Project. BioData Mining. 2014;7:3. doi: 10.1186/1756-0381-7-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0271970.ref003] 3. Steinegger M, Salzberg SL. Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Genome Biology. 2020;21(1):115. doi: 10.1186/s13059-020-02023-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0271970.ref004] 4.Farbiash D, Puzis R. Cyberbiosecurity: DNA injection attack in synthetic biology; 2020.

[pone.0271970.ref005] 5. Wang Z, Wang Y, Fuhrman JA, Sun F, Zhu S. Assessment of metagenomic assemblers based on hybrid reads of real and simulated metagenomic sequences. Briefings in Bioinformatics. 2020;21(3):777–790. doi: 10.1093/bib/bbz025 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0271970.ref006] 6. Karr AF, Hauzel J, Porter AA, Schaefer M. Application of Markov structure of genomes to outlier identification and read classification. BMC Bioinformatics (submitted). [Google Scholar]

[pone.0271970.ref007] 7. Pride DT, Meinersmann RJ, Wassenaar TM, Blaser MJ. Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Research. 2003;13:145–158. doi: 10.1101/gr.335003 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0271970.ref008] 8. Teeling H, Meyerdierks A, Bauer M, Amann R, Glöckner FO. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology. 2004;6(9):938–947. doi: 10.1111/j.1462-2920.2004.00624.x [DOI] [PubMed] [Google Scholar]

[pone.0271970.ref009] 9. Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. 2004;5(163). doi: 10.1186/1471-2105-5-163 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0271970.ref010] 10.Holtgrewe M. Mason: A Read Simulator for Second Generation Sequencing Data. Technical Report FU Berlin. 2010;.

[pone.0271970.ref011] 11. Navarro G. A guided tour to approximate string matching. ACM Computing Surveys. 2001;33(1):31–88. doi: 10.1145/375360.375365 [DOI] [Google Scholar]

[pone.0271970.ref012] 12. Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28(4):593–594. doi: 10.1093/bioinformatics/btr708 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0271970.ref013] 13. Nikulin MS. Hellinger distance. In: Encyclopedia of Mathematics. Berlin: EMS Press; 2001. [Google Scholar]

[pone.0271970.ref014] 14.R Core Team. R: A Language and Environment for Statistical Computing; 2020. Available from: https://www.R-project.org/.

[pone.0271970.ref015] 15. Kruskal JB. Nonmetric multidimensional scaling: a numerical method. Psychometrika. 1964;29:115–130. doi: 10.1007/BF02289694 [DOI] [Google Scholar]

[pone.0271970.ref016] 16. Cox TF, Cox MAA. Multidimensional Scaling. London: Chapman and Hall; 2001. [Google Scholar]

[pone.0271970.ref017] 17. Mojica FJM, Díez-Villaseñor C, Soria E, Juez G. Biological significance of a family of regularly spaced repeats in the genomes of archaea, bacteria and mitochondria. Molecular Microbiology. 2000;36:244–246. doi: 10.1046/j.1365-2958.2000.01838.x [DOI] [PubMed] [Google Scholar]

[pone.0271970.ref018] 18. Biggio B, Roli F. Wild patterns: ten years after the rise of adversarial machine learning. Pattern Recognition. 2018;834:317–331. doi: 10.1016/j.patcog.2018.07.023 [DOI] [Google Scholar]

[pone.0271970.ref019] 19. Valdivia-Granda WA. Big data and artificial intelligence for biodefense: a genomic-based approach for averting technological surprise. In: Singh SK, Kuhn JH, editors. Defense Against Biological Attacks. New York: Springer–Verlag; 2019. p. 317–327. [Google Scholar]

PERMALINK

Measuring quality of DNA sequence data via degradation

Alan F Karr

Jason Hauzel

Adam A Porter

Marcel Schaefer

Roles

Abstract

Introduction

Materials and methods

Preliminaries

The hypothesis and initial evidence

Fig 1. Examplar of degradation.

Fig 2. Degradation behavior as a function of Mason_variator iterations and initial data quality, for a randomly selected coronavirus genome.

Experimental platform

Measuring degradation

General effects

Fig 3. Effect of degradation on tuple distributions.

Measuring degradation of triplet distributions via entropy

Fig 4. Change in entropy as a function of Mason_variator iterations, for one adenovirus genome and four degraded versions of it.

Results

Full dataset

Fig 5. Entropy as a function of Mason_variator iterations, for the 26,964-genome dataset.

Outlier detection

Outlier detection using triplet distributions

Fig 6. Results for triplet distribution clustering.

Fig 7. Results for triplet distribution clustering.

Outlier detection using entropy degradation

Fig 9. Results for triplet entropy clustering.

Fig 8. Results for triplet entropy clustering.

Relationships between the two sets of clusters

Table 1. Cross-tabulation of the 34 degradation clusters (rows) and the 23 triplet distribution clusters (columns).

Higher-order DNA structure

Fig 10. Entropy as a function of Mason_variator iterations, for selected prokaryote and eukaryote genomes.

Repeats

Table 2. Numbers of repeats as a function of Mason_variator iterations.

Palindromes

Table 3. Numbers of palindromes as a function of Mason_variator iterations.

Other genomes

Discussion

Conclusions

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Alvaro Galli

Roles

Author response to Decision Letter 0

Decision Letter 1

Alvaro Galli

Roles

Acceptance letter

Alvaro Galli

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Fig 2. Degradation behavior as a function of `Mason_variator` iterations and initial data quality, for a randomly selected coronavirus genome.

Fig 4. Change in entropy as a function of `Mason_variator` iterations, for one adenovirus genome and four degraded versions of it.

Fig 5. Entropy as a function of `Mason_variator` iterations, for the 26,964-genome dataset.

Fig 10. Entropy as a function of `Mason_variator` iterations, for selected prokaryote and eukaryote genomes.

Table 2. Numbers of repeats as a function of `Mason_variator` iterations.

Table 3. Numbers of palindromes as a function of `Mason_variator` iterations.