BOLD and GenBank revisited – Do identification errors arise in the lab or in the sequence libraries?

Mikko Pentinsaari; Sujeevan Ratnasingham; Scott E Miller; Paul D N Hebert

doi:10.1371/journal.pone.0231814

. 2020 Apr 16;15(4):e0231814. doi: 10.1371/journal.pone.0231814

BOLD and GenBank revisited – Do identification errors arise in the lab or in the sequence libraries?

Mikko Pentinsaari ¹, Sujeevan Ratnasingham ¹, Scott E Miller ², Paul D N Hebert ^1,^*

Editor: Matjaž Kuntner³

PMCID: PMC7162515 PMID: 32298363

Abstract

Applications of biological knowledge, such as forensics, often require the determination of biological materials to a species level. As such, DNA-based approaches to identification, particularly DNA barcoding, are attracting increased interest. The capacity of DNA barcodes to assign newly encountered specimens to a species relies upon access to informatics platforms, such as BOLD and GenBank, which host libraries of reference sequences and support the comparison of new sequences to them. As parameterization of these libraries expands, DNA barcoding has the potential to make valuable contributions in diverse applied contexts. However, a recent publication called for caution after finding that both platforms performed poorly in identifying specimens of 17 common insect species. This study follows up on this concern by asking if the misidentifications reflected problems in the reference libraries or in the query sequences used to test them. Because this reanalysis revealed that missteps in acquiring and analyzing the query sequences were responsible for most misidentifications, a workflow is described to minimize such errors in future investigations. The present study also revealed the limitations imposed by the lack of a polished species-level taxonomy for many groups. In such cases, applications can be strengthened by mapping the geographic distributions of sequence-based species proxies rather than waiting for the maturation of formal taxonomic systems based on morphology.

Introduction

Species identifications play an important role in forensic analyses in contexts ranging from the interception of trade in CITES-listed species [1] to ascertaining the post mortem interval [2]. There are also expanding opportunities to track the movement of objects and organisms linked to their associated DNA. Although species identifications can play an important role in these contexts, the lack of taxonomic specialists often impedes analysis, a factor which has provoked interest in DNA-based approaches to species identification. Past studies have established that DNA barcodes can often assign specimens to their source species, but have also revealed differences in success among the kingdoms of eukaryotes. For example, the three barcode regions (rbcL, matK, ITS2) for plants deliver lower success than the single gene region (cytochrome c oxidase I, COI) used for animals [3]. Because COI generally has high accuracy in species assignment [4–9], the conclusions from a recent study by Meiklejohn et al. [10] were surprising. They assessed the capacity of reference sequences in BOLD, the Barcode of Life Data System [11], and GenBank [12] to generate species-level identifications. Their analysis revealed that both platforms performed similarly in identifying plants and macrofungi, but fared poorly in identifying insect species with BOLD showing lower success than GenBank (35% vs. 53%). They noted that their observed identification success did not conform to earlier results from DNA barcoding studies on insects, but did not carefully examine the possible causes of this unexpectedly low success. By evaluating the factors underpinning the incorrect assignments, the present study revealed that errors in sequence acquisition and interpretation accounted for most, if not all, of the misidentifications. To avoid similar issues in future studies, there is a need to adopt more rigorous procedures for data acquisition and analysis, and to reduce the current reliance on immature taxonomic systems.

Material and methods

Meiklejohn et al. [10] analyzed 17 insects including representatives from 12 insect orders–Coleoptera (1), Dermaptera (1), Diptera (5), Ephemeroptera (1), Hymenoptera (1), Lepidoptera (2), Mecoptera (1), Neuroptera (1), Odonata (1), Orthoptera (1), Pthiraptera (1), and Siphonaptera (1). The specimens were obtained from the Smithsonian’s National Museum of Natural History (USNM); most were collected 20+ years ago (e.g. Pediculus humanus– 1955). Following DNA extraction, the barcode region of COI was PCR amplified and then Sanger sequenced. Reflecting the DNA degradation typical of museum specimens, the sequences recovered were often incomplete (e.g. 254 bp for Hexagenia limbata). The resultant sequences were injected into the ID engine on BOLD [11] and into the BLAST function on GenBank [12]. This analysis delivered correct species identifications for six specimens (35%) on BOLD and for nine (53%) on GenBank.

The present study was initiated by downloading the 17 sequences from GenBank. These sequences were compiled into a dataset on BOLD (dx.doi.org/10.5883/DS-FBI2019), and specimen metadata unreported by Meiklejohn et al. were added to the sequences based on the labels of the voucher specimens deposited at the USNM, along with images of most specimens. The sequences were resubmitted to the BOLD ID engine and to GenBank BLAST with self matches excluded. Because some of the resultant identifications deviated from those reported in [10], the factors responsible for this discordance were examined.

Results and discussion

ID results from BOLD and GenBank

Table 1 compares the ID results for the 17 specimens between [10] and those obtained in the present study. The IDs from BLAST matched those reported by [10] as did ten of the IDs from BOLD. The other seven IDs from BOLD corresponded to those from GenBank, but not with the results in [10]. There was a simple explanation for this discordance. Meiklejohn et al. [10] had submitted the reverse complement rather than the coding sequence into the ID engine on BOLD, an approach which generated distant matches. Avoiding this misstep, the number of “correct” identifications generated by BOLD and GenBank was similar (12/17 at the genus level, 9/17 at the species level). In order to prevent further occurrence of such errors, the BOLD user interface has been updated to instruct users to enter COI barcode sequences in the ID engine in the forward orientation, and to warn users if the resulting identifications are suspected to result from reverse or reverse complement sequences.

Table 1. Comparison of query results (top matches) for 17 insect species between Meiklejohn et al. [10] and the present study.

							Meiklejohn et al.			Present study
Query sequence	Order	Family	Genus	Species	Sequence length	Database	Top match genus	Top match species	Identity	Top match order	Top match family	Top match genus	Top match species	Identity
MK905407	Coleoptera	Scarabaeidae	Phanaeus	vindex	581	GenBank	Phanaeus	sp.	0.99821	Coleoptera	Scarabaeidae	Phanaeus	sp.	0.9982
MK905407	Coleoptera	Scarabaeidae	Phanaeus	vindex	581	BOLD_all	Phanaeus	sp.	0.9982	Coleoptera	Scarabaeidae	Phanaeus	sp.	0.9982
MK905402	Dermaptera	Forficulidae	Forficula	auricularia	429	GenBank	Forficula	auricularia	0.99299	Dermaptera	Forficulidae	Forficula	aff. auricularia A	0.9930
MK905402	Dermaptera	Forficulidae	Forficula	auricularia	429	BOLD_all	Dyscheralcis	retroflexa	0.5	Dermaptera	Forficulidae	Forficula	auricularia-A	0.9976
MK905402 / RC	Dermaptera	Forficulidae	Forficula	auricularia	429	BOLD_all				Lepidoptera	Geometridae	Dyscheralcis	retroflexa	0.5
MK905396	Diptera	Calliphoridae	Chrysomya	rufifacies	592	GenBank	Chrysomya	rufifacies	1	Diptera	Calliphoridae	Chrysomya	rufifacies	1
MK905396	Diptera	Calliphoridae	Chrysomya	rufifacies	592	BOLD_all	Chrysomya	rufifacies	1	Diptera	Calliphoridae	Chrysomya	rufifacies	1
MK905397	Diptera	Calliphoridae	Calliphora	vicina	553	GenBank	Calliphora	vicina	0.98373	Diptera	Calliphoridae	Calliphora	vicina	0.9837
MK905397	Diptera	Calliphoridae	Calliphora	vicina	553	BOLD_all	Calliphora	vicina	1	Diptera	Calliphoridae	Calliphora	vicina	1
MK905393	Diptera	Culicidae	Aedes	aegypti	658	GenBank	Aedes	aegypti	0.99848	Diptera	Culicidae	Aedes	aegypti	0.9985
MK905393	Diptera	Culicidae	Aedes	aegypti	658	BOLD_all	Aedes	aegypti	1	Diptera	Culicidae	Aedes	aegypti	1
MK905403	Diptera	Glossinidae	Glossina	palpalis	611	GenBank	Glossina	brevipalpis	0.971	Diptera	Glossinidae	Glossina	brevipalpis	0.9710
MK905403	Diptera	Glossinidae	Glossina	palpalis	611	BOLD_all	Glossina	brevipalpis	0.9694	Diptera	Glossinidae	Glossina	brevipalpis	0.9694
MK905404	Diptera	Muscidae	Musca	domestica	645	GenBank	Cryptopygus	tricuspis	0.996	Entomobryomorpha	Isotomidae	Cryptopygus	tricuspis	0.9960
MK905404	Diptera	Muscidae	Musca	domestica	645	BOLD_all	Amphiura	incana	0.5571	Entomobryomorpha	Isotomidae	Folsomia	cf. diplopthalma	1
MK905404 / RC	Diptera	Muscidae	Musca	domestica	645	BOLD_all				Ophiurida	Amphiuridae	Amphiura	incana	0.5586
MK905400	Ephemeroptera	Ephemeridae	Hexagenia	limbata	254	GenBank	Glossina	brevipalpis	0.9681	Diptera	Glossinidae	Glossina	brevipalpis	0.9681
MK905400	Ephemeroptera	Ephemeridae	Hexagenia	limbata	254	BOLD_all	Glossina	brevipalpis	0.9675	Diptera	Glossinidae	Glossina	brevipalpis	0.9675
MK905409	Hymenoptera	Vespidae	Vespula	squamosa	560	GenBank	Vespula	squamosa	0.99643	Hymenoptera	Vespidae	Vespula	squamosa	0.9964
MK905409	Hymenoptera	Vespidae	Vespula	squamosa	560	BOLD_all	Vespula	squamosa	1	Hymenoptera	Vespidae	Vespula	squamosa	1
MK905395	Lepidoptera	Saturniidae	Callosamia	promethea	354	GenBank	Callosamia	promethea	0.9969	Lepidoptera	Saturniidae	Callosamia	promethea	0.9969
MK905395	Lepidoptera	Saturniidae	Callosamia	promethea	354	BOLD_all	Callosamia	promethea	0.9938	Lepidoptera	Saturniidae	Callosamia	promethea	0.9940
MK905401	Lepidoptera	Nymphalidae	Danaus	plexippus	623	GenBank	Danaus	plexippus	1	Lepidoptera	Nymphalidae	Danaus	plexippus	1
MK905401	Lepidoptera	Nymphalidae	Danaus	plexippus	623	BOLD_all	Danaus	plexippus	1	Lepidoptera	Nymphalidae	Danaus	plexippus	1
MK905405	Mecoptera	Meropeidae	Merope	tuber	655	GenBank	Merope	tuber	0.92006	Mecoptera	Meropeidae	Merope	tuber	0.9201
MK905405	Mecoptera	Meropeidae	Merope	tuber	655	BOLD_all	Craesus	alniastri	0.5	Mecoptera	Meropeidae	Merope	tuber	0.9430
MK905405 / RC	Mecoptera	Meropeidae	Merope	tuber	655	BOLD_all				Hymenoptera	Tenthredinidae	Craesus	alniastri	0.5
MK905408	Neuroptera	Ascalaphidae	Ululodes	quadripunctatus	635	GenBank	Ululodes	quadrimaculatus	1	Neuroptera	Ascalaphidae	Ululodes	quadripunctatus	1
MK905408	Neuroptera	Ascalaphidae	Ululodes	quadripunctatus	635	BOLD_all	Xanthopimpla	sp.	0.5152	Neuroptera	Ascalaphidae	Ululodes	quadripunctatus	1
MK905408 / RC	Neuroptera	Ascalaphidae	Ululodes	quadripunctatus	635	BOLD_all				Hymenoptera	Ichneumonidae	Xanthopimpla	sp.	0.5152
MK905399	Odonata	Gomphidae	Gomphus	exilis	612	GenBank	Cecidomyiidae	sp.	0.9934	Diptera	Cecidomyiidae	Cecidomyiidae	sp.	0.9935
MK905399	Odonata	Gomphidae	Gomphus	exilis	612	BOLD_all	Dolichophis	schmidti	0.5283	Diptera	Cecidomyiidae	Cecidomyiidae	sp.	0.9967
MK905399 / RC	Odonata	Gomphidae	Gomphus	exilis	612	BOLD_all				Squamata	Colubridae	Dolichophis	schmidti	0.5283
MK905398	Orthoptera	Gryllidae	Gryllus	assimilis	278	GenBank	Gryllus	pennsylvanicus	0.9964	Orthoptera	Gryllidae	Gryllus	pennsylvanicus	0.9964
MK905398	Orthoptera	Gryllidae	Gryllus	assimilis	278	BOLD_all	Gryllus	pennsylvanicus	0.9964	Orthoptera	Gryllidae	Gryllus	pennsylvanicus	0.9964
MK905394	Siphonaptera	Pulicidae	Ctenocephalides	felis	643	GenBank	Pulex	irritans	0.9642	Siphonaptera	Pulicidae	Pulex	irritans	0.9642
MK905394	Siphonaptera	Pulicidae	Ctenocephalides	felis	643	BOLD_all	Natrix	tessellata	0.6176	Siphonaptera	Pulicidae	Pulex	irritans	0.9642
MK905394 / RC	Siphonaptera	Pulicidae	Ctenocephalides	felis	643	BOLD_all				Squamata	Colubridae	Natrix	tessellata	0.6176
MK905406	Phthiraptera	Pediculidae	Pediculus	humanus capitis	384	GenBank	Stylops	sp.	1	Strepsiptera	Stylopidae	Stylops	sp.	1
MK905406	Phthiraptera	Pediculidae	Pediculus	humanus capitis	384	BOLD_all	Akapala	rudis	0.596	Strepsiptera	Stylopidae	Stylops	sp.	0.9013
MK905406 / RC	Phthiraptera	Pediculidae	Pediculus	humanus capitis	384	BOLD_all				Hymenoptera	Eucharitidae	Akapala	rudis	0.5960

Open in a new tab

RC = reverse complement. Blue and red shading indicate correct or inaccurate identification, respectively, at each taxonomic rank.

Factors responsible for four ‘errors’ in generic assignment

Both BOLD and GenBank delivered generic identifications deemed incorrect for four specimens. In each case, the query sequence showed close similarity (95–100% in three cases, 90% in one) to taxa belonging to a different order than that analyzed (Table 1; S1 File and S2 File). These discordances could either reflect errors in the reference libraries or in the query sequences. The cause for one misidentification was certain; it arose through internal cross-contamination as the sequence for Hexagenia limbata was a truncated version of that for Glossina palpalis (identical at all 250 bp that overlapped). The other three mismatches involved taxa (springtail, gall midge, strepsipteran) unrepresented among the 17 tested species, ruling out contamination between the specimens included in the analyses. Moreover, because of their striking morphological differences to the test taxa (house fly, dragonfly, louse), misidentification can be excluded as a cause. This leaves two possible explanations–contamination in the reference sequence libraries or in the query sequences. Because each query sequence was embedded within many independently generated reference sequences from another order, these cases of misidentification clearly arose from contamination of the query sequences. The contamination of a house fly by collembolan DNA and a dragonfly by gall midge DNA is easily explained by small non-target specimens or their fragments being tangled in the legs of the much larger target specimens. This occurs commonly when specimens are sorted from bulk samples and mounted individually for storage in a natural history collection. In contrast, contamination of Pediculus humanus by strepsipteran DNA seems highly unlikely at first glance. However, examination of the loan records at the USNM revealed that the loan of material to Meiklejohn et al. contained multiple specimens which were not included in their analyses or mentioned in their article–and among them, two were Strepsiptera (USNM ENT 01248370 and 01248357). Cross-contamination is a well-recognized risk when working with museum specimens so it is standard practice to check for its occurrence [13,14]. While Meiklejohn et al. [10] exercised some precautions in their laboratory protocols such as the incorporation of negative controls in PCR, there was no evidence that they considered the possibility that some of their DNA sequences derived from non-target taxa. After excluding these four cases, the number of correct identifications for BOLD and GenBank (12/13 for genus, 9/13 for species) was identical.

Need for taxonomic validation of museum specimens

The four remaining ‘incorrect’ identifications all involved cases where BOLD and GenBank assigned the query sequence to a species closely related to the taxon analyzed by Meiklejohn et al. [10] (Table 1). As such, the evidence for misidentification rests on the presumption that their specimens were correctly identified. While the National Museum of Natural History is considered one of the better curated of North American insect collections, the quality of the identification of individual specimens depends on the expertise available and the time elapsed since they were assigned to a species [15]. As such, specimens may be misidentified, mirroring the situation reported in other studies. For example, Meier & Dikow [16] found that 12% of all species-level identifications for a genus of asilid flies from various collections were wrong. Similarly, Muona [17] found that from 1–25% of beetles belonging to two easily discriminated species pairs and one species tetrad were incorrectly identified in a major collection. Efforts to build a DNA barcode reference library for North American Lepidoptera exposed many misidentified specimens and overlooked cryptic species in major collections [18]. Importantly, all four cases of apparent misidentification reported by Meiklejohn et al. [10] involve species whose recognition is not straightforward. The sole case of generic misidentification involved a presumptive specimen of the flea, Ctenocephalides felis, whose sequence matched those for another flea, Pulex irritans, on BOLD and GenBank. BOLD holds nearly 1,200 records, contributed by 15 institutions, representing four species of Ctenocephalides and each possesses a divergent array of barcode sequences. The barcode results support the monophyly of all species in the genus while P. irritans forms a sister taxon. The specimen sequenced by Meiklejohn et al. [10] was consumed in the analysis, but we examined the vial of specimens from which it came. The sequence clusters within other independent records of Pulex irritans, which was also reported from dogs by the same study [19] from which the specimens in the vial originate. It is likely that a Pulex specimen was mis-sorted among the many Ctenocephalides in the tube of alcohol. The specimens were collected from Washington County, Arkansas, not from Washington state, as incorrectly noted in the GenBank record.

The three remaining cases of presumptive species-level misidentifications involved genera (Gryllus, Glossina, Phanaeus) with complex taxonomy. One of the three species, Gryllus assimilis, was formerly thought to be widely distributed in the New World, but it is now recognized to be a complex of 8+ species, several of which can only be reliably distinguished by their calls or life history [20,21]. Examination of the specimen analyzed by Meiklejohn et al. revealed that it was collected in Virginia–well outside the range of the true G. assimilis, which is a more southern species. In the United States, G. assimilis is only known from Texas and, as an introduced species, from southern Florida [21]. The identification retrieved for this specimen from both BOLD and GenBank (G. pennsylvanicus) is likely correct, although other species do occur in Virginia [21]. Similarly, the query species of tsetse fly (G. palpalis) belongs to a complex that includes G. brevipalpis [22–24], the species identified as the closest match by BOLD and GenBank. There are hundreds of Glossina COI sequences in GenBank and BOLD, but most of them are from the 3’ end of the gene and do not overlap with the DNA barcode region. The Phanaeus specimen analyzed by Meiklejohn et al. [10] may well be correctly identified as it matches closely (0.9955) to two specimens of P. vindex from an earlier study on the phylogeny of this genus [25]. The closest match (0.9982) for this specimen on both BOLD and GenBank is a sequence associated with an interim species name, which results in the apparent species-level misidentification. It should be noted that Phanaeus vindex belongs to a group of three closely related species as well as some controversial subspecies [26], but it is likely more diverse as records for it on BOLD include four distinct COI sequence clusters. Because of these taxonomic uncertainties, the four cases of presumptive species- or genus-level misidentifications are best viewed as unconfirmed or incorrect.

Resolving taxonomic uncertainty

As the preceding section reveals, efforts to assess the resolution of DNA barcodes are often constrained by poor taxonomy. It is certain that some records on BOLD and GenBank derive from misidentified specimens, but there is no easy path to correct them. This fact was powerfully demonstrated by Mutanen et al. [27] in a study of DNA barcode variation in 4,977 species of European Lepidoptera which revealed that 60% of the cases initially thought to indicate compromised species resolution or DNA barcode sharing actually arose as a result of misidentifications, databasing errors, or flawed taxonomy. As the taxonomic system for European Lepidoptera is very advanced, similar issues will be a greater impediment in most other groups. Databases like BOLD and GenBank record these divergences in taxonomic opinion, but they cannot resolve them, providing strong motivation for approaches that sidestep this barrier. The Barcode Index Number (BIN) system is a good candidate as it makes it possible to objectively register genetically diversified lineages [28]. One of the species in the current study, Forficula auricularia, provides a good example of the enhanced geographic and taxonomic resolution offered by BINs that could be useful in forensic and many other contexts. This taxon was, in principle, correctly identified through GenBank by Meiklejohn et al. [10] (Table 1). However, although only a single species is still formally recognized, F. auricularia has been known to include two lineages with differing distributions and life histories for >20 years [29,30]. In fact, barcode results indicate that North American populations actually include three divergent lineages with allopatric distributions (Fig 1). As such, BIN assignments provide information on the geographic distributions of the component lineages of this species complex that could be important in certain contexts, but that would be overlooked by a species-based assignment. Because most species of multicellular organisms await description, it is certain that there are many other cases where BIN-based analysis will enhance geographic resolution.

Distinction between BOLD and GenBank

It is not surprising that BOLD and GenBank demonstrated similar performance in identification, once operational issues were resolved, as many records appear in both platforms, which are intended to be complementary. Sequences of COI submitted independently to GenBank are mined and entered into BOLD periodically while records from BOLD are submitted to GenBank when they are published. At present, 11% of all COI barcode records on BOLD originate from GenBank, while 75% of the COI barcodes on GenBank derive from BOLD. Although many records are shared, the two platforms diverge in collateral data. For example, for the 17 species of insects analyzed in [10], 65% of the records originating from BOLD possess GPS coordinates, 60% have trace electropherograms, and 40% have specimen images, while only 26% of those originating from GenBank had GPS coordinates and all lacked images and electropherograms. In addition, BOLD employs BINs to integrate records that lack a genus or species designation with those that possess them. These extended data elements and functionality are a valuable, often essential, component in the evaluation of identification results.

Conclusions and path forward

Six of the 17 species examined by Meiklejohn et al. [10] escaped operational errors, but the other 11 did not (Table 2), explaining the low identification success they reported. Even after correcting for the use of reverse complements, the effectiveness of DNA barcoding could not be evaluated for eight species, those impacted by sequence contamination or taxonomic uncertainty. Importantly, DNA barcode records in BOLD and GenBank did deliver a correct species assignment for the other nine species. While the outcome for these species is reassuring, the lack of an outcome for other taxa reveals the need for improved protocols. Clearly, two conditions need to be satisfied to ensure a correct identification–the query sequences must be legitimate and the reference libraries must be well-validated. As a start, any study that aims to employ DNA barcodes for species identification should include steps to ensure the sequences recovered are valid by including positive and negative controls, by assessing sequence quality, and by checking for contaminants (Fig 2). Presuming the query sequences pass these quality checks, the generation of a reliable identification requires a comprehensive, well-validated reference library. The taxonomic reliability of GenBank has often been questioned [e.g. 31], although the actual overall data quality is much better than often assumed [32]. Because BOLD is a workbench for the DNA barcode research community, it will always contain sequences from specimens whose identifications are being refined. However, the taxonomic coverage and resolution of the COI barcode library on BOLD, and hence the accuracy of identification queries, is steadily improving [33]. The establishment of a Barcode REF library, based upon a small number of carefully validated records for each species, would represent an important step towards further improving BOLD’s capacity to generate reliable identifications. Under ideal circumstances, the primary reference sequence for each species would derive from its holotype. However, because 90% of all multicellular organisms await description, and the status of many described species groups is uncertain, these efforts will need to be reinforced by a BIN-based approach.

Table 2. Three categories of operational errors which compromised efforts by Meiklejohn et al. [10] to test the effectiveness of the BOLD and GenBank reference libraries in identifying 17 insect species.

Specimen #	ID	Reverse Complement	Contamination	Incorrect ID
1	Phanaeus vindex	—	—	Yes
2	Forficula auricularia	Yes	—	—
3	Chrysomya rufifacies	—	—	—
4	Calliphora vicina	—	—	—
5	Aedes aegypti	—	—	—
6	Glossina palpalis	—	—	Yes
7	Musca domestica	Yes	Yes	N.D.
8	Hexagenia limbata	—	Yes	N.D.
9	Vespula squamosa	—	—	—
10	Callosamia promethea	—	—	—
11	Danaus plexippus	—	—	—
12	Merope tuber	Yes	—	—
13	Ululodes quadripunctatus	Yes	—	—
14	Gomphus exilis	Yes	Yes	N.D.
15	Gryllus assimilis	—	—	Yes
16	Ctenocephalides felis	Yes	—	Yes
17	Pediculus humanus capitis	Yes	Yes	N.D.

Open in a new tab

N.D. = not determined.

Supporting information

S1 File. Top 20 matches in GenBank BLAST queries for the four specimens deemed cross-contaminations.

(XLSX)

Click here for additional data file.^{(16KB, xlsx)}

S2 File. Top 20 matches from queries to the BOLD ID engine for four specimens whose COI sequences derive from cross-contamination.

(XLSX)

Click here for additional data file.^{(16.7KB, xlsx)}

Acknowledgments

Floyd Shockley and Cailin Meyer reassembled the voucher specimens which had been returned to the USNM collection. Nicholas Silverson added images of most of the specimens to BOLD.

Data Availability

This study is a reanalysis of already published data that was submitted to GenBank. GenBank accessions are referenced in Table 1.

Funding Statement

This work was enabled by Funding from the Canada First Research Excellence Fund, the Ontario Ministry of Research and Innovation, the Canada Foundation for Innovation, and Natural Sciences and Engineering Research Council of Canada. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Chang C-H, Dai W-Y, Chen T-Y, Lee A-H, Hou H-Y, Liu S-H, et al. DNA barcoding reveals CITES-listed species among Taiwanese government-seized chelonian specimens. Genome. 2018;61: 615–624. 10.1139/gen-2017-0264 [DOI] [PubMed] [Google Scholar]
2.Koroiva R, de Souza MS, Roque F de O, Pepinelli M. DNA barcodes for forensically important fly species in Brazil. J Med Entomol. 2018;55: 1055–1061. 10.1093/jme/tjy045 [DOI] [PubMed] [Google Scholar]
3.Coissac E, Hollingsworth PM, Lavergne S, Taberlet P. From barcodes to genomes: extending the concept of DNA barcoding. Mol Ecol. 2016;25: 1423–8. 10.1111/mec.13549 [DOI] [PubMed] [Google Scholar]
4.Pentinsaari M, Hebert PDN, Mutanen M. Barcoding beetles: A regional survey of 1872 species reveals high identification success and unusually deep interspecific divergences. PLoS One. 2014;9: e108651 10.1371/journal.pone.0108651 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Hajibabaei M, Janzen DH, Burns JM, Hallwachs W, Hebert PDN. DNA barcodes distinguish species of tropical Lepidoptera. Proc Natl Acad Sci U S A. 2006;103: 968–971. 10.1073/pnas.0510466103 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Hausmann A, Godfray HC, Huemer P, Mutanen M, Rougerie R, van Nieukerken EJ, et al. Genetic patterns in European geometrid moths revealed by the Barcode Index Number (BIN) system. PLoS One. 2013;8: e84518 10.1371/journal.pone.0084518 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Hendrich L, Morinière J, Haszprunar G, Hebert PDN, Hausmann A, Köhler F, et al. A comprehensive DNA barcode database for Central European beetles with a focus on Germany: Adding more than 3,500 identified species to BOLD. Mol Ecol Resour. 2015;15: 795–818. 10.1111/1755-0998.12354 [DOI] [PubMed] [Google Scholar]
8.Huemer P, Mutanen M, Sefc KM, Hebert PDN. Testing DNA barcode performance in 1000 species of European lepidoptera: large geographic distances have small genetic impacts. PLoS One. 2014;9: e115774 10.1371/journal.pone.0115774 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Kerr KCR, Stoeckle MY, Dove CJ, Weigt L a., Francis CM, Hebert PDN. Comprehensive DNA barcode coverage of North American birds. Mol Ecol Notes. 2007;7: 535–543. 10.1111/j.1471-8286.2007.01670.x [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Meiklejohn KA, Damaso N, Robertson JM. Assessment of BOLD and GenBank–Their accuracy and reliability for the identification of biological materials. Fugmann SD, editor. PLoS One. 2019;14: e0217084 10.1371/journal.pone.0217084 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Ratnasingham S, Hebert PDN. BOLD: The Barcode of Life Data System (http://www.barcodinglife.org). Mol Ecol Notes. 2007;7: 355–364. 10.1111/j.1471-8286.2007.01678.x [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, et al. GenBank. Nucleic Acids Res. 2017;45: D37–D42. 10.1093/nar/gkw1070 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Siddall ME, Fontanella FM, Watson SC, Kvist S, Erséus C. Barcoding bamboozled by bacteria: convergence to metazoan mitochondrial primer targets by marine microbes. Syst Biol. 2009;58: 445–451. 10.1093/sysbio/syp033 [DOI] [PubMed] [Google Scholar]
14.Mioduchowska M, Czyż MJ, Gołdyn B, Kur J, Sell J. Instances of erroneous DNA barcoding of metazoan invertebrates: Are universal cox1 gene primers too “universal”? Hajibabaei M, editor. PLoS One. 2018;13: e0199609 10.1371/journal.pone.0199609 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.McGinley RJ. Where’s the management in collections management? Planning for improved care, greater use and growth of collections. Int Symp First World Congr Preserv Conserv Nat Hist Collect. 1993;3: 309–338. [Google Scholar]
16.Meier R, Dikow T. Significance of specimen databases from taxonomic revisions for estimating and mapping the global species diversity of invertebrates and repatriating reliable specimen data. Conserv Biol. 2004;18: 478–488. 10.1111/j.1523-1739.2004.00233.x [DOI] [Google Scholar]
17.Muona J. Huomioita eläinmuseon kuoriaskokoelmien virhemäärityksistä [Some observations concerning incorrectly determined beetles in public collections. (Coleoptera)]. Sahlbergia. 2001;6: 34–36. [Google Scholar]
18.Levesque-Beaudin V, Rosati M, Silverson N, Warne CP, Brown A, Telfer AC, et al. Museum harvesting in major natural history collections. Genome. 2017;60: 962. [Google Scholar]
19.Schiefer BA, Lancaster JL Jr. Siphonaptera from Arkansas. Journal of the Kansas Entomological Society, 1970; 43:177–181. [Google Scholar]
20.Walker TJ. Singing Insects of North America (SINA). 2004 [cited 11 Jul 2019]. Available from: http://entnemdept.ufl.edu/walker/buzz/index.htm
21.Weissman DB, Gray DA. Crickets of the genus Gryllus in the United States (Orthoptera: Gryllidae: Gryllinae). Zootaxa. 2019;4705: 1–277. 10.11646/zootaxa.4705.1.1 [DOI] [PubMed] [Google Scholar]
22.Gooding RH, Krafsur ES. TSETSE GENETICS: Contributions to biology, systematics, and control of tsetse flies. Annu Rev Entomol. 2005;50: 101–123. 10.1146/annurev.ento.50.071803.130443 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Dyer NA, Furtado A, Cano J, Ferreira F, Odete Afonso M, Ndong-Mabale N, et al. Evidence for a discrete evolutionary lineage within Equatorial Guinea suggests that the tsetse fly Glossina palpalis palpalis exists as a species complex. Mol Ecol. 2009;18: 3268–3282. 10.1111/j.1365-294X.2009.04265.x [DOI] [PubMed] [Google Scholar]
24.De Meeûs T, Bouyer J, Ravel S, Solano P. Ecotype Evolution in Glossina palpalis subspecies, major vectors of sleeping sickness. PLoS Negl Trop Dis. 2015;9: e0003497 10.1371/journal.pntd.0003497 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Price DL. Phylogeny and Biogeography of the Dung Beetle Genus Phanaeus (Coleoptera: Scarabaeidae). Syst Entomol. 2009;34: 137–150. [Google Scholar]
26.Edmonds WD, Zídek J. Taxonomy of Phanaeus revisited: Revised keys to and comments on species of the New World dung beetle genus Phanaeus MacLeay, 1819 (Coleoptera: Scarabaeidae: Scarabaeinae: Phanaeini). Insecta Mundi. 2012; 0274: 1–108. [Google Scholar]
27.Mutanen M, Kivelä SM, Vos RA, Doorenweerd C, Ratnasingham S, Hausmann A, et al. Species-level para- and polyphyly in DNA barcode gene trees: strong operational bias in European Lepidoptera. Syst Biol. 2016; syw044 10.1093/sysbio/syw044 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Ratnasingham S, Hebert PDN. A DNA-based registry for all animal species: the Barcode Index Number (BIN) system. PLoS One. 2013;8: e66213 10.1371/journal.pone.0066213 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Guillet S, Josselin N, Vancassel M. Multiple introductions of the Forficula auricularia species complex (Dermaptera: Forficulidae) in eastern North America. Can Entomol. 2000;132: 49–57. 10.4039/Ent13249-1 [DOI] [Google Scholar]
30.Guillet S, Guiller A, Deunff J, Vancassel M. Analysis of a contact zone in the Forficula auricularia L. (Dermaptera: Forficulidae) species complex in the Pyrenean Mountains. Heredity (Edinb). 2000;85: 444–449. 10.1046/j.1365-2540.2000.00775.x [DOI] [PubMed] [Google Scholar]
31.Bridge PD, Roberts PJ, Spooner BM, Panchal G. On the unreliability of published DNA sequences. New Phytol. 2003;160: 43–48. 10.1046/j.1469-8137.2003.00861.x [DOI] [PubMed] [Google Scholar]
32.Leray M, Knowlton N, Ho SL, Nguyen BN, Machida RJ. GenBank is a reliable resource for 21st century biodiversity research. Proc Natl Acad Sci U S A. 2019;116: 22651–22656. 10.1073/pnas.1911714116 [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Madden MJL, Young RG, Brown JW, Miller SE, Frewin AJ, Hanner RH. Using DNA barcoding to improve invasive pest identification at U.S. ports-of-entry. PLoS One. 2019;14: e0222291 10.1371/journal.pone.0222291 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0231814.r001

Decision Letter 0

Matjaž Kuntner

6 Jan 2020

PONE-D-19-20600

Forensics and DNA Barcodes – Do Identification Errors Arise in the Lab or in the Sequence Libraries?

PLOS ONE

Dear Dr. Hebert,

Thank you for submitting your manuscript to PLOS ONE. We invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Both reviewers are very positive about your work, and I agree that it is important to publish it with minimum delay. Please address in a revised version the minor recommendations from the reviewers.

We would appreciate receiving your revised manuscript by Feb 07 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Matjaž Kuntner

Academic Editor

PLOS ONE

Journal Requirements:

1. When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This is an important paper. It competently and completely addresses the errors and misunderstandings reported by Meiklejohn et al. 2019. The conclusions of the Meiklejohn et al. paper were flawed, full stop, but have the potential to greatly influence the field. Especially given the title. Therefore, a reanalysis of the results is called for and the authors do a thorough job of it. Hopefully their work will be taken into account by future readers of the Meiklejohn paper.

They take the opportunity to provide a pipeline that, if followed, will produce query sequences that are high quality – avoiding contamination, pseudogenes, and ambiguous bases. I expect that the majority of the DNA barcoding community follows a similar pipeline, but it is good to see it laid out so clearly. The pipeline would be a good place to start for a lab just beginning its barcoding efforts.

In the course of the paper, the discussion of the difficulties of using museum specimens as references, especially the pitfalls and effects of mislabeling, are clearly laid out. In conjunction with their discussion of the impact of taxonomic uncertainty, this information lays out the limitations of using DNA barcoding for identification. In general, they make clear the importance of considering these limitations and how they may be project- or taxon- or location-specific. DNA barcoding works very well for many questions, but a researcher needs to be aware.

Reviewer #2: In essence, the submitted manuscript draft of Pentinsaari and coauthors is a rebuttal to a manuscript recently published in PloS One “Meiklejohn KA, Damaso N, Robertson JM (2019) Assessment of BOLD and GenBank – Their accuracy and reliability for the identification of biological materials. PLoS ONE 14(6): e0217084”. In that work, Meiklejohn et al. (2019) assessed the the usability of DNA barcoding coupled with public sequence repositories (GenBank, BOLD) in plants (using several genetic markers rbcL, matK, trnH-psbA, ITS) and insects (using a single genetic marker – COI) for among others the purpose of forensic studies. One of their main conclusion (Meiklejohn et al., 2019) was that the COI coupled with GenBank and BOLD has a very limited power in identification of insect species (53% and 35%, respectively).

To investigate what are the reasons for such a low success Pentinsaari and coauthors reanalyzed the COI data set from Meiklejohn et al. (2019) and identified two grave errors in their data processing that lead to such a a low success of COI DNA barcoding. The first error was that in their research Meiklejohn et al. (2019) obviously submitted a reverse complement rather than the coding sequence into the BOLD engine (which raised the success of species/genus level identification for both approaches to over 50/70%) demonstrated. Furthermore, they that additional four misidentifications (23%) were due cross-contamination of the museum samples used for DNA extractions. The remaining species level misidentifications were due misidentification of the museum specimens or a result of complex not fully resolved species level taxonomies. In addition, the authors give a suggestions on how can forensic sciences make of DNA Barcoding with the help of Barcode Index Number (BIN) system even when analyzing species with taxonomies that are not fully resolved. Finally, the authors give a workflow with a series of recommendations to maximize the chance of recovering reliable sequence records for DNA barcoding.

I have carefully read both the Meiklejohn et al. (2019) paper and the draft Pentinsaari and coauthors have submitted. Furthermore, to leave any doubt I have also personally rechecked and reanalyzed the COI data set in question and can confirm their results and I also agree with their identifications of errors in Meiklejohn et al. (2019). Therefore I, recommend the editor to accept this manuscript after a minor revision - I am adding a couple of suggestions that would make it a bit easier to follow the what they have done

Here are some minor recommendations

Results and Discussion

The authors might also want to mention that the data from Meiklejohn et al. (2019) is not deposited in BOLD and thus trace files are also not available (at least I could not find them). Namely, electopherograms could also be used to see if there are double peaks indicating chimeric sequences due to such contamination.

155 I would avoid using ‘species’ (I am reffering to ‘’). Even if we know that this taxon name probably referees to several related it is still the only one described. Also such situations are not rare and ‘’ doesn’t really clarify what the authors think is wrong with it. Maybe you can simply say “… Forficula auricularia complex, which was also used in the current study provides a good example...”

155-157: Maybe this is not the best example as Forficula auricularia was also correctly identified by Meiklejohn et al. (2019). But, yes it is a good example for BIN system. I suggest the authors recognize that Meiklejohn identified this species correctly somewhere in the text.

Supplementary Table 1

I suggest to include the Table S1 into the main text as it is integral to the understanding of the manuscript and readers. Furthermore, it will be easier to follow manuscript if the authors refer to this table when discussing different categories of mistakes through the text (reverse complement, likely contamination, misidentification).

Figure 2:

One negative control for each test same – what is the meaning of “same”.

No heterozygous peaks in trace – in mtDNA there are (almost never) no heterzoygous peaks. Such peaks are due to pseudogenes or contamination. Here they should be referred to as chimeric or by using another incorporate term.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Apr 16;15(4):e0231814. doi: 10.1371/journal.pone.0231814.r002

Author response to Decision Letter 0

12 Jan 2020

To the editor:

We thank the reviewers for scrutinizing our manuscript and providing constructive feedback. We agree with the suggestions made by Reviewer 2, and have edited the manuscript text and Figure 2 accordingly. We have also revised the formatting of the manuscript file to comply with the PLOS style requirements as requested by the editor. All changes are documented in the annotated version of the revised manuscript file.

Sincerely,

Mikko Pentinsaari

Sujeevan Ratnasingham

Scott E. Miller

Paul D. N. Hebert

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(11.6KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0231814.r003

Decision Letter 1

Matjaž Kuntner

10 Feb 2020

PONE-D-19-20600R1

Forensics and DNA barcodes – do identification errors arise in the lab or in the sequence libraries?

PLOS ONE

Dear Dr. Hebert,

Thank you for making the corrections the the originally submitted version of your manuscript. Before we move towards acceptance of your paper, please consider the following. As per PLoS policy, the journal editors have asked the authors of the paper you critique to provide a signed review. This review is now available to you herein. Please take some time to provide a list of rebuttals to this review, if you feel so, or changes to your manuscript, where appropriate and where you agree with these suggestions.

We would appreciate receiving your revised manuscript by Mar 26 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Matjaž Kuntner

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #3: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #3: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #3: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #3: Yes

**********

6. Review Comments to the Author

Reviewer #3: ¬¬Review of: Forensics and DNA barcodes – do identification errors arise in the lab or in the sequence libraries?

PONE-D-19-20600R1

Paper summary:

This paper seeks to identify reasons why the results presented in Meiklejohn et al (PLoS One. 2019;14: e0217084. Doi:10.1371/journal.pone.0217084) provide a misleading evaluation of the data contained in BOLD. In Meiklejohn et al, appropriate DNA barcode sequences from 17 insect, 61 plant and 16 macro-fungi species (curated reference material sourced from national collections) were obtained and compared to both GenBank and BOLD. Specifically for insect taxa, Meiklejohn et al reported that GenBank out performed BOLD with respect to correct species level identifications. In the paper under review, the authors identify three main reasons for the misidentifications reported in Meiklejohn et al using BOLD: 1) reverse complement sequences were searched against BOLD, 2) sequences were cross-contaminated, and 3) taxa included were either misidentified or part of a species complex.

Major Technical Comments:

1) Line 24-26: The authors made the claim that they have accounted for all of the misidentifications in Meiklejohn et al. by identifying errors in the methods (i.e., “problems in the reference libraries or in the query sequences used to test them”). This idea is later contradicted in line 48 when they restated this as “..most, if not all, of the misidentifications” and later in line 82-83 where the authors implied that misidentifications also occurred due to “errors in the reference libraries..”, which was one of the primary conclusions of Meiklejohn et al. The authors contradicted themselves throughout their manuscript and need to update the text for consistency (based on valid data and conclusions – discussed in later points).

2) Line 41-42: Authors provided the statement, “Because COI generally has high accuracy in species assignment [4–9], the conclusions from a recent study by Meiklejohn et al. [10] were surprising.”. This statement would lead readers to believe that Meiklejohn et al did not acknowledge that their results did not conform to the majority of previous studies examining the accuracy of BOLD. This is not the case, as Meiklejohn et al. highlighted this by stating “This result was lower than previously reported for flies (Diptera [15,20], beetles (Coleoptera [23]), butterflies and moths (Lepidoptera [24]), when searching against either or both of these databases.” Thus, the text needs to be modified to reflect that Meiklejohn et al. also reported that their results were lower than previously reported.

3) Line 71-75: It is apparent that Meiklejohn et al did indeed search the reverse complement for seven COI sequences against BOLD. In doing so, the matches they returned were lower with respect to the similarity statistic and were to the incorrect species (highlighted by the authors in Table 1). While it is good scientific practice to search sequences in the forward orientation, at the time in which Meiklejohn et al searched sequences against BOLD (reported in their paper as “July 2017-January 2018”), there was no guidance or requirement that COI sequences must be searched in the forward orientation; neither Ratnasingham and Hebert (2007) nor the “BOLD Print Handbook for BOLD (v3 or v4)” specified the orientation of query sequences. Further, it is apparent the identification interface for COI on boldsystems.org was updated only ~6 months ago (after Meiklejohn et al was published) to state “Enter fasta formatted sequences in the forward orientation” and if a reverse complement sequence is submitted the result states “These results may be due to reverse or reverse complement sequences. Please try again with the forward orientation”. Conversely, the BOLD database always accommodated both orientations for plants and fungal identification engines. Thus, when researchers completed the searches against the original BOLD, they would have had no reason to immediately assume that the incorrect matches were a result of query sequence orientation. While the authors are correct in identifying a source of error in Meiklejohn et al, they have neglected to address why this mistake might have occurred. Given it was not previously explicitly stated that forward sequences must be searched (solely for insect COI sequences; not fungal or plant sequences), it is possible that results from other researchers might also have been affected by this issue and so must be addressed in this manuscript.

Meiklejohn et al. wrote that the BOLD similarity metric was an indicator of an incorrect match and recommended that it is important to use the metric to determine confidence of the results. In Meiklejohn et al., results that gave lower similarity were not excluded, presumably in order to see what the database results gave as the “most similar match”.

4) Line 89-91: The authors made a bold statement and suggested that “misidentification clearly arose from contamination” and made a reasonable hypothesis as to a potential cause for misidentifications. However, they failed to meet their burden of proof as well as contradicted themselves later in the article. In line 85-87, the authors admitted that “the other three mismatches involve taxa unrepresented among the 17 tested species ruling out internal contamination.” If internal contamination were the cause of one sample’s misidentification, then the authors need to provide an explanation for the lack of this contamination in the other samples. It is at this point that the authors contradicted themselves and by writing, “This leaves two possible explanations-contamination in the reference sequences libraries or in the query sequences.” Again, the authors have not met their burden of proof for either claim of contamination. Furthermore, misidentification caused by inconsistencies in reference sequence libraries was a major conclusion cited by Meiklejohn et al. as a source for misidentification in the first place. Furthermore, Meiklejohn et al. used controls during sample processing, a procedure that was referenced in their material and methods (see further explanation below). Even after making these concessions, the authors in line 91 wrote, “these cases of misidentification clearly arose from contamination of query sequences,” again, contradicting themselves.

For the four specimens in which the COI sequences were identified by the authors as stemming from cross-contamination, only one of those sequences corresponded to a species processed in Meiklejohn et al (Hexagenia limbata). Typically, inadvertent cross-contamination arises when a laboratory routinely processes large numbers of similar samples. Do the authors know whether the research division of the FBI Laboratory (where Meiklejohn et al performed the research) processes large numbers of insect material? If so, they should state that as a plausible source of DNA for cross-contamination, rather than just providing a broad sweeping explanation of “cross-contamination”.

Overall for this section (subheading “Factors responsible for four ‘errors’ in generic assignment”) the authors need to make clear and well-supported arguments, rather than swapping between possible explanations for the misidentifications and contradicting themselves. Substantial modifications to the text are needed to address this.

5) Line 91-94: Authors wrote, “Cross-contamination is a well- recognized risk when working with museum specimens, so it is standard practice to check for its occurrence [13,14], but Meiklejohn et al. [10] make no mention of exercising precautions in this regard.” Actually, Meiklejohn et al. did exercise some of the precautions identified in Figure 2, but these might have been overlooked by the authors of this paper as Meiklejohn et al referenced their original paper for the majority of their methods (Meiklejohn et al 2018, International Journal of Legal Medicine). Perhaps Meiklejohn et al should have rearticulated these precautions in their 2019 paper, but they did include some such as negative controls, bleach specimen wash prior to extraction, and amplification using various primer pairs. The authors need to modify the text either to state which precautions were taken by Meiklejohn et al, or remove the text ‘make no mention of exercising precautions in this regard’.

6) Line 98-103: Authors briefly discussed how even at a well curated museum there is a potential for taxonomic misidentification, particularly with older samples (see Line 102, “As such, specimens may be misidentified, mirroring the situation reported in other studies”). This reviewer acknowledges that this can be true, however it would be unreasonable for Meiklejohn et al. to second-guess the taxonomic classification given to a specimen at a museum by a subject matter expert. To confirm the authors’ hypothesis, they could have rechecked the identity of the vouchered specimen. Since it is standard practice to return unused specimen material to the repository, it would have been straight-forward for the authors to confirm their claims of misidentification by examination of the returned specimens. Since the authors did not report doing this examination, their assumptions are overreaching.

7) Line 108: Authors wrote, “Importantly, all four cases of apparent misidentification reported by Meiklejohn et al. [10] involve species whose recognition is not straightforward.” Upon review, Meiklejohn et al reported that common insect species were selected for inclusion and discussed that possible misidentifications (i.e., sister species) might have impacted identification success. However, if the species were incorrectly identified as the sister species, correct genus level identification still would have been obtained (this was not the case for four taxa in question). Thus, the authors need to update the text to accurately reflect that the type of species included in Meiklejohn et al were common ones.

8) Line 118: Authors state “the supposed specimen..analyzed by Meiklejohn et al is almost certainly P. irritans”. This reviewer does not know how the authors could possibly make this bold claim without having access to the specimen used in Meiklejohn et al to confirm the identity morphologically. Moreover, as stated in the above comment, it would be inappropriate to second guess the identification as outlined above. In retrospect, this supports the claim made by Meiklejohn et al. that care must be taken when using barcoding as the sole source for identifying material found in the environment, due to a wide variety of sources for possible misidentifications including morphological misidentification in barcode and genetic databases.

9) Line 125-127: Authors argued that these specific instances should be viewed as “unconfirmed”. However, these authors are leading entomologists. One could argue how would non-entomologists (such as Meiklejohn et al.) differentiate between an unconfirmed and a misidentification? Authors themselves mention in Line 36-39 that DNA based approaches to species identification can help fill the lack of taxonomic expertise that often impedes analysis. It is apparent from reviewing the manuscript that the purpose of the Meiklejohn et al. paper was different, as they were focused on the general accuracy of GenBank and BOLD for returning a match to the expected species, not providing guidance on how matches should be reported out. Thus, the authors need to either remove the term ‘unconfirmed’ or explain how a user of bold with no entomology expertise could differentiate between an ‘unconfirmed’ and a ‘misidentification’.

10) Line 128-149: This reviewer agrees that identification of biological materials for forensics should be tackled interdisciplinary and with common sense, by looking at geographic information, etc. However, it is incorrect to suggest that Meiklejohn et al. scope was discussing forensics and DNA barcodes as their paper was solely on the utility of the DNA barcoding databases using a subset of curated museum species. It is unjust to assume the co-authors of Meiklejohn et al wrote in terms of forensics because of their affiliation. Presumably, Meiklejohn et al. would have been more specific regarding cautions if their objective was for forensic application. However, this was not the case. The authors need to modify the text to reflect the inaccuracy.

11) Line 160-166: Authors briefly summarized how it is possible to use GPS coordinate metadata included in BOLD to better source and identify samples. This is a reasonable suggestion for those who wish to more accurately identify a species for purely taxonomic reasons. However, doing so would require subject expertise in entomology or arthropod evolution to properly utilize this type of data (e.g., specimen images and GPS coordinates) which is not reasonable to expect for most users of BOLD whose background knowledge will vary wildly from one user to the next. Finally, this kind of analysis was outside the scope of work performed by Meiklejohn et al. and therefore an unreasonable criticism of their methodology. This section either needs to be removed, or placed within context of the goal/purpose of the Meiklejohn et al study.

12) Line 168-171: This statement overstates that 11 species did not escape operational errors. The sequences for only 3 species were able to be fully resolved, but the sequences for the other 8 species were not. Authors failed to meet burden of proof as well as contradicted themselves throughout the paper writing that these species were “not as straightforward” and should be viewed as “unconfirmed”.

Minor Technical Comments:

1) Line 55-59: The impression the authors made here is that Meiklejohn et al used old specimens and thus this contributed to their misidentification results. Meiklejohn et al. included the age of the specimens as metadata and highlighted that amplicon/sequence length recovery was not correlated with specimen age. Thus, the authors need to modify the text to include this fact.

2) Line 80: Awkward wording of sentence. “Both BOLD and GenBank delivered generic identifications deemed incorrect for four specimens.” Not sure what this sentence means.

3) Line 155-158: Authors explain how BOLD and GenBank are similar. This is also mentioned in Meiklejohn et al and could have ben acknowledged.

4) Line 175-178: Authors suggest that Meiklejohn et al. did not use proper controls and quality checks; however, they did this according to their protocols that was cited in Methods. Meiklejohn et al out-referenced these steps from their earlier paper (Meiklejohn et al 2018 International Journal of Legal Medicine). The authors need to update the text to reflect this.

Comments on Tables and Figures:

Table 1: When checking the matches returned from the sequences listed in Table 1, it is apparent the authors mislabeled the query sequences. For example, upon downloading MK905402 from GenBank and searching against BOLD one returns a match for Forifucla auricularia. Thus, the authors need to place the red results from Meiklejohn et al under the MK905402/RC row for BOLD_all, not the MK905402 row alone.

Lastly, this table shows that 7 of 11 needed to be RC. However, only 3 species were able to be resolved. All 7 sequences that were not RC (as Meiklejohn et al. stated in their Discussion) were able to be confidently known to be incorrect due to their similarity statistic (<60%). Furthermore, the sequence orientation for BOLD was not explicated stated at the time Meiklejohn et al. queried the sequences and seems to only be affecting insect sequences (the BOLD fungal and plant algorithm was not sensitive to sequence orientation).

Table 2: What is N.D? the table description states “three categories of operational errors which compromised efforts by Meiklejohn et al…” Given this reviewer has concerns with the claims (and contradictions) the authors made about contamination, the ‘contamination’ column in this table needs to be removed. For instance, in line 85-87, the authors admitted that “the other three mismatches.. along the 17 tested species ruling out internal contamination.” Therefore, these species should not be included in the list as “contaminated” in the table.

Moreover, it is unclear what comprises the third operational error. Is it incorrect identification? Incorrect identification is not due to Meiklejohn et al. as these authors also had the same results. Neither Meiklejohn et al nor the authors performed taxonomic identification of the specimens. As presented, it appears to this reviewer that the authors wanted to find any reason for misidentification, aside from possible issues with BOLD. The column title needs to be modified accordingly.

Figure 2: This is a great figure to show proper workflow. Meiklejohn et al. followed many of these features (e.g., negative controls, quality assessment, etc.). Also given their sequences were submitted to GenBank and are from a coding region, the sequences would also have been checked for premature stop codons, indels, etc. (this is part of GenBank’s QC process before acceptance).

Additional Comments:

Appropriateness of the Methods

There has been a shift in the guidelines for using BOLD for COI identifications – the interface now states that sequences must be submitted in forward orientation. The authors need to highlight that they searched the BOLD when this requirement had been articulated.

To support the claims made in this paper specifically on misidentification and cross contamination, the authors should have gone back to the pinned vouchered material at USNM to a) taxonomically confirm the species (addressing misidentification errors), and b) independently resequencing the questionable specimens (addressing cross-contamination issues). The addition of these steps would provide the authors evidence to classify the errors reported by Meiklejohn et al. A description of this work was not given.

Support of the Conclusions by the Results

The authors identified three reasons for the misidentification reported in Meiklejohn et al: 1) reverse complement sequences were searched against BOLD, 2) sequences were cross-contaminated, and 3) taxa included were either misidentified or part of a species complex. The authors results supported conclusion 1), however only circumstantial evidence/results were provided to support conclusions 2 and 3.

Over reaching Conclusions

Several conclusions appear overreaching, especially those on misidentifications and cross-contamination. To validate these conclusions as previously stated, the authors should have gone back to the pinned vouchered material at USNM repository to a) taxonomically confirm the species (validating whether the specimens initially tested were indeed misidentified), and b) independently resequencing from questionable specimens (addressing cross-contamination issues).

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #3: Yes: James M Robertson

Attachment

Submitted filename: PONE-D-19-20600R1_Comments to Authors.docx

Click here for additional data file.^{(28.9KB, docx)}

Attachment

Submitted filename: PONE-D-19-20600R1_Comments to Editor.docx

Click here for additional data file.^{(18.4KB, docx)}

PLoS One. 2020 Apr 16;15(4):e0231814. doi: 10.1371/journal.pone.0231814.r004

Author response to Decision Letter 1

26 Mar 2020

1) Line 24-26: The authors made the claim that they have accounted for all of the misidentifications in Meiklejohn et al. by identifying errors in the methods (i.e., “problems in the reference libraries or in the query sequences used to test them”). This idea is later contradicted in line 48 when they restated this as “…most, if not all, of the misidentifications” and later in line 82-83 where the authors implied that misidentifications also occurred due to “errors in the reference libraries...”, which was one of the primary conclusions of Meiklejohn et al. The authors contradicted themselves throughout their manuscript and need to update the text for consistency (based on valid data and conclusions – discussed in later points).

Pentinsaari et al: Public databases will always contain misidentified or otherwise problematic sequence records. Despite this fact, all incorrect identifications noted by Meiklejohn et al. can still be confidently attributed to analytical errors or questionable identifications of their voucher specimens, as detailed in the manuscript and in our replies to the reviewer’s comments below. The sentence in the abstract has been edited to “…responsible for most misidentifications…” since the original wording could be interpreted as not covering misidentifications of the voucher specimens themselves.

Pentinsaari et al: A note acknowledging the statement by Meiklejohn et al. has been added near to the end of the introduction.

Pentinsaari et al: It is common practice that sequence data are analyzed in the forward orientation unless there is a specific reason to do otherwise. The BOLD user interface was updated during the preparation of our manuscript specifically to reduce the likelihood of further occurrences of similar analytical errors. A note on this update has been added in the revised version of our manuscript. We emphasize that BOLD has been employed as a basis for more than 2000 publications since 2007 and we have not encountered any other case where results based on reverse complement sequences were published.

Although BOLD did not specifically include a warning about the possibility of introducing errors if sequences were injected in the wrong orientation at the time of the original analysis, we still find it surprising that Meiklejohn et al. did not question their results and notice the errors in sequence orientation. Their specimens were expected to match closely to existing data on BOLD, but their closest matches obtained for these incorrect queries were very distant in both taxonomy and sequence similarity. For example, the closest match for the reverse complement sequence of the cat flea (Ctenocephalides felis) specimen retrieved through the BOLD ID engine was the snake Natrix tessellata (similarity 0.6176). The GenBank BLAST query of the same sequence by Meiklejohn et al. identified it as Pulex irritans (0.9642), which at least represents the same family as C. felis. The reasons for this particular discordance in genus/species identification are addressed elsewhere in our manuscript and in this document. GenBank automatically accounts for incorrect orientation of the query sequences which is why the observed results of the BLAST queries for these specimens conformed to expectations better than the BOLD ID engine results (although three of the reverse complement sequences still represented non-target organisms).

Pentinsaari et al: We state in the beginning of this section that there are two possible explanations for the incorrect genus-level identification for these four sequences. We continue by evaluating which of these two potential error sources is more likely responsible for the observed misidentifications. As we conclude that cross-contamination is more likely in these cases than gross misidentification of the analyzed specimens, we see no signs of the self-contradiction the reviewer suggests.

On line 90, we erroneously stated that the flea sequence was one of the cases affected by cross-contamination. We were, in fact, referring to the Pediculus humanus specimen and should have written “louse” instead of “flea”. This error has been corrected in the revised version of the manuscript.

In bulk samples of arthropods, small and slender specimens (or fragments of such specimens) often get tangled in the legs or other structures of larger specimens, and will remain attached to the larger specimens when they are mounted for permanent storage in a natural history collection. This is the most likely source of the non-target DNA for Musca domestica and Gomphus exilis. The query sequence for Musca domestica matches the family Isotomidae (Collembola), and the query sequence for Gomphus exilis matches Cecidomyiidae (Diptera). The size difference between target and non-target is an order of magnitude in both cases. Examination of the voucher specimens at USNM revealed that the Gomphus exilis specimen was collected with a Malaise trap in which Cecidomyiidae are often very abundant (see e.g. Hebert et al. 2016, doi: 10.1098/rstb.2015.0333). The collecting method for the Musca domestica specimen was not reported on its label. Fragments of the smaller contaminant specimens were probably attached to a leg of the larger specimens used for DNA extraction, and the contaminant DNA was amplified in PCR instead of that from the target specimens due to differences in primer binding or DNA quality. By examining loan records at the National Museum of Natural History, we found a likely explanation for otherwise unexplainable contamination of Pediculus humanus by strepsipteran DNA. Specifically, we found that two specimens of Strepsiptera were included in the original loan to Meiklejohn et al. Although these specimens were not reported as analyzed in the study and are not mentioned in the article by Meiklejohn et al, it seems likely that a data entry error or lab mishap explains the attribution of a strepsipteran sequence to a louse.

Although the exact source of contamination cannot be proven, it is certain these sequences did not derive from their supposed source organism as they cluster within large numbers of independently generated sequences from a completely different taxon, and are distant from other representatives of the target taxon (again, multiple independently sequenced specimens). Further notes on the possible sources of non-target DNA have been added in our revised manuscript.

Pentinsaari et al: Our text has been modified to mention the use of desirable precautions such as negative controls. However, none of the stated precautions will achieve reliable results if the tissue sample originates from the wrong specimen, multiple specimens, or a misidentified specimen.

Pentinsaari et al: No taxonomy-related data apart from the species names are reported by Meiklejohn et al. for any of the analyzed specimens, which makes it impossible to evaluate the validity of the identifications based on the information provided in the original study. If the analyzed specimens have been authoritatively identified, the name of the identifier and date (or at least year) of identification would normally be reported on a “det label” together with the taxon name. Based on examination of the label data of the specimens at USNM, this information is available for some, but not all, of the specimens.

As for provenance data, only the year of collection was reported by Meiklejohn et al, but not the geographic origin of the specimens. The USNM specimen ID numbers of the analyzed specimens were also not listed by Meiklejohn et al. This made it difficult to confirm the source specimen for certain sequences because multiple specimens of some species were included in the loan to Meiklejohn et al., but only a single specimen per species was reported in their analyses, itself a weak approach. In some cases (e.g. louse, flea), USNM loan documentation indicates that the specimens were analyzed destructively so it was not possible to verify their identity morphologically (but see our replies to points 4 and 8). We have mined the sequence records from the original study from GenBank and compiled them on BOLD as a dataset (DS-FBI2019). When possible, specimen metadata have been added to the BOLD records based on the USNM specimen labels.

As detailed in our manuscript, some of the analyzed species are known to belong to complexes of several closely related species where morphological identification is challenging and attempting to only sample supposedly common species does not help avoid this problem.

The Phanaeus vindex specimen analyzed by Meiklejohn et al. clusters in the same BIN with specimens carrying one interim name in addition to P. vindex. The specimen included in the Meiklejohn et al. study happens to be most similar to the specimen with an interim name, hence the observed ‘misidentification’. The Phanaeus vindex species group contains three closely related species currently considered valid, but also several synonymized species names and controversial subspecies (Edmonds & Zídek 2012, http://digitalcommons.unl.edu/insectamundi/784). Other BIN clusters on BOLD also currently carry the name P. vindex, indicating that there likely are more species within this group than is currently recognized. Our revised manuscript includes further details on this species group.

In the case of Gryllus, reliable identifications cannot be based on morphology alone as male song patterns are crucial for confident species identifications without molecular analysis. Several species are now recognized within what was once thought to be a single widespread and common species called G. assimilis. However, based on the specimen label data, the analyzed Gryllus “assimilis” specimen was collected in Virginia, which is well outside the range of the true G. assimilis according to the recent revision of the genus (Weissman & Gray 2019, doi: 10.11646/zootaxa.4705.1.1). The identification retrieved for this specimen from both BOLD and GenBank (Gryllus pennsylvanicus) is likely correct, although other species also occur in eastern United States.

8) Line 118: Authors state “the supposed specimen.analyzed by Meiklejohn et al is almost certainly P. irritans”. This reviewer does not know how the authors could possibly make this bold claim without having access to the specimen used in Meiklejohn et al to confirm the identity morphologically. Moreover, as stated in the above comment, it would be inappropriate to second guess the identification as outlined above. In retrospect, this supports the claim made by Meiklejohn et al. that care must be taken when using barcoding as the sole source for identifying material found in the environment, due to a wide variety of sources for possible misidentifications including morphological misidentification in barcode and genetic databases.

Pentinsaari et al: As the flea specimen was completely consumed in the analysis by Meiklejohn et al, its identification cannot be validated morphologically. Our conclusion that the specimen was likely misidentified is based on comparison of the query sequence to multiple independently generated sequences of both Ctenocephalides and Pulex in both BOLD and GenBank. The query sequence of the supposed C. felis specimen is deeply divergent from all other representatives of this species and its congeners, and instead clusters within sequences identified as P. irritans. Misidentification of the single specimen studied by Meiklejohn et al. is a far more likely cause for this conflict than the alternative explanation that all other records in the databases originating from multiple independent studies are consistently and concordantly misidentified. Although the specimen sequenced by Meiklejohn et al. was consumed, we examined the vial of specimens at USNM from which it came. Pulex irritans was also reported from dogs in the same study (Schiefer & Lancaster 1970, Journal of the Kansas Entomological Society 43:177-181). It is likely that a Pulex specimen was mis-sorted among the many Ctenocephalides in the tube of alcohol.

9) Line 125-127: Authors argued that these specific instances should be viewed as “unconfirmed”. However, these authors are leading entomologists. One could argue how would non-entomologists (such as Meiklejohn et al.) differentiate between an unconfirmed and a misidentification? Authors themselves mention in Line 36-39 that DNA based approaches to species identification can help fill the lack of taxonomic expertise that often impedes analysis. It is apparent from reviewing the manuscript that the purpose of the Meiklejohn et al. paper was different, as they were focussed on the general accuracy of GenBank and BOLD for returning a match to the expected species, not providing guidance on how matches should be reported out. Thus, the authors need to either remove the term ‘unconfirmed’ or explain how a user of bold with no entomology expertise could differentiate between an ‘unconfirmed’ and a ‘misidentification’.

Pentinsaari et al: By ‘unconfirmed’ we mean an unconfirmed case of DNA barcodes failing to distinguish between species. The goal of the original study as stated by the authors was to assess the accuracy of BOLD and GenBank, not to identify unknown specimens. The accurate identification of any specimens used in such assessment is critically important to reach correct conclusions. The authors should themselves be primarily responsible for the quality of the data they publish, and as we argue in the section ‘Need for Taxonomic Validation of Museum Specimens’ taking identifications of specimens in natural history collections at face value is unwise. The statement by Meiklejohn et al. that they used “curated reference material” is misleading, although undoubtedly not intentionally so. A specimen being placed in the unit tray for a particular species is no guarantee that the specimen is reliably identified. If the authors of the original study lacked entomological expertise, we wonder why they did not collaborate with entomologists (such as curators at the USNM). A closer collaboration with entomological experts would certainly have revealed the major flaw in the study design of Meiklejohn et al. – analyzing single specimens per species, many of them from known species complexes, without expert confirmation of the identifications – during the data acquisition and analysis phase instead of post publication.

Pentinsaari et al: Considering the affiliation of the authors of the original study, it is certain the Meiklejohn et al. publication was widely viewed by the forensic science community and provided it with an unfounded pessimistic picture of DNA barcoding. Our manuscript seeks, in part, to correct this through a follow-up study. The authors also state that they selected some of their target species based on forensic importance. However, the reviewer is correct in stating that the original study was not specifically addressing the application of DNA barcoding in forensics. We have modified the title of our manuscript and changed the wording referring to forensics in other parts of the text.

Pentinsaari et al: This section does not criticize the study by Meiklejohn et al., but instead discusses the connection between BOLD and GenBank and differences between them which might not be familiar to many readers. The set of species analyzed by Meiklejohn et al. is only used as a convenient example to present these differences. Reporting all relevant metadata for natural history specimens is good practice even if those data are not essential for the study as such metadata can improve the reproducibility/interpretability of data and can be highly useful in subsequent analyses or validation of the data. For example, in the context of this study, the geographic origin of the Gryllus specimen (not reported by Meiklejohn et al.) reveals that it is highly unlikely to represent G. assimilis (see our reply to points 6-7 above).

Pentinsaari et al: Based on our reanalysis of the sequence data and subsequent examination of those specimens at USNM which could be traced and examined, we see no reason to change our conclusions regarding operational errors in the original study by Meiklejohn et al.

Minor Technical Comments:

2) Line 80: Awkward wording of sentence. “Both BOLD and GenBank delivered generic identifications deemed incorrect for four specimens.” Not sure what this sentence means.

3) Line 155-158: Authors explain how BOLD and GenBank are similar. This is also mentioned in Meiklejohn et al and could have been acknowledged.

Comments on Tables and Figures:

Pentinsaari et al: Table 1 provides a direct comparison between the results reported by Meiklejohn et al. and our attempt to replicate these results. As a result, we include the incorrect matches obtained by Meiklejohn et al. on the BOLD_all row as originally reported instead of the reverse complement row. Any sequences uploaded to GenBank by Meiklejohn et al. as reverse complements have been converted by GenBank into the proper orientation, and therefore they now produce different query results than those reported in the original study. Our initial inability to reproduce the original results and subsequent discovery of what caused the very incorrect identifications for some insect specimens was a major motivator for compiling this manuscript. The lack of warning on BOLD about sequence orientation is addressed above in our response to comment number 3.

Pentinsaari et al: Internal cross-contamination (i.e. between specimens included in the analyses of Meiklejohn et al.) could be ruled out for those three specimens, but as detailed above (reply to point 4) and in the manuscript, contamination with non-target DNA still explains the distant matches obtained beyond any reasonable doubt. While preparing the first draft of our manuscript, we did not examine the USNM loan records related to the Meiklejohn et al. material, the analyzed specimens, or their label data, but we did so during the review process. These studies have confirmed our original conclusions regarding the ambiguity in the identification of some of the analyzed specimens.

Figure 2: This is a great figure to show proper workflow. Meiklejohn et al. followed many of these features (e.g., negative controls, quality assessment, etc.). Also given their sequences were submitted to GenBank and are from a coding region, the sequences would also have been checked for premature stop codons, indels, etc (this is part of GenBank’s QC process before acceptance).

Additional Comments:

Appropriateness of the Methods

Support of the Conclusions by the Results

Over reaching Conclusions

Pentinsaari et al: Since all the cases we have listed as cross-contaminations involve pairs of very distantly related taxa, they are easily identified as cross-contaminations through comparison of the Meiklejohn et al. data to other independently generated sequences of the target and non-target taxa. Resequencing the original specimens would not add any value to our study. The comments related to controversial identifications of some of the analyzed specimens and changes in the BOLD interface have been addressed earlier in this response.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(51.7KB, docx)}