Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2009 Jul 30;106(31):12794–12797. doi: 10.1073/pnas.0905845106

A DNA barcode for land plants

CBOL Plant Working Group1, Peter M Hollingsworth a,2, Laura L Forrest a, John L Spouge b, Mehrdad Hajibabaei c, Sujeevan Ratnasingham c, Michelle van der Bank d, Mark W Chase e, Robyn S Cowan e, David L Erickson f, Aron J Fazekas g, Sean W Graham h, Karen E James i, Ki-Joong Kim j, W John Kress f, Harald Schneider i, Jonathan van AlphenStahl e, Spencer CH Barrett k, Cassio van den Berg l, Diego Bogarin m, Kevin S Burgess k,n, Kenneth M Cameron o, Mark Carine i, Juliana Chacón p, Alexandra Clark a, James J Clarkson e, Ferozah Conrad q, Dion S Devey e, Caroline S Ford r, Terry AJ Hedderson s, Michelle L Hollingsworth a, Brian C Husband g, Laura J Kelly a,e, Prasad R Kesanakurti g, Jung Sung Kim j, Young-Dong Kim t, Renaud Lahaye d, Hae-Lim Lee j, David G Long a, Santiago Madriñán p, Olivier Maurin d, Isabelle Meusnier c, Steven G Newmaster g, Chong-Wook Park u, Diana M Percy h, Gitte Petersen v, James E Richardson a, Gerardo A Salazar w, Vincent Savolainen e,x, Ole Seberg v, Michael J Wilkinson r, Dong-Keun Yi j, Damon P Little y
PMCID: PMC2722355  PMID: 19666622

Abstract

DNA barcoding involves sequencing a standard region of DNA as a tool for species identification. However, there has been no agreement on which region(s) should be used for barcoding land plants. To provide a community recommendation on a standard plant barcode, we have compared the performance of 7 leading candidate plastid DNA regions (atpF–atpH spacer, matK gene, rbcL gene, rpoB gene, rpoC1 gene, psbK–psbI spacer, and trnH–psbA spacer). Based on assessments of recoverability, sequence quality, and levels of species discrimination, we recommend the 2-locus combination of rbcL+matK as the plant barcode. This core 2-locus barcode will provide a universal framework for the routine use of DNA sequence data to identify specimens and contribute toward the discovery of overlooked species of land plants.

Keywords: matK, rbcL, species identification


Large-scale standardized sequencing of the mitochondrial gene CO1 has made DNA barcoding an efficient species identification tool in many animal groups (1). In plants, however, low substitution rates of mitochondrial DNA have led to the search for alternative barcoding regions. From initial investigations of plastid regions (24), 7 leading candidates have emerged (5, 6). Four are portions of coding genes (matK, rbcL, rpoB, and rpoC1), and 3 are noncoding spacers (atpF–atpH, trnH–psbA, and psbK–psbI). Different research groups have proposed various combinations of these loci as their preferred plant barcodes, but no consensus has emerged (512). This lack of an agreed standard has impeded progress in plant barcoding.

Our aim here is to identify a standard DNA barcode for land plants. To achieve this goal, we have pooled data across laboratories including sequence data from 907 samples, representing 445 angiosperm, 38 gymnosperm, and 67 cryptogam species. Using various subsets of these data, we evaluated the 7 candidate loci using criteria in the Consortium for the Barcode of Life's (CBOL) data standards and guidelines for locus selection (http://www.barcoding.si.edu/protocols.html). Universality: Which loci can be routinely sequenced across the land plants? Sequence quality and coverage: Which loci are most amenable to the production of bidirectional sequences with few or no ambiguous base calls? Discrimination: Which loci enable most species to be distinguished?

Results

Universality.

Direct universality assessments using a single primer pair for each locus in angiosperms resulted in 90%–98% PCR and sequencing success for 6/7 regions. Success for the seventh region, psbK–psbI, was 77% (Fig. 1A). Greater problems were encountered in other land plant groups, with rpoB, matK, atpF–atpH, and psbK–psbI all showing <50% success in gymnosperms and/or cryptogams based on data compiled from several laboratories (Fig. 1A).

Fig. 1.

Fig. 1.

Comparison of the performance of 7 candidate barcoding loci (see locus codes at head of Fig. 1A). (A) Universality success based on 170 angiosperm samples compared under similar conditions, and community-wide data for up to 81 gymnosperm and 156 cryptogam samples. (B) Assessment of sequence quality calculated as the percentage of 190 seed plant samples from which high quality bidirectional sequences (contigs) could be assembled (see Materials and Methods for trace-quality criteria), plotted against the percentage species discrimination for single-locus barcodes. 95% confidence intervals are indicated. Colors reflect sequence quality (red, worse; green, better). (C) Discrimination success for 1–3 and 7 locus barcodes for species for which multiple individuals from multiple congeneric species were sampled, and all 7 loci were recovered. Outer error bars (thin lines) demarcate 95% confidence intervals. Inner error bars (thick lines) indicate the relative magnitude of discrimination failure as measured by the interquartile range (IQR) for the number of species that are indistinguishable from a given query sequence. Discrimination success from all 7 loci is shown with a white line, with the associated 95% confidence interval in light gray, and the magnitude of discrimination failure in dark gray. Colors indicate the average percentage of finished bidirectional sequences expected for each locus combination. The arrow indicates the recommended standard 2-locus barcode.

Sequence Quality.

Evaluation of sequence quality and coverage from the candidate loci demonstrated that high quality bidirectional sequences were routinely obtained from rbcL, rpoC1, and rpoB (Fig. 1B, x axis). The remaining 4 loci required more manual editing and produced fewer bidirectional reads. matK performed best of this group, although it showed discordance between forward and reverse reads more frequently than other coding regions. The greatest problems in obtaining bidirectional sequences with few ambiguous bases were encountered with the intergenic spacers trnH–psbA and psbK–psbI, in part attributable to a high frequency of mononucleotide repeats disrupting individual sequencing reads.

Species Discrimination.

Among 397 samples successfully sequenced for all 7 loci, species discrimination for single-locus barcodes ranged from 43% (rpoC1) to 68%–69% (psbK–psbI and trnH–psbA), with rbcL and matK providing 61% and 66% discrimination respectively (rank order: rpoC1<rpoB<atpF–atpH<rbcL<matK<psbK–psbI<trnH–psbA; Fig. 1B, y axis). Two-locus combinations gave 59%–75% resolution, and 3-locus combinations 65%–76% (Fig. 1C). Ten of the 2-locus combinations gave 70%–75% discrimination. The top 5 of these involved various combinations of rbcL, psbK–psbI, matK, and trnH–psbA. Using all 7 loci, 73% of species were discriminated. When the species discrimination analyses are extended to the full sample, which includes those that failed to sequence for 1 or more loci, the rank order among single-locus comparisons is rpoC1 (38%), rpoB (40%), atpF–atpH (50%), matK (57%), rbcL (58%), trnH–psbA (58%), and psbK–psbI (64%). The rise in relative performance of rbcL is associated with its strong (87%) discriminatory power in the cryptogam samples. These were excluded from the preceding analyses as all had missing data from 1 or more loci.

Discussion

An ideal DNA barcode should be routinely retrievable with a single primer pair, be amenable to bidirectional sequencing with little requirement for manual editing of sequence traces, and provide maximal discrimination among species. Based on these criteria, 4 of the candidate loci can be excluded (Fig. 1 A and B). Both rpoC1 and rpoB performed well in terms of universality and/or sequence quality, but had low discriminatory power; atpF–atpH fell below the median for species resolution in single and multilocus barcodes and for recovery of high-quality bidirectional sequences; whereas psbK–psbI showed good discriminatory power, but had the lowest sequencing success in these trials, and substantial problems generating bidirectional reads.

Choosing a plant barcode from the 3 remaining candidate loci was more difficult. Individually, trnH–psbA, rbcL, and matK possess attributes that are highly desirable in a plant DNA barcoding system, although none of the 3 loci fits all 3 criteria perfectly. As reported elsewhere (7), trnH–psbA demonstrated good amplification across land plants with a single pair of primers (93% for angiosperms; Fig. 1A) and high levels of species discrimination. However, problems obtaining high quality bidirectional sequences are the primary limitation for this locus. In addition, trnH–psbA has a median length of 418 bp (IQR = 296–500 bp) in the dataset examined here, which is well-suited for DNA barcoding, but its upper length of >1,000 bp in some monocot (3) and conifer (11) species can lead to problems obtaining bidirectional sequences without using taxon-specific internal sequencing primers.

Among plastid regions, rbcL is the best characterized gene. Improvements in primer design make it easily retrievable across land plants (8) and it is well suited for recovery of high-quality bidirectional sequences. Although not the most variable region (Fig. 1B), it is a frequent component of the best performing multi-locus combinations for species discrimination (Fig. 1C).

matK is one of the most rapidly evolving plastid coding regions and it consistently showed high levels of discrimination among angiosperm species (Fig. 1C) (8, 9). Mixed reports have been published regarding the universality of matK primers, ranging from routine success (9) to more patchy recovery (7, 8), which has led to reservations about this locus by some researchers. In the current study, 90% of the angiosperm samples tested were successfully amplified and sequenced using a single primer pair (Fig. 1A). Success in gymnosperms (83%) and particularly cryptogams (10%) was more limited, even when multiple primer sets were used.

In summary, rbcL offers high universality and good, but not outstanding discriminating power, whereas matK and trnH–psbA offer higher resolution, but each requires further development work. Primer universality needs improvement for matK in some clades, and trnH–psbA does not consistently provide bidirectional unambiguous sequences, often requiring manual editing of sequence traces. Thus, no single locus meets CBOL's data standards and guidelines for locus selection, and as a result a synergistic combination of loci is required.

One option preferred by some researchers in the CBOL Plant Working Group was a 3-locus barcode of matK+rbcL+trnH-psbA, to allow further testing of these loci. Based on the relative performance of the 3 loci, the best 2-locus barcode could be selected at a later date. The majority preference, however, was to select a 2-locus barcode to (a) avoid the increased costs of sequencing 3 loci rather than 2 in very large sample sets, and (b) prevent further delays in implementing a standard barcode for land plants. In the datasets examined here, sequencing 3 loci did not improve discrimination beyond the best performing 2-locus barcodes.

Among the 2-locus barcode combinations, rbcL+matK was the majority choice for several reasons. High-quality sequences of rbcL are easily retrievable across phylogenetically divergent lineages, and it performs well in discrimination tests in combination with other loci. Developing amplification strategies for matK was considered an investment with better prospects for return than solving the problem of sequence quality in trnH–psbA caused by mononucleotide repeats (13). Recent primer development for matK has improved its recovery from angiosperms, and so prospects for further improvement in angiosperms and other land plant groups seem reasonable, analogous to the extensive improvements made to primer sets for CO1 for animal DNA barcoding (14).

We therefore propose rbcL+matK as the standard barcode for land plants. This combination represents a pragmatic solution to a complex trade-off between universality, sequence quality, discrimination, and cost. Using rbcL+matK in the sample set examined here, species discrimination was successful in 72% of cases, with the remaining species being matched to groups of congeneric species with 100% success. Given the logistical difficulties of undertaking identifications with some ≈400,000 species of land plant, this 2-locus barcode offers the opportunity to harness high-throughput automated sequencing technologies to establish a powerful universal framework for DNA-based identification of plants.

The unique identification to species level in 72% of cases and to ‘species groups’ in the remainder will be useful for many applications of DNA barcoding such as studies of plant-animal interactions (15), establishing whether plant products in international trade belong to protected species (9, 16, 17), discriminating among seedlings to establish forest regeneration dynamics, or undertaking large-scale biodiversity surveys with limited access to taxonomic expertise. A particular strength of the barcoding approach is that these identifications can be made with small amounts of tissue from sterile, juvenile or fragmentary materials from which morphological identifications are difficult or impossible (18). In addition, it is important to emphasize that the discriminatory power of this standard barcode will be higher in situations that involve geographically restricted sample sets, such as studies focusing on the plant biodiversity of a given region or local area (19, 20).

A future challenge for DNA barcoding in plants is to increase the proportion of cases in which unique species identifications are achieved. In the short term, where further resolution and universality are required, we envisage that the core rbcL+matK barcode will be augmented in individual projects from a flexible short-list of supplementary loci including the noncoding plastid regions examined here (trnH–psbA, atpF–atpH, and psbK–psbI), and the trnL intron which has been advocated for situations involving highly degraded tissue (19). The rapidly evolving internal transcribed spacers of nuclear ribosomal DNA also represent a useful supplementary barcode in taxonomic groups in which direct sequencing of this locus is possible (21). Moving beyond these currently available supplementary barcodes, ongoing advances in sequencing technologies and the concomitant accumulation of genomic and transcriptomic sequence data from plants will greatly increase opportunities for targeting the nuclear genome as a source of informative characters.

There is little doubt that the approaches used in plant DNA barcoding will be refined in future (22). However, the key foundation step for plant barcoding is in reaching agreement on a standard set of loci to enable large-scale sequencing and the development of a global plant barcoding infrastructure. The broad community agreement presented here, to sequence rbcL and matK as a standard 2-locus barcode, is thus an important step in establishing a centralized plant barcode database as a tool for taxonomy, conservation, and the multitude of other applications (23) that require identification of plant material.

Materials and Methods

Plant Materials.

We used a total of 907 samples from 550 species representing the major lineages of land plants (including 670/445 angiosperm, 81/38 gymnosperm, and 156/67 cryptogam samples/species) to evaluate the candidate barcoding loci (Fig. S1, Fig. S2, and Table S1; cryptogams are defined here as all non–seed bearing embryophytes).

Universality.

To provide directly comparable information on universality and trace quality (see below), we generated de novo sequence data from 190 samples (including 170 angiosperms) at the Canadian Centre for DNA Barcoding (CCDB), University of Guelph, using a single primer pair per locus (Table S1). We used this dataset to quantify universality in angiosperms. As amplification and sequencing success is typically lower in nonangiosperm land plants, which often require different primer sets, we compiled existing data on amplification and sequencing success from different laboratories as an indicator of success for these groups (n = 81 for gymnosperms; n = 156 for cryptogams; Table S1). Our assessments of universality simply record whether sequence data were obtained, regardless of the amount of manual trace editing required or the extent of read bidirectionality. Full details of molecular methods are available from the corresponding author on request.

Sequence Quality and Coverage.

To assess suitability for bidirectional sequencing with minimal requirement for manual editing of sequences, we examined the quality of the de novo generated sequence traces via the CCDB automated informatics pipeline. Using a window size of 20 bp, segments with >2 bp showing <20 QV were trimmed. The amount of high-quality sequence data recovered was defined such that both the forward and reverse reads should have a minimum length of 100 bp, a minimum average QV of 30, and the post-trim lengths should be >50% of the original read length; the assembled contig should have >50% overlap in the alignment of the forward and reverse reads with <1% low-quality bases (<20QV) and <1% internal gaps and substitutions when aligning the forward and reverse reads. These quality control criteria were selected as a pragmatic set of thresholds to discriminate higher quality sequences from lower quality sequences. Various permutations of the parameters resulted in the same general conclusions (rbcL, rpoC1, and rpoB performed well, matK was intermediate, and fewer high-quality bidirectional sequences were obtained from trnH–psbA, psbK–psbI, and atpF–atpH).

Discrimination.

To evaluate species discrimination we focused on samples from which all 7 loci were successfully sequenced (397 samples, all seed plants). We restricted assessment of discrimination success to species where multiple individuals were sampled from multiple congeneric species (259 samples of 95 species from 34 genera). Although not counted in the discrimination success statistics, a further 104 singleton-sampled species congeneric with the above, and 34 singleton-sampled species from 21 other genera were included to serve as potential sources of discrimination failure. Using the same samples for all 7 loci allowed us to directly compare the relative discriminatory power of the different loci. We considered discrimination as successful if the minimum uncorrected interspecific p-distance involving a species was larger than its maximum intraspecific distance [all distances were calculated from pairwise global alignments counting unambiguous base substitutions only (24)]. We evaluated species discrimination for multiple loci by summing the components of the distance measure for all possible 2–7 locus combinations and recording the success of each multi-locus combination. We used the binomial distribution to calculate 95% confidence intervals to establish whether performance differences between loci and locus combinations were statistically significant. Species discrimination assessments were then repeated on a dataset of 907 individuals/550 species that included samples successfully sequenced for some, but not all loci. Multi-locus combinations were not evaluated in this dataset because of large numbers of zero-distances introduced by individuals being represented by mutually exclusive loci.

Supplementary Material

Supporting Information

Acknowledgments.

We thank David Schindel for comments, Sergey Sheetlin for data formatting, and George Weiblen for plant material. This work was supported by the Alfred P. Sloan Foundation, Gordon and Betty Moore Foundation, Genome Canada, Scottish Government's Rural and Environment Research and Analysis Directorate, Royal Society, South African National Research Foundation, Intramural Research Program of the National Library of Medicine, National Institutes of Health, and Consortium for the Barcode of Life.

Footnotes

Conflict of interest statement: Following the publication of Lahaye et al. (PNAS 105:2923, 2008), the process of filing a patent on DNA barcoding of land plants using matK was initiated by V.S., M.v.d.B., R.L., and D.B., but because of the lack of commercial interest the patent application was subsequently dropped.

Data deposition: The sequence reported in this paper has been deposited in the GenBank database. For a list of accession numbers, see SI Table 1. FASTA files of sequences are available on request.

See Commentary on page 12569.

This article contains supporting information online at www.pnas.org/cgi/content/full/0905845106/DCSupplemental.

References

  • 1.Hebert PDN, Cywinska A, Ball SL, deWaard JR. Biological identifications through DNA barcodes. Proc R Soc Biol Sci SerB. 2003;270:313–321. doi: 10.1098/rspb.2002.2218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Kress JW, Wurdack KJ, Zimmer EA, Weigt LA, Janzen DH. Use of DNA barcodes to identify flowering plants. Proc Natl Acad Sci USA. 2005;102:8369–8374. doi: 10.1073/pnas.0503123102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chase MW, et al. A proposal for a standardised protocol to barcode all land plants. Taxon. 2007;56:295–299. [Google Scholar]
  • 4.Ford CS, et al. Selection of candidate DNA barcoding regions for use on land plants. Bot J Linn Soc. 2009;159:1–11. [Google Scholar]
  • 5.Pennisi E. Taxonomy. Wanted: A barcode for plants. Science. 2007;318:190–191. doi: 10.1126/science.318.5848.190. [DOI] [PubMed] [Google Scholar]
  • 6.Ledford H. Botanical identities: DNA barcoding for plants comes a step closer. Nature. 2008;451:616. doi: 10.1038/451616b. [DOI] [PubMed] [Google Scholar]
  • 7.Kress WJ, Erickson DL. A two-locus global DNA barcode for land plants: The coding rbcL gene complements the non-coding trnH-psbA spacer region. PLoS ONE. 2007;2:e508. doi: 10.1371/journal.pone.0000508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Fazekas AJ, et al. Multiple multilocus DNA barcodes from the plastid genome discriminate plant species equally well. PLoS ONE. 2008;3:e2802. doi: 10.1371/journal.pone.0002802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lahaye R, et al. DNA barcoding the floras of biodiversity hotspots. Proc Natl Acad Sci USA. 2008;105:2923–2928. doi: 10.1073/pnas.0709936105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Erickson DL, Spouge J, Resch A, Weigt LA, Kress JW. DNA barcoding in land plants: Developing standards to quantify and maximize success. Taxon. 2008;57:1304–1316. [PMC free article] [PubMed] [Google Scholar]
  • 11.Hollingsworth ML, et al. Selecting barcoding loci for plants: Evaluation of seven candidate loci with species-level sampling in three divergent groups of land plants. Mol Ecol Res. 2009;9:439–457. doi: 10.1111/j.1755-0998.2008.02439.x. [DOI] [PubMed] [Google Scholar]
  • 12.Seberg O, Petersen G. How many loci does it take to DNA barcode a crocus? PLoS ONE. 2009;4:e4598. doi: 10.1371/journal.pone.0004598. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Devey DS, Chase MW, Clarkson JJ. A stuttering start to plant DNA barcoding: Microsatellites present a previously overlooked problem in non-coding plastid regions. Taxon. 2009;58:7–15. [Google Scholar]
  • 14.Ivanova NV, Zemlak TS, Hanner RH, Hebert PDN. Universal primer cocktails for fish DNA barcoding. Mol Ecol Notes. 2007;7:544–548. [Google Scholar]
  • 15.Jurado-Rivera JA, Vogler AP, Reid CAM, Petitpierre E, Gómez-Zurita J. DNA barcoding insect-host plant associations. Proc R Soc Biol Sci SerB. 2009;276:639–648. doi: 10.1098/rspb.2008.1264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ogden R, et al. SNP-based method for the genetic identification of ramin Gonystylus spp. timber and products: Applied research meeting CITES enforcement needs. Endang Species Res. 2008 doi 10.3354/esr00141. [Google Scholar]
  • 17.Little DP, Stevenson DW. A comparison of algorithms for the identification of specimens using DNA barcodes: Examples from gymnosperms. Cladistics. 2007;23:1–21. doi: 10.1111/j.1096-0031.2006.00126.x. [DOI] [PubMed] [Google Scholar]
  • 18.Valentini A, Pompanon F, Taberlet P. DNA barcoding for ecologists. Trends Ecol Evol. 2008;24:110–117. doi: 10.1016/j.tree.2008.09.011. [DOI] [PubMed] [Google Scholar]
  • 19.Taberlet P, et al. Power and limitations of the chloroplast trnL (UAA) intron for plant DNA barcoding. Nucleic Acids Res. 2006;35:e14. doi: 10.1093/nar/gkl938. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Janzen DH. In: Plant conservation: A natural history approach. Krupnick G, Kress WJ, editors. Chicago: University of Chicago Press; 2005. pp. ix–xiii. [Google Scholar]
  • 21.Feliner GN, Rosselló JA. Better the devil you know? Guidelines for insightful utilization of nrDNA ITS in species-level evolutionary studies in plants. Mol Phylogenet Evol. 2007;44:911–919. doi: 10.1016/j.ympev.2007.01.013. [DOI] [PubMed] [Google Scholar]
  • 22.Fazekas AJ, et al. Are plant species inherently harder to discriminate than animal species using DNA barcoding markers? Mol Ecol Res 9 S. 2009;1:130–139. doi: 10.1111/j.1755-0998.2009.02652.x. [DOI] [PubMed] [Google Scholar]
  • 23.Newmaster SG, Ragupathy S, Janovec J. A botanical renaissance: State-of-the-art DNA bar coding facilitates an Automated Identification Technology system for plants. Int J Comp Appl Tech. 2009;35:50–60. [Google Scholar]
  • 24.Edgar RC. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;19:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES