Skip to main content
Data in Brief logoLink to Data in Brief
. 2017 Apr 23;12:423–426. doi: 10.1016/j.dib.2017.04.024

A curated dataset of complete Enterobacteriaceae plasmids compiled from the NCBI nucleotide database

Alex Orlek a,b,, Hang Phan a,b, Anna E Sheppard a,b, Michel Doumith c, Matthew Ellington b,c, Tim Peto a,b, Derrick Crook a,b, A Sarah Walker a,b, Neil Woodford b,c,1, Muna F Anjum b,d,1, Nicole Stoesser a,1
PMCID: PMC5426034  PMID: 28516137

Abstract

Thousands of plasmid sequences are now publicly available in the NCBI nucleotide database, but they are not reliably annotated to distinguish complete plasmids from plasmid fragments, such as gene or contig sequences; therefore, retrieving complete plasmids for downstream analyses is challenging. Here we present a curated dataset of complete bacterial plasmids from the clinically relevant Enterobacteriaceae family. The dataset was compiled from the NCBI nucleotide database using curation steps designed to exclude incomplete plasmid sequences, and chromosomal sequences misannotated as plasmids. Over 2000 complete plasmid sequences are included in the curated plasmid dataset. Protein sequences produced from translating each complete plasmid nucleotide sequence in all 6 frames are also provided. Further analysis and discussion of the dataset is presented in an accompanying research article: “Ordering the mob: insights into replicon and MOB typing…” (Orlek et al., 2017) [1]. The curated plasmid sequences are publicly available in the Figshare repository.

Keywords: Plasmids, Sequence data curation, Complete genomes, Enterobacteriaceae family


Specifications Table

Subject area Microbiology, Bioinformatics
More specific subject area Plasmids
Type of data Sequence data
How data was acquired Plasmid nucleotide sequences were compiled from Genbank and RefSeq accessions contained within the NCBI nucleotide database. Corresponding protein sequences were generated by translating each plasmid nucleotide sequence in all 6 frames.
Data format FASTA files, Genbank files (zipped)
Experimental factors N/A
Experimental features N/A
Data source location Sequences were retrieved from the NCBI nucleotide database (https://www.ncbi.nlm.nih.gov/nucleotide/); geographic location metadata was not retrieved.
Data accessibility Data is publicly available in the Figshare repository.
https://figshare.com/s/18de8bdcbba47dbaba41
DOI: D10.6084/m9.figshare.4609303

Value of the data

  • To our knowledge, this is currently the only large curated dataset of complete plasmids, compiled according to well-defined, and transparently validated, inclusion and exclusion criteria.

  • The data could be used to benchmark the performance of plasmid typing schemes [1].

  • The data could be used for reference-based plasmid analyses [2]; for example, contigs could be queried against the curated plasmid sequences with the aim of distinguishing plasmid from chromosomal contigs [3] or assessing plasmid genetic content [4].

  • The protein dataset is a useful resource for MOB typing [5]. Information about sequence conservation from aligned protein database sequences can be harnessed using more powerful profile-based homology searching [6], enabling improved MOB typing compared with standard protein BLAST. A bioinformatic protocol and code for MOB typing using the protein dataset are provided on GitHub (https://github.com/AlexOrlek/MOBtyping).

  • Those interested in the epidemiology of plasmid-mediated antibiotic resistance in the Enterobacteriaceae family could use the data to extend previous analyses [1].

1. Data

The data consists of nucleotide sequences of 2097 complete Enterobacteriaceae plasmids, compiled from the NCBI nucleotide database (‘nucleotideseq.fa’). In addition, we provide a corresponding dataset of 12,582 protein sequences (‘translatedproteinseq.fa’), derived from translating each plasmid nucleotide sequence in all 6 frames. Nucleotide and protein sequence datasets are formatted as FASTA files. Headers in the protein FASTA file are in the following format: >accession id|strand|frame|protein sequence length. Furthermore, NCBI Genbank files, with detailed information on accessions, are also provided. One Genbank file contains the 2097 complete curated plasmid accessions (‘filtered_2097plasmids.gb.gz’). Another Genbank file contains 6952 accessions (‘6952plasmids.gb.gz’), obtained using an initial query, prior to removing duplicate sequences or applying inclusion/exclusion criteria.

2. Experimental design, materials and methods

Putative complete plasmid accessions were retrieved from the NCBI nucleotide database (https://www.ncbi.nlm.nih.gov/nucleotide/) on 26th August 2016, using an Entrez query with filters to exclude some incomplete or non-plasmid accessions at this stage. Following this initial query, duplicate sequences (those sharing 100% nucleotide sequence identity with another retrieved sequence) were removed. Biopython scripts [7] were used to filter-out non-coding sequences. Regular expression searches of accession title descriptions were used to apply exclusion and inclusion criteria. Subsequent filtering involved conducting multi-locus sequence typing (MLST) to exclude chromosomal accessions misannotated as plasmids. In addition, the ‘completeness’ annotation (included as accession metadata in NCBI) was used to further exclude partial plasmid sequences. Additional filtering involved manual inspection of putative plasmids at the tails of the sequence length distribution, to remove remaining accessions that represented chromosomal sequences or partial plasmid sequences. A more detailed description of these methods can be found in the accompanying research article [1].

Acknowledgements

The research was funded by the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Healthcare Associated Infections and Antimicrobial Resistance at Oxford University in Partnership with Public Health England (PHE) [HPRU-2012-10041]. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR, the Department of Health or Public Health England. NS is currently funded through an NIHR/University of Oxford Academic Clinical Lectureship. TP is an NIHR Senior Investigator.

Footnotes

Transparency document

Transparency data associated with this article can be found in the online version at doi:10.1016/j.dib.2017.04.024.

Transparency document. Supplementary material

Supplementary material

mmc1.pdf (107.1KB, pdf)

.

References

  • 1.Orlek A., Phan H., Sheppard A.E., Doumith M., Ellington M., Peto T., Crook D., Walker A.S., Woodford N., Anjum M.F., Stoesser N. Ordering the mob: insights into replicon and MOB typing schemes from analysis of a curated dataset of publicly available plasmids. Plasmid. 2017;91:42–52. doi: 10.1016/j.plasmid.2017.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Orlek A., Stoesser N., Anjum M.F., Doumith M., Ellington M. Plasmid classification in an era of whole-genome sequencing: application in studies of antibiotic resistance epidemiology. Front. Microbiol. 2017 doi: 10.3389/fmicb.2017.00182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Edwards D.J., Holt K.E. Beginner׳s guide to comparative bacterial genome analysis using next-generation sequence data. Microb. Inform. Exp. 2013;3:2. doi: 10.1186/2042-5783-3-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Zetner A., Cabral J., Mataseje L., Knox N., Mabon P., Mulvey M., Domselaar G. Van. Plasmid profiler: comparative analysis of plasmid content in WGS data. bioRxiv. 2017 [Google Scholar]
  • 5.Garcillán-Barcia M.P., Francia M.V., De La Cruz F. The diversity of conjugative relaxases and its application in plasmid classification. FEMS Microbiol. Rev. 2009;33:657–687. doi: 10.1111/j.1574-6976.2009.00168.x. [DOI] [PubMed] [Google Scholar]
  • 6.Chen J., Guo M., Wang X., Liu B. A comprehensive review and comparison of different computational methods for protein remote homology detection. Brief. Bioinform. 2016:1–14. doi: 10.1093/bib/bbw108. [DOI] [PubMed] [Google Scholar]
  • 7.Cock P.J.A., Antao T., Chang J.T., Chapman B.A., Cox C.J., Dalke A., Friedberg I., Hamelryck T., Kauff F., Wilczynski B., De Hoon M.J.L. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.pdf (107.1KB, pdf)

Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES