Abstract
Thousands of plasmid sequences are now publicly available in the NCBI nucleotide database, but they are not reliably annotated to distinguish complete plasmids from plasmid fragments, such as gene or contig sequences; therefore, retrieving complete plasmids for downstream analyses is challenging. Here we present a curated dataset of complete bacterial plasmids from the clinically relevant Enterobacteriaceae family. The dataset was compiled from the NCBI nucleotide database using curation steps designed to exclude incomplete plasmid sequences, and chromosomal sequences misannotated as plasmids. Over 2000 complete plasmid sequences are included in the curated plasmid dataset. Protein sequences produced from translating each complete plasmid nucleotide sequence in all 6 frames are also provided. Further analysis and discussion of the dataset is presented in an accompanying research article: “Ordering the mob: insights into replicon and MOB typing…” (Orlek et al., 2017) [1]. The curated plasmid sequences are publicly available in the Figshare repository.
Keywords: Plasmids, Sequence data curation, Complete genomes, Enterobacteriaceae family
Specifications Table
Subject area | Microbiology, Bioinformatics |
More specific subject area | Plasmids |
Type of data | Sequence data |
How data was acquired | Plasmid nucleotide sequences were compiled from Genbank and RefSeq accessions contained within the NCBI nucleotide database. Corresponding protein sequences were generated by translating each plasmid nucleotide sequence in all 6 frames. |
Data format | FASTA files, Genbank files (zipped) |
Experimental factors | N/A |
Experimental features | N/A |
Data source location | Sequences were retrieved from the NCBI nucleotide database (https://www.ncbi.nlm.nih.gov/nucleotide/); geographic location metadata was not retrieved. |
Data accessibility | Data is publicly available in the Figshare repository. |
https://figshare.com/s/18de8bdcbba47dbaba41 | |
DOI: D10.6084/m9.figshare.4609303 |
Value of the data
-
•
To our knowledge, this is currently the only large curated dataset of complete plasmids, compiled according to well-defined, and transparently validated, inclusion and exclusion criteria.
-
•
The data could be used to benchmark the performance of plasmid typing schemes [1].
-
•
The data could be used for reference-based plasmid analyses [2]; for example, contigs could be queried against the curated plasmid sequences with the aim of distinguishing plasmid from chromosomal contigs [3] or assessing plasmid genetic content [4].
-
•
The protein dataset is a useful resource for MOB typing [5]. Information about sequence conservation from aligned protein database sequences can be harnessed using more powerful profile-based homology searching [6], enabling improved MOB typing compared with standard protein BLAST. A bioinformatic protocol and code for MOB typing using the protein dataset are provided on GitHub (https://github.com/AlexOrlek/MOBtyping).
-
•
Those interested in the epidemiology of plasmid-mediated antibiotic resistance in the Enterobacteriaceae family could use the data to extend previous analyses [1].
1. Data
The data consists of nucleotide sequences of 2097 complete Enterobacteriaceae plasmids, compiled from the NCBI nucleotide database (‘nucleotideseq.fa’). In addition, we provide a corresponding dataset of 12,582 protein sequences (‘translatedproteinseq.fa’), derived from translating each plasmid nucleotide sequence in all 6 frames. Nucleotide and protein sequence datasets are formatted as FASTA files. Headers in the protein FASTA file are in the following format: >accession id|strand|frame|protein sequence length. Furthermore, NCBI Genbank files, with detailed information on accessions, are also provided. One Genbank file contains the 2097 complete curated plasmid accessions (‘filtered_2097plasmids.gb.gz’). Another Genbank file contains 6952 accessions (‘6952plasmids.gb.gz’), obtained using an initial query, prior to removing duplicate sequences or applying inclusion/exclusion criteria.
2. Experimental design, materials and methods
Putative complete plasmid accessions were retrieved from the NCBI nucleotide database (https://www.ncbi.nlm.nih.gov/nucleotide/) on 26th August 2016, using an Entrez query with filters to exclude some incomplete or non-plasmid accessions at this stage. Following this initial query, duplicate sequences (those sharing 100% nucleotide sequence identity with another retrieved sequence) were removed. Biopython scripts [7] were used to filter-out non-coding sequences. Regular expression searches of accession title descriptions were used to apply exclusion and inclusion criteria. Subsequent filtering involved conducting multi-locus sequence typing (MLST) to exclude chromosomal accessions misannotated as plasmids. In addition, the ‘completeness’ annotation (included as accession metadata in NCBI) was used to further exclude partial plasmid sequences. Additional filtering involved manual inspection of putative plasmids at the tails of the sequence length distribution, to remove remaining accessions that represented chromosomal sequences or partial plasmid sequences. A more detailed description of these methods can be found in the accompanying research article [1].
Acknowledgements
The research was funded by the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Healthcare Associated Infections and Antimicrobial Resistance at Oxford University in Partnership with Public Health England (PHE) [HPRU-2012-10041]. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR, the Department of Health or Public Health England. NS is currently funded through an NIHR/University of Oxford Academic Clinical Lectureship. TP is an NIHR Senior Investigator.
Footnotes
Transparency data associated with this article can be found in the online version at doi:10.1016/j.dib.2017.04.024.
Transparency document. Supplementary material
Supplementary material
.
References
- 1.Orlek A., Phan H., Sheppard A.E., Doumith M., Ellington M., Peto T., Crook D., Walker A.S., Woodford N., Anjum M.F., Stoesser N. Ordering the mob: insights into replicon and MOB typing schemes from analysis of a curated dataset of publicly available plasmids. Plasmid. 2017;91:42–52. doi: 10.1016/j.plasmid.2017.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Orlek A., Stoesser N., Anjum M.F., Doumith M., Ellington M. Plasmid classification in an era of whole-genome sequencing: application in studies of antibiotic resistance epidemiology. Front. Microbiol. 2017 doi: 10.3389/fmicb.2017.00182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Edwards D.J., Holt K.E. Beginner׳s guide to comparative bacterial genome analysis using next-generation sequence data. Microb. Inform. Exp. 2013;3:2. doi: 10.1186/2042-5783-3-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Zetner A., Cabral J., Mataseje L., Knox N., Mabon P., Mulvey M., Domselaar G. Van. Plasmid profiler: comparative analysis of plasmid content in WGS data. bioRxiv. 2017 [Google Scholar]
- 5.Garcillán-Barcia M.P., Francia M.V., De La Cruz F. The diversity of conjugative relaxases and its application in plasmid classification. FEMS Microbiol. Rev. 2009;33:657–687. doi: 10.1111/j.1574-6976.2009.00168.x. [DOI] [PubMed] [Google Scholar]
- 6.Chen J., Guo M., Wang X., Liu B. A comprehensive review and comparison of different computational methods for protein remote homology detection. Brief. Bioinform. 2016:1–14. doi: 10.1093/bib/bbw108. [DOI] [PubMed] [Google Scholar]
- 7.Cock P.J.A., Antao T., Chang J.T., Chapman B.A., Cox C.J., Dalke A., Friedberg I., Hamelryck T., Kauff F., Wilczynski B., De Hoon M.J.L. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary material