Skip to main content
Data in Brief logoLink to Data in Brief
. 2016 Jun 18;9:345–348. doi: 10.1016/j.dib.2016.05.025

Data set for phylogenetic tree and RAMPAGE Ramachandran plot analysis of SODs in Gossypium raimondii and G. arboreum

Wei Wang 1, Minxuan Xia 1, Jie Chen 1, Fenni Deng 1, Rui Yuan 1, Xiaopei Zhang 1, Fafu Shen 1,
PMCID: PMC5030311  PMID: 27672674

Abstract

The data presented in this paper is supporting the research article “Genome-Wide Analysis of Superoxide Dismutase Gene Family in Gossypium raimondii and G. arboreum[1]. In this data article, we present phylogenetic tree showing dichotomy with two different clusters of SODs inferred by the Bayesian method of MrBayes (version 3.2.4), “Bayesian phylogenetic inference under mixed models” [2], Ramachandran plots of G. raimondii and G. arboreum SODs, the protein sequence used to generate 3D sructure of proteins and the template accession via SWISS-MODEL server, “SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information.” [3] and motif sequences of SODs identified by InterProScan (version 4.8) with the Pfam database, “Pfam: the protein families database” [4].

Keywords: Phylogenetic tree, RAMPAGE Ramachandran plot analysis, SOD, Cotton


Specifications Table

Subject area Biology
More specific subject area Genetics and Molecular Biology
Type of data Figure
How data was acquired Database analysis
Data format Analyzed
Experimental factors Amino acid sequences were retrieved from NCBI, TAIR10, Joint Genome Institute (JGI) and/or CottonGen.
Experimental features Sequences were aligned using BLAST for Proteins (BLASTP), Structural evaluation and stereochemical analyses were assessed using RAMPAGE Ramachandran plot analysis
Data accessibility With this article

Value of the data

  • Data on phylogenies separately estimated using Bayesian method of MrBayes enable researchers to examine how the topologies differ from each other.

  • Data on phylogenies of Gossypium SOD proteins enable researchers to infer the possible ranges of time frames in the divergence events of Gossypium SOD genes and its molecular evolution in general.

  • Data on RAMPAGE Ramachandran plot analysis of Gossypium SOD proteins enable researchers to evaluate the accuracy of the predicted models.

1. Data

The phylogenetic tree obtained using Maximum-Likelihood (ML) method of PhyML (version 20120412) [5] and the 3D structure of SODs generated by using SWISS-MODEL server (http://swissmodel.expasy.org/) [3] and using the online COACH server (http://zhanglab.ccmb.med.umich.edu/COACH/) [6]. were presented in [1]. The data shown here represent the showing dichotomy with two different clusters of SODs (I: Cu/Zn; II: Mn/Fe-SODs) inferred by the Bayesian method of MrBayes (version 3.2.4) [2] and Cu/Zn-SOD cluster had three subgroups (Ia–Ic), whereas the Mn/Fe-SOD cluster had two subgroups (IId and IIe) (Figure. 1). We analysed the accuracy of the predicted models evaluated by Ramachandran plot using the RAMPAGE server (http://mordred.bioc.cam.ac.uk/~rapper/rampage.php) [3]. The refined SOD models showed good proportions of residues in favoured, allowed and outlier regions (Appendix A, Appendix A). In-depth analyses of the data is presented in the associated research article [1].

2. Experimental design, materials and methods

2.1. Information access

The latest versions of the G. raimondii (V1.0) and G. arboreum (V2.0) genomes and annotation files were downloaded from CottonGen (https://www.cottongen.org/data/genome). The latest version of the Arabidopsis (TAIR10) genome and annotation files were downloaded from the Joint Genome Institute (JGI) (http://www.phytozome.net).

2.2. Data filtering

We then filtered gene annotation results based on the following criteria [7]: (1) the longest transcript in each gene loci was chosen to represent that locus; (2) coding sequences (CDS) with length <150 base pair bp were filtered out; (3) CDS with the percentage of ambiguous nucleotides (‘N’) >50% were filtered out; (4) CDS with internal termination codon were filtered out; and (5) the CDS with hits(Basic Local Alignment Search Tool (BLAST) identity ≥80%) to RepBase sequences (http://www.girinst.org/repbase/index.html) were filtered out.

2.3. Identification of SOD protein

To identify members of the SOD protein in G. raimondii and G. arboreum, we retrieved SOD protein sequences from the NCBI protein database (http://www.ncbi.nlm.nih.gov/protein/). These protein sequences from six species, including Arabidopsis (accession nos. NP_172360.1, NP_565666.1, NP_197311.1, NP_199923.1, NP_197722.1 and NP_187703.1), Theobroma cacao (XP_007030135.1 and XP_007038205.1), G. hirsutum (ABA00453.1, ACC93639.1, ABA00454.1, ABA00456.1 and ABA00455.1), Po. trichocarpa (XP_002319589.1 and XP_002325843.1), Z. mays (NP_001105704.1, BAI50563.1, ACG41865.1, ACG32380.1 and NP_001105742.1) and O. sativa (AAA33917.1, BAD09607.1, BAA37131.1 and NP_001055195.1), were used as query sequences to perform multiple database searches using BLAST for Proteins (BLASTP) [8]. After removing alignments with identity <50%, the resultant candidate SOD proteins were aligned to each other to ensure that no gene was represented multiple times. InterProScan (version 4.8) [9]was further used to confirm the inclusion of the SOD domain in each candidate sequence using the Pfam database. Furthermore, we gathered the SOD protein sequences, the template accession and motif sequences.

2.4. Construct phylogenetic trees

Phylogenetic trees were constructed using the Bayesian analysis method. Bayesian trees were constructed using MrBayes (version 3.2.4) [2] with GTR+I+gamma substitution model. The Markov chain Monte Carlo process performed 5,000,000 iterations with sampling every 500 iterations resulting in 10,000 samples and a burn-in of 25% samples. Other parameters were the default settings.

2.5. Structural evaluation and stereochemical analysis

Structural evaluation and stereochemical analyses were assessed using RAMPAGE Ramachandran plot analysis (http://mordred.bioc.cam.ac.uk/~rapper/rampage.php) [10].

Acknowledgements

This research was mainly supported by the China Major Projects for Transgenic Breeding (Grant Nos. 2011ZX08005-004 and 2011ZX08005-002) and the China Key Development Project for Basic Research (973) (Grant No. 2010CB12606).

Footnotes

Transparency document

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.dib.2016.05.025.

Appendix A

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.dib.2016.05.025.

Transparency document. Supplementary material

Supplementary material

mmc1.doc (25.5KB, doc)

.

Appendix A. Supplementary material

Supplementary material

mmc2.zip (398.1KB, zip)

Supplementary material

mmc3.zip (136.1KB, zip)

Supplementary material

mmc4.zip (139.3KB, zip)

References

  • 1.Wang W., Xia M., Chen J., Deng F., Yuan R., Zhang X., Shen F. Genome-wide analysis of superoxide dismutase gene family in Gossypium raimondii and G. arboretum. Plant Gene. 2016;6:18–29. [Google Scholar]
  • 2.Ronquist F., Huelsenbeck J.P. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–1574. doi: 10.1093/bioinformatics/btg180. [DOI] [PubMed] [Google Scholar]
  • 3.Biasini M., Bienert S., Waterhouse A., Arnold K., Studer G., Schmidt T., Kiefer F., Cassarino T.G., Bertoni M., Bordoli L., Schwede T. SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information. Nucleic Acids Res. 2014;42:W252–W258. doi: 10.1093/nar/gku340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Finn R.D., Bateman A., Clements J., Coggill P., Eberhardt R.Y., Eddy S.R., Heger A., Hetherington K., Holm L., Mistry J., Sonnhammer E.L., Tate J., Punta M. Pfam: the protein families database. Nucleic Acids Res. 2014;42:D222–D230. doi: 10.1093/nar/gkt1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Guindon S., Dufayard J.-F., Lefort V., Anisimova M., Hordijk W., Gascuel O. New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0. Syst. Biol. 2010;59:307–321. doi: 10.1093/sysbio/syq010. [DOI] [PubMed] [Google Scholar]
  • 6.Yang J., Roy A., Zhang Y. BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic Acids Res. 2013;41:D1096–D1103. doi: 10.1093/nar/gks966. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Ma T., Wang J., Zhou G., Yue Z., Hu Q., Chen Y., Liu B., Qiu Q., Wang Z., Zhang J., Wang K., Jiang D., Gou C., Yu L., Zhan D., Zhou R., Luo W., Ma H., Yang Y., Pan S., Fang D., Luo Y., Wang X., Wang G., Wang J., Wang Q., Lu X., Chen Z., Liu J., Lu Y., Yin Y., Yang H., Abbott R.J., Wu Y., Wan D., Li J., Yin T., Lascoux M., DiFazio S.P., Tuskan G.A., Wang J., Liu J. Genomic insights into salt adaptation in a desert poplar. Nat. Commun. 2013;4 doi: 10.1038/ncomms3797. [DOI] [PubMed] [Google Scholar]
  • 8.Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., Madden T.L. BLAST+: architecture and applications. BMC Bioinforma. 2009;10 doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Quevillon E., Silventoinen V., Pillai S., Harte N., Mulder N., Apweiler R., Lopez R. InterProScan: protein domains identifier. Nucleic Acids Res. 2005;33:W116–W120. doi: 10.1093/nar/gki442. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lovell S.C., Davis I.W., Arendall W.B., de Bakker P.I.W., Word J.M., Prisant M.G., Richardson J.S., Richardson D.C. Structure validation by Cα geometry: ϕ, ψ and Cβ deviation. Proteins: Struct., Funct., Bioinforma. 2003;50:437–450. doi: 10.1002/prot.10286. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.doc (25.5KB, doc)

Supplementary material

mmc2.zip (398.1KB, zip)

Supplementary material

mmc3.zip (136.1KB, zip)

Supplementary material

mmc4.zip (139.3KB, zip)

Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES