Abstract
RepeatsDB 2.0 (URL: http://repeatsdb.bio.unipd.it/) is an update of the database of annotated tandem repeat protein structures. Repeat proteins are a widespread class of non-globular proteins carrying heterogeneous functions involved in several diseases. Here we provide a new version of RepeatsDB with an improved classification schema including high quality annotations for ∼5400 protein structures. RepeatsDB 2.0 features information on start and end positions for the repeat regions and units for all entries. The extensive growth of repeat unit characterization was possible by applying the novel ReUPred annotation method over the entire Protein Data Bank, with data quality is guaranteed by an extensive manual validation for >60% of the entries. The updated web interface includes a new search engine for complex queries and a fully re-designed entry page for a better overview of structural data. It is now possible to compare unit positions, together with secondary structure, fold information and Pfam domains. Moreover, a new classification level has been introduced on top of the existing scheme as an independent layer for sequence similarity relationships at 40%, 60% and 90% identity.
INTRODUCTION
Tandem repeat regions in proteins are characterized by a repeated sequence coding for a modular architecture, where structural modules are called ‘units’. Proteins with tandem repeats play important functional roles (1), are abundant in nature and related to major health threats (2–6). Detecting and annotating them appropriately may increase our understanding of mechanisms of pathogenicity (e.g. virulence factors (7)), allow the design of scaffold proteins for engineered ligand binding with multiple applications (e.g. cancer therapy (8)) and generally expand our knowledge of the function and structure of many proteins (e.g. the mineralocorticoid receptor (9)). It is widely accepted that structural and functional complexity in domains evolved through fusion, recombination, accretion and repetition of a very small set of elementary functions (10,11). Therefore, units in tandem repeat proteins represent a fundamental source of information to explain contemporary structural diversity and the physico-chemical properties of highly designable folds (12). However, identification of sequence periodicity is an extremely hard task, since repetitive proteins evolve quickly, for two main reasons. The process of duplication originating new repeats is error-prone and identical flanking repeat units have an intrinsic tendency to diverge (13). A number of structure-based methods for the identification of repeats has been developed to fill this gap (14–17). RepeatsDB (18) was proposed in 2014 as a database of repeat protein structures and a resource for high-quality repeat structure annotation. The data was collected with RAPHAEL (19), a state-of-the-art method for the detection of Protein Data Bank (PDB) (20) structures containing repeat regions. The entries were classified into repeat structural classes (21) and further divided into subclasses. The five repeat classes are mainly distinguished by repeat unit length and general structural arrangement, and subclasses by the secondary structure assignment of the repeat unit. The shortest repeats, of one or two residues, form crystallites and are typically harmful or non-functional in natural organisms. No example of their structure is present in the PDB and consequently in RepeatsDB. Class II structures are fibrous proteins with very short units stabilized by interchain interactions, typically collagens and α-helical coiled coils, with various arrangements described in (22). Class III contains the most typical repeat examples, elongated structures where repetitive units require one another to maintain folding, e.g. β-, α/β- and α-solenoids. Class IV includes all closed repeat structures. Widespread across all types of organisms, this class includes the TIM-barrel and β-propeller subclasses. Both class III and IV contain units of length between 10 and 50 residues. Class V has unit lengths >40 residues and groups ‘beads on a string’ repeats, whose repeat units are large enough to fold independently.
All subclasses are characterized by a strong structural conservation in repeat units frequently not clearly reflected in sequence. This is the main reason why domain sequence databases such as Pfam (23) and SMART (24) fail to detect a large number of repeats (25). Indeed, as most of the largest clusters of human sequence regions not covered by Pfam were found to be repeated (25,26). RepeatsDB was developed to fill this gap and provides the community with a high-quality resource of reliable datasets of repeat structures for various purposes. The first and most obvious goal achieved was to compare the structural classification of repeats with the sequence-based one (26). Other uses of RepeatsDB are the extraction of repeat datasets to discuss specific features (27,28), benchmarking both sequence- and structure-based repeat detection methods and discussing the role of proteins with repeats (17,29–33). The high-quality manually annotated set of RepeatsDB units (Structural Repeat Unit Library, SRUL) exploits ReUPred (34) to predict repeat unit position and classification. RepeatsDB 2.0 includes new annotations, an improved classification and completely redesigned web server and interface, to guarantee availability of data and better user experience in terms of database usability and look-and-feel.
DATABASE DESCRIPTION
RepeatsDB 2.0 data have been completely regenerated taking advantage of the new ReUPred predictor (34) for automatic detection of tandem repeat units. In the new database version all entries are annotated at the unit level, i.e. providing start and end position for each repeated segment, and classified at the subclass level. Compared to the old version, unit annotations have grown by more than an order of magnitude. A detailed description of the RepeatsDB annotation pipeline follows.
Data curation
The initial dataset for RepeatsDB is the entire PDB (20). Repeat candidates are extracted with RAPHAEL (19) and processed with ReUPred (34) to confirm the presence of repeat regions and provide detailed unit information. ReUPred is a predictor able to identify the position of repeated fragments by performing iterative structural alignments against a manually refined library of representative units. ReUPred is also able to assign the class and subclass by transferring this information from the unit library. The final dataset available in RepeatsDB 2.0 is the result of an iterative process where the ReUPred library has been refined manually multiple times to resolve conflicts, improve its ability to generalize and include newly discovered subclasses. At the end of the process, an extensive validation and refinement of the predictions has been carried out by expert visual inspection. More than 60% of the entries have been reviewed and five new subclasses created, three for class IV (closed structures) and two for class V (beads on a string).
Implementation
RepeatsDB was designed as a multi-tier architecture, with three modules managing data storage, processing and presentation, respectively. Data are stored in a MongoDB database, and processed with Node.js. The server is accessible through a web interface or programmatically exploiting a RESTful architecture. The web interface is designed using the Angular.js and Bootstrap frameworks. Dynamic and interactive elements of the entry page are developed using PV for structure visualization and Bio.js as sequence feature viewer, respectively. Both the database structure and Node.js server have been completely rewritten to improve efficiency and data reliability. Moreover, all data derived from third party resources have been processed and stored locally to prevent broken dependencies.
Innovations
Apart from the new annotation pipeline, several bugs have been fixed and many improvements have been introduced since the last RepeatsDB release. All positional annotations are now based on SIFTS (35), making them consistent with both PDB (20) and UniProt (36) references. The search engine has been completely redesigned. An intuitive search interface allows to perform complex queries using logical operators and guides the user through all possible searching fields. A new classification level has been added to include evolutionary relationships among different repeat regions. An all-vs.-all alignment of the repeat regions allowed to group them according to sequence similarity and to identify different repeat families. The new classification has been implemented as an independent layer on top of the existing structural features, and is available at three different identity thresholds (40%, 60% and 90%). The web interface allows to navigate entry clusters, providing an overview of the representative sequences inside each structural subclass.
DATABASE USAGE
The user interface presents an intuitive summary table providing direct access to all entries by structural class directly from the home page. For a finer search, the user can visit either the ‘Browse’ page providing subclass access or the ‘Search’ page for generating complex queries (Figure 1A and B). All entry points redirect to the same result page listing the retrieved proteins in a table (Figure 1C). The table can be further filtered by providing additional matching strings in the column headers. The ‘Browse’ page also provides direct access to sequence clusters, where entries are grouped by sequence similarity. The redesigned entry page (Figure 2) is much more informative compared to the previous RepeatsDB version, including several cross-links to third party resources. It also integrates several structural features useful for comparing CATH, SCOP, Pfam and DSSP annotations with RepeatsDB data. Regions, units and insertions are provided for all entries and correctly mapped both to UniProt and PDB reference (SEQRES field in the PDB file) sequences thanks to the SIFTS service. The correct mapping can strongly improve RepeatsDB impact since it is now very easy to link repeat data with other sequence features like mutations or post-translational modifications. Thanks to a RESTful architecture, all RepeatsDB data are accessible from external APIs and third party resources through HTTP URLs. Please refer to the ‘Help’ page of the website for details on using the RepeatsDB web services. Customized datasets can be downloaded in JSON or text format using the browse function or RESTful web services.
Figure 1.
Retrieving RepeatsDB data. RepeatsDB data can be retrieved in three different ways. (A) The ‘Browse’ page provides the entry point for both the structural hierarchy and sequence clusters. (B) The ‘Search’ page allows the user to perform advanced queries against a range of RepeatsDB-specific and third-party search fields. The input can be simple text or numeric (single value or range) according to the field type and multiple queries can be combined by boolean operators (AND, OR, NOT). Both the ‘Browse’ and ‘Search’ pages redirect to the results page (C). This page provides a table with the list of retrieved entries and can be further filtered (and sorted) through column header fields. Results can be displayed by PDB chain (default), region or UniProt.
Figure 2.
Screenshot of RepeatsDB sample entry page for PDB code 1ialA. The top part of the page (A) reports structure information from the PDB and cross-references to third-party databases including UniProt, MobiDB, SCOP, CATH and Pfam (when available). RepeatsDB annotations are available for download both in text and JSON formats on the top-right corner. (B) A table provides region details such as structural classification, start/end position, number of units, repeat period and cluster families. (C) The feature viewer summarizes available annotation for the PDB reference sequence, i.e. the SEQRES field in the PDB file. An overview of RepeatsDB information (regions, units and insertions) along with secondary structure (DSSP), Pfam, SCOP and CATH tracks (when available) are shown. (D) A detailed view of RepeatsDB annotations is highlighted in the sequence and PDB viewers.
Statistics
RepeatsDB provides high quality annotation for ∼5400 entries. Figure 3 compares the current RepeatsDB content to the previous version. The chart shows the total number of entries belonging to each class. However, the new version provides unit definition and subclass classification for all entries where the old version annotated only a tiny fraction (327 entries, cyan bar). Moreover, in RepeatsDB 2.0 more than 60% of the entries have been manually reviewed by expert curators (blue segment). Further details such as the number of regions, units and genes are available from the ‘Stats’ page of the web site.
Figure 3.
RepeatsDB growth. RepeatsDB 2.0 is compared to the previous release. Entries have unit and subclass annotation, with more than 60% manually reviewed (blue). For the old version, only a tiny fraction of entries have unit definition (cyan) and the rest is mostly annotated only at the class level (yellow).
CONCLUSION AND FUTURE WORK
RepeatsDB was presented in 2014 with the goal to provide the community with a central resource for high-quality tandem repeat protein structure annotation. It has been cited in a number of different studies regarding repeat proteins, and has been used to extract datasets for repeat proteins analysis and to test algorithms for repeat proteins annotation. The detailed annotation of entries performed by RepeatsDB curators has allowed us to build of a high quality Structure Repeat Unit Library (SRUL). This library was exploited by the ReUPred algorithm (34) as a gold standard to define unit position in new entries in an iterative process.
The new release of RepeatsDB includes a new annotation pipeline, combining the RAPHAEL algorithm for repeat detection (19) and ReUPred for annotation (34), producing extensive annotation for all entries. The pipeline is fully automated and allows the easy regular update of the database. The iterative execution of the pipeline already demonstrated its efficacy both because it identified a large number of new entries, and because new subclasses were identified and added to the structural classification scheme. RepeatsDB will benefit from regular updates, which will steadily increase the number of available annotations. Future work will concentrate on exploiting the repeat unit definitions to create profiles for use in detecting repeats from sequence for genome-scale analysis (37).
Acknowledgments
The authors are grateful to Francesco Tabaro for help with the web interface. We also acknowledge ELIXIR-IIB (elixir-italy.org), the Italian Node of the European ELIXIR infrastructure (elixir-europe.org), for supporting the development and maintenance of RepeatsDB.
FUNDING
COST Action BM1405 NGP-net. Funding for open access charge: COST BM1405.
Conflict of interest statement. None declared.
REFERENCES
- 1.Andrade M.A., Perez-Iratxeta C., Ponting C.P. Protein repeats: structures, functions, and evolution. J. Struct. Biol. 2001;134:117–131. doi: 10.1006/jsbi.2001.4392. [DOI] [PubMed] [Google Scholar]
- 2.Kojic S., Radojkovic D., Faulkner G. Muscle ankyrin repeat proteins: their role in striated muscle function in health and disease. Crit. Rev. Clin. Lab. Sci. 2011;48:269–294. doi: 10.3109/10408363.2011.643857. [DOI] [PubMed] [Google Scholar]
- 3.Jha S., Ting J.P.-Y. Inflammasome-associated nucleotide-binding domain, leucine-rich repeat proteins and inflammatory diseases. J. Immunol. Baltim. Md. 2009;183:7623–7629. doi: 10.4049/jimmunol.0902425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Madsen B.E., Villesen P., Wiuf C. Short tandem repeats in human exons: a target for disease mutations. BMC Genomics. 2008;9:410. doi: 10.1186/1471-2164-9-410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Orr H.T., Zoghbi H.Y. Trinucleotide repeat disorders. Annu. Rev. Neurosci. 2007;30:575–621. doi: 10.1146/annurev.neuro.29.051605.113042. [DOI] [PubMed] [Google Scholar]
- 6.Liggett W.H., Sidransky D. Role of the p16 tumor suppressor gene in cancer. J. Clin. Oncol. Off. J. Am. Soc. Clin. Oncol. 1998;16:1197–1206. doi: 10.1200/JCO.1998.16.3.1197. [DOI] [PubMed] [Google Scholar]
- 7.Cerveny L., Straskova A., Dankova V., Hartlova A., Ceckova M., Staud F., Stulik J. Tetratricopeptide repeat motifs in the world of bacterial pathogens: role in virulence mechanisms. Infect. Immun. 2013;81:629–635. doi: 10.1128/IAI.01035-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Weidle U.H., Auer J., Brinkmann U., Georges G., Tiefenthaler G. The emerging role of new protein scaffold-based agents for treatment of cancer. Cancer Genomics Proteomics. 2013;10:155–168. [PubMed] [Google Scholar]
- 9.Vlassi M., Brauns K., Andrade-Navarro M.A. Short tandem repeats in the inhibitory domain of the mineralocorticoid receptor: prediction of a β-solenoid structure. BMC Struct. Biol. 2013;13:17. doi: 10.1186/1472-6807-13-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Aziz M.F., Caetano-Anollés K., Caetano-Anollés G. The early history and emergence of molecular functions and modular scale-free network behavior. Sci. Rep. 2016;6:25058. doi: 10.1038/srep25058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Alva V., Söding J., Lupas A.N. A vocabulary of ancient peptides at the origin of folded proteins. eLife. 2015;4:e09410. doi: 10.7554/eLife.09410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Goncearenco A., Berezovsky I.N. Protein function from its emergence to diversity in contemporary proteins. Phys. Biol. 2015;12:45002. doi: 10.1088/1478-3975/12/4/045002. [DOI] [PubMed] [Google Scholar]
- 13.Jorda J., Xue B., Uversky V.N., Kajava A.V. Protein tandem repeats - the more perfect, the less structured. FEBS J. 2010;277:2673–2682. doi: 10.1111/j.1742-464X.2010.07684.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Abraham A.-L., Rocha E.P.C., Pothier J. Swelfe: a detector of internal repeats in sequences and structures. Bioinformatics. 2008;24:1536–1537. doi: 10.1093/bioinformatics/btn234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Parra R.G., Espada R., Sánchez I.E., Sippl M.J., Ferreiro D.U. Detecting repetitions and periodicities in proteins by tiling the structural space. J. Phys. Chem. B. 2013 doi: 10.1021/jp402105j. doi:10.1021/jp402105j. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hrabe T., Godzik A. ConSole: using modularity of Contact maps to locate Solenoid domains in protein structures. BMC Bioinformatics. 2014;15:119. doi: 10.1186/1471-2105-15-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Do Viet P., Roche D.B., Kajava A.V. TAPO: A combined method for the identification of tandem repeats in protein structures. FEBS Lett. 2015;589:2611–2619. doi: 10.1016/j.febslet.2015.08.025. [DOI] [PubMed] [Google Scholar]
- 18.Di Domenico T., Potenza E., Walsh I., Parra R.G., Giollo M., Minervini G., Piovesan D., Ihsan A., Ferrari C., Kajava A.V., et al. RepeatsDB: a database of tandem repeat protein structures. Nucleic Acids Res. 2014;42:D352–D357. doi: 10.1093/nar/gkt1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Walsh I., Sirocco F.G., Minervini G., Di Domenico T., Ferrari C., Tosatto S.C.E. RAPHAEL: recognition, periodicity and insertion assignment of solenoid protein structures. Bioinformatics. 2012;28:3257–3264. doi: 10.1093/bioinformatics/bts550. [DOI] [PubMed] [Google Scholar]
- 20.Velankar S., van Ginkel G., Alhroub Y., Battle G.M., Berrisford J.M., Conroy M.J., Dana J.M., Gore S.P., Gutmanas A., Haslam P., et al. PDBe: improved accessibility of macromolecular structure data from PDB and EMDB. Nucleic Acids Res. 2016;44:D385–D395. doi: 10.1093/nar/gkv1047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kajava A.V. Tandem repeats in proteins: from sequence to structure. J. Struct. Biol. 2012;179:279–288. doi: 10.1016/j.jsb.2011.08.009. [DOI] [PubMed] [Google Scholar]
- 22.Moutevelis E., Woolfson D.N. A periodic table of coiled-coil protein structures. J. Mol. Biol. 2009;385:726–732. doi: 10.1016/j.jmb.2008.11.028. [DOI] [PubMed] [Google Scholar]
- 23.Finn R.D., Coggill P., Eberhardt R.Y., Eddy S.R., Mistry J., Mitchell A.L., Potter S.C., Punta M., Qureshi M., Sangrador-Vegas A., et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 2016;44:D279–D285. doi: 10.1093/nar/gkv1344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Letunic I., Doerks T., Bork P. SMART: recent updates, new developments and status in 2015. Nucleic Acids Res. 2015;43:D257–D260. doi: 10.1093/nar/gku949. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Mistry J., Coggill P., Eberhardt R.Y., Deiana A., Giansanti A., Finn R.D., Bateman A., Punta M. The challenge of increasing Pfam coverage of the human proteome. Database. 2013;2013:bat023. doi: 10.1093/database/bat023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Paladin L., Tosatto S.C.E. Comparison of protein repeat classifications based on structure and sequence families. Biochem. Soc. Trans. 2015;43:832–837. doi: 10.1042/BST20150079. [DOI] [PubMed] [Google Scholar]
- 27.Brunette T.J., Parmeggiani F., Huang P.-S., Bhabha G., Ekiert D.C., Tsutakawa S.E., Hura G.L., Tainer J.A., Baker D. Exploring the repeat protein universe through computational protein design. Nature. 2015;528:580–584. doi: 10.1038/nature16162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Espada R., Parra R.G., Mora T., Walczak A.M., Ferreiro D.U. Capturing coevolutionary signals inrepeat proteins. BMC Bioinformatics. 2015;16:207. doi: 10.1186/s12859-015-0648-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Turjanski P., Parra R.G., Espada R., Becher V., Ferreiro D.U. Protein repeats from first principles. Sci. Rep. 2016;6:23959. doi: 10.1038/srep23959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Glantz S.T., Carpenter E.J., Melkonian M., Gardner K.H., Boyden E.S., Wong G.K.-S., Chow B.Y. Functional and topological diversity of LOV domain photoreceptors. Proc. Natl. Acad. Sci. U.S.A. 2016;113:E1442–E1451. doi: 10.1073/pnas.1509428113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Sharma M., Pandey G.K. Expansion and function of repeat domain proteins during stress and development in plants. Front. Plant Sci. 2015;6:1218. doi: 10.3389/fpls.2015.01218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Parra R.G., Espada R., Verstraete N., Ferreiro D.U. Structural and energetic characterization of the ankyrin repeat protein family. PLoS Comput. Biol. 2015;11:e1004659. doi: 10.1371/journal.pcbi.1004659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Pellegrini M. Tandem repeats in proteins: prediction algorithms and biological role. Front. Bioeng. Biotechnol. 2015;3:143. doi: 10.3389/fbioe.2015.00143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Hirsh L., Piovesan D., Paladin L., Tosatto S.C.E. Identification of repetitive units in protein structures with ReUPred. Amino Acids. 2016;48:1391–1400. doi: 10.1007/s00726-016-2187-2. [DOI] [PubMed] [Google Scholar]
- 35.Velankar S., Dana J.M., Jacobsen J., van Ginkel G., Gane P.J., Luo J., Oldfield T.J., O'Donovan C., Martin M.-J., Kleywegt G.J. SIFTS: Structure integration with function, taxonomy and sequences resource. Nucleic Acids Res. 2013;41:D483–D489. doi: 10.1093/nar/gks1258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.UniProt: a hub for protein information. Nucleic Acids Res. 2015;43:D204–D212. doi: 10.1093/nar/gku989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Mitchell A., Chang H.-Y., Daugherty L., Fraser M., Hunter S., Lopez R., McAnulla C., McMenamin C., Nuka G., Pesseat S., et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 2015;43:D213–D221. doi: 10.1093/nar/gku1243. [DOI] [PMC free article] [PubMed] [Google Scholar]



