Abstract
We focus our attention on multiple repeats of one amino acid (homorepeats) and create a new database (named HRaP, at http://bioinfo.protres.ru/hrap/) of occurrence of homorepeats and disordered patterns in different proteomes. HRaP is aimed at understanding the amino acid tandem repeat function in different proteomes. Therefore, the database includes 122 proteomes, 97 eukaryotic and 25 bacterial ones that can be divided into 9 kingdoms and 5 phyla of bacteria. The database includes 1 449 561 protein sequences and 771 786 sequences of proteins with GO annotations. We have determined homorepeats and patterns that are associated with some function. Through our web server, the user can do the following: (i) search for proteins with the given homorepeat in 122 proteomes, including GO annotation for these proteins; (ii) search for proteins with the given disordered pattern from the library of disordered patterns constructed on the clustered Protein Data Bank in 122 proteomes, including GO annotations for these proteins; (iii) analyze lengths of homorepeats in different proteomes; (iv) investigate disordered regions in the chosen proteins in 122 proteomes; (v) study the coupling of different homorepeats in one protein; (vi) determine longest runs for each amino acid inside each proteome; and (vii) download the full list of proteins with the given length of a homorepeat.
INTRODUCTION
It was found that motifs with low complexity occurred in eukaryotic proteomes (including the human one) more frequently than other protein motifs (1–3). One such motif is a homorepeat, which is the region with repeating of a single amino acid. It turned out that homorepeats play important roles in some biological processes (1,2,4,5). Homorepeats of some amino acids occur more frequently than homorepeats of other amino acids, and the type of homorepeats varies in different proteomes (3). For example, EEEEEE appears to be most frequent for Chordata, QQQQQQ for Arthropoda and SSSSSS for Nematoda (3). One can suggest that such homorepeats may be molecular recognition elements for proteins. A growing number of studies suggest that homorepeats may have a broader role in human diseases than was previously recognized (6). It should be stressed that expansion of homorepeats is a molecular basis for at least 18 human neurological diseases. For example, expansion of poly-A in polyadenine-binding protein 2 is associated with oculopharyngeal muscular dystrophy (7). Long poly-A tracts cause several human developmental diseases (5,8,9). Expansion of poly-Q (larger than 36 residues long) in the Huntington gene results in Huntington’s disease. Moreover, poly-Q tracts are associated with several ataxias (8,10). Therefore, perceiving the functional role of these patterns, homorepeats in particular, in the proteomes is a formidable challenge.
With active studying of disordered regions and their functioning, we focus our attention on multiple long repeats of one amino acid (homorepeats) (see Figure 1). The longest uninterrupted runs in the Dictyostelium discoideum proteome are of 306 residues for serine, 79 for glutamine, 90 for asparagine and 55 for glutamic acid. The longest uninterrupted runs in the human proteome are of 58 residues for serine, 74 for glutamine, 58 for aspartic acid and 53 for lysine. It is just the time to make a more careful analysis of occurrence, evolution and conservation of these repeats to find their functions. It is still unknown why genetically unstable homorepeats have been preserved throughout evolution, but now it is very important to perform evolution searching of occurrences of homorepeats in different classes. Recently the functional determination of some such motifs has been done. For example, histidine repeats in the protein kinase DYRK1A (length of 13) and in the protein FAM76B (length of 10) mediate nuclear speckle trafficking (5,11,12). Poly-A tract in the HOXD13 protein (length of 15) is important in limb development (5). It has been predicted that the most parts of the homorepeats are disordered (3,13). It should be noted that homorepeats such as KKKKK, PPPPP and HHHH are included in the library of disordered patterns (14). It is worth mentioning that in living organisms, homopeptides can be of non-ribosomal origin as well (2). Comparative analysis of amino acid repeats in some proteomes has been done (2,9,15,16). To gain a clear insight into the abundance of homorepeats and disordered patterns, we have created a database of occurrence of homorepeats with different lengths and disordered patterns (HRaP) in 122 eukaryotic and bacterial proteomes. Our database includes 1 449 561 protein sequences from 122 proteomes, 771 786 sequences of proteins with GO annotations (17) and all homorepeats and 412 disordered patterns from three sets (14,18,19).
Figure 1.
Dependence of the number of proteins that contain homorepeats of different lengths for 20 amino acids in D. discoideum proteome.
DESCRIPTION OF THE DATABASE
We considered 3617 proteomes from the European Bioinformatics Institute site (ftp://ftp.ebi.ac.uk/pub/databases/SPproteomes/uniprot/proteomes/). Because the disordered patterns with the frequent occurrence in proteomes have low complexity (homorepeats), we performed a preliminary analysis. Figure 2 shows the dependence of the number of proteins with at least one occurrence of homorepeats of 6 and more residues long on the size of proteomes. One can see the weak dependence of the occurrence of homorepeats on the size of proteomes. The general result following from this analysis is that the homorepeats appear more often in the eukaryotic proteome than in other proteomes (bacterial, archaeal and viral ones). From Figure 2, one can also see that the number of proteins with at least one occurrence of homorepeats of 6 residues long is <100 for proteomes with an overall number of residues <2 500 000. The data gave grounds for our research involving only proteomes with an overall number of residues exceeding 2 500 000 (19). We obtained 122 proteomes taking into account the length of proteomes representing nine kingdoms of eukaryotes and five phyla of bacteria (see Table in HRaP: proteomes). In view of these proteomes, we have 1 449 561 protein sequences. It should be mentioned that the possible use of this database (named HRaP) is not restricted only to the tasks connected with investigations of disordered regions in proteins and proteomes. Disordered regions can be calculated by using our programs IsUnstruct (14,20) and FoldUnfold (21). It should be noted that recently the new published methods for the prediction of disordered regions are usually meta-servers that combine multiple disorder predictors, e.g. MD (22), PONDR-FIT (23) and MFDp (24). There are separate methods for predicting short [≤15 residues in the program PONDR VSL2 (25)] and long disordered residues [≥30 residues PONDR VSL1 (26)]. Our method IsUnstruct demonstrates a high accuracy in predicting both short and long disordered regions.
Figure 2.
Dependence the number of proteins with at least one occurrence of homorepeats of ≥6 residues long in 3617 proteomes on the size of proteomes.
HRaP can be used to analyze evolution differences between proteins from different proteomes and connections of these regions with some definite functions. The database includes 771 786 of proteins with GO annotations. It has been found that leucine repeats were especially abundant in the ‘Receptor and/or Membrane’ group, glutamine and alanine repeats in ‘Transcription Factor and/or Development’ group, and lysine repeats in the ‘Metabolism’ group (2,5).
To see the occurrence of a homorepeat, at the first step the user should choose a proteome among 122 considered ones, and then at the second step choose the investigated homorepeat with the given length or pattern (see Figure 3). It should be noted that the order of amino acids and patterns is not random. As concerns the first, the order groups similar amino acids together. From the table of occurrence of homorepeats for different lengths, one can see that long homorepeats appear more often for polar and charge amino acids. The patterns have been ordered according to their significance for prediction of disordered regions. These numbers have been assigned in the corresponding articles (14,18,19). After that, the list of proteins with the given homorepeat or pattern appears with GO annotations (if such is determined). Usually, long proteins contain a homorepeat or several different homorepeats. If several homorepeats and patterns exist in a protein, then all these regions will be marked by different colors in the sequence. In the section HomoRepeats and Patterns, the user can find the occurrence of homorepeats with different lengths and disordered patterns for all 122 proteomes. The largest fraction of homorepeats of six and more residues long belongs to Amoebozoa proteomes (D. discoideum), 46% (see Figure 4). The longest uninterrupted runs in D. discoideum proteome are of 306 residues for serine, 79 for glutamine, 90 for asparagine and 55 for glutamic acid. The most frequent amino acid runs in the 122 proteomes occur for poly-Q (6 ≤ the length of tract ≤15), poly-S, poly-A, poly-G, poly-N, poly-P and poly-E (in decreasing order). The acidic runs poly-E and poly-D exceed the runs poly-Q and poly-N for tracts with a short length until 5. The relationship is changed for the long tracts. The occurrence of basic runs poly-K exceeds the runs poly-R, and poly-S exceeds the runs poly-T for all lengths of homorepeats.
Figure 3.

A screenshot of HRaP results filtered for HomoRepeats of the all 122 proteomes.
Figure 4.

The percentage of proteins with at least one occurrence of homorepeats of ≥6 residues long in 122 proteomes.
Homorepeats and patterns associated with the function
We can suggest that homorepeats and patterns are responsible for common functions of nonhomologous, unrelated proteins from different organisms. To confirm this, we have done the following analysis. All possible GO annotations for proteins were taken for the set of 122 proteomes. The number of different kinds of all annotations is 11 313. Proteins without annotations were combined into the class «absent annotation». The number of proteins including at least one pattern from the last version of the library [171 patterns, set 2012 (14)] was calculated, «Npt». The number of proteins including homorepeats of length ≥6 residues long was calculated, «Nhm» as well. The number of proteins with the given annotation was also calculated and indicated in the column «Ngo». For example, we found 60 proteins with GO annotations of functional kind (F) as ATP binding and including the pattern IKSHHNVGGLP. The same pattern is associated also with the guanosine monophosphate (GMP) synthase (glutamine-hydrolyzing) activity and GMP biosynthetic process. For each pattern or homorepeat, we can calculate the frequency of occurrence in all proteins:
where Npt is the number of proteins with the given homorepeat or with the given pattern, Nall = 1 449 561 is the full number of proteins in 122 proteomes and the total number of GO-annotated proteins is 771 786. The number of proteins with the given homorepeat (or pattern) and annotation (Npt,go) is given in the Table (section GO annotations). The probability to find the number of proteins Npt,go and larger among all proteins with the given annotation was calculated as:
![]() |
Taking into account 171 patterns, 20 homorepeats and 11 313 kinds of GO annotations, we have 11 313*(171 + 20) = 2 160 783 ≈ 2·106 possible combinations. Therefore, we should not pay attention to the events in which the probability is higher than 10−7. Taking this into account, the probabilities pz were colored according to the following conditions: green corresponds to pz < 10−15, light green corresponds to 10−15 ≤ pz < 10−10 and yellow corresponds to 10−10 ≤ pz < 10−7.
We also calculated the probabilities:
The patterns and homorepeats are sorted by p1 and p2 using the following colors: green–p1 > 0.5, light green–0.3 < p1 < 0.5 and light yellow–0.1 < p1 < 0.3. The patterns and homorepeats associated with the functions are presented in section GO annotations. It is interesting to note that histidine, alanine, glutamine and glutamine acid repeats are connected with GO annotation ‘C: nucleus’. As has been mentioned in the Introduction, histidine repeats mediate nuclear speckle trafficking in several transcription factors (5,11,12). The methionine repeat is connected with the voltage-gated calcium channel activity. Proline homorepeats are associated with many GO annotations: dendrite self-avoidance, central nervous system morphogenesis, bacterial cell surface binding, axon guidance receptor activity, axon extension involved in axon guidance, actin polymerization or depolymerization, Rho GTPase binding, mushroom body development, actin cortical patch, axonal fasciculation, actin cytoskeleton organization, peripheral nervous system development, cell morphogenesis, tropomyosin binding and stereocilium. Also, it should be noted that not all amino acid repeats are associated with some functions.
Among 109 disordered patterns (set 2010), 8 occur (with precise coincidence) only in the Protein Data Bank but are absent in 122 proteomes. Among 141 patterns (set 2011), there are only 6 such patterns, and 8 among 171 patterns (set 2012). Such patterns as TTTATT and NNNNN (from set 2012) occur > 17 000 times in the considered 122 proteomes. The leader is QQQQQQQ, which occurs >20 000 times. Moreover, the pattern NNNNN is connected with such process as symbiosis, encompassing mutualism through parasitism. This pattern occurs very seldom in the human proteome, only in 21 proteins.
We have created the list of human proteins with homorepeats that are associated with disease. The list can be found in the frequently asked questions section. Also, the list of proteins with homorepeats of 6 and more residues long from the clustered Protein Data Bank (14) can be found in the frequently asked questions section.
Correlations between number of proteins with homorepeats or patterns in any proteome
For each proteome, we calculated a set of 109 values reflecting the number of proteins containing at least one disordered pattern for each of the 109 patterns from the library. Then considering all possible pairs of proteomes, the correlation coefficients between the 109 values have been calculated resulting in the matrix of correlation coefficients. The correlation coefficient was calculated for each pair of proteomes separately, and then averaging has been done inside each kingdom and phylum (see Correlations section). Similar values have been calculated for a set of 141 disordered patterns, 171 disordered patterns and 20 homorepeats. A comparative analysis of the number of proteins containing homorepeats of 6 and more residues long in 122 proteomes has demonstrated that the correlation coefficients between numbers of proteins, where at least once a homorepeat of 6 and more residues long for each of the 20 types of amino acid residues appears in 9 kingdoms of eukaryota and 5 phyla of bacteria, are higher inside the considered kingdom than between them (3). The same result is valid for the 109 disordered selected patterns (set 2010) (18), the 141 disordered selected patterns (set 2011) (19) and the 171 disordered selected patterns (set 2012) (14).
CONCLUSIONS AND FUTURE DIRECTIONS
We have collected an exhaustive database of occurrence of homorepeats and patterns in 122 proteomes with the number of residues larger than 2 500 000 in each proteome. The found patterns and homorepeats associated with the function point to the tremendous importance of homorepeats in a large variety of cellular processes and merit further studying. In future work, we are planning to include the analysis of coupling between occurrences of different homorepeats in one protein and to make clusterization of proteins to escape the influence of homologous proteins for determination of homorepeat functions. We will be grateful for any contribution to the database from the community.
FUNDING
Russian Foundation for Basic Research [11-04-00763]; Russian Academy of Sciences (programs ‘Molecular and Cell Biology’ [01201353567] and ‘Fundamental Sciences to Medicine’). Funding for open access charge: Russian Academy of Sciences program ‘Molecular and Cell Biology' [01201353567].
Conflict of interest statement. None declared.
ACKNOWLEDGEMENTS
The authors thank O.I. Sokolovskaya for assistance in programming.
REFERENCES
- 1.Tompa P. Intrinsically unstructured proteins evolve by repeat expansion. Bioessays. 2003;25:847–855. doi: 10.1002/bies.10324. [DOI] [PubMed] [Google Scholar]
- 2.Jorda J, Kajava AV. Protein homorepeats sequences, structures, evolution, and functions. Adv. Protein Chem. Struct. Biol. 2010;79:59–88. doi: 10.1016/S1876-1623(10)79002-7. [DOI] [PubMed] [Google Scholar]
- 3.Lobanov MY, Galzitskaya OV. Occurrence of disordered patterns and homorepeats in eukaryotic and bacterial proteomes. Mol. BioSyst. 2012;8:327–337. doi: 10.1039/c1mb05318c. [DOI] [PubMed] [Google Scholar]
- 4.Karlin S, Burge C. Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. Proc. Natl Acad. Sci. USA. 1996;93:1560–1565. doi: 10.1073/pnas.93.4.1560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Mularoni L, Ledda A, Toll-Riera M, Albà MM. Natural selection drives the accumulation of amino acid tandem repeats in human proteins. Genome Res. 2010;20:745–754. doi: 10.1101/gr.101261.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Siwach P, Ganesh S. Tandem repeats in human disorders: mechanisms and evolution. Front. Biosci. 2008;13:4467–4484. doi: 10.2741/3017. [DOI] [PubMed] [Google Scholar]
- 7.Brais B, Bouchard JP, Xie YG, Rochefort DL, Chretien N, Tome FM, Lafrenière RG, Rommens JM, Uyama E, Nohira O, et al. Short GCG expansions in the PABP2 gene cause oculopharyngeal muscular dystrophy. Nat. Genet. 1998;18:164–167. doi: 10.1038/ng0298-164. [DOI] [PubMed] [Google Scholar]
- 8.Brown LY, Brown SA. Alanine tracts: the expanding story of human illness and trinucleotide repeats. Trends Genet. 2004;20:51–58. doi: 10.1016/j.tig.2003.11.002. [DOI] [PubMed] [Google Scholar]
- 9.Mularoni L, Veitia RA, Alba` MM. Highly constrained proteins contain an unexpectedly large number of amino acid tandem repeats. Genomics. 2007;89:316–325. doi: 10.1016/j.ygeno.2006.11.011. [DOI] [PubMed] [Google Scholar]
- 10.Gatchel JR, Zoghbi HY. Diseases of unstable repeat expansion: mechanisms and common principles. Nat. Rev. Genet. 2005;6:743–755. doi: 10.1038/nrg1691. [DOI] [PubMed] [Google Scholar]
- 11.Alvarez M, Estivill X, de la Luna S. DYRK1A accumulates in splicing speckles through a novel targeting signal and induces speckle disassembly. J. Cell Sci. 2003;116:3099–3107. doi: 10.1242/jcs.00618. [DOI] [PubMed] [Google Scholar]
- 12.Salichs E, Ledda A, Mularoni L, Alba` MM, de la Luna S. Genome-wide analysis of histidine repeats reveals their role in the localization of human proteins to the nuclear speckles compartment. PLoS Genet. 2009;5:e1000397. doi: 10.1371/journal.pgen.1000397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Jorda J, Xue B, Uversky VN, Kajava AV. Protein tandem repeats - the more perfect, the less structured. FEBS J. 2010;277:2673–2682. doi: 10.1111/j.1742-464X.2010.07684.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Lobanov MY, Sokolovskiy IV, Galzitskaya OV. IsUnstruct: prediction of the residue status to be ordered or disordered in the protein chain by a method based on the Ising model. J. Biomol. Struct. Dyn. 2013;31:1034–1043. doi: 10.1080/07391102.2012.718529. [DOI] [PubMed] [Google Scholar]
- 15.Alba` MM, Guigo R. Comparative analysis of amino acid repeats in rodents and humans. Genome Res. 2004;14:549–554. doi: 10.1101/gr.1925704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Dalby AR. A comparative proteomic analysis of the simple amino acid repeat distributions in Plasmodia reveals lineage specific amino acid selection. PLoS One. 2009;4:e6231. doi: 10.1371/journal.pone.0006231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Gene Ontology Consortium. Gene Ontology annotations and resources. Nucleic Acids Res. 2013;41:D530–D535. doi: 10.1093/nar/gks1050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lobanov MY, Furletova EI, Bogatyreva NS, Roytberg MA, Galzitskaya OV. Library of disordered patterns in 3D protein structures. PLoS Comput. Biol. 2010;6:e1000958. doi: 10.1371/journal.pcbi.1000958. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lobanov MY, Galzitskaya OV. Disordered patterns in clustered Protein Data Bank and in eukaryotic and bacterial proteomes. PLoS One. 2011;6:e27142. doi: 10.1371/journal.pone.0027142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Lobanov MY, Galzitskaya OV. The Ising model for prediction of disordered residues from protein sequence alone. Phys. Biol. 2011;8:035004. doi: 10.1088/1478-3975/8/3/035004. [DOI] [PubMed] [Google Scholar]
- 21.Galzitskaya OV, Garbuzynskiy SO, Lobanov MY. FoldUnfold: web server for the prediction of disordered regions in protein chain. Bioinformatics. 2006;22:2948–2949. doi: 10.1093/bioinformatics/btl504. [DOI] [PubMed] [Google Scholar]
- 22.Schlessinger A, Punta M, Yachdav G, Kajan L, Rost B. Improved disorder prediction by combination of orthogonal approaches. PLoS One. 2009;4:e4433. doi: 10.1371/journal.pone.0004433. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Xue B, Dunbrack RL, Williams RW, Dunker AK, Uversky VN. PONDR-FIT: a meta-predictor of intrinsically disordered amino acids. Biochim. Biophys. Acta. 2010;1804:996–1010. doi: 10.1016/j.bbapap.2010.01.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Mizianty MJ, Stach W, Chen K, Kedarisetti KD, Disfani FM, Kurgan L. Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources. Bioinformatics. 2010;26:i489–i496. doi: 10.1093/bioinformatics/btq373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z. Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics. 2006;7:208. doi: 10.1186/1471-2105-7-208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Obradovic Z, Peng K, Vucetic S, Radivojac P, Dunker AK. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins. 2005;61(Suppl. 7):176–182. doi: 10.1002/prot.20735. [DOI] [PubMed] [Google Scholar]



