Tally-2.0: upgraded validator of tandem repeat detection in protein sequences

Vladimir Perovic; Jeremy Y Leclercq; Neven Sumonja; Francois D Richard; Nevena Veljkovic; Andrey V Kajava

doi:10.1093/bioinformatics/btaa121

. 2020 Feb 25;36(10):3260–3262. doi: 10.1093/bioinformatics/btaa121

Tally-2.0: upgraded validator of tandem repeat detection in protein sequences

Vladimir Perovic ^b1, Jeremy Y Leclercq ^b2, Neven Sumonja ^b1, Francois D Richard ^b2,^b3, Nevena Veljkovic ^b1, Andrey V Kajava ^b2,^✉

Editor: Jinbo Xu

PMCID: PMC7214015 PMID: 32096820

Abstract

Motivation

Proteins containing tandem repeats (TRs) are abundant, frequently fold in elongated non-globular structures and perform vital functions. A number of computational tools have been developed to detect TRs in protein sequences. A blurred boundary between imperfect TR motifs and non-repetitive sequences gave rise to necessity to validate the detected TRs.

Results

Tally-2.0 is a scoring tool based on a machine learning (ML) approach, which allows to validate the results of TR detection. It was upgraded by using improved training datasets and additional ML features. Tally-2.0 performs at a level of 93% sensitivity, 83% specificity and an area under the receiver operating characteristic curve of 95%.

Availability and implementation

Tally-2.0 is available, as a web tool and as a standalone application published under Apache License 2.0, on the URL https://bioinfo.crbm.cnrs.fr/index.php? route=tools&tool=27. It is supported on Linux. Source code is available upon request.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Numerous studies demonstrate the fundamental functional importance of protein regions containing periodic sequences representing arrays of similar motifs that are directly adjacent to each other. The majority of proteins with these tandem repeats (TRs) in sequences have repetitive non-globular arrangements in their 3D structures (Fraser and MacRae, 1973; Kajava, 2012). Functions of these protein regions also frequently differ from the protein domains having aperiodic sequences folded in the globular structures. The TRs containing proteins predominantly serve as structural blocks (e.g. collagen, silk, keratin, proteins of epithelial tissues), as large hub proteins involved in protein–protein interactions (LRR or HEAT proteins), as core elements of multi-protein machineries and as proteins used like multivalent binders of ligands with periodic structures (Andrade and Bork, 1995; Fraser and MacRae, 1973; Kobe and Kajava, 2001).

The structural and functional differences of proteins with aperiodic and periodic sequences point to the importance of bioinformatics tools that are able to distinguish between these two types of sequences. Most of the existing methods (Biegert and Söding, 2008; Jorda and Kajava, 2009; Szklarczyk and Heringa, 2004) can detect perfect TRs; however, in many cases, TRs are imperfect, contain a number of mutations accumulated during evolution and cannot be easily identified. In this situation, the 3D structure of proteins can be used as a benchmarking criterion for TR detection in sequences. The majority of proteins having TRs are built of repetitive 3D structural blocks and, the evolution cannot completely erase the repetitive patterns because some residues located in the equivalent positions of the repeats are critical for maintenance of the stable and functional structure. Previously, we developed a scoring tool called ‘Tally’, which is based on a machine learning (ML) approach and trained and evaluated on curated datasets of the ‘true’ TRs found both in sequence and in structure (TR-SS) and ‘false’ TRs only found in sequence but not in the structure (TR-SNS) (Richard et al., 2016). Tally achieved a better separation between sequences with structural TRs and sequences of aperiodic structures, than the other existing scoring procedures. In this work, we significantly improved this scoring tool by using additional ML features and enlargement of the curated benchmarking datasets. The dataset of ‘true’ TRs was enriched in nearly perfect TRs allowing us to extend Tally application to the TRs of the natively unfolded regions.

2 Materials and methods

Datasets: Previously, we built a positive set of 441 ‘true’ TRs found both in sequence and in structure and 141 ‘false’ TRs only found in sequence but not in the structure (Richard et al., 2016). Here, we improved these datasets by (i) increasing and equalizing numbers of TRs in the positive and negative datasets (553 and 525, correspondingly), (ii) verifying and decreasing TR sequence redundancy in both datasets and (iii) choosing TRs that allow a more equal representation in terms of their perfection, length and number of repeats. The TR of a given region is presented as multiple sequence alignment (MSA) of its repeats. For the TR identification and generation of MSAs, T-REKS (Jorda and Kajava, 2009), TRUST (Szklarczyk and Heringa, 2004) and HHrepID (Biegert and Söding, 2008) programs were used.

ML algorithm and features: Previously, we generated 40 MSAs based ML feature (Richard et al., 2016). In this work, we added three new features related to the number of gap openings in the MSA, and also a new family of 112 features, which are based on Fourier Transform and physico-chemical characteristics of amino acids. These spectral features are developed based on Informational Spectrum Method (Veljkovic et al., 2007) and are comprising of four groups: (i) two features based on amplitude values of first peaks in spectral representations of MSA, (ii) eight features, which represent sum of signal/noise values on spectral peaks, (iii) one noise based feature and (iv) three entropy based features, across eight amino acid characteristics from AAIndex database (Nakai et al., 1988) (Supplementary Material). In feature engineering process, we have selected 55 from total of 155 original attributes for final model using sequential backward elimination (Saeys et al., 2007) as a feature selection algorithm (Supplementary Material). The backward feature elimination was done by using H2O.ai platform (2018) and custom implementation in R language. The H2O.ai platform (2018) was used for cross-validation process. The Tally-2.0 classifier was generated using Random Forest (Breiman, 2001) classification ML algorithm, as a method with the best prediction efficacy (see the comparison in Supplementary Material).

3 Results

Tally-2.0 classifier was implemented in JAVA language using ML platform H2O.ai (2018). As an input, Tally-2.0 uses the list of TR regions presented as MSAs of their repeats. The calculation of MSA based features is implemented in Python and of Spectral features in JAVA. The output lists Tally-2.0 score and several other known TR scores (Psim, entropy, P-value-phylo and parsimony; Richard and Kajava, 2015) allowing the users to validate the quality of the examined TRs. Tally-2.0 just like Tally (Richard et al., 2016) has the best performance when we use Random Forest classifier (Supplementary Material), which indicates that the better results of the upgraded tool is due to the improved training datasets and additional ML features.

The evaluation of Tally-2.0, carried out on 10-fold cross-validation, showed 0.95 of area under the receiver operating characteristic curve (AUC) (Fig. 1a). At a threshold of 0.45, established based on the maximization of F-score, Tally-2.0 performs at the level of 0.88 accuracy, 0.89 F-score, 83% specificity, while achieving a high value 93% of sensitivity. In addition, we compared Tally-2.0 to existing scoring methods as follows: Tally-2.0 scores was obtained with 10-fold cross-validation on the positive and negative training set, while the performance of the other scoring methods was evaluated by the direct calculation of the scores of the complete training set. Our comparative analysis showed that Tally-2.0 evaluates the separation between sequences with and without TRs better than the other scoring procedures (Fig. 1).

Fig. 1. — Comparative analysis of TR validators. For Tally-2.0, ROC curve has been obtained on the training set with 10-fold cross-validation, whereas for the other existing scoring methods we used the Tally-2.0 training dataset. Values of AUC in decreasing order are 0.95, 0.89, 0.83, 0.77, 0.73 and 0.67, respectively, for Tally2.0, Parsimony, Tally, P-value-phylo, Psim and Entropy scores. This example has only two continuous Steppers, S₁ and S₂

Initially, Tally was developed to distinguish between protein structures with repetitive and non-repetitive architectures and, therefore, its dataset was enriched in MSAs that were close to the boundary between these two classes of proteins. As a result, Tally did not score well the MSAs which were far apart from this boundary (e.g. almost perfect repeats or MSAs from aperiodic random sequences) (Richard et al., 2016). The updated dataset of ‘true’ TRs used to build Tally-2.0 was enriched, on the one hand, in the perfect and almost perfect TRs and, on the other hand, in the random aperiodic sequences. It is also important to note that Tally input requires only sequence information. All this allowed us to cover the whole spectrum of MSAs and to extend application of Tally to the TRs of the natively unfolded (or intrinsically disordered) regions. Now, Tally 2.0 can be used in the large-scale analyses as a uniform validator of TR detection. It is one of the most important application of our tool as at present each of TR detection programs use their own scoring measure. As a result, in the previous large-scale surveys, the number of TR containing proteins in the proteomes varied significantly (between 14% and 30%) (Marcotte et al., 1999; Pellegrini, 2015) and the question about the total number of TRs in proteomes still stand unanswered.

Thus, the standalone version of Tally-2.0 is suitable for the validation of the large-scale analysis of TRs. In addition, web-based version of Tally-2.0 allows the users to validate imperfect TRs identified by them in the protein of their interest.

Funding

This work was supported by the H2020-MSCA-RISE project REFRACT [GA No. 823886], the National Institute of Allergy and Infectious Diseases [Research Grant 1R01AI12123701] and Ministry of Education, Science and Technological Development of the Republic of Serbia [grant number 173001]. This work was done within COST Action BM1405.

Conflict of Interest: none declared.

Supplementary Material

btaa121_Supplementary_Data

Click here for additional data file.^{(537.5KB, doc)}

References

Andrade M.A., Bork P. (1995) HEAT repeats in the Huntington’s disease protein. Nat. Genet., 11, 115–116. [DOI] [PubMed] [Google Scholar]
Biegert A., Söding J. (2008) De novo identification of highly diverged protein repeats by probabilistic consistency. Bioinformatics, 24, 807–814. [DOI] [PubMed] [Google Scholar]
Breiman L. (2001) Random forest. Machine Learn., 45, 5–32. [Google Scholar]
Fraser R.D.B., MacRae T.P. (1973) Conformation in Fibrous Proteins and Related Synthetic Polypeptides. Academic Press, New York and London. [Google Scholar]
Jorda J., Kajava A.V. (2009) T-REKS: identification of Tandem REpeats in sequences with a K-meanS based algorithm. Bioinformatics, 25, 2632–2638. [DOI] [PubMed] [Google Scholar]
Kajava A.V. (2012) Tandem repeats in proteins: from sequence to structure. J. Struct. Biol., 179, 279–288. [DOI] [PubMed] [Google Scholar]
Kobe B., Kajava A.V. (2001) The leucine-rich repeat as a protein recognition motif. Curr. Opin. Struct. Biol., 11, 725–732. [DOI] [PubMed] [Google Scholar]
Marcotte E.M. et al. (1999) A census of protein repeats. J. Mol. Biol., 293, 151–160. [DOI] [PubMed] [Google Scholar]
Nakai K. et al. (1988) Cluster analysis of amino acid indices for prediction of protein structure and function. Prot. Eng., 2, 93–100. [DOI] [PubMed] [Google Scholar]
Pellegrini M. (2015) Tandem repeats in proteins: prediction algorithms and biological role. Front. Bioeng. Biotechnol., 3, 143. [DOI] [PMC free article] [PubMed] [Google Scholar]
Richard F.D. et al. (2016) Tally: a scoring tool for boundary determination between repetitive and non-repetitive protein sequences. Bioinformatics, 32, 1952–1958. [DOI] [PubMed] [Google Scholar]
Richard F.D., Kajava A.V. (2015) In search of the boundary between repetitive and non-repetitive protein sequences. Biochem. Soc. Trans., 43, 807–811. [DOI] [PubMed] [Google Scholar]
Saeys Y. et al. (2007) A review of feature selection techniques in bioinformatics. Bioinformatics, 23, 2507–2517. [DOI] [PubMed] [Google Scholar]
Szklarczyk R., Heringa J. (2004) Tracking repeats using significance and transitivity. Bioinformatics, 20 (Suppl. 1), i311–i317. [DOI] [PubMed] [Google Scholar]
Veljkovic V. et al. (2007) Application of the EIIP/ISM bioinformatics concept in development of new drugs. Curr. Med. Chem., 14, 441–453. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btaa121_Supplementary_Data

Click here for additional data file.^{(537.5KB, doc)}

[btaa121-B1] Andrade M.A., Bork P. (1995) HEAT repeats in the Huntington’s disease protein. Nat. Genet., 11, 115–116. [DOI] [PubMed] [Google Scholar]

[btaa121-B2] Biegert A., Söding J. (2008) De novo identification of highly diverged protein repeats by probabilistic consistency. Bioinformatics, 24, 807–814. [DOI] [PubMed] [Google Scholar]

[btaa121-B3] Breiman L. (2001) Random forest. Machine Learn., 45, 5–32. [Google Scholar]

[btaa121-B4] Fraser R.D.B., MacRae T.P. (1973) Conformation in Fibrous Proteins and Related Synthetic Polypeptides. Academic Press, New York and London. [Google Scholar]

[btaa121-B5] Jorda J., Kajava A.V. (2009) T-REKS: identification of Tandem REpeats in sequences with a K-meanS based algorithm. Bioinformatics, 25, 2632–2638. [DOI] [PubMed] [Google Scholar]

[btaa121-B6] Kajava A.V. (2012) Tandem repeats in proteins: from sequence to structure. J. Struct. Biol., 179, 279–288. [DOI] [PubMed] [Google Scholar]

[btaa121-B7] Kobe B., Kajava A.V. (2001) The leucine-rich repeat as a protein recognition motif. Curr. Opin. Struct. Biol., 11, 725–732. [DOI] [PubMed] [Google Scholar]

[btaa121-B8] Marcotte E.M. et al. (1999) A census of protein repeats. J. Mol. Biol., 293, 151–160. [DOI] [PubMed] [Google Scholar]

[btaa121-B9] Nakai K. et al. (1988) Cluster analysis of amino acid indices for prediction of protein structure and function. Prot. Eng., 2, 93–100. [DOI] [PubMed] [Google Scholar]

[btaa121-B10] Pellegrini M. (2015) Tandem repeats in proteins: prediction algorithms and biological role. Front. Bioeng. Biotechnol., 3, 143. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa121-B11] Richard F.D. et al. (2016) Tally: a scoring tool for boundary determination between repetitive and non-repetitive protein sequences. Bioinformatics, 32, 1952–1958. [DOI] [PubMed] [Google Scholar]

[btaa121-B12] Richard F.D., Kajava A.V. (2015) In search of the boundary between repetitive and non-repetitive protein sequences. Biochem. Soc. Trans., 43, 807–811. [DOI] [PubMed] [Google Scholar]

[btaa121-B13] Saeys Y. et al. (2007) A review of feature selection techniques in bioinformatics. Bioinformatics, 23, 2507–2517. [DOI] [PubMed] [Google Scholar]

[btaa121-B14] Szklarczyk R., Heringa J. (2004) Tracking repeats using significance and transitivity. Bioinformatics, 20 (Suppl. 1), i311–i317. [DOI] [PubMed] [Google Scholar]

[btaa121-B15] Veljkovic V. et al. (2007) Application of the EIIP/ISM bioinformatics concept in development of new drugs. Curr. Med. Chem., 14, 441–453. [DOI] [PubMed] [Google Scholar]

PERMALINK

Tally-2.0: upgraded validator of tandem repeat detection in protein sequences

Vladimir Perovic

Jeremy Y Leclercq

Neven Sumonja

Francois D Richard

Nevena Veljkovic

Andrey V Kajava

Roles

Abstract

Motivation

Results

Availability and implementation

Supplementary information

1 Introduction

2 Materials and methods

3 Results

Fig. 1.

Funding

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Tally-2.0: upgraded validator of tandem repeat detection in protein sequences

Vladimir Perovic

Jeremy Y Leclercq

Neven Sumonja

Francois D Richard

Nevena Veljkovic

Andrey V Kajava

Roles

Abstract

Motivation

Results

Availability and implementation

Supplementary information

1 Introduction

2 Materials and methods

3 Results

Fig. 1.

Funding

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases