Abstract
Summary: Intragenic duplications of genetic material have important biological roles because of their protein sequence and structural consequences. We developed Swelfe to find internal repeats at three levels. Swelfe quickly identifies statistically significant internal repeats in DNA and amino acid sequences and in 3D structures using dynamic programming. The associated web server also shows the relationships between repeats at each level and facilitates visualization of the results.
Availability: http://bioserv.rpbs.jussieu.fr/swelfe
Contact: annela@abi.snv.jussieu.fr
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Duplications play a major role in genome evolution by creating and modifying cellular functions (Marcotte et al., 1999). Duplications can be large, up to the entire genome, or small, down to small parts of genes. While genome and gene duplications have been extensively studied, few works have aimed at identifying and studying intragenic repeats. These arise in DNA but are selected for their functional and structural consequences. Therefore, the simultaneous study of repeats at DNA, protein sequence and protein structure levels is necessary to understand their biological role.
Currently, no tool allows for the integrated analysis of internal repeats at the three levels. Several programs efficiently detect large very similar DNA repeats [e.g. Reputer (Kurtz and Schleiermacher, 1999), Repseek (Achaz et al., 2007)], or tandem repeats [e.g. Tandem Repeat Finder (Benson, 1999)]. But there is a lack of methods to identify small, closely spaced and divergent repeats using appropriate substitution matrices and statistical procedures. Some programs detect structural similarities [Vast (Gibrat et al., 1996), CE (Shindyalov and Bourne, 1998), DALI (Holm and Sander, 1993)] but they are slow and not adapted to detect internal similarities. Our tool, Swelfe, uses conceptually the same algorithm to detect internal similarities at these three levels allowing to analyze the evolution of DNA repeats at the light of their effects on protein sequence and structure. This facilitates pinpointing sequence-structure associations and understanding the evolutionary forces acting upon the evolution of these elements.
2 ALGORITHM AND STATISTICS
Swelfe identifies repeats by alignment of DNA sequences, amino acids sequences and three dimensional (3D) structures. Preliminarily, 3D structures are encoded as linear sequences of α angles (α angle is the dihedral angle between four consecutive Cα) (Usha and Murthy, 1986) (supplementary Fig. 1). Strings of α angles have been shown to be very compact ways of representing protein backbones while conserving most of the structural features of the peptide skeleton (Carpentier et al., 2005). In Supplementary Materials we show comparisons with DALI showing that Swelfe is capable of finding very distant similarities even in the absence of classical secondary structural elements. Using this description we find repeats by dynamic programming with the Huang and Miller algorithm (Huang and Miller, 1991; Huang et al., 1990) on sequences and protein structures (Supplementary Fig. 2). The system of scores was adapted at each level (see Supplementary Table 1 for formulae and default parameters). In sequences, Swelfe uses any BLOSUM or PAM matrix for proteins while it generates a similarity matrix explicitly accounting for the frequencies of each nucleotide in DNA (Achaz et al., 2007). The structural score for two matching α angles increases when the circular difference between them decreases and also accounts for the relative frequencies of α-angles on the PDB (Supplementary Fig. 3). Thus very frequent angles, e.g. originating from α-helices or β-sheets, have a lower score.
As post-processing steps we check that the sequence repeats are statistically significant (see below). Since a succession of non-perfectly matching α-angles could theoretically lead to poor overall superposition of repeats we check that the relative root mean square deviation (RRMSD) (Betancourt and Skolnick, 2001) between the two copies of the repeat is low. The default threshold (0.5) corresponds to a probability of 10−3 of finding such a low RRMSD in a 20 residues substructure. The vast majority of significant repeats we find in the PDB structures has much lower values of RRMSD (see histograms of RRMSD and RMSD distributions in Supplementary Material). Along with Swelfe we provide a python script that filters and simplifies the output of highly overlapping successive repeats (default: >50% overlap). Most parameters of Swelfe can be tuned as described in the manual. An example of protein exhibiting a repeat at the three levels is shown on Figure 1.
To assign a statistical significance for repeats in sequences we implemented the Waterman and Vingron method (Waterman and Vingron, 1994). The P-value is computed using the distribution of scores in a large number of random sequences computed by shuffling codons or amino acids of the original sequence. Full description can be found in Supplementary Material. We observed that drawing 100 random sequences is enough in most cases to obtain the most significant repeats (see Supplementary Fig. 4). The same authors also proposed a faster ‘declumping estimation’ method using fewer (e.g. 20) random sequences. We implemented it in Swelfe (see Supplementary Fig. 5). We find it to be 6 (DNA) to 10 (amino acids) times faster when calculating the same number of scores on random sequences, and we recommend it as a preliminary filter when scanning large databases.
On structural alignments there is no currently well-accepted method to assign statistical values to the alignment scores. We thus chose a conservative default score based on the analysis of the resulting structural alignments (250○ followed by the RRMSD filter described earlier). This default value leads to finding approximately the same number of repeats at the level of amino acids and structures for the PDB proteins.
3 IMPLEMENTATION
Swelfe was written in C language and we offer a number of pre-compiled binaries (Linux and Mac OS X) and the source code. Swelfe is rather fast. Using a Xeon MacPro we analyzed the 9537 proteins from the subset ‘clusters50’ of PDB (i.e. structures having <50% sequence identity with each other) for which we found DNA and amino acid sequences. The program took less than a minute to find the 3D repeats or the amino acid repeats, 5 min for the DNA repeats. Statistical evaluation slows the program because it needs generating and analyzing the random DNA and protein sequences. Yet, when we made the same analysis including statistical evaluation for repeats using default parameters, the program took about 20 h for finding and classifying all DNA repeats and 30 min for the amino acid repeats. It uses ∼16 MB RAM for the DNA bank. The web server interface allows drawing relationships between the results at the three levels and visualization of the 3D structural results using Jmol (www.jmol.org). We also built a databank linking explicitly PDB structures with their genes and amino acid sequences through extensive similarity searches. This databank contains 85 845 entries, thus allowing extensive analyses at the three levels, and is available from the authors upon request.
Supplementary Material
ACKNOWLEDGEMENTS
Swelfe is hosted by Ressource Parisienne en BioInformatique Structurale (RPBS).
Funding: Grants from Region Ile-de-France to ALA, ACI IMPBIO to EvolRep and ANR-06-CIS to project PROTEUS.
Conflict of Interest: none declared.
REFERENCES
- Achaz G, et al. Repseek, a tool to retrieve approximate repeats from large DNA sequences. Bioinformatics. 2007;23:119–121. doi: 10.1093/bioinformatics/btl519. [DOI] [PubMed] [Google Scholar]
- Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–580. doi: 10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Betancourt MR, Skolnick J. Universal similarity measure for comparing protein structures. Biopolymers. 2001;59:305–309. doi: 10.1002/1097-0282(20011015)59:5<305::AID-BIP1027>3.0.CO;2-6. [DOI] [PubMed] [Google Scholar]
- Carpentier M, et al. YAKUSA: a fast structural database scanning method. Proteins. 2005;61:137–151. doi: 10.1002/prot.20517. [DOI] [PubMed] [Google Scholar]
- Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. Embo J. 1986;5:823–826. doi: 10.1002/j.1460-2075.1986.tb04288.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gibrat JF, et al. Surprising similarities in structure comparison. Curr. Opin. Struct. Biol. 1996;6:377–385. doi: 10.1016/s0959-440x(96)80058-3. [DOI] [PubMed] [Google Scholar]
- Holm L, Sander C. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 1993;233:123–138. doi: 10.1006/jmbi.1993.1489. [DOI] [PubMed] [Google Scholar]
- Huang X, Miller W. A time-efficient, linear-spaced local similarity algorithm. Adv. Appl. Math. 1991;12:337–357. [Google Scholar]
- Huang XQ, et al. A space-efficient algorithm for local similarities. Comput. Appl. Biosci. 1990;6:373–381. doi: 10.1093/bioinformatics/6.4.373. [DOI] [PubMed] [Google Scholar]
- Kurtz S, Schleiermacher C. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics. 1999;15:426–427. doi: 10.1093/bioinformatics/15.5.426. [DOI] [PubMed] [Google Scholar]
- Marcotte EM, et al. A census of protein repeats. J. Mol. Biol. 1999;293:151–160. doi: 10.1006/jmbi.1999.3136. [DOI] [PubMed] [Google Scholar]
- Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998;11:739–747. doi: 10.1093/protein/11.9.739. [DOI] [PubMed] [Google Scholar]
- Usha R, Murthy MR. Protein structural homology: a metric approach. Int. J. Pept. Protein Res. 1986;28:364–369. doi: 10.1111/j.1399-3011.1986.tb03267.x. [DOI] [PubMed] [Google Scholar]
- Waterman MS, Vingron M. Rapid and accurate estimates of statistical significance for sequence data base searches. Proc. Natl Acad. Sci. USA. 1994;91:4625–4628. doi: 10.1073/pnas.91.11.4625. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.