Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2014 Jun 27;42(Web Server issue):W356–W360. doi: 10.1093/nar/gku459

SARA-Coffee web server, a tool for the computation of RNA sequence and structure multiple alignments

Paolo Di Tommaso 1,2, Giovanni Bussotti 3, Carsten Kemena 4, Emidio Capriotti 5, Maria Chatzou 1,2, Pablo Prieto 1,2, Cedric Notredame 1,2,*
PMCID: PMC4086076  PMID: 24972831

Abstract

This article introduces the SARA-Coffee web server; a service allowing the online computation of 3D structure based multiple RNA sequence alignments. The server makes it possible to combine sequences with and without known 3D structures. Given a set of sequences SARA-Coffee outputs a multiple sequence alignment along with a reliability index for every sequence, column and aligned residue. SARA-Coffee combines SARA, a pairwise structural RNA aligner with the R-Coffee multiple RNA aligner in a way that has been shown to improve alignment accuracy over most sequence aligners when enough structural data is available. The server can be accessed from http://tcoffee.crg.cat/apps/tcoffee/do:saracoffee.

INTRODUCTION

Phylogeny reconstruction and homology-based annotation are two of the most common modeling procedures in biology. Both of them require the assembly of accurate multiple sequence alignments (MSA). In this work, we introduce a web server dedicated to 3D structure-based multiple RNA sequence alignment using the SARA-Coffee package (1). SARA-Coffee allows the combination of sequences with and without known tertiary structure. It is a suitable companion tool for any modeling technique that can benefit from a structurally accurate ribonucleic acid (RNA) MSA. This includes construction of profile stochastic context free grammar (SCFG) using packages like infernal (2,3).

Accurately aligning RNA sequences is especially important in a context where recent developments in genomics have fueled the detection of thousands of expressed loci and challenged the long held view that non-coding RNA (ncRNA) functions were supported by a much smaller number of genes than their protein-coding counterparts. Between 2008 and 2014, the ENSEMBL non-coding RNA (ncRNA) gene count has grown from 5732 (October 2008) to 22 643 (January 2014), thus overtaking proteins. It remains unclear, however, which proportion of these genes simply results from spurious transcription. One should also bear in mind that the term ncRNA encompasses a very heterogeneous gene population, including about 14 000 Long ncRNAs—of mostly unknown function—and a bit less than 9000 short ncRNAs often better biologically characterized. This last group includes snoRNAs, microRNA precursors, transfer RNA (tRNA) and in general most of the highly structured ncRNAs catalogued in the RNA families database (RFAM) (4). SARA-Coffee has been specifically designed for this category of ncRNA genes.

SARA-Coffee's main characteristic is to combine sequence and structure pairwise alignment methods into a unified multiple sequence alignment, using SARA (5) as a 3D structure pairwise alignment engine. When used in pure 3D structural aligner mode SARA-Coffee is limited by the small number of available RNA PDB structures. These merely constitute ∼3% of all PDB entries (2'941 out of 99'775 in March 2014) and their pace of determination remains significantly slower than that of proteins—with a doubling time of ∼7 years, as compared to 5 years for proteins. Yet, in its sequence/structure hybrid mode, SARA-Coffee can be used to combine available 3D structural information with the very large number of novel uncharacterized ncRNA sequence reported by genome sequencing projects.

The problem of aligning RNA sequences is rather complex. Even though RNA molecules often contain evolutionary conserved secondary structures, their primary structure signal is rarely as strong as the one resulting from the protein three-letters meta-alphabet. As a consequence, higher levels of sequence identity are required to infer homology—60% as compared to 25% in proteins (6–9). Integrating additional information, e.g. secondary structure signal, into the alignment process is possible but can be computationally expensive. The first attempt to do so was described by Sankoff and later improved using profile Stochastic Context Free Grammar (SCFG) (2). Over the years, many methods have been reported for this purpose; a non-exhaustive list includes Foldalign (10), Stemloc (11), Consan (11), LocARNA (12), R-Coffee (13,14) and MAFFT (15). All of them constitute an attempt to capture the secondary structure signal contained in RNA molecules in order to build more informative sequence alignment models. These models can then be passed to SCFG profile building tools in order to build co-variation models like the ones used by RFAM.

These multiple alignment heuristics only address the issue of aligning sequences using experimental or predicted secondary structure information. In some recent work, we have shown that one can also combine sequence and experimental 3D structural information (1) into an MSA. This process requires the ability to superpose RNA 3D structures, a problem similar in nature to protein structure comparison and for which several pairwise comparison tools have been developed, including SARA (5), DIAL (16), iPARTS (17), ARTS (18) and R3D Align (19). Our main motivation when developing SARA-Coffee has been to turn these pairwise 3D structure aligners into a method able to deliver 3D-based multiple sequence alignments. We did so by integrating SARA within the T-Coffee consistency framework. It must be stressed that this approach is generic enough to be applied to any of the above-mentioned RNA 3D pairwise aligners.

In a previous study, SARA-Coffee was extensively validated on a purpose built reference dataset named BraliDART (1) made of 41 RNA families containing between 4 and 71 members with known 3D structures. This dataset was assembled so as to only include high quality X-ray structures and exclude any discontinuous structure, on which the algorithm is expected to perform poorly. Our benchmarks indicate that SARA-Coffee is significantly more accurate than alternative sequence based methods, even those using predicted secondary structures. Overall, SARA-Coffee was reported to be over 10% points more accurate than primary structure based aligners, judging from its capacity to properly align pairs of interacting residues. On dense secondary structures—in which 70% or more of the nucleotides form Watson and Crick base pairs—SARA-Coffee outperforms all tested methods, including the ones using predicted secondary structure information. It is about 3% points more accurate than MXSCARNA (20), the second best method in this benchmark. Detailed structural analysis using a distance-Root Mean Square Deviation (dRMSD) measurement also indicates that SARA-Coffee produces alignments significantly superior to all methods tested in the study. Overall SARA-Coffee alignments have a dRMSD 10 to 20% lower than alternative models (i.e. 4.53 Å against 5.11 Å for LocARNA, the next best method in terms of 3D modeling).

ALGORITHM

SARA-Coffee (1) produces 3D structure based multiple RNA sequence alignments. Its algorithm can be described as a combination between SARA (5), a pairwise RNA structural aligner and R-Coffee (14), a T-Coffee (22) based multiple RNA sequence aligner. The algorithm is described extensively in (1) and its general flow can be summarized as follows:

By default, SARA-Coffee takes as input a set of RNA sequences with known 3D structures. The first step of the algorithm involves aligning all possible pairs of the provided sequences using SARA. The result is an alignment library populated with SARA’ 3D-structure based pairwise sequence alignments. In practice, the alignment library is a list of all the residue pairs found aligned in any of the compiled pairwise alignments. The second step involves extracting contact information from each input structure using the -p mode of x3dna/find_pairs (21) that reports all base-pairs, including the non-canonical and higher-order (3+). This contact information is then used to extend the alignment library so that its alignments become compatible with 2D and 3D contacts. This is achieved by adding pairs of aligned residues that are implied by the contacts but missing from the library. For instance, let us consider a contact between residues XY in sequence A and another contact between residues WZ in sequence B. If X and W are found aligned in the library, then the contact based extension will involve adding the pair YZ to the library, thus ensuring full compatibility between the library and the contact. If in the library, Y (or Z) are already aligned to other residues, this process will introduce incompatibilities that will be resolved through consistency analysis when incorporating the sequences into the final MSA. Once extended using contact information, the library can be fed to the default T-Coffee algorithm. This contact based extension is the main feature of the R-Coffee algorithm.

Figure 1.

Figure 1.

SARA-Coffee Server output. Sequences are colored according to their alignment reliability. The top block indicates the reliability of each individual sequence alignment on a 0–100 scale. Sequences are named after their PDB identifier chains. In the main alignment, dark red regions are the most reliable, the bottom line (cons), indicates the most reliable columns.

T-Coffee uses a tree based progressive alignment algorithm. It starts by estimating the similarity between every pair of sequences counting words of size four and then uses neighbor joining to turn this distance matrix into a binary guide tree. Sequences are then incorporated into the MSA following the guide tree. The guide tree being binary, its resolution involves aligning at each node either two sequences, a sequence and a profile or two profiles, until reaching the root where the full MSA is resolved. At each node the pairwise alignments are computed using the Gotoh dynamic programming algorithm. The main characteristic of T-Coffee is the ability to use the library described above instead of a standard substitution matrix. When using the library, the cost for aligning two symbols is set to be equal to the number of time these symbols are found aligned in the library, either directly or by combining any two pairs of aligned residues (i.e. the pair XW of aligned residues may be supported by a combination of the two library pairs XK and KW, K being a residue from a third sequence). This consistency analysis helps deciding between conflicting library pairs that may result from the secondary structure based extension. No gap penalty scheme is needed at that stage.

This same algorithm also makes it possible to combine structures and sequences. When doing so, the library is built in a similar way, but in that case SARA is only applied onto sequence pairs having both an experimental 3D structure. In all other cases, a pair-HMM (proba_pair) adapted from Probcons is used to produce the pairwise comparisons and to populate the library with residue pairs having a high alignment posterior probability (22). In the next step, the contact list for sequences with no experimental 3D structures is replaced with a secondary structure prediction, as provided by RNAplfold from the Vienna package (23). The rest of the algorithm (extension and alignment computation) is identical.

SARA-COFFEE WEB SERVER

The SARA-Coffee web server is part of the T-Coffee web platform; its access is free and unrestricted, with no login procedure. The server is accessible from http://tcoffee.crg.cat/apps/tcoffee/do:saracoffee with any standard Internet browser (Mozilla Firefox 5+, Google Chrome, Internet Explorer 8+, Safari 6+ and Opera 11+). Results can be retrieved from the web server or received by email if requested. Anonymous jobs can nonetheless remain available from the submission terminal thanks to a cookies cache procedure. Results are kept on the web server for a month after computation and are assigned a permanent URL during this time.

Input

Sequences must be provided in FASTA format. Sequences with a known PDB structure must be named after their PDB identifier, including the chain index separated by an underscore character (i.e. >PdbID_chain), other sequences can be given arbitrary names. White spaces are forbidden, all sequences are required to have a different name and the provided primary sequences must match the PDB SEQRES field. The input interface also gives access to an advanced mode that makes it possible to control several output options, including format, case, residue numbering, output order and interleaved format block length. Once submitted, the server runs the default SARA-Coffee onto the provided dataset. Results are returned via the same interface but can also be accessed via the history.

Computation

The main limiting step of SARA-Coffee is the computation of SARA pairwise 3D-structure based sequence alignment. For two tRNA structures, SARA requires about 5 s. On a dataset of 71 tRNA with known 3D structures, the server takes about 4 h, it requires about 2 min on a dataset containing 5 tRNAs only like the one provided as a test.

Output

The result page displays the following items in order:

  1. MSA: Shows the resulting interleaved MSA. This graphic is the HTML rendering of the file *.score_html, available for download from the next section. The MSA is colored according to the T-Coffee reliability scheme (24) where red and orange bits correspond to alignment portions for which there is a high consistency within the SARA-Coffee pairwise library, while lighter bits (green, yellow and blue) correspond to the less well supported portions, expected to be less accurately aligned.

  2. Citation: link to the relevant publications when using this server.

  3. The result files produced by the submitted SARA-Coffee alignment. All files can be downloaded as a single zip file by clicking the ‘download all’ link. They can also be automatically imported into the user Dropbox account, if available.

  4. Info: some information about the executed job.

  5. Replay: this feature allows the users to re-run the job while modifying some input options or data.

  6. Feedback: this feature encourages users to provide some feedback via social media.

CONCLUSION

We describe the SARA-Coffee web server, a web based tool able to incorporate 3D structure information within RNA multiple sequence alignments by combining sequence and structural information. SARA-Coffee has been shown to produce accurate alignment as judged from structural analysis. This server makes it possible to combine user's data with publicly available RNA 3D structures, so as to obtain the required models. Future improvements will involve the possibility of uploading user's defined 3D models as well as new output formats providing a local visualization of structural similarity.

ACKNOWLEDGMENT

We thank the reviewers, especially Sean Eddy, for very constructive suggestions, observations and critcisms.

FUNDING

Plan Nacional [BFU2011-28575 to C.N. and P.D.]; Quantomics [KBBE-2A 222664]; Center for Genomics Regulation (CRG); La Caixa Fellowship program (to M.C.); Spanish Ministry of Economy and Competitiveness [BES-2012–051918 to P.B.]. EMBL Interdisciplinary Postdoc (EIPOD) under Marie Curie Actions (COFUND) (to G.B.); Department of Pathology at the University of Alabama at Birmingham (to E.C.). Funding for open access charge: Plan Nacional [BFU2011-28575 to C.N. and P.D.]; Quantomics [KBBE-2A 222664]; Center for Genomics Regulation (CRG); La Caixa Fellowship program (to M.C.); Spanish Ministry of Economy and Competitiveness [BES-2012–051918 to P.B.]. EMBL Interdisciplinary Postdoc (EIPOD) under Marie Curie Actions (COFUND) (to G.B.); Department of Pathology at the University of Alabama at Birmingham (to E.C.); Computational resources are provided by the Center for Genomic Regulation (CRG).

Conflict of interest statement. None declared.

REFERENCES

  • 1.Kemena C., Bussotti G., Capriotti E., Marti-Renom M.A., Notredame C. Using tertiary structure for the computation of highly accurate multiple RNA alignments with the SARA-Coffee package. Bioinformatics. 2013;29:1112–1119. doi: 10.1093/bioinformatics/btt096. [DOI] [PubMed] [Google Scholar]
  • 2.Eddy S.R., Durbin R. RNA sequence analysis using covariance models. Nucleic Acids Res. 1994;22:2079–2088. doi: 10.1093/nar/22.11.2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Eddy S.R. A new generation of homology search tools based on probabilistic inference. Genome Inform. 2009;23:205–211. [PubMed] [Google Scholar]
  • 4.Burge S.W., Daub J., Eberhardt R., Tate J., Barquist L., Nawrocki E.P., Eddy S.R., Gardner P.P., Bateman A. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 2013;41:D226–D232. doi: 10.1093/nar/gks1005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Capriotti E., Marti-Renom M.A. RNA structure alignment by a unit-vector approach. Bioinformatics. 2008;24:i112–i118. doi: 10.1093/bioinformatics/btn288. [DOI] [PubMed] [Google Scholar]
  • 6.Capriotti E., Marti-Renom M.A. Quantifying the relationship between sequence and three-dimensional structure conservation in RNA. BMC Bioinformatics. 2010;11:322. doi: 10.1186/1471-2105-11-322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Abraham M., Dror O., Nussinov R., Wolfson H.J. Analysis and classification of RNA tertiary structures. RNA. 2008;14:2274–2289. doi: 10.1261/rna.853208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gardner P.P., Wilm A., Washietl S. A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res. 2005;33:2433–2439. doi: 10.1093/nar/gki541. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Rost B. Twilight zone of protein sequence alignments. Protein Eng. 1999;12:85–94. doi: 10.1093/protein/12.2.85. [DOI] [PubMed] [Google Scholar]
  • 10.Havgaard J.H., Torarinsson E., Gorodkin J. Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix. PLoS Comput. Biol. 2007;3:1896–1908. doi: 10.1371/journal.pcbi.0030193. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Holmes I. Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics. 2005;6:73. doi: 10.1186/1471-2105-6-73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Will S., Reiche K., Hofacker I.L., Stadler P.F., Backofen R. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput. Biol. 2007;3:e65. doi: 10.1371/journal.pcbi.0030065. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Taly J.F., Magis C., Bussotti G., Chang J.M., Di Tommaso P., Erb I., Espinosa-Carrasco J., Kemena C., Notredame C. Using the T-Coffee package to build multiple sequence alignments of protein, RNA, DNA sequences and 3D structures. Nat. Protoc. 2011;6:1669–1682. doi: 10.1038/nprot.2011.393. [DOI] [PubMed] [Google Scholar]
  • 14.Wilm A., Higgins D.G., Notredame C. R-Coffee: a method for multiple alignment of non-coding RNA. Nucleic Acids Res. 2008;36:e52. doi: 10.1093/nar/gkn174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Katoh K., Toh H. Improved accuracy of multiple ncRNA alignment by incorporating structural information into a MAFFT-based framework. BMC Bioinformatics. 2008;9:212. doi: 10.1186/1471-2105-9-212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ferre F., Ponty Y., Lorenz W.A., Clote P. DIAL: a web server for the pairwise alignment of two RNA three-dimensional structures using nucleotide, dihedral angle and base-pairing similarities. Nucleic Acids Res. 2007;35:W659–W668. doi: 10.1093/nar/gkm334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Wang C.W., Chen K.T., Lu C.L. iPARTS: an improved tool of pairwise alignment of RNA tertiary structures. Nucleic Acids Res. 2010;38:W340–W347. doi: 10.1093/nar/gkq483. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Dror O., Nussinov R., Wolfson H. ARTS: alignment of RNA tertiary structures. Bioinformatics. 2005;21(Suppl.2):ii47–ii53. doi: 10.1093/bioinformatics/bti1108. [DOI] [PubMed] [Google Scholar]
  • 19.Rahrig R.R., Leontis N.B., Zirbel C.L. R3D align: global pairwise alignment of RNA 3D structures using local superpositions. Bioinformatics. 2010;26:2689–2697. doi: 10.1093/bioinformatics/btq506. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kiryu H., Tabei Y., Kin T., Asai K. Murlet: a practical multiple alignment tool for structural RNA sequences. Bioinformatics. 2007;23:1588–1598. doi: 10.1093/bioinformatics/btm146. [DOI] [PubMed] [Google Scholar]
  • 21.Lu X.J., Olson W.K. 3DNA: a versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures. Nature Protoc. 2008;3:1213–1227. doi: 10.1038/nprot.2008.104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Do C.B., Mahabhashyam M.S., Brudno M., Batzoglou S. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 2005;15:330–340. doi: 10.1101/gr.2821705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Lorenz R., Bernhart S.H., Honer Zu Siederdissen C., Tafer H., Flamm C., Stadler P.F., Hofacker I.L. ViennaRNA Package 2.0. Algorithms Mol. Biol.: AMB. 2011;6:26. doi: 10.1186/1748-7188-6-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Chang J.M., DiTommaso P., Notredame C. TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Mol. Biol. Evol. 2014;31 doi: 10.1093/molbev/msu117. [DOI] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES