Rfam 15: RNA families database in 2025

Nancy Ontiveros; Emma Cooke; Eric P Nawrocki; Sandra Triebel; Manja Marz; Elena Rivas; Sam Griffiths-Jones; Anton I Petrov; Alex Bateman; Blake Sweeney

doi:10.1101/2024.09.23.614430

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Sep 24:2024.09.23.614430. [Version 1] doi: 10.1101/2024.09.23.614430

Rfam 15: RNA families database in 2025

Nancy Ontiveros ¹, Emma Cooke ², Eric P Nawrocki ³, Sandra Triebel ^4,⁵, Manja Marz ^4,⁵, Elena Rivas ⁶, Sam Griffiths-Jones ⁷, Anton I Petrov ⁸, Alex Bateman ¹, Blake Sweeney ^1,^*

PMCID: PMC11451735 PMID: 39372780

Abstract

The Rfam database, a widely-used repository of non-coding RNA (ncRNA) families, has undergone significant updates in release 15.0. This paper introduces major improvements, including the expansion of Rfamseq to 26,106 genomes, a 76% increase, incorporating the latest UniProt reference proteomes and additional viral genomes. Sixty-five RNA families were enhanced using experimentally determined 3D structures, improving the accuracy of consensus secondary structures and annotations. R-scape covariation analysis was used to refine structural predictions in 26 families. Gene Ontology and Sequence Ontology annotations were comprehensively updated, increasing GO term coverage to 75% of families. The release adds 14 new Hepatitis C Virus RNA families and completes microRNA family synchronisation with miRBase, resulting in 1,603 microRNA families. New data types, including FULL alignments, have been implemented. Integration with APICURON for improved curator attribution and multiple website enhancements further improve user experience. These updates significantly expand Rfam’s coverage and improve annotation quality, reinforcing its critical role in RNA research, genome annotation, and the development of machine learning models. Rfam is freely available at https://rfam.org.

Keywords: Non-coding RNA, Viruses, microRNA, RNA 3D Structure, Bioinformatics, Database

Graphical Abstract

graphic file with name nihpp-2024.09.23.614430v1-f0001.jpg

Rfam has undergone a major update with the release of 15.0. We have increased the number of genomes in our sequence database Rfamseq by 75%, completed the synchronisation with miRBase and improved 65 families using 3D structures.

Introduction

The Rfam database was established in 2002 (1) in order to provide a central repository of non-coding RNA (ncRNA) families for genomic annotations. Each family is represented by three key components: (i) a multiple sequence alignment called the SEED alignment that contains aligned sequences of homologous RNA sequences that share a consensus secondary structure, (ii) a covariance model (CM) built using the Infernal software (2) that was trained on the SEED alignment, and (ii) a set of matches that were found using the covariance model, called the set of FULL hits. The families are built manually by Rfam curators using scientific literature or alignments submitted by the community.

For each model the curators search for homologs in a sequence database, Rfamseq, and select a bit score threshold, called the gathering threshold, that separates homologous sequences from unrelated or more distantly related sequences, taking into account phylogenetic distribution of hits. The cutoff can be used by Infernal to report only sequences that score higher than this threshold when annotating genomes or searching in sequence databases.

Each major release of the Rfam database, such as 14.0 or 15.0, corresponds to a new version of Rfamseq. Prior to Rfam 13.0, Rfamseq was based on a set of sequences from the ENA database (3). However, due to the explosive growth in ENA, in Rfam 13.0 Rfamseq transitioned to a representative and reduced redundancy set of genomes produced by the UniProt team (4) for the Reference Proteomes dataset (5). The representative proteomes from UniProt are mapped to their corresponding genomes and the genomic sequence is analysed by Rfam (5). This provides users with a representative collection of genomes which have comprehensive ncRNA and protein annotations.

Rfam has been widely adopted by various scientific communities, with uses ranging from the annotation of small non-coding RNAs (ncRNAs) in genomic resources like Ensembl (6), to serving as a dataset for training machine learning models such as AlphaFold 3 (7). Additionally, Rfam is frequently used as a key reference for known non-coding RNAs. To ensure it remains current and valuable to the scientific community, we continuously review, update, and enhance the Rfam families.

In release 15.0 we have updated and expanded Rfamseq, improved families using 3D structures and R-scape (8), improved Gene Ontology (9) and Sequence Ontology (10) annotations and completed the synchronisation of miRBase (11) microRNAs into Rfam.

New Rfamseq database

Rfam 15 is based on the 2024_03 release of UniProt reference proteomes that includes 23,158 genomes (see Figure 1 showing the taxonomic distribution). We attempted to download all components of the genomes used to build the proteome set. In some cases, the specified genome was replaced by a newer version, which was used instead. In a very few cases, less than 50, all versions of the genome were either outdated or removed. In those cases we did not fetch any genome.

In addition to the genomes in the UniProt set, we also fetched all viral genomes from the protein information resource (12) at 75% percent identity for an additional 2,985 viral genomes. This set was added because during our viral work, described below, we found that the UniProt reference proteome work under reports the diversity of viral genomes. This has led to some viral Rfam families having very few or no FULL hits. Rfam would prefer that all families have at least one match outside of the SEED sequences. While using an additional set of viral genomes leads to a large increase in the number of genomes, the total size of Rfamseq is dominated by cellular organisms as <1% of the nucleotides come from a viral genome. Table 1 shows the comparison of Rfamseq in the last major release Rfam 14.0 vs Rfam 15.0.

Table 1.

The number of genomes in Rfamseq corresponding to release 14.0 and 15.0. The size on disk is the uncompressed size of all sequences in Rfamseq.

Kingdom	Rfamseq v14	Rfamseq v15
Total sequence length	3.68e11	1.46e12
Number of species	14,774	26,106
Eukaryotes	1,057	2,613
Archaea	418	370
Bacteria	7,808	9,601
Viruses	5,491	13,552
Number of Rfam hits	2,897,296	10,736,534
Size on disk	170 GB	772 GB

Open in a new tab

Once Rfamseq was updated all families were searched against the updated database and the matches were collected. Many older or larger families have not been rescanned in years due to technical pipeline limitations, for example, the 5S rRNA (RF00001) has not had its matches updated since 2014.

Once all families were updated, we compared the matches of families in 14.3, which was the last published version, and 15.0 to determine how Rfam has changed. Of the 3,431 families in common between 14.3 and 15.0, 98 had no FULL hits in 14.3 and when moving to 15.0, 23 of these families gained at least one hit while 21 lost all hits, leading to 96 families in 15.0 without FULL hits. Families without FULL hits occur because either the curator-selected gathering threshold of the family is too high, or if Rfamseq does not contain any similar sequences. Future work will look to see if these families require updating the threshold or if additional genomes need to be added to Rfamseq.

Of the 3,302 families with hits in both releases, the average family grew by 166%, with 2,335 gaining at least one hit, 547 losing hits and 420 with no change in the number of hits. Examining the families with an increase shows that there are 26 families with a greater than 10x growth. These families are primarily microRNA families matching to plant genomes. While, as described below, these families have been synchronised with miRBase, we are revisiting these families to see if they need additional future updates in the light of these additional hits, or if the genomes are poorly assembled and should be excluded from Rfamseq.

Overall the update to Rfamseq has led to a considerable change in all Rfam families. This will provide the community with a consistent and up-to-date dataset to reuse. Future updates to Rfamseq will seek to maximise the coverage of known organisms.

Improving existing Rfam families

Using RNA 3D structures to revise Rfam secondary structures

Rfam families are based on multiple sequence alignments annotated with consensus secondary structures that indicate base pairing. The accuracy and completeness of the consensus secondary structure is of critical importance as it guides the SEED alignment, informs the Infernal CM, and is used for training and benchmarking of software for RNA 2D and 3D structure prediction, including AlphaFold 3 (7) and R-scape (8).

Most of the Rfam families with known 3D structure were created prior to the 3D structure determination using less accurate, predicted secondary structures. For example, while the FMN riboswitch family (RF00050) correctly captured 5 helices of the FMN riboswitch, it did not include 1 helix, 2 pseudoknots, and several key base pairings (Figure 2B). As such, one major goal of this project was to include the pseudoknots which are observed in 3D structures into the Rfam families.

Figure 2. — A) Pipeline for reviewing Rfam families using experimentally determined RNA 3D structures from PDB. B) Example of the FMN riboswitch (RF00050) before and after improvements with a 3D structure. Improvements included adding 2 pseudoknots and creating a missing helicees. C) A summary of how the IS605-orfB-I RNA (RF03065) family was reviewed and improved using R-scape.

To aid the Rfam curators with updating Rfam alignments an automated pipeline was developed to map RNA sequences and secondary structures of experimentally determined RNA 3D structures to Rfam SEED alignments. Every week Rfam families are mapped to the latest set of RNA chains from PDB (13) using the Infernal cmscan program. For each Rfam family with one or more matching structures, the sequences and secondary structures are iteratively added to the alignment in the Stockholm format using Infernal’s cmalign program. The base pairing information, as determined by FR3D (14), is also included in the alignment as an additional GR annotation line (one per 3D structure). The pipeline is automatically executed weekly and the Rfam curators receive notifications of newly mapped structures (Figure 2A).

The resulting alignments undergo manual review to determine whether the family consensus secondary structure is consistent with the base pair information from PDB structures. The structure and annotated based pairs are always kept in the alignment, even in cases where the structure is not used to update the consensus secondary structure, e.g. low resolution structure or non-canonical base pairs. Additional annotations are also included in the Stockholm files, such as RNA structural elements. For example P1, P2, and P3 domains are included in the Pistol ribozyme RF02679, while the SAM riboswitch (RF00162) now includes a kink-turn annotation.

Our pipeline detects 143 Rfam families which match at least one 3D structure. We began updating families with 3D structures in release 14.5 and since then Rfam has updated 65 families, with 298 chains from 3D structures. This includes well known families such as the SAM riboswitch (RF00162), the FMN riboswitch (RF00050) and microRNA 16 precursor (RF00254). The updates included the addition of pseudoknots (PK) in 42 of the 65 families, of which 6 had 2 PK, and 1 had 3. These features include ligand binding sites for riboswitches (28 annotations), locations of well known motifs such kink-turns (8 annotations), and structural regions such as helices and domains (60 annotations). A detailed table showing which families were updated with which 3D structures and can also be found online at https://rfam.org/3d. Rfam curators work to keep families as up-to-date with the latest structures as possible, which has led to 11 families being updated more than once since release 14.5.

The review of structures is part of the ongoing curation process, and additional families will be improved in future releases. The pipeline is implemented in Python and is available at https://github.com/Rfam/rfam-3d-seed-alignments and the weekly updates are found in the file pdb_full_region.txt.gz located in the preview section of the FTP archive (http://ftp.ebi.ac.uk/pub/databases/Rfam/.preview/).

Using R-scape to improve Rfam consensus secondary structures

Since version 13.0 (5) the Rfam website included the results of the R-scape analysis (8) to help users evaluate consensus secondary structures for each Rfam family, as well as alternative secondary structures generated using the R-scape CaCoFold algorithm. CaCoFold evaluates all possible consensus secondary structures that are consistent with multiple sequence alignment and can propose a structure maximising the number of base pairs with statistically significant covariation (15).

In release 14.9, the R-scape software was used to systematically review all Rfam families and identify those that could be enhanced using R-scape CaCoFold structures. We examined all families with an increased number of covarying base pairs and selected 26 families for updates. These families are listed in Table 2. Shown in Figure 2C is an example of a family, IS605-orfB-I, before and after improvement.

Table 2:

Summary of the R-scape based improvements. This shows the families which were improved by using R-scape CaCoFold structures.

Family	Additional Covarying Basepairs
RF02033 (HEARO)	24
RF03065 (IS605-orfB-I)	14
RF03068 (RT-3)	8
RF03072 (raiA)	5
RF02969 (DUF3800-I)	4
RF01688 (Actino-pnp)	3
RF02004 (group-II-D1D4–5)	3
RF02005 (group-II-D1D4–6)	3
RF02913 (pemK)	3
RF03077 (RT-2)	3
RF03135 (L4-Archaeoglobi)	3
RF03144 (eL15-Euryarchaeota)	3
RF00062 (HgcC)	2
RF01731 (TwoAYGGAY)	2
RF01794 (sok)	2
RF02221 (sRNA-Xcc1)	2
RF02947 (cow-rumen-2)	2
RF03000 (LOOT)	2
RF03158 (L31-Actinobacteria)	2
RF01864 (plasmodium_snoR21)	1
RF01867 (CC2171)	1
RF02944 (c4–2)	1
RF02968 (DUF3800-IX)	1
RF02987 (GA-cis)	1
RF03019 (RT-16)	1
RF03046 (Pseudomonadales- 1)	1

Open in a new tab

Gene and Sequence Ontology annotations

Rfam is commonly used as a source of functional information for non-coding RNAs. This takes several forms, from users reading our curated descriptions and Wikipedia articles to understand the role of an ncRNA, to using the Gene Ontology (GO) and Sequence Ontology (SO) annotations that Rfam curators provide to understand the role of an ncRNA. Rfam is the largest source of GO annotations with over 10 million sequences having an Rfam based GO annotation. In addition to providing information to human users, Rfam is also used in training Large Language Models (LLM). LLMs such as ChatGPT and Claude have clearly been trained on Rfam data and are able to output entries in the formats, Stockholm and DESC, that Rfam uses. Thus providing a completely new way for scientists to access the information found in Rfam.

The Gene Ontology is a resource to classify and provide functional information for biomolecules using structured annotations and ontologies (9). Similarly the Sequence Ontology provides an ontology and structured annotations for the types of biomolecules (10). Both resources are continuously updated to better reflect scientific knowledge. For example, since the last publication of Rfam the SO gained several terms specific to the location of rRNA, e.g. cytosolic_rRNA (SO:0002343) and obsoleted the more generic rRNA terms.

Rfam provides GO and SO annotations for families, however, these annotations are not regularly reviewed to stay up-to-date with the latest changes in the ontologies. As an example, prior to 15.0 the rRNA families used obsoleted SO terms. For 15.0 we manually reviewed all annotations and updated them to better reflect changes in the GO and SO, as well as improved the specificity of annotations where possible.

We ensured that all families had at least one up-to-date SO term and the annotation was as specific as possible. For GO terms, we ensured the terms were up-to-date and added related terms where possible. A few examples of the changes we made were to 1) ensure all snoRNAs had a snoRNA specific SO term and included an RNA processing GO term; 2) add the sodium ion binding term to the NA+ riboswitch; and 3) updated families with the generic mature_transcript SO term to the more specific ncRNA SO term. Of the 3,431 common families between 14.3 and 15.0 2,084 had a GO term in 14.3 while 2,426 did in 15.0. For all families in 15.0 we now have at least one GO term for 3,157, which covers 75% of all families, an increase from 60% in 14.3. These changes increased the total number of GO terms from 3,752 to 4,446. As these updates propagate through the community, we expect these changes to lead to millions more sequences annotated with functional information.

Viral RNAs

Starting with Rfam 14.3 we collaborated with the Marz group to develop a new workflow for viral RNA families (16). This workflow is based upon whole genome alignments of viruses and led to the creation of Flaviviridae and Coronaviridae families as described previously (16). Since the last publication we have continued to use this workflow and have added Hepatitis C Virus (HCV) (17) families. We included 14 new families and removed 4 outdated families. A schematic of the new families can be seen in Figure 3. These families include structures found in both the non-coding and coding regions of the viral alignments. All viral families organised by viral clade can be seen at https://rfam.org/viruses. We plan to continue to use the viral pipeline and improve the representation of viral families in Rfam and expect to focus on HDV and pestivirus families in the near future.

Figure 3: — A schematic of the HCV viral families Rfam now contains. These families were built from whole genome alignments, as described in (17). The figure shows their relative locations in the HCV genomic alignment and shown in the inset are two of the new HCV families.

MicroRNAs

MicroRNAs are an important class of ncRNA that regulate gene expression in animals and plants, with many microRNAs being implicated in disease. For example, the mir-17~92 cluster (18) and mir-155 (19) are amongst a number of microRNAs which act as oncogenes. Understanding the evolutionary and family relationships of microRNAs across species allows the transfer of annotation, for example from model organisms to humans and vice versa.

Since early releases, Rfam has included microRNA families, e.g let-7 (20) was deposited in 2002 for Rfam 1.1. However, they were not subject to regular review and were not coordinated with miRBase (11), an authoritative resource for microRNA annotation that assigns identifiers for microRNA genes and sequences. As of Rfam 13.0, out of 1,983 microRNA families found in miRBase v21, only 28% matched one or more of the 529 Rfam microRNA families.

Starting in Rfam 14, we began systematically reviewing microRNA families with the goal to synchronise Rfam microRNA families with miRBase. To this end, we worked to ensure that the SEED alignments of all microRNA families in Rfam were built from microRNA sequences that are tracked in the miRBase database. Multiple sequence alignments of microRNA sequences were extracted from miRBase and mapped to Rfam accessions. Each sequence was assigned a RNAcentral Unique RNA Sequence (URS) (21) identifier to remove sequence redundancy and represent only distinct sequences for each species. For each alignment a covariance model was built using Infernal and used to search the Rfamseq database. Bit score thresholds for each model (known as gathering thresholds (16)) were manually curated. These thresholds enable automatic and accurate genome annotation using Rfam microRNA families with Infernal.

Pre-existing Rfam microRNA families were reviewed and replaced with miRBase-derived SEED alignments where possible. New families were created when Rfam did not already represent the microRNA sequences. MicroRNA families found in Rfam that did not match miRBase were reviewed and updated or deleted. In cases where the miRBase alignment matched several Rfam families we manually inspected the alignment and families and split and merged families and built clans of families as appropriate.

Rfam release 15.0 marks the completion of the synchronisation process, as we have processed all microRNA alignments from miRBase. For roughly 200 miRBase-derived alignments, we determined that the family was not suitable for inclusion into Rfam. In most cases this was because there was a single unique sequence in the miRBase alignment. A small fraction of the alignments which have not been added may be removed from miRBase in the future. Table 3 shows the summary of changes since release 14.3, and a list of all microRNA families is found in Supplementary information in Table S1. All miRBase microRNA sequences that are not represented in updated Rfam families will be periodically and systematically reviewed in a cycle of improvements and curation of both Rfam and miRBase. Rfam and miRBase will continue to synchronise our resources as new microRNAs and new microRNA alignments are available, with miRBase acting as the repository of microRNA sequences, and Rfam as the authority on the grouping of those sequences into families. Rfam microRNA family classifications will be available through both resources.

Table 3:

A summary of the changes in Rfam microRNA families since last publication (16) with release 14.3.

	Changes for Rfam 15.0
Total number of microRNA families	1,603
New families	722
Updated families	881
Deleted families	8

Open in a new tab

New families

While the primary focus of Rfam has been to complete the microRNA project, and improve existing families with 3D structures, we have continued to create new families. Since Rfam 14.3 (16), Rfam has created 16 new non-microRNA or viral RNA families. The families cover a range of phylogenetic and functional types. A few examples include Bacteroidales small SRP (RF04183), the signal recognition particle RNA of Bacteroidetes (22), Hairpin-meta1, a virus-like ribozyme reported in RNA satellites of plant viruses (RF04190) (23), Hovlinc ribozyme (RF04188) (24), a newly discovered type of self-cleaving ribozymes found in human and other hominids and RF04222 PLRV xrRNA, a exoribonuclease-resistant RNA detected in Flavivirus Potato virus (25), RF04247 bZIP family which is a non canonical Hac1/Xbp1 intron found in Metazoa (26). Shown in Figure 4 are the secondary structures of selected families.

Figure 4: — A selection of the new families for Rfam 15.0. The five families shown Hovlinc ribozyme (RF04188), the PLRV xrRNA (RF04222), the hairpin meta ribozyme (RF04190), nqrA-II ncRNA motif (RF04310) and Bacteriodales small SRP (RF04183) are several of the new families created for Rfam 15.0.

These families are curated through a combination of literature reviews and community submissions. We actively encourage groups with candidate RNAs to contribute their alignments and secondary structure data to Rfam. However, it is important to note that Rfam requires all sequences in SEED alignments to be traceable to public databases such as GenBank (27) or RNAcentral (21). With the increasing ease of sequencing, many laboratories maintain private sequence databases for constructing alignments. As a result, many of these new alignments cannot be incorporated into Rfam due to the lack of traceable public records. We strongly recommend that the scientific community submit their sequences to publicly accessible databases, ensuring they are available for reuse and analysis by the broader research community. Furthermore, when publishing alignments, using public accessions will facilitate faster and more efficient integration into resources like Rfam.

Other improvements

Updating FULL alignments

Rfam provides several data types for interested users to download from our FTP site. These include the SEED alignments, CMs and FULL sequences. Older versions of Rfam included FULL alignments, which are an alignment of all matching sequences to the CM. Previously, these were no longer produced because of technical limitations, however, with improvements to infernal and growth in available compute power it is now possible to create these alignments again. We now produce full alignments for all Rfam families. These are available for each model in the ‘full_alignments’ section on the FTP site (https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/full_alignments).

APICURON integration

APICURON (https://apicuron.org/) is a database to track and credit biocurators for the work they do (28). The database allows for databases to credit curators for each activity they perform and provides overall statistics for each database including Rfam. In Rfam, it is common for a curator to perform many small updates and fixes to families to produce a release. Currently, this work goes unacknowledged except during publication, which are generally separated by several years. By integrating with APICURON we are able to display all the changes curators perform as part of their duties on a more regular schedule. Additionally, APICURON updates ORCID records (https://orcid.org/) to help credit curators for their activities. We update our APICURON records with each release and our APICURON page is available at: https://apicuron.org/databases/rfam. APICURON requires that each resource track the activities of each curator, unfortunately some historical Rfam activity does not have the required specificity, leading to some older curators lacking attributions. This is being worked on and we hope to properly credit all curators for their contributions to Rfam.

Website improvements

The Rfam website (http://rfam.org) has undergone continuous development since our last publication. We have added several new features, such as dedicated landing pages for each project, e.g. viral families can easily be browsed at http://rfam.org/viruses. These pages make it simpler for users interested in a particular project, e.g. the 3D structure alignments, to see our progress and fetch all data related to the project.

We have recently updated some terminology used for riboswitch families in Rfam. We have found that some users are confused that Rfam riboswitch families generally include just the aptamer domains. This is because aptamers align well, while it is difficult to build an alignment of a whole riboswitch. We have added a short note to the header of each Rfam riboswitch page to indicate that this family is only the aptamer domain and a short explanation of what that means. Additionally, the descriptions of families now include the term ‘aptamer’ to clarify this point. We have not updated the identifiers or accessions of any families.

Finally, we have integrated the RNAcentral LitScan widget from RNAcentral. RNAcentral has developed a pipeline, LitScan, which identifies and extracts mentions of any non-coding RNA in all open access literature (29). The result of this pipeline can be reused by embedding a simple HTML widget. The widget allows users to browse and search the related literature in a convenient way. We have embedded this widget on all Rfam family pages under the new ‘Publications’ tab. The results are updated with each RNAcentral release, roughly every 4 months. An example of the widget for the Glutamine riboswitch family page is shown in Figure 5.

Figure 5: — The LitScan widget embedded in the Glutamine riboswitch family page. This widget shows users which open access articles mention the family. Families are searched using their Rfam ID and Rfam accession in all open access literature. It allows users to search the matched articles by several types of metadata such as the article type (research, review, report, etc), the section of the paper which mentions the family and organisms mentioned in the paper. Additionally, users can perform a text search within the sentences which mention the Rfam family.

Conclusions

After more than 20 years of work (1), Rfam has grown to over 4,000 families and continues to be a key resource in RNA science. We continue to provide a large centralised collection of ncRNA families, which are used in many ways. Rfam was originally created to annotate genomes, but we have grown far beyond that use case. We would like to take this opportunity to thank everyone who has worked on or collaborated with Rfam over the past 20 years. Anyone interested in learning about the history and changes in Rfam since its inception are encouraged to visit https://rfam.org/rfam20 to find interviews with many former and current staff and users.

Since our last publication we have focused on completing our synchronisation with miRBase and improving existing families using R-scape and 3D structures. This has led to a large increase in the number and quality of families. Rfam plans to stay synchronised with miRBase and complete the 3D structure improvements by connecting each family to at least one structure where possible. Outside of those projects, we will continue to import viral RNAs and return to capturing more novel families discovered by the community. Finally, we will explore ways to expand Rfamseq to better improve our coverage of the phylogenetic space.

It is essential to note that there has been a major shift in bioinformatics with the publication of AlphaFold and its effect on protein structure prediction. We expect that as the field of RNA structure prediction matures, Rfam will continue to play a key role as a source of ground truth. In order to best serve this new use case, we will begin exploring ways to grow faster than before, while maintaining the high standards the community has come to expect of Rfam alignments. This will be essential to providing the test and training sets required for deep learning based approaches.

Finally, we encourage any interested community members to reach out for collaboration or to provide data. As discussed here, much of our major work is carried out with interested community members and can lead to new directions for the resource. We invite new data submissions and feedback at https://docs.rfam.org/page/contact-us.html.

Supplementary Material

Supplement 1

NIHPP2024.09.23.614430v1-supplement-1.pdf^{(190KB, pdf)}

Acknowledgements

The authors would like to thank the community for its excellent feedback and the organisers of the Benasque 2022 and 2024 meetings where several collaborations described in this paper were established. We would also like to thank the broader RNA community community for their contributions, including providing alignments and feedback for families. In particular we thank Lars Barquist, Fei Qi, Christina Weinberg, Zasha Weinberg and Quentin Vincens who kindly shared their data with us and worked with us in the creation of new families.

Funding

This work was supported by the Wellcome Trust [218302/Z/19/Z]; the Biotechnology and Biological Sciences Research Council [BB/S020462/1]; the Intramural Research Program of the U.S. National Library of Medicine (NLM), National Institutes of Health; ELIXIR and EMBL core funds.

Footnotes

Conflict of interest

AB is a member of the Nucleic Acids Research Editorial Board.

Data availability

All Rfam data are released under the Creative Commons Zero (CC0) licence at https://rfam.org. The data can be accessed via an API, a public MySQL database, and the FTP archive. The Rfam documentation (https://docs.rfam.org) and (30) contain detailed instructions. All code is available on GitHub under the Apache 2.0 licence at https://github.com/Rfam.

References

1.Griffiths-Jones S., Bateman A., Marshall M., Khanna A. and Eddy S.R. (2003) Rfam: an RNA family database. Nucleic Acids Res., 31, 439–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Nawrocki E.P. and Eddy S.R. (2013) Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics, 29, 2933–2935. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Nawrocki E.P., Burge S.W., Bateman A., Daub J., Eberhardt R.Y., Eddy S.R., Floden E.W., Gardner P.P., Jones T.A., Tate J., et al. (2015) Rfam 12.0: updates to the RNA families database. Nucleic Acids Res., 43, D130–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Consortium UniProt (2023) UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res., 51, D523–D531. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Kalvari I., Argasinska J., Quinones-Olvera N., Nawrocki E.P., Rivas E., Eddy S.R., Bateman A., Finn R.D. and Petrov A.I. (2018) Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res., 46, D335–D342. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Harrison P.W., Amode M.R., Austine-Orimoloye O., Azov A.G., Barba M., Barnes I., Becker A., Bennett R., Berry A., Bhai J., et al. (2024) Ensembl 2024. Nucleic Acids Res., 52, D891–D899. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Abramson J., Adler J., Dunger J., Evans R., Green T., Pritzel A., Ronneberger O., Willmore L., Ballard A.J., Bambrick J., et al. (2024) Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Rivas E., Clements J. and Eddy S.R. (2020) Estimating the power of sequence covariation for detecting conserved RNA structure. Bioinformatics, 36, 3072–3076. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Gene Ontology Consortium, Aleksander S.A., Balhoff J., Carbon S., Cherry J.M., Drabkin H.J., Ebert D., Feuermann M., Gaudet P., Harris N.L., et al. (2023) The Gene Ontology knowledgebase in 2023. Genetics, 224. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Sant D.W., Sinclair M., Mungall C.J., Schulz S., Zerbino D., Lovering R.C., Logie C. and Eilbeck K. (2021) Sequence Ontology terminology for gene regulation. Biochim. Biophys. Acta Gene Regul. Mech., 1864, 194745. [DOI] [PubMed] [Google Scholar]
11.Kozomara A., Birgaoanu M. and Griffiths-Jones S. (2019) miRBase: from microRNA sequences to function. Nucleic Acids Res., 47, D155–D162. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Wu C.H., Yeh L.-S.L., Huang H., Arminski L., Castro-Alvear J., Chen Y., Hu Z., Kourtesis P., Ledley R.S., Suzek B.E., et al. (2003) The Protein Information Resource. Nucleic Acids Res., 31, 345–347. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Armstrong D.R., Berrisford J.M., Conroy M.J., Gutmanas A., Anyango S., Choudhary P., Clark A.R., Dana J.M., Deshpande M., Dunlop R., et al. (2020) PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res., 48, D335–D343. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Sarver M., Zirbel C.L., Stombaugh J., Mokdad A. and Leontis N.B. (2008) FR3D: finding local and composite recurrent structural motifs in RNA 3D structures. J. Math. Biol., 56, 215–252. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Rivas E. (2020) RNA structure prediction using positive and negative evolutionary information. PLoS Comput. Biol., 16, e1008387. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Kalvari I., Nawrocki E.P., Ontiveros-Palacios N., Argasinska J., Lamkiewicz K., Marz M., Griffiths-Jones S., Toffano-Nioche C., Gautheret D., Weinberg Z., et al. (2021) Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res., 49, D192–D200. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Triebel S., Lamkiewicz K., Ontiveros N., Sweeney B., Stadler P.F., Petrov A.I., Niepmann M. and Marz M. (2024) Comprehensive survey of conserved RNA secondary structures in full-genome alignment of Hepatitis C virus. Sci. Rep., 14, 15145. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Zhao W., Gupta A., Krawczyk J. and Gupta S. (2022) The miR-17–92 cluster: Yin and Yang in human cancers. Cancer Treat. Res. Commun., 33, 100647. [DOI] [PubMed] [Google Scholar]
19.Wang F., Wang J., Zhang H., Fu B., Zhang Y., Jia Q. and Wang Y. (2024) Diagnostic value of circulating miR-155 for breast cancer: a meta-analysis. Front. Oncol., 14, 1374674. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Rougvie A.E. (2001) Control of developmental timing in animals. Nat. Rev. Genet., 2, 690–701. [DOI] [PubMed] [Google Scholar]
21.RNAcentral Consortium (2021) RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Res., 49, D212–D220. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Ryan D., Jenniches L., Reichardt S., Barquist L. and Westermann A.J. (2020) A high-resolution transcriptome map identifies small RNA regulation of metabolism in the gut microbe Bacteroides thetaiotaomicron. Nat. Commun., 11, 3557. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Weinberg C.E., Olzog V.J., Eckert I. and Weinberg Z. (2021) Identification of over 200-fold more hairpin ribozymes than previously known in diverse circular RNAs. Nucleic Acids Res., 49, 6375–6388. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Chen Y., Qi F., Gao F., Cao H., Xu D., Salehi-Ashtiani K. and Kapranov P. (2021) Hovlinc is a recently evolved class of ribozyme found in human lncRNA. Nat. Chem. Biol., 17, 601–607. [DOI] [PubMed] [Google Scholar]
25.Steckelberg A.-L., Vicens Q., Costantino D.A., Nix J.C. and Kieft J.S. (2020) The crystal structure of a Polerovirus exoribonuclease-resistant RNA shows how diverse sequences are integrated into a conserved fold. RNA, 26, 1767–1776. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Hooks K.B. and Griffiths-Jones S. (2011) Conserved RNA structures in the non-canonical Hac1/Xbp1 intron. RNA Biol., 8, 552–556. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Benson D.A., Cavanaugh M., Clark K., Karsch-Mizrachi I., Lipman D.J., Ostell J. and Sayers E.W. (2013) GenBank. Nucleic Acids Res., 41, D36–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Hatos A., Quaglia F., Piovesan D. and Tosatto S.C.E. (2021) APICURON: a database to credit and acknowledge the work of biocurators. Database, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Green A., Ribas C., Ontiveros-Palacios N., Griffiths-Jones S., Petrov A.I., Bateman A. and Sweeney B. (2023) LitSumm: Large language models for literature summarisation of non-coding RNAs. arXiv [q-bio.GN]. [Google Scholar]
30.Kalvari I., Nawrocki E.P., Argasinska J., Quinones-Olvera N., Finn R.D., Bateman A. and Petrov A.I. (2018) Non-Coding RNA Analysis Using the Rfam Database. Curr. Protoc. Bioinformatics, 62, e51. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

NIHPP2024.09.23.614430v1-supplement-1.pdf^{(190KB, pdf)}

Data Availability Statement

[R1] 1.Griffiths-Jones S., Bateman A., Marshall M., Khanna A. and Eddy S.R. (2003) Rfam: an RNA family database. Nucleic Acids Res., 31, 439–441. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Nawrocki E.P. and Eddy S.R. (2013) Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics, 29, 2933–2935. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Nawrocki E.P., Burge S.W., Bateman A., Daub J., Eberhardt R.Y., Eddy S.R., Floden E.W., Gardner P.P., Jones T.A., Tate J., et al. (2015) Rfam 12.0: updates to the RNA families database. Nucleic Acids Res., 43, D130–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Consortium UniProt (2023) UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res., 51, D523–D531. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Kalvari I., Argasinska J., Quinones-Olvera N., Nawrocki E.P., Rivas E., Eddy S.R., Bateman A., Finn R.D. and Petrov A.I. (2018) Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res., 46, D335–D342. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Harrison P.W., Amode M.R., Austine-Orimoloye O., Azov A.G., Barba M., Barnes I., Becker A., Bennett R., Berry A., Bhai J., et al. (2024) Ensembl 2024. Nucleic Acids Res., 52, D891–D899. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Abramson J., Adler J., Dunger J., Evans R., Green T., Pritzel A., Ronneberger O., Willmore L., Ballard A.J., Bambrick J., et al. (2024) Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Rivas E., Clements J. and Eddy S.R. (2020) Estimating the power of sequence covariation for detecting conserved RNA structure. Bioinformatics, 36, 3072–3076. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Gene Ontology Consortium, Aleksander S.A., Balhoff J., Carbon S., Cherry J.M., Drabkin H.J., Ebert D., Feuermann M., Gaudet P., Harris N.L., et al. (2023) The Gene Ontology knowledgebase in 2023. Genetics, 224. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Sant D.W., Sinclair M., Mungall C.J., Schulz S., Zerbino D., Lovering R.C., Logie C. and Eilbeck K. (2021) Sequence Ontology terminology for gene regulation. Biochim. Biophys. Acta Gene Regul. Mech., 1864, 194745. [DOI] [PubMed] [Google Scholar]

[R11] 11.Kozomara A., Birgaoanu M. and Griffiths-Jones S. (2019) miRBase: from microRNA sequences to function. Nucleic Acids Res., 47, D155–D162. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Wu C.H., Yeh L.-S.L., Huang H., Arminski L., Castro-Alvear J., Chen Y., Hu Z., Kourtesis P., Ledley R.S., Suzek B.E., et al. (2003) The Protein Information Resource. Nucleic Acids Res., 31, 345–347. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Armstrong D.R., Berrisford J.M., Conroy M.J., Gutmanas A., Anyango S., Choudhary P., Clark A.R., Dana J.M., Deshpande M., Dunlop R., et al. (2020) PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res., 48, D335–D343. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Sarver M., Zirbel C.L., Stombaugh J., Mokdad A. and Leontis N.B. (2008) FR3D: finding local and composite recurrent structural motifs in RNA 3D structures. J. Math. Biol., 56, 215–252. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Rivas E. (2020) RNA structure prediction using positive and negative evolutionary information. PLoS Comput. Biol., 16, e1008387. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Kalvari I., Nawrocki E.P., Ontiveros-Palacios N., Argasinska J., Lamkiewicz K., Marz M., Griffiths-Jones S., Toffano-Nioche C., Gautheret D., Weinberg Z., et al. (2021) Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res., 49, D192–D200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Triebel S., Lamkiewicz K., Ontiveros N., Sweeney B., Stadler P.F., Petrov A.I., Niepmann M. and Marz M. (2024) Comprehensive survey of conserved RNA secondary structures in full-genome alignment of Hepatitis C virus. Sci. Rep., 14, 15145. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Zhao W., Gupta A., Krawczyk J. and Gupta S. (2022) The miR-17–92 cluster: Yin and Yang in human cancers. Cancer Treat. Res. Commun., 33, 100647. [DOI] [PubMed] [Google Scholar]

[R19] 19.Wang F., Wang J., Zhang H., Fu B., Zhang Y., Jia Q. and Wang Y. (2024) Diagnostic value of circulating miR-155 for breast cancer: a meta-analysis. Front. Oncol., 14, 1374674. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Rougvie A.E. (2001) Control of developmental timing in animals. Nat. Rev. Genet., 2, 690–701. [DOI] [PubMed] [Google Scholar]

[R21] 21.RNAcentral Consortium (2021) RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Res., 49, D212–D220. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Ryan D., Jenniches L., Reichardt S., Barquist L. and Westermann A.J. (2020) A high-resolution transcriptome map identifies small RNA regulation of metabolism in the gut microbe Bacteroides thetaiotaomicron. Nat. Commun., 11, 3557. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Weinberg C.E., Olzog V.J., Eckert I. and Weinberg Z. (2021) Identification of over 200-fold more hairpin ribozymes than previously known in diverse circular RNAs. Nucleic Acids Res., 49, 6375–6388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Chen Y., Qi F., Gao F., Cao H., Xu D., Salehi-Ashtiani K. and Kapranov P. (2021) Hovlinc is a recently evolved class of ribozyme found in human lncRNA. Nat. Chem. Biol., 17, 601–607. [DOI] [PubMed] [Google Scholar]

[R25] 25.Steckelberg A.-L., Vicens Q., Costantino D.A., Nix J.C. and Kieft J.S. (2020) The crystal structure of a Polerovirus exoribonuclease-resistant RNA shows how diverse sequences are integrated into a conserved fold. RNA, 26, 1767–1776. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Hooks K.B. and Griffiths-Jones S. (2011) Conserved RNA structures in the non-canonical Hac1/Xbp1 intron. RNA Biol., 8, 552–556. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Benson D.A., Cavanaugh M., Clark K., Karsch-Mizrachi I., Lipman D.J., Ostell J. and Sayers E.W. (2013) GenBank. Nucleic Acids Res., 41, D36–42. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Hatos A., Quaglia F., Piovesan D. and Tosatto S.C.E. (2021) APICURON: a database to credit and acknowledge the work of biocurators. Database, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Green A., Ribas C., Ontiveros-Palacios N., Griffiths-Jones S., Petrov A.I., Bateman A. and Sweeney B. (2023) LitSumm: Large language models for literature summarisation of non-coding RNAs. arXiv [q-bio.GN]. [Google Scholar]

[R30] 30.Kalvari I., Nawrocki E.P., Argasinska J., Quinones-Olvera N., Finn R.D., Bateman A. and Petrov A.I. (2018) Non-Coding RNA Analysis Using the Rfam Database. Curr. Protoc. Bioinformatics, 62, e51. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

This is a preprint.

Rfam 15: RNA families database in 2025

Nancy Ontiveros

Emma Cooke

Eric P Nawrocki

Sandra Triebel

Manja Marz

Elena Rivas

Sam Griffiths-Jones

Anton I Petrov

Alex Bateman

Blake Sweeney

Abstract

Graphical Abstract

Introduction

New Rfamseq database

Figure 1.

Table 1.

Improving existing Rfam families

Using RNA 3D structures to revise Rfam secondary structures

Figure 2.

Using R-scape to improve Rfam consensus secondary structures

Table 2:

Gene and Sequence Ontology annotations

Viral RNAs

Figure 3:

MicroRNAs

Table 3:

New families

Figure 4:

Other improvements

Updating FULL alignments

APICURON integration

Website improvements

Figure 5:

Conclusions

Supplementary Material

Acknowledgements

Funding

Footnotes

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases