Skip to main content
Plant Biotechnology logoLink to Plant Biotechnology
. 2024 Sep 25;41(3):303–308. doi: 10.5511/plantbiotechnology.24.0312c

Triterpene RDF: Developing a database of plant enzymes and transcription factors involved in triterpene biosynthesis using the Resource Description Framework

Keita Tamura 1,*, Hirokazu Chiba 2, Hidemasa Bono 1,2,3,**
PMCID: PMC11921141  PMID: 40115771

Abstract

Plants produce structurally diverse triterpenes (triterpenoids and steroids). Their biosynthesis occurs from a common precursor, namely 2,3-oxidosqualene, followed by cyclization catalyzed by oxidosqualene cyclases (OSCs) to yield various triterpene skeletons. Steroids, which are biosynthesized from cycloartenol or lanosterol, are essential primary metabolites in most plant species, along with lineage-specific steroids, such as steroidal glycoalkaloids found in the Solanum species. Other diverse triterpene skeletons are converted into triterpenoids, often classified as specialized compounds that are biosynthesized only in a limited number of plant species with tissue- or cell-type-specific accumulation in plants. Recent studies have identified various tailoring enzymes involved in the structural diversification of triterpenes as well as transcription factors that regulate the expression of these enzymes. However, the coverage of these proteins is scarce in publicly available databases for curated proteins or enzymes, which complicates the functional annotation of newly assembled genomes or transcriptome sequences. Here, we created the Triterpene RDF, a manually curated database of enzymes and transcription factors involved in plant triterpene biosynthesis. The database (https://github.com/ktamura2021/triterpene_rdf/) contains 532 proteins, with links to the UniProt Knowledgebase or NCBI protein database, and it enables direct download of a set of protein sequences filtered by protein type or taxonomy. Triterpene RDF will enhance the functional annotation of enzymes and regulatory elements for triterpene biosynthesis, in a current expansion of availability of genomic information on various plant species.

Keywords: annotation, database, Resource Description Framework, SPARQL, triterpene


Triterpenes are among the most diverse groups of natural compounds found in plants. These compounds are derived from six isoprene units. The last common triterpene precursor is 2,3-oxidosqualene, from which oxidosqualene cyclases (OSCs) generate more than 100 different triterpene skeletons (Sawai and Saito 2011; Thimmappa et al. 2014). Among the triterpene skeletons, cycloartenol and lanosterol are intermediates in the biosynthesis of phytosterols, which are indispensable components of plasma membranes or the plant hormone brassinosteroids (Ohyama et al. 2007, 2009). In addition to these primary metabolites, many specialized metabolites with beneficial bioactivities are known as triterpenes, including glycyrrhizin (derived from β-amyrin) and ginsenosides (derived from β-amyrin or dammarenediol-II) (Sawai and Saito 2011; Seki et al. 2015). Although the definition is not uniform, we classified triterpenes into steroids, which are compounds derived from cycloartenol or lanosterol, and triterpenoids, which are compounds derived from other triterpene skeletons, according to the classification by Ohyama et al. (2007).

Triterpene skeletons are further modified by several classes of enzymes to produce highly diverse structures of triterpenoids and steroids (Sawai and Saito 2011; Thimmappa et al. 2014). The biosynthetic pathways of phytosterols, including brassinosteroids, are relatively well understood, with the support of mutants of Arabidopsis thaliana (Benveniste 2004; Bishop and Yokota 2001). In contrast, enzymes for the modification of triterpene skeletons for the biosynthesis of triterpenoids remained poorly understood until the first identification of cytochrome P450 monooxygenase (P450) CYP93E1 in Glycine max (Shibuya et al. 2006) and UDP-dependent glycosyltransferases (UGTs) UGT73K1 and UGT71G1 in Medicago truncatula (Achnine et al. 2005). Since then, P450s and UGTs have been known to be central players in the structural diversification of triterpenoids, as P450s introduce functional groups, such as hydroxyl and carboxyl groups, into triterpene skeletons, and UGTs add sugar moieties to triterpene aglycones (Seki et al. 2015). In addition to these two protein families, recent studies have identified new types of enzymes involved in triterpenoid biosynthesis, including cellulose synthase-like glycosyltransferases (Chung et al. 2020; Jozwiak et al. 2020) and BAHD acetyltransferases (Kumar et al. 2021). Additionally, transcription factors (TFs) that regulate triterpenoid or steroid biosynthetic genes have been reported in a few biosynthetic pathways (Dinday and Ghosh 2023). A simplified overview of the target pathways is shown in Figure 1.

Figure 1. A simplified overview of the target pathways in Triterpene RDF using representative compounds. Solid arrows represent reactions catalyzed by oxidosqualene cyclases, and dashed arrows represent reactions catalyzed by the combination of tailoring enzymes including P450s, UGTs, and other types of enzymes. Sugar moieties are indicated as follows: Gal, galactose; Glc, glucose; GlcA, glucuronic acid; Rha, rhamnose.

Figure 1. A simplified overview of the target pathways in Triterpene RDF using representative compounds. Solid arrows represent reactions catalyzed by oxidosqualene cyclases, and dashed arrows represent reactions catalyzed by the combination of tailoring enzymes including P450s, UGTs, and other types of enzymes. Sugar moieties are indicated as follows: Gal, galactose; Glc, glucose; GlcA, glucuronic acid; Rha, rhamnose.

With the rapid progress in sequencing technologies, including long-read sequencing platforms, it is now possible to assemble large and complex plant genomes in more feasible ways (Sahu and Liu 2023). This has accelerated the elucidation of genome sequences of non-model plant species, including rare medicinal plants that produce valuable triterpenoids. In general, the functional annotation of predicted genes in assembled genomes after annotation of gene structures requires a set of well-curated reference proteome sequences, such as the reviewed (Swiss-Prot) part of the UniProt Knowledgebase (UniProtKB) (Bateman et al. 2023) and The Arabidopsis Information Resource (TAIR) (Berardini et al. 2015). Although these databases mainly cover OSCs or well-studied Arabidopsis biosynthetic pathways, most specialized triterpenoid biosynthetic pathway proteins have not been fully curated, which complicates functional annotation of triterpenoid biosynthetic genes in plant genomes. The TriForC database (http://bioinformatics.psb.ugent.be/triforc/ (Accessed Dec 21, 2023)) is a manually curated resource for enzymes involved in triterpene biosynthesis (Miettinen et al. 2018). Although this is a valuable resource to study plant triterpene biosynthesis, it does not directly provide a set of protein sequences from the database that would be useful for the functional annotation of predicted genes. The purpose of this study is to develop a database of functionally characterized proteins involved in triterpenoid and steroid biosynthesis in plants from public repositories, including UniProtKB, and to easily curate a set of protein sequences for the functional annotation of potential genes involved in triterpenoid or steroid biosynthesis.

The overall scheme of database construction is shown in Figure 2. By manual curation of the literature, initially curating previously published well-summarized review articles (Dinday and Ghosh 2023; Malhotra and Franke 2022; Rahimi et al. 2019; Thimmappa et al. 2014), and additional relevant articles, we first created tables containing 532 proteins (*_db02.tsv), indexed as TP0001–TP0532. These proteins were classified into five types: enzymes classified as OSC, P450, UGT, and “other enzyme”, and TF. Additionally, nine squalene cyclases (SCs) identified in ferns were added for reference (indicated as “*SC”). We classified these proteins according to their biosynthetic pathways. All proteins (except for SCs) were classified as either “triterpenoid” or “steroid” (forming the “pathway” column). Proteins classified as “triterpenoid” were further classified based on the triterpene skeleton compound produced by OSCs relating to the pathway (forming the “skeleton” column). We summarized the characterized function of the protein in the “function” column. In case of OSCs producing multiple triterpene skeletons, the characterization was made based on the major products. Each protein entry was linked to the accession of UniProtKB (forming the “uniprot” column), using the accession or sequence indicated in the literature. Since UniProtKB distributes its databases in the Resource Description Framework (RDF) format, which allows the linking of various life science data using a standard query language (SPARQL) (Jupp et al. 2014; The UniProt Consortium 2017), we chose UniProtKB as the primary accession for each protein entry. For the entries that cannot map to UniProtKB accessions, NCBI protein database accessions were indicated (forming the “ncbiprotein” column). A primary citation for each entry was indicated as the PubMed accession (forming the “pubmed” column). Additional citations, Digital Object Identifiers for publications not available in the PubMed database, or notes for the entries were indicated in the “note” column.

Figure 2. Overall scheme of the database construction.

Figure 2. Overall scheme of the database construction.

Taxonomic and sequence information was retrieved using the accession of UniProtKB or NCBI protein database. For entries linked to UniProtKB, taxonomic information (scientific names of species, family, and order) and amino acid sequences (canonical isoforms) were retrieved using federated queries at the SPARQL endpoint of the UniProt database available at the RDF portal (https://rdfportal.org/sib/sparql; version 2023_02) (Kawashima et al. 2018). For entries linked to the NCBI protein database, taxonomy ID and amino acid sequences were retrieved by programmatic access using TogoWS (Katayama et al. 2010), and taxonomic information (scientific names of species, family, and order) was retrieved using the same method as entries linked to UniProtKB. The obtained taxonomic and sequence information was merged with the primary tables (*_db02.tsv) to curate a final database table (v20240207_dbfinal.tsv) and a corresponding RDF dataset (v20240207_dbfinal.ttl), which we named Triterpene RDF. Triterpene RDF is accessible via a website (https://ktamura2021.github.io/triterpene_rdf/), which internally uses SPARQL query language against the RDF dataset. The sources of the database, custom scripts, SPARQL queries, and intermediate files for the construction of the database are available at the GitHub repository (https://github.com/ktamura2021/triterpene_rdf/).

A screenshot of the Triterpene RDF is shown in Figure 3. When users access a website, all entries are displayed first. Drop-down menus at the top of the page are provided to filter entries. The “Download FASTA” button enables users to download amino acid sequences for the displayed entries in FASTA format. Table 1 shows the number of protein entries in the database classified by “type” and “pathway”. We collected 400 and 123 entries for triterpenoid and steroid biosynthesis, respectively. P450s and OSCs were most commonly identified in both triterpenoid and steroid biosynthesis. The portion of entries labeled as “other_enzyme” in steroid biosynthesis is relatively higher, due to the involvement of many methyltransferases, reductases, and isomerases involved in phytosterol biosynthesis. We also analyzed the number of entries into triterpenoid and steroid biosynthesis, classified according to plant order (Tables 2 and 3). The most studied plant order for triterpenoid biosynthesis was Fabales (100 entries), followed by Brassicales (49 entries). All five proteins were identified in both orders (Table 2). The Fabales order includes the well-studied Fabaceae species for triterpenoid saponin biosynthesis, such as Medicago truncatula (25 entries) and Glycine max (18 entries). The most studied plant orders for steroid biosynthesis were Brassicales and Solanales (30 entries) (Table 3). This reflects the elucidation of phytosterol biosynthetic pathways using A. thaliana and the recent accumulation of knowledge on steroidal glycoalkaloid biosynthesis in the Solanum species (Akiyama et al. 2023).

Figure 3. A screenshot of the database site. Drop-down menus at the top of the page work as filters for entries. Users can download amino acid sequences of the displayed data using the “Download FASTA” button.

Figure 3. A screenshot of the database site. Drop-down menus at the top of the page work as filters for entries. Users can download amino acid sequences of the displayed data using the “Download FASTA” button.

Table 1. Number of protein entries in the database based on the classification of “type” and “pathway”.

Type Pathway
Triterpenoid Steroid Total
OSC 137 38 175
P450 143 43 186
UGT 60 5 65
other_enzyme 44 35 79
TF 16 2 18
*SC 9
Total 400 123 532

Table 2. Number of protein entries involved in triterpenoid biosynthesis classified by taxonomic order and protein type.

Order Type Total
OSC P450 UGT other_enzyme TF
Apiales 8 12 16 36
Aquifoliales 4 1 5
Asterales 21 6 1 28
Brassicales 16 11 10 5 7 49
Caryophyllales 7 6 5 4 22
Celastrales 6 7 13
Cucurbitales 10 9 1 2 2 24
Dioscoreales
Ericales 1 2 3
Fabales 15 44 25 12 4 100
Fagales 5 1 6
Gentianales 2 2 4
Lamiales 4 6 10
Liliales
Malpighiales 8 3 11
Myrtales 7 3 10
Poales 6 7 2 4 19
Polypodiales
Ranunculales 1 2 3
Rosales 5 3 8
Sapindales 6 13 16 35
Saxifragales 4 4
Solanales 4 3 7
Vitales 2 2
Zingiberales 1 1

Table 3. Number of protein entries involved in steroid biosynthesis classified by taxonomic order and protein type.

Order Type Total
OSC P450 UGT other_enzyme TF
Apiales 2 2
Aquifoliales
Asterales 1 1
Brassicales 2 11 17 30
Caryophyllales
Celastrales 3 3
Cucurbitales 5 5
Dioscoreales 4 4
Ericales
Fabales 7 4 3 14
Fagales 2 2
Gentianales
Lamiales 1 1
Liliales 1 9 1 11
Malpighiales 5 5
Myrtales 2 2
Poales 1 5 1 7
Polypodiales 3 3
Ranunculales 1 1
Rosales
Sapindales
Saxifragales 1 1
Solanales 2 8 5 13 2 30
Vitales
Zingiberales 1 1

The future direction for Triterpene RDF is to annotate proteins using either pathway or reaction databases. For example, UniProtKB uses Rhea (http://www.rhea-db.org) (Lombardot et al. 2019) for enzyme annotation, which eases the process of integrating metabolites with protein information (Morgat et al. 2020). Such linked data would render it possible to obtain a set of annotated protein sequences necessary for the biosynthesis of a specified triterpene compound; however, the coverage of reactions for plant triterpene biosynthesis in the Rhea database is limited. One possible solution is to deposit the relevant pathways necessary for mapping the entries for this triterpene database in the community-driven pathway database WikiPathways (https://www.wikipathways.org/) (Agrawal et al. 2024). Although the current scope of WikiPathways mainly covers model organisms, efforts to expand the database to include non-model organisms are ongoing (Oec et al. 2023; Pico et al. 2023). Future integration with this type of pathway or reaction database will enforce the annotation and characterization of triterpene biosynthetic genes.

Acknowledgments

We thank the developers and organizers who attended the domestic BioHackathons in Japan (BH22.9 and BH23.9, organized by Database Center for Life Science (DBCLS)) and the Togothon meetings (organized by DBCLS) for their helpful discussions and technical support.

Abbreviations

OSC

oxidosqualene cyclase

P450

cytochrome P450 monooxygenase

RDF

Resource Description Framework

SC

squalene cyclase

TF

transcription factor

UGT

UDP-dependent glycosyltransferase

Conflict of interest

The authors declare no conflict of interest.

Author contribution

Conceptualization: K.T.; Methodology: K.T. and H.C.; Software: K.T. and H.C.; Validation: K.T. and H.C.; Formal analysis: K.T. and H.C.; Investigation: K.T. and H.C.; Resources: K.T., H.C., and H.B.; Data curation: K.T. and H.C.; Visualization: K.T. and H.C.; Resources: K.T., H.C., and H.B.; Writing—original draft: K.T.; Writing—review and editing: K.T., H.C., and H.B.; Supervision: H.B.; Project administration: K.T.; Funding acquisition: K.T. and H.B.

Funding

This work was supported by JSPS KAKENHI Grant Number 23K13886 to K.T., ROIS-DS-JOINT (003RP2022 and 008RP2023) to K.T., and the Center of Innovation for Bio-Digital Transformation (BioDX), an open innovation platform for industry-academia co-creation of JST (COI-NEXT, JPMJPF2010).

References

  • Achnine L, Huhman DV, Farag MA, Sumner LW, Blount JW, Dixon RA (2005) Genomics-based selection and functional characterization of triterpene glycosyltransferases from the model legume Medicago truncatula. Plant J 41: 875–887 [DOI] [PubMed] [Google Scholar]
  • Agrawal A, Balcı H, Hanspers K, Coort SL, Martens M, Slenter DN, Ehrhart F, Digles D, Waagmeester A, Wassink I, et al. (2024) WikiPathways 2024: Next generation pathway database. Nucleic Acids Res 52(D1): D679–D689 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Akiyama R, Umemoto N, Mizutani M (2023) Recent advances in steroidal glycoalkaloid biosynthesis in the genus Solanum. Plant Biotechnol (Tokyo) 40: 185–191 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Bateman A, Martin M-J, Orchard S, Magrane M, Ahmad S, Alpi E, Bowler-Barnett EH, Britto R, Bye-A-Jee H, Cukura A, et al.; The UniProt Consortium (2023) UniProt: The universal protein knowledgebase in 2023. Nucleic Acids Res 51(D1): D523–D531 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Benveniste P (2004) Biosynthesis and accumulation of sterols. Annu Rev Plant Biol 55: 429–457 [DOI] [PubMed] [Google Scholar]
  • Berardini TZ, Reiser L, Li D, Mezheritsky Y, Muller R, Strait E, Huala E (2015) The arabidopsis information resource: Making and mining the “gold standard” annotated reference plant genome. Genesis 53: 474–485 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Bishop GJ, Yokota T (2001) Plants steroid hormones, brassinosteroids: Current highlights of molecular aspects on their synthesis/metabolism, transport, perception and response. Plant Cell Physiol 42: 114–120 [DOI] [PubMed] [Google Scholar]
  • Chung SY, Seki H, Fujisawa Y, Shimoda Y, Hiraga S, Nomura Y, Saito K, Ishimoto M, Muranaka T (2020) A cellulose synthase-derived enzyme catalyses 3-O-glucuronosylation in saponin biosynthesis. Nat Commun 11: 5664. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Dinday S, Ghosh S (2023) Recent advances in triterpenoid pathway elucidation and engineering. Biotechnol Adv 68: 108214. [DOI] [PubMed] [Google Scholar]
  • Jozwiak A, Sonawane PD, Panda S, Garagounis C, Papadopoulou KK, Abebie B, Massalha H, Almekias-Siegl E, Scherf T, Aharoni A (2020) Plant terpenoid metabolism co-opts a component of the cell wall biosynthesis machinery. Nat Chem Biol 16: 740–748 [DOI] [PubMed] [Google Scholar]
  • Jupp S, Malone J, Bolleman J, Brandizi M, Davies M, Garcia L, Gaulton A, Gehant S, Laibe C, Redaschi N, et al. (2014) The EBI RDF platform: Linked open data for the life sciences. Bioinformatics 30: 1338–1339 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Katayama T, Nakao M, Takagi T (2010) TogoWS: Integrated SOAP and REST APIs for interoperable bioinformatics Web services. Nucleic Acids Res 38(Web Server): W706–W711 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Kawashima S, Katayama T, Hatanaka H, Kushida T, Takagi T (2018) NBDC RDF portal: A comprehensive repository for semantic data in life sciences. Database (Oxford) 2018: bay123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Kumar A, Srivastava P, Srivastava G, Sandeep, Kumar N, Chanotiya CS, Ghosh S (2021) BAHD acetyltransferase contributes to wound-induced biosynthesis of oleo-gum resin triterpenes in Boswellia. Plant J 107: 1403–1419 [DOI] [PubMed] [Google Scholar]
  • Lombardot T, Morgat A, Axelsen KB, Aimo L, Hyka-Nouspikel N, Niknejad A, Ignatchenko A, Xenarios I, Coudert E, Redaschi N, et al. (2019) Updates in Rhea: SPARQLing biochemical reaction data. Nucleic Acids Res 47(D1): D596–D600 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Malhotra K, Franke J (2022) Cytochrome P450 monooxygenase-mediated tailoring of triterpenoids and steroids in plants. Beilstein J Org Chem 18: 1289–1310 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Miettinen K, Iñigo S, Kreft L, Pollier J, De Bo C, Botzki A, Coppens F, Bak S, Goossens A (2018) The TriForC database: A comprehensive up-to-date resource of plant triterpene biosynthesis. Nucleic Acids Res 46(D1): D586–D594 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Morgat A, Lombardot T, Coudert E, Axelsen K, Neto TB, Gehant S, Bansal P, Bolleman J, Gasteiger E, de Castro E, et al.; UniProt Consortium (2020) Enzyme annotation in UniProtKB using Rhea. Bioinformatics 36: 1896–1901 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Oec N, Hirota T, Nozu R, Bono H (2023) Efforts to analyze pathways in non-model organisms. Preprint: https://doi.org/10.37044/osf.io/spf3q
  • Ohyama K, Suzuki M, Kikuchi J, Saito K, Muranaka T (2009) Dual biosynthetic pathways to phytosterol via cycloartenol and lanosterol in Arabidopsis. Proc Natl Acad Sci USA 106: 725–730 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Ohyama K, Suzuki M, Masuda K, Yoshida S, Muranaka T (2007) Chemical phenotypes of the hmg1 and hmg2 mutants of Arabidopsis demonstrate the in-planta role of HMG-CoA reductase in triterpene biosynthesis. Chem Pharm Bull 55: 1518–1521 [DOI] [PubMed] [Google Scholar]
  • Pico A, Ono H, Nozu R, Oec N, Bono H (2023) BioHackJP 2023 Report R3: Expand the pathway analysis environment to non-model organisms. Preprint: https://doi.org/10.37044/osf.io/4uskb
  • Rahimi S, Kim J, Mijakovic I, Jung K-H, Choi G, Kim S-C, Kim Y-J (2019) Triterpenoid-biosynthetic UDP-glycosyltransferases from plants. Biotechnol Adv 37: 107394. [DOI] [PubMed] [Google Scholar]
  • Sahu SK, Liu H (2023) Long-read sequencing (method of the year 2022): The way forward for plant omics research. Mol Plant 16: 791–793 [DOI] [PubMed] [Google Scholar]
  • Sawai S, Saito K (2011) Triterpenoid biosynthesis and engineering in plants. Front Plant Sci 2: 25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Seki H, Tamura K, Muranaka T (2015) P450s and UGTs: Key players in the structural diversity of triterpenoid saponins. Plant Cell Physiol 56: 1463–1471 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Shibuya M, Hoshino M, Katsube Y, Hayashi H, Kushiro T, Ebizuka Y (2006) Identification of beta-amyrin and sophoradiol 24-hydroxylase by expressed sequence tag mining and functional expression assay. FEBS J 273: 948–959 [DOI] [PubMed] [Google Scholar]
  • The UniProt Consortium (2017) UniProt: The universal protein knowledgebase. Nucleic Acids Res 45(D1): D158–D169 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Thimmappa R, Geisler K, Louveau T, O’Maille P, Osbourn A (2014) Triterpene biosynthesis in plants. Annu Rev Plant Biol 65: 225–257 [DOI] [PubMed] [Google Scholar]

Articles from Plant Biotechnology are provided here courtesy of Japanese Society for Plant Biotechnology

RESOURCES