dbCAN3: automated carbohydrate-active enzyme and substrate annotation

Jinfang Zheng; Qiwei Ge; Yuchen Yan; Xinpeng Zhang; Le Huang; Yanbin Yin

doi:10.1093/nar/gkad328

. 2023 May 1;51(W1):W115–W121. doi: 10.1093/nar/gkad328

dbCAN3: automated carbohydrate-active enzyme and substrate annotation

Jinfang Zheng ^1,², Qiwei Ge ^2,², Yuchen Yan ³, Xinpeng Zhang ⁴, Le Huang ⁵, Yanbin Yin ^6,^✉

PMCID: PMC10320055 PMID: 37125649

Abstract

Carbohydrate active enzymes (CAZymes) are made by various organisms for complex carbohydrate metabolism. Genome mining of CAZymes has become a routine data analysis in (meta-)genome projects, owing to the importance of CAZymes in bioenergy, microbiome, nutrition, agriculture, and global carbon recycling. In 2012, dbCAN was provided as an online web server for automated CAZyme annotation. dbCAN2 (https://bcb.unl.edu/dbCAN2) was further developed in 2018 as a meta server to combine multiple tools for improved CAZyme annotation. dbCAN2 also included CGC-Finder, a tool for identifying CAZyme gene clusters (CGCs) in (meta-)genomes. We have updated the meta server to dbCAN3 with the following new functions and components: (i) dbCAN-sub as a profile Hidden Markov Model database (HMMdb) for substrate prediction at the CAZyme subfamily level; (ii) searching against experimentally characterized polysaccharide utilization loci (PULs) with known glycan substates of the dbCAN-PUL database for substrate prediction at the CGC level; (iii) a majority voting method to consider all CAZymes with substrate predicted from dbCAN-sub for substrate prediction at the CGC level; (iv) improved data browsing and visualization of substrate prediction results on the website. In summary, dbCAN3 not only inherits all the functions of dbCAN2, but also integrates three new methods for glycan substrate prediction.

Graphical Abstract

INTRODUCTION

Carbohydrate active enzymes (CAZymes) act upon specific glycosidic linkages to degrade or synthesize polysaccharides (1). They are found in all cellular organisms and viruses, but are particularly abundant in photosynthetic plants and algae for atmospheric carbon fixation into complex carbohydrates (2), as well as in microbes living in carbohydrate-rich environments such as animal guts and agricultural soils (3). For example, for their roles in carbohydrate utilizations, CAZymes (e.g. glycoside hydrolases or GHs, polysaccharide lyases or PLs) are vitally important as part of the genomic repertoire of human gut microbiome (4). CAZymes are thus of huge interests to research in not only gut microbiome, human health, nutrition, but also bioenergy, plant disease, and global carbon recycling.

CAZymes often work with other proteins, such as sugar transporters (TCs), transcriptional regulators (TFs) and signal transduction proteins (STPs) to fully break down a specific complex carbohydrate substrate (5). Genes encoding these proteins often form physically linked gene clusters in bacterial genomes. Those gene clusters that have been experimentally characterized are termed polysaccharide utilization loci (PULs), with known carbohydrate substrates (e.g. starch, mannan, xylan, and glucan) (6–8). Biochemical characterization of new PULs is expensive and time consuming, but can be assisted by initial computational screening of bacterial genomes for physically linked CAZyme gene clusters (CGCs). Unlike PULs, CGCs are computer predicted gene clusters without known substrates, but contain genes encoding CAZymes, TCs, TFs and STPs (9). As their names indicate (e.g. xylan utilization loci), PULs should only contain experimentally characterized CGCs (e.g. with GHs) for degradation. However, the computer predicted CGCs could also include gene clusters with glycosyltransferases (GTs) for biosynthesis of carbohydrates (e.g. capsular polysaccharides), which should not be classified as PULs.

Using the expert-curated and routinely-updated CAZy database (1,10,11) as the foundation, in 2012 we developed dbCAN (database for automated CAZyme annotation), as a database of profile Hidden Markov Models (HMMs hereafter) representing the signature domains of over 400 CAZyme families (12). We further developed dbCAN-seq (2,3), dbCAN2 meta server (9), CGC-Finder (9), eCAMI (13) and dbCAN-PUL (14) that together form the dbCAN tool suite. Particularly, dbCAN2 replaced dbCAN in 2018 as a meta sever that integrates three tools (HMMER vs dbCAN-HMMdb, DIAMOND vs CAZyDB, Hotpep versus PPR (15)) for a more accurate CAZyme annotation (9). Hotpep was later replaced by eCAMI in dbCAN2 in 2021 and then by dbCAN-sub in 2022 (see below). In addition, the CUPP web server (16) was published in 2020 to allow automated CAZyme annotation with predicted EC and subfamilies. A recent evaluation work (17) conducted by an independent group found that dbCAN2 outperforms all other tools.

dbCAN2 is the most widely used web server for automated CAZyme annotation (∼200 000 jobs processed since 2017). The accompanying standalone Conda software package run_dbcan is available on GitHub (https://github.com/linnabrown/run_dbcan) to allow users perform the CAZyme annotation on their own computers. For example, dbCAN has been used to annotate CAZymes in microbiome samples (18–20), and one study (21) found that western diets low in plant carbohydrates corresponded to lower gut microbiota diversity and CAZyme diversity, compared to diets of Africans living in a hunter-gatherer life. Additionally, dbCAN HMMdb has also been incorporated into other bioinformatics software systems, e.g. Kbase's App Catalog (22), METABOLIC (23), DRAM (24), CUPP (16) and SACCHARIS (25).

MOTIVATION FOR DBCAN3

Our motivation to develop an updated dbCAN3 server came from user requests. In recent years, we have received frequent emails from dbCAN users to develop a capability to predict the carbohydrate substrates for CAZymes in their query (meta)genomes. The reason is that all current CAZyme annotation tools can only assign given proteins to CAZyme families (e.g. GH5) or subfamilies (e.g. GH5_1). This is not enough for people who want to learn about the glycan substrates that a CAZyme, a genome, or a microbiome sample might target. Filling this research gap will significantly enhance the basic science to characterize new CAZymes and PULs in microbes of carbohydrate-rich environments (6–8). It will also help researchers to compare different bacterial genomes or microbiome samples in terms of their differing metabolic capacity for utilizing various carbohydrate substrates, and contribute to the emerging personalized nutrition practice (26–29), e.g. using gut microbiome sequencing to infer if a person is a responder to certain dietary fibers or prebiotics.

Predicting glycan substrates for CAZymes is not easy. The reason is mostly because of the protein family poly-specificity (25) meaning that proteins of one CAZyme family may have multiple biochemical activities or cleave different glycosidic linkages. For example, GH5 contains proteins with 28 EC (enzyme commission) numbers corresponding to over 10 different carbohydrate substrates (e.g. cellulose, xyloglucan, beta-mannan, xylan, arabinogalactan, chitosan). In some recent papers (e.g. (21,24,30,31)), researchers had to independently map CAZyme families to substrates by manually curating the CAZy webpages (www.cazy.org) and/or literature, once they obtained the CAZyme families in their query (meta)genomes by dbCAN or other tools. Inevitably, this will lead to inconsistent CAZyme-substrate mappings not to mention the duplicated and laborious manual work on the user's side due to the lack of such function in our dbCAN and other tools.

To annotate proteins with a specific biochemical activity targeting a specific glycan, classification of a CAZyme family into subfamilies has proved very useful, as a subfamily is expected to contain proteins with very few or a single biochemical activity (32). The idea of ‘deeper CAZyme annotation’ is to use CAZyme subfamily models, where each subfamily is ideally associated with just one biochemical activity, to annotate query (meta)genomes. Over the past years, the CAZy database has developed subfamily classification for 27 families, which involved building phylogenies or sequence similarity networks (SSNs) and extensive manual inspections (33–35). The CAZy-created subfamilies have been included in dbCAN2 as HMMs for subfamily-level CAZyme annotation but only for the 27 families. CUPP (16) and eCAMI (13) have automatically classified all CAZyme families into subfamilies, and used distinguishing k-mer peptides for CAZyme subfamily and EC number annotation. However, none of the current tools have provided the glycan substrate prediction function. With our dbCAN-PUL (14) and eCAMI tools, we have developed novel substrate prediction approaches that can predict substrates for CAZyme gene clusters (CGCs). These approaches have been detailed in our recent dbCAN-seq database paper (3), and will be briefly introduced here as well.

APPROACHES TO SUBSTRATE PREDICTION

In dbCAN3, three substrate prediction approaches have been implemented. One is for substrate prediction at the CAZyme subfamily level, and the other two are for substrate prediction at the CGC level.

(A) Substrate prediction at CAZyme subfamily level by HMMER search against dbCAN-sub

Using eCAMI (13), we have classified 426 CAZyme families into 26041 CAZyme subfamilies (Figure 1A). The dbCAN domain sequences of each subfamily were aligned and further converted to an HMM. All the 26041 HMMs collectively form the dbCAN-sub database (https://bcb.unl.edu/dbCAN_sub/) to enable substrate annotation at CAZyme subfamily level. Each dbCAN-sub HMM corresponds to an eCAMI subfamily, containing protein sequences annotated by CAZy, some of which may have been experimentally characterized with EC numbers. As described in our recent paper (3), glycan substrates of each subfamily are manually curated from the CAZy webpages and literature according to EC numbers (Figure 1A). The result of this manual curation was a CAZyme subfamily → EC → substrate mapping table (https://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-sub.substrate.mapping.xls), which connects CAZyme families with ECs and further to carbohydrate substrates. Given new protein sequences, HMMER will be used to search against dbCAN-sub HMMdb. The best hit subfamily's EC numbers and substrates from the mapping table will be assigned to the query.

The subfamily classification partially addressed the family poly-specificity issue. Before the subfamily classification, 226 (53.1%) of the 426 CAZyme families contain experimentally characterized CAZy proteins with more than one EC numbers, and 22 families have >10 EC numbers. After the subfamily classification, 3003 CAZyme subfamilies contain experimentally characterized CAZy proteins with EC numbers, and among them only 655 (21.8%) subfamilies have more than one EC numbers (Figure 1B). 23 038 CAZyme subfamilies contain no experimentally characterized CAZy proteins and no EC numbers. Their HMMs will not help substrate prediction but can still be informative with subfamily annotation. There are 2348 subfamilies with only one EC number, and these families contain in total 735 699 CAZy proteins (cover 74.3% of CAZy proteins from the 3003 CAZyme subfamilies with EC numbers) (Figure 1B inset barplot).

(B) Substrate prediction at CGC level by BLAST search against dbCAN-PUL

In 2021, we published dbCAN-PUL (14) to collect polysaccharide utilization loci (PULs), which are CGCs with experimentally characterized glycan substrates extracted from literature. Given an input genome, CGCs are first predicted by the CGC-Finder program of the dbCAN package; and then proteins encoded in each CGC are searched against the 654 PULs of dbCAN-PUL (https://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-PUL.substrate.mapping.xls) using BLAST. The best hit PUL’s substrate will be assigned to the query CGC. The algorithm has been illustrated and detailed in our recent dbCAN-seq paper (see Figure 1C of (3) for an example). Briefly, to select the best hit PUL, a score is calculated between the query CGC and each subject PUL by summing the BLAST bit scores of all protein matches. The best hit PUL is selected as the one with the highest summed score, and the BLAST matches must involve at least one CAZyme and one other signature genes (TFs, TCs, STPs).

With approach A, it is now possible to predict substrates for CAZymes. As one CGC can encode multiple CAZymes, using a simple majority voting rule, a substrate assignment for a CGC can be inferred by considering all the CAZymes in the CGC and their substrates inferred by approach A. This is called dbCAN-sub majority voting approach. The algorithm has been illustrated and detailed in our recent dbCAN-seq paper (see Figure 1E of (3) for an example).

EVALUATION OF SUBSTRATE PREDICTION

To evaluate the performance of the three substrate prediction approaches, we have designed three experiments, one for each approach. All experiments used the experimentally characterized PUL data from dbCAN-PUL (14).

For approach A evaluation, we first collected highly reliable PULs that have been: (i) characterized by biochemical enzyme activity assay, (ii) published within the last two decades and (iii) characterized to target only one specific substrate according to the original biochemical papers. We also excluded spurious PULs with unreasonable substrates such as capsular polysaccharides and monosaccharides. This resulted in 642 CAZymes in 215 PULs with substrate labels. The substrate of a PUL was assigned to all the CAZymes in it as the ground truth data. We then took these 642 CAZyme sequences as query to search against dbCAN-sub for substrate prediction. We found 522/642 = 81% (coverage, Table 1) CAZymes were assigned with at least one substrate. Furthermore, 385/522 (74%) CAZymes received predicted substrates that matched the substrate labels in the ground truth data, corresponding to an overall precision (true positives/all predictions) = 74% and overall recall (true positives/all positives) = 385/642 = 60% (Supplementary Table S1). As this is a multi-class prediction problem (multiple substrate groups with uneven data counts), we also calculated the weighted precision = 81% and recall = 60% for approach A at the CAZyme level (Table 1).

Table 1.

Performance evaluation result of the three approaches

		Unweighted		Weighted *
Approach	Coverage%	Precision%	Recall%	Precision%	Recall%
A	81	74	60	81	60
B	88	74	65	91	66
C	37	95	35	84	35

Open in a new tab

* Following https://towardsdatascience.com/micro-macro-weighted-averages-of-f1-score-clearly-explained-b603420b292f.

When matching substrate names from predictions (dbCAN-sub) and ground truth data (dbCAN-PUL), we used Supplementary Table S4 as a guide to consider glycan substrates that are similar in structures but different in names. For example, starch, glycogen, malto-oligosaccharides, dextrin, dextran, maltose, pullulan, isolichenan, nigerose, trehalose, kojibiose are all alpha-glucans. Host glycans include hyaluronan, lactose, LacNAc, mucin, blood group B substance, glycosphingolipid, sialic acid, glycosaminoglycan, mucin-type O-glycan, N-glycan, keratan sulfate, human milk oligosaccharide/polysaccharide, heparan sulfate proteoglycan, sulfated glycosaminoglycan, heparin and heparan sulfate, dermatan sulfate, etc.

For approach B evaluation, we randomly split 603 PULs of dbCAN-PUL into two datasets (1:4 ratio): 106 PULs as testing, and 497 PULs as the training. The testing PULs (ground truth data) were used as input to run BLAST search against the training PULs. We found 93/106 = 88% (coverage, Table 1) testing PULs were assigned with at least one substrate. Out of the 93 PULs, 69 (74%) received predicted substrates that matched the substrate labels in the ground truth data, corresponding to an overall precision = 74% and overall recall = 69/106 = 65% (Supplementary Table S2). Considering the multi-class prediction, we had the weighted precision = 91% and weighted recall = 66% for approach B substrate prediction at the CGC level (Supplementary Table S2).

For approach C evaluation, we took CAZymes of the 106 PULs (ground truth data) as the input to run dbCAN-sub search. Each PUL then went through the majority voting to receive a substrate assignment. We found 39/106 = 37% (coverage, Table 1) PULs were assigned with at least one substrate. Out of the 39 PULs, 37 (95%) received predicted substrates that matched the substrate labels in the ground truth data corresponding to an overall precision = 95% and overall recall = 37/106 = 35% (Supplementary Table S3). Considering the multi-class prediction, we had the weighted precision = 84% and weighted recall = 35% for approach C substrate prediction at the CGC level (Supplementary Table S3).

Therefore, the approach B (dbCAN-PUL search) has the best performance (highest coverage, weighted precision, and recall), followed by the approach A (dbCAN-sub search for CAZyme substrate prediction) (Table 1). Approach C (dbCAN-sub majority voting) suffers from low coverage and recall.

NEW WEB FUNCTIONS AND UPDATES

We have updated the dbCAN website and the run_dbcan standalone package to implement the three substrate prediction approaches. The user submitted genomes or metagenomes will be first subject to gene prediction (Figure 2, step 1). The resulting proteins will be processed by HMMER against dbCAN HMMdb (family level) and dbCAN-sub HMMdb (subfamily level), and by DIAMOND against CAZyDB (Figure 2, step 2). The dbCAN-sub HMMdb search (approach A) will lead to substrate prediction at the CAZyme subfamily level (Figure 2, step 4) if the proteins match the 3003 CAZyme subfamilies associated with substrates (Figure 1B). CGCs from the (meta)genomes will be predicted by CGC-Finder (Figure 2, step 3), and further subject to substrate prediction at the CGC level by the dbCAN-PUL search (Figure 2, step 5, approach B) and by the dbCAN-sub majority voting (Figure 2, step 6, approach C). All these substrate prediction approaches have been coded into the run_dbcan program and the dbCAN3 website.

Figure 2. — dbCAN3 workflow. The input genome or metagenome will go through six steps of data processing. Steps 1, 2, 3 were in dbCAN2 already. Step 4 is based on the dbCAN-sub search (described as approach A in the text), which will look up the CAZyme subfamily → EC → substrate mapping table to assign a substrate to a protein. Not all proteins could be predicted with a substrate; those that do receive a substrate prediction are indicated with ‘*’. After step 3, all CGCs are predicted in the input (meta)genome by CGC-Finder, which contain four classes of signature genes: CAZymes (red), transporters (TCs, blue), transcription factors (TFs, green), and signal transduction proteins (STPs, cyan), as well as other non-signature genes (gray). Step 5 is based on a BLAST search against the dbCAN-PUL database (14), which currently contains 654 experimentally determined PUL-substrate pairs (described as approach B in the text; details can be found in Figure 1C of our recent paper (3)). Step 6 is based on a majority voting considering all CAZymes with predicted substrates from step 4 (described as approach C in the text; details can be found in Figure 1E of our recent paper (3)). The four boxes are what are reported to users as the outputs: the two boxes with red scribble borders are new in dbCAN3.

We have modified the job submission page of dbCAN3 to allow users choose what steps of analyses and what tools they would like to run. The default setting is for the simplest analysis: HMMER vs dbCAN HMMdb search given protein sequences as input. Like dbCAN2, users can choose to also run DIAMOND:CAZy and HMMER:dbCAN-sub searches. When users choose to also run CGC-Finder, they will have the option to also choose whether they also want to perform substrate prediction at the CGC level.

The result page has been significantly re-designed in dbCAN3 compared to dbCAN2 to visualize new substrate prediction results. First, the overview tab now has an EC# column resulted from dbCAN-sub search, and the old Hotpep column is now replaced by dbCAN-sub (Figure 3A). The CAZyme sequences in FASTA format can be downloaded now. Second, the HMMER:dbCAN-sub tab allows users access the detailed search result (Figure 3B), including the link to a new page to visualize the component CAZy proteins and ECs in each subfamily (from eCAMI) and where the CAZyme subfamily → EC → substrate mapping is derived. The CGC-Finder tab now has two substrate columns (Figure 3C), one from dbCAN-PUL search and the other from the dbCAN-sub majority voting. Clicking on each CGC will lead to the CGC page to visualize the CGC gene composition diagram (Figure 3D), the gene composition table, the substrate prediction from dbCAN-PUL search (CGC-PUL alignment diagram (Figure 3E)), and the substrate prediction from dbCAN-sub majority voting table.

Figure 3. — Screenshots of dbCAN3 result pages. (A) The result page has five tabs (https://bcb.unl.edu/dbCAN2/blastation.php?jobid=20230211113326 as an example): 1) Overview, 2) HMMER:dbCAN, 3) DIAMOND:CAZy, 4) HMMER:dbCAN_sub, and 5) CGC-Finder. Compared to dbCAN2, the overview table has a new EC# column, and the dbCAN_sub column replaced Hotpep. The subfamily is named like ‘PL7_e9’ meaning family PL7 subfamily e9; the ‘e’ means it is from eCAMI to distinguish from the CAZy subfamilies which currently are available for 27 CAZyme families. (B) The HMMER:dbCAN_sub tab is new in dbCAN3. The substrate is predicted from the EC numbers according to the member proteins of the subfamily. For example, the protein ‘Scaffold3_1161’ matched the dbCAN-sub/eCAMI subfamily ‘PL7_e9’, which contains in total 16 CAZy proteins (Subfam Composition column ‘PL7:16’). Among the 16 CAZy proteins, 4 are experimentally characterized with EC 4.2.2.3. Clicking on the ‘PL7_e9’ link in the dbCAN Subfam column will open a new page to show the details of this subfamily (the inset screenshot). (C) The CGC-Finder tab has two new columns to show the substrates predicted by dbCAN-PUL search and dbCAN-sub majority voting (see main text). Clicking on each CGC in the first column will open the CGC page. (D) The CGC page is redesigned in dbCAN3 to visualize the gene composition of the CGC and substrate prediction. The menu on the top allows users quickly navigate to different sections. (E) Visualization of BLAST result between the query CGC and the subject PUL, which is experimentally characterized to degrade alginate. Clicking on the graph will direct to the PUL page of dbCAN-PUL for the details of the subject PUL.

We plan to update dbCAN3 annually as new sequences and families are added in the CAZy database. New CAZyme families and subfamilies will be created if the new CAZy sequences have higher similarity to eCAMI/dbCAN-sub unclassified sequences or to each other than to existing subfamilies. The dbCAN-sub HMMdb database will be a new addition to our dbCAN family tool suite (dbCAN, dbCAN-seq, dbCAN-PUL, eCAMI). The dbCAN3 web server will continue to serve the carbohydrate and microbiome research communities with the state-of-the-art tools for automated CAZymes, CGCs and substrate annotation.

DATA AVAILABILITY

All the data are free available online at https://bcb.unl.edu/dbCAN2/. The source code of dbCAN3 (run_dbcan v4) is available on FigShare: https://doi.org/10.6084/m9.figshare.22640326.v1 and GitHub: https://github.com/linnabrown/run_dbcan.

Supplementary Material

gkad328_Supplemental_Files

Click here for additional data file.^{(85.5KB, zip)}

ACKNOWLEDGEMENTS

We would like to acknowledge our lab members (Jerry Akresi, Ved Piyush) for helpful discussions. This work was partially completed utilizing the Holland Computing Center of the University of Nebraska, which receives support from the Nebraska Research Initiative. We also thank Roland Madadjim and Dr Juan Cui of UNL for the Python script to reformat CGC-Finder output.

Contributor Information

Jinfang Zheng, Nebraska Food for Health Center, Department of Food Science and Technology, University of Nebraska, Lincoln, NE 68588, USA.

Qiwei Ge, School of Computing, University of Nebraska, Lincoln, NE 68588, USA.

Yuchen Yan, Nebraska Food for Health Center, Department of Food Science and Technology, University of Nebraska, Lincoln, NE 68588, USA.

Xinpeng Zhang, Nebraska Food for Health Center, Department of Food Science and Technology, University of Nebraska, Lincoln, NE 68588, USA.

Le Huang, Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, NC, USA.

Yanbin Yin, Nebraska Food for Health Center, Department of Food Science and Technology, University of Nebraska, Lincoln, NE 68588, USA.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Institutes of Health (NIH) awards [R01GM140370, R21AI171952]; National Science Foundation (NSF) CAREER award [DBI-1933521]; United States Department of Agriculture (USDA) award [58-8042-9-089]. Funding for open access charge: NSF award [DBI-1933521].

Conflict of interest statement. None declared.

REFERENCES

1. Cantarel B.L., Coutinho P.M., Rancurel C., Bernard T., Lombard V., Henrissat B.. The Carbohydrate-Active EnZymes database (CAZy): an expert resource for glycogenomics. Nucleic Acids Res. 2009; 37:D233–D238. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Huang L., Zhang H., Wu P., Entwistle S., Li X., Yohe T., Yi H., Yang Z., Yin Y.. dbCAN-seq: a database of carbohydrate-active enzyme (CAZyme) sequence and annotation. Nucleic Acids Res. 2018; 46:D516–D521. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Zheng J., Hu B., Zhang X., Ge Q., Yan Y., Akresi J., Piyush V., Huang L., Yin Y.. dbCAN-seq update: cAZyme gene clusters and substrates in microbiomes. Nucleic Acids Res. 2023; 51:D557–D563. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. El Kaoutari A., Armougom F., Gordon J.I., Raoult D., Henrissat B.. The abundance and variety of carbohydrate-active enzymes in the human gut microbiota. Nat. Rev. Microbiol. 2013; 11:497–504. [DOI] [PubMed] [Google Scholar]
5. Terrapon N., Lombard V., Drula E., Lapebie P., Al-Masaudi S., Gilbert H.J., Henrissat B.. PULDB: the expanded database of polysaccharide utilization loci. Nucleic Acids Res. 2018; 46:D677–D683. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Terrapon N., Lombard V., Gilbert H.J., Henrissat B.. Automatic prediction of polysaccharide utilization loci in Bacteroidetes species. Bioinformatics. 2015; 31:647–655. [DOI] [PubMed] [Google Scholar]
7. Grondin J.M., Tamura K., Dejean G., Abbott D.W., Brumer H.. Polysaccharide utilization loci: fueling microbial communities. J. Bacteriol. 2017; 199:e00860-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Martens E.C., Koropatkin N.M., Smith T.J., Gordon J.I.. Complex glycan catabolism by the human gut microbiota: the bacteroidetes sus-like paradigm. J. Biol. Chem. 2009; 284:24673–24677. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Zhang H., Yohe T., Huang L., Entwistle S., Wu P., Yang Z., Busk P.K., Xu Y., Yin Y.. dbCAN2: a meta server for automated carbohydrate-active enzyme annotation. Nucleic Acids Res. 2018; 46:W95–W101. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Lombard V., Golaconda Ramulu H., Drula E., Coutinho P.M., Henrissat B.. The carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res. 2014; 42:D490–D495. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Drula E., Garron M.L., Dogan S., Lombard V., Henrissat B., Terrapon N.. The carbohydrate-active enzyme database: functions and literature. Nucleic Acids Res. 2022; 50:D571–D577. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Yin Y., Mao X., Yang J., Chen X., Mao F., Xu Y.. dbCAN: a web resource for automated carbohydrate-active enzyme annotation. Nucleic Acids Res. 2012; 40:W445–W451. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Xu J., Zhang H., Zheng J., Dovoedo P., Yin Y.. eCAMI: simultaneous classification and motif identification for enzyme annotation. Bioinformatics. 2019; 36:2068–2075. [DOI] [PubMed] [Google Scholar]
14. Ausland C., Zheng J., Yi H., Yang B., Li T., Feng X., Zheng B., Yin Y.. dbCAN-PUL: a database of experimentally characterized CAZyme gene clusters and their substrates. Nucleic Acids Res. 2021; 49:D523–D528. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Busk P.K., Pilgaard B., Lezyk M.J., Meyer A.S., Lange L.. Homology to peptide pattern for annotation of carbohydrate-active enzymes and prediction of function. BMC Bioinf. 2017; 18:214. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Barrett K., Hunt C.J., Lange L., Meyer A.S.. Conserved unique peptide patterns (CUPP) online platform: peptide-based functional annotation of carbohydrate active enzymes. Nucleic Acids Res. 2020; 48:W110–W115. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Hobbs E.E.M., Gloster T., Chapman S., Pritchard L.. Microbiology Society Annual Conference 2021. 2021; [Google Scholar]
18. Vatanen T., Ang Q.Y., Siegwald L., Sarker S.A., Le Roy C.I., Duboux S., Delannoy-Bruno O., Ngom-Bru C., Boulange C.L., Strazar M.et al.. A distinct clade of bifidobacterium longum in the gut of Bangladeshi children thrives during weaning. Cell. 2022; 185:4280–4297. [DOI] [PubMed] [Google Scholar]
19. Qin Y., Havulinna A.S., Liu Y., Jousilahti P., Ritchie S.C., Tokolyi A., Sanders J.G., Valsta L., Brozynska M., Zhu Q.et al.. Combined effects of host genetics and diet on human gut microbiota and incident disease in a single population cohort. Nat. Genet. 2022; 54:134–142. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Olm M.R., Dahan D., Carter M.M., Merrill B.D., Yu F.B., Jain S., Meng X., Tripathi S., Wastyk H., Neff N.et al.. Robust variation in infant gut microbiome assembly across a spectrum of lifestyles. Science. 2022; 376:1220–1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Smits S.A., Leach J., Sonnenburg E.D., Gonzalez C.G., Lichtman J.S., Reid G., Knight R., Manjurano A., Changalucha J., Elias J.E.et al.. Seasonal cycling in the gut microbiome of the Hadza hunter-gatherers of Tanzania. Science. 2017; 357:802–806. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Arkin A.P., Cottingham R.W., Henry C.S., Harris N.L., Stevens R.L., Maslov S., Dehal P., Ware D., Perez F., Canon S.et al.. KBase: the United States Department of Energy Systems Biology Knowledgebase. Nat. Biotechnol. 2018; 36:566–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Zhou Z., Tran P.Q., Breister A.M., Liu Y., Kieft K., Cowley E.S., Karaoz U., Anantharaman K.. METABOLIC: high-throughput profiling of microbial genomes for functional traits, metabolism, biogeochemistry, and community-scale functional networks. Microbiome. 2022; 10:33. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Shaffer M., Borton M.A., McGivern B.B., Zayed A.A., La Rosa S.L., Solden L.M., Liu P., Narrowe A.B., Rodriguez-Ramos J., Bolduc B.et al.. DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic Acids Res. 2020; 48:8883–8900. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Jones D.R., Thomas D., Alger N., Ghavidel A., Inglis G.D., Abbott D.W.. SACCHARIS: an automated pipeline to streamline discovery of carbohydrate active enzyme activities within polyspecific families and de novo sequence datasets. Biotechnol. Biofuels. 2018; 11:27. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Makki K., Deehan E.C., Walter J., Backhed F.. The impact of dietary Fiber on gut microbiota in host health and disease. Cell Host Microbe. 2018; 23:705–715. [DOI] [PubMed] [Google Scholar]
27. Valdes A.M., Walter J., Segal E., Spector T.D.. Role of the gut microbiota in nutrition and health. BMJ. 2018; 361:k2179. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Zmora N., Suez J., Elinav E.. You are what you eat: diet, health and the gut microbiota. Nat. Rev. Gastroenterol. Hepatol. 2019; 16:35–56. [DOI] [PubMed] [Google Scholar]
29. Deehan E.C., Yang C., Perez-Munoz M.E., Nguyen N.K., Cheng C.C., Triador L., Zhang Z., Bakal J.A., Walter J.. Precision microbiome modulation with discrete dietary Fiber structures directs short-chain fatty acid production. Cell Host Microbe. 2020; 27:389–404. [DOI] [PubMed] [Google Scholar]
30. Zhao L., Zhang F., Ding X., Wu G., Lam Y.Y., Wang X., Fu H., Xue X., Lu C., Ma J.et al.. Gut bacteria selectively promoted by dietary fibers alleviate type 2 diabetes. Science. 2018; 359:1151–1156. [DOI] [PubMed] [Google Scholar]
31. Desai M.S., Seekatz A.M., Koropatkin N.M., Kamada N., Hickey C.A., Wolter M., Pudlo N.A., Kitamoto S., Terrapon N., Muller A.et al.. A dietary Fiber-deprived gut microbiota degrades the colonic mucus barrier and enhances pathogen susceptibility. Cell. 2016; 167:1339–1353. [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Stam M.R., Danchin E.G., Rancurel C., Coutinho P.M., Henrissat B.. Dividing the large glycoside hydrolase family 13 into subfamilies: towards improved functional annotations of alpha-amylase-related proteins. Protein Eng. Des. Sel. 2006; 19:555–562. [DOI] [PubMed] [Google Scholar]
33. Aspeborg H., Coutinho P.M., Wang Y., Brumer H. 3rd, Henrissat B.. Evolution, substrate specificity and subfamily classification of glycoside hydrolase family 5 (GH5). BMC Evol. Biol. 2012; 12:186. [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Mewis K., Lenfant N., Lombard V., Henrissat B.. Dividing the large glycoside hydrolase Family 43 into subfamilies: a motivation for detailed enzyme characterization. Appl. Environ. Microbiol. 2016; 82:1686–1692. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Hornung B.V.H., Terrapon N.. An objective criterion to evaluate sequence-similarity networks helps in dividing the protein family sequence space. 2022; bioRxiv doi:29 April 2022, preprint: not peer reviewed 10.1101/2022.04.19.488343. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkad328_Supplemental_Files

Click here for additional data file.^{(85.5KB, zip)}

Data Availability Statement

[B1] 1. Cantarel B.L., Coutinho P.M., Rancurel C., Bernard T., Lombard V., Henrissat B.. The Carbohydrate-Active EnZymes database (CAZy): an expert resource for glycogenomics. Nucleic Acids Res. 2009; 37:D233–D238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2. Huang L., Zhang H., Wu P., Entwistle S., Li X., Yohe T., Yi H., Yang Z., Yin Y.. dbCAN-seq: a database of carbohydrate-active enzyme (CAZyme) sequence and annotation. Nucleic Acids Res. 2018; 46:D516–D521. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3. Zheng J., Hu B., Zhang X., Ge Q., Yan Y., Akresi J., Piyush V., Huang L., Yin Y.. dbCAN-seq update: cAZyme gene clusters and substrates in microbiomes. Nucleic Acids Res. 2023; 51:D557–D563. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4. El Kaoutari A., Armougom F., Gordon J.I., Raoult D., Henrissat B.. The abundance and variety of carbohydrate-active enzymes in the human gut microbiota. Nat. Rev. Microbiol. 2013; 11:497–504. [DOI] [PubMed] [Google Scholar]

[B5] 5. Terrapon N., Lombard V., Drula E., Lapebie P., Al-Masaudi S., Gilbert H.J., Henrissat B.. PULDB: the expanded database of polysaccharide utilization loci. Nucleic Acids Res. 2018; 46:D677–D683. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6. Terrapon N., Lombard V., Gilbert H.J., Henrissat B.. Automatic prediction of polysaccharide utilization loci in Bacteroidetes species. Bioinformatics. 2015; 31:647–655. [DOI] [PubMed] [Google Scholar]

[B7] 7. Grondin J.M., Tamura K., Dejean G., Abbott D.W., Brumer H.. Polysaccharide utilization loci: fueling microbial communities. J. Bacteriol. 2017; 199:e00860-16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8. Martens E.C., Koropatkin N.M., Smith T.J., Gordon J.I.. Complex glycan catabolism by the human gut microbiota: the bacteroidetes sus-like paradigm. J. Biol. Chem. 2009; 284:24673–24677. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9. Zhang H., Yohe T., Huang L., Entwistle S., Wu P., Yang Z., Busk P.K., Xu Y., Yin Y.. dbCAN2: a meta server for automated carbohydrate-active enzyme annotation. Nucleic Acids Res. 2018; 46:W95–W101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10. Lombard V., Golaconda Ramulu H., Drula E., Coutinho P.M., Henrissat B.. The carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res. 2014; 42:D490–D495. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11. Drula E., Garron M.L., Dogan S., Lombard V., Henrissat B., Terrapon N.. The carbohydrate-active enzyme database: functions and literature. Nucleic Acids Res. 2022; 50:D571–D577. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12. Yin Y., Mao X., Yang J., Chen X., Mao F., Xu Y.. dbCAN: a web resource for automated carbohydrate-active enzyme annotation. Nucleic Acids Res. 2012; 40:W445–W451. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13. Xu J., Zhang H., Zheng J., Dovoedo P., Yin Y.. eCAMI: simultaneous classification and motif identification for enzyme annotation. Bioinformatics. 2019; 36:2068–2075. [DOI] [PubMed] [Google Scholar]

[B14] 14. Ausland C., Zheng J., Yi H., Yang B., Li T., Feng X., Zheng B., Yin Y.. dbCAN-PUL: a database of experimentally characterized CAZyme gene clusters and their substrates. Nucleic Acids Res. 2021; 49:D523–D528. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15. Busk P.K., Pilgaard B., Lezyk M.J., Meyer A.S., Lange L.. Homology to peptide pattern for annotation of carbohydrate-active enzymes and prediction of function. BMC Bioinf. 2017; 18:214. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16. Barrett K., Hunt C.J., Lange L., Meyer A.S.. Conserved unique peptide patterns (CUPP) online platform: peptide-based functional annotation of carbohydrate active enzymes. Nucleic Acids Res. 2020; 48:W110–W115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17. Hobbs E.E.M., Gloster T., Chapman S., Pritchard L.. Microbiology Society Annual Conference 2021. 2021; [Google Scholar]

[B18] 18. Vatanen T., Ang Q.Y., Siegwald L., Sarker S.A., Le Roy C.I., Duboux S., Delannoy-Bruno O., Ngom-Bru C., Boulange C.L., Strazar M.et al.. A distinct clade of bifidobacterium longum in the gut of Bangladeshi children thrives during weaning. Cell. 2022; 185:4280–4297. [DOI] [PubMed] [Google Scholar]

[B19] 19. Qin Y., Havulinna A.S., Liu Y., Jousilahti P., Ritchie S.C., Tokolyi A., Sanders J.G., Valsta L., Brozynska M., Zhu Q.et al.. Combined effects of host genetics and diet on human gut microbiota and incident disease in a single population cohort. Nat. Genet. 2022; 54:134–142. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20. Olm M.R., Dahan D., Carter M.M., Merrill B.D., Yu F.B., Jain S., Meng X., Tripathi S., Wastyk H., Neff N.et al.. Robust variation in infant gut microbiome assembly across a spectrum of lifestyles. Science. 2022; 376:1220–1223. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21. Smits S.A., Leach J., Sonnenburg E.D., Gonzalez C.G., Lichtman J.S., Reid G., Knight R., Manjurano A., Changalucha J., Elias J.E.et al.. Seasonal cycling in the gut microbiome of the Hadza hunter-gatherers of Tanzania. Science. 2017; 357:802–806. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22. Arkin A.P., Cottingham R.W., Henry C.S., Harris N.L., Stevens R.L., Maslov S., Dehal P., Ware D., Perez F., Canon S.et al.. KBase: the United States Department of Energy Systems Biology Knowledgebase. Nat. Biotechnol. 2018; 36:566–569. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23. Zhou Z., Tran P.Q., Breister A.M., Liu Y., Kieft K., Cowley E.S., Karaoz U., Anantharaman K.. METABOLIC: high-throughput profiling of microbial genomes for functional traits, metabolism, biogeochemistry, and community-scale functional networks. Microbiome. 2022; 10:33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24. Shaffer M., Borton M.A., McGivern B.B., Zayed A.A., La Rosa S.L., Solden L.M., Liu P., Narrowe A.B., Rodriguez-Ramos J., Bolduc B.et al.. DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic Acids Res. 2020; 48:8883–8900. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25. Jones D.R., Thomas D., Alger N., Ghavidel A., Inglis G.D., Abbott D.W.. SACCHARIS: an automated pipeline to streamline discovery of carbohydrate active enzyme activities within polyspecific families and de novo sequence datasets. Biotechnol. Biofuels. 2018; 11:27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26. Makki K., Deehan E.C., Walter J., Backhed F.. The impact of dietary Fiber on gut microbiota in host health and disease. Cell Host Microbe. 2018; 23:705–715. [DOI] [PubMed] [Google Scholar]

[B27] 27. Valdes A.M., Walter J., Segal E., Spector T.D.. Role of the gut microbiota in nutrition and health. BMJ. 2018; 361:k2179. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] 28. Zmora N., Suez J., Elinav E.. You are what you eat: diet, health and the gut microbiota. Nat. Rev. Gastroenterol. Hepatol. 2019; 16:35–56. [DOI] [PubMed] [Google Scholar]

[B29] 29. Deehan E.C., Yang C., Perez-Munoz M.E., Nguyen N.K., Cheng C.C., Triador L., Zhang Z., Bakal J.A., Walter J.. Precision microbiome modulation with discrete dietary Fiber structures directs short-chain fatty acid production. Cell Host Microbe. 2020; 27:389–404. [DOI] [PubMed] [Google Scholar]

[B30] 30. Zhao L., Zhang F., Ding X., Wu G., Lam Y.Y., Wang X., Fu H., Xue X., Lu C., Ma J.et al.. Gut bacteria selectively promoted by dietary fibers alleviate type 2 diabetes. Science. 2018; 359:1151–1156. [DOI] [PubMed] [Google Scholar]

[B31] 31. Desai M.S., Seekatz A.M., Koropatkin N.M., Kamada N., Hickey C.A., Wolter M., Pudlo N.A., Kitamoto S., Terrapon N., Muller A.et al.. A dietary Fiber-deprived gut microbiota degrades the colonic mucus barrier and enhances pathogen susceptibility. Cell. 2016; 167:1339–1353. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] 32. Stam M.R., Danchin E.G., Rancurel C., Coutinho P.M., Henrissat B.. Dividing the large glycoside hydrolase family 13 into subfamilies: towards improved functional annotations of alpha-amylase-related proteins. Protein Eng. Des. Sel. 2006; 19:555–562. [DOI] [PubMed] [Google Scholar]

[B33] 33. Aspeborg H., Coutinho P.M., Wang Y., Brumer H. 3rd, Henrissat B.. Evolution, substrate specificity and subfamily classification of glycoside hydrolase family 5 (GH5). BMC Evol. Biol. 2012; 12:186. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B34] 34. Mewis K., Lenfant N., Lombard V., Henrissat B.. Dividing the large glycoside hydrolase Family 43 into subfamilies: a motivation for detailed enzyme characterization. Appl. Environ. Microbiol. 2016; 82:1686–1692. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] 35. Hornung B.V.H., Terrapon N.. An objective criterion to evaluate sequence-similarity networks helps in dividing the protein family sequence space. 2022; bioRxiv doi:29 April 2022, preprint: not peer reviewed 10.1101/2022.04.19.488343. [DOI] [PMC free article] [PubMed]

PERMALINK

dbCAN3: automated carbohydrate-active enzyme and substrate annotation

Jinfang Zheng

Qiwei Ge

Yuchen Yan

Xinpeng Zhang

Le Huang

Yanbin Yin

Abstract

Graphical Abstract

Graphical Abstract.

INTRODUCTION

MOTIVATION FOR DBCAN3

APPROACHES TO SUBSTRATE PREDICTION

Figure 1.

EVALUATION OF SUBSTRATE PREDICTION

Table 1.

NEW WEB FUNCTIONS AND UPDATES

Figure 2.

Figure 3.

DATA AVAILABILITY

Supplementary Material

ACKNOWLEDGEMENTS

Contributor Information

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

dbCAN3: automated carbohydrate-active enzyme and substrate annotation

Jinfang Zheng

Qiwei Ge

Yuchen Yan

Xinpeng Zhang

Le Huang

Yanbin Yin

Abstract

Graphical Abstract

Graphical Abstract.

INTRODUCTION

MOTIVATION FOR DBCAN3

APPROACHES TO SUBSTRATE PREDICTION

Figure 1.

EVALUATION OF SUBSTRATE PREDICTION

Table 1.

NEW WEB FUNCTIONS AND UPDATES

Figure 2.

Figure 3.

DATA AVAILABILITY

Supplementary Material

ACKNOWLEDGEMENTS

Contributor Information

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases