Skip to main content
Glycobiology logoLink to Glycobiology
. 2023 Feb 17;33(5):354–357. doi: 10.1093/glycob/cwad014

The Glycan Structure Dictionary—a dictionary describing commonly used glycan structure terms

Jeet Vora 1,, Rahi Navelkar 2, K Vijay-Shanker 3, Nathan Edwards 4, Karina Martinez 5, Xiying Ding 6, Tianyi Wang 7, Peng Su 8, Karen Ross 9, Frederique Lisacek 10, Catherine Hayes 11, Robel Kahsay 12, Rene Ranzinger 13, Michael Tiemeyer 14, Raja Mazumder 15
PMCID: PMC10243773  PMID: 36799723

Abstract

Recent technological advances in glycobiology have resulted in a large influx of data and the publication of many papers describing discoveries in glycoscience. However, the terms used in describing glycan structural features are not standardized, making it difficult to harmonize data across biomolecular databases, hampering the harvesting of information across studies and hindering text mining and curation efforts. To address this shortcoming, the Glycan Structure Dictionary has been developed as a reference dictionary to provide a standardized list of widely used glycan terms that can help in the curation and mapping of glycan structures described in publications. Currently, the dictionary has 190 glycan structure terms with 297 synonyms linked to 3,332 publications. For a term to be included in the dictionary, it must be present in at least 2 peer-reviewed publications. Synonyms, annotations, and cross-references to GlyTouCan, GlycoMotif, and other relevant databases and resources are also provided when available. The purpose of this effort is to facilitate biocuration, assist in the development of text mining tools, improve the harmonization of search, and browse capabilities in glycoinformatics resources and help to map glycan structures to function and disease. It is also expected that authors will use these terms to describe glycan structures in their manuscripts over time. A mechanism is also provided for researchers to submit terms for potential incorporation. The dictionary is available at https://wiki.glygen.org/Glycan_structure_dictionary.

Keywords: Dictionary, Glycan structure term, Glycobioinformatics, GlyGen, Text mining

Introduction

Glycans mediate important biological functions, serve as biomarkers for diseases, regulate host-pathogen interactions, and contribute to ongoing efforts to develop novel biotherapeutics, bioenergy sources, and biomaterials. Glycan-related research, databases, and bioinformatics tools have advanced significantly in the last 20 years, providing ever-increasing volumes of glycoproteomics and glycomics data (Varki 2017). However, significant differences exist in the terms that authors use to describe and report glycan structural features in the literature. The lack of a reference dictionary, which can be leveraged by researchers to report glycan structures in a standardized manner, complicates efforts to extract glycan-related information from the literature through text mining and to harmonize annotations in bioinformatics databases. To bridge this gap, we have developed the Glycan Structure Dictionary (GSD) that encompasses a reference list of widely used glycan terms and their definitions.

The goal of our effort is to collect, in the form of a dictionary, the commonly used glycan structure terms from peer-reviewed publications and provide their meaning, synonyms, and at least one sentence from the publication that utilizes the term. Such a resource will assist GlyGen (Kahsay et al. 2020; York et al. 2020) and other glycoinformatics resources engaged in text mining efforts in connecting glycans to function, biomarkers, and more. These terms, whenever possible, are linked to the glycan detail pages in GlyGen via GlyTouCan accessions. The intent of this manuscript is to make users aware of this resource so that they can browse the terms and also propose new terms or synonyms. Terms that are found in the abstract or the main text of 2 or more peer-reviewed publications will be incorporated into the dictionary. Terms that are present in only one peer-reviewed publication will be flagged as a “provisional” term until another paper is identified through text mining or manual curation that uses that term. The purpose of our effort is multipronged: (i) assist in developing text mining tools and curation efforts; (ii) encourage authors to use terms and their accessions already in the dictionary whenever possible; (iii) improve search and browse capabilities of glycoinformatics resources either through hyperlinking of these terms and/or allowing searches using these terms.

Workflow

GSD is designed to generate a reference list for meta-analysis and text mining to help researchers and informaticians extract, transfer, and search glycan information efficiently. To achieve this goal, a term list was initially derived through the manual curation of abstracts. The publication list was generated by searching PubMed using the following search string: [(“glycosylation site” OR “glycan structure” OR “glycan composition”)] (sorted by best match). The double quotes in the query ensure an exact match (in the same order) of the text words. A total of 1,093 papers were identified based on a manual review of all titles and finally 195 publications were selected to extract glycan structure terms from the abstracts. While it is possible that this approach failed to retrieve all possible terms, it did allow us to retrieve many common terms, which we then used to manually check the main body of the manuscript to identify additional terms. The major limitation of this approach is that not all synonyms might be captured. We have provided a mechanism to register terms and synonyms and hope that researchers will submit terms to fill this gap (see the section below on “Term submission”). Additional terms and synonyms were added from resources, such as GlycoEpitope (https://www.glycoepitope.jp/epitopes/epitope_list), GlycoMotif (http://glycomotif.glyomics.org/) and NCBI SNFG (https://ncbi.nlm.nih.gov/glycans/snfg.html) (Varki et al. 2015). Just like a term in a language dictionary, each term collected in the dictionary is annotated with additional information such as a definition, literature evidence (PMID), a sentence from a paper where the term is found, cross-links from other open-access databases, synonyms, functions, disease associations, Wikipedia pages, and relevant chapter number(s) in Essentials of Glycobiology. All terms have a GSD accession and, whenever available, are also linked to GlyTouCan accessions (Fujita et al. 2021). Certain GSD terms, e.g. high mannose, can be represented by several GlyTouCan accessions, but for simplicity, are represented by one GlyTouCan accession in the dictionary. The GSD also includes cross-references to GlycoMotif, GlycoEpitope, PubChem (Kim et al. 2021), ChEBI (Hastings et al. 2016), and other databases. Figure 1 provides a detailed overview of the workflow.

Fig. 1.

Fig. 1

GSD workflow. Glycan structure terms are extracted from publications using a text mining tool workflow and are also derived from glycan resources such as GlycoMotif, GlycoEpitope, NCBI SNFG, and essentials of Glycobiology textbook. The glycan structure terms make it to the dictionary when the structure type is a whole glycan, glycan motif, substructure, glycan type or subtype, and when there are at least 2 publications reporting the term. The glycan structure terms that do not meet the criteria are added as provisional terms until new information/publication is available. Annotations, such as additional PMIDs, synonyms, diseases, functions, cross-references to other glycan resources and links to a Wikipedia page and to the chapters of Essentials of Glycobiology, are added to the terms. The dictionary containing the glycan structure terms is published as a wiki page on GlyGen Wiki (https://wiki.glygen.org/Glycan_structure_dictionary) and also as a CSV dataset on GlyGen data portal (https://data.glygen.org/GLY_000557) that is freely accessible by community users and glycan resources. The GSD allows the submission of terms from the community (https://wiki.glygen.org/Glycan_structure_dictionary#Submit_new_terms).

Term submission

The GSD will become increasingly useful as it is populated by terms frequently used by the community. Therefore, collaborations and contributions from a broad range of glycobiologists and other scientists are welcome. Users, researchers, bioinformaticians, and other interested colleagues can submit a single term using the online form at https://data.glygen.org/gsd/, or they can submit multiple terms through a file upload mechanism https://data.glygen.org/uploads (with a sample template) provided on the GSD Wiki Page. The term will be incorporated if it (i) describes a motif, type, subtype, branching, or terminal structure of a glycan, (ii) has at least 2 associated publications (PMIDs/DOIs) where the term is represented exactly or in its synonym form, and (iii) is not already included in the dictionary. Users can also submit updates such as annotations (synonyms, function, disease association, etc.) to existing terms.

Statistics and dictionary structure

Currently, the dictionary has 190 terms, 297 synonyms, and 3,332 PMIDs and is being updated regularly as part of the GlyGen production release cycle. The current terms can be accessed via GlyGen Wiki (https://wiki.glygen.org/Glycan_structure_dictionary) or as a downloadable dataset via https://data.glygen.org/GLY_000557. Each dictionary entry consists of the term (main_entry), glycan_dictionary_accession, glytoucan_accession (whenever applicable), term_in_sentence (from the publication without any edits), publication (PMID), definition (from ChEBI), term_xref (from ChEBI, PubChem, GlycoMotif etc.), synonyms (from at least one publication), function, disease_associations, Wikipedia (if an article exists), and essentials_of_gycobiology (if available). The data fields for example term GM1 are presented in Table 1.

Table 1.

Data fields of GSD structure terms, GM1 as an example term.

Field Value
SNFG Representation graphic file with name cwad014fx1.jpg
term (main_entry) GM1
glycan_dictionary_accession GSD000091
glytoucan_accession G48558GR
term_in_sentence GM1-induced fluidization of the phospholipid membranes and probable physical contact between bulky sugar head group of GM1 and spectrin, … [PMID: 29920238]
publications 18524657|31761138|36180805|29920238|33859490|…
definition A branched amino pentasaccharide consisting of the linear sequence β-D-Gal-(1 → 3)-β-D-GalNAc-(1 → 4)-β-D-Gal-(1 → 4)-β-D-Glc having a Neu5Ac residue attached to the inner galactose via an α-(2 → 3) linkage. [CHEBI: 59208]
term_xref GlycoMotif:GGM.000098|CID:196569|GlycoEpitope:EP0050|
synonyms GM1 ganglioside|GM1-ganglioside
function The branched pentasaccharide chain of ganglioside GM1 is a prominent cell surface ligand, for example, for cholera toxin or tumor growth-regulatory homodimeric galectins. [PMID: 16267866]
disease_association GM1 gangliosidosis (GM1) [PMID: 33859490]
wikipedia https://en.wikipedia.org/wiki/GM1
essentials_of_glycobiology Chapter 11| Chapter 14

Example search

Most GSD terms can be searched using multiple search options in GlyGen. For example, the GSD term “GM1” or associated accession “GSD000091” can be searched directly from the Global Search located in the menu bar on the GlyGen pages (Global Search → GM1 or GSD000091). Another way to search GM1 is by searching in the “Glycan Name” search box located in the Glycan Advanced Search (Explore → Glycan Search → Glycan Advances Search → Glycan Name → GM1). The Super Search also allows searching the term GM1 or GSD000091 from the Glycan tab using the “Glycan Name” or “Database ID” option from the drop-down menu of glycan properties, respectively (Explore → Super Search → Glycan → Glycan Name or Database ID → GSD000091). The GlyGen Mapper (Tools → GlyGen Mapper) is a mapping tool that allows the searching and mapping of one associated GSD accession to another (e.g. from GSD accession to GlyTouCan accession), providing another entry point to users.

The broad adoption of the Symbol Nomenclature for Glycans (SNFG) provided a blueprint for the visual representation of monosaccharides and glycan structures in a standardized way. Adaptation of that nomenclature by researchers, authors, bioinformatics databases, and peer-reviewed journals led to faster, more effective communication and accurate representation. Similarly, we believe that the glycan dictionary will help map glycan structures described in publications to databases, such as GlyGen, GlyConnect, GlyCosmos, PubChem, and ChEBI, thus positively impacting the curation process and enhancing access to knowledge in the glycoscience domain and better flow of data.

Supplementary Material

GSD_workflow_final_cwad014

Acknowledgments

We thank Dr. Daniel Lyman for his valuable feedback, Reeya Gupta for maintaining the wiki page, the SNFG and eGlyc (https://glycanencyc.gitlab.io/mainpage/) community and other efforts in the glycoinformatics space for developing standards and working on facilitating glycoscience research.

Contributor Information

Jeet Vora, Department of Biochemistry & Molecular Medicine, The George Washington School of Medicine and Health Sciences, 2300 I Street NW, Washington, DC 20037, USA.

Rahi Navelkar, Department of Biochemistry & Molecular Medicine, The George Washington School of Medicine and Health Sciences, 2300 I Street NW, Washington, DC 20037, USA.

K Vijay-Shanker, Department of Computer and Information Science, University of Delaware, Smith Hall, 18 Amstel Ave Newark, DE 19716, USA.

Nathan Edwards, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University, Washington, 3900 Reservoir Rd NW #337, DC 20007, USA.

Karina Martinez, Department of Biochemistry & Molecular Medicine, The George Washington School of Medicine and Health Sciences, 2300 I Street NW, Washington, DC 20037, USA.

Xiying Ding, Department of Biochemistry & Molecular Medicine, The George Washington School of Medicine and Health Sciences, 2300 I Street NW, Washington, DC 20037, USA.

Tianyi Wang, Department of Biochemistry & Molecular Medicine, The George Washington School of Medicine and Health Sciences, 2300 I Street NW, Washington, DC 20037, USA.

Peng Su, Department of Computer and Information Science, University of Delaware, Smith Hall, 18 Amstel Ave Newark, DE 19716, USA.

Karen Ross, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University, Washington, 3900 Reservoir Rd NW #337, DC 20007, USA.

Frederique Lisacek, University of Geneva and Swiss Institute of Bioinformatics, CUI - 7, route de Drize, Geneva 1211, Switzerland.

Catherine Hayes, University of Geneva and Swiss Institute of Bioinformatics, CUI - 7, route de Drize, Geneva 1211, Switzerland.

Robel Kahsay, Department of Biochemistry & Molecular Medicine, The George Washington School of Medicine and Health Sciences, 2300 I Street NW, Washington, DC 20037, USA.

Rene Ranzinger, Complex Carbohydrate Research Center, The University of Georgia, 315 Riverbend Rd, Athens, GA 30602, USA.

Michael Tiemeyer, Complex Carbohydrate Research Center, The University of Georgia, 315 Riverbend Rd, Athens, GA 30602, USA.

Raja Mazumder, Department of Biochemistry & Molecular Medicine, The George Washington School of Medicine and Health Sciences, 2300 I Street NW, Washington, DC 20037, USA.

Funding

This work was supported by supplement funds from the National Institute of General Medical Sciences [3U01GM125267-04S1] to the National Institute of Health Director’s Common Fund supported GlyGen [U01GM125267].

Conflict of Interest: None.

Data availability

All data are freely available from GlyGen Portal (https://glygen.org/), GlyGen Data Portal (https://data.glygen.org, [for Glycan Structure Dictionary dataset—https://data.glygen.org/GLY_000557]) and GlyGen wiki (https://wiki.glygen.org/Glycan_structure_dictionary) under CC BY 4.0 license. This means users are free to copy, distribute, display, and make commercial use of the data (https://glygen.org/license/).

References

  1. Alocci  D, Mariethoz  J, Gastaldello  A, Gasteiger  E, Karlsson  NG, Kolarich  D, Packer  NH, Lisacek  F. GlyConnect: Glycoproteomics goes visual, interactive, and analytical. J Proteome Res. 2019:18:664–677. [DOI] [PubMed] [Google Scholar]
  2. Fujita  A, Aoki  NP, Shinmachi  D, Matsubara  M, Tsuchiya  S, Shiota  M, Ono  T, Yamada  I, Aoki-Kinoshita  KF. The international glycan repository GlyTouCan version 3.0. Nucleic Acids Res. 2021:49:D1529–D1533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Hastings  J, Owen  G, Dekker  A, Ennis  M, Kale  N, Muthukrishnan  V, Turner  S, Swainston  N, Mendes  P, Steinbeck  C. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Res. 2016:44:D1214–D1219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Kahsay  R, Vora  J, Navelkar  R, Mousavi  R, Fochtman  BC, Holmes  X, Pattabiraman  N, Ranzinger  R, Mahadik  R, Williamson  T  et al.  GlyGen data model and processing workflow. Bioinformatics. 2020:36:3941–3943. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Kim  S, Chen  J, Cheng  T, Gindulyte  A, He  J, He  S, Li  Q, Shoemaker  BA, Thiessen  PA, Yu  B, et al.  PubChem in 2021: New data content and improved web interfaces. Nucleic Acids Res. 2021:49:D1388–D1395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Varki  A. Biological roles of glycans. Glycobiology. 2017:27:3–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Varki  A, Cummings  RD, Aebi  M, Packer  NH, Seeberger  PH, Esko  JD, Stanley  P, Hart  G, Darvill  A, Kinoshita  T, et al.  Symbol nomenclature for graphical representations of Glycans. Glycobiology. 2015:25:1323–1324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Yamada  I, Shiota  M, Shinmachi  D, Ono  T, Tsuchiya  S, Hosoda  M, Fujita  A, Aoki  NP, Watanabe  Y, Fujita  N, et al.  The GlyCosmos portal: A unified and comprehensive web resource for the glycosciences. Nat Methods. 2020:17:649–650. [DOI] [PubMed] [Google Scholar]
  9. York  WS, Mazumder  R, Ranzinger  R, Edwards  N, Kahsay  R, Aoki-Kinoshita  KF, Campbell  MP, Cummings  RD, Feizi  T, Martin  M, et al.  GlyGen: Computational and informatics resources for Glycoscience. Glycobiology. 2020:30:72–73. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

GSD_workflow_final_cwad014

Data Availability Statement

All data are freely available from GlyGen Portal (https://glygen.org/), GlyGen Data Portal (https://data.glygen.org, [for Glycan Structure Dictionary dataset—https://data.glygen.org/GLY_000557]) and GlyGen wiki (https://wiki.glygen.org/Glycan_structure_dictionary) under CC BY 4.0 license. This means users are free to copy, distribute, display, and make commercial use of the data (https://glygen.org/license/).


Articles from Glycobiology are provided here courtesy of Oxford University Press

RESOURCES