Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2019 Dec 16;36(8):2438–2442. doi: 10.1093/bioinformatics/btz934

GlyMDB: Glycan Microarray Database and analysis toolset

Yiwei Cao 1, Sang-Jun Park 1, Akul Y Mehta 2, Richard D Cummings 2, Wonpil Im 1,3,
Editor: Arne Elofsson
PMCID: PMC7178435  PMID: 31841142

Abstract

Motivation

Glycan microarrays are capable of illuminating the interactions of glycan-binding proteins (GBPs) against hundreds of defined glycan structures, and have revolutionized the investigations of protein–carbohydrate interactions underlying numerous critical biological activities. However, it is difficult to interpret microarray data and identify structural determinants promoting glycan binding to glycan-binding proteins due to the ambiguity in microarray fluorescence intensity and complexity in branched glycan structures. To facilitate analysis of glycan microarray data alongside protein structure, we have built the Glycan Microarray Database (GlyMDB), a web-based resource including a searchable database of glycan microarray samples and a toolset for data/structure analysis.

Results

The current GlyMDB provides data visualization and glycan-binding motif discovery for 5203 glycan microarray samples collected from the Consortium for Functional Glycomics. The unique feature of GlyMDB is to link microarray data to PDB structures. The GlyMDB provides different options for database query, and allows users to upload their microarray data for analysis. After search or upload is complete, users can choose the criterion for binder versus non-binder classification. They can view the signal intensity graph including the binder/non-binder threshold followed by a list of glycan-binding motifs. One can also compare the fluorescence intensity data from two different microarray samples. A protein sequence-based search is performed using BLAST to match microarray data with all available PDB structures containing glycans. The glycan ligand information is displayed, and links are provided for structural visualization and redirection to other modules in GlycanStructure.ORG for further investigation of glycan-binding sites and glycan structures.

Availability and implementation

http://www.glycanstructure.org/glymdb.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Glycans are abundant on the surface of both eukaryotic and prokaryotic cells, serving as the first cellular components encountered by approaching molecules, cells and pathogens. Therefore, glycans have critical roles in various biological processes, such as cell adhesion, signal transduction, host–pathogen interactions and immune activities (Varki, 2017). Glycosylation, the most common post-translational modification, is the enzymatic process to attach carbohydrates covalently to proteins and lipids to form glycoproteins and glycolipids (Apweiler et al., 1999). Glycans can also be specifically recognized and non-covalently bound by a set of proteins, known as glycan-binding proteins (GBPs) or lectins. Compared to the other three major classes of biomolecules (proteins, nucleic acids and lipids), glycans are the most heterogeneous by virtue of different anomeric configurations (α and β) of glycosidic linkages, multiple branched structures and various chemical modifications. Due to the complex nature of glycans, characterizing the binding specificities and identifying the binding determinants are major questions in glycobiology research.

In the past two decades, glycan microarrays have revolutionized the analysis of protein–glycan binding specificities. Glycan microarrays are composed of various saccharides, either chemically synthesized or purified from natural sources, immobilized on array surfaces (Rillahan and Paulson, 2011). They are incubated with increasing concentrations of a GBP, and fluorescence emission of either the fluorescent-tagged GBP or secondary reagent is measured, which illuminates the binding specificity and indirectly the affinity of the GBP toward various glycan structures (Heimburg-Molinaro et al., 2011). The Consortium for Functional Glycomics (CFG) has greatly expanded the availability of glycan array data by making the results of glycan array experiments publicly available. Though these data provide a rich insight into the specificities of GBPs, data interpretation remains challenging, particularly when the proteins have more than one binding motif or the glycans have subunits that block protein–glycan binding. Manual interpretation of complicated glycan array data is tedious and error-prone, and hence automated methods need to be developed to solve these problems in an efficient and robust manner.

There are a few web resources for glycan microarray database and related bioinformatics tools. Glycosciences Laboratory (https://glycosciences.med.ic.ac.uk/glycanLibraryIndex.html), the National Center for Functional Glycomics (https://ncfg.hms.harvard.edu/ncfg-data) and the CFG (http://functionalglycomics.org/) provide access to the microarray data from the experiments that they have performed. However, they do not provide analysis for protein–glycan interactions in terms of three-dimensional structures. GlycoPattern (Agravat et al., 2014) provides a set of tools supporting the analysis of glycan microarray data. While highly useful in identifying glycan determinants bound by GBPs, it only allows users to upload their own microarray data. MCAW-DB (Hosoda et al., 2018) is a glycan profile database containing the multi-sequence alignment analysis of 1081 glycan microarray samples collected from the CFG, and it only focuses on glycan sequence alignment. In addition, there are several reported tools, such as GlycoViewer (Joshi et al., 2010) and GLAD (Mehta and Cummings, 2019) for microarray data visualization and mining, and GlycanMotifMiner (Cholleti et al., 2012) and GlycoSearch (Kletter et al., 2015) for glycan patterns discovery.

In this work, we present the Glycan Microarray DataBase (GlyMDB), a database of glycan microarray samples with a toolset for data analysis. The GlyMDB provides a user-friendly web interface that enables users to search from the database or upload their own microarray samples for data visualization, binder/non-binder classification, binding motif discovery and data comparison of two different samples. When protein sequence information is available, either included in the queried samples or provided by users when uploading data, BLAST (Camacho et al., 2009) search is performed to find relevant PDB entries based on the protein sequence identity. We discuss the GlyMDB web interface and usage of each tool in the following sections. A stepwise guide to use the GlyMDB is also available in glycanstructure.org/glymdb/howto.

2 Materials and methods

2.1 Web interface

We obtained publicly available glycan array data from the CFG in the form of spreadsheet files, which are parsed automatically by Python scripts and the extracted information is stored and managed using sqlite3. The web interface of GlyMDB is built with Django and JavaScript. There are two options on the search interface: users can (i) upload microarray data in a spreadsheet file or (ii) search microarray data stored in the database (Fig. 1). To upload their own microarray data, users need to prepare a spreadsheet file containing three columns: glycan number, structure and signal intensity. An example is shown in the Supplementary Material S1. Optionally, users can specify the protein sequence information of the uploaded microarray data (if available), and the GlyMDB provides the option to perform PDB matching in the result page. To search the microarray data stored in GlyMDB, users can make a query by protein name, sequence or PDB ID. When users select to make a query by protein name, the GlyMDB attempts to match the keywords in both the lectin sample name and the description originally provided on the CFG website. Since not all microarray samples include protein sequence information, a query is made among the samples with protein sequence available if users select to search by protein sequence or PDB ID. BLAST is used to align the queried protein sequence or the sequence associated to the queried PDB ID against the protein sequences of all stored microarray samples. By default, the threshold of sequence identity is 95%, but users can change the value based on their requirements. There are several filters available to narrow the query result by the species, family and glycan array version (Fig. 1).

Fig. 1.

Fig. 1.

The GlyMDB search interface. (A) Upload glycan microarray data. Shown above, a spreadsheet named ‘my_microarray.xls’ is selected and protein sequence information is written in the textbox. (B) Search database. Users can select to make a query by the name (e.g. ‘DC-SIGN’), protein sequence or PDB ID (e.g. ‘1SL5’). When searching by protein sequence or PDB ID, users need to set the threshold of sequence identity. Three options can be used to filter the results by GBP species, family and glycan array version

When the upload or search is complete, the GlyMDB shows the result page (Figs 2 and 3). The result page starts with a list of samples (Figs 2A and 3A), which include the information extracted from the uploaded sample or the database query result. The same protein can have multiple entries in the sample list if there are data from the experiments on different versions of glycan arrays or under different protein concentrations. By selecting one entry from the sample list and one criterion for binder/non-binder classification (Fig. 2A), users can view the bar chart of fluorescence intensity and the threshold distinguishing binders from non-binders (Fig. 2B). In addition, the GlyMDB also provides users with the GLAD format input file (below the bar chart) in order to utilize the recently developed GLAD web application for the data visualization and analysis of glycan microarray data (Mehta and Cummings, 2019). The GlyMDB also shows a list of common motifs that make positive or negative contributions to binding interactions (Fig. 2C), which will be discussed in the following section. If users provide the protein sequence when uploading their own microarray data, or select a sample (stored in GlyMDB) with protein sequence available, the GlyMDB shows the option for matching the microarray data with PDB structures. BLAST search is performed to generate a list of PDB IDs with sequence identity above the selected threshold, and Glycan Reader (Jo et al., 2011, Park et al., 2017) is used to process the PDB files and to extract the information of glycan ligands (Fig. 2D). The links for structural visualization and PDB file download are provided (Fig. 2E). In addition, users are able to compare two samples by selecting two entries from the sample list, which can be two different GBPs, or different samples of the same GBP (Fig. 3A). The GlyMDB uses a heatmap to show the similarity and difference of fluorescence intensity between two selected samples (Fig. 3B).

Fig. 2.

Fig. 2.

The GlyMDB result interface. (A) To view the results of binder/non-binder classification and motifs discovery, users should select only one sample from the list. The PDB file matching function works only if the selected sample has protein sequence information available. (B) The bar chart of fluorescence intensity and the sorted lists of binders and non-binders. (C) Top ranking motifs that make significant contributions to protein–glycan binding. (D) The list of PDB files matched to the selected microarray sample. There are links for visualizing PDB structures, downloading PDB files from RCSB, searching PDB IDs in our glycan-binding site database and searching glycan ligands in our glycan fragment database. (E) Structural visualization of the PDB files

Fig. 3.

Fig. 3.

The GlyMDB result interface for two sample comparison. (A) To compare two different microarray samples, users should select two samples from the list. (B) The heat map with one-to-one comparison of the intensity of each glycan in two selected samples. If users click a glycan Id, the glycan sequence is shown

2.2 Glycan microarray data analysis toolset

2.2.1 Binder/non-binder classification

To split the glycan array into binders and non-binders, users need to select a threshold of fluorescence intensity. In GlyMDB, there are two options: P-value of z-score, and percentage of maximum intensity. A z-score is a measure of how many standard deviations a data point is above or below the mean of population. It is used as a statistical test for the significance of a sample with the null hypothesis that a randomly selected sample is a non-binder. The threshold is set to a P-value converted from a z-score. Though the default value is 0.15, which is same as the one used in GLYMMR (Cholleti et al., 2012), users can choose any number between 0 and 1. As the second option, users can use a certain percentage of the highest fluorescence intensity as the threshold. The default value is 10%, which means that a glycan is classified to be a binder if the fluorescence intensity is >10% × the maximum intensity observed on the glycan array.

2.2.2 Glycan-binding motif discovery

Though glycan array data can illuminate the binding affinity of GBPs toward a variety of defined glycans on the array, additional data interpretation is necessary to identify the glycan structure determinants for GBP specificities. To discover the motifs that make significant contributions to protein–glycan binding, the GlyMDB first breaks each glycan into fragments and then takes advantage of the support vector machine (SVM) algorithm to select the fragments with highest importance. We chose to use the SVM with a linear kernel rather than other kernels because it outputs the weights assigned to input features (i.e. glycan fragments), and we use these weights as the measurement for the importance of each glycan fragment. As shown in Figure 4A, a given glycan sequence is fragmented by enumerating all connected substructures (Jo and Im, 2013). After that, we obtain a set of unique fragments, each of which is present in at least one glycan. For example, the CFG version 5.1 array has 610 glycans in total. For the same glycan attached to different spacer arms, we only keep the one with the highest fluorescence intensity. Consequently, there are 541 unique glycans and we have 14 973 unique fragments. For each of 541 glycans, the GlyMDB generates a fingerprint showing whether the glycan contains each of 14 973 fragments or not. The footprints are combined into a 541 × 14 973 binary matrix. Meanwhile, a binary array with a length of 541 is generated to indicate whether each glycan is a binder or not (Fig. 4B). They are the input for training an SVM classifier, and we use the SVM module in the Scikit-learn package, which is a free machine learning library for Python. After training the SVM classifier, we make a record of the weights assigned to each glycan fragment. The fragments with positive weights are considered to be the motifs that contribute to protein–glycan binding and those with negative weights are considered to be the motifs that prevent protein–glycan binding. Since the number of features (i.e. fragments) is much greater than the number of training samples, recursive feature elimination (Guyon et al., 2002) is used to recursively reduce the number of features (e.g. from 14 973 to 256 features for the CFG version 5.1 array). The final SVM classifier is trained on the remaining 256 features, and fragments with positive and negative weights are ranked separately. More details are given in the Supplementary Material S2. The GlyMDB ignores a fragment if it is a substructure of a larger fragment and the sub-fragment’s weight is less than or equal to the super-fragment’s weight. In addition, the top-ranked fragments with positive (or negative) weights are ignored if they are present in <5 or 1/3 of the total number of binders (or non-binders), whichever is less. In the final output, the GlyMDB displays up to five top-ranked positive and negative weight fragments (if any).

Fig. 4.

Fig. 4.

Glycan fragment and fingerprint. (A) Glycan fragments are generated by enumerating all connected substructures of a given glycan sequence. (B) Glycan fingerprints are binary arrays indicating whether a glycan includes each fragment or not

2.2.3 Glycan array sample comparison

To investigate whether a GBP has consistent binding specificity when an experiment is performed under different protein concentrations or using different glycan arrays, or to investigate whether two different lectins have similar binding specificity, one needs to compare the glycan microarray data of two different samples. The GlyMDB allows users to compare the microarray data of two different samples, and a heat map based on the fluorescence intensity of each glycan is made for one-to-one comparison (Fig. 3). The heat map is generated independently for two selected samples. For each sample, red corresponds to the maximum intensity and white corresponds to the intensity lower than the threshold for binder/non-binder classification. Thus, only binders are highlighted with red color and all non-binders are shown in white.

2.2.4 Cross-linking glycan microarray to PDB

Though microarray data contain substantial information about the specificity of GBPs, it does not provide three-dimensional structural information, such as the binding site on a protein for a glycan. In contrast, PDB files contain such general structural information, yet they are not enough to elucidate the specificity of proteins, since the glycan ligands in PDB structures are generally limited in length and variety. To bridge the gap between microarray data and PDB structures, the GlyMDB attempts to find all relevant PDB structures for each given microarray sample.

When users select or upload a microarray sample with protein sequence information available, we first use BLAST to query this sequence against all PDB sequences and record the PDB IDs if the sequence identities are above the user-specified threshold. Glycan Reader is used to automatically detect and annotate glycan units, glycosidic linkages and chemical modifications. By default, the list of PDB IDs is sorted by the length of the largest glycan ligand contained in each PDB structure, and users can filter the list by glycan ligand length. The PDB list can also be sorted in the order of sequence identity and PDB resolution. In addition, the links are provided for downloading PDB files from RCSB, searching PDB IDs in our glycan-binding site database and searching glycan ligands in our glycan fragment database (Jo and Im, 2013). The PDB search and visualization are illustrated with Aspergillus fumigatus lectin that shows specificity for glycans with terminal fucose, particularly the glycans with terminal Fucα1-2Galβ1-4(Fucα1-3)GlcNAc substructure. After PDB search, we found four PDB entries—4D4U, 4AH4, 4AGT and 4AHA. As shown in Figure 5, 4D4U is a dimer containing multiple glycan-binding sites, and it can bind Fucα1-2Galβ1-4(Fucα1-3)GlcNAc with two different binding poses.

Fig. 5.

Fig. 5.

PDB search and visualization. PDB ID 4D4U is matched to the microarray data of Aspergillus fumigatus lectin that shows specificity to glycans with terminal fucose. The PDB structure shows that the dimer in 4D4U contains multiple binding sites and binds Fucα1-2Galβ1-4(Fucα1-3)GlcNAc with different binding poses

3 Results and discussion

As of June 2019, the GlyMDB contains 5203 glycan microarray samples collected from the CFG. Multiple experimental data of the same GBP on different glycan arrays (from version 1.0 to 5.2) or under different concentrations are counted as multiple samples. Among 5203 microarray samples, 1849 have protein sequence information available (Fig. 6A). We performed BLAST search against all protein sequences from PDB protein structures (as of June 2019) with sequence similarity >95%, and the numbers of matched PDB entries are shown in Figure 6B. We extracted the glycan information from each PDB file and the numbers of PDB entries containing glycan ligands are also shown in Figure 6B. Since multiple microarray samples can have the same protein sequence that is matched to the same PDB file, we removed redundancy, and there are 1965 unique PDB entries. A total of 771 out of 1965 PDB entries contain glycan ligands, and the length distribution of the largest glycan ligand in each PDB file is shown in Figure 6C.

Fig. 6.

Fig. 6.

Statistics of GlyMDB and related PDB files. (A) Numbers of microarray samples and numbers of microarray samples with protein sequence information available, grouped by CFG glycan array versions. (B) Numbers of PDB structures and numbers of PDB structures containing glycan ligands, grouped by CFG glycan array versions. The sequence identity of the PDB structure(s) to the corresponding microarray sample is >95%. (C) Length distribution of the largest glycan ligand in each PDB file. Numbers of unique proteins are calculated by removing multiple PDB entries corresponding to the same protein

Multivalency is common in protein–glycan interactions. In these cases, the glycan-binding sites occur between protein monomers instead of within one monomer. However, it is difficult, merely from the sequence, to identify whether the binding interaction is multivalent, which is one of the reasons that we wished to bridge microarray data to PDB structures. BLAST protein searches can find available multimeric protein structures even if the queried protein sequence only covers one monomer. With the multimeric structures, one can perform further investigations to locate the glycan-binding sites and deduce how the protein interacts with the glycan ligand.

In the current release of GlyMDB, we have some requirements on the format of user-uploaded microarray data. The glycan sequence should be represented by the text nomenclature recommended by CFG (http://www.functionalglycomics.org/static/consortium/Nomenclature.shtml). To make it easier for users to specify glycan structures, we will support more glycan sequence formats and glycan accession numbers, such as GlyTouCan ID (Tiemeyer et al., 2017), in the future release of GlymDB. For structural visualization, we utilize NGL viewer (Rose et al., 2018), which is a web application for macromolecular structure visualization, and it is also the 3D structure viewer embedded in the RCSB Protein Data Bank website. NGL provides a comprehensive set of molecular representations and allows to highlight and focus on each selected ligand. We plan to embedded other 3D structure viewers, such as LiteMol (Sehnal et al., 2017), into our website, and users can take advantage of the features in different structure viewers and choose the one that satisfies their requirements.

4 Summary

We have described the development and usage of GlyMDB, which is an integrated platform for database query, user upload and data/structure analysis. It can assist glycoscientists in searching for publicly available microarray data and processing their own data. In addition to the functional features of binder/non-binder classification, glycan-binding motif discovery and glycan array sample comparison, the GlyMDB is the first tool to cross-link microarray samples to PDB structures, and this can supplement the structural information that is not included in microarray data. These tools are expected to be useful in investigating the specificity of protein–glycan interactions in both sequence and structural levels. The database will be updated quarterly and is freely available at http://www.glycanstructure.org.

Funding

This work was supported by the National Science Foundation [DBI-1707207 to W.I.]; and National Institutes of Health Grants [P41GM103694 to R.D.C., U01GM125267 to R.D.C., Will York).

Conflict of Interest: none declared.

Supplementary Material

btz934_Supplementary_Data

References

  1. Agravat S.B. et al. (2014) GlycoPattern: a web platform for glycan array mining. Bioinformatics, 30, 3417–3418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Apweiler R. et al. (1999) On the frequency of protein glycosylation, as deduced from analysis of the SWISS-PROT database. Biochim. Biophys. Acta, 1473, 4–8. [DOI] [PubMed] [Google Scholar]
  3. Camacho C. et al. (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10, 421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cholleti S.R. et al. (2012) Automated motif discovery from glycan array data. OMICS, 16, 497–512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Guyon I. et al. (2002) Gene selection for cancer classification using support vector machines. Mach. Learn., 46, 389–422. [Google Scholar]
  6. Heimburg-Molinaro J. et al. (2011) Preparation and analysis of glycan microarrays. Curr. Protoc. Protein Sci., 12, 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Hosoda M. et al. (2018) MCAW-DB: a glycan profile database capturing the ambiguity of glycan recognition patterns. Carbohydr. Res., 464, 44–56. [DOI] [PubMed] [Google Scholar]
  8. Jo S., Im W. (2013) Glycan fragment database: a database of PDB-based glycan 3D structures. Nucleic Acids Res., 41, D470–D474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Jo S. et al. (2011) Glycan Reader: automated sugar identification and simulation preparation for carbohydrates and glycoproteins. J. Comput. Chem., 32, 3135–3141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Joshi H.J. et al. (2010) GlycoViewer: a tool for visual summary and comparative analysis of the glycome. Nucleic Acids Res., 38, W667–W670. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kletter D. et al. (2015) Exploring the specificities of glycan-binding proteins using glycan array data and the GlycoSearch software. Methods Mol. Biol., 1273, 203–214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Mehta A.Y., Cummings R.D. (2019) GLAD: GLycan Array Dashboard, a visual analytics tool for glycan microarrays. Bioinformatics, 35, 3536–3537. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Park S.J. et al. (2017) Glycan Reader is improved to recognize most sugar types and chemical modifications in the Protein Data Bank. Bioinformatics, 33, 3051–3057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Rillahan C.D., Paulson J.C. (2011) Glycan microarrays for decoding the glycome. Annu. Rev. Biochem., 80, 797–823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Rose A.S. et al. (2018) NGL viewer: web-based molecular graphics for large complexes. Bioinformatics, 34, 3755–3758. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Sehnal D. et al. (2017) LiteMol suite: interactive web-based visualization of large-scale macromolecular structure data. Nat. Methods, 14, 1121–1122. [DOI] [PubMed] [Google Scholar]
  17. Tiemeyer M. et al. (2017) GlyTouCan: an accessible glycan structure repository. Glycobiology, 27, 915–919. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Varki A. (2017) Biological roles of glycans. Glycobiology, 27, 3–49. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btz934_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES