Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2024 Apr 1;40(4):btae174. doi: 10.1093/bioinformatics/btae174

The IDSM mass spectrometry extension: searching mass spectra using SPARQL

Jakub Galgonek 1,, Jiří Vondrášek 2
Editor: Peter Robinson
PMCID: PMC11034985  PMID: 38561173

Abstract

Summary

The Integrated Database of Small Molecules (IDSM) integrates data from small-molecule datasets, making them accessible through the SPARQL query language. Its unique feature is the ability to search for compounds through SPARQL based on their molecular structure. We extended IDSM to enable mass spectra databases to be integrated and searched for based on mass spectrum similarity. As sources of mass spectra, we employed the MassBank of North America database and the In Silico Spectral Database of natural products.

Availability and implementation

The extension is an integral part of IDSM, which is available at https://idsm.elixir-czech.cz. The manual and usage examples are available at https://idsm.elixir-czech.cz/docs/ms. The source codes of all IDSM parts are available under open-source licences at https://github.com/idsm-src.

1 Introduction

In recent decades, there has been a significant increase in both the size and number of life science databases. Unsurprisingly, then, one of the essential features of a contemporary database is mutual interoperability. Modern databases should not only offer powerful search options, but also enable found data to be easily linked with data in other databases. Many of them are available as semantic databases using the Resource Description Framework (RDF) (Schreiber and Raimond 2014) and accessible through the SPARQL query language (Harris and Seaborne 2013). These databases include, e.g. the UniProt protein database and the Rhea reaction database, both of which are provided by the SIB Swiss Institute of Bioinformatics (SIB) (SIB RDF Group Members 2023). The YummyData site, which monitors semantic databases of interest to the biomedical community, currently lists approximately 60 databases (Yamamoto et al. 2018).

We contribute to this collective effort by operating the Integrated Database of Small Molecules (IDSM) semantic database, which makes small-molecule data available through SPARQL (Galgonek and Vondrášek 2021). IDSM integrates data primarily sourced from PubChemRDF (Fu et al. 2015), ChEMBL (Davies et al. 2015), and ChEBI (Hastings et al. 2016). Although these datasets are exported in RDF, their own SPARQL endpoints are not provided. The IDSM database greatly increases the usability of these datasets by providing a SPARQL endpoint, allowing them to be queried. In addition, it boasts the unique feature of allowing users to search for compounds through SPARQL based on their molecular structure (Kratochvíl et al. 2019). This option is, e.g. useful for searching within the web interface of the Rhea database (Bansal et al. 2022), or it can be used in conjunction with UniProt, e.g. to find all proteins that bind to ligands with structures similar to those of the query ligand (Coudert et al. 2023).

Building upon the LOTUS project (Rutz et al. 2022), we expanded IDSM with the option to search by structure for compounds coming from Wikidata (https://www.wikidata.org). With a focus on sharing knowledge in the research of natural products, the LOTUS project is expected to integrate predicted mass spectra in an upcoming version. With this in mind, we decided that the ideal next step for IDSM would be to incorporate mass spectra, which users would be able to search for using SPARQL.

2 Selected datasets

We selected the MassBank of North America (MoNA) database (https://mona.fiehnlab.ucdavis.edu) as our primary source of mass spectra. MoNA is a metadata-centric, auto-curating repository of metabolite mass spectra, metadata, and associated compounds. MoNA integrates data from many other datasets such as LipidBlast, MassBank, and GNPS, and it currently contains approximately 2 000 000 spectra of around 600 000 compounds. As a secondary data source, we utilized the In Silico Spectral Database (ISDB) of natural products calculated from structures aggregated by LOTUS (Allard et al. 2023). ISDB contains positive and negative in silico mass spectra of almost 300 000 compounds.

3 Selected ontologies

To maximize interoperability with other semantic databases, we decided not to design a bespoke ad hoc ontology tailored to selected datasets. Instead, we opted to leverage existing ontologies as much as possible. When looking for suitable ontologies, we used services such as the Ontology Lookup Service (OLS) (Cote et al. 2010), the Ontobee server (Ong et al. 2017), BioPortal (Whetzel et al. 2011), and the Open Biological and Biomedical Ontologies (OBO) Foundry (Jackson et al. 2021). Unfortunately, none of the ontologies associated with mass spectrometry (MS) proved fully suitable for representing the selected datasets in RDF. The ontology covering most of the necessary mass spectrometry terms that we needed was, albeit designed for use in proteomics, the PSI (Proteomics Standards Initiative)–MS controlled vocabulary provided by the Human Proteome Organization–Proteomics Standards Initiative (HUPO–PSI) (Mayer et al. 2013). However, this vocabulary is only designed to specify types of parameter elements in XML-based formats, such as the Mass Spectrometry Markup Language (mzML) format (Martens et al. 2011), which means that we were unable to use it as a stand-alone solution. Therefore, to fully represent the data, we opted for a more general upper-level ontology, namely the Semanticscience Integrated Ontology (SIO) (Dumontier et al. 2014). This ontology, which specializes in biomedical research and knowledge discovery, provides users with general descriptions of objects, processes, and their attributes. Using SIO, each entry of a mass spectrum database is represented as an experiment, which generates a mass spectrum from an input compound.

Attributes in the SIO ontology represent independent entities, enabling types from other ontologies to be assigned. Specifically, we assigned types from the PSI–MS controlled vocabulary for attributes related to mass spectrometry, and from the Chemical Information Ontology (CHEMINF) (Hastings et al. 2011) for attributes related to compound properties.

Another advantage of the SIO ontology is that it is used by the PubChemRDF and ChEMBL datasets, so representing the selected mass spectrum databases in this way seamlessly integrates with the overarching data model used in IDSM.

In addition to the SIO ontology, the following ontologies are employed to represent selected datasets: the Units of Measurement Ontology (UO) (Rijgersberg et al. 2011) for units of measured values; Dublin Core Metadata Initiative Metadata Terms (DCMI Usage Board 2020) to express basic information about mass spectrum libraries and experiments; the vCard ontology (Iannella and McKinney 2014) to express information about submitters; and the Simple Knowledge Organization System (SKOS) ontology (Miles and Bechhofer 2009) to cross-link entities from different datasets. All of the above ontologies were already used in IDSM.

4 Data model

A record from a source mass spectroscopy dataset is represented as the mass spectrometry experiment (class sio:SIO_001180). The measured compound (class sio:SIO_011125) is related to the experiment as its input (property sio:SIO_000230). Similarly, the mass spectrum (class obo:MS_1000294) is related to the experiment as its output (property sio:SIO_000229). The mass spectrum entity refers (via property sio:SIO_000300) to the mass spectrum literal containing its own measured data, i.e. the intensities of the mass-to-charge ratios.

The experiment, input compound, and output mass spectrum contain attributes encoded as separate entities. Each attribute has a type (property rdf:type) and a value (property sio:SIO_000300). Where necessary, the attribute also has a unit (property sio:SIO_000221) in addition to the value.

The experiment parameters, such as ion mode or collision energy, are encoded as attributes of appropriate types derived from the PSI–MS vocabulary and linked to the experiment (property sio:SIO_000553). Attributes representing chemical qualities and compound identifiers are categorized based on appropriate types from the CHEMINF ontology and linked to a compound (properties sio:SIO_000011 and sio:SIO_000672 respectively). In the case of the mass spectrum, its SPLASH identifier is represented in a similar way.

If the original record contains annotations of peaks, each annotated peak is represented as a separate entity (class obo:MS_1000231) and connected to the spectrum as its component part (property sio:SIO_000313) on a given mass-to-charge ratio position (property sio:SIO_000056).

Annotations of experiments (called tags in the MoNA database) and peaks are encoded as attributes of the type annotation (class sio:SIO_001166) and related to the corresponding entity (property sio:SIO_000254).

To preserve information about the origins of the records, the experiments are organized (property sio:SIO_001278) into datasets (class sio:SIO_000089) based on their original sources. Each experiment is also connected (property sio:SIO_000066) to a person (class vcard:Individual) who submits the corresponding original record into the original database.

Figure 1 shows a usage example of the data model. Details of the data model used to represent the mass spectrum data are available at https://idsm.elixir-czech.cz/docs/ms.

Figure 1.

Figure 1.

An example of representation of a MoNA record in the RDF data model.

5 Interlinking with other datasets

An important requirement for interoperability is ensuring that the data is not isolated but cross-linked with other datasets. The MoNA database uses ClassyFire (Djoumbou Feunang et al. 2016) to automatically classify compounds. Unfortunately, this classification is not used throughout the rest of IDSM. However, the ontology of this classification contains references to equivalent classes in Medical Subject Headings (MeSH) (Rogers 1963) and ChEBI classifications. This allowed us to supplement the MoNA dataset with the basic classification according to MeSH and ChEBI (via property rdf:type), both of which are already used in IDSM. We used the ROBOT tool to convert the ClassyFire ontology from OBO to the Web Ontology Language (OWL) (Jackson et al. 2019).

Some identifiers of the compounds from the MoNA dataset refer to compounds contained in the PubChem dataset. This enabled us to establish direct cross-links for these compounds (via property skos:closeMatch). To increase this cross-linking further, we also added links based on matching InChI identifiers (Heller et al. 2015). This type of interlinking was also applied to compounds from the ISDB dataset. ISDB compounds are also cross-linked with their origin compounds from the LOTUS project.

6 Mass spectrum similarity support

To facilitate the mass spectrum similarity search, we utilized an in-house port of the matchms package (Huber et al. 2020). Matchms provides several frequently used similarity scores to compare mass spectra. The most important of these are various variants of cosine similarity, which is based on comparing peak positions and intensities.

Relevant parts of the matchms package were ported from Python to C as a PostgreSQL extension. This enabled us to easily integrate it into our SPARQL engine, similar to the way we previously integrated the Sachem extension to search for compounds by molecular structure (Kratochvíl et al. 2018).

In IDSM, the value of each mass spectrum is represented as a literal of the ms:spectrum type. The similarity score of two spectra can be calculated using the functions ms:cosineHungarian, ms:cosineGreedy, or ms:modifiedCosine taken from the matchms package. All of these functions can be then used in a filter statement to search for spectra similar to a given query spectrum. This approach allows similarity searches to be easily combined with searches based on other criteria.

7 Availability and sample queries

The IDSM SPARQL endpoint is available at https://idsm.elixir-czech.cz/sparql/endpoint/idsm. Although a basic user interface is available under this address, it is mainly intended as a programming interface or for use in federated queries. A more user-friendly interface is available at https://idsm.elixir-czech.cz/chemweb, which also supports advanced visualization of the results found. Sample queries are presented in the manual page available at https://idsm.elixir-czech.cz/docs/ms.

The SPARQL language provides a wide range of search options. For example, the user can search for compounds based on the similarity of their mass spectra with the specified query spectrum (see Query 1 in the manual). Similarly, it is possible to search for the mass spectra of compounds that meet certain properties. Searching by structure is also supported in selected datasets. So, it offers the option of selecting the mass spectra of only those compounds that contain a specified molecular structure (see Query 2).

Since the datasets are interlinked with data already contained in IDSM, it is possible, for instance, to obtain the spectra of all compounds positively tested against a given protein target in PubChem (Query 3). These spectra can be further used within a single query, e.g. for selecting other spectra similar to them (Query 4). Data interlinking can also be used in federated queries, which, in addition to IDSM, also involve other SPARQL endpoints in the query (Query 5).

8 Conclusion

The mass spectrometry extension of the IDSM semantic database allows users to search for data related to small-molecule mass spectra. This search can also be performed based on the similarity of mass spectra. Built on well-established ontologies already used in IDSM, the extension seamlessly integrates within the overarching IDSM data model. Its main advantages over the original datasets are the support for a full query language and smoother integration with other semantic databases. Through federated queries, this database can be queried together with other semantic databases, thus increasing the overall usability of the data spread between different sources.

Acknowledgements

We would like to thank Marek Mošna for his assistance with matchms porting.

Contributor Information

Jakub Galgonek, Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo náměstí 2, Prague 160 00, Czech Republic.

Jiří Vondrášek, Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo náměstí 2, Prague 160 00, Czech Republic.

Conflict of interest

None declared.

Funding

This work was funded by the Ministry of Education, Youth and Sports of the Czech Republic under the ELIXIR CZ programme, which supports national large research infrastructure for biological data [ID LM2023055]. This work has received funding from the European Union’s Horizon Europe Programme under Grant Agreement No. 101082304 (BlueRemediomics). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Executive Agency (REA). Neither the European Union nor the granting authority can be held responsible for them.

Data availability

The source codes of all IDSM parts are available at https://github.com/idsm-src. Datasets used in this manuscript are publicly available via https://mona.fiehnlab.ucdavis.edu for the MoNA dataset and at doi.org/10.5281/zenodo.8287341 for the ISDB dataset.

References

  1. Allard P-M, Bisson J, Rutz A.. ISDB. In Silico Spectral Databases of Natural Products. Zenodo, 2023. 10.5281/zenodo.8287341. [DOI] [Google Scholar]
  2. Bansal P, Morgat A, Axelsen KB. et al. Rhea, the reaction knowledgebase in 2022. Nucleic Acids Res 2022;50:D693–700. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Cote R, Reisinger F, Martens L. et al. The ontology lookup service: bigger and better. Nucleic Acids Res 2010;38:W155–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Coudert E, Gehant S, de Castro E. et al. ; UniProt Consortium. Annotation of biologically relevant ligands in UniProtKB using ChEBI. Bioinformatics 2023;39:btac793. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Davies M, Nowotka M, Papadatos G. et al. ChEMBL web services: streamlining access to drug discovery data and utilities. Nucleic Acids Res 2015;43:W612–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. DCMI Usage Board. DCMI Metadata Terms. 2020. http://dublincore.org/specifications/dublin-core/dcmi-terms/2020-01-20/.
  7. Djoumbou Feunang Y, Eisner R, Knox C. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J Cheminform 2016;8:61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Dumontier M, Baker CJ, Baran J. et al. The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery. J Biomed Semantics 2014;5:14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Fu G, Batchelor C, Dumontier M. et al. PubChemRDF: towards the semantic annotation of PubChem compound and substance databases. J Cheminform 2015;7:34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Galgonek J, Vondrášek J.. IDSM ChemWebRDF: SPARQLing small-molecule datasets. J Cheminform 2021;13:38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Harris S, Seaborne A.. SPARQL 1.1 Query Language. World Wide Web Consortium, 2013. https://www.w3.org/TR/2013/REC-sparql11-query-20130321/.
  12. Hastings J, Chepelev L, Willighagen E. et al. The chemical information ontology: provenance and disambiguation for chemical data on the biological semantic web. PLoS One 2011;6:e25513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Hastings J, Owen G, Dekker A. et al. ChEBI in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res 2016;44:D1214–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Heller SR, McNaught A, Pletnev I. et al. InChI, the IUPAC international chemical identifier. J Cheminform 2015;7:23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Huber F, Verhoeven S, Meijer C. et al. matchms – processing and similarity evaluation of mass spectrometry data. JOSS 2020;5:2411. [Google Scholar]
  16. Iannella R, McKinney J.. vCard Ontology – for describing People and Organizations. World Wide Web Consortium, 2014. https://www.w3.org/TR/2014/NOTE-vcard-rdf-20140522/.
  17. Jackson R, Matentzoglu N, Overton JA. et al. OBO foundry in 2021: operationalizing open data principles to evaluate ontologies. Database (Oxford) 2021;2021:baab069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Jackson RC, Balhoff JP, Douglass E. et al. ROBOT: a tool for automating ontology workflows. BMC Bioinform 2019;20:407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Kratochvíl M, Vondrášek J, Galgonek J.. Sachem: a chemical cartridge for high-performance substructure search. J Cheminform 2018;10:27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kratochvíl M, Vondrášek J, Galgonek J.. Interoperable chemical structure search service. J Cheminform 2019;11:45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Martens L, Chambers M, Sturm M. et al. mzML – a community standard for mass spectrometry data. Mol Cell Proteomics 2011;10:R110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Mayer G, Montecchi-Palazzi L, Ovelleiro D. et al. ; HUPO-PSI Group. The HUPO proteomics standards initiative – mass spectrometry controlled vocabulary. Database (Oxford) 2013;2013:bat009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Miles A, Bechhofer S.. SKOS Simple Knowledge Organization System Reference. World Wide Web Consortium, 2009. https://www.w3.org/TR/2009/REC-skos-reference-20090818/. [Google Scholar]
  24. Ong E, Xiang Z, Zhao B. et al. Ontobee: a linked ontology data server to support ontology term dereferencing, linkage, query and integration. Nucleic Acids Res 2017;45:D347–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Rijgersberg H, Wigham M, Top JL.. How semantics can improve engineering processes: a case of units of measure and quantities. Adv Eng Inform 2011;25:276–87. [Google Scholar]
  26. Rogers FB. Medical subject headings. Bull Med Libr Assoc 1963;51:114–6. [PMC free article] [PubMed] [Google Scholar]
  27. Rutz A, Sorokina M, Galgonek J. et al. The LOTUS initiative for open knowledge management in natural products research. Elife 2022;11:e70780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Schreiber G, Raimond Y.. RDF 1.1 Primer. World Wide Web Consortium, 2014. https://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/. [Google Scholar]
  29. SIB Swiss Institute of Bioinformatics RDF Group Members. The SIB Swiss Institute of Bioinformatics Semantic Web of data. Nucleic Acids Res 2024;52(D1):D44–D51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Whetzel PL, Noy NF, Shah NH. et al. BioPortal: enhanced functionality via new web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic Acids Res 2011;39:W541–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Yamamoto Y, Yamaguchi A, Splendiani A.. YummyData: providing high-quality open life science data. Database (Oxford) 2018;2018:bay022. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The source codes of all IDSM parts are available at https://github.com/idsm-src. Datasets used in this manuscript are publicly available via https://mona.fiehnlab.ucdavis.edu for the MoNA dataset and at doi.org/10.5281/zenodo.8287341 for the ISDB dataset.


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES