To the editor
Over the past few years, as the use of mass spectrometry (MS) has increased, multiple spectral libraries, databases and software frameworks have been created to enable sharing and searching of MS data. However, finding all the spectra that correspond to a specific compound across different databases continues to be a challenge. A spectral identifier that improves the exchange of mass spectra, as well as provenance and duplicate detection, would address these issues and enhance searchability.
MassBank1 has been the source of data for other open libraries such as the Global Natural Products Social Molecular Networking2 (GNPS) and Human Metabolome Database3 (HMDB) libraries and the MetaboLights reference layer4. In turn, HMDB and community-contributed spectra from GNPS have also been imported into MassBank of North America1c (MoNA), while GNPS searches public MS data against the above-mentioned libraries as well as the NIST spectral library5. The mzCloud6 library contains some spectra generated from the same raw data that was used to create MassBank records. As these examples show, the complexity and the cross-import of data is increasing, together with the number of mass spectra, such that these different resources can now contain identical or near identical spectra under different accession numbers. For example, the library entries PR100026 (MassBank, MoNA), 5464 (HMDB), and CCMSLIB00000222858 (GNPS) all refer to exactly the same mass spectrum of caffeine, originally sourced from MassBank. As the different libraries focus on different compound domains7, users wishing to access mass spectra from all compounds must use several resources, some of which are not fully open access (e.g., NIST and mzCloud).
Mass spectra are highly variable, with one to potentially thousands of mass-to-charge (m/z) and intensity entries per spectrum, presenting a challenge in the design of an optimal identifier. However, other life science databases have faced a similar need. For databases with chemical structures, the InChI code and the hashed InChIKey8,9 of fixed length, which have been broadly adopted as chemical identifiers, can be easily stored in databases, compared across resources and, for InChIKeys, searched on general-purpose search engines10. A hash is a one-directional mapping between a long, potentially complex object and a typically much shorter hash string with a fixed length of characters and numbers. For chemicals, the InChIKey is much easier to search than the (generally) much longer InChI, which contains special characters. While it is not possible to obtain the original object back purely from the hash value, hash keys provide easy access to the original data within a data collection.
We designed the SPLASH (SPectraL hASH) as an unambiguous, database-independent spectrum identifier that fulfills the criteria outlined above and offers some additional functionality. Inspired by the broad applicability of the InChIKey across cheminformatics and like the InChIKey (which encodes skeleton, stereochemistry, and charge), SPLASH contains separate blocks that define different layers of information, separated by dashes. As an example, the full SPLASH of the caffeine spectrum above is “splash10-0002-0900000000-b112e4e059e1ecf98c5f”. The first block is the SPLASH identifier, the second and third are summary blocks, while the fourth is the hash block.
To calculate a SPLASH, spectra are converted into a canonical text representation: the intensities are normalized to an integer value between 0 and 100, with m/z values given in exactly 6 decimal places. To ensure consistent handling between different software and implementations, entries with zero intensities are included, but empty (“N/A”) values are eliminated prior to creating the SPLASH. The first block (“splash10”) encodes the SPLASH identifier, starting with letters for semantic web compatibility, followed by a number representing the measurement type (1 for MS, 2 and above for other data types to be included in the future) and the SPLASH version number, starting at 0, to allow for future specification updates. Thus, splash10 is a SPLASH identifier for MS, version 0.
Both the second and third blocks are spectral summaries, which serve to prefilter and restrict searches. In the second and third blocks, intensities are summed over fixed (but different) bin sizes and wrapped over 10 bins. The wrapped bin (zero-based) index for a given ion is computed as floor (m/z ÷ BinSize) modulo 10. This wrapping strategy accommodates all possible spectral mass ranges while maintaining fixed-length summary blocks. The second block (e.g., “0002” for caffeine) is formed using a reduced spectrum (the top 10 or fewer ions greater than 10% of the base peak). This reduced spectrum is summed over bins of 5 Da. Each bin is then scaled to a single-digit integral value in base 3 (0–2), and the resulting length 10 histogram is converted to a base 36 number, resulting in a 4-digit block. In the third block (e.g., “0900000000”) the intensities are summed over 100 Da bin sizes, each bin is then scaled to a single-digit, integral base 10 digit (0–9).
The fourth block (e.g., “b112e4e059e1ecf98c5f”) is a hash of the full spectrum in Secure Hash Algorithm11 SHA256 (numbers and lowercase letters only), calculated in hexadecimal notation and truncated to 20 characters. The full spectrum string of m/z and relative intensity pairs are sorted by ascending m/z and then by descending intensity. The m/z value is multiplied by 106, cast to a long (64-bit) integer, and joined with the normalized intensity as strings separated by a colon. The resulting ion pairs are then joined, delimited by a single space. Specification document and reference implementations have been created for several programming environments (Python, Scala, C++, C#, R, Ruby, and Java) under a BSD-3 license as well as a REST interface; additional information is available at http://splash.fiehnlab.ucdavis.edu/ and all code is available on GitHub at https://github.com/berlinguyinca/spectra-hash.
The SPLASH concept was developed and refined on a dataset of 563,902 mass spectra from MassBank2, GNPS3, HMDB4, ReSpect12, FiehnLib13 and NIST 146; all but the NIST spectra (which cannot be released publically) are available on MoNA (http://mona.fiehnlab.ucdavis.edu/). This dataset is a mix of many types of mass spectra and the SPLASH was designed to account for this, plus be easily searchable in general-purpose search engines, offer a unique identifier (through the hash) and basic pre-filter and similarity functionality (through the second and third blocks).
Ensuring all these features are present in one short text string requires compromise; the SPLASH is not intended to replace more sophisticated database-specific functions, but does offer simple cross-database functionality. The second block was chosen from 136 different potential block formats as the best short, web search-compatible way to reduce the mass spectral search space. In order to determine the best performing second block, we queried a subset of 19,435 spectra against the full 563,902 dataset. The second block that we selected for use reduced the search space by 94% or above (36,107 spectra or less) in all cases, while returning 87% of all spectra within a similarity score of 700 (using the NIST cosine similarity score6,14) of the queried spectra. In contrast, other tested formats for this block returned more spectra (maximum 93.4%), but too many spectra (up to 100,000 or 1 in 5 spectra) remained in the search space so that the search space reduction was insufficient. The third block provides a visual summary (shown in Table 1 for selected compounds) and a simple text-based summary and basic similarity search that can be used in search engines or spreadsheets. More information on the most common second and third blocks, as well as the most common combinations and the approximate distribution of compounds (not all spectra are annotated with structures in the validation set) is given in Table 2.
Table 1.
SPLASH statistics for selected compounds. Data for alanine shows how derivative spectra and suspicious database entries can be detected with the third block (see bold, italic entries), the lower two rows show the variety of different spectra per compound. The combination of second and third blocks is selective, e.g. 0a41-1940000000 and 01ea-1940000000 for alanine and codeine
| Alanine | Caffeine | Codeine | Clarithromycin | |
|---|---|---|---|---|
| InChIKey First Block | QNAYBMKLOCPYGJ | RYYVLZVUVIJVGH | OROGSEYTTFOCAN | AGOYDEPGAOXOCK |
| PubChem CID(s) | 602, 5950, 71080 | 2519 | 2828, 5284371 | 894029 |
| ChemSpider ID(s) | 582, 5735, 64234 | 2424 | 2726, 4447447, 4642640 | 10342604 |
| Monoisotopic Mass (Da) | 89.047676 | 194.080383 | 299.15213 | 747.476868 |
|
| ||||
| Number of Spectra | 58 (10 negative) | 80 | 19 | 21 |
| Coupling (GC/LC/neither) | 6/37/15 | 14/52/14 | 0/19/0 | 0/21/0 |
|
| ||||
| Second/Third/Fourth Blocks | 10/7/43 | 16/13/67 | 6/9/19 | 6/13/21 |
|
| ||||
| List of Second Blocks (number) | 0006 (32); 000i (10); 014i (6); | 0002 (25); 000i (21); 0006 (9); | 0udi (8); 0uxr (4); | 001i (6); 00di (4); |
| 01b9, 00kf, 000f (2); | 0536 (5); 052f (3); 0a4l, 05nf, 01×9, 00di, e001l, 000b (2); | 0lea, 015a, 0159 (2); 0uyi (1) | 0a4j, 0a4i, 052e, 0006 (2); | |
| 0f79, 0a4i, 00di, 0007 (1); | 0a59; 05o0, 053r (1) | |||
| 01w0, 016u, 00dr, 000l, 000j (1) | ||||
|
| ||||
| List of Third Blocks (number) | 9000000000 (46); 0900000000 (5) | 0900000000 (47); 1900000000 (8) | 0009000000 (6); 0973000000 (2) | 0000090000 (5); 9000000000 (4) |
| 9002000000 (2); 6900000000 (2) | 9100000000 (5); 3900000000 (5) | 0920000000 (2); 0910000000 (2) | 4900000000 (2); 9800000000 (1) | |
| 1940000000 (1); 1900000000 (1) | 4900000000 (3); 2900000000 (3) | 0390000000 (2); 0139000000 (2) | 9300000000 (1); 9200000000 (1) | |
| 0910000000 (1) | 9800000000 (2); 6900000000 (2) | 1952000000 (1); 1940000000 (1) | 8900000000 (1); 3900020000 (1) | |
| italics = derivatised spectra | 9500000000 (1); 9200000000 (1) | 1930000000 (1) | 1900060800 (1); 1900030300 (1) | |
| bold = suspicious entries | 8900000000 (1); 7900000000 (1) | 1900020500 (1); 0800070900 (1) | ||
| 5900000000 (1) | 0000001900 (1) | |||
Table 2.
The number of spectra and substances (estimated by first block of the InChIKey) with the “most common” second, third and second+third SPLASH blocks, calculated on a subset of the validation dataset containing 532,675 spectra with compound information. The number of structures is an estimate; missing structure information was filled in automatically using the Chemical Translation Service (http://cts.fiehnlab.ucdavis.edu/). The place indicates how common the combination is (1 = most common, 200 = 200th most common)
| Place | 2nd Block | #Spectra | %Spec | #Structures | 3rd Block | #Spectra | %Spec | #Structures | Second+Third Block | #Spectra | %Spec | #Structures |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0006 | 36323 | 6.82 | 21990 | 9000000000 | 49553 | 9.30 | 17920 | 0006-9000000000 | 6569 | 1.23 | 2930 |
| 2 | 0a4i | 33191 | 6.23 | 17529 | 0900000000 | 36724 | 6.89 | 7375 | 0a4i-9000000000 | 4771 | 0.90 | 2023 |
| 3 | 00di | 28888 | 5.42 | 14008 | 9100000000 | 19502 | 3.66 | 13435 | 001i-0900000000 | 3438 | 0.65 | 1288 |
| 4 | 014i | 28213 | 5.30 | 15278 | 9200000000 | 14988 | 2.81 | 11693 | 000i-0900000000 | 3287 | 0.62 | 1111 |
| 5 | 000i | 26792 | 5.03 | 12965 | 0090000000 | 14724 | 2.76 | 3507 | 00di-0900000000 | 3251 | 0.61 | 1173 |
| 6 | 001i | 25438 | 4.78 | 11697 | 1900000000 | 13351 | 2.51 | 8679 | 0002-9000000000 | 3020 | 0.57 | 1161 |
| 7 | 004i | 24893 | 4.67 | 11728 | 2900000000 | 13201 | 2.48 | 10196 | 014i-0900000000 | 2791 | 0.52 | 1062 |
| 8 | 0002 | 24247 | 4.55 | 12543 | 3900000000 | 13046 | 2.45 | 10737 | 0002-0900000000 | 2744 | 0.52 | 1096 |
| 9 | 0udi | 21556 | 4.05 | 10389 | 9300000000 | 12504 | 2.35 | 10380 | 001i-9000000000 | 2683 | 0.50 | 1173 |
| 10 | 03di | 19913 | 3.74 | 9748 | 4900000000 | 11438 | 2.15 | 9701 | 004i-9000000000 | 2605 | 0.49 | 929 |
|
| ||||||||||||
| 20 | 004l | 2444 | 0.46 | 1966 | 9800000000 | 6461 | 1.21 | 5810 | 014i-9000000000 | 1855 | 0.35 | 904 |
| 30 | 0fb9 | 1843 | 0.35 | 1385 | 0390000000 | 2289 | 0.43 | 1701 | 0002-0090000000 | 1238 | 0.23 | 537 |
| 40 | 00fr | 1700 | 0.32 | 1362 | 9510000000 | 1512 | 0.28 | 1482 | 0006-0090000000 | 1024 | 0.19 | 426 |
| 50 | 00xr | 1600 | 0.30 | 1256 | 8910000000 | 1336 | 0.25 | 1306 | 000i-0009000000 | 909 | 0.17 | 380 |
| 100 | 0abc | 949 | 0.18 | 806 | 9630000000 | 585 | 0.11 | 580 | 000i-9200000000 | 541 | 0.10 | 376 |
| 200 | 0fmi | 218 | 0.04 | 195 | 9350000000 | 250 | 0.05 | 249 | 014i-9400000000 | 255 | 0.05 | 239 |
| 500 | 0ac3 | 76 | 0.01 | 76 | 9102000000 | 60 | 0.01 | 59 | 0udi-0590000000 | 94 | 0.02 | 87 |
While the mapping from object to hash should ideally be unique, hash collisions (where two totally different objects have the same hash, or fourth block of the SPLASH) may occur, depending on the hash algorithm and length of the hash string. Testing the fourth block for hash collisions on the full dataset of 53.25 million spectra revealed that identical SPLASHes only arose from mass spectra containing a single ion of the same mass, where the SPLASH is identical by definition due to intensity normalization. The theoretical probability for a collision15 with any given hash is approximately 10−31 for a database containing 109 spectra and is further reduced by the presence of two preceding spectral summary blocks. Thus, the SPLASH fulfills its role as a unique identifier while offering simple summary and searching functionality.
The SPLASH has already been implemented in MassBank2, MoNA2c, GNPS3, HMDB4, MetaboLights5 and mzCloud6, as well as software tools including MZmine16, MS-DIAL17, RMassBank18, BinBase19, Bioclipse20 and the Mass Spectrometry Development Kit (MSDK)21.
The format of the SPLASH allows direct access to spectra on database websites and searching using general purpose search engines. Spectral libraries with more restrictive licenses (e.g. mzCloud and possibly NIST) could also use the SPLASH to provide summarized information about their spectra. SPLASH enables an easier calculation of spectral overlap between libraries, to detect and remove exact duplicate spectra and perform provenance operations. Through the second and third blocks, SPLASH empowers quick searches for similar spectra within or between libraries, using a variety of search methods. The SPLASH algorithm has been kept independent of metadata, similar to the InChIKey, because an extension to include and distinguish metadata (such as analytical conditions or chemical information) would rapidly become complex and reduce the applicability of the identifier]. Instead, the SPLASH is designed to facilitate quick queries and subsequent metadata retrieval.
The widespread adoption of the SPLASH as a standard spectral identifier allows automated, cross-resource spectral exchange and enables enhanced searchability and data processing across mass spectrometry platforms.
Acknowledgments
GW, SSM, DP, and OF were supported by National Institute of Health U24 DK097154 and National Science Foundation MCB 1139644, MW, PCD and NB by the National Institutes of Health 5P41GM103484 for the Center for Computational Mass Spectrometry; ES by SOLUTIONS (European Union’s Seventh Framework Programme Grant Agreement No. 603437); PCD by the European Union’s Horizon2020 program under the Grant agreement No. 634402 (METASPACE); RFM was funded by the Database Integration Coordination Program of the National Bioscience Database Center, Japan. European MassBank is supported by the NORMAN Association (France) and hosted by the Helmholtz Centre for Environmental Research, Discussions with anonymous parties, T. Hofstetter and the reviewer feedback are gratefully acknowledged.
Footnotes
Conflict of Interest
Pieter C. Dorrestein is on the scientific advisory board to Sirenas Marine Biosciences. Robert Mistrik derives income from mzCloud licensing.
References
- 1.Horai H, et al. J Mass Spectrom. 2010;45:703–714. doi: 10.1002/jms.1777. [DOI] [PubMed] [Google Scholar]; (a) [accessed 8 June 2016]; http://www.massbank.jp.; (b) [accessed 8 June 2016]; http://massbank.eu/MassBank/; (c) [accessed 8 June 2016]; http://mona.fiehnlab.ucdavis.edu/
- 2.Wang M, et al. Nat Biotech. 2016;34:828–837. doi: 10.1038/nbt.3597. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wishart DS, et al. Nucleic Acids Res. 2013;41:D801–807. doi: 10.1093/nar/gks1065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Haug K, et al. Nucl Acids Res. 2012:1–6. doi: 10.1093/nar/gks1004. [DOI] [Google Scholar]
- 5.Stein SE, et al. NIST Mass Spectral Search Program and NIST/EPA/NIH Mass Spectral Library version 2.2. National Institute of Standards and Technology, U.S. Secretary of Commerce; USA: Jun, 2014. [Google Scholar]
- 6.mzCloud. [accessed 8 June 2016]; https://www.mzcloud.org/
- 7.Vinaixa M, et al. TrAC-Trends Anal Chem. 2016;78:23–35. [Google Scholar]
- 8.Heller SR, et al. J Chem Inf. 2013;5:7. [Google Scholar]
- 9.Heller SR, et al. J Chem Inf. 2015;7:23. [Google Scholar]
- 10.Southan C. J Chem Inf. 2013;5:10. [Google Scholar]
- 11.National Institute of Standards and Technology. [accessed 8 June 2016];Secure Hash Standard. FIPS PUB 180–4, http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf.
- 12.Sawada, et al. Phytochemistry. 2012;82:38–45. doi: 10.1016/j.phytochem.2012.07.007. [DOI] [PubMed] [Google Scholar]
- 13.Kind T, et al. Anal Chem. 2009;81:24, 10038–10048. doi: 10.1021/ac9019522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Stein SE, Scott DR. J Am Soc Mass Spectrom. 1994;5:859–866. doi: 10.1016/1044-0305(94)87009-8. [DOI] [PubMed] [Google Scholar]
- 15.Preshing J. [accessed 8 June 2016]; http://preshing.com/20110504/hash-collision-probabilities/
- 16.Pluskal T, et al. BMC Bioinformatics. 2010;11:395. doi: 10.1186/1471-2105-11-395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Tsugawa H, et al. Nature Methods. 2015;12:523–526. doi: 10.1038/nmeth.3393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Stravs MA, et al. J Mass Spectrom. 2013;48(1):89–99. doi: 10.1002/jms.3131. [DOI] [PubMed] [Google Scholar]
- 19.Skogerson K, et al. BMC Bioinformatics. 2011;12:321. doi: 10.1186/1471-2105-12-321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Spjuth O, et al. BMC Bioinformatics. 2007;8:59. doi: 10.1186/1471-2105-8-59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Mass Spectrometry Development Kit (MSDK) [accessed 8 June 2016]; https://msdk.github.io/
