Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Jul 18.
Published in final edited form as: Nat Biotechnol. 2016 Nov 8;34(11):1099–1101. doi: 10.1038/nbt.3689

SPLASH, A hashed identifier for mass spectra

Gert Wohlgemuth 1,*, Sajjan S Mehta 1, Ramon F Mejia 2, Steffen Neumann 3, Diego Pedrosa 1, Tomáš Pluskal 4, Emma L Schymanski 5,*, Egon L Willighagen 6, Michael Wilson 7, David S Wishart 7, Masanori Arita 2,8, Pieter C Dorrestein 9,10, Nuno Bandeira 9,11,12, Mingxun Wang 11,12, Tobias Schulze 13, Reza M Salek 14, Christoph Steinbeck 14, Venkata Chandrasekhar Nainala 14, Robert Mistrik 15, Takaaki Nishioka 16, Oliver Fiehn 1,17,*
PMCID: PMC5515539  NIHMSID: NIHMS871648  PMID: 27824832

To the editor

Over the past few years, as the use of mass spectrometry (MS) has increased, multiple spectral libraries, databases and software frameworks have been created to enable sharing and searching of MS data. However, finding all the spectra that correspond to a specific compound across different databases continues to be a challenge. A spectral identifier that improves the exchange of mass spectra, as well as provenance and duplicate detection, would address these issues and enhance searchability.

MassBank1 has been the source of data for other open libraries such as the Global Natural Products Social Molecular Networking2 (GNPS) and Human Metabolome Database3 (HMDB) libraries and the MetaboLights reference layer4. In turn, HMDB and community-contributed spectra from GNPS have also been imported into MassBank of North America1c (MoNA), while GNPS searches public MS data against the above-mentioned libraries as well as the NIST spectral library5. The mzCloud6 library contains some spectra generated from the same raw data that was used to create MassBank records. As these examples show, the complexity and the cross-import of data is increasing, together with the number of mass spectra, such that these different resources can now contain identical or near identical spectra under different accession numbers. For example, the library entries PR100026 (MassBank, MoNA), 5464 (HMDB), and CCMSLIB00000222858 (GNPS) all refer to exactly the same mass spectrum of caffeine, originally sourced from MassBank. As the different libraries focus on different compound domains7, users wishing to access mass spectra from all compounds must use several resources, some of which are not fully open access (e.g., NIST and mzCloud).

Mass spectra are highly variable, with one to potentially thousands of mass-to-charge (m/z) and intensity entries per spectrum, presenting a challenge in the design of an optimal identifier. However, other life science databases have faced a similar need. For databases with chemical structures, the InChI code and the hashed InChIKey8,9 of fixed length, which have been broadly adopted as chemical identifiers, can be easily stored in databases, compared across resources and, for InChIKeys, searched on general-purpose search engines10. A hash is a one-directional mapping between a long, potentially complex object and a typically much shorter hash string with a fixed length of characters and numbers. For chemicals, the InChIKey is much easier to search than the (generally) much longer InChI, which contains special characters. While it is not possible to obtain the original object back purely from the hash value, hash keys provide easy access to the original data within a data collection.

We designed the SPLASH (SPectraL hASH) as an unambiguous, database-independent spectrum identifier that fulfills the criteria outlined above and offers some additional functionality. Inspired by the broad applicability of the InChIKey across cheminformatics and like the InChIKey (which encodes skeleton, stereochemistry, and charge), SPLASH contains separate blocks that define different layers of information, separated by dashes. As an example, the full SPLASH of the caffeine spectrum above is “splash10-0002-0900000000-b112e4e059e1ecf98c5f”. The first block is the SPLASH identifier, the second and third are summary blocks, while the fourth is the hash block.

To calculate a SPLASH, spectra are converted into a canonical text representation: the intensities are normalized to an integer value between 0 and 100, with m/z values given in exactly 6 decimal places. To ensure consistent handling between different software and implementations, entries with zero intensities are included, but empty (“N/A”) values are eliminated prior to creating the SPLASH. The first block (“splash10”) encodes the SPLASH identifier, starting with letters for semantic web compatibility, followed by a number representing the measurement type (1 for MS, 2 and above for other data types to be included in the future) and the SPLASH version number, starting at 0, to allow for future specification updates. Thus, splash10 is a SPLASH identifier for MS, version 0.

Both the second and third blocks are spectral summaries, which serve to prefilter and restrict searches. In the second and third blocks, intensities are summed over fixed (but different) bin sizes and wrapped over 10 bins. The wrapped bin (zero-based) index for a given ion is computed as floor (m/z ÷ BinSize) modulo 10. This wrapping strategy accommodates all possible spectral mass ranges while maintaining fixed-length summary blocks. The second block (e.g., “0002” for caffeine) is formed using a reduced spectrum (the top 10 or fewer ions greater than 10% of the base peak). This reduced spectrum is summed over bins of 5 Da. Each bin is then scaled to a single-digit integral value in base 3 (0–2), and the resulting length 10 histogram is converted to a base 36 number, resulting in a 4-digit block. In the third block (e.g., “0900000000”) the intensities are summed over 100 Da bin sizes, each bin is then scaled to a single-digit, integral base 10 digit (0–9).

The fourth block (e.g., “b112e4e059e1ecf98c5f”) is a hash of the full spectrum in Secure Hash Algorithm11 SHA256 (numbers and lowercase letters only), calculated in hexadecimal notation and truncated to 20 characters. The full spectrum string of m/z and relative intensity pairs are sorted by ascending m/z and then by descending intensity. The m/z value is multiplied by 106, cast to a long (64-bit) integer, and joined with the normalized intensity as strings separated by a colon. The resulting ion pairs are then joined, delimited by a single space. Specification document and reference implementations have been created for several programming environments (Python, Scala, C++, C#, R, Ruby, and Java) under a BSD-3 license as well as a REST interface; additional information is available at http://splash.fiehnlab.ucdavis.edu/ and all code is available on GitHub at https://github.com/berlinguyinca/spectra-hash.

The SPLASH concept was developed and refined on a dataset of 563,902 mass spectra from MassBank2, GNPS3, HMDB4, ReSpect12, FiehnLib13 and NIST 146; all but the NIST spectra (which cannot be released publically) are available on MoNA (http://mona.fiehnlab.ucdavis.edu/). This dataset is a mix of many types of mass spectra and the SPLASH was designed to account for this, plus be easily searchable in general-purpose search engines, offer a unique identifier (through the hash) and basic pre-filter and similarity functionality (through the second and third blocks).

Ensuring all these features are present in one short text string requires compromise; the SPLASH is not intended to replace more sophisticated database-specific functions, but does offer simple cross-database functionality. The second block was chosen from 136 different potential block formats as the best short, web search-compatible way to reduce the mass spectral search space. In order to determine the best performing second block, we queried a subset of 19,435 spectra against the full 563,902 dataset. The second block that we selected for use reduced the search space by 94% or above (36,107 spectra or less) in all cases, while returning 87% of all spectra within a similarity score of 700 (using the NIST cosine similarity score6,14) of the queried spectra. In contrast, other tested formats for this block returned more spectra (maximum 93.4%), but too many spectra (up to 100,000 or 1 in 5 spectra) remained in the search space so that the search space reduction was insufficient. The third block provides a visual summary (shown in Table 1 for selected compounds) and a simple text-based summary and basic similarity search that can be used in search engines or spreadsheets. More information on the most common second and third blocks, as well as the most common combinations and the approximate distribution of compounds (not all spectra are annotated with structures in the validation set) is given in Table 2.

Table 1.

SPLASH statistics for selected compounds. Data for alanine shows how derivative spectra and suspicious database entries can be detected with the third block (see bold, italic entries), the lower two rows show the variety of different spectra per compound. The combination of second and third blocks is selective, e.g. 0a41-1940000000 and 01ea-1940000000 for alanine and codeine

Alanine Caffeine Codeine Clarithromycin
InChIKey First Block QNAYBMKLOCPYGJ RYYVLZVUVIJVGH OROGSEYTTFOCAN AGOYDEPGAOXOCK
PubChem CID(s) 602, 5950, 71080 2519 2828, 5284371 894029
ChemSpider ID(s) 582, 5735, 64234 2424 2726, 4447447, 4642640 10342604
Monoisotopic Mass (Da) 89.047676 194.080383 299.15213 747.476868

Number of Spectra 58 (10 negative) 80 19 21
Coupling (GC/LC/neither) 6/37/15 14/52/14 0/19/0 0/21/0

Second/Third/Fourth Blocks 10/7/43 16/13/67 6/9/19 6/13/21

List of Second Blocks (number) 0006 (32); 000i (10); 014i (6); 0002 (25); 000i (21); 0006 (9); 0udi (8); 0uxr (4); 001i (6); 00di (4);
01b9, 00kf, 000f (2); 0536 (5); 052f (3); 0a4l, 05nf, 01×9, 00di, e001l, 000b (2); 0lea, 015a, 0159 (2); 0uyi (1) 0a4j, 0a4i, 052e, 0006 (2);
0f79, 0a4i, 00di, 0007 (1); 0a59; 05o0, 053r (1)
01w0, 016u, 00dr, 000l, 000j (1)

List of Third Blocks (number) 9000000000 (46); 0900000000 (5) 0900000000 (47); 1900000000 (8) 0009000000 (6); 0973000000 (2) 0000090000 (5); 9000000000 (4)
9002000000 (2); 6900000000 (2) 9100000000 (5); 3900000000 (5) 0920000000 (2); 0910000000 (2) 4900000000 (2); 9800000000 (1)
1940000000 (1); 1900000000 (1) 4900000000 (3); 2900000000 (3) 0390000000 (2); 0139000000 (2) 9300000000 (1); 9200000000 (1)
0910000000 (1) 9800000000 (2); 6900000000 (2) 1952000000 (1); 1940000000 (1) 8900000000 (1); 3900020000 (1)
italics = derivatised spectra 9500000000 (1); 9200000000 (1) 1930000000 (1) 1900060800 (1); 1900030300 (1)
bold = suspicious entries 8900000000 (1); 7900000000 (1) 1900020500 (1); 0800070900 (1)
5900000000 (1) 0000001900 (1)

Table 2.

The number of spectra and substances (estimated by first block of the InChIKey) with the “most common” second, third and second+third SPLASH blocks, calculated on a subset of the validation dataset containing 532,675 spectra with compound information. The number of structures is an estimate; missing structure information was filled in automatically using the Chemical Translation Service (http://cts.fiehnlab.ucdavis.edu/). The place indicates how common the combination is (1 = most common, 200 = 200th most common)

Place 2nd Block #Spectra %Spec #Structures 3rd Block #Spectra %Spec #Structures Second+Third Block #Spectra %Spec #Structures
1 0006 36323 6.82 21990 9000000000 49553 9.30 17920 0006-9000000000 6569 1.23 2930
2 0a4i 33191 6.23 17529 0900000000 36724 6.89 7375 0a4i-9000000000 4771 0.90 2023
3 00di 28888 5.42 14008 9100000000 19502 3.66 13435 001i-0900000000 3438 0.65 1288
4 014i 28213 5.30 15278 9200000000 14988 2.81 11693 000i-0900000000 3287 0.62 1111
5 000i 26792 5.03 12965 0090000000 14724 2.76 3507 00di-0900000000 3251 0.61 1173
6 001i 25438 4.78 11697 1900000000 13351 2.51 8679 0002-9000000000 3020 0.57 1161
7 004i 24893 4.67 11728 2900000000 13201 2.48 10196 014i-0900000000 2791 0.52 1062
8 0002 24247 4.55 12543 3900000000 13046 2.45 10737 0002-0900000000 2744 0.52 1096
9 0udi 21556 4.05 10389 9300000000 12504 2.35 10380 001i-9000000000 2683 0.50 1173
10 03di 19913 3.74 9748 4900000000 11438 2.15 9701 004i-9000000000 2605 0.49 929

20 004l 2444 0.46 1966 9800000000 6461 1.21 5810 014i-9000000000 1855 0.35 904
30 0fb9 1843 0.35 1385 0390000000 2289 0.43 1701 0002-0090000000 1238 0.23 537
40 00fr 1700 0.32 1362 9510000000 1512 0.28 1482 0006-0090000000 1024 0.19 426
50 00xr 1600 0.30 1256 8910000000 1336 0.25 1306 000i-0009000000 909 0.17 380
100 0abc 949 0.18 806 9630000000 585 0.11 580 000i-9200000000 541 0.10 376
200 0fmi 218 0.04 195 9350000000 250 0.05 249 014i-9400000000 255 0.05 239
500 0ac3 76 0.01 76 9102000000 60 0.01 59 0udi-0590000000 94 0.02 87

While the mapping from object to hash should ideally be unique, hash collisions (where two totally different objects have the same hash, or fourth block of the SPLASH) may occur, depending on the hash algorithm and length of the hash string. Testing the fourth block for hash collisions on the full dataset of 53.25 million spectra revealed that identical SPLASHes only arose from mass spectra containing a single ion of the same mass, where the SPLASH is identical by definition due to intensity normalization. The theoretical probability for a collision15 with any given hash is approximately 10−31 for a database containing 109 spectra and is further reduced by the presence of two preceding spectral summary blocks. Thus, the SPLASH fulfills its role as a unique identifier while offering simple summary and searching functionality.

The SPLASH has already been implemented in MassBank2, MoNA2c, GNPS3, HMDB4, MetaboLights5 and mzCloud6, as well as software tools including MZmine16, MS-DIAL17, RMassBank18, BinBase19, Bioclipse20 and the Mass Spectrometry Development Kit (MSDK)21.

The format of the SPLASH allows direct access to spectra on database websites and searching using general purpose search engines. Spectral libraries with more restrictive licenses (e.g. mzCloud and possibly NIST) could also use the SPLASH to provide summarized information about their spectra. SPLASH enables an easier calculation of spectral overlap between libraries, to detect and remove exact duplicate spectra and perform provenance operations. Through the second and third blocks, SPLASH empowers quick searches for similar spectra within or between libraries, using a variety of search methods. The SPLASH algorithm has been kept independent of metadata, similar to the InChIKey, because an extension to include and distinguish metadata (such as analytical conditions or chemical information) would rapidly become complex and reduce the applicability of the identifier]. Instead, the SPLASH is designed to facilitate quick queries and subsequent metadata retrieval.

The widespread adoption of the SPLASH as a standard spectral identifier allows automated, cross-resource spectral exchange and enables enhanced searchability and data processing across mass spectrometry platforms.

Acknowledgments

GW, SSM, DP, and OF were supported by National Institute of Health U24 DK097154 and National Science Foundation MCB 1139644, MW, PCD and NB by the National Institutes of Health 5P41GM103484 for the Center for Computational Mass Spectrometry; ES by SOLUTIONS (European Union’s Seventh Framework Programme Grant Agreement No. 603437); PCD by the European Union’s Horizon2020 program under the Grant agreement No. 634402 (METASPACE); RFM was funded by the Database Integration Coordination Program of the National Bioscience Database Center, Japan. European MassBank is supported by the NORMAN Association (France) and hosted by the Helmholtz Centre for Environmental Research, Discussions with anonymous parties, T. Hofstetter and the reviewer feedback are gratefully acknowledged.

Footnotes

Conflict of Interest

Pieter C. Dorrestein is on the scientific advisory board to Sirenas Marine Biosciences. Robert Mistrik derives income from mzCloud licensing.

References

RESOURCES