Skip to main content
. 2022 Dec 5;18(12):103. doi: 10.1007/s11306-022-01963-y

Fig. 2.

Fig. 2

Illustration of reference library imbalances with respect to chemical classes, instrument types, and annotation rates by precursor mass. These factors may affect machine learning training dataset quality and representativeness. a ClassyFire classes of all 24,101 unique structures from the positive ionisation mode MS/MS spectra in GNPS. Chemical compound classes were determined by using ClassyFire superclasses (Djoumbou Feunang et al., 2016). For simplicity, classes are numbered from most to least occurring. b Instrument types for the 314,318 positive ionisation mode spectra in GNPS. Instrument type names were simplified to the ones shown in the figure. c Parent mass distributions of the 314,318 positive ionisation mode spectra in GNPS, the 13,908 positive ionisation mode spectra in GNPS that had no annotated SMILES, and the 9129 spectra in the dataset used by Crüsemann et al. (2015). Matchms was used to process the mgf files in the same way as in MS2DeepScore; here, MS/MS spectra with at least one fragment peak and a parent mass were considered