Abstract
Summary
Python is the most commonly used language for deep learning (DL). Existing Python packages for mass spectrometry imaging (MSI) data are not optimized for DL tasks. We, therefore, introduce pyM2aia, a Python package for MSI data analysis with a focus on memory-efficient handling, processing and convenient data-access for DL applications. pyM2aia provides interfaces to its parent application M2aia, which offers interactive capabilities for exploring and annotating MSI data in imzML format. pyM2aia utilizes the image input and output routines, data formats, and processing functions of M2aia, ensures data interchangeability, and enables the writing of readable and easy-to-maintain DL pipelines by providing batch generators for typical MSI data access strategies. We showcase the package in several examples, including imzML metadata parsing, signal processing, ion-image generation, and, in particular, DL model training and inference for spectrum-wise approaches, ion-image-based approaches, and approaches that use spectral and spatial information simultaneously.
Availability and implementation
Python package, code and examples are available at (https://m2aia.github.io/m2aia)
1 Introduction
Feeding large amounts of data to deep neural networks during training can be as important as it can be a bottleneck in the training process. This is particularly the case in mass spectrometry imaging (MSI), a technique for label-free imaging of the spatial distribution of hundreds of molecules in tissue sections. MSI datasets are large (up to tens of gigabytes) containing high-dimensional spectral information (thousands of m/z bins) per pixel.
Different strategies for handling MSI data (Fig. 1) can be distinguished: (i) spectral strategy, in which the spectral information is used but the spatial relationships between spectra are lost (e.g. spectrum-wise peak picking or classification), (ii) spatial strategy, in which spatial properties of a molecular distribution are addressed but intra-spectral relationships are not taken into account (e.g. ion-image-based clustering or segmentation), and (iii) spatio-spectral strategies, which uses spatial and spectral information simultaneously, which are computationally highly demanding and still rare. For each strategy, references to example applications can be found in the Supplementary Appendix S3.
Figure 1.
Strategies for processing MSI datasets in DL tasks. Cubes represent hyperspectral data with sample selection parameters (red) x and y for the spatial dimensions and m/z for the spectral dimension.
The comparatively slow progress of deep learning (DL) for MSI data may be attributed to the lack of benchmark datasets Alexandrov (2020), inconsistent data quality as a consequences of batch effects Balluff et al. (2021), a diverse landscape of MSI data signal processing strategies, issues related to the curse of dimensionality, and the lack of explainability and interpretability of DL models (black box). Additionally, most DL code is written in Python, whereas most MSI data science packages are written in R.
We hypothesize that the development of DL applications can best be supported by a tandem approach where interactive processing and visualization of MSI data is applied in combination with scripting for DL. Here, we introduce pyM2aia (biotools: pym2aia), a Python package for accessing MSI data with a focus on supporting the development of DL applications, complementing the interactive application M2aia (Cordes et al. 2021, biotools: m2aia).
2 Features
To support the development of DL applications for high-dimensional data from MSI acquisitions, pyM2aia aims to provide (i) memory- and computationally efficient loading of imzML datasets to make spectra or ion-images available as quickly as possible for (GPU-based) training and inference of DL models, and to facilitate the creation of (ii) readable and easy-to-maintain DL pipelines and (iii) solutions that rely on a common code-base for interactive exploration and scripting to enable consistent data views and processing.
pyM2aia complements the open-source desktop application M2aia cordes et al. (2021), which offers interactive visualization and image processing utilities (see Supplementary Table S1) for continuous profile/centroid imzML datasets. pyM2aia wraps the highly optimized input and processing methods (implemented in C++) of M2aia, enabling consistent views of MSI data in both systems. M2aia’s signal processing methods are directly applied after reading spectra from disk, omitting the need to hold any intermediate data in memory, which substantially increases the number of datasets that can be accessed simultaneously. The application programming interface (API) of pyM2aia supports high-level data handling for the implementation of DL pipelines realizing the different processing strategies defined in the introduction (Fig. 1). pyM2aia provides data generators for the generation of batches, enables the invocation of arbitrary data augmentation functions and allows cyclic passes over single as well as multiple MSI datasets. The API is compatible with common DL libraries like TensorFlow/Keras and PyTorch.
3 Results
Exemplary applications with increasing complexity were realized to showcase the capabilities of pyM2aia. Openly available MSI datasets are used, published by Geier et al. (2021). Code/data availability statements and further details on the examples can be found in the Supplementary material.
The first examples show how imzML data is handled with pyM2aia. In example I we demonstrate how pyM2aia can be used to retrieve imzML meta data. Example II illustrates the application of pyM2aia’s signal processing methods (Supplementary Fig. S1). Example III explains the generation of ion-images and how to overlay multiple ion-images to show co-localization of ions (Supplementary Fig. S2).
The next examples demonstrate DL applications using pyM2aia. Example IV showcases a spectral strategy that uses pyM2aia to feed spectra to adapted versions of an autoencoder model for peak learning proposed by Abdelmoula et al. (2021). The original approach is available as a TensorFlow implementation that loads datasets in hl5 format. We replaced the hl5 input by pyM2aia’s imzML reader. For stabilizing the training on the example dataset by Geier et al. (2021), we replaced the originally used categorical cross-entropy loss by a mean-squared error loss and removed the sigmoid activation function of the output layer. The model accepts as input a tensor of the form [B, C, H(=1),W(=1)] [with B: batch size, C: channel size (=spectral depth), H/W: height/width of patch]. The core demonstration of this example is the usage of pyM2aia’s spectrum batch generators (Supplementary Fig. S3). We show how to train multiple models for each imzML image individually (similar to the original publication) and how pyM2aia enables easily to process multiple images simultaneously (combined) to create a single model for all inputs.
A spatial strategy is demonstrated in example V by adapting the PyTorch implementation of an ion-image clustering approach proposed by Hu et al. (2022). The approach uses a pretrained EfficientNet model to embed ion-images in a latent space. The model is then fine-tuned using contrastive learning (SimCLR). SimCLR heavily relies on data augmentations, which are defined by adding augmentation functions to pyM2aia’s ion-image batch generator. For processing the dataset, we reduced the number of input-channels of the EfficientNet from three (RGB) to one (gray-scale) and adapt the augmentation methods to accept single-channel inputs of the form [B, C(=1),H, W]. The core demonstration of this example is to train the approach described above with pyM2aia’s ion-image batch generator and show how augmentations are incorporated (Supplementary Fig. S4).
Spatio-spectral strategies are demonstrated in example VI and example VII, showing how pyM2aia’s spectrum batch generator is used to generate spatio-spectral samples by providing additional neighboring spectra for a given sample location (Supplementary Fig. S5). The spectrum batch generator can retrieve specified spatial neighborhoods of size [H, W] and provides them as tensors of the form [B, C, H, W]. In example VI, we train and apply an unsupervised auto-encoder and in example VII a supervised model for pixel-wise classification.
4 Conclusion
pyM2aia gives the MSI and DL communities Python-based access to M2aia’s efficient implementations for MSI data handling and processing. By providing high-level convenience methods, DL workflows can be realized with less code, improving readability and reducing the risk of potential mistakes.
Supplementary Material
Contributor Information
Jonas Cordes, Faculty of Computer Science, Mannheim University of Applied Sciences, Mannheim 68163, Germany; Medical Faculty Mannheim, Heidelberg University, Mannheim 68167, Germany.
Thomas Enzlein, Center for Mass Spectrometry and Optical Spectroscopy, Mannheim University of Applied Sciences, Mannheim 68163, Germany.
Carsten Hopf, Medical Faculty Mannheim, Heidelberg University, Mannheim 68167, Germany; Center for Mass Spectrometry and Optical Spectroscopy, Mannheim University of Applied Sciences, Mannheim 68163, Germany; Medical Faculty, Heidelberg University, Heidelberg 69120, Germany.
Ivo Wolf, Faculty of Computer Science, Mannheim University of Applied Sciences, Mannheim 68163, Germany.
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflicts of interest
None declared.
Funding
This work was supported by the German Federal Ministry of Education and Research (BMBF) as part of the Innovation Partnership M2Aind, project M2Aind-DeepLearning [13FH8I08IA] within the framework FH-Impuls; and by the Carl-Zeiss-Foundation, project Digi-FIT.
References
- Abdelmoula WM, Lopez BG-C, Randall EC. et al. Peak learning of mass spectrometry imaging data using artificial neural networks. Nat Commun 2021;12:5544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alexandrov T. Spatial metabolomics and imaging mass spectrometry in the age of artificial intelligence. Annu Rev Biomed Data Sci 2020;3:61–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Balluff B, Hopf C, Porta Siegel T. et al. Batch effects in MALDI mass spectrometry imaging. J Am Soc Mass Spectrom 2021;32:628–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cordes J, Enzlein T, Marsching C. et al. M2aia—interactive, fast, and memory-efficient analysis of 2D and 3D multi-modal mass spectrometry imaging data. GigaScience 2021;10:giab049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geier B, Oetjen J, Ruthensteiner B. et al. Connecting structure and function from organisms to molecules in small-animal symbioses through chemo-histo-tomography. Proc Natl Acad Sci U S A 2021;118:e2023773118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu H, Bindu JP, Laskin J. et al. Self-supervised clustering of mass spectrometry imaging data using contrastive learning. Chem Sci 2022;13:90–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

