Abstract
Photoacoustic (PA) imaging has the potential to revolutionize functional medical imaging in healthcare due to the valuable information on tissue physiology contained in multispectral photoacoustic measurements. Clinical translation of the technology requires conversion of the high-dimensional acquired data into clinically relevant and interpretable information. In this work, we present a deep learning-based approach to semantic segmentation of multispectral photoacoustic images to facilitate image interpretability. Manually annotated photoacoustic and ultrasound imaging data are used as reference and enable the training of a deep learning-based segmentation algorithm in a supervised manner. Based on a validation study with experimentally acquired data from 16 healthy human volunteers, we show that automatic tissue segmentation can be used to create powerful analyses and visualizations of multispectral photoacoustic images. Due to the intuitive representation of high-dimensional information, such a preprocessing algorithm could be a valuable means to facilitate the clinical translation of photoacoustic imaging.
Keywords: Medical image segmentation, Deep learning, Multispectral imaging, Photoacoustics, Optoacoustics
1. Introduction
Photoacoustic (PA) imaging (PAI) is an emerging and rapidly developing imaging modality that enables real-time, non-invasive, and radiation-free measurement of optical tissue properties [1]. PAI has the potential to spatially resolve valuable morphological and functional tissue information, such as the blood oxygen saturation (sO) [2], for up to several centimeters in depth [3]. While the recovery of accurate and reliable functional parameters from PA measurements is an ongoing field of research [4], [5], [6], providing accurate and interpretable visualizations of multispectral PA measurement data is a crucial step towards the clinical translation of PAI. One way to achieve this could be to classify tissue pixel-wise based on the multispectral PA signal, thus segmenting the tissue into disjunct regions, as illustrated in Fig. 1. These can be annotated with relevant information, such as structure-specific sO.
Fig. 1.
Overview of the proposed approach to automatic semantic image annotation. The nnU-Net and a fully-connected neural network (FCNN) automatically create semantic annotations of multiple tissue types, including skin, blood, and fat, based on multispectral photoacoustic (PA) and ultrasound (US) images.
Several groups have already worked on methods for automatic image segmentation in PAI, for example for the automatic identification of structures in small animal images [7], [8], [9], the segmentation of breast cancer [10], or for vessel segmentation both in simulation studies [11] and experimental settings [12], [13]. Furthermore, work has been conducted towards the annotation of different skin layers in raster-scanned images [14], [15], [16]. However, to our knowledge, no work has been published to date on the automatic multi-label semantic annotation of multispectral PA images in humans.
The purpose of this paper is, therefore, to address this gap in the literature. Specifically, we investigate the hypothesis that automatic multi-label PA image segmentation with neural networks is feasible. More specifically, we explore two different approaches to utilizing the multispectral data as the input of a neural network: (1) a single-pixel representation for a fully-connected neural network (FCNN) and (2) a full image representation for the nnU-Net [17]. As many commonly used PA devices capture PA and ultrasound (US) data simultaneously [18], [19], we additionally compare the performance of our method for different types of input images, namely PA images, US images and combined PA and US images which we refer to as PAUS images. Our methods are trained, validated, and tested on separate splits of a data set of forearm, calf, and neck measurements acquired from 16 healthy volunteers.
2. Materials and methods
In this section, we describe the data set we acquired and annotated for the training and validation of our method (cf. Section 2.1) and the deep learning-based methods for semantic annotation of PA data (cf. Section 2.2).
2.1. Data
The following section provides details of the acquisition (cf. Section 2.1.1) and annotation (cf. Sections 2.1.2 and 2.1.3) of the PA and US data as well as specifics of the data split (cf. Section 2.1.4).
2.1.1. Data acquisition
The data set consisted of multispectral PA and US images of 16 healthy human volunteers. For each of the 16 volunteers, the forearm, calf, and neck were imaged at three distinct locations both on the left and right side of the body (cf. Fig. 2 for the hierarchical representation of the data), yielding N = 288 unique multispectral PA and US image pairs in total. These locations were chosen as they are easily accessible, the least intrusive, and feature superficial blood vessels, which were of key interest for this study. 18 scans were acquired per volunteer as a trade-off between acquisition time (1 h) and number of scans. Ethics approval was obtained from the committee of the medical faculty of Heidelberg University under reference number S-451/2020 and the study was registered with the German Clinical Trials Register under reference number DRKS00023205. The experiments were carried out in accordance with relevant guidelines and regulations of the ethics approval, as for example laser safety guidelines, and informed consent was obtained from all subjects.
Fig. 2.
Visualization of the hierarchical nature of the data. For each volunteer, the neck, calf, and forearm were imaged at three distinct poses both on the left and right side of the body.
The images were acquired using the multi-spectral optoacoustic tomography (MSOT) Acuity Echo device (iThera Medical, Munich, Germany) using 26 wavelengths equidistantly chosen between 700 nm and 950 nm, similar to [12] and allowing the results of this study to be easily applied to data-driven oximetry [20]. Each location was imaged freehand and as statically as possible for approximately 30 s. The US images were reconstructed using a proprietary backprojection algorithm provided by the vendor and the PA images were reconstructed using a custom implementation of the backprojection algorithm [21] within the Medical Imaging Interaction Toolkit (MITK) [22]. The sequences were post-processed including a correction for variations in the laser pulse energy and an optimization of the Signal-to-Noise-Ratio (SNR). To correct for the variations in laser pulse energy, the PA images were divided by the respective laser pulse energies. Because of the different reconstruction algorithms with different fields of view used for PA and US images, the PA images were cropped to enable co-registration with the US images. In particular, the skin visible in both PA and US images was used to manually co-register the images. To optimize the SNR, the co-registered sequences were divided into four sections of approximately 8 s that were averaged, and the averaged section with the highest acutance defined by the mean of the image gradients was used.
2.1.2. Data annotation
After image reconstruction and post-processing, similar to [12] and following the recommendations described in [23], the images were manually annotated by one of three available domain experts using a standardized annotation protocol which can be found in Supplemental Material 1. Eight equally important annotation classes were distinguished during the annotation process: blood, skin, fat, US gel, transducer membrane, coupling agent in the transducer head (mostly heavy water), other tissue, and coupling artifact. The other tissue class was assigned to the tissue below the fat layer that does not fall into the blood category and comprises e.g. muscle tissue or conjunctive tissue. The coupling artifact class was introduced to account for a loss of signal at the edges of the image due to a lack of coupling of the transducer to the skin. Note that the distribution of annotation classes was unbalanced, meaning that the amount of pixels assigned to each class is different.
2.1.3. Human annotation reliability
To approximate the effect of human annotation performance on our results, we performed a human annotation reliability study for the class blood, which is of particularly high clinical relevance [24]. To this end, a subset of ten test images (cf. Section 2.1.4) of one volunteer was chosen in a manner to include at least one image of every body region and body side, but otherwise random. The ten images were annotated by five domain experts and the performance of the five new annotators was assessed while using the original annotations as reference for two different metrics (cf. Section 2.3). To account for both the hierarchical structure and the small amount of data, a linear mixed model [25] was applied on the per image and per annotator metric values. Here we considered the body region as a fixed effect and annotator and image id as random intercepts.
2.1.4. Data split
The data of the 16 volunteers was divided into training/validation and test sets while respecting the underlying hierarchical structure: A training and validation set was assembled from the images of randomly selected ten volunteers and split for five-fold cross-validation, where each fold used the data of two randomly chosen volunteers as the validation data (N = 36) and the remaining data as the training data (N = 144). The test set comprised 108 images from the remaining six volunteers that were neither included in the training nor the validation set. This data split was chosen to ensure a large number of training data while retaining a small validation data set for hyper-parameter tuning as well as to enable statistically more reliable conclusions on the test set with N ¿ 5.
2.2. Segmentation methods
Due to its breakthrough successes in various fields of research and practice, we based our segmentation method on deep learning. In this context, we investigated two well-known neural network architectures with complementary strengths: the U-Net and the FCNN. The main difference between these architectures is the different representation of the input data they are based on. As the input representation of the U-Net is the full image where the wavelength dependency is assigned to different input channels, the network leverages local context by inherent convolutional kernels. In contrast, the input data of the FCNN is represented by single-pixel spectra. This allows for a pixel-wise classification that is completely independent of the spatial distribution of the different classes (cf. Fig. 3).
Fig. 3.
Overview of the data/network configuration. (Left) (top) ultrasound (US) data (blue), (center) multispectral photoacoustic (PA) data (orange), and (bottom) a combination thereof (PAUS) were used as (right) input sources for two neural network architectures: (A) the nnU-Net and (B) the fully-connected neural network (FCNN). Multispectral information encoded in PA data is indicated in pink. The input of the nnU-Net is a full image with one single channel for US data and multi-channel input features for PA (26 channels) and PAUS (27 channels) data. In contrast, the FCNN leverages single-pixel multi-dimensional input features for PA (26 dimensions) and PAUS (27 dimensions) data.
Details of the U-Net (cf. Section 2.2.1) and the FCNN (cf. Section 2.2.2) are presented in the following sections.
2.2.1. Details of the nnU-Net
In a recent literature review [26], we found the U-Net to be the most commonly used and successful network architecture for deep learning applications in PAI [20], [27]. Compared to previous architectures used for image segmentation, it requires fewer training images, yields less blurred segmentation results, and is thus particularly well-suited for medical applications [28].
We applied the nnU-Net framework,3 the currently best-performing framework across numerous biomedical segmentation challenges [17]. The core idea of the nnU-Net framework is that not the network architecture but the detailed design choices (e.g. batch size, patch size, augmentation, ensembling of folds, etc.) are key to performance optimization. In this study, we used the 2D nnU-Net configuration, since initial results showed that this network architecture performs best on our validation data. The network architecture is based on the original U-Net design [28] with minor changes, such as strided convolutional downsampling layers instead of maxpooling layers. All details are described in [17] and a schematic of the network architecture can be found in Figure A.1 of the Supplemental Material 2.
The size of the input layer was chosen according to the full image size (256 × 128) with channels that were assigned to the acquired wavelengths (i.e., 1 for US, 26 for PA, and 27 for PAUS), as also defined in the FCNN section. The size of the output layer was defined as the full image size (256 × 128) with 9 output channels that corresponded to the one-hot encoded representation of the eight annotation classes and one background class. The nnU-Net was trained in a five-fold cross-validation and the estimations were ensembled. The loss was defined as the sum of the Cross-Entropy (CE) Loss and the Soft Dice Loss. The CE Loss of the one-hot encoded estimated data and reference data is defined as:
(1) |
where is a class-specific weighting factor, here set to 1, is the minibatch size, and the number of classes. The Soft Dice Loss, here calculated per minibatch, is defined as:
(2) |
where and are the estimated and reference data, respectively, and a smoothing factor, here set to .
2.2.2. Details of the FCNN
Another popular network design is the FCNN. In the context of PAI, FCNNs are well-suited for working with single-pixel spectra as input, which increases the number of training samples and makes it easy to work with sparsely annotated data.
The FCNN architecture is based on a previous publication [20] and consisted of an input layer of size 1 × 1 with dimensions that, as for the nnU-Net, corresponded to the measured wavelengths (i.e., 26 for PA, and 27 for PAUS). The input layer was followed by an upscaling layer and a Tanh activation function, four hidden layers of channel size and a one-hot encoded output layer of dimension size . Dropout layers (20%) and leaky ReLUs were used in between the four hidden layers. A diagram of the network architecture can be found in Figure A.2 in the Supplemental Material 2. For training of the FCNN, we used a Soft Margin Loss [29] defined as:
(3) |
where and define the one-hot encoded estimated and reference data (containing −1 or 1), the number of pixels per batch, and is the number of classes. Same as the nnU-Net, the FCNN was implemented in Pytorch [29], trained in a five-fold cross-validation and the estimations were ensembled. The hyper-parameter optimization was based on a grid-search. We used a learning rate of , a batch size of , 1000 batches per epoch and trained the network for 200 epochs.
2.3. Performance assessment
We systematically compared the performances of different algorithm/data combinations using distance-based and overlap-based segmentation performance metrics. Specifically, we used the following metrics as recommended in [30] and applied in [31]:
Dice Similarity Coefficient (DSC).4 [32] which is defined as:
(4) |
where is the estimation and the reference of a label class.
Normalized Surface Distance (NSD) [33] defined as:
(5) |
with the tolerance and the surfaces and and the border regions and of the estimation and the reference respectively. Here, the tolerance was set to = 1 pixel for all except the blood classes. This value was chosen as the most critical value since no inter-rater reliability was available. For blood, the tolerance was calculated based on the human annotation reliability analysis (cf. Section 2.1.3) as proposed in [33]. For every test image, the average nearest neighbor distances between the surface of the reference and the re-annotated blood vessel segmentations (surface distance) were calculated. The intercept of the fitted linear mixed model (cf. Section 2.1.3) was used as the NSD tolerance value = 5 pixels (cf. Table A.2 in the Supplemental Material 2).
For each test image, the performance metrics were calculated per annotation class, hence the validation results are not biased by imbalances in the amount of pixels of the different classes. Note that the metrics, depending on the implementation, handle edge cases differently. Here, we decided to not compute the DSC, if a class is not existent in the reference and to not compute the NSD, if a class is not existent in the reference or in the estimation. In addition to the annotation class-specific metric values, the class-specific values were averaged per test image, which we refer to as values of all structures. Furthermore, descriptive statistics over the test instances were computed that resulted in one average value per algorithm/data configuration, annotation class, and metric used. Note that this approach accounts for the hierarchical nature of the data.
Additionally, for the class blood, the overall human performance was calculated using the DSCs of the reference and the re-annotated blood vessel segmentations. In analogy to the calculation of , the mean and the standard deviation of the performance of human annotators were determined by applying the linear mixed model on the per image and per annotator DSC values (cf. Section 2.1.3).
A comparison between the algorithms was performed using the challengeR toolkit5 which is especially suited to analyze and visualize benchmarking results [34]. A statistical rank-then-aggregate-based approach was chosen to rank the methods, similar to the ranking scheme of the Medical Segmentation Decathlon [31]. First, a ranking per test case and algorithm was calculated. Second, the mean of the rankings per algorithm was calculated, resulting in the final rank. We chose the DSC as the primary metric for the ranking and defined the segmentation for blood vessels, skin, and all structures as three separate tasks. The segmentation of skin and blood vessels can be considered two of the most crucial tasks for segmentation algorithms in the field of PAI [26]. The corresponding DSC values were aggregated across the three poses and two body sites (cf. Fig. 2), resulting in = 18 test cases per annotation class. Note that initial results of the challengeR toolkit did not show any major differences when using the NSD values or different aggregation schemes or a different (non test-based) ranking method.
3. Experiments and results
We designed one experiment to qualitatively analyze the data annotations and two experiments to verify the following core hypotheses:
-
1.
Automatic multi-label segmentation of PA and PAUS data is feasible (cf. Section 3.1).
-
2.
Automatic multi-label segmentation is feasible even when being applied to morphologically different test data (cf. Section 3.2).
The following sections detail the experimental design and the results for each.
3.1. Automatic multi-label segmentation of PA and PAUS data is feasible
We performed a feasibility experiment in which we used different combinations of data and inference models to evaluate differences in performance when segmenting the labeled annotation classes. We trained the nnU-Net on US data only, PA data only, and a combination of both PA and US data (PAUS) (cf. Fig. 4). The FCNN training was performed only on PA and PAUS data because it estimates annotation classes from pixel-wise spectra alone, and the one-dimensional nature of the US data does not provide this information. Fig. 5 shows the estimated segmentations (having at least 60 blood pixels) corresponding to the median DSC for the blood class. While the FCNN estimations for the PA data sets qualitatively look plausible, there is much more noise in the estimated labels compared to the nnU-Net estimates.
Fig. 4.
Example of the forearm data set. (a) An ultrasound (US) image, (b) a photoacoustic (PA) image at 800 nm, and (c) the reference segmentation mask. The white arrows denote the location of the same vessel structures in all three images.
Fig. 5.
Segmentation results show agreement with the reference segmentation. A representative example image was chosen according to the median blood Dice Similarity Coefficient (DSC) (calculated on images with at least 60 pixels classified as blood) for the nnU-Net using photoacoustic and ultrasound (PAUS) data as input. The first row shows the input data: (left) the logscaled photoacoustic (PA) image at 800 nm and (right) the reference segmentation. The estimated segmentation maps are shown below for (left) the nnU-Net and (right) the fully-connected neural network (FCNN) based on (second row) PAUS input data and (third row) PA input data.
Table 1 shows the DSC and the NSD results of all input data for the nnU-Net and FCNN segmentation architectures. Fig. 6 presents the distribution of the DSC for blood, skin, and the average over all structures. Especially the nnU-Nets trained on multispectral data achieve slightly higher DSCs for blood compared to the performance of the human annotators (mean of 0.66, standard deviation of 0.09). The results for all annotation classes can be seen in Table A.1 and details of the fitted linear mixed model can be found in Table A.2 in the Supplemental Material 2. The model performances differed the most in case of blood segmentation. Especially for this annotation class, the nnU-Net performed substantially better than the FCNN. Fig. 7 confirms this finding and also indicates that it is beneficial to use PA or PAUS data compared to using US data as sole input for the blood segmentation. Additional qualitative results can be found in Figure A.3 in the Supplemental Material 2.
Table 1.
The mean performance scores show the feasibility of the method across multiple performance metrics. The Dice Similarity Coefficients (DSCs) and Normalized Surface Distances (NSDs) for the estimation results achieved by the nnU-Net and the fully-connected neural network (FCNN) leveraging photoacoustic (PA) data, ultrasound (US) data, and a combination thereof (PAUS), were calculated over all structures and for blood and skin separately. Higher DSC and NSD values are better.
Fig. 6.
Raw data plot for all algorithm/data combinations. Dice Similarity Coefficient (DSC) scores achieved in the feasibility experiment (a) averaged over all structures, (b) for blood, and (c) for skin. A separate plot is shown for each of the algorithm/data combinations. Color/Shape coding enables distinguishing the six volunteers (six colors) as well as the different target structures (circle, plus, and star for forearm, calf, and neck respectively). Measurements from the left and right side of the volunteers’ bodies are plotted at the left and right of the vertical lines, respectively. The gray density plots show the relative score frequencies separately for each side. For blood, the mean and standard deviation of the performance of the human annotators are shown as the dotted line and the shaded area, respectively.
Fig. 7.
The nnU-Net generally outperforms the fully-connected neural network (FCNN) and when using US data only, the nnU-Net struggles to segment blood vessels. For each of the raw data plots in Fig. 6 this figure shows an accompanying podium plot generated with the challengeR toolkit [34] that displays the relative performance of the Dice Similarity Coefficient (DSC) aggregated across the three poses and two body sides (cf. Fig. 2). (Upper parts) Participating models and corresponding DSC values are color-coded and ordered according to the achieved ranks from best (1) to worst (5). DSC values corresponding to identical test cases are connected by a line (spaghetti structure). (Lower parts) The bar charts represent the relative frequency at which each model achieved the respective rank.
3.2. Automatic multi-label segmentation is feasible even when applied to morphologically different test data
To investigate the robustness of the proposed approach with respect to morphologically different target structures, we conducted experiments in which we used different target structures for training and testing of our algorithm. Specifically, we trained the networks exclusively on (A) the neck and calf, (B) the forearm and neck, and (C) the forearm and calf measurements of the training set and estimated the annotation classes on (A) the forearm, (B) the calf, and (C) the neck measurements of the test set. The different combinations of data and inference models were analogous to the feasibility experiments.
The DSC and the NSD scores for the baseline method as well as the scenarios (A – C) are shown in Table 2. In general, the nnU-Net models outperform the FCNN models. However, the differences of DSC results between the robustness and feasibility experiments, as shown in Fig. 8, indicate that the FCNN is more robust to body sites that are not included in the training data set compared to the nnU-Net. Additional qualitative as well as quantitative results can be found in Figures A.4–A.6 and Tables A.3–A.5 in the Supplemental Material 2.
Table 2.
Semantic segmentation results can differ on geometrically different test images when using a combination of photoacoustic and ultrasound (PAUS) data. The baseline feasibility experiment is compared to the robustness experiment results of semantic segmentation tested on images from geometries (forearm, calf, and neck) that were not included in the training set. We use the Dice Similarity Coefficient (DSC) as well as the Normalized Surface Distance (NSD). Higher DSC and NSD values are better.
Fig. 8.
The fully-connected neural network (FCNN) tends to be more robust to body sites that were not included in the training data compared to the nnU-Net. The respective differences of the Dice Similarity Coefficient (DSC) results (a) averaged over all structures, (b) for skin, and (c) for blood between the robustness experiments and the feasibility experiment for the nnU-Net and FCNN trained on photoacoustic and ultrasound (PAUS) data are shown. Values higher than zero denote an improved performance of the robustness results compared to the feasibility results. Values smaller than zero indicate the opposite. The separate plots for each of the training/test data combinations are indicated by the respective test set: forearm models trained on calf and neck data and tested on forearm measurements, calf models trained on forearm and neck data and tested on calf measurements, and neck models trained on forearm and calf data and tested on neck measurements. Color/Shape coding enables distinguishing the six volunteers (six colors) as well as the different target structures (circle, plus, and star for forearm, calf, and neck respectively). Measurements from the left and right side of the volunteers’ bodies are plotted to the left and right of the vertical lines, respectively. The gray density plots show the relative score frequencies separately for each side. The frequency distributions of the FCNN differences are narrower compared to those of the nnU-Net.
4. Discussion
To our knowledge, this paper is the first to show that fully-automatic multi-label semantic image annotation of PA images with deep learning is feasible. We designed two experiments around one core hypothesis that questioned: (1) the general feasibility of multi-label image segmentation based on PA images and (2) the robustness towards changes in tissue geometry. We found that the task is generally feasible across different neural network architectures and that, compared with pure US images, the multispectral nature of PA images is particularly beneficial for blood segmentation (DSC of PAUS nnU-Net was 0.74, DSC of US nnU-Net was 0.32). To this end, we used two distinct types of networks throughout the experiments: an FCNN that utilizes single-pixel spectral information and a nnU-Net that additionally incorporates spatial context information. Specifically, we applied the nnU-Net, the current state of the art in medical image segmentation. On PAUS data, averaged across all annotation classes, the nnU-Net achieved a DSC of 0.85 and the FCNN of 0.66. Furthermore, the method was robustly applicable even to data where samples of the imaged body site were not included in the training data.
In the feasibility experiment, the nnU-Net trained on multispectral data achieved better results on average compared to the FCNN. We attribute this to its ability to take into account the spatial image context. However, even though the FCNN was only able to estimate the labels based on single-pixel spectra, the resulting images were plausible as well. The overall results of the networks were very convincing, achieving high overlap-based and contour-based metric values for the majority of classes. The worst performances were achieved for the blood and coupling artifact classes, for which the most obvious explanation is the difficulty of annotating these areas. Nevertheless, the spectra of the annotation classes seem to be very characteristic for the respective class. In an additional experiment, we noticed that the nnU-Net trained on PAUS data performs only marginally worse when using five wavelengths evenly sampled from the 26 wavelengths. This leads us to assume that, provided the characteristic spectra are sufficiently sampled, fewer wavelengths can be leveraged. A systematic analysis of failure cases (representative image shown in Figure A.3 in the Supplemental Material 2) revealed two common sources of error: (1) over-segmentation of small superficial vessels and (2) segmentation of regions of high intensity that were not identified as vessels by the annotators.
In our robustness experiment, the performance of the networks that were trained on PAUS data decreased slightly in comparison to the feasibility experiment in most of the cases. We suspect the domain shift between different body regions that were explicitly included in this experiment to be a reason for this finding. This indicates that the variations between different body regions might have an impact on the algorithm performance although the tissue composition of the regions investigated was similar. It can be expected that the performance would degrade even further for vastly different structures (e.g. in abdominal surgery) or when being applied to a different cohort, e.g. cancer patients. It has to be noted that the test data set of the robustness experiment compared to the feasibility experiment was smaller in size by a factor of three, which may also have caused the overall decrease in performance. Moreover, the smaller test set size could be a reason why the results of the robustness experiments partly outperformed those of the feasibility experiment. In future work including more data, the significance of the performance difference should be addressed. The experimental results could be a further indicator of the high potential of FCNNs for robust interpretation, which, in this study, is referred to as an independence of neural networks of morphologically different training and test data, as the distribution of DSC differences between the robustness and feasibility results was narrower than that of the nnU-Net. Compared to convolutional approaches, FCNNs are by design more independent of morphological variations within the training and target domains. Ensembling strategies based on the estimation uncertainty using both the nnU-Net and FCNN results, leveraging the FCNN results as a prior, or post-processing of the resulting images might be able to combine the respective advantages of both.
In the broader context of biomedical image analysis, it was suggested that model performance can be hampered by the quality of the reference segmentation labels [35]. The main problem with manual labeling is that the process is time-consuming, requires expert knowledge, and is error-prone. To reduce the differences between annotations, we devised a structured annotation protocol for standardization of the labeling process and facilitated differentiation of various structures (cf. Supplemental Material 1). A leave-one-out cross-validation of all acquired images did not reveal any obvious differences between the segmentation masks of the individual volunteers drawn by different annotators. Still, there remains ambiguity in the images, for example when delineating the apparent size and location of blood vessels in the PA signal, as highlighted in an additional human annotation reliability analysis. Especially the slightly higher blood DSC results of the nnU-Net trained on multispectral data compared to the performance of human annotators indicate that the networks might be able to replace manual labeling — if the annotation quality is high enough. In some images, we have found size and position mismatches between the PA and US images, which might be introduced by differences in the speed of sound across different volunteers or by a blurring of the PA signals, which can be attributed to the limited bandwidth of the US detection elements, their impulse response, as well as artifacts introduced by the reconstruction algorithm. These inherent ambiguities might be resolvable by multi-modal image registration [36] or by capturing 3D instead of 2D images to exploit the spatial context information. However, obtaining high-quality 3D reconstructions with handheld linear transducers is usually very time-consuming and requires the use of additional hardware [37].
While our study indicates that multi-label semantic annotation of PA and/or US images with deep learning is feasible, the results should be interpreted with care. The biggest limitation of our study is the fact that we had a very low number of test images. With only 108 images from 6 volunteers, no broad conclusions should be drawn from the results, especially with respect to the relative performance of the different architecture/input combinations. A related potential problem is the hierarchical nature of the data set (16 volunteers, three imaging sites, three images per site, left and right side of the body), which complicates a rigorous statistical analysis. In fact, based on the data visualization in Fig. 6, we found no clear evidence that the algorithms performed differently on test images of the left and right side of the body. At the same time, in some cases there is a clustering of DSC results within the same site. Our test data therefore cannot be regarded as independent, which was the reason for us to report the mean performance without standard deviation in Table 1, Table 2. Additionally, our experimental design may have led to a bias favoring networks trained on PAUS data. On the one hand, the manual annotations were done using both the PA and US images. On the other hand, the number of learnable parameters of the networks increased with the number of input channels (in this case corresponding to the number of input spectra), which may improve the performance of the corresponding networks [38]. Moreover, the presented work is limited to the given annotated classes. In this paper, we categorized pixels into eight classes. Future work with more training data could investigate semantic scene segmentation with hierarchical class structures, for example to differentiate arteries and veins within the class blood. It should be mentioned that initial experiments on distinguishing the two blood classes did not yield high performance with the limited training data per class that we currently have. In future work, this method could be extended to segment specific structures that are oncologically relevant. However, these structures would need to be located sufficiently close to the skin and show a characteristic PA spectrum.
Overall, our work indicates that neural network-based semantic image segmentation of multispectral PA images is feasible, producing robust estimates even with relatively small amounts of training data. We believe that algorithms for automatic analysis of photoacoustic images are an important step towards clinical translation as they can assist physicians in understanding multispectral photoacoustic images. Especially in combination with wavelength-dependent tools for functional parameter estimation, such as blood oxygen saturation, they allow for the creation of powerful and clinically impactful visualizations of the imaged tissue structures.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This project was funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme through the ERC starting grant COMBIOSCOPY (grant agreement No. ERC-2015-StG-37960) and consolidator grant NEURAL SPICING (grant agreement No. [101002198]) and the Surgical Oncology Program of the National Center for Tumor Diseases (NCT) Heidelberg . Part of this work was funded by Helmholtz Imaging (HI), a platform of the Helmholtz Incubator on Information and Data Science.
Additional information
The healthy human volunteer experiments were approved by the ethics committee of the medical faculty of Heidelberg University under reference number S-451/2020 and the study is registered with the German Clinical Trials Register under reference number DRKS00023205.
Biographies
Melanie Schellenberg received her M.Sc. degree in Physics from the University of Heidelberg in 2019. She is currently pursuing an interdisciplinary Ph.D. in computer science at the division of Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ) and aiming for quantitative photoacoustic imaging with a learning-to-simulate approach.
Kris Kristoffer Dreher received his M.Sc. degree in Physics from the University of Heidelberg in 2020. He is currently pursuing a Ph.D. at the division of Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ) and does research in deep learning-based domain adaptation methods to tackle the inverse problems of photoacoustic imaging.
Niklas Holzwarth received his M.Sc. degree in Physics from the University of Heidelberg in 2020. He is currently pursuing an interdisciplinary Ph.D. in computer science at the division of Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ) investigating a sensorless 3D photoacoustic approach, referred to as “tattoo tomography”.
Fabian Isensee finished his Ph.D. at the division of Medical Image Computing at the German Cancer Research Center (DKFZ) and has been appointed as head of the Helmholtz Imaging Applied Computer Vision Lab at the DKFZ with the goal of translating state of the art AI methods to the many diverse research applications found across the Helmholtz association and beyond afterwards. The unit is working in close collaboration with researchers, providing consulting, support and working on joint scientific projects.His research focuses on deep learning techniques for semantic segmentation of (three-dimensional) datasets in the biological and medical domain. He has been particularly active in the development of methods for automated segmentation pipeline design and his methods won multiple international segmentation competitions.
Annika Reinke is a Ph.D. student at the German Cancer Research Center (DKFZ), leading the sub-group “Benchmarking and Validation” in the division Computer Assisted Medical Interventions (CAMI). Her research focuses on validation and benchmarking of machine learning algorithms, therefore she is member of multiple initiatives aiming for high quality validation research.
Nicholas Schreck is a postdoctoral researcher and biostatistician at the German Cancer Research Center (DKFZ). He received his Doctorate in Mathematics from the University Mannheim. His research focusses on linear mixed models in genomic and biological applications.
Alexander Seitel is a computer scientist currently working as a group lead and deputy head at the division of Computer Assisted Medical Interventions at the German Cancer Research Center (DKFZ) in Heidelberg. He received his Doctorate in Medical Informatics from the University of Heidelberg and holds a Diploma (M.Sc. equivalent) in Computer Science from the Karlsruhe Institute of Technology. His research focusses on computer-assisted interventions and novel imaging methodologies aiming to improve interventional healthcare. In this area, he conducted various international projects at the DKFZ, during his two-year postdoctoral fellowship at the University of British Columbia, Vancouver, Canada, and at the Massachusetts Institute of Technology (MIT), Cambridge, MA.
Minu Dietlinde Tizabi is a physician, scientist and writer in the division of Computer Assisted Medical Interventions (CAMI) at the German Cancer Research Center (DKFZ).
Lena Maier-Hein is a full professor at Heidelberg University (Germany) and affiliated professor to LKSK institute of St. Michael‘s Hospital (Toronto, Canada). At the German Cancer Research Center (DKFZ) she is managing director of the “Data Science and Digital Oncology” cross-topic program and head of the division Computer Assisted Medical Interventions (CAMI). Her research concentrates on machine learning-based biomedical image analysis with a specific focus on surgical data science, computational biophotonics and validation of machine learning algorithms.
Janek Gröhl received his M.Sc. degree in medical informatics from the University of Heidelberg and Heilbronn University of Applied Sciences in 2016. He received his Ph.D. from the medical faculty of the University of Heidelberg in April 2021. In 2020, he worked as a postdoctoral researcher at the German Cancer Research Center in Heidelberg, Germany and is currently working as a research associate at the Cancer Research UK Cambridge Institute in Cambridge, United Kingdom. He does research in computational biophotonics focusing on data-driven methods for data processing and signal quantification in photoacoustic imaging.
Footnotes
Supplementary material related to this article can be found online at https://doi.org/10.1016/j.pacs.2022.100341.
Contributor Information
Melanie Schellenberg, Email: melanie.schellenberg@dkfz-heidelberg.de.
Lena Maier-Hein, Email: l.maier-hein@dkfz-heidelberg.de.
Appendix A. Supplementary data
The following is the Supplementary material related to this article.
The following is the Supplementary material related to this article. It contains the annotation protocol, schematic figures of the network architectures, additional results, and more detailed plots that supplement the findings and results reported in the main paper.
References
- 1.Xia J., Yao J., Wang L.V. Photoacoustic tomography: Principles and advances. Electromagn. Waves (Camb.) 2014;147:1. doi: 10.2528/pier14032303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Brunker J., Yao J., Laufer J., Bohndiek S.E. Photoacoustic imaging using genetically encoded reporters: A review. J. Biomed. Opt. 2017;22(7) doi: 10.1117/1.JBO.22.7.070901. [DOI] [PubMed] [Google Scholar]
- 3.Beard P. Biomedical photoacoustic imaging. Interface Focus. 2011;1(4):602–631. doi: 10.1098/rsfs.2011.0028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hauptmann A., Cox B.T. Deep learning in photoacoustic tomography: Current approaches and future directions. J. Biomed. Opt. 2020;25(11) [Google Scholar]
- 5.Liu C., Chen J., Zhang Y., Zhu J., Wang L. Five-wavelength optical-resolution photoacoustic microscopy of blood and lymphatic vessels. Adv. Photonics. 2021;3(1) [Google Scholar]
- 6.Triki F., Xue Q. Hölder stability of quantitative photoacoustic tomography based on partial data. Inverse Problems. 2021;37(10) doi: 10.1088/1361-6420/ac1e7e. [DOI] [Google Scholar]
- 7.Lafci B., Merčep E., Morscher S., Deán-Ben X.L., Razansky D. Deep learning for automatic segmentation of hybrid optoacoustic ultrasound (OPUS) images. IEEE Trans. Ultrason. Ferroelectr. Freq. Control. 2020;68(3):688–696. doi: 10.1109/TUFFC.2020.3022324. [DOI] [PubMed] [Google Scholar]
- 8.Lafci B., Merćep E., Morscher S., Deán-Ben X.L., Razansky D. vol. 11240. International Society for Optics and Photonics; 2020. Efficient segmentation of multi-modal optoacoustic and ultrasound images using convolutional neural networks; p. 112402N. (Photons Plus Ultrasound: Imaging and Sensing 2020). [Google Scholar]
- 9.Liang Z., Zhang S., Wu J., Li X., Zhuang Z., Feng Q., Chen W., Qi L. Automatic 3-D segmentation and volumetric light fluence correction for photoacoustic tomography based on optimal 3-D graph search. Med. Image Anal. 2021 doi: 10.1016/j.media.2021.102275. [DOI] [PubMed] [Google Scholar]
- 10.Zhang J., Chen B., Zhou M., Lan H., Gao F. Photoacoustic image classification and segmentation of breast cancer: A feasibility study. IEEE Access. 2018;7:5457–5466. [Google Scholar]
- 11.Luke G.P., Hoffer-Hawlik K., Van Namen A.C., Shang R. 2019. O-Net: A convolutional neural network for quantitative photoacoustic image segmentation and oximetry. arXiv preprint arXiv:1911.01935. [Google Scholar]
- 12.Chlis N.-K., Karlas A., Fasoula N.-A., Kallmayer M., Eckstein H.-H., Theis F.J., Ntziachristos V., Marr C. A sparse deep learning approach for automatic segmentation of human vasculature in multispectral optoacoustic tomography. Photoacoustics. 2020;20 doi: 10.1016/j.pacs.2020.100203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Yuan A.Y., Gao Y., Peng L., Zhou L., Liu J., Zhu S., Song W. Hybrid deep learning network for vascular segmentation in photoacoustic imaging. Biomed. Opt. Express. 2020;11(11):6445–6457. doi: 10.1364/BOE.409246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gerl S., Paetzold J.C., He H., Ezhov I., Shit S., Kofler F., Bayat A., Tetteh G., Ntziachristos V., Menze B. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2020. A distance-based loss for smooth and continuous skin layer segmentation in optoacoustic images; pp. 309–319. [Google Scholar]
- 15.Ly C.D., Nguyen V.T., Vo T.H., Mondal S., Park S., Choi J., Vu T.T.H., Kim C.-S., Oh J. Full-view in vivo skin and blood vessels profile segmentation in photoacoustic imaging based on deep learning. Photoacoustics. 2021 doi: 10.1016/j.pacs.2021.100310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Moustakidis S., Omar M., Aguirre J., Mohajerani P., Ntziachristos V. Fully automated identification of skin morphology in raster-scan optoacoustic mesoscopy using artificial intelligence. Med. Phys. 2019;46(9):4046–4056. doi: 10.1002/mp.13725. [DOI] [PubMed] [Google Scholar]
- 17.Isensee F., Jaeger P.F., Kohl S.A., Petersen J., Maier-Hein K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nature Methods. 2021;18(2):203–211. doi: 10.1038/s41592-020-01008-z. [DOI] [PubMed] [Google Scholar]
- 18.Becker A., Masthoff M., Claussen J., Ford S.J., Roll W., Burg M., Barth P.J., Heindel W., Schaefers M., Eisenblaetter M., et al. Multispectral optoacoustic tomography of the human breast: Characterisation of healthy tissue and malignant lesions using a hybrid ultrasound-optoacoustic approach. Eur. Radiol. 2018;28(2):602–609. doi: 10.1007/s00330-017-5002-x. [DOI] [PubMed] [Google Scholar]
- 19.Wei C.-W., Nguyen T.-M., Xia J., Arnal B., Wong E.Y., Pelivanov I.M., O’Donnell M. Real-time integrated photoacoustic and ultrasound (PAUS) imaging system to guide interventional procedures: Ex vivo study. IEEE Trans. Ultrason. Ferroelectr. Freq. Control. 2015;62(2):319–328. doi: 10.1109/TUFFC.2014.006728. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Gröhl J., Kirchner T., Adler T.J., Hacker L., Holzwarth N., Hernández-Aguilera A., Herrera M.A., Santos E., Bohndiek S.E., Maier-Hein L. Learned spectral decoloring enables photoacoustic oximetry. Sci. Rep. 2021;11(1):1–12. doi: 10.1038/s41598-021-83405-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kirchner T., Sattler F., Gröhl J., Maier-Hein L. Signed real-time delay multiply and sum beamforming for multispectral photoacoustic imaging. J. Imaging. 2018;4(10):121. [Google Scholar]
- 22.Nolden M., Zelzer S., Seitel A., Wald D., Müller M., Franz A.M., Maleike D., Fangerau M., Baumhauer M., Maier-Hein L., et al. The medical imaging interaction toolkit: Challenges and advances. Int. J. Comput. Assist. Radiol. Surg. 2013;8(4):607–620. doi: 10.1007/s11548-013-0840-8. [DOI] [PubMed] [Google Scholar]
- 23.Mongan J., Moy L., Kahn C.E. Checklist for artificial intelligence in medical imaging (CLAIM): A guide for authors and reviewers. Radiol. Artif. Intell. 2020;2(2) doi: 10.1148/ryai.2020200029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Attia A.B.E., Balasundaram G., Moothanchery M., Dinish U., Bi R., Ntziachristos V., Olivo M. A review of clinical photoacoustic imaging: Current and future trends. Photoacoustics. 2019;16 doi: 10.1016/j.pacs.2019.100144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Roß T., Bruno P., Reinke A., Wiesenfarth M., Koeppel L., Full P.M., Pekdemir B., Godau P., Trofimova D., Isensee F., et al. 2021. How can we learn (more) from challenges? A statistical approach to driving future algorithm development. arXiv preprint arXiv:2106.09302. [Google Scholar]
- 26.Gröhl J., Schellenberg M., Dreher K., Maier-Hein L. Deep learning for biomedical photoacoustic imaging: A review. Photoacoustics. 2021 doi: 10.1016/j.pacs.2021.100241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Zhang H., Bo W., Wang D., Di Spirito A., Huang C., Nyayapathi N., Zheng E., Vu T., Gong Y., Yao J., et al. Deep-E: A fully-dense neural network for improving the elevation resolution in linear-array-based photoacoustic tomography. IEEE Trans. Med. Imaging. 2021 doi: 10.1109/TMI.2021.3137060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Ronneberger O., Fischer P., Brox T. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2015. U-Net: Convolutional networks for biomedical image segmentation; pp. 234–241. [Google Scholar]
- 29.Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al. vol. 32. Curran Associates, Inc.; 2019. PyTorch: An imperative style, high-performance deep learning library; pp. 8024–8035. (Advances in Neural Information Processing Systems). [Google Scholar]
- 30.Reinke A., Eisenmann M., Tizabi M.D., Sudre C.H., Rädsch T., Antonelli M., Arbel T., Bakas S., Cardoso M.J., Cheplygina V., et al. 2021. Common limitations of image processing metrics: A picture story. arXiv preprint arXiv:2104.05642. [Google Scholar]
- 31.Antonelli M., Reinke A., Bakas S., Farahani K., Landman B.A., Litjens G., Menze B., Ronneberger O., Summers R.M., van Ginneken B., et al. 2021. The medical segmentation Decathlon. arXiv preprint arXiv:2106.05735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Dice L.R. Measures of the amount of ecologic association between species. Ecology. 1945;26(3):297–302. [Google Scholar]
- 33.Nikolov S., Blackwell S., Zverovitch A., Mendes R., Livne M., De Fauw J., Patel Y., Meyer C., Askham H., Romera-Paredes B., et al. Clinically applicable segmentation of head and neck anatomy for radiotherapy: Deep learning algorithm development and validation study. J. Med. Internet Res. 2021;23(7) doi: 10.2196/26151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wiesenfarth M., Reinke A., Landman B.A., Eisenmann M., Saiz L.A., Cardoso M.J., Maier-Hein L., Kopp-Schneider A. Methods and open-source toolkit for analyzing and visualizing challenge results. Sci. Rep. 2021;11(1):1–15. doi: 10.1038/s41598-021-82017-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.A. Zlateski, R. Jaroensri, P. Sharma, F. Durand, On the importance of label quality for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1479–1487.
- 36.Ren W., Deán-Ben X.L., Augath M.-A., Razansky D. vol. 11642. International Society for Optics and Photonics; 2021. Feasibility study on concurrent optoacoustic tomography and magnetic resonance imaging; p. 116420C. (Photons Plus Ultrasound: Imaging and Sensing 2021). [Google Scholar]
- 37.Holzwarth N., Schellenberg M., Gröhl J., Dreher K., Nölke J.-H., Seitel A., Tizabi M.D., Müller-Stich B.P., Maier-Hein L. Tattoo tomography: Freehand 3D photoacoustic image reconstruction with an optical pattern. Int. J. Comput. Assist. Radiol. Surg. 2021:1–10. doi: 10.1007/s11548-021-02399-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Tan M., Le Q. International Conference on Machine Learning. PMLR; 2019. EfficientNet: Rethinking model scaling for convolutional neural networks; pp. 6105–6114. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
The following is the Supplementary material related to this article. It contains the annotation protocol, schematic figures of the network architectures, additional results, and more detailed plots that supplement the findings and results reported in the main paper.