Abstract
Photoacoustic tomography (PAT) is an emerging biomedical imaging modality that uniquely combines high spatial resolution with deep tissue penetration in a non-invasive manner, holding significant promise for diverse applications. However, image reconstruction quality in PAT severely degrades under limited-view data acquisition scenarios, such as those imposed by the physical constraints of intracavitary imaging. Conventional reconstruction methods (e.g., Delay-and-Sum, DAS) under these conditions typically yield images plagued by severe artifacts and loss of fine structural details. While deep learning (DL) approaches offer some improvement, existing post-processing methods still struggle to accurately recover intricate anatomical features from severely undersampled, limited-view data, often resulting in blurred details or persistent artifacts. To address these critical limitations, we propose DUAFF-Net, a novel dual-stream deep learning architecture. DUAFF-Net uniquely processes two complementary input representations in parallel: 1) conventional DAS reconstructions, and 2) pixel-wise interpolated raw data. The network employs a sophisticated two-stage feature fusion strategy to maximize information extraction and synergy. In the first stage, the Multi-scale Information Aggregation and Feature-refinement Module (MIAF-Module) enables early-stage cross-modal information complementarity and feature enhancement. Subsequently, the Global Context and Deep Fusion Module (GCDF-Module) focuses on holistic feature optimization and deep integration across the streams. These modules work synergistically to progressively refine the reconstruction. Extensive experiments on simulated PAT datasets of retinal vasculature and complex brain structures, as well as an in vivo mouse abdomen dataset, demonstrate that DUAFF-Net robustly generates high-quality images even under highly incomplete data conditions. Quantitative evaluation shows that DUAFF-Net achieves substantial improvements over the standard DAS algorithm, with gains of ∼18.38 dB in Peak Signal-to-Noise Ratio (PSNR) and ∼0.69 in Structural Similarity Index (SSIM). Furthermore, DUAFF-Net consistently outperforms other state-of-the-art DL-based reconstruction models across multiple metrics, demonstrating its superior capability in preserving fine details and suppressing artifacts, thereby establishing comprehensive performance advantages for limited-view PAT reconstruction.
Keywords: Limited-view, Feature fusion, Photoacoustic imaging, Deep learning, Image reconstruction
1. Introduction
Photoacoustic tomography (PAT), an emerging hybrid imaging modality, uniquely combines the advantages of high optical contrast and high ultrasonic resolution. It enables detailed structural and functional imaging of biological tissues at penetration depths ranging from several millimeters to centimeters,demonstrating immense potential for clinical applications [1], [2], [3]. However, reconstructing the internal initial pressure distribution from photoacoustic signals detected at the tissue boundary is an inherently challenging and ill-posed inverse problem, the solution of which is highly sensitive to data incompleteness. Conventionally, this problem has been addressed using non-iterative algorithms such as Time Reversal (TR) [4], [5] and Delay-and-Sum (DAS) [6], [7].
When the sensor array is equipped with a sufficient number of transducer elements and can achieve full-angle, dense sampling of the target tissue, traditional algorithms can reconstruct high-quality images [8]. However, in practical biomedical imaging applications, physical and geometric constraints—including transducer cost, size limitations, and restricted acoustic windows for complex or deep-seated tissues—often make the construction of such ideal acquisition systems infeasible. These real-world limitations typically lead to incomplete photoacoustic signal acquisition, manifesting primarily as sparse spatial sampling and a limited data acquisition aperture (i.e., the limited-view problem) [9], [10], [11]. This inherent data deficiency is a critical bottleneck that compromises the fidelity of the subsequent image reconstruction.
To address the aforementioned shortcomings of traditional reconstruction methods, deep learning (DL) has emerged as a key paradigm for enhancing PAT image quality, owing to its powerful data-driven modeling capabilities [12], [13], [14]. Currently, DL-based PAT reconstruction strategies can be broadly categorized according to their primary role in the workflow: signal-domain pre-enhancement [15], [16], end-to-end direct reconstruction [17], [18], [19], and image-domain post-processing enhancement [20], [21], [22], [23].
In end-to-end direct reconstruction (direct mapping from sensor signals to an image), researchers have explored various network architectures to recover the initial pressure distribution from incomplete or sparse acquisitions. For instance, early work by Waibel et al. [24] proposed a modified U-Net architecture, validating the feasibility of learning reconstruction directly from acquired data and laying the groundwork for using neural networks to overcome PAT data limitations. Subsequent research, such as the work by Feng et al. [25], further improved reconstruction quality by introducing residual learning into the U-Net (ResUNet). Additionally, to learn a more robust signal-to-image mapping, Tong et al. [26] proposed the Feature Projection Network (FPNet), which innovatively integrated photoacoustic signals and their temporal derivatives as a dual-channel input. However, deep learning networks that directly reconstruct from time-domain signals often face challenges when processing the high-dimensional, asymmetric nature of PAT data. The use of fully connected layers or large-receptive-field convolutions can lead to a prohibitive number of model parameters, imposing significant computational and memory demands and creating challenges for training and generalization.
Image-domain post-processing enhancement (image-to-image) is another extensively studied strategy, which utilizes deep neural networks to optimize images initially reconstructed by traditional algorithms like DAS.The Fully Dense U-Net (FDUNet) proposed by Guan et al. [27] is a typical example, designed to remove sparse-sampling artifacts from these preliminary images. Although such methods can effectively improve the visual appearance of images, their ultimate performance is highly dependent on the quality of the input image. If the initial image already exhibits a severe loss of structural information or is plagued by strong artifacts, the network faces an inherent limitation in recovering fine features and supplementing missing details, making it difficult to achieve the desired high-quality imaging standards [28], [29].
To overcome the limitations of single-source information or single-stage processing—such as the artifact susceptibility and detail loss of post-processing and the lower robustness of direct-processing [30], [31]—researchers have begun to explore more advanced hybrid-input and feature-fusion strategies. Such dual-domain approaches, which leverage complementary information from both the image and signal (sinogram) domains, have proven highly effective in related reconstruction tasks like sparse-view and low-dose CT [32], [33], [34]. Similarly, in the field of PAT, the Y-Net architecture proposed by Lan et al. [30] demonstrated the potential of multi-source information fusion by concurrently processing raw sensor data and DAS-reconstructed images. However, Y-Net's use of large-sized convolution kernels to handle high-dimensional raw sensor data posed challenges to computational resources and training efficiency. To mitigate such issues, the AS-Net designed by Guo et al. [35] introduced a parameter-efficient folding transform to process time-series data, making progress in enhancing parameter efficiency. Yet, this operation, which reshapes the 1D time series and transfers some temporal information to the channel dimension, may face challenges in optimally preserving critical long-range temporal dependencies within the original signal. Although existing fusion methods have advanced reconstruction quality, effectively mining and fusing the complementary advantages of different data sources (e.g., structural information from the image domain and detailed information from the signal domain) remains a core challenge and a critical research direction. This is especially true under conditions of extreme data incompleteness, such as combined limited-view and sparse-sampling scenarios, where achieving a synergistic, high-fidelity recovery of both structure and detail while suppressing artifacts remains a significant challenge.
To address these challenges, this study proposes a novel deep learning framework named DUAFF-Net (Dual-stream U-Net with Attentive Feature Fusion Network). The core innovation of DUAFF-Net lies in its unique dual-input stream design and its staged, multi-level attentive feature fusion mechanism. First, it concurrently processes a traditional DAS image and a multi-channel interpolated dataset, which is pre-processed based on physical principles to preserve independent multi-sensor perspectives. Simultaneously, an MSRA-Module is designed for the DAS image stream to deeply mine and reinforce its macro-structural prior information. Second, a novel MIAF-Module is introduced to achieve effective interaction and preliminary fusion of heterogeneous features (structural information from DAS and detail information from the interpolated data) at the shallow stages of the network, promoting early-stage information complementarity. Finally, a unique GCDF-Module is constructed to efficiently integrate and enhance the deep features from both parallel paths at the deeper layers of the network, ensuring a comprehensive synergy between global structure and local detail to elevate the final reconstruction quality. To validate the efficacy of DUAFF-Net, we conducted comprehensive training and evaluation on both computer-simulated photoacoustic datasets (retinal vasculature and brain MRI) and an in vivo mouse abdomen dataset. These experiments specifically focused on the model's performance under limited-view and sparse-data conditions, as well as its robustness in real-world biological imaging scenarios.
2. Method
2.1. Imaging model and network input generation
Assuming a two-dimensional space, the relationship between the initial photoacoustic pressure distribution and the measurement data detected by the sensors can be expressed as:
| (1) |
where represents the forward system matrix determined by the sensor geometry and acoustic wave propagation laws. In limited-view scenarios, is severely ill-conditioned, making direct inverse reconstruction extremely unstable.
To generate the network input, we employ the Delay-and-Sum (DAS) algorithm to transform the measurement data into the image domain. The reconstruction value at pixel is calculated as follows:
| (2) |
where is the signal received by the -th sensor, and is the time-of-flight for the sound wave to propagate from position to the -th sensor. This image serves as the initial structural input for the subsequent deep learning model.
2.2. Pixel-wise interpolation
To construct an input stream that is information-complementary to the DAS image and that preserves the fine details and multi-view characteristics of the original signals, we employed a pixel-wise interpolation technique for pre-processing, based on the work of Guan et al. [29] The core of this technique is to take the one-dimensional time-domain signals captured by each of the N sensors (system configuration shown in Fig. 1(a), with raw data shown in Fig. 1(b)) and, based on the acoustic Time-of-Flight (TOF), map them individually onto a pre-defined H×W reconstruction grid.
Fig. 1.
Photoacoustic imaging process. (a) Schematic of the imaging system with 64 detectors (yellow dots). (b) Raw signal data from the 64 sensors. (c) Spatial mapping of interpolated data from a single sensor located at coordinates (-10, 0) mm. The plot displays the data in a two-dimensional Cartesian coordinate system, where the time-series signal amplitudes are mapped to corresponding spatial locations based on acoustic propagation distance, resulting in concentric ring textures centered at the sensor position.
Ultimately, these N generated, independent two-dimensional spatial maps (an example is shown in Fig. 1(c)) are stacked along the channel dimension to form an N × H×W multi-channel data matrix. This matrix effectively preserves the independent perspective of each sensor, providing a detail-rich input for the subsequent Pixel-level Feature Extraction Network (PFE-UNet). To simplify computation, the TOF estimation in this study adopts a uniform speed of sound assumption. Although this assumption may introduce certain geometric deviations, the powerful end-to-end learning capability of DUAFF-Net can effectively compensate for and correct such deviations.
2.3. Dual-stream U-net with attentive feature fusion network (DUAFF-Net)
2.3.1. Overall network architecture
To achieve high-quality image reconstruction from incomplete photoacoustic signals, this study proposes a novel dual-stream deep learning framework, termed DUAFF-Net.The overall architecture of DUAFF-Net is illustrated in Fig. 2. Its core design philosophy lies in an information-complementary strategy using dual input streams and a multi-stage, multi-level feature fusion mechanism.
Fig. 2.
An overview of the DUAFF-Net photoacoustic reconstruction framework proposed in this paper. The framework employs a dual-input streaming strategy: one branch inputs a DAS image into the SFE-UNet; the other inputs spatially mapped raw data into the PFE-UNet. The two modified UNet-like subnetworks interact and integrate information through a two-stage feature fusion module (FFM, which includes the MIAF-Module and GCDF-Module), ultimately outputting a high-quality reconstructed photoacoustic image.(Note: Tensor dimensions in the diagram are denoted as Channels Height Width.).
Specifically, DUAFF-Net processes two information sources in parallel: The first stream is the image initially reconstructed by the conventional Delay-and-Sum (DAS) algorithm, which aims to provide the network with macro-structural and contour priors of the target. The second stream consists of the multi-channel spatial map data formed by pre-processing the raw sensor data with pixel-wise interpolation, intended to preserve the rich micro-structural details and multi-view features of the original signal. These two input streams are processed by two dedicated U-Net-based subnetworks, SFE-UNet (see Section 2.3.2) and PFE-UNet (see Section 2.3.3), for deep feature extraction.
To efficiently integrate these two complementary information streams, DUAFF-Net employs a feature fusion mechanism that runs throughout the network. This primarily includes the MIAF-Module (Multi-level Inter-stream Attention Fusion Module, see Section 2.3.4(1)) for early information interaction between corresponding levels of the subnetworks, and the GCDF-Module (Global Context and Dense Feature Fusion Module, see Section 2.3.4(2)) for deep integration of high-level features at the end of the network. This design enables DUAFF-Net to synergistically leverage the advantages of different information sources, ultimately generating high-quality reconstructed photoacoustic images.
2.3.2. Structural feature extraction UNet(SFE-UNet)
SFE-UNet is the critical branch in DUAFF-Net responsible for processing the DAS image. Its core objective is to extract robust and information-rich macro-structural features from the preliminary reconstructed images, which often contain significant artifacts and blurred structures. Although the DAS image may lack fine details, it preserves the general contour and position of the target, serving as a valuable structural prior. To effectively utilize this prior while suppressing its inherent defects, SFE-UNet incorporates the following key improvements upon the classic U-Net architecture:
First, to enhance the network's multi-scale feature representation capabilities,we have comprehensively replaced the standard convolutional layers in both the encoder and decoder with a novel Multi-scale Residual Attention Module (MSRA-Module). As the artifacts and target structures in DAS images vary in size, the MSRA-Module, by processing branches with different receptive fields in parallel, can simultaneously capture this multi-scale contextual information. This is crucial for identifying and separating artifacts from true structures.
Second, to optimize feature fusion, we integrated the lightweight ECA-Attention [36] into the skip connections, enabling adaptive channel refinement and suppression of redundant features with negligible computational cost.
Furthermore, the decoder of SFE-UNet also performs early cross-stream fusion with features from PFE-UNet at multiple levels via the MIAF-Module (detailed in Section 2.3.4). This serves to continuously guide and correct the feature learning process.
The MSRA-Module is the core building block for efficient feature extraction in SFE-UNet, and its detailed structure is shown in Fig. 3(a).
Fig. 3.
Presents the three core modules of our model: (a) MSRA-Module, (b) MIAF-Module, and (c) GCDF-Module.
Unlike generic multi-scale modules, the MSRA-Module is a physics-driven adaptation designed specifically for scale-aware artifact decoupling. As observed in preliminary DAS reconstructions, streak artifacts and vascular structures are spatially intertwined but possess distinct scale characteristics. Standard convolutional layers, with fixed receptive fields, struggle to effectively separate these aliased signals. To address this, we integrated parallel depth-wise separable convolutions (33 and 55) with the CBAM attention mechanism [37]. In this design, the depth-wise convolutions capture features at different receptive fields with minimal parameter overhead, achieving scale separation of signal components. Simultaneously, CBAM acts as a signal selector, dynamically re-weighting features to suppress artifact-dominated channels while enhancing those containing valid structural priors. Finally, residual learning is introduced to facilitate gradient flow and maintain structural integrity. This module effectively serves as a robust feature filter that disentangles artifacts from true structures, as validated in subsequent experiments.
2.3.3. Pixel-level feature extraction UNet(PFE-UNet)
PFE-UNet is the parallel branch in DUAFF-Net responsible for processing the multi-channel, pixel-wise interpolated data. Its core task is to extract the fine-grained features crucial for high-quality photoacoustic image reconstruction from this detail-rich data, which preserves the independent perspectives of each sensor, thereby effectively supplementing the macro-structural information provided by the SFE-UNet branch.
The architecture of PFE-UNet is also based on U-Net, but it incorporates the following key optimizations to efficiently process its unique input data:
In the encoder stage, we replace the standard convolutions with a combination of Selective Kernel Attention (SK-Attention) [38] and depthwise separable convolution. Each channel of the pixel-wise interpolated data corresponds to the perspective of a single sensor, and the feature scales and patterns they contain can vary significantly. SK-Attention dynamically and adaptively adjusts the receptive field size of the convolution kernel based on the input features, which enables PFE-UNet to more flexibly capture and fuse diverse information from different perspectives and scales.
To maximize the extraction of contextual information at the deepest layer of the network, the bottleneck layer integrates dilated convolution and a self-attention [39] mechanism. Dilated convolution effectively expands the receptive field without increasing parameters or computational cost to capture multi-scale contextual features. Subsequently, the self-attention mechanism models global relationships among these features, further enhancing their expressive power and the ability to capture long-range dependencies.
Similar to SFE-UNet, PFE-UNet also integrates ECA-Attention modules at its skip connections. This is intended to optimize the flow of low-level detail information from the encoder to the decoder by selectively enhancing information-rich feature channels.
The deep features extracted by PFE-UNet are fed, along with the features from SFE-UNet, into the subsequent feature fusion modules (MIAF-Module and GCDF-Module) to achieve synergistic integration of the two information streams.
2.3.4. Feature fusion
The success of DUAFF-Net lies in its innovative multi-stage collaborative feature fusion strategy, which is designed to intelligently integrate the heterogeneous features from the two parallel branches, SFE-UNet and PFE-UNet. Unlike conventional strategies that perform a single fusion only at the end of the network, DUAFF-Net adopts a two-stage fusion mechanism that combines "early guidance with late-stage integration." This mechanism is collaboratively implemented by two core modules: the Multi-level Inter-stream Attention Fusion Module (MIAF-Module) and the Global Context and Dense Feature Fusion Module (GCDF-Module), whose structures are shown in Fig. 3(b) and 3(c), respectively.
-
1)
Multi-level Inter-stream Attention Fusion (MIAF-Module)
The core design philosophy of the MIAF-Module is to use the rich, detailed features extracted by PFE-UNet from the pixel-wise interpolated data to continuously guide and supplement the feature learning process of SFE-UNet as it handles the DAS image. This enables the generation of more comprehensive intermediate feature representations with fewer artifacts in the early and middle stages of the network, thereby avoiding the potential issues of feature loss or cumulative deviation that can occur in a single network path.
To achieve this goal, the MIAF-Module performs collaborative feature fusion at the bottleneck layer and at multiple corresponding levels of the decoders of SFE-UNet and PFE-UNet. At the bottleneck layer, the global semantic contextual features from PFE-UNet are fused with the deep features of SFE-UNet to correct structural deviations in the DAS reconstruction at a macro level. Subsequently, in the decoder path, the fusion strategy is adapted for different resolution levels: at lower-resolution fusion points, contextual information from PFE-UNet is used to optimize the fundamental structure in SFE-UNet; at higher-resolution fusion points, high-frequency spatial details, edges, and texture features from PFE-UNet are injected into and reinforce the SFE-UNet feature stream to accurately recover minute structures that are blurred or lost in the DAS image.
Distinct from standard symmetric fusion mechanisms [40], the core innovation of the MIAF-Module lies in its physics-prior-guided asymmetric retrieval strategy. As illustrated in Fig. 3(b), we designed an asymmetric assignment where the SFE-UNet features () are designated as the Query, while the PFE-UNet features () serve as the Key and Value. This assignment is physically motivated: the DAS features provide a reliable structural skeleton, whereas the interpolated features contain rich but cluttered textural details. By projecting as the Query, the network uses the reliable structural skeleton as an index to 'retrieve' missing high-frequency details from the interpolated stream.Mathematically, for the -th attention head, this asymmetric projection is defined as:
| (3) |
where , , and denote the learnable projection matrices for the Query, Key, and Value of the -th head, respectively. These matrices map the input features into subspaces of dimension (for Query/Key) and (for Value).
The attention weights are computed to filter the interpolated details based on the DAS structure:
| (4) |
Here, the scaling factor is applied to prevent gradient vanishing in the softmax function.
The final output, representing the structure-guided details, is projected and fused back into the DAS stream via a residual connection:
| (5) |
| (6) |
where denotes the concatenation operation across the channel dimension, and is the output linear projection matrix.
This residual addition ensures that the retrieved details specifically reinforce the structural backbone, achieving precise guidance and avoiding artifact hallucination.
-
2)
Global Context and Dense Feature Fusion Module(GCDF-Module)
Located at the terminal end of the DUAFF-Net architecture, the GCDF-Module employs a tailored terminal dense fusion strategy to integrate the final high-level features from the SFE-UNet and PFE-UNet outputs. This design is intended to replace simple shallow concatenation, which often leads to information fragmentation, and is specifically used to bridge the semantic gap between heterogeneous features. Its core task is to generate a unified feature representation that possesses both a global field of view and preserves complementary information from both streams, ensuring high-quality image reconstruction. Specifically, this strategy comprises three cascaded stages: First, we utilize a GCBlock to model global context on the concatenated features [41]. By recalibrating channel weights, it enhances responsiveness to long-range structural dependencies, effectively mitigating macro-vascular discontinuities caused by limited local receptive fields. Second, cascaded DenseBlocks focus on facilitating deep interaction between the dual-stream features [42]. This design ensures that the structural skeleton from DAS and the textural details from interpolated data achieve synergistic fusion, significantly outperforming simple linear concatenation in artifact suppression. Finally, we employ a convolutional bottleneck layer as a feature distiller to compress and map the high-dimensional dense features back into the residual space. This step effectively aligns the semantic features of the two data streams, ensuring the final output features are highly compact and discriminative.
Through its multi-level feature integration strategy, the GCDF-Module synergistically enhances the fidelity of micro-structural details in the image and effectively suppresses background artifacts caused by signal incompleteness, thereby significantly improving the overall final imaging quality.
2.4. Loss function
To achieve higher-quality photoacoustic image reconstruction, we employ a composite loss function to optimize the performance of the dual-stream parallel feature fusion network. This composite loss function is composed of the Smooth L1 Loss and the Mean Squared Error (MSE) Loss.
The Smooth L1 loss, through its robustness to large deviations and its precision in penalizing subtle differences, effectively guides SFE-UNet to learn how to recover a high-fidelity image that is closer to the ground truth from a preliminary reconstructed image containing artifacts.
| (7) |
where represents the ground truth image and represents the output reconstructed by the SFE-UNet network.
The Mean Squared Error (MSE) loss, by comparing the squared differences pixel by pixel, guides PFE-UNet to learn how to recover a photoacoustic image that is faithful to the ground truth in terms of pixel values, details, and overall structure from interpolated data that may contain incomplete or distorted information.
| (8) |
where represents the ground truth image and represents the output reconstructed by the PFE-UNet network.
Finally, the composite loss function is defined as:
| (9) |
In the total loss function, we assign different weights to each loss term: the weight for is set to 1, while the weight for is 0.1, and the weight for is 0.05. These weights were determined empirically to balance the contributions of each term, enabling the model to generate reconstructed images with the highest visual plausibility and quality. , the loss on the final network output, is considered the primary optimization target; therefore, its weight is set to 1 to make it the main driver of the optimization. and are auxiliary losses for the specific branches. Setting their weights to the relatively small values of 0.1 and 0.05 ensures that they effectively guide the learning of their corresponding branches without overly dominating or interfering with the learning process for the primary objective.
3. Experiments
3.1. Dataset
3.1.1. Numerical simulation datasets
To evaluate the performance and generalization capability of the proposed DUAFF-Net, we conducted comprehensive experiments using publicly available datasets, including a retinal fundus vasculature dataset [43] and a brain MRI dataset [44]. The images from these datasets were used to define the initial pressure distributions in our photoacoustic tomography (PAT) simulations. Subsequently, the required photoacoustic signal data for the experiments were generated through simulations constructed using the k-Wave toolbox [45].
Specifically, the key parameters for the k-Wave simulation environment were configured as follows: The simulation employed a two-dimensional (2D) environment, with a total grid size of 128 × 128 pixels, corresponding to a computational domain of 20.8 × 20.8 mm. The number of ultrasonic transducers in the region of interest was determined by the experimental requirements. To simulate the restricted acoustic windows commonly encountered in clinical scenarios, we employed a 64-element transducer array arranged in a 180-degree semi-circle, as shown in Fig. 1(a). This specific geometry was selected to introduce realistic limited-view and sparse-sampling reconstruction challenges. The transducers were placed at a radius of 10 mm from the center of the grid. The surrounding medium was water with a density of 1000 kg/m³ and a speed of sound set to 1500 m/s. Additionally, in this study, we simulated an ideal noise-free acoustic environment. This choice was made to explicitly isolate the geometric artifacts induced by the limited-view configuration from noise-related signal degradation, allowing us to focus specifically on structural restoration.
As the number of samples in the public retinal vasculature dataset was insufficient to support our experiments, we performed data augmentation to expand the training set to 1600 samples and the test set to 400 samples.
3.1.2. In vivo mouse abdomen dataset
To further validate the clinical applicability and robustness of DUAFF-Net in real-world scenarios, we utilized the publicly available in vivo mouse abdomen dataset (MSOT-Abdomen) open-sourced by Tong et al. [26] This dataset comprises 699 cross-sectional photoacoustic scans of the mouse abdomen acquired using an MSOT inVision 128 system. To maintain consistency with the sparse sampling configuration used in our simulation experiments, the raw data was uniformly downsampled from 128 channels to 64 channels (i.e., utilizing every other transducer element). Beyond the requisite alignment of physical specifications, the preprocessing of the in vivo dataset remained entirely consistent with the methodology established for the numerical simulations.
Following the standard data partition provided in the reference, we allocated 575 samples for training and 124 samples for testing. Given the high acquisition cost and relatively limited sample size of real biological data, we adopted a transfer learning strategy to effectively facilitate the transfer and adaptation of model features from the simulation domain to the real in vivo imaging domain.
3.2. Parameter settings
Numerical simulations and in vivo experiments were conducted in this study to validate the performance of the proposed reconstruction method. The proposed method was implemented using the PyTorch framework and run on a workstation equipped with four NVIDIA TITAN Xp GPUs. To ensure a fair comparison, all models used the exact same training and test sets, and were trained and inferred on the same hardware platform.
For the numerical simulations, the network was trained for 500 epochs using the Adam optimizer. The batch size was set to 4 to accommodate the memory constraints of the GPU, and the learning rate was initialized at 0.0005 to ensure the stable and efficient convergence of the training loss.All baseline methods strictly followed the hyperparameter settings from their respective original papers.
For the in vivo validation, we applied a consistent transfer learning strategy to both DUAFF-Net and all baseline models to bridge the domain gap and ensure rigorous fairness. Specifically, all models were initialized with their respective weights pre-trained on the simulated Brain MRI dataset and subsequently fine-tuned on the sparsely sampled in vivo data. This fine-tuning process was universally conducted for 200 epochs with a learning rate of 0.0001, ensuring stable convergence and optimal cross-domain adaptation while maintaining high physical fidelity in the reconstructed images.
3.3. Performance evaluation
To objectively evaluate the performance of the image reconstruction, this study employed two widely used image quality assessment metrics: the Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity Index (SSIM). PSNR measures the degree of image distortion by calculating pixel-wise errors, while SSIM focuses more on assessing the similarity in brightness, contrast, and structural information between the images.
| (10) |
where and represent the ground truth and reconstructed images, respectively. and are the mean values of the ground truth and reconstructed images,and are their variances, and is their covariance. and are two constants to stabilize the division with a weak denominator.
| (11) |
where and represent the ground truth and reconstructed images, respectively, MAX is the maximum possible pixel value of the image, where N is the total number of pixels in the image.
3.4. Results comparison
To validate the effectiveness of the proposed method, we compared it against the conventional Delay-and-Sum (DAS) algorithm as well as a series of deep learning models. Specifically, the compared deep learning methods include:U-Net, which takes the initially reconstructed DAS image as input.Y-Net, a hybrid architecture that simultaneously utilizes the DAS reconstructed image and the raw time-series data as dual-channel inputs and generates the final reconstruction through a shared decoder.AS-Net, a network that employs a multi-feature fusion strategy, fusing features extracted from the DAS image with folded time-series data, and then fusing them again at the final stage to output the reconstructed image. In addition, we also included SwinIR [46] for comparison, a recent Transformer-based model that has demonstrated powerful restoration capabilities in the field of photoacoustic reconstruction.
3.4.1. Simulated vasculature
In this section, we qualitatively and quantitatively evaluate the reconstruction performance of DUAFF-Net and various competing methods on the retinal vasculature dataset. As shown in Fig. 4, DUAFF-Net demonstrates visual results that most closely resemble the ground truth (GT) image in terms of detail reconstruction, accuracy, and artifact suppression for the vascular network. The results from the conventional DAS method are blurry, with an almost complete loss of vascular structural information. As an improvement, although U-Net and Y-Net can recover the main vascular contours,
Fig. 4.
Comparison of reconstruction results from different methods on two representative retinal vascular image samples. The first row shows the results for sample one, and the second row for sample two. The columns from left to right are: Ground Truth (GT), DAS, U-Net, SwinIR, Y-Net, AS-Net, and the proposed DUAFF-Net.
they exhibit clear and consistent deficiencies in fine structures. This issue is particularly prominent in the regions marked by red boxes in Fig. 4. In sample one, the vessel edges reconstructed by U-Net are blurred and the terminals are discontinuous, while in the Y-Net result, fine vessels are nearly invisible due to severely insufficient contrast. The same defects reappear in sample two, where U-Net fails to reconstruct the complete microvascular network, leading to structural disconnections and information loss, while the Y-Net result is even dimmer, with an almost complete loss of detail in the region. In contrast, DUAFF-Net clearly and completely reconstructs the vascular branches and sharp edges in both of these critical regions, showcasing its significant advantage in detail fidelity.
Compared to advanced models like SwinIR and AS-Net, the advantage of DUAFF-Net lies in its higher fidelity and accuracy. In the blue-boxed region of sample one, SwinIR introduces faint line artifacts not present in the GT. In the yellow-boxed region of sample two, AS-Net not only fails to fully recover the microvasculature but also erroneously generates non-existent vascular connections. DUAFF-Net, however, achieves accurate reconstruction in these key areas, preserving fine structures while avoiding such erroneous generations.
The significant visual advantages are strongly supported by objective evaluation metrics. Looking at specific samples, DUAFF-Net achieves the highest SSIM and PSNR values on both sample one (SSIM: 0.877, PSNR: 21.109 dB) and sample two (SSIM: 0.912, PSNR: 21.216 dB). To evaluate the overall performance and robustness of the models, Table 1 summarizes the average metrics for all methods on the entire vasculature test set. The data clearly show that DUAFF-Net's average SSIM (0.9014 ± 0.0379) and average PSNR (20.6815 ± 2.5049 dB) are significantly superior to all competing methods. This series of qualitative and quantitative results collectively demonstrates the superior performance of DUAFF-Net in the task of limited-view photoacoustic image reconstruction of vasculature.
Table 1.
Quantitative comparison of different methods on the vessel test set (mean ± std).
| Algorithm | SSIM | PSNR |
|---|---|---|
| DAS | 0.100 ± 0.033 | 12.030 ± 1.957 |
| U-Net | 0.782 ± 0.060 | 18.154 ± 2.432 |
| Y-Net | 0.784 ± 0.056 | 17.371 ± 1.874 |
| SwinIR | 0.857 ± 0.048 | 20.397 ± 2.455 |
| AS-Net | 0.888 ± 0.037 | 19.705 ± 2.423 |
| DUAFF-Net | 0.901 ± 0.037 | 20.681 ± 2.504 |
3.4.2. Simulated brain structure
In the more challenging task of brain structure reconstruction, DUAFF-Net again demonstrates comprehensive performance advantages, with the evaluation results shown in Fig. 5 and Table 2. The visual quality assessment (Fig. 5) intuitively reveals the performance hierarchy of the different methods. The conventional DAS method is almost unusable due to severe radial artifacts. Although U-Net and Y-Net can suppress some artifacts, they do so at the cost of severe image blurring and the loss of key anatomical details. SwinIR and AS-Net perform better, reconstructing relatively clear brain contours, but there is still a gap compared to the ground truth (GT) image.
Fig. 5.
Comparison of reconstruction results from different methods on brain structures. From left to right: Ground Truth (GT), DAS, U-Net, SwinIR, Y-Net, AS-Net, and the proposed DUAFF-Net.
Table 2.
Quantitative comparison of different methods on the brain test set (mean ± std).
| Algorithm | SSIM | PSNR |
|---|---|---|
| DAS | 0.174 ± 0.030 | 8.650 ± 1.957 |
| U-Net | 0.755 ± 0.052 | 24.004 ± 1.697 |
| Swin-IR | 0.788 ± 0.041 | 25.662 ± 2.020 |
| Y-Net | 0.792 ± 0.051 | 24.581 ± 1.415 |
| AS-Net | 0.834 ± 0.047 | 26.988 ± 2.348 |
| DUAFF-Net | 0.861 ± 0.044 | 27.025 ± 2.078 |
Among all methods, the reconstruction result from DUAFF-Net is visually closest to the GT. It not only exhibits the highest image clarity but also accurately restores natural fine structures. Particularly in terms of clear gray-white matter boundaries and effective artifact suppression, its visual results are significantly superior to all competing methods.
This visual superiority is strongly supported by the quantitative evaluation results (Table 2). DUAFF-Net achieves the highest average SSIM (0.861 ± 0.044) and PSNR (27.025 ± 2.078 dB). Notably, its SSIM value is significantly higher than that of the suboptimal AS-Net (0.834), which strongly demonstrates DUAFF-Net's superior ability to accurately preserve the complex anatomical structures of the brain—a finding that is perfectly consistent with the clear boundaries and structural details we observed in Fig. 5. In terms of PSNR, both DUAFF-Net and AS-Net (26.988 dB) delivered top-tier performance, far surpassing the other deep learning models and indicating their excellent performance in high signal fidelity and noise suppression.
As shown in Table 3, although our DUAFF-Net (29.81 M) has more parameters than SwinIR and AS-Net, it demonstrates a significant advantage in training efficiency. Our training time (4.45 h) is markedly faster than that of SwinIR (10.97 h) and AS-Net (5.99 h), which suggests our model converges more quickly.
Table 3.
Computational cost comparison of different methods.
| Method | U-Net | Y-Net | SwinIR | AS-Net | DUAFF-Net |
|---|---|---|---|---|---|
| ParaMeters (M) | 1.96 | 11.24 | 16.41 | 5.89 | 29.81 |
| Training Time (h) | 0.12 | 3.48 | 10.97 | 5.99 | 4.45 |
| Inference Time (ms) | 6.25 | 11.98 | 86.15 | 28.15 | 62.13 |
Furthermore, while the inference time of DUAFF-Net is not as fast as AS-Net's, our model is superior to AS-Net in terms of both performance metrics (as shown in Table 1 and Table 2) and training time. This demonstrates that our proposed architecture strikes an excellent balance between SOTA performance and training efficiency.
3.4.3. In vivo mouse abdomen
To validate the generalization capability of DUAFF-Net in real-world biological scenarios, we conducted a cross-domain evaluation on the in vivo mouse abdomen dataset, employing the transfer learning strategy detailed in Section 3.2. As presented in Fig. 6 and Table 4, while the conventional DAS reconstruction is degraded by sparse-sampling artifacts and baseline deep learning models generally suffer from over-smoothing, DUAFF-Net demonstrates superior performance. Visually, it effectively eliminates streak artifacts and recovers the clearest anatomical boundaries; quantitatively, it achieves the highest metrics (SSIM: 0.9077, PSNR: 29.67 dB). These results confirm that, leveraging the transfer learning strategy, DUAFF-Net successfully learns robust physical features transferable to the complex acoustic environments of real biological tissues.
Fig. 6.
Comparison of reconstruction results from different methods on the in vivo mouse abdomen dataset. From left to right: Ground Truth (GT), DAS, U-Net, SwinIR, Y-Net, AS-Net, and the proposed DUAFF-Net.
Table 4.
Quantitative comparison of different methods on the in vivo mouse abdomen dataset.
| Algorithm | DAS | U-Net | SwinIR | Y-Net | AS-Net | DUAFF-Net |
|---|---|---|---|---|---|---|
| SSIM | 0.1851 | 0.8505 | 0.8639 | 0.8704 | 0.8882 | 0.9077 |
| PSNR | 8.00 | 27.39 | 28.08 | 28.81 | 24.99 | 29.67 |
3.5. Ablation study
To evaluate the contributions of the key components in our network architecture, we conducted systematic ablation studies, focusing on the two core parts: feature extraction and feature fusion.
To evaluate the effectiveness of the Structural Feature Extraction Network (SFE-UNet) and the Pixel-level Feature Extraction Network (PFE-UNet), we established two baseline U-Nets for comparison. The first baseline (U-Net I) uses a standard U-Net architecture to directly process the raw Delay-and-Sum (DAS) images. The second baseline (U-Net II) also uses a standard U-Net architecture, but its input is the data pre-processed with pixel-wise interpolation.This baseline was specifically designed to evaluate how a standard network performs when processing only the stack of pixel-wise interpolated data. Through these comparisons, we aim to clearly demonstrate the advantages of SFE-UNet and PFE-UNet in their specific feature representation capabilities.
To dissect our proposed two-stage fusion strategy, we designed systematic comparative experiments. We established a baseline configuration called End Simple Fusion (ESF), which only performs a terminal fusion of the dual-stream features via simple concatenation and a 1 × 1 convolution, to measure the baseline performance of the dual-stream architecture. Second, we individually evaluated the independent contributions of the two core modules: one setup replaced ESF with only our designed GCDF-Module for advanced terminal fusion; another setup introduced the MIAF-Module on top of the ESF baseline to achieve early feature interaction. Finally, we combined both to form the complete DUAFF-Net, to validate the final effect of the MIAF and GCDF modules working in synergy.
The experimental results in Table 5 clearly demonstrate the effectiveness of the improved design of SFE-UNet. Compared to U-Net I, which only uses a standard U-Net to process the DAS input, SFE-UNet, which integrates the MSRA-Module and ECA-Attention, achieves a significant improvement in the SSIM metric, reaching 0.837 ± 0.047, while the PSNR also increased to 18.952 ± 2.371. The significant increase in SSIM indicates that SFE-UNet can more accurately reconstruct the structural information of the vasculature. The visual comparison in Fig. 7 further validates this point; the vessels reconstructed by SFE-UNet exhibit superior overall structural clarity and continuity compared to U-Net I, more closely resembling the ground truth image. Similarly, in processing the pixel-wise interpolated data, PFE-UNet achieves a substantial performance leap in both SSIM and PSNR compared to U-Net II. This fully demonstrates that the specific designs within PFE-UNet, such as SK-Attention, the enhanced bottleneck layer, and ECA-Attention, are crucial for effectively learning and extracting fine-grained features from the multi-channel mapped data. The visual results in Fig. 7 show that PFE-UNet can initially recover vascular structures from the pixel-wise interpolated data, with a quality far superior to that of U-Net II, which failed to generate a coherent image, showcasing its powerful feature learning capabilities.
Table 5.
Ablation study on the performance of feature extraction networks (mean ± std).
| Model | SSIM | PSNR |
|---|---|---|
| U-NetⅠ | 0.782 ± 0.060 | 18.154 ± 2.432 |
| U-NetⅡ | 0.442 ± 0.112 | 12.776 ± 2.300 |
| SFE-UNet | 0.837 ± 0.047 | 18.952 ± 2.371 |
| PFE-UNet | 0.7424 ± 0.073 | 16.317 ± 2.103 |
Fig. 7.
Impact of feature extraction networks on vascular reconstruction in ablation experiments. From left to right: Ground Truth (GT), U-NetⅠ, SFE-U-Net, U-NetⅡ, PFE-UNet, and the proposed DUAFF-Net.
Next, we further evaluate the effectiveness of different fusion strategies through Table 6 and Fig. 8. First, a key observation is that even with the simplest fusion strategy, ESF (second row of Table 6), combining the features extracted by SFE-UNet and PFE-UNet yields SSIM and PSNR metrics that are significantly superior to any single-branch network shown in Table 5. This strongly validates that our proposed dual-source input itself is a highly promising strategy that provides a fundamental performance boost. Compared to ESF, using only the GCDF-Module for terminal deep fusion (first row of Table 6) increases the SSIM to 0.8770 and the PSNR to 19.7324 dB. This indicates that GCDF can more effectively integrate high-level features through its global context modeling and dense feature refinement. Subsequently, when we introduce the MIAF-Module between SFE-UNet and PFE-UNet for early multi-level feature interaction (supplemented by ESF for terminal fusion), the SSIM significantly improves to 0.8904 and the PSNR to 20.2950 dB. This highlights the critical role of the MIAF-Module in promoting early information complementarity and guidance between the parallel branches, providing higher-quality features for subsequent fusion. Finally, our proposed complete DUAFF-Net architecture, which integrates the early interaction of the MIAF-Module with the terminal deep fusion of the GCDF-Module, achieves the best performance among all configurations, as shown in the fourth row of Table 6. As depicted in Fig. 8, the visual reconstruction results of DUAFF-Net are optimal in terms of the clarity, completeness, and artifact suppression of the vascular network. This fully demonstrates the superiority of the synergistic work of the MIAF and GCDF modules, which effectively maximizes the complementary advantages of the dual-information streams.
Table 6.
Ablation study on the effectiveness of different feature fusion strategies.
| MIAF | ESF | GCDF | SSIM | PSNR |
|---|---|---|---|---|
| × | × | √ | 0.8770 | 19.7324 |
| × | √ | × | 0.8658 | 19.6524 |
| √ | √ | × | 0.8904 | 20.2950 |
| √ | × | √ | 0.9014 | 20.6815 |
Fig. 8.
Impact of feature fusion strategies on vascular reconstruction in ablation experiments. From left to right: Ground Truth (GT), End Simple Fusion (ESF), GCDF-Module, MIAF-Module + ESF, and the proposed DUAFF-Net.
To further isolate the independent contributions of our proposed modules, we conducted an ablation study utilizing a simple U-Net backbone. As detailed in Table 7, the baseline U-Net I (processing a single DAS input) serves as the standard U-Net. U-Net+MSRA, which added the MSRA-Module (a single-stream module) on top of it, yielded respective SSIM and PSNR gains of 5.9 % and 3.8 %.
Table 7.
Ablation study on the efficacy of different modules applied to a simple U-Net backbone.
| Method | U-NetⅠ | U-Net+MSRA | U-Net+MIAF | U-Net+GCDF |
|---|---|---|---|---|
| SSIM | 0.782 | 0.828 | 0.852 | 0.844 |
| PSNR | 18.154 | 18.850 | 19.501 | 19.137 |
In contrast, our proposed MIAF and GCDF are fusion modules designed to intelligently fuse dual-stream data (DAS and pixel-wise interpolation). To test them fairly in a simplified environment, we constructed a parallel dual U-Net backbone—one U-Net processing the DAS input and the other processing the pixel-wise interpolation input—and then used our proposed modules to fuse the features from these two streams.
As shown in Table (6), the SSIM and PSNR for the U-Net+MIAF fusion strategy increased substantially by 9.0 % and 7.4 % compared to the U-Net I baseline, while the U-Net+GCDF strategy also yielded gains of 7.9 % and 5.4 %. This result objectively shows that, even on a simplified U-Net backbone, the gains on key metrics from our two proposed dual-stream fusion strategies (MIAF and GCDF) are also significantly higher than those from the single-stream enhancement module (MSRA).
The visual results in Fig. 9 intuitively verify this point. Compared with the baseline U-Net I and U-Net + MSRA, our U-Net + GCDF and U-Net + MIAF can reconstruct clearer and more continuous vascular networks, and effectively suppress artifacts, which is consistent with the quantitative results in Table 7.
Fig. 9.
Visual comparison of the module ablation study on the U-Net backbone. From left to right: Ground Truth (GT); Baseline U-Net I (single DAS input); U-Net + MSRA (single-stream comparison module); U-Net + GCDF (dual-stream fusion module); U-Net + MIAF (dual-stream fusion module).
4. Discussion
The comprehensive experimental evaluation results of this study clearly indicate that the proposed DUAFF-Net demonstrates significant superiority over the conventional DAS algorithm and several advanced deep learning models (including Y-Net, SwinIR, and AS-Net). This advantage is evident in
both key quantitative evaluation metrics (Table 1, Table 2, Table 4) and visual reconstruction quality (Fig. 4, Fig. 5, Fig. 6) for the tasks of retinal vasculature, brain MRI photoacoustic reconstruction., and in vivo mouse abdomen imaging.The superior performance of DUAFF-Net is primarily attributed to several key innovative designs in its architecture, the effectiveness of which has been further validated through systematic ablation studies (see Table 5, Table 6, and Fig. 7, Fig. 8).
Furthermore, this ablation study in Table 7 and Fig. 9 successfully disentangles module contributions from the backbone design. It first validates MSRA as an effective single-stream enhancement strategy (outperforming U-Net I), and then compellingly demonstrates that our MIAF/GCDF fusion modules represent a superior dual-stream strategy. The results confirm that, even on a simple U-Net backbone, our fusion modules provide performance gains that significantly surpass both the standard baseline and the single-stream MSRA module.
First, the core MSRA-Module within SFE-UNet is crucial for enhancing the feature representation capability for DAS images. As shown in Table 5, SFE-UNet, which employs the MSRA-Module, achieves a significant improvement in the SSIM metric from 0.782 to 0.837 compared to the standard U-Net I. The core advantage of the MSRA-Module lies in its multi-scale design, which can effectively decouple artifacts and true structures of different sizes within the DAS image, while the CBAM attention mechanism guides the network to focus on information-rich regions. This provides high-quality feature maps containing reliable macro-structural priors for the subsequent feature fusion, which is fundamental to DUAFF-Net's ability to accurately reconstruct brain contours and the paths of major vessels.
Second, the most innovative component of our framework is the MIAF-Module, which establishes an information pathway between the two streams early in the network. The ablation study (Table 6) confirms that introducing the MIAF-Module brings a significant performance gain (SSIM improving from 0.8658 to 0.8904). Its success can be attributed to two points: First, at the bottleneck layer, it injects the global context-rich features from PFE-UNet into SFE-UNet, effectively correcting macro-structural deviations caused by the limited-view acquisition. Second, in the decoder, MIAF enhances reconstruction fidelity and effectively avoids artifact generation by injecting high-frequency details from PFE-UNet into SFE-UNet, which also improves image clarity. The green-boxed region in sample one of Fig. 4 clearly illustrates this, where several competing methods erroneously reconstruct artifacts not present in the GT, a problem DUAFF-Net does not exhibit. This fully demonstrates that the MIAF-guided feature fusion allows the model to more robustly learn true structures rather than hallucinating details.
At the end of the network, the GCDF-Module is responsible for the final intelligent integration of the deeply optimized features from both streams. Compared to simple feature concatenation, the application of the GCDF-Module brings a clear performance improvement (Table 6, SSIM increasing from 0.8658 to 0.8770). We believe that its built-in GCBlock ensures the overall consistency of the reconstructed structures by establishing global dependencies, while the DenseBlock performs deep refinement of the fused features through dense feature reuse. As shown in the yellow-boxed region of sample two in Fig. 4, DUAFF-Net is able to avoid the erroneous connections generated by AS-Net, which intuitively demonstrates the superiority of the GCDF-Module in final feature arbitration and refinement, thereby ensuring the high fidelity of the output image.
To address the physical domain gap between 2D simulations and real 3D tissues, and to mitigate the risk of overfitting, we conducted a rigorous cross-domain validation using in vivo mouse experiments (Section 3.4.3). By employing the specific parameter configurations and transfer learning strategy tailored for the in vivo data (Section 3.2), DUAFF-Net quantitatively outperformed baseline methods on the tested in vivo dataset. Importantly, this result does not imply that 2D models are physically equivalent to 3D modeling; rather, it empirically demonstrates that transfer learning adaptation serves as a feasible and effective pathway to bridge the simulation-to-reality gap.
However, we acknowledge that limitations remain. Although incorporating real noise distributions via transfer learning partially mitigates the inherent risk of deep learning hallucination, future clinical deployment will require integrating uncertainty quantification techniques. Furthermore, while the current work focuses on cross-sectional tomography, extending the DUAFF-Net architecture to support native 3D reconstruction remains a critical direction to fully capture 3D wave propagation characteristics (e.g., geometric attenuation). Future work will also focus on verifying the method on larger-scale clinical datasets with greater pathological diversity to ensure broad generalizability.Finally, the computational complexity of DUAFF-Net may be higher compared to more lightweight models. Exploring model compression or acceleration methods will be an important direction for subsequent research.
5. Conclusion
Focusing on the core challenges in the field of photoacoustic tomography (PAT) reconstruction, this study successfully proposes and validates a novel deep learning framework, DUAFF-Net, which aims to significantly enhance the reconstruction fidelity and detail clarity of photoacoustic images. The core of DUAFF-Net lies in its innovative feature fusion mechanism, which effectively combines the macro-structural information provided by the initial reconstruction from the conventional Delay-and-Sum (DAS) algorithm with fine-grained features deeply extracted from pixel-wise interpolated data, thereby achieving a synergistic capture of both macro-structures and micro-details.
Extensive experiments and systematic ablation studies based on simulated photoacoustic data of retinal vasculature and brain structures, as well as real-world in vivo mouse abdomen data, have strongly demonstrated that, especially under challenging conditions such as limited-view and sparse data acquisition, the proposed DUAFF-Net is significantly superior to the conventional DAS algorithm and several advanced deep learning models, including Y-Net and SwinIR, in terms of both reconstruction quality and various performance metrics (such as SSIM and PSNR).Notably, the in vivo results empirically verify that, empowered by transfer learning, DUAFF-Net possesses strong generalization capabilities to bridge the domain gap between simulation and real biological tissues. Furthermore, systematic ablation studies have verified the effectiveness of the core innovative designs proposed in DUAFF-Net. The results clearly indicate that all components—including the multi-scale SFE-UNet optimized for DAS, the PFE-UNet for processing specialized interpolated data, and the key fusion modules like the MIAF for early interaction and the GCDF for deep integration—have made clear and positive contributions to improving the quality of the final reconstructed image. The synergy of these components resulted in significant performance gains over the single-stream baseline (U-Net I), with cumulative improvements of up to 14 % in PSNR and 15 % in SSIM. This fully validates the advanced nature of our proposed method's overall architecture and the soundness and efficiency of each module's design.
In summary, DUAFF-Net stands as an efficient and robust framework for photoacoustic image reconstruction, offering an effective solution to overcome common challenges in the current PAT field, such as artifacts and loss of detail. This research provides a valuable reference for the development of high-performance photoacoustic imaging technology and its related applications. In the future, we plan to further validate the generalization capability of DUAFF-Net on a broader range of diverse pathological clinical data, extend the DUAFF-Net framework to a 3D architecture, and explore corresponding model lightweighting methods to enhance its practical utility.
CRediT authorship contribution statement
Yixin Lai: Writing – review & editing, Writing – original draft, Visualization, Methodology, Investigation, Formal analysis, Conceptualization. Zhengnan Yin: Writing – review & editing, Validation, Methodology, Conceptualization. Qiong Zhang: Writing – review & editing, Supervision, Resources, Project administration, Funding acquisition, Conceptualization.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by The Science and Technology Planning Project of Guangdong Province (No. 2018A0303070017)..
Biographies

Yixin Lai received his bachelor's degree from Lingnan Normal University in 2023. Currently, he is pursuing a Master's degree in Electronic Information at Shantou University. His main research interests include photoacoustic imaging and image reconstruction.

Qiong Zhang received her Ph.D. degree in biomedical engineering from the University of Science and Technology of China in 2012. Currently she works as associate professor in Shantou University Medical College. Her research interests include medical image processing, ultrasound signal processing, photoacoustic imaging.

Zhengnan Yin received an M.S. degree in physics from Shantou University, Guangdong, China, in 2024. His research interests include low level computer vision, multimodal interaction and learning.
Data availability
Data will be made available on request.
References
- 1.Wang L.V. Multiscale photoacoustic microscopy and computed tomography. Nat. Photonics. 2009;3:503–509. doi: 10.1038/nphoton.2009.157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Zhou Y., Yao J., Wang L.V. Tutorial on photoacoustic tomography. J. Biomed. Opt. 2016;21 doi: 10.1117/1.JBO.21.6.061007. 061007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Beard P. Biomedical photoacoustic imaging. Interface Focus. 2011;1:602–631. doi: 10.1098/rsfs.2011.0028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Cox B.T., Kara S., Arridge S.R., Beard P.C. k-space propagation models for acoustically heterogeneous media: application to biomedical photoacoustics. J. Acoust. Soc. Am. 2007;121:3453–3464. doi: 10.1121/1.2717409. [DOI] [PubMed] [Google Scholar]
- 5.Xu Y., Wang L.V. Time reversal and its application to tomography with diffracting sources. Phys. Rev. Lett. 2004;92 doi: 10.1103/PhysRevLett.92.033902. [DOI] [PubMed] [Google Scholar]
- 6.Kalva S.K., Pramanik M. Photons Plus Ultrasound: Imaging and Sensing 2017. SPIE; 2017. Modified delay-and-sum reconstruction algorithm to improve tangential resolution in photoacoustic tomography; pp. 569–576. [Google Scholar]
- 7.Matrone G., Savoia A.S., Caliano G., Magenes G. The delay multiply and sum beamforming algorithm in ultrasound b-mode medical imaging. IEEE Trans. Med. Imaging. 2014;34:940–949. doi: 10.1109/TMI.2014.2371235. [DOI] [PubMed] [Google Scholar]
- 8.Davoudi N., Deán-Ben X.L., Razansky D. Deep learning optoacoustic tomography with sparse data. Nat. Mach. Intell. 2019;1:453–460. [Google Scholar]
- 9.Huang B., Xia J., Maslov K., Wang L.V. Improving limited-view photoacoustic tomography with an acoustic reflector. J. Biomed. Opt. 2013;18 doi: 10.1117/1.JBO.18.11.110505. 110505. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Meng J., Jiang Z., Wang L.V., Park J., Kim C., Sun M., Zhang Y., Song L. High-speed, sparse-sampling three-dimensional photoacoustic computed tomography in vivo based on principal component analysis. J. Biomed. Opt. 2016;21 doi: 10.1117/1.JBO.21.7.076007. 076007. [DOI] [PubMed] [Google Scholar]
- 11.Zhang S., Miao J., Li L.S. Challenges and advances in two-dimensional photoacoustic computed tomography: a review. J. Biomed. Opt. 2024;29 doi: 10.1117/1.JBO.29.7.070901. 070901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Li H., Schwab J., Antholzer S., Haltmeier M. Nett: Solving inverse problems with deep neural networks. Inverse Probl. 2020;36 [Google Scholar]
- 13.Yang C., Lan H., Gao F., Gao F. Review of deep learning for photoacoustic imaging. Photoacoustics. 2021;21 doi: 10.1016/j.pacs.2020.100215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Antholzer S., Haltmeier M., Schwab J. Deep learning for photoacoustic tomography from sparse data. Inverse Probl. Sci. Eng. 2019;27:987–1005. doi: 10.1080/17415977.2018.1518444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Dehner C., Olefir I., Chowdhury K.B., Jüstel D., Ntziachristos V. Deep-learning-based electrical noise removal enables high spectral optoacoustic contrast in deep tissue. IEEE Trans. Med. Imaging. 2022;41:3182–3193. doi: 10.1109/TMI.2022.3180115. [DOI] [PubMed] [Google Scholar]
- 16.Awasthi N., Jain G., Kalva S.K., Pramanik M., Yalavarthy P.K. Deep neural network-based sinogram super-resolution and bandwidth enhancement for limited-data photoacoustic tomography. IEEE Trans. Ultrason. Ferroelectr. Freq. Control. 2020;67:2660–2673. doi: 10.1109/TUFFC.2020.2977210. [DOI] [PubMed] [Google Scholar]
- 17.Cai C., Deng K., Ma C., Luo J. End-to-end deep neural network for optical inversion in quantitative photoacoustic imaging. Opt. Lett. 2018;43:2752–2755. doi: 10.1364/OL.43.002752. [DOI] [PubMed] [Google Scholar]
- 18.Lan H., Yang C., Gao F. A jointed feature fusion framework for photoacoustic image reconstruction. Photoacoustics. 2023;29 doi: 10.1016/j.pacs.2022.100442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wang R., Zhu J., Meng Y., Wang X., Chen R., Wang K., Li C., Shi J. Adaptive machine learning method for photoacoustic computed tomography based on sparse array sensor data. Comput. Methods Prog. Biomed. 2023;242 doi: 10.1016/j.cmpb.2023.107822. [DOI] [PubMed] [Google Scholar]
- 20.Huo H., Deng H., Gao J., Duan H., Ma C. Mitigating under-sampling artifacts in 3d photoacoustic imaging using res-unet based on digital breast phantom. Sensors. 2023;23:6970. doi: 10.3390/s23156970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Xie Y., Wu D., Wang X., Wen Y., Zhang J., Yang Y., Chen Y., Wu Y., Chi Z., Jiang H. Photonics. MDPI; 2023. Image enhancement method for photoacoustic imaging of deep brain tissue; p. 31. [Google Scholar]
- 22.Li B., Lu M., Zhou T., Bu M., Gu W., Wang J., Zhu Q., Liu X., Ta D. Removing artifacts in transcranial photoacoustic imaging with polarized self-attention dense-unet. Ultrasound Med. Biol. 2024;50:1530–1543. doi: 10.1016/j.ultrasmedbio.2024.06.006. [DOI] [PubMed] [Google Scholar]
- 23.Zhao X., Hu S., Yang Q., Zhang Z., Guo Q., Niu C. Lightweight sparse optoacoustic image reconstruction via an attention-driven multi-scale wavelet network. Photoacoustics. 2025 doi: 10.1016/j.pacs.2025.100695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Waibel D., Gröhl J., Isensee F., Kirchner T., Maier-Hein K., Maier-Hein L. Photons Plus Ultrasound: Imaging and Sensing 2018. SPIE; 2018. Reconstruction of initial pressure from limited view photoacoustic images using deep learning; pp. 196–203. [Google Scholar]
- 25.Feng J., Deng J., Li Z., Sun Z., Dou H., Jia K. End-to-end res-unet based reconstruction algorithm for photoacoustic imaging. Biomed. Opt. Express. 2020;11:5321–5340. doi: 10.1364/BOE.396598. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Tong T., Huang W., Wang K., He Z., Yin L., Yang X., Zhang S., Tian J. Domain transform network for photoacoustic tomography from limited-view and sparsely sampled data. Photoacoustics. 2020;19 doi: 10.1016/j.pacs.2020.100190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Guan S., Khan A.A., Sikdar S., Chitnis P.V. Fully dense unet for 2-d sparse photoacoustic tomography artifact removal. IEEE J. Biomed. Health Inform. 2019;24:568–576. doi: 10.1109/JBHI.2019.2912935. [DOI] [PubMed] [Google Scholar]
- 28.Deng H., Qiao H., Dai Q., Ma C. Deep learning in photoacoustic imaging: a review. J. Biomed. Opt. 2021;26(4) doi: 10.1117/1.JBO.26.4.040901. 040901-040901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Guan S., Khan A.A., Sikdar S., Chitnis P.V. Limited-view and sparse photoacoustic tomography for neuroimaging with deep learning. Sci. Rep. 2020;10:8510. doi: 10.1038/s41598-020-65235-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Lan H., Jiang D., Yang C., Gao F., Gao F. Y-net: hybrid deep learning image reconstruction for photoacoustic tomography in vivo. Photoacoustics. 2020;20 doi: 10.1016/j.pacs.2020.100197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Tian C., Shen K., Dong W., Gao F., Wang K., Li J., Liu S., Feng T., Liu C., Li C., et al. Image reconstruction from photoacoustic projections. Photonics Insights. 2024;3(3) R06-R06. [Google Scholar]
- 32.Sun C., Salimi Y., Angeliki N., Boudabbous S., Zaidi H. An efficient dual-domain deep learning network for sparse-view ct reconstruction. Comput. Methods Prog. Biomed. 2024;256 doi: 10.1016/j.cmpb.2024.108376. [DOI] [PubMed] [Google Scholar]
- 33.Ge R., He Y., Xia C., Sun H., Zhang Y., Hu D., Chen S., Chen Y., Li S., Zhang D. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2022. Ddpnet: a novel dual-domain parallel network for low-dose ct reconstruction; pp. 748–757. [Google Scholar]
- 34.Wu W., Hu D., Niu C., Yu H., Vardhanabhuti V., Wang G. Drone: dual-domain residual-based optimization network for sparse-view ct reconstruction. IEEE Trans. Med. Imaging. 2021;40(11):3002–3014. doi: 10.1109/TMI.2021.3078067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Guo M., Lan H., Yang C., Liu J., Gao F. As-net: fast photoacoustic reconstruction with multi-feature fusion from sparse data. IEEE Trans. Comput. Imaging. 2022;8:215–223. [Google Scholar]
- 36.Wang Q., Wu B., Zhu P., Li P., Zuo W., Hu Q. Eca-net: Efficient channel attention for deep convolutional neural networks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020:11534–11542. [Google Scholar]
- 37.Woo, S., Park, J., Lee, J.Y., Kweon, I.S., 2018. Cbam: Convolutional block attention module, in: Proceedings of the European conference on computer vision (ECCV), pp. 3–19.
- 38.Li X., Wang W., Hu X., Yang J. Selective kernel networks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019:510–519. [Google Scholar]
- 39.Zhao H., Jia J., Koltun V. Exploring self-attention for image recognition. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020:10076–10085. [Google Scholar]
- 40.Huang Z., Wang X., Huang L., Huang C., Wei Y., Liu W. Ccnet: Criss-cross attention for semantic segmentation. Proceedings of the IEEE/CVF international conference on computer vision. 2019:603–612. [Google Scholar]
- 41.Cao Y., Xu J., Lin S., Wei F., Hu H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. Proceedings of the IEEE/CVF international conference on computer vision workshops. 2019:0. 0. [Google Scholar]
- 42.Huang G., Liu Z., Van Der Maaten L., Weinberger K.Q. Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017:4700–4708. [Google Scholar]
- 43.Staal J., Abràmoff M.D., Niemeijer M., Viergever M.A., Van Ginneken B. Ridge-based vessel segmentation in color images of the retina. IEEE Trans. Med. Imaging. 2004;23:501–509. doi: 10.1109/TMI.2004.825627. [DOI] [PubMed] [Google Scholar]
- 44.Clark K., Vendt B., Smith K., Freymann J., Kirby J., Koppel P., Moore S., Phillips S., Maffitt D., Pringle M., et al. The cancer imaging archive (tcia): maintaining and operating a public information repository. J. Digit. Imaging. 2013;26:1045–1057. doi: 10.1007/s10278-013-9622-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Treeby B.E., Cox B.T. k-wave: Matlab toolbox for the simulation and reconstruction of photoacoustic wave fields. J. Biomed. Opt. 2010;15 doi: 10.1117/1.3360308. 021314. [DOI] [PubMed] [Google Scholar]
- 46.Shijo V., Vu T., Yao J., Xu W., Xia J. IEEE International Ultrasonics Symposium (IUS) IEEE; 2023. Swinir for photoacoustic computed tomography artifact reduction, in: 2023; pp. 1–4. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data will be made available on request.









