Skip to main content
Sensors (Basel, Switzerland) logoLink to Sensors (Basel, Switzerland)
. 2026 Jan 22;26(2):738. doi: 10.3390/s26020738

CHARMS: A CNN-Transformer Hybrid with Attention Regularization for MRI Super-Resolution

Xia Li 1, Haicheng Sun 1, Tie-Qiang Li 2,3,4,*
Editor: Alexander Wong
PMCID: PMC12845672  PMID: 41600530

Abstract

Highlights

What are the main findings?

  • CHARMS, a lightweight CNN-Transformer hybrid (~1.9 M parameters, ~30 GFLOPs), outperforms state-of-the-art lightweight MRI super-resolution models (EDSR, PAN, W2AMSN-S, and FMEN) by 0.1–0.6 dB PSNR and up to 1% SSIM at ×2/×4 upscaling while reducing inference time by ~40%.

  • With cross-field fine-tuning on only twenty subjects, CHARMS upgrades clinical 3T MRI to near-7T quality, yielding ~6 dB PSNR and 0.12 SSIM gains over native 3T scans across T1w/T2w contrasts.

What are the implications of the main findings?

  • Near-real-time performance (~11 ms/slice enabling ~1.6–1.9 s processing per 3D brain volume on RTX 4090) and small model size enable practical deployment in clinical workstations, online reconstruction pipelines, and resource-constrained environments including low-field and portable MRI scanners.

  • Superior fidelity–efficiency balance paves the way for shorter scan times, reduced motion artifacts, and 7T-like diagnostic quality from standard 3T systems without additional hardware.

Abstract

Magnetic resonance imaging (MRI) super-resolution (SR) enables high-resolution reconstruction from low-resolution acquisitions, reducing scan time and easing hardware demands. However, most deep learning-based SR models are large and computationally heavy, limiting deployment in clinical workstations, real-time pipelines, and resource-restricted platforms such as low-field and portable MRI. We introduce CHARMS, a lightweight convolutional–Transformer hybrid with attention regularization optimized for MRI SR. CHARMS employs a Reverse Residual Attention Fusion backbone for hierarchical local feature extraction, Pixel–Channel and Enhanced Spatial Attention for fine-grained feature calibration, and a Multi-Depthwise Dilated Transformer Attention block for efficient long-range dependency modeling. Novel attention regularization suppresses redundant activations, stabilizes training, and enhances generalization across contrasts and field strengths. Across IXI, Human Connectome Project Young Adult, and paired 3T/7T datasets, CHARMS (~1.9M parameters; ~30 GFLOPs for 256 × 256) surpasses leading lightweight and hybrid baselines (EDSR, PAN, W2AMSN-S, and FMEN) by 0.1–0.6 dB PSNR and up to 1% SSIM at ×2/×4 upscaling, while reducing inference time ~40%. Cross-field fine-tuning yields 7T-like reconstructions from 3T inputs with ~6 dB PSNR and 0.12 SSIM gains over native 3T. With near-real-time performance (~11 ms/slice, ~1.6–1.9 s per 3D volume on RTX 4090), CHARMS offers a compelling fidelity–efficiency balance for clinical workflows, accelerated protocols, and portable MRI.

Keywords: MRI super-resolution, lightweight network, CNN-Transformer hybrid, attention regularization, reverse residual attention, clinical deployment

1. Introduction

Magnetic resonance imaging (MRI) is a cornerstone modality in clinical diagnostics and biomedical research, offering superior soft tissue contrast without ionizing radiation. However, acquiring high-resolution (HR) MRI often requires long scan durations or high-field systems, which increase costs, patient discomfort, and vulnerability to motion artifacts. Super-resolution (SR) reconstruction [1,2,3] seeks to mitigate these limitations by recovering HR images from low-resolution (LR) inputs, thereby enhancing spatial detail without extending acquisition time.

Deep learning has substantially advanced SR reconstruction by enabling end-to-end mappings between LR and HR domains. Early 2D convolutional neural network (CNN) models such as SRCNN [4], VDSR [5], EDSR [6], and RCAN [7] achieved strong gains in image fidelity through deep feature extraction and residual learning. These methods demonstrated the potential of CNNs to recover fine structural details but often depended on increasingly deep or wide architecture. As a result, they introduced high computational cost, large memory demand, and limited suitability for real-time or near-real-time MRI workflow.

Subsequent efforts introduced lightweight 2D CNN variants, including MobileNet [8], ShuffleNet [9], and ESPNet [10], and explored depthwise separable convolutions and channel decomposition to reduce parameters while preserving performance. Although effective for accelerating SR networks, these designs generally lacked mechanisms for modeling long-range spatial relationships, which are essential for reconstructing complex anatomical structures.

To address these limitations, we propose CHARMS (CNN-Transformer Hybrid with Attention Regularization for MRI Super-Resolution), a compact and efficient SR framework tailored for MRI. CHARMS integrates a Reverse Residual Attention Fusion (RRAF) backbone [11] for hierarchical feature extraction with Pixel–Channel Attention (PCA) [12] and Enhanced Spatial Attention (ESA) modules [13] for fine-grained feature calibration. Multi-Depthwise Dilated Transformer Attention (MDDTA) module [14] captures long-range dependencies with linear computational complexity, while an attention regularization mechanism reduces redundancy and stabilizes training (see Appendix A for a glossary of acronyms and brief descriptions of the main architectural modules).

Extensive experiments on multiple open-access MRI datasets, including IXI [15], Human Connectome Project Young Adult (HCP-YA) [16], and paired 3T-7T datasets [17,18], demonstrate that CHARMS achieves superior reconstruction accuracy with significantly fewer parameters and lower inference latency compared to existing CNN- and Transformer-based SR models. By balancing representational power with computational efficiency, CHARMS advances the integration of deep SR frameworks into modern MRI acquisition and reconstruction pipelines.

This study addresses the following research questions: RQ1: Can a lightweight CNN-Transformer hybrid outperform existing MRI SR models in fidelity–efficiency trade-off? RQ2: How does attention regularization improve training stability and cross-contrast/field-strength generalization? RQ3: What is the clinical potential of limited-data cross-field fine-tuning for upgrading standard 3T scans to near-7T quality?

2. Related Work

2.1. CNN-Based MRI Super-Resolution

Early deep SR models relied heavily on 2D CNNs. SRCNN pioneered end-to-end learning with a simple three-layer architecture, while VDSR and EDSR introduced residual learning and deeper networks for enhanced detail recovery. RCAN advanced this further by incorporating channel attention to adaptively recalibrate features. Lightweight variants, inspired by MobileNet, ShuffleNet, and ESPNet, employed depthwise separable convolutions and channel shuffling to reduce parameters, achieving faster inference suitable for clinical constraints. Despite these efficiency gains, pure CNN approaches often struggle with modeling long-range spatial dependencies critical for reconstructing extended anatomical structures in brain MRI.

2.2. Attention Mechanisms in SR

Attention modules have significantly boosted SR performance [19,20]. Channel [7] and spatial attention [19] modules improved intra-feature relevance by adaptively weighting informative regions, while residual attention blocks [7,12,19,20,21,22] helped capture hierarchical contextual cues. More recently, Transformer-based architectures [23] introduced self-attention to model long-range dependencies across the entire image plane [24,25,26,27]. These developments produced notable improvements in reconstruction accuracy and generalization. Representative hybrid CNN-Transformer SR models, such as SwinIR [28], Restormer [27], and TTVSR [29], demonstrated the value of combining convolutional locality with global context modeling. However, many of these hybrid approaches [30,31,32,33] rely on computationally expensive attention operations, redundant attention activations, or complex multi-stage pipelines, which hinder their scalability to high-resolution volumetric MRI and limit deployment in resource-constrained clinical environments.

2.3. Transformer and Hybrid Architectures

The introduction of Transformers revolutionized natural image SR through global self-attention, as exemplified by SwinIR (shifted-window attention), Restormer (multi-head channel attention), and TTVSR. These hybrid CNN-Transformer models combine inductive biases of convolutions with Transformer’s long-range modeling, yielding superior fidelity. In MRI SR, variants such as SuperFormer, REHRSeg, and physics-informed hybrids have emerged, often extending to multi-contrast or anisotropic data. Very recent advances include VMamba-Transformer hybrids (e.g., MRISR [34,35,36,37,38,39]) for low-field enhancement and diffusion-guided models that leverage generative sampling for texture-rich reconstruction. Although powerful, these methods frequently involve high parameter counts (>10 M), iterative sampling, or quadratic attention complexity, limiting real-time applicability.

Transformer-enhanced frameworks often incur prohibitive computational overhead [24,27,33,40]. A key challenge is how to design an efficient hybrid model that preserves local representational strength, captures long-range dependencies, and avoids redundant or unstable attention behaviors. Addressing this challenge is critical for advancing practical MRI SR, especially in hardware-limited settings where latency, memory footprint, and energy consumption are essential constraints. In addition to these CNN and Transformer-based methods, several very recent models have further extended MRI SR capabilities. MRISR [41] employs a VMamba-Transformer architecture to enhance texture reconstruction in low-field MRI, while a 2D slice-wise diffusion model [42] demonstrates the strong generative capacity of diffusion sampling for high-fidelity SR. Although effective, these approaches tend to be computationally demanding, reinforcing the need for lightweight and stable SR frameworks.

2.4. Diffusion Models and Emerging Trends

Diffusion-based SR frameworks have recently demonstrated exceptional perceptual quality by treating super-resolution as iterative denoising. Models exploiting latent diffusion or physics-informed guidance achieve impressive detail recovery, particularly in multi-parametric or low-SNR settings. However, their reliance on dozens to hundreds of sampling steps renders them computationally intensive, often unsuitable for clinical workflows demanding sub-second latency. Recent advances have extended MRI SR with diffusion models for high-fidelity generation [35,42,43,44,45,46] and Transformer hybrids for multi-contrast handling [44,47]. However, these often prioritize perceptual quality over computational efficiency, with high iteration counts or parameters (>10 M), limiting real-time clinical use [48,49,50]. CHARMS addresses this gap by integrating lightweight CNN-Transformer fusion with attention regularization, achieving SOTA efficiency–fidelity balance for portable and cross-field MRI [35,42,43,44,45,46,47,51].

CHARMS distinguishes itself by integrating lightweight convolutional processing with linear complexity-dilated Transformer attention and explicit regularization, achieving competitive or superior fidelity at a fraction of the computational cost of contemporary Transformer hybrids and diffusion models. This positions CHARMS as a practical solution for resource-constrained and cross-field MRI super-resolution.

3. Materials and Methods

3.1. CHARMS Framework

The proposed CHARMS network reconstructs HR MR images from LR inputs through an end-to-end mapping

ISR=fθILR, (1)

where fθ denotes the trainable parameters of the model. As illustrated in Figure 1, CHARMS comprises three sequential stages as follows: (1) RRAF blocks [11] for local texture encoding and multi-scale spatial awareness; (2) contextual enhancement through a MDDTA layer [14] coupled with a Gated Depthwise Dilated Feed-Forward Network (GDDFN) [52] to capture long-range dependencies; and (3) reconstruction and refinement that combines pixel-shuffle upsampling [53] with a High-Frequency Information Refinement (HFIR) module [54] to restore fine structural details. A global residual connection adds shallow features directly to the final output, preserving low-frequency signal fidelity and improving optimization stability. Each design element within CHARMS was introduced to address the performance–efficiency trade-off that limits most existing CNN-Transformer hybrid SR models.

Figure 1.

Figure 1

Overview of the proposed CHARMS framework, consisting of the following three key components: deep feature extraction, upsampling reconstruction, and high-frequency refinement (a). Deep feature extraction involves a stack of Reverse Residual Attention Fusion (RRAF) blocks (b), each incorporating four improved residual local feature extraction (RLFE) modules (c) with Enhanced Spatial Attention (ESA). A Multi-Depthwise Dilated Transformer Attention (MDDTA) transformer with supporting GDDFN further enhanced features. Various attention mechanisms, including ESA, pixel attention (PA), and shallow channel attention (SCA), are integrated into architecture. Solid arrows indicate data/model flow.

Unlike deep CNNs [35,36,37,38,39,55] that rely on large parameter counts or pure Transformers [30,31,32,33] that suffer from high computational complexity, CHARMS unifies convolutional locality and Transformer globality in a lightweight yet expressive architecture. Its design incorporates the following three principal innovations relative to existing lightweight SR architectures: (1) RRAF modules were adapted and extended from the reverse-attention principle introduced in prior work [11], are reformulated for MRI SR by integrating residual feature refinement, channel shuffle, and enhanced spatial attention to improve feature reuse and gradient flow while directing focus toward anatomically salient regions; (2) an attention regularization mechanism, applied jointly across the CNN and Transformer branches, suppresses redundant activations and stabilizes optimization, reducing overfitting and improving generalization across contrasts and scanners; (3) the MDDTA layer [14] introduces depthwise-separable projections and dilated attention heads to maintain linear computational complexity while preserving long-range contextual modeling, which is essential for capturing extended anatomical structures.

Through these interdependent innovations, CHARMS achieves a balance between representational richness and computational efficiency, reducing model size and inference time while improving fidelity and structural sharpness. The following subsections describe each architectural component, dataset preparation, and experimental protocols used for model training and evaluation.

3.2. Reverse Residual Attention Fusion (RRAF) Block

At the core of CHARMS lies the RRAF module [11], which integrates residual learning and spatial attention to extract discriminative features while maintaining parameter efficiency. As shown in Figure 1b, each RRAF block contains several RLFE units [56] composed of paired convolutions with ReLU activations and an ESA operator. ESA generates a spatial weight map that highlights high-frequency anatomical regions such as tissue interfaces and structural boundaries. Dilated convolutions with rates of 1 and 2 expand the receptive field, while a channel-reduction and recovery sequence captures multi-scale dependencies. To enhance inter-channel communication without increasing parameters, channel shuffle operations are incorporated, allowing independent feature groups to exchange information efficiently. Mathematically, the output of an RRAF block can be expressed as:

Fout=Shuffle(k=1KRLFEk(Fin)+Fin), (2)

where Fin and Fout represent the input and output feature maps, k is the number of RLFE units, and Shuffle() denotes channel shuffling. Equation (2) provides an overview, with details in (3) for RLFE summation and (5) for channel shuffle to enhance inter-feature communication. This design facilitates both spatial adaptivity and gradient stability, addressing the common vanishing gradient issues observed in deeper CNNs. Each RRAF block aggregates K RLFE modules [56] that jointly encode spatial detail and contextual cues. For the i-th block,

FRRAF=Fin+k=1Kεk(Fk1),F0=Fin, (3)

where εk() denotes an RLFE unit consisting of two 3×3 convolutions with ReLU activation followed by an ESA operation. ESA computes a spatial weight map

As=σ (C(Cd2(Cd1(P(C(F)))))), (4)

with C and C representing channel-reduction and recovery layers, Cd1 and Cd2 dilated convolutions of rates 1 and 2, and σ the sigmoid activation. The refined output F=FAs selectively enhances informative regions, while a channel-shuffle operator S() intermixes channel groups to strengthen cross-feature communication without increasing parameter count:

FRRAFSFRRAF. (5)

Residual skip connections maintain integrity and mitigate gradient attenuation.

3.3. Pixel–Channel Attention (PCA) Module

To further recalibrate feature saliency, CHARMS employs a PCA mechanism that merges pixel-level and channel-level cues. Given an input tensor F, pixel attention is computed as

Ap=σC1×1F, (6)

while channel attention is

Ac=σ(C1×1(AvgPool(F))). (7)

The combined attention map

APCA=NAp+Ac  (8)

is normalized and applied elementwise, yielding the recalibrated feature

FPCA=FAPCA. (9)

This joint weighting emphasizes anatomically relevant structures and suppresses background noise with negligible computational cost.

3.4. Transformer Module with MDDTA and GDDFN

Local convolutions cannot effectively capture long-range anatomical dependencies; thus, CHARMS introduces a lightweight Transformer composed of a Multi-Depthwise Dilated Transformer Attention (MDDTA) layer [14] followed by a Gated Depthwise Dilated Feed-Forward Network (GDDFN) [52]. The intermediate feature map obtained after the MDDTA block (plus its residual skip connection) is

Ft=AMDDTAFin+Fin, (10)

The combined process after GDDFN block can be expressed as

Fout=GGDDFNFt+Ft (11)

The MDDTA computes self-attention over multi-scale depthwise features:

AMDDTA(Fin)=iwi Softmax (QiKid)Vi,Qi,Ki,Vi=Cri(Fin), (12)

where Cri is a depthwise convolution with dilation ri and wi are learned weights. This formulation expands the receptive field while maintaining linear complexity. The subsequent GDDFN enhances nonlinearity through gated depthwise convolutions:

GGDDFN(F)=C1×1 [ϕ(Cd(F1))Cd(F2)], (13)

where ϕ denotes GELU activation. Together, these components effectively balance contextual modeling power and efficiency.

3.5. High-Frequency Information Refinement (HFIR)

After pixel-shuffle upsampling [53], the HFIR module [54] performs residual correction to restore attenuated high-frequency information. The final reconstruction can be written as

ISR=UFout+σC3×34UFoutUFout, (14)

where U denotes the upsampling operator, C3×3(4) represents four stacked depthwise separable convolutions, and σ is a sigmoid gate. This refinement selectively boosts edge and texture components while suppressing reconstruction artifacts.

3.6. Datasets and Preprocessing

To evaluate the performance and generalization capability of CHARMS framework, we employed four publicly available brain MRI datasets that encompass diverse imaging contrasts, spatial resolution, and acquisition protocols. As summarized in Table 1, the datasets used in this study are from the Human Connectome Project Young Adult study (HCP-YA) [16], IXI dataset [15], and two sets (PTT1 and PTT2) of paired 3T/7T studies [17,18]. The HCP-YA dataset offers high-quality 3T scans at 0.7 mm isotropic resolution for a relatively homogeneous healthy control (HC) cohort, the IXI dataset provides T1W and T2W volumes (0.94 × 0.94 × 1.2 mm3) of more heterogenous participants over different platforms (1.5T to 3T of different vendors), and the two paired 3T/7T datasets supply matched cross-field T1W and T2W images, where the 7 T data serve as pseudo-ground-truth for cross-field validation. To train the CHARMS model, all 3D image volumes were bias-field corrected [57] with N4ITK and skull-stripped using SynthStrip [58]. This bias-field correction was applied consistently to all volumes across the IXI, HCP-YA, PTT1, and PTT2 datasets to ensure uniform preprocessing. The central 100 slices were downsampled by factors ×2 and ×4 using bicubic interpolation to form LR–HR pairs. Datasets were split according to the ratio of 70:20:10 for training, validation, and testing at the subject level, with intensities normalized to [0, 1]. Specifically, intensity normalization was performed via min–max scaling applied independently to each volume (i.e., rescaling voxel intensities based on the minimum and maximum values within that volume), without additional techniques such as Z-score normalization, White Stripe, or histogram matching between datasets. This approach was used consistently across all datasets to maintain reproducibility and support model generalization, as no direct quantitative comparisons were made between datasets.

Table 1.

Summary of the MRI datasets used.

Dataset Source Subjects Demographics Contrasts Resolution (mm3)
HCP-YA [16] 1206 HC 22–36 years
m/f = 507/699
T1w 0.7 × 0.7 × 0.7
T2w 0.7 × 0.7 × 0.7
IXI [15] 563 HC 20–86 years
m/f = 250/313
T1w 0.94 × 0.94 × 1.2
T2w 0.94 × 0.94 × 1.2
PTT1 [17] 20 HC 18–25 years
m/f = 10/10
T1w 1 × 1×1, 0.7 × 0.7 × 0.7
T2w 0.9 × 0.9 × 1.9, 0.4 × 0.4 × 1
PTT2 [18] 10 HC 25–41 years
m/f = 7/3
T1w 0.8 × 0.8 × 0.8, 0.7 × 0.7 × 0.7
T2w 0.8 × 0.8 × 0.8, 0.7 × 0.7 × 0.7

3.7. Training Protocol and Comparative Models

All experiments were implemented in PyTorch 2.0 and executed on an NVIDIA RTX 4090 GPU (24 GB GDDR6X VRAM) under standard conditions (PyTorch 2.0, CUDA 11.7, and no additional optimizations such as TensorRT). Model optimization employed the AdamW optimizer [59] with β1=0.9, β2=0.999, and ϵ=108 hyper parameters suggested in the literature [60,61,62]. Training minimized an L1 reconstruction loss over 200 epochs [63] with a batch size of 16 and an initial learning rate of 104. The proposed CHARMS network comprises approximately 1.9 million parameters and requires about 30 GFLOPs for a 256 × 256 input, yielding roughly 40% faster inference than the widely used EDSR model [6] while achieving higher reconstruction fidelity across all benchmark datasets.

To ensure a fair and reproducible evaluation, CHARMS was compared with the following seven representative 2D SR networks that together span the evolution of deep SR approaches: the non-learned bicubic interpolation baseline; early CNN architectures SRCNN [4], VDSR [5], and EDSR [6]; the attention-augmented CNN PAN [12]; and the more recent multi-scale and attention-driven hybrids W2AMSN-S [64] and FMEN [65]. All baseline models were trained from scratch on identical training, validation, and testing partitions using the same optimizer, learning-rate schedule, and number of epochs, ensuring consistent experimental conditions across architectures.

Performance was assessed using the Peak Signal-to-Noise Ratio (PSNR) [66] and Structural Similarity Index (SSIM) [67] to quantify both numerical accuracy and perceptual quality. In addition, qualitative visual analysis was conducted to evaluate the recovery of fine anatomical structures, such as cortical boundaries and tissue interfaces. Computational efficiency was measured in terms of inference time and parameter count using standardized 256 × 256 MR image inputs. Finally, a comprehensive ablation study isolated the contributions of the RRAF, PCA, MDDTA, and HFIR modules, demonstrating how each component contributes to CHARMS overall balance between efficiency and reconstruction quality.

3.8. Cross-Field Adaptation and Evaluation Procedure

To adapt CHARMS for generating 7T-like images from 3T inputs, we performed cross-field fine-tuning using the PTT1 dataset [17] of 20 subjects (see Table 1), each providing matched 3T/7T T1-weighted and T2-weighted volumes. The overall procedure is illustrated in Figure 2.

Figure 2.

Figure 2

Schematic of the CHARMS cross-field adaptation workflow. Solid arrows indicate data/model flow; The curved blue arrow represents the fine-tuning process using paired 3T/7T data; the final inference path (3T input → 7T-like output) is shown with a thick arrow.

The pretrained model fθ0 is fine-tuned on 20 paired 3T/7T subjects from PTT1 dataset to obtain fθ1, which is then evaluated on the PTT2 dataset [17] of 10 subjects to assess cross-site generalization. Let

Dft={[(xi3T, yi7T)]}i=120 (15)

denote this fine-tuning dataset and let fθ0 represent the pretrained CHARMS model obtained from large-scale multi-contrast MRI SR training described above. Fine-tuning adapts the model parameters to produce an updated model fθ1 by solving

θ1=arg minθ(xi3T,yi7T)DftL(fθ(xi3T), yi7T), (16)

where the total training loss for cross-field adaptation was defined as

L=λ1fθxi3Tyi7T1+λ21SSIMfθxi3T,yi7T+λ3LAR (17)

includes the pixel-wise L1 reconstruction term, a structural-similarity term, and the proposed attention-regularization (AR) penalty to enhance fine-scale anatomical fidelity. To mitigate overfitting on the 20-subject fine-tuning dataset. We used fixed weights λ1=1, λ2=0.1, and λ3=0.005. We fine-tuned the model using a reduced learning rate of 105 and restricted trainable parameters to the last two Transformer blocks and the decoder, while keeping the early CNN backbone frozen. This selective fine-tuning strategy follows established transfer-learning principles, where early layers capture stable, generalizable features while later layers adapt to domain-specific variations [68]. In medical imaging, partial fine-tuning of deeper layers has been shown to be more effective and less prone to overfitting than full retraining when data are limited [69].

Model evaluation was conducted using the independent external dataset PTT2 [18] consisting of 10 additional subjects with paired 3T/7T volumes.

Dtest=[(xj3T, yj7T)]j=110. (18)

This strict separation of sites ensures that the evaluation reflects cross-site generalization rather than memorization of vendor- or protocol-specific features. For each subject in the external cohort, from the 3T inputs, the adapted model produced a 7T-like prediction

y^j7T=fθ1xj3T, (19)

which was quantitatively compared against the true 7T reference yj7T using PSNR, SSIM, and additional region-specific sharpness and edge-contrast metrics. Qualitative assessment emphasized cortical ribbon delineation, deep-gray-matter contrast, and cerebellar structural clarity.

4. Results

4.1. Benchmark Performance

The proposed CHARMS network was first evaluated against seven representative SR methods—Bicubic, SRCNN [4], VDSR [5], EDSR [6], PAN [12], W2AMSN-S [64], and FMEN [65]—on the IXI (T1w/T2w) and HCP-YA (T1w) datasets. Figure 3 presents qualitative comparisons for a representative sagittal slice of the IXI T1w volume downsampled by a factor of four. CHARMS visibly restores sharper cortical and subcortical structures with reduced blurring, while competing models exhibit varying degrees of texture loss or over-smoothing.

Figure 3.

Figure 3

Qualitative ×4 super-resolution results on a representative sagittal IXI T1w slice. Top row: full-slice reconstructions showing overall image quality and contrast, with red rectangles indicating zoomed regions. Middle row: zoomed regions highlighting fine anatomical details. Bottom row: MSE maps with coluber visualizing residual errors and potential artifacts.

Quantitative results for ×2 and ×4 up-scaling are summarized in Table 2 and Table 3. At ×2 scaling, CHARMS achieves PSNR/SSIM values of 37.79 dB/0.973 on IXI T1w and 38.56 dB/0.981 on IXI T2w, outperforming FMEN and W2AMSN-S by ≈0.2–0.6 dB with fewer parameters. At ×4 scaling, CHARMS yields 33.27 dB/0.945 on IXI T1w and 32.97 dB/0.956 on IXI T2w, again slightly higher than the best competing networks. On the high-quality HCP dataset, performance differences are smaller but consistent: CHARMS matches or exceeds FMEN while maintaining the lowest model complexity (≈1.9 M parameters).

Table 2.

Summary of the model characteristics and performances (PSNR and SSIM) after 2× downsampling and SRR enhancement. All CHARMS gains over FMEN are statistically significant (p < 0.05).

Model Parameters
(Million)
Train
(h)
PSNR
(IXIT1w)
SSIM
(IXIT1w)
PSNR
(IXI-T2w)
SSIM
(IXI-T2w)
PSNR
(HCP-T1w)
SSIM
(HCP-T1w)
Bicubic 31.68 ± 0.55 0.935 ± 0.028 32.51 ± 0.58 0.946 ± 0.025 40.19 ± 0.48 0.9874 ± 0.020
SRCNN 0.82 0.7 33.86 ± 0.42 0.961 ± 0.018 35.33 ± 0.45 0.969 ± 0.016 43.80 ± 0.38 0.9935 ± 0.014
VDSR 0.37 3.0 35.43 ± 0.35 0.968 ± 0.015 36.74 ± 0.38 0.963 ± 0.013 44.77 ± 0.32 0.9936 ± 0.011
EDSR 1.36 4.7 36.13 ± 0.28 0.971 ± 0.012 38.07 ± 0.30 0.979 ± 0.010 46.11 ± 0.25 0.9957 ± 0.010
PAN 0.78 5.0 36.38 ± 0.27 0.970 ± 0.013 37.79 ± 0.27 0.978 ± 0.011 45.71 ± 0.21 0.9954 ± 0.008
W2AMSN-S 11.37 14 37.12 ± 0.22 0.972 ± 0.009 38.30 ± 0.26 0.980 ± 0.008 46.43 ± 0.20 0.9961 ± 0.007
FMEN 3.80 11 37.22 ± 0.16 0.973 ± 0.008 38.40 ± 0.17 0.981 ± 0.007 46.58 ± 0.14 0.9963 ± 0.007
CHARMS 1.74 10 37.79 ± 0.11 0.973 ± 0.008 38.56 ± 0.11 0.981 ± 0.006 46.58 ± 0.10 0.9963 ± 0.006

Table 3.

Summary of the model characteristics and performances (PSNR and SSIM) after 4× downsampling and SRR enhancement. CHARMS shows statistically significant improvements over FMEN (p < 0.05) and much larger gains over heavier models like W2AMSN-S (p < 0.01), despite ~6× fewer parameters.

Model Parameters
(Million)
Train
(h)
PSNR
(IXI-T1w)
SSIM
(IXI-T1w)
PSNR
(IXI-T2W)
SSIM
(IXI-T2W)
PSNR
(HCP-T1w)
SSIM
(HCP-T1w)
Bicubic 26.17 ± 0.58 0.786 ± 0.030 26.92 ± 0.58 0.826 ± 0.025 31.57 ± 0.48 0.924 ± 0.024
SRCNN 0.82 0.5 29.18 ± 0.50 0.888 ± 0.028 30.63 ± 0.50 0.873 ± 0.022 33.96 ± 0.39 0.948 ± 0.021
VDSR 0.37 1.5 30.49 ± 0.45 0.909 ± 0.020 32.44 ± 0.46 0.890 ± 0.020 34.60 ± 0.36 0.954 ± 0.018
EDSR 1.51 2.3 31.48 ± 0.36 0.931 ± 0.018 32.53 ± 0.38 0.948 ± 0.012 36.10 ± 0.33 0.965 ± 0.011
PAN 0.92 2.7 31.76 ± 0.28 0.926 ± 0.016 32.27 ± 0.28 0.944 ± 0.011 35.90 ± 0.28 0.964 ± 0.009
W2AMSN-S 11.41 9.0 32.61 ± 0.22 0.939 ± 0.016 32.76 ± 0.11 0.953 ± 0.009 36.56 ± 0.20 0.968 ± 0.008
FMEN 3.95 7.0 32.72 ± 0.20 0.944 ± 0.010 32.92 ± 0.20 0.956 ± 0.007 36.81 ± 0.18 0.969 ± 0.007
CHARMS 1.89 7.0 33.27 ± 0.14 0.945 ± 0.010 32.97 ± 0.15 0.956 ± 0.007 36.65 ± 0.11 0.969 ± 0.007

These results confirm that CHARMS attains a favorable trade-off between reconstruction fidelity and computational efficiency. Its complementary convolutional and Transformer branches allow the network to preserve fine anatomical textures while modeling extended contextual relationships. By contrast, purely convolutional baselines (SRCNN, VDSR, EDSR, and PAN) either fail to recover global continuity or require deeper architectures with higher latency.

The IXI T1w dataset has served as a standard benchmark for MRI super-resolution since 2014, enabling direct longitudinal comparison of model performance across studies [35,43,44,45,46,47,51,70,71,72,73,74,75,76]. Figure 4 summarizes the evolution of reconstruction fidelity over this period, plotting reported PSNR (Figure 4a) and SSIM (Figure 4b) values for ×2 and ×4 upscaling from the literature, alongside linear regression trends. Both metrics exhibit a consistent, approximately linear improvement over time, reflecting ongoing advances in deep learning architectures and training strategies. The models evaluated in the current study, including CHARMS, closely align with these established trends. Notably, CHARMS reaches state-of-the-art fidelity levels for both scaling factors—matching or exceeding recent heavyweight baselines—while offering superior computational efficiency (~1.9 M parameters and ~30 GFLOPs versus substantially higher values for competing SOTA methods). This demonstrates that CHARMS achieves cutting-edge performance without relying on increased model complexity, highlighting the effectiveness of its lightweight hybrid design and attention regularization.

Figure 4.

Figure 4

Evolution of super-resolution performance on the IXI T1w dataset from 2014 to 2025 [35,43,44,45,46,47,51,70,71,72,73,74,75,76]. (a) Peak Signal-to-Noise Ratio (PSNR, dB) and (b) Structural Similarity Index (SSIM) for ×2 (black) and ×4 (red) upscaling factors. Solid lines represent linear regression fits to published state-of-the-art results compiled from the literature over this period, illustrating the steady progress in reconstruction fidelity. Individual data points (filled squares for ×2, filled circles for ×4) correspond to the validated performance of models evaluated in the current study, including CHARMS. CHARMS achieves state-of-the-art fidelity for both scaling factors while maintaining substantially lower computational complexity.

To assess the contribution of each component, an ablation study was conducted on the IXI T1w and T2w datasets at ×2 and ×4 scaling. The baseline model consisted of stacked RRAF blocks without channel shuffle, PCA, Transformer, or HFIR modules. Successive additions of these components (Table 4) show a progressive improvement in both PSNR and SSIM. Introducing channel shuffle alone increased PSNR by about 1 dB, highlighting its effectiveness in promoting inter-channel interaction. Adding the PCA module yielded an additional 0.1 dB gain and improved structural similarity, confirming the advantage of combined pixel- and channel-level calibration. The inclusion of the Transformer further enhanced performance by ≈0.02–0.03 dB, reflecting improved long-range feature modeling. The full CHARMS model, integrating all modules including HFIR, achieved the highest metrics across all datasets and scaling factors with only marginal increases in parameter counts (<2 MB).

Table 4.

Summary of ablation studies, detailing the effects of each module on the PSNR (dB) and SSIM in SRR-enhanced image quality.

Model Baseline +CS +CS
+PCA
+CS +PCA
+Transformer
Full Model
Parameters (MB) 1.46 1.49 1.69 1.72 1.74
PSNR/SSIM (IXI-T1w) 35.12/0.967 36.19/0.973 36.27/0.973 36.28/0.973 36.29/0.973
PSNR/SSIM (IXI-T2w) 37.38/0.973 38.43/0.979 38.50/0.981 38.51/0.981 38.56/0.981
PSNR/SSIM (HCP-T1w) 45.63/0.989 46.44/0.995 46.56/0.996 46.56/0.996 46.58/0.996
Size of Parameters (MB) 1.61 1.64 1.84 1.87 1.89
PSNR/SSIM (IXI-T1w) 28.66/0.869 29.37/0.889 29.42/0.890 29.44/0.890 29.47/0.891
PSNR/SSIM (IXI-T2w) 29.56/0.896 30.39/0.909 30.41/0.916 30.44/0.916 30.47/0.916
PSNR/SSIM (HCP-T1w) 35.81/0.946 36.53/0.967 36.60/0.968 36.63/0.968 36.65/0.969

4.2. Ablation Study

To further illustrate the effect of the proposed attention regularization, Figure 5 shows attention maps extracted from the MDDTA block before and after applying the regularization loss on representative IXI T1w slices. Without regularization (Figure 5a), attention is distributed more diffusely across the entire image, with limited selectivity toward structural boundaries. In contrast, after regularization (Figure 5b), attention becomes markedly more focused and structured, preferentially highlighting anatomically salient regions including cortical folds, sulci, and white-gray matter junctions. This sharper localization demonstrates that the regularization successfully suppresses redundant activations and encourages the model to efficiently direct its focus to informative features critical for high-fidelity super-resolution, contributing to the observed improvements in both quantitative metrics and perceptual quality.

Figure 5.

Figure 5

Visualization of attention maps from the Multi-Depthwise Dilated Transformer Attention (MDDTA) block on representative axial slices from the IXI T1w dataset. (a) Attention maps before attention regularization, showing relatively diffuse and less structured activation patterns. (b) Attention maps after regularization, demonstrating sharper and more focused attention concentrated along anatomically salient regions, such as cortical boundaries, sulci, and white-gray matter interfaces. Crossing black lines indicate the positions of the orthogonal cross-sections (axial, sagittal, and coronal views) displayed in each column.

Although AR yields the smallest isolated PSNR improvement (~0.05 dB), it plays a crucial role in suppressing redundant attention patterns and stabilizing optimization. This is especially valuable in the resource-limited cross-field fine-tuning scenario, where AR contributes to consistent high-fidelity 7T-like reconstructions without overfitting on the small twenty-subject dataset.

Figure 6 illustrates the influence of network depth, parameterized by the number of RRAF and RLFE blocks. Performance saturated beyond a (4, 4) configuration, while deeper variants provided negligible gains at substantially higher computational cost. This observation guided the final model design used throughout subsequent experiments.

Figure 6.

Figure 6

Performance on the IXI dataset (T1w) across different model configurations.

Consistent with backbone depth saturation, additional experiments evaluated stacking multiple MDDTA blocks or varying their position. Stacking 2–3 blocks provided marginal PSNR gains (<0.01 dB) at ~15–30% higher computational cost, while earlier placement was less effective due to insufficient local feature maturity. The selected single late-stage MDDTA thus achieves the optimal balance for lightweight clinical MRI SR.

4.3. Cross-Field Validation Using Paired 3T/7T Datasets

To verify generalization beyond synthetic downsampling, CHARMS was further tested on two datasets of paired 3T/7T (PTT) MR images [17,18]. Models pre-trained on IXI and HCP-YA datasets were fine-tuned using the aligned 3T inputs and 7T ground-truth targets. CHARMS successfully synthesized 7T-like images from clinical 3T acquisitions (Figure 5), achieving remarkable quantitative gains over the original 3T images as follows: an average PSNR improvement of ≈6 dB across T1w and T2w volumes (Table 5) and an SSIM increase of 0.12 (Table 6)—a particularly large leap.

Table 5.

PSNR for the T1w, T2w, and T1/T2 image volumes acquired at 3T and the reconstructed SRR images based on the corresponding high-resolution 7T MRI data.

Subject 3T (T1w) SR (T1w) 3T (T2w) SR (T2w)
1 27.310 33.330 27.359 28.475
2 28.639 34.660 32.265 38.286
3 27.927 33.947 31.938 37.958
4 26.724 32.744 30.251 36.271
5 27.174 33.195 30.546 36.566
6 27.752 33.773 31.982 38.003
7 27.717 33.737 31.956 37.976
8 27.622 33.643 30.472 36.493
9 28.394 34.415 31.208 37.229
10 27.213 33.234 31.952 37.973
Mean 27.647 33.668 30.993 36.523
Std 0.579 0.579 1.476 2.923

Table 6.

The SSI for the T1w, T2w, and T1/T2 image volumes acquired at 3T and the generated SR images are based on the corresponding high-resolution 7T MRI data.

Subject 3T (T1w) SR (T1w) 3T (T2w) SR (T2w)
1 0.821 0.938 0.726 0.727
2 0.870 0.954 0.872 0.948
3 0.830 0.945 0.851 0.933
4 0.807 0.933 0.783 0.890
5 0.832 0.947 0.822 0.907
6 0.844 0.949 0.829 0.920
7 0.816 0.944 0.816 0.914
8 0.836 0.949 0.817 0.911
9 0.854 0.951 0.840 0.925
10 0.818 0.939 0.825 0.914
Mean 0.833 0.945 0.818 0.899
Std 0.019 0.007 0.040 0.062

Qualitatively, Figure 5 presents representative axial T1w slices before and after super-resolution. The super-resolved images display markedly sharper cortical gyri, crisper gray-white matter boundaries, and enhanced tissue contrast, closely approximating the visual appearance of true 7T acquisitions. Quantitative consistency is equally striking. Swarm plots in Figure 6 and the summary statistics in Table 6 reveal that the per-slice SSIM distribution for T1w data becomes dramatically narrower after super-resolution, with standard deviation reduced by >50%. This collapse of variance indicates highly reliable structure and texture recovery across diverse anatomical regions.

These results demonstrate that CHARMS effectively bridges the resolution gap between clinical 3T and ultra-high field 7T MRI without requiring paired 7T data during deployment. The ability to upgrade standard-of-care 3T scans to near-7T quality highlights the framework’s strong generalization across magnetic field strengths, scanner vendors, and acquisition protocols—paving the way for its application to low-field and portable MRI systems, where hardware-imposed resolution limitations remain a major challenge.

To further quantify the enhancement in image quality, we evaluated Signal-to-Noise Ratio (SNR) and gray-white matter Contrast-to-Noise Ratio (CNR) on the PTT2 test set. As shown in Table 7, CHARMS SR images improve mean SNR over native 3T by ~0.8 for T1w (8.994 vs. 8.196) and ~0.9 for T2w (5.697 vs. 4.802), approaching native 7T levels (9.393 for T1w, 6.198 for T2w). Similarly, Table 8 reveals CNR gains of ~0.1 for T1w (2.209 vs. 2.103) and ~0.7 for T2w (2.631 vs. 1.901), closely matching 7T values (2.297 for T1w, 2.825 for T2w). These metrics objectively confirm the SR outputs’ ‘near-7T quality’ in terms of reduced noise and improved tissue differentiation, see Figure 7 and Figure 8.

Table 7.

The SNR for the T1w and T2w image volumes acquired at 3T, 7T, and the reconstructed SR images based on the fine-tuned model using the paired PTT1 dataset.

Subject 3T (T1w) 7T (T1w) SR (T1w) 3T (T2w) 3T (T2w) SR (T2w)
1 8.007 9.013 8.557 4.301 5.983 5.481
2 8.887 10.792 10.7 5.695 7.331 6.689
3 8.478 10.368 9.628 5.353 6.501 5.956
4 7.451 7.458 7.588 4.119 5.071 4.542
5 7.995 7.576 7.963 4.133 5.497 5.105
6 8.311 10.153 9.375 5.147 6.368 5.791
7 8.159 9.771 8.879 4.961 6.285 5.779
8 8.055 9.277 8.642 4.652 6.174 5.725
9 8.616 10.729 10.301 5.466 6.888 6.522
10 8.001 8.791 8.305 4.192 5.878 5.389
Mean 8.196 9.393 8.994 4.802 6.198 5.697
Std 0.401 1.201 1.003 0.601 0.652 0.631

Table 8.

The CNR between gray and white matter brain tissues for the T1w and T2w image volumes acquired at 3T, 7T, and the reconstructed SR images based on the fine-tuned model using the paired PTT1 dataset.

Subject 3T (T1w) 7T (T1w) SR (T1w) 3T (T2w) 3T (T2w) SR (T2w)
1 1.916 1.915 2.081 1.551 2.823 2.357
2 3.193 3.537 3.46 2.755 4.676 3.646
3 2.559 2.557 2.492 2.431 3.321 2.520
4 1.234 1.451 1.205 1.042 1.623 1.578
5 1.393 1.679 1.307 1.298 1.722 1.742
6 2.476 2.305 2.42 2.015 2.185 2.603
7 1.953 2.285 2.21 1.986 3.082 3.398
8 1.923 2.011 2.173 1.842 3.001 2.715
9 2.607 3.421 2.718 2.549 4.174 3.416
10 1.775 1.81 2.025 1.539 1.642 2.332
Mean 2.103 2.297 2.209 1.901 2.825 2.631
Std 0.601 0.702 0.652 0.559 1.201 1.066

Figure 7.

Figure 7

A representative T1w axial slice of the brain as well as a 4-fold zoomed-in display of the selected region marked with a red box showcasing the application of the proposed CHARMS network for SR reconstruction on the 3T MRI data using the paired PTT dataset (PTT2).

Figure 8.

Figure 8

Swarm boxplot of PSNR (a) and SSIM (b) metrics for the T1w, T2w, and their corresponding SR based on the fine-tuned SR framework using the PTT datasets. Colors were used to distinguish the different datasets.

In addition to reconstruction accuracy, CHARMS delivers substantial computational advantages. With ≈1.9 million parameters and ≈30 GFLOPs per 256 × 256 input, it is significantly lighter than W2AMSN-S (11 M) and FMEN (3.8 M) while producing higher PSNR and SSIM. Training time was ≈7 h per configuration, compared with 14 h for W2AMSN-S. Inference on a single RTX 4090 required ≈11 ms per slice. For a typical 3D brain volume (e.g., 176 slices in PTT datasets), accounting for slices without brain tissue via foreground masking, total inference time is approximately 1.6–1.9 s on an RTX 4090 GPU—enabling near-real-time processing in clinical reconstruction pipelines.

The observed efficiency stems from the combined use of depthwise separable convolutions, channel shuffle, and attention regularization. Depthwise operations reduce parameter count and memory usage, while the attention regularization strategy limits redundancy in learned attention maps, improving convergence stability and generalization. The resulting architecture thus achieves the accuracy of more complex Transformer networks with the speed of lightweight CNNs—an essential property for integration into resource-constrained imaging systems or embedded sensor pipelines.

5. Discussion

The proposed CHARMS framework demonstrates that a carefully engineered CNN-Transformer hybrid can achieve an effective balance between reconstruction accuracy, computational efficiency, and architectural interpretability in MRI SR reconstruction. By pairing convolutional local feature extraction with Transformer-based global context modeling, CHARMS consistently outperformed both classical CNN models and pure Transformer architectures across multiple datasets and scaling factors. The combination of RRAF [11] and MDDTA [14] proved particularly powerful in preserving fine-grained textures while maintaining global anatomical coherence in the reconstructed images.

A central contributor to the performance of CHARMS is its attention regularization mechanism [7,23,37,52,70,77,78], which suppresses redundant activation patterns and stabilizes optimization. Existing hybrid SR models such as SwinIR [28] and Restormer [27] often rely on densely parameterized attention blocks that encode overlapping contextual information, resulting in diminishing returns when scaling to high-resolution MRI. In contrast, CHARMS encourages diverse and complementary attention maps, enabling sharper delineation of tissue interfaces, improved structural consistency, and reduced over-smoothing—limitations frequently observed in CNN-based SR frameworks such as EDSR [6] and PAN [12]. By referencing 2023–2025 studies [35,42,43,44,45,46,47,51], CHARMS’ novel attention regularization and MDDTA block fill gaps in stable, low-data adaptation, enabling near-7T quality from 3T inputs with minimal overhead.

The marginal direct fidelity gain from AR in standard training is outweighed by its stabilization effects and negligible computational cost, making it a worthwhile addition for reliable performance in diverse clinical settings, including low-data adaptation regimes.

The hybrid nature of CHARMS provides a principled solution over simpler pure architectures as follows: ablation results (Table 4) show that removing the Transformer component degrades PSNR/SSIM by ~0.02–0.06 dB, underscoring its role in enhancing global coherence without the overhead of full Transformer models. In brain MRI, where scans exhibit sparse, hierarchical patterns, this combination yields sharper, more anatomically consistent reconstructions than optimized pure CNNs (e.g., outperforming EDSR by ~0.3 dB PSNR) or Transformers, prioritizing efficiency for practical deployment.

Although diffusion-based models [41,42,43] and heavily parameterized vision/state-space-model hybrids [40] have shown excellent perceptual quality in recent studies, they typically require iterative sampling (10–100× slower) or >10–50 million parameters. Because the primary goal of CHARMS is lightweight single-pass inference suitable for clinical and resource-constrained environments, direct head-to-head comparison with these fundamentally different paradigms was outside the scope of this study.

Addressing RQ1, CHARMS’ hybrid architecture (Section 3.1, Section 3.2, Section 3.3, Section 3.4 and Section 3.5) outperforms baselines like EDSR and FMEN by 0.1–0.6 dB PSNR with ~40% faster inference (Table 2 and Table 3), validated on IXI and HCP-YA datasets. For RQ2, attention regularization stabilizes optimization, yielding robust gains in ablation studies (Table 4) and cross-contrast performance. RQ3 is demonstrated through cross-field adaptation (Section 4.3), achieving ~6 dB PSNR and 0.12 SSIM improvements over native 3T, with SNR/CNR nearing 7T levels (Table 5, Table 6, Table 7 and Table 8), paving the way for reduced scan times in clinical workflows.

In the cross-field adaptation results (Table 5 and Table 6), we observed greater variability in performance metrics (e.g., larger standard deviations in PSNR and SSIM) for T2w reconstructions compared to T1w results. This increased variability in T2w contrasts is likely attributable to inherent tissue contrast characteristics as follows: T2w imaging is more sensitive to fluid content, edema, and subtle tissue–water variations, which can introduce higher slice-to-slice heterogeneity in anatomical detail, noise patterns, and edge sharpness—particularly in downsampled inputs. In contrast, T1w images typically exhibit more consistent gray-white matter differentiation and lower sensitivity to such fluid-related variations, leading to more stable reconstruction outcomes. Additionally, the paired 3T/7T datasets (PTT1/PTT2) contain fewer T2w volumes relative to T1w in some splits, potentially contributing to slightly reduced robustness during fine-tuning. These observations align with prior studies noting greater challenges in modeling T2w contrasts due to their pronounced sensitivity to relaxation properties and potential artifacts. Future multi-contrast training strategies could further mitigate this by incorporating explicit T2w-specific regularization or additional data augmentation.

Despite its strengths, CHARMS currently operates in a 2D slice-wise manner. Extending it to native 3D processing [70,79] would better exploit through-plane continuity and further improve consistency. While CHARMS achieves state-of-the-art performance with a lightweight 2D slice-wise architecture, this design inherently neglects explicit modeling of through-plane continuity. As a result, minor inconsistencies can arise across slices in the reconstructed 3D volume. Figure 9 illustrates this limitation on a representative ×4 super-resolved IXI T1w scan. Although the axial view (native processing plane) appears smooth and artifact-free, reformatted sagittal and coronal views reveal subtle horizontal banding and slight intensity variations between adjacent slices (visible as faint striped patterns). These discontinuities, while minor and not clinically disruptive in most cases, could be further mitigated by extending the model to full 3D processing or incorporating inter-slice consistency constraints in future work.

Figure 9.

Figure 9

Qualitative illustration of the limitation of the 2D slice-wise processing in CHARMS. The (left) panel shows an axial view of a representative super-resolved T1w volume from the IXI dataset. The (middle and right) panels display reconstructed sagittal and coronal views, respectively, obtained by resampling the 3D volume along the orthogonal planes indicated by the green crosshairs. Horizontal banding artifacts and subtle slice-to-slice intensity inconsistencies are visible in the reformatted views (most evident in the sagittal and coronal planes), highlighting the lack of explicit through-plane continuity modeling in the current 2D approach.

Additionally, while supervised training with paired data yielded excellent results, incorporating self-supervised [80,81,82,83,84,85,86,87] or physics-informed strategies [43,88,89,90,91,92,93] could reduce dependency on high-quality ground-truth pairs and enhance robustness across scanners and pathologies.

While VGG-based perceptual losses are common in natural image super-resolution, we deliberately omitted them to prioritize quantitative fidelity and anatomical accuracy in clinical MRI reconstructions, relying instead on L1 and structural similarity terms to achieve high PSNR/SSIM and artifact-free results.

Although primary benchmarks employ isotropic scaling, the cross-field adaptation from clinical (often anisotropic) 3T acquisitions to near-isotropic 7T quality directly demonstrates CHARMS’ applicability to real-world anisotropic MRI, a common scenario in routine scanning protocols.

While direct comparisons to other models fine-tuned on the same limited cross-field data would further isolate transfer efficiency, the substantial gains over native 3T images—consistent with evaluation protocols in prior 3T-to-7T synthesis studies—validate CHARMS’ practical value for enhancing routine clinical acquisitions.

The added SNR and CNR analyses (Table 7 and Table 8) provide rigorous quantitative support for the ‘near-7T quality’ claim, demonstrating that CHARMS not only enhances resolution but also boosts signal characteristics critical for diagnostic confidence in clinical settings.

Clinically, the combination of high fidelity, low latency (~11 ms/slice on RTX 4090), with total inference times of ~1.6–1.9 s per 3D volume on a standard high-end GPU, CHARMS supports integration into online reconstruction workflows, and small memory footprint positions CHARMS as an ideal candidate for integration into online reconstruction pipelines, low-field/portable systems [94,95,96,97,98], or accelerated protocols—ultimately helping reduce scan time, minimize motion artifacts, and improve diagnostic confidence without additional hardware. While CHARMS demonstrates strong performance on healthy and near-healthy brain scans across T1w/T2w contrasts, its generalization to brains with gross pathologies (e.g., large tumors or significant focal lesions causing substantial anatomical distortion) remains to be fully validated, as such cases may introduce domain shifts that challenge learned feature representations; future work will explore fine-tuning or domain-adaptation strategies on pathological datasets to enhance robustness in clinical neuro-oncology applications.

In summary, CHARMS represents a significant step toward practical, clinically viable deep learning SR for MRI by delivering top-tier lightweight performance today while providing a modular foundation for future 3D and unsupervised extensions. Deploying CHARMS in real-world clinical settings is feasible due to its lightweight design (~11 ms/slice, ~1.6 s/volume on RTX 4090), enabling integration into PACS or online reconstruction via PyTorch/ONNX export. Challenges include DICOM compatibility, validation on diverse scanners/pathologies, and regulatory approval (e.g., FDA for AI tools). Required extensions are 3D volumetric processing for continuity, self-supervised fine-tuning for unlabeled data, and edge deployment on low-power GPUs for portable MRI.

While the current evaluation relies on established quantitative metrics (PSNR, SSIM, SNR, and CNR) and side-by-side qualitative visual assessment by the authors (demonstrating sharper cortical boundaries, improved gray-white contrast, and reduced blurring without obvious over-sharpening or hallucination-like artifacts), a formal blinded subjective scoring by expert radiologists (e.g., using a Likert scale for diagnostic confidence, artifact presence, and perceived resolution gain) was not performed in this study. Such an evaluation would be valuable for future clinical validation, particularly when extending CHARMS to pathological cases or real-world heterogeneous clinical scans. In the present work, the absence of gross hallucinatory artifacts in the reconstructions is supported by the high structural fidelity (SSIM gains of ~0.12) and the preservation of anatomically plausible tissue interfaces across all evaluated slices.

Recent studies in brain MRI deep learning, such as Samarasinghe et al. (2025) [99], have advanced self-supervised segmentation and edge detection on multi-modal MRI using architectures like dual-decoder 3D-UNet with SimSiam pretraining. While these approaches effectively reduce annotation dependency for tumor boundary tasks, they differ fundamentally from our focus on super-resolution reconstruction. CHARMS uniquely contributes a lightweight (~1.9 M parameters) CNN-Transformer hybrid tailored for efficient MRI SR, incorporating Reverse Residual Attention Fusion, multi-depthwise dilated Transformer blocks, and novel attention regularization to achieve superior PSNR/SSIM gains and ~40% faster inference compared to prior lightweight SR models, while enabling practical cross-field 3T-to-near-7T quality enhancement—features not addressed in segmentation-oriented works.

6. Conclusions

This study presented CHARMS, a CNN-Transformer hybrid framework with attention regularization for SR reconstruction of MR images. By combining convolutional feature extraction with the global contextual modeling of Transformers, CHARMS achieves superior reconstruction accuracy and visual fidelity while maintaining exceptional computational efficiency. Experimental evaluations across multiple MRI datasets confirmed its consistent performance advantage over existing CNN- and Transformer-based methods. The framework’s lightweight design and modular structure make it suitable for integration into routine MRI reconstruction pipelines, offering a practical solution for accelerating image acquisition and improving diagnostic quality. Future work will extend CHARMS to 3D volumetric SR and self-supervised domain adaptation to enhance generalization across scanners, field strength, and contrasts. Overall, CHARMS represents a promising step toward intelligent, efficient, and clinically adaptable SR reconstruction of MR images.

Acknowledgments

The authors gratefully acknowledge stimulating discussions with H. Zhang on MRI data curation and his valuable assistance with the software.

Appendix A. Glossary of Key Acronyms

To improve readability, this appendix provides a summary table of the main acronyms used for the architectural components in the CHARMS framework.

CHARMS Full Name Brief Description
CHARMS CNN-Transformer Hybrid with Attention Regularization for MRI Super-Resolution The proposed lightweight model for MRI super-resolution, combining CNN and Transformer elements with attention regularization.
RRAF Reverse Residual Attention Fusion Backbone module for hierarchical local feature extraction, integrating residual learning and attention.
RLFE Residual Local Feature Extraction Units within RRAF blocks, consisting of convolutions, ReLU activations, and ESA for feature encoding.
ESA Enhanced Spatial Attention Spatial attention operator that highlights high-frequency regions using dilated convolutions.
PCA Pixel–Channel Attention Mechanism merging pixel- and channel-level attention for fine-grained feature recalibration.
MDDTA Multi-Depthwise Dilated Transformer Attention Transformer block for efficient long-range dependency modeling with linear complexity.
GDDFN Gated Depthwise Dilated Feed-Forward Network Feed-forward component in the Transformer module, enhancing nonlinearity via gated convolutions.
THFIR High-Frequency Information Refinement Refinement module post-upsampling to restore high-frequency details and suppress artifacts.

Author Contributions

Conceptualization: X.L. and T.-Q.L., Methodology: X.L. and H.S., Software: X.L. and H.S., Validation: T.-Q.L. and X.L., Formal Analysis: T.-Q.L. and X.L., Visualization: X.L. and H.S., Writing—Original Draft: T.-Q.L. and X.L., Writing—Review and Editing: T.-Q.L., X.L. and H.S., Supervision: T.-Q.L. and X.L., Project Administration: T.-Q.L., Funding Acquisition: T.-Q.L. and X.L. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The study was based on public domain data which are all openly accessible. Code for our preprocessing and decoding the machine learning framework will be available on GitHub. The datasets analyzed in the study were available in the public domain.

Conflicts of Interest

The authors declare no conflicts of interest. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Funding Statement

This work was supported in part by the Fujian Medical University Start-up Fund for Scientific Research (Grant No. XRCZX203013), the Joint China–Sweden Mobility program from STINT (Dnr: CH2019-8397), and a grant from the Zhejiang Natural Science Foundation of China (No. LY23F010005).

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

  • 1.Li Y., Sixou B., Peyrin F. A Review of the Deep Learning Methods for Medical Images Super Resolution Problems. IRBM. 2021;42:120–133. doi: 10.1016/j.irbm.2020.08.004. [DOI] [Google Scholar]
  • 2.Yang H., Wang Z., Liu X., Li C., Xin J., Wang Z. Deep learning in medical image super resolution: A review. Appl. Intell. 2023;53:20891–20916. doi: 10.1007/s10489-023-04566-9. [DOI] [Google Scholar]
  • 3.Ji Z., Zou B., Kui X., Liu J., Zhao W., Zhu C., Dai P., Dai Y. Deep learning-based magnetic resonance image super-resolution: A survey. Neural Comput. Appl. 2024;36:12725–12752. doi: 10.1007/s00521-024-09890-w. [DOI] [Google Scholar]
  • 4.Dong C., Loy C.C., He K., Tang X. Learning a Deep Convolutional Network for Image Super-Resolution. In: Fleet D., Pajdla T., Schiele B., Tuytelaars T., editors. Computer Vision—ECCV 2014. Springer International Publishing; Berlin/Heidelberg, Germany: 2014. pp. 184–199. [Google Scholar]
  • 5.Kim J., Lee J.K., Lee K.M. Accurate image super-resolution using very deep convolutional networks; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. 27–30 June 2016; pp. 1646–1654. [Google Scholar]
  • 6.Lim B., Son S., Kim H., Nah S., Lee K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution; Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); Honolulu, HI, USA. 21–26 July 2017; pp. 1132–1140. [Google Scholar]
  • 7.Zhang Y., Li K., Li K., Wang L., Zhong B., Fu Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In: Ferrari V., Hebert M., Sminchisescu C., Weiss Y., editors. Computer Vision—ECCV 2018. Springer International Publishing; Berlin/Heidelberg, Germany: 2018. pp. 294–310. [Google Scholar]
  • 8.Howard A.G., Zhu M., Chen B., Kalenichenko D., Wang W., Weyand T., Andreetto M., Adam H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv. 2017 doi: 10.48550/arXiv.1704.04861.1704.04861 [DOI] [Google Scholar]
  • 9.Zhang X., Zhou X., Lin M., Sun J., Recognition P. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices; Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA. 18–22 June 2018; pp. 6848–6856. [Google Scholar]
  • 10.Mehta S., Rastegari M., Caspi A., Shapiro L., Hajishirzi H. ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation. Springer International Publishing; Cham, Switzerland: 2018. pp. 561–580. [Google Scholar]
  • 11.Wang Z., Xie X., Yang J., Song X. RA-Net: Reverse attention for generalizing residual learning. Sci. Rep. 2024;14:12771. doi: 10.1038/s41598-024-63623-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Zhao H., Kong X., He J., Qiao Y., Dong C. Efficient Image Super-Resolution Using Pixel Attention; Proceedings of the ECCV Workshops; Glasgow, UK. 23–28 August 2020. [Google Scholar]
  • 13.Liu J., Zhang W., Tang Y., Tang J., Wu G. Residual Feature Aggregation Network for Image Super-Resolution; Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Seattle, WA, USA. 13–19 June 2020; pp. 2359–2368. [Google Scholar]
  • 14.Shen L., Qin F., Zhang Y., Wang Q., Hu B., Zhao W. MATNet: MRI Super-Resolution with Multiple Attention Mechanisms; Proceedings of the 2023 International Conference on New Trends in Computational Intelligence (NTCI); Qingdao, China. 3–5 November 2023; pp. 171–180. [Google Scholar]
  • 15.Wood D.A., Kafiabadi S., Busaidi A.A., Guilhem E., Montvila A., Lynch J., Townend M., Agarwal S., Mazumder A., Barker G.J., et al. Accurate brain-age models for routine clinical MRI examinations. Neuroimage. 2022;249:118871. doi: 10.1016/j.neuroimage.2022.118871. [DOI] [PubMed] [Google Scholar]
  • 16.Glasser M.F., Smith S.M., Marcus D.S., Andersson J.L., Auerbach E.J., Behrens T.E., Coalson T.S., Harms M.P., Jenkinson M., Moeller S., et al. The Human Connectome Project’s neuroimaging approach. Nat. Neurosci. 2016;19:1175–1187. doi: 10.1038/nn.4361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Chu L., Ma B., Dong X., He Y., Che T., Zeng D., Zhang Z., Li S. A paired dataset of multi-modal MRI at 3 Tesla and 7 Tesla with manual hippocampal subfield segmentations. Sci. Data. 2025;12:260. doi: 10.1038/s41597-025-04586-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Chen X., Qu L., Xie Y., Ahmad S., Yap P.T. A paired dataset of T1- and T2-weighted MRI at 3 Tesla and 7 Tesla. Sci. Data. 2023;10:489. doi: 10.1038/s41597-023-02400-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Chen J., Sasaki H., Lai H., Su Y., Liu J., Wu Y., Zhovmer A., Combs C.A., Rey-Suarez I., Chang H.Y., et al. Three-dimensional residual channel attention networks denoise and sharpen fluorescence microscopy image volumes. Nat. Methods. 2021;18:678–687. doi: 10.1038/s41592-021-01155-x. [DOI] [PubMed] [Google Scholar]
  • 20.Woo S., Park J., Lee J.-Y., Kweon I.-S.J.A. CBAM: Convolutional Block Attention Module. arXiv. 2018 doi: 10.48550/arXiv.1807.06521.1807.06521 [DOI] [Google Scholar]
  • 21.Liu J., Tang J., Wu G. Residual Feature Distillation Network for Lightweight Image Super-Resolution. Springer International Publishing; Cham, Switzerland: 2020. pp. 41–55. [Google Scholar]
  • 22.Liu J.-J., Hou Q., Cheng M.-M., Wang C., Feng J. Improving Convolutional Networks with Self-Calibrated Convolutions; Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Seattle, WA, USA. 13–19 June 2020; pp. 10093–10102. [Google Scholar]
  • 23.Vaswani A., Shazeer N.M., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I. Attention is All you Need; Proceedings of the Neural Information Processing Systems; Long Beach, CA, USA. 4–9 December 2017. [Google Scholar]
  • 24.Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv. 20202010.11929 [Google Scholar]
  • 25.Li G., Lv J., Tian Y., Dou Q., Wang C., Xu C., Qin J. Transformer-empowered Multi-scale Contextual Matching and Aggregation for Multi-contrast MRI Super-resolution; Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); New Orleans, LA, USA. 18–24 June 2022; pp. 20604–20613. [Google Scholar]
  • 26.Mahapatra D., Ge Z. MR Image Super Resolution by Combining Feature Disentanglement CNNs and Vision Transformers. In: Ender K., Bjoern M., Archana V., Christian B., Qi D., Shadi A., editors. Proceedings of the 5th International Conference on Medical Imaging with Deep Learning; Zurich, Switzerland. 6–8 July 2022; Cambridge, MA, USA: PMLR: Proceedings of Machine Learning Research; 2022. pp. 858–878. [Google Scholar]
  • 27.Zamir S.W., Arora A., Khan S., Hayat M., Khan F.S., Yang M.H. Restormer; Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022; New Orleans, LA, USA. 18–24 June 2022; Washington, DC, USA: IEEE Computer Society; 2022. pp. 5718–5729. [Google Scholar]
  • 28.Liang J., Cao J., Sun G., Zhang K., Van Gool L., Timofte R. SwinIR: Image Restoration Using Swin Transformer; Proceedings of the ICCVW; Montreal, BC, Canada. 11–17 October 2021; New York, NY, USA: IEEE; 2021. pp. 1833–1844. [Google Scholar]
  • 29.Liu Y., Zhang M., Zhang W., Hou B., Liu D., Lian H., Jiang B. Flexible Alignment Super-Resolution Network for Multi-Contrast MRI. arXiv. 20222210.03460 [Google Scholar]
  • 30.Wang J., Zou Y., Wu H. Image super-resolution method based on attention aggregation hierarchy feature. Vis. Comput. 2023;40:2655–2666. doi: 10.1007/s00371-023-02968-x. [DOI] [Google Scholar]
  • 31.Lyu J., Li G., Wang C., Cai Q., Dou Q., Zhang D., Qin J. Multicontrast MRI Super-Resolution via Transformer-Empowered Multiscale Contextual Matching and Aggregation. IEEE Trans. Neural Netw. Learn. Syst. 2023;35:12004–12014. doi: 10.1109/TNNLS.2023.3250491. [DOI] [PubMed] [Google Scholar]
  • 32.Huang S., Liu X., Tan T., Hu M., Wei X., Chen T., Sheng B. TransMRSR: Transformer-based self-distilled generative prior for brain MRI super-resolution. Vis. Comput. 2023;39:3647–3659. doi: 10.1007/s00371-023-02938-3. [DOI] [Google Scholar]
  • 33.He K., Gan C., Li Z., Rekik I., Yin Z., Ji W., Gao Y., Wang Q., Zhang J., Shen D. Transformers in medical image analysis. Intell. Med. 2023;3:59–78. doi: 10.1016/j.imed.2022.07.002. [DOI] [Google Scholar]
  • 34.Meng F., Guo Y., He W., Xu Z. Ultra-Low Field Magnetic Resonance Image Enhancement based on Deep-Learning Method; Proceedings of the 2023 2nd International Conference on Health Big Data and Intelligent Healthcare (ICHIH); Zhuhai, China. 27–29 October 2023; pp. 39–42. [Google Scholar]
  • 35.Bischoff L.M., Peeters J.M., Weinhold L., Krausewitz P., Ellinger J., Katemann C., Isaak A., Weber O.M., Kuetting D., Attenberger U., et al. Deep Learning Super-Resolution Reconstruction for Fast and Motion-Robust T2-weighted Prostate MRI. Radiology. 2023;308:e230427. doi: 10.1148/radiol.230427. [DOI] [PubMed] [Google Scholar]
  • 36.Yang G., Zhang L., Zhou M., Liu A., Chen X., Xiong Z., Wu F. Model-Guided Multi-Contrast Deep Unfolding Network for MRI Super-resolution Reconstruction; Proceedings of the 30th ACM International Conference on Multimedia; Lisbon, Portugal. 10–14 October 2022; pp. 3974–3982. [Google Scholar]
  • 37.Wang H., Hu X., Zhao X., Zhang Y. Wide Weighted Attention Multi-Scale Network for Accurate MR Image Super-Resolution. IEEE Trans. Circuits Syst. Video Technol. 2022;32:962–975. doi: 10.1109/TCSVT.2021.3070489. [DOI] [Google Scholar]
  • 38.Zhao C., Dewey B.E., Pham D.L., Calabresi P.A., Reich D.S., Prince J.L. SMORE: A Self-Supervised Anti-Aliasing and Super-Resolution Algorithm for MRI Using Deep Learning. IEEE Trans. Med. Imaging. 2021;40:805–817. doi: 10.1109/TMI.2020.3037187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Rakotonirina N.C., Rasoanaivo A. ESRGAN+: Further Improving Enhanced Super-Resolution Generative Adversarial Network; Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Barcelona, Spain. 4–8 May 2020; pp. 3637–3641. [Google Scholar]
  • 40.Li Y., Yang M., Bian T., Wu H. MRI Super-Resolution Analysis via MRISR: Deep Learning for Low-Field Imaging. Information. 2024;15:655. doi: 10.3390/info15100655. [DOI] [Google Scholar]
  • 41.Wang R., Cao Z., He Y., Liu J., Shi F., Shen D. Clinical Brain MRI Super-Resolution with 2D Slice-Wise Diffusion Model. Springer Nature; Cham, Switzerland: 2025. pp. 166–176. [Google Scholar]
  • 42.Zhao K., Pang K., Hung A.L.Y., Zheng H., Yan R., Sung K. MRI Super-Resolution With Partial Diffusion Models. IEEE Trans. Med. Imaging. 2025;44:1194–1205. doi: 10.1109/TMI.2024.3483109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Safari M., Wang S., Eidex Z., Li Q., Qiu R.L.J., Middlebrooks E.H., Yu D.S., Yang X. MRI super-resolution reconstruction using efficient diffusion probabilistic model with residual shifting. Phys. Med. Biol. 2025;70:125008. doi: 10.1088/1361-6560/ade049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Chen W., Wu S., Wang S., Li Z., Yang J., Yao H., Tian Q., Song X. Multi-contrast image super-resolution with deformable attention and neighborhood-based feature aggregation (DANCE): Applications in anatomic and metabolic MRI. Med. Image Anal. 2025;99:103359. doi: 10.1016/j.media.2024.103359. [DOI] [PubMed] [Google Scholar]
  • 45.Yoon D., Myong Y., Kim Y.G., Sim Y., Cho M., Oh B.M., Kim S. Latent diffusion model-based MRI superresolution enhances mild cognitive impairment prognostication and Alzheimer’s disease classification. Neuroimage. 2024;296:120663. doi: 10.1016/j.neuroimage.2024.120663. [DOI] [PubMed] [Google Scholar]
  • 46.Georgescu M.-I., Ionescu R.T., Miron A.-I., Savencu O., Ristea N.-C., Verga N., Khan F.S. Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for Medical Image Super-Resolution; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; Waikoloa, HI, USA. 2–7 January 2023; New York, NY, USA: IEEE; 2023. [Google Scholar]
  • 47.Wang Y., Hu H., Yu S., Yang Y., Guo Y., Song X., Chen F., Liu Q. A unified hybrid transformer for joint MRI sequences super-resolution and missing data imputation. Phys. Med. Biol. 2023;68:135006. doi: 10.1088/1361-6560/acdc80. [DOI] [PubMed] [Google Scholar]
  • 48.Zhang M., Xia C., Tang J., Yao L., Hu N., Li J., Peng W., Hu S., Ye Z., Zhang X., et al. Evaluation of high-resolution pituitary dynamic contrast-enhanced MRI using deep learning-based compressed sensing and super-resolution reconstruction. Eur. Radiol. 2025;35:5922–5934. doi: 10.1007/s00330-025-11574-5. [DOI] [PubMed] [Google Scholar]
  • 49.Xiao H., Yang Z., Liu T., Liu S., Huang X., Dai J. Deep learning for medical imaging super-resolution: A comprehensive review. Neurocomputing. 2025;630:129667. doi: 10.1016/j.neucom.2025.129667. [DOI] [Google Scholar]
  • 50.Kravchenko D., Isaak A., Mesropyan N., Peeters J.M., Kuetting D., Pieper C.C., Katemann C., Attenberger U., Emrich T., Varga-Szemes A., et al. Deep learning super-resolution reconstruction for fast and high-quality cine cardiovascular magnetic resonance. Eur. Radiol. 2025;35:2877–2887. doi: 10.1007/s00330-024-11145-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Chatterjee S., Sciarra A., Dünnwald M., Ashoka A.B.T., Vasudeva M.G.C., Saravanan S., Sambandham V.T., Tummala P., Oeltze-Jafra S., Speck O., et al. Beyond Nyquist: A Comparative Analysis of 3D Deep Learning Models Enhancing MRI Resolution. J. Imaging. 2024;10:207. doi: 10.3390/jimaging10090207. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Yang Q., Zhang Y., Chandler D.M., Farias M.C.Q. SSRT: Intra- and cross-view attention for stereo image super-resolution. Multimed. Tools Appl. 2025;84:22917–22945. doi: 10.1007/s11042-024-20000-9. [DOI] [Google Scholar]
  • 53.Shi W., Caballero J., Huszár F., Totz J., Aitken A.P., Bishop R., Rueckert D., Wang Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA. 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  • 54.Singh V., Ramnath K., Mittal A. Refining high-frequencies for sharper super-resolution and deblurring. Comput. Vis. Image Underst. 2020;199:103034. doi: 10.1016/j.cviu.2020.103034. [DOI] [Google Scholar]
  • 55.Pham C.-H., Tor-Díez C., Meunier H., Bednarek N., Fablet R., Passat N., Rousseau F. Multiscale brain MRI super-resolution using deep 3D convolutional networks. Comput. Med. Imaging Graph. 2019;77:101647. doi: 10.1016/j.compmedimag.2019.101647. [DOI] [PubMed] [Google Scholar]
  • 56.Kong F., Li M., Liu S., Liu D., He J., Bai Y., Chen F., Fu L. Residual Local Feature Network for Efficient Super-Resolution; Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); New Orleans, LA, USA. 19–20 June 2022; pp. 765–775. [Google Scholar]
  • 57.Tustison N.J., Avants B.B., Cook P.A., Zheng Y., Egan A., Yushkevich P.A., Gee J.C. N4ITK: Improved N3 bias correction. IEEE Trans. Med. Imaging. 2010;29:1310–1320. doi: 10.1109/TMI.2010.2046908. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Hoopes A., Mora J.S., Dalca A.V., Fischl B., Hoffmann M. SynthStrip: Skull-stripping for any brain image. Neuroimage. 2022;260:119474. doi: 10.1016/j.neuroimage.2022.119474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Loshchilov I., Hutter F. Fixing Weight Decay Regularization in Adam. arXiv. 20171711.05101 [Google Scholar]
  • 60.Duchi J.C., Hazan E., Singer Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 2011;12:2121–2159. [Google Scholar]
  • 61.Smith L.N. Cyclical Learning Rates for Training Neural Networks; Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV); Santa Rosa, CA, USA. 24–31 March 2017; pp. 464–472. [Google Scholar]
  • 62.Masters D., Luschi C. Revisiting Small Batch Training for Deep Neural Networks. arXiv. 2018 doi: 10.48550/arXiv.1804.07612.1804.07612 [DOI] [Google Scholar]
  • 63.Bengio Y. Practical Recommendations for Gradient-Based Training of Deep Architectures; Proceedings of the Neural Networks; Shenyang, China. 11–14 July 2012. [Google Scholar]
  • 64.Zou W., Ye T., Zheng W., Zhang Y., Chen L., Wu Y. Self-Calibrated Efficient Transformer for Lightweight Super-Resolution; Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); New Orleans, LA, USA. 19–20 June 2022; pp. 929–938. [Google Scholar]
  • 65.Du Z., Liu D., Liu J., Tang J., Wu G., Fu L. Fast and Memory-Efficient Network Towards Efficient Image Super-Resolution; Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); New Orleans, LA, USA. 19–20 June 2022; pp. 852–861. [Google Scholar]
  • 66.Horé A., Ziou D. Image Quality Metrics: PSNR vs. SSIM; Proceedings of the 2010 20th International Conference on Pattern Recognition; Istanbul, Turkey. 23–26 August 2010; pp. 2366–2369. [Google Scholar]
  • 67.Zhou W., Bovik A.C., Sheikh H.R., Simoncelli E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004;13:600–612. doi: 10.1109/TIP.2003.819861. [DOI] [PubMed] [Google Scholar]
  • 68.Yosinski J., Clune J., Bengio Y., Lipson H. How Transferable Are Features in Deep Neural Networks?; Proceedings of the Advances in Neural Information Processing Systems (NeurIPS); Montreal, QC, Canada. 8–13 December 2014. [Google Scholar]
  • 69.Tajbakhsh N., Shin J.Y., Gurudu S.R., Hurst R.T., Kendall C.B., Gotway M.B., Liang J. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Trans. Med. Imaging. 2016;35:1299–1312. doi: 10.1109/TMI.2016.2535302. [DOI] [PubMed] [Google Scholar]
  • 70.Zhang Y., Li K., Li K., Fu Y.R. MR Image Super-Resolution with Squeeze and Excitation Reasoning Attention Network; Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Nashville, TN, USA. 19–25 June 2021; pp. 13420–13429. [Google Scholar]
  • 71.Feng C.-M., Yan Y., Fu H., Chen L., Xu Y. Task Transformer Network for Joint MRI Reconstruction and Super-Resolution. In: de Bruijne M., Cattin P.C., Cotin S., Padoy N., Speidel S., Zheng Y., Essert C., editors. Medical Image Computing and Computer Assisted Intervention—MICCAI 2021. Springer International Publishing; Berlin/Heidelberg, Germany: 2021. pp. 307–317. [Google Scholar]
  • 72.Georgescu M.-I., Ionescu R.T., Verga N. Convolutional Neural Networks with Intermediate Loss for 3D Super-Resolution of CT and MRI Scans. IEEE Access. 2020;8:49112–49124. doi: 10.1109/ACCESS.2020.2980266. [DOI] [Google Scholar]
  • 73.Zhao X., Zhang Y., Zhang T., Zou X. Channel Splitting Network for Single MR Image Super-Resolution. IEEE Trans. Image Process. 2019;28:5649–5662. doi: 10.1109/TIP.2019.2921882. [DOI] [PubMed] [Google Scholar]
  • 74.Zhang Y., Tian Y., Kong Y., Zhong B., Fu Y. Residual Dense Network for Image Super-Resolution; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA. 18–22 June 2018; pp. 2472–2481. [Google Scholar]
  • 75.Hui Z., Wang X., Gao X. Fast and Accurate Single Image Super-Resolution via Information Distillation Network; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA. 18–22 June 2018; pp. 723–731. [Google Scholar]
  • 76.Dong C., Loy C.C., He K., Tang X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016;38:295–307. doi: 10.1109/TPAMI.2015.2439281. [DOI] [PubMed] [Google Scholar]
  • 77.Iglesias J.E., Billot B., Balbastre Y., Magdamo C., Arnold S.E., Das S., Edlow B.L., Alexander D.C., Golland P., Fischl B. SynthSR: A public AI tool to turn heterogeneous clinical brain scans into high-resolution T1-weighted images for 3D morphometry. Sci. Adv. 2023;9:eadd3607. doi: 10.1126/sciadv.add3607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Wang X., Zhang S., Lin Y., Lyu Y., Zhang J. Pixel attention convolutional network for image super-resolution. Neural Comput. Appl. 2022;35:8589–8599. doi: 10.1007/s00521-022-08132-1. [DOI] [Google Scholar]
  • 79.Forigua C., Escobar M., Arbelaez P. Simulation and Synthesis in Medical Imaging. Springer; Berlin/Heidelberg, Germany: 2022. SuperFormer: Volumetric Transformer Architectures for MRI Super-Resolution. Volume Lecture Notes in Computer Science. [Google Scholar]
  • 80.Song Z. REHRSeg: Unleashing the Power of Self-Supervised Super-Resolution for Resource-Efficient 3D MRI Segmentation. Neurocomputing. 2025;624:129425. doi: 10.1016/j.neucom.2025.129425. [DOI] [Google Scholar]
  • 81.Schlereth M., Schillinger M., Breininger K. Faster, Self-Supervised Super-Resolution for Anisotropic Multi-View MRI Using a Sparse Coordinate Loss; Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2025; Daejeon, Republic of Korea. 23–27 September 2025. [Google Scholar]
  • 82.Zhang H., Zhang Y., Wu Q., Wu J., Zhen Z., Shi F., Yuan J., Wei H., Liu C., Zhang Y. Self-Supervised Arbitrary Scale Super-Resolution Framework for Anisotropic MRI; Proceedings of the 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI); Cartagena, Colombia. 18–21 April 2023; pp. 1–5. [Google Scholar]
  • 83.Wang X. Inter-Slice Super-Resolution of Magnetic Resonance Images by Pre-Training and Self-Supervised Fine-Tuning; Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI); Athens, Greece. 27–30 May 2024. [Google Scholar]
  • 84.Gundogdu B. Self-Supervised Multi-Contrast Super-Resolution for Diffusion-Weighted Prostate MRI. Magn. Reson. Med. 2024;92:319–331. doi: 10.1002/mrm.30047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Chen H. Low-Res Leads the Way: Improving Generalization for Super-Resolution by Self-Supervised Learning; Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Seattle, WA, USA. 16–22 June 2024; pp. 25857–25867. [Google Scholar]
  • 86.Benisty R. SIMPLE: Simultaneous Multi-Plane Self-Supervised Learning for Isotropic MRI Restoration from Anisotropic Data; Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Marrakesh, Morocco. 6–10 October 2024. [Google Scholar]
  • 87.Remedios S.W. In: Self-Supervised Super-Resolution for Anisotropic MR Images with and Without Slice Gap. Wang L., Dou Q., Fletcher P.T., Speidel S., Li S., editors. Springer; Berlin/Heidelberg, Germany: 2023. pp. 118–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Mayo P. Physics informed guided diffusion for accelerated multi-parametric MRI reconstruction; Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Daejeon, Republic of Korea. 23–27 September 2025. [Google Scholar]
  • 89.Zhang L., Nie J., Wei W., Zhang Y. Unsupervised Test-Time Adaptation Learning for Effective Hyperspectral Image Super-Resolution With Unknown Degeneration. IEEE Trans. Pattern Anal. Mach. Intell. 2024;46:5008–5025. doi: 10.1109/TPAMI.2024.3361894. [DOI] [PubMed] [Google Scholar]
  • 90.Wang Z. One for Multiple: Physics-informed Synthetic Data Boosts Generalizable Deep Learning for Fast MRI Reconstruction. Med. Image Anal. 2025;103:103616. doi: 10.1016/j.media.2025.103616. [DOI] [PubMed] [Google Scholar]
  • 91.Ren Y., Liu W., Zhou Z., Hu P. PI-GNN: Physics-Informed Graph Neural Network for Super-Resolution of 4D Flow MRI; Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI); Athens, Greece. 27–30 May 2024. [Google Scholar]
  • 92.Jacobs L. Generalizable synthetic MRI with physics-informed convolutional networks. Med. Phys. 2023;51:3348–3359. doi: 10.1002/mp.16884. [DOI] [PubMed] [Google Scholar]
  • 93.Ferdian E., Marlevi D., Schollenberger J., Aristova M., Edelman E.R. Cerebrovascular super-resolution 4D Flow MRI: Sequential combination of resolution enhancement by deep learning and physics-informed image processing. Med. Image Anal. 2023;88:102831. doi: 10.1016/j.media.2023.102831. [DOI] [PubMed] [Google Scholar]
  • 94.Donnay C., Okar S.V., Tsagkas C., Gaitán M.I., Poorman M., Reich D.S., Nair G. Super resolution using sparse sampling at portable ultra-low field MR. Front. Neurol. 2024;15:1330203. doi: 10.3389/fneur.2024.1330203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95.Man C., Lau V., Su S., Zhao Y., Xiao L., Ding Y., Leung G.K.K., Leong A.T.L., Wu E.X. Deep learning enabled fast 3D brain MRI at 0.055 tesla. Sci. Adv. 2023;9:eadi9327. doi: 10.1126/sciadv.adi9327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Lau V., Xiao L., Zhao Y., Su S., Ding Y., Man C., Wang X., Tsang A., Cao P., Lau G.K.K., et al. Pushing the limits of low-cost ultra-low-field MRI by dual-acquisition deep learning 3D superresolution. Magn. Reson. Med. 2023;90:400–416. doi: 10.1002/mrm.29642. [DOI] [PubMed] [Google Scholar]
  • 97.Islam K.T., Zhong S., Zakavi P., Rawluk N., Hill M., Spreeuwers L., Mateen F., Sher I., McCreary C., Frayne R. Improving portable low-field MRI image quality through image-to-image translation using paired low- and high-field images. Sci. Rep. 2023;13:20848. doi: 10.1038/s41598-023-48438-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Iglesias J.E., Schleicher R., Laguna S., Billot B., Schaefer P., McKaig B., Goldstein J.N., Sheth K.N., Rosen M.S., Kimberly W.T. Quantitative brain morphometry of portable low-field-strength MRI using super-resolution machine learning. Radiology. 2023;306:e220522. doi: 10.1148/radiol.220522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Samarasinghe D., Wickramasinghe D., Wijerathne T., Meedeniya D., Yogarajah P. Brain Tumour Segmentation and Edge Detection Using Self-Supervised Learning. Int. J. Online Biomed. Eng. (IJOE) 2025;21:127–141. doi: 10.3991/ijoe.v21i05.53405. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The study was based on public domain data which are all openly accessible. Code for our preprocessing and decoding the machine learning framework will be available on GitHub. The datasets analyzed in the study were available in the public domain.


Articles from Sensors (Basel, Switzerland) are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES