Skip to main content
Journal of Imaging Informatics in Medicine logoLink to Journal of Imaging Informatics in Medicine
. 2025 Jan 27;38(5):3248–3262. doi: 10.1007/s10278-024-01322-4

Systematic Review of Hybrid Vision Transformer Architectures for Radiological Image Analysis

Ji Woong Kim 1, Aisha Urooj Khan 2, Imon Banerjee 1,2,3,
PMCID: PMC12572492  PMID: 39871042

Abstract

Vision transformer (ViT)and convolutional neural networks (CNNs) each possess distinct strengths in medical imaging: ViT excels in capturing long-range dependencies through self-attention, while CNNs are adept at extracting local features via spatial convolution filters. While ViT may struggle with capturing detailed local spatial information, critical for tasks like anomaly detection in medical imaging, shallow CNNs often fail to effectively abstract global context. This study aims to explore and evaluate hybrid architectures that integrate ViT and CNN to leverage their complementary strengths for enhanced performance in medical vision tasks, such as segmentation, classification, reconstruction, and prediction. Following PRISMA guideline, a systematic review was conducted on 34 articles published between 2020 and Sept. 2024. These articles proposed novel hybrid ViT-CNN architectures specifically for medical imaging tasks in radiology. The review focused on analyzing architectural variations, merging strategies between ViT and CNN, innovative applications of ViT, and efficiency metrics including parameters, inference time (GFlops), and performance benchmarks. The review identified that integrating ViT and CNN can mitigate the limitations of each architecture offering comprehensive solutions that combine global context understanding with precise local feature extraction. We benchmarked the articles based on architectural variations, merging strategies, innovative uses of ViT, and efficiency metrics (number of parameters, inference time (GFlops), and performance), and derived a ranked list. By synthesizing current literature, this review defines fundamental concepts of hybrid vision transformers and highlights emerging trends in the field. It provides a clear direction for future research aimed at optimizing the integration of ViT and CNN for effective utilization in medical imaging, contributing to advancements in diagnostic accuracy and image analysis. We performed systematic review of hybrid vision transformer architecture using PRISMA guideline and performed thorough comparative analysis to benchmark the architectures.

Keywords: Vision transformer, Hybrid architecture, Radiology, Image analysis

Introduction

Convolutions neural networks (CNN) were capable of learning inductive bias and were state-of-the-art for many medical imaging applications [1]. Recently, vision transformer (ViT) has been adopted in a lot of applications in medical imaging domain and demonstrated comparative performance [24]. Since the transformer has strength in capturing the long-range dependencies by using self- attention mechanism, it has shown good performance in complex natural language processing tasks [5, 6]. To detect anomalies in anatomical imaging, the local correlation among the neighboring pixels is important for identification of shape and regional texture difference with neighboring tissue in addition to the long-range dependencies to understand anatomical location of the anomaly [10]. Both local and global feature understanding is particularly important for smaller radiological findings given the potential possibility of multiple mimicking entities presence in the image and thus, it is difficult to confirm the actual finding only based on local spatial features. CNNs are known for their ability to capture local dependencies [79].

While CNNs, with the help of spatial convolution filters can primarily learn local features, shallow CNN networks with fewer layers often struggle to understand the global context of an image given the limitation of abstraction. In contrast, ViT learns the long-range dependencies via self-attention between the image patches to understand the global context. However, the patch-based positional encoding mechanism may miss relevant local spatial information and ViT usually cannot attain the performance of CNNs on small-scale dataset. This limitation of ViT has been highlighted in recent studies, particularly in the medical imaging domain, claims the limitation of ViT for identifying small findings [13]. Furthermore, a recent qualitative evaluation with heat maps in radiology shows that ViT and CNN learn the same radiographic image findings in different scales — ViT attention maps were more precise with smaller areas of activation, while CNN highlights large portions of the same area [14]. Thus, the vision transformer and CNN have complementary strengths for processing medical imaging for various applications, e.g., segmentation, classification, and prediction. Therefore, many researchers have been attempting to combine these two architectures to leverage the strength of capturing both global and local contents of the images through hybrid concept architectures [11].

Combining CNN and ViT in a hybrid modeling architecture can overcome the limitations of both ultimately providing an opportunity to learn the global and local spatial contexts in an end-to-end model. Given many attempts to embrace strengths of both ViT and CNN in radiology domain within a unified end-to-end framework, we conducted a systematic review to explore the various trends of how the hybrid vision transformer (CNN + ViT) was used to address image processing challenges in radiology and create a comprehensive benchmarking to guide future development. Shamshad et al. [4] recently published a comprehensive survey on transformer applications for medical images; however, the survey was primarily focused on the broad transformer architectures and only briefly mentioned the hybrid structures’ pros and cons, while we primarily focus on benchmarking the hybrid architectures for radiological images.

Based on the PRISMA guideline, we have selected 34 published articles for full text review that proposed novel hybrid architecture for medical vision tasks. We performed three-level comparative analysis — (a) overall architectural variations for the design of the hybrid CNN and ViT; (ii) merging strategies between CNN and ViT; (iii) innovative ViT usages; (iv) design efficiency of the architectures in terms of number of parameters and inference time efficiency; and (v) application in which task the hybrid vision transformer is used. To the best of our knowledge, there exists no systematic review that focuses on the hybrid architectures that combine ViT and CNN for the usage in radiology domain. While there exist survey papers that analyze the usage of ViT or CNN usage separately in the medical imaging domain [4], we particularly focused on how the ViT is combined with other modules as a hybrid vision transformer architecture and applied to analyze varying scale features in radiology images. Through the comparative analysis, we not only analyzed and summarized the contents of the published literature, but our paper also defines the fundamental concepts of the hybrid vision transformer to pave a clear pathway for future research in this area. Finally, we generated a ranking for the studies based on a unified five benchmarking criteria and derive community adoption score based on normalized citations.

Method

Study Selection

Initially, papers for review were identified from three search engines — Google Scholar, PubMed and Science Direct, encompassing all articles published between Jan. 2020 and Sept. 2024. The first screening pass (JK) identified papers pertaining to six keywords [(“Vision transformer,” “ViT,” “Hybrid ViT”) AND (“Radiology,” “Image Analysis,” “Modeling”)]. Afterwards, the papers were excluded based on (i) the publication year before 2020 and after 2024 Sept., (ii) not peer-reviewed, e.g., papers published in arXiv, MedRxiv, were not included since proposed claims and models are not validated by peers, and (iii) proposed usage not related to radiology, e.g., hybrid architectures applied in pathology, dermatology, and ophthalmology, were not considered to keep the discussion focus only on radiology domain. To properly filter papers, we parsed the methods and experiments sections to determine the architecture and usage. We selected only architectures that combine self-attention or transformer modules in visual feature extraction and integrate with CNN modules. Hybrid transformer architectures that used transformer modules solely for text feature extraction were excluded, as we focused on transformer usage for visual radiological features. Vanilla ViT also proposes CNN embedding for input patches [15], thus we defined hybrid ViT as the vision transformer architecture that integrates CNN more extensively in an end-to-end learning manner in contrast to the variant of vanilla ViT. Finally, papers that only concatenated the output of two architectures without proposing any new design modifications were excluded. Subsequently, three independent reviewers, JK, AU, and IB, scanned the retrieved papers and selected relevant papers based on predefined inclusion and exclusion criteria. Conflicts identified in three papers were resolved by majority voting. All the selected papers were reviewed by three reviewers.

Comparative Analysis

We primarily focused on four comparative analysis factors, as described below.

Architecture

We categorized the existing hybrid architectures into parallel and sequential design (Fig. 1). Parallel design includes the one where CNN and ViT modules used in-parallel to cooperatively learn the feature representation from the data, and various intermediate fusion functions are generally used to increase the co-understanding between the modules, such as cross-attention. Note that both modules parse the data at the same level of granularity and have less dependence on each other. In contrast, CNN and ViT are used in sequence in sequential design where the output of one module is directly passed onto others. One module is used for initial feature extraction and other module for generating an abstract view based on previous module interpretation. Therefore, the dependency between the CNN and ViT modules is much higher in the sequential design. The hybrid architectural choice primarily depends on the targeted task. For example, sequential architecture (CNN → transformer) helps to reduce the input dimensional while feeding into transformer and therefore increases the chance of capturing global information with less memory utilization and particularly useful for segmentation and image reconstruction tasks. Parallel architectures absorb the benefits on both CNN and transformer for joint decision-making for classification.

Fig. 1.

Fig. 1

Architecture variations — a parallel and b sequential

Merging Strategy

Based on the architectural variations (sequential or parallel), various strategies have been employed to merge CNN and ViT outputs effectively to utilize their respective strengths. We have identified three broad techniques for merging ViT and CNN (Fig. 2): (i) feature reshaping — this is an extremely well-adopted technique for sequential architecture where the output of one module is resized using a simple linear function (such as flatting) to be fed into the other. For example, given that ViT only parse sequential input tokens, CNN feature maps are required to be flattened into sequential tokens if the output is passed to ViT; (ii) positional encoding — Since flattened feature maps lose spatial information, often architecture that uses feature reshaping also include positional encoding to capture the spatial information based on the feature map; and (iii) fusing module — for the parallel design where both of the modules are co-learning together, most of the literature used linear combination of ViT and CNN output to create a fusion between these two modules or use a different model for learning the fusion parameters.

Fig. 2.

Fig. 2

Merging strategies variations — a feature reshaping, b positional encoding, and c fusing module

Transformer Utilization

Given the usage of the transformer varied between the hybrid models, we also categorized the hybrid model based on the final usage of the transformer module — (i) Encoder — when the transformer is used to embed the raw or processed data. In other words, to generate a compressed representation of the input; (ii) Decoder — when the transformer is used for the generation of reconstruction or interpretation.

Applications

We observed that hybrid transformer architectures are adopted in various problems of computer vision for radiology domain, starting from classic tasks like classification, segmentation, reconstruction, registration to regression, synthesis, view combination, and text generation. We aimed to group the designs based on the architecture to better highlight the design choices and metrics for performance evaluation. And as hybrid architecture, we aim to analyze how the authors utilize the concept of hybrid for the tasks and application.

Benchmarking Criteria

We employed five criteria to benchmark these models, which are centered around understanding the utility and efficiency of the hybrid models [16].

Modality

Hybrid models are primary developed to deal with high dimensional spatial data (e.g., 2D, 3D, and 4D) to process them in a computationally efficient way through transformer while retaining both local and global spatial contexts. Therefore, we consider the imaging modality (2D — X-ray, 2D + time — ultrasound (US), 3D — magnetic resonance (MR), computed tomography (CT), and positron emission tomography (PET)) as a primary benchmarking criterion which ideally serves as a proxy for representing the data dimensionality.

Model Size

We measured the model size by the number of trainable parameters. In theory, as the number of trainable parameters increases, so does the need for more training samples [12], which limits the large model’s applicability for interesting clinical use cases due to the lack of training data and manual annotation in the clinical domain. However, the number of training parameters also depends on the input data dimension. Therefore, we defined an efficient hybrid design as a model that can handle high dimensional input data with fewer training parameters. We primarily record the number of trainable parameters either from the published paper (if documented) or by loading the model manually, if the model is supplied by the authors with an academic open-source license.

Computing Efficiency

To measure the computing efficiency, we used the unit FLOPs(G) as a benchmarking criterion to demonstrate feasibility of applying the algorithm in real-time. FLOPs denote the number of floating-point operations performed to run the inference on a single image. Given the assumption that the hardware configurations may vary, FLOPs show a standardize hardware independent measure of the algorithmic efficiency of the hybrid model.

Though, we calculated and tested all the models on 4 A100 GPU cores with 32 GB memory. Like model size, we rely on the published manuscript for the FLOPs if documented or calculated the FLOPs by doing inference on input image generated based on the dimension specified by the authors.

Training Data Size

Though the training sample size is not directly related to the hybrid model design, we incorporated it as a benchmarking criterion to highlight task and modality-specific trends in the targeted use cases and how it ultimately affects the model performance. We directly capture the number of training samples from the reported documentation by the original authors. If multiple training setting was mentioned in the paper, we only selected the largest cohort setting and report the number of total exams included.

Performance

Based on the applications (“Applications”), we ultimately benchmark the reported performance. Given the issue of not availability of the shared codebases or a common dataset, we documented only the reported performance by the original authors on distinct datasets. However, the performance benchmark highlights the overall task performance for a task based on a standard task-specific metric, e.g., Dice for segmentation accuracy, AUC for classification prediction true positive and false positive trade-off at different probability threshold (F1 if AUC is not reported), Structural Similarity Index (SSIM) for reconstruction and registration quality compared to the input data, and Bleu1 score for evaluating the quality of text generation compared to the original reference text.

Ranking

Based on the five benchmarking criteria, we generated a categorical ranking to highlight the best practices for generalizability and adoptability of the hybrid models by research and clinical community. The higher rank is assigned if a lighter model with lower FLOPS is applied to high dimensional data and achieved higher computation accuracy (> 80) on both internal test sets and validated the performance to understand performance disparity (e.g., device, population, and anatomy). However, such ranking is only possible when the model and codebase is public or shared with us. For the manuscript with no open-source code repo, we grouped them as opaque. We also calculate the rate of community adoption by: citationsyears.

Results

In Fig. 3, we present the PRISMA diagram with total number of articles included and excluded at each step of filtering. Finally, within the scope of this survey, we analyzed 34 articles that satisfied all our inclusion criteria. Table 1 shows the comparative analysis results and Table 2, benchmarking, according to the defined comparative analysis and benchmarking criteria in “Comparative Analysis” and “Benchmarking Criteria,” respectively. Table 3 presents the ranking derived by the five criteria and community adoption rate calculated based on citations as of Oct 2024.

Fig. 3.

Fig. 3

PRISMA study selection diagram for hybrid vision transformer in radiology

Table 1.

Comparative analysis of the existing hybrid architectures performed based on the criteria defined in “Comparative Analysis” studies are grouped based on the targeted applications. “Transf.,” “Util.” refers to transformer and utility, respectively

Reference Application Architecture Merging Transf. Util Transf. Backbone CNN Backbone
U-net Transformer [36] Segmentation Sequential Fusing Encoder Multi-head cross-attention U-Net
UTNet [7]  Segmentation Sequential

Feature reshaping

Positional encoding

Encoder

Decoder

Multi-head self-attention U-Net
CPT U-Net [20]  Segmentation Parallel Fusing

Encoder

Decoder

Pyramid vision transformer U-Net
UNETR [37]  Segmentation Sequential Feature reshaping Encoder Vision transformer U-Net
Swin UNETR [38]  Segmentation Sequential

Feature reshaping

Fusing

Encoder Swin transformer U-Net
COTRNet [21]  Segmentation Sequential

Feature reshaping

Positional encoding

Encoder Light vision transformer U-Net
Cotr [39]  Segmentation Sequential

Feature reshaping

Positional encoding

Encoder Deformable transformer-encoder U-Net
Hybrid ViT and CNN [24]  Segmentation Sequential Fusing

Encoder

Decoder

Vision transformer U-Net
TransBTS [10] Segmentation  Sequential

Feature reshaping

Positional encoding

Encoder Light vision transformer U-Net
Trans U-Net [22]  Segmentation Sequential

Feature reshaping

Positional encoding

Encoder Vision transformer U-Net
Bitr U-Net [40]  Segmentation Sequential

Feature reshaping

Positional encoding

Encoder Vision transformer

U-Net

CBAM

After U-Net [41] Segmentation  Sequential Feature reshaping Encoder Axial fusion transformer U-Net
WAU [42]  Segmentation Sequential Feature reshaping Decoder Window attention

Group convolution

Depthwise separable CNN

HCTN [43]  Segmentation Parallel Feature reshaping Encoder Vision transformer U-Net
Hybrid CNN-Transformer Non-Contrast [44]  Segmentation Parallel Fusing Encoder Hierarchical transformer U-Net
D-TrAttUnet [45]  Segmentation Parallel Fusing Encoder Swin transformer U-Net
MLABHCTM [46]  Segmentation Parallel Fusing Encoder/Decoder Transformer U-Net
UCTNet [47]  Segmentation Sequential Feature Reshaping Encoder/Decoder Transformer U-Net
Hybrid-MT-ESTAN [48] Classification/segmentation Parallel

Feature Reshaping

Fusing

Encoder Swin transformer ResNet
ChexViT [17] Classification Sequential

Feature Reshaping

Positional Encoding

Encoder Vision transformer CheXNet [49]
TECNN [19]  Classification Parallel

Feature Reshaping

Fusing

Encoder Vision transformer DenseNet
SLATER [50] Reconstruction Sequential

Feature Reshaping

Positional Encoding

Decoder Cross-attention Specialized CNN
3D Transformer GAN [18] Reconstruction  Sequential

Feature Reshaping

Positional Encoding

Encoder

Decoder

Vision transformer Specialized CNN
T2Net [31]  Reconstruction Sequenital Feature Reshaping Encoder

Task-attention

Soft-attention

Specialized CN
MIST-Net [51]  Reconstruction Sequential

Feature Reshaping

Fusing

Decoder Swin transformer Specialized CNN
Ultrasound ViT [30] Regression Sequential Feature Mapping Encoder Bert ResNetAE/DenseNet
DTN [25] Registration Sequential

Feature Reshaping

Fusing

Encoder

Decoder

Dual transformer Specialized CNN
ViT-V-Net [52]  Registration Sequential

Feature Reshaping

Positional Encoding

Encoder Vision transformer Specialized CNN
Transmorph [23]  Registration Sequential Feature Reshaping Encoder Swin transformer U-Net
ResViT [29] Image synthesis Sequential Feature Reshaping Encoder Vision transformer Specialized CNN
TransCT [27] Restoration Parallel Feature Reshaping

Encoder

Decoder

Vision transformer Specialized CNN
Multi-View ViT [26] Combining views Sequential Feature Reshaping Encoder Cross view-attention ResNet
AlignTransformer [53] Report generation Sequential Feature Reshaping Encoder Align hierarchical-attention ResNet
R2Gen [28]  Report generation Sequential Feature Reshaping

Encoder

Decoder

Vision transformer Pretrained (ResNet, VGG)

Table 2.

Benchmark Table. See “Benchmarking Criteria” for benchmarking criteria definition. Inaccessibility of the models are marked as “–” which restrict to calculate the benchmarking criteria

Reference Modality Parameters (M) Inference time (GFLOPs) Sample size Performance
U-net Transformer [36] CT 42.5 TCIA (public): 82 total Dice: 0.78
UTNet [7] MR 14.4 40.9

M&Ms [54]

Training: 150/Test: 200

Dice: 0.88
CPT U-Net [20] CT 123.8 150.6 Synapse1 (public) 30 Dice: 0.81
UNETR [37] CT/MRI 92.5 41.1

BTCV: 20subjects

MSD: 484 CT/MRI

Dice: 0.89
Swin UNETR [38] MRI 61.9 394.8

BraTS21

Training: 1251/Val: 219

Dice: 0.92
COTRNet [21] CT

Kits21

Training: 240/Val: 60

Dice: 0.61
Cotr [39] CT 41.9 399.2

Kits21

Training: 240/Val: 60/est: 100

Dice: 0.61
Hybrid ViT and CNN [24] CT

Synapse

Training: 18/Val: 12

Dice: 0.87
TransBTS [10] MRI 32.9 333.0

BraTS19

Train: 335/Val: 125

Dice: 0.9
TransUNet [22] CT/MRI 105.1 1186.9

Synapse

Train: 18/Val: 12

Dice: 0.77
Bitr U-Net [40] MRI 43.4 186.2

BraTS21

Training: 1251/Val:219

Dice: 0.92
After U-Net [41] CT 41.5 BCV Training: 18/Test: 12 Dice: 0.81
WAU [42] CT/MRI 21.8 15.94

Synapse: 18

MSD: 484

Dice: 0.80
HCTN [43] X-Ray/MRI

STS

Training: 45/Val: 5

Dice: 0.88

Hybrid CNN-Transformer

Non-Contrast [44]

CT 38.94 3.75

AISD

Training: 305/Val: 40/Test: 52

Dice: 0.61
D-TrAttUnet [45] CT 70.13 28.47

BM Seg Dataset

1517: Five-fold cross-validation

Dice: 0.84
MLABHCTM [46] MRI 1.8 4.6

ACDC

Training: 140/Val: 20/Test: 40

Dice: 0.86
UCTNet [47] CT/MRI/Dermoscopic 58.8 21.7

ACDC

Training: 70/Val: 10/Test: 20

Dice: 0.92
Hybrid-MT-ESTAN [48] Ultrasound

Private Dataset

Total: 3320

Dice: 0.84

AUC: 0.82

ChexViT [17] X-Ray

Chest X-ray 14 [55]

Training: 86,524/Val: 25,596

AUC: 0.83
TECNN [19] MRI 22.5

BraTS, Figshare

Training: 998 + /Val: 285 + / Test: 142 + 

Recall: 0.97

Precision: 0.967 F1-Score: 0.968

SLATER [50] MRI 36.0 174.8 IXI: Training: 25/Val: 5/Test: 10 SSIM (T2): 97.77
3D Transformer GAN [18] PET 42.0

Brain MRI

Training: 10935 (15 subjects × 729 patches); Val: LOOV

SSIM: 0.986
T2Net [31] MRI 1.4 140.3

IXI Dataset, Clinical MRI Dataset

Training: 420/Val: 60/Test: 120

SSIM: 0.87
MIST-Net [51] CT 11.8 576.0

2016 NIH AAPM Mayo Challenge

Training: 4274 sinograms Test: 391 sinograms

SSIM: 0.98
Ultrasound ViT [30] Echocardiogram 346.8 521.7

Echonet-Dynamic

Training: 7522/Val: 1504/Test: 1504

MAE: 6.77
DTN [25] MRI

Oasis, IXI, BraTS

Training: 256/Val: 19/Test: 150

Dice: 0.76
ViT-V-Net [52] MRI 31.5 778.4

In-House T1 Weight

Training: 182/Val: 26/Test: 52

Dice: 0.72
Transmorph [23] CT/MRI/XCAT 46.7 1427.0

Oasis

Training: 256/Val: 19/Test: 150

Dice: 0.816
ResViT [29] CT/MRI 123.4 973.0

IXI, BraTS, CT-MRI

Training: 59/Val: 29/Test: 42

SSIM

T1, T2- > FLAIR: 0.886

TransCT [27] CT 12.6 598.7

2016 NIH AAPM Mayo Challenge

Training: 7patients/Val: 1/Test: 2

SSIM 0.92
Multi-View ViT [26] X-ray 23.6 9.5

CheXpert

Training: 23,628 (16,810 patients) Val: 3915/Test: 3870

AUC 0.834
AlignTransformer [53] X-ray

MIMIC CXR

Training: 368,960 Val: 2991 Test: 5159

BLEU1 0.378
R2Gen [28] X-ray 78.4 35.4

MIMIC CXR

Training: 368,960 Val: 2991 Test: 5159

BLEU1 0.353

Table 3.

Ranking Table. See “Benchmarking Criteria” ranking for benchmarking rating definition. Inaccessibility of the models are marked as “–” and ranked as “opaque” which restrict to calculate the benchmarking criteria. Citation for community adaptation rate is calculated as of 10 Oct. 24

graphic file with name 10278_2024_1322_Tab3a_HTML.jpg

graphic file with name 10278_2024_1322_Tab3b_HTML.jpg

Comparative Analysis: Architecture

Despite parallel architecture enhances co-operative learning between CNN and ViT, as a simplistic design option, most hybrid architectures follow a sequential structure (26 out of 34) where the output of the CNN feature extractor block is fed into ViT for generating compressed feature representation (Table 1). Particularly, the hybrid models proposed for segmentation mostly follow the U shape architecture based on U-Net. The transformer with self-attention module is included between the encoder-decoder in the U-Net shape backbone. For the classification, restoration, and reconstruction tasks, a similar sequential architecture idea has been adopted where CNNs are used as feature extractors, and feature maps are flattened and fed into transformer with positional encoding, which is either calculated in the feature space [17] Common claims of that of sequential architectures is reduced computational cost, since the CNN feature maps which are fed to the transformer, are compact and smaller in size compared to the original input. 3D Transformer-GAN [18] is one of the architectures that reduces the transformer’s computational cost by having CNN modules before transformer modules.

Parallel architectures were particularly proposed to conserve same-level features using both CNN and transformer; however, fusing the feature spaces at multiple levels is a challenging problem and needs innovative measures. Zhu et al. [20] proposes two separate trainable modules in two parallel branches which supports smooth transition of data between CNN and transformer branches but requires more careful handling during merging. For segmentation, CPT U-Net [20] utilizes a parallel architecture using a transformer pathway, where the CNN feature maps are flattened at each step and fed into the transformer. Ultimately, a merge module combines both feature spaces and computes classification scores using Softmax. As parallel architecture for high quality image restoration, TransCT decomposed the images into high (HF) and low (LF) frequency component and used CNN to parse LF and transformer to parse HF. To combine the transformer and CNN features, they again utilize a simplistic ResNET (2 Conv Layers) to get the final output.

Comparative Analysis: Merging Strategy

As also highlighted by the dominance of sequential architectures (26 out of 34), irrespective of the downstream application, most existing hybrid architectures [7, 10, 17, 18, 21] leverage the long-context learning ability of the transformers after calculating feature maps from CNN by reshaping the feature maps and by adding the positional encoding. TransUNet [22], as an early example of hybrid vision transformer architecture, suggested a feature reshaping approach to convert CNN feature maps into sequential tokens for ViT by transforming channel dimensions from CNN feature maps into hidden dimensions. This feature reshaping approach has been adopted by many hybrid ViT architectures. ResViT repeats up sampling and down sampling feature maps between ViT and CNN since ViT requires lower resolution input due to the high computation complexity, and CNN requires high resolution to improve the sensitivity for local features. Such design showed that feature reshaping could be an essential component to provide the feature maps that can simultaneously optimize capability of both CNN and transformer.

Separate fusing modules are primarily leveraged by parallel architectures where either linear merging or another model is used for fusing the features from transformer and CNN [19, 24]. Interestingly, being a sequential network, DTN [25] simultaneously leverages the feature reshaping for handling exchange between CNN and transformer and fusion module for aggregation of two parallel transformer blocks for temporal image frames. TransMorph [23] designed an unique strategy which we grouped under the broad category of feature reshaping, though they developed a patch merging module to reshape and align the features between the 3D Swin transformer blocks in the encoder. The feature maps generated at each resolution are sent into a ConvNet decoder to produce an output. The fusion module is designed to handle of feature reshaping and a smooth transition from CNN to ViT and vice versa. TECNN [19] and CPT-U-Net [20] embedded two different reshaping methods within the merge module depending on the output to the CNN branch or transformer branch. Hybrid architectures also used cross-attention merging modules. Multi-View ViT [26] included cross-attention module to combine two branches that use different views of the image object, respectively. For alignment, TransCT [27] also has mixed input in the attention module where key and value come directly from low-frequency embeddings and query from high-frequency embeddings. Multi-View ViT and TransCT suggested attention modules that received mixed input from different embeddings. Even though these two strategies are different usages of attention modules than directly merge CNN and ViT, they provide evidence that attention modules can be used as a merge module for two different/views types of inputs.

Comparative Analysis: Transformer Utilization

Transformers are solely used for additional compression of the bottleneck features to capture the global context — out 22 of 34 studies. Interestingly, 9 studies applied transformers for both encoding and decoding purposes — particularly for target image generation tasks, such as segmentation, registration, and restoration. Text generation models (R2GEN [28]) primarily use a transformer for the language generation task where CNN is being used as an image feature extractor. A compelling use of transformers is observed in the 3D transformer GAN [18] which devises a 12-layer transformer network between the encoder CNN and decoder CNN where 6 transformer layers perform encoding and 6 layers perform decoding before feeding into the CNN decoder block. The primary difference from the original transformer, the transformer block applies parallel decoding to achieve parallel sequence prediction that is reshaped and processed by 1 × 1 × 1 convolution. Beyond just encoder and decoder for the transformer utilization, the transformer is also used as merging modules. The transformer is included to combine two different types of input data as cross-attention, as seen in multi-View ViT [26]. 3D Transformer GAN [18] also uses transformer encoder and transformer decoder in combination with the CNN decoder which is different from other hybrid architecture that uses transformer encoder only as a bridge between U shape architecture encoder and decoder.

Comparative Analysis: Application

The hybrid architectures are adopted for all fundamental medical image analysis tasks, e.g. segmentation (12), classification (2), reconstruction (4), and registration (3). Additionally, we observed some innovative applications of hybrid architectures. For example, ResViT [29] leveraged the contextual sensitivity of vision transformers along with the precision of convolution operators to generate missing multi-contrast MR series. For the synthesis task, the ResViT insists that the transformer’s global context capturing strength can improve the synthesis performance. Through the paper, we can see that global content capturing via transformer can give positive influence on the synthesis applications. They compared their approach against the state-of-art convolution-only GAN-based models and showed that hybrid methods outperformed them. TransCT [27] improved the final CT image quality from a low-dose CT image by using content features from transformers and latent texture features from CNNs. Multi-View ViT [26] designed a cross-view transformer to transfer information between unregistered image views at the level of spatial feature maps and validated its effectiveness for mammograms and chest X-rays. TransCT, ResViT, and Multi-View ViT performed image translation as an innovative application and demonstrate that global contents captured by the transformer can be used efficiently to derive the target images.

As we observed that rare anomalies and pathological variations are not yet considered as a use-case in hybrid model which could be due to the number of trainable parameters, and semi-supervised or self-supervised learning strategies adopted in the hybrid models which still requires a large representative dataset for training. For segmentation use-case, still smaller dataset has been used but needs pixel/voxel-based annotation for model training and often pre-trained weights of backbones are included.

Benchmarking

Based on the five benchmarking criteria described in “Benchmarking Criteria,” we compare the existing 34 architectures in Table 2. We observed that most the hybrid architectures (28/34) are applied for the high dimensional imaging modalities, such as MR, CT, and PET. This deign choice could be influenced by the capability of the transformer-CNN hybrid to digest both local and global contexts from a high dimensional image space and generate a denser representation compared to CNN only encoder. Highest number of trainable parameters (346.8 M) is observed in Ultrasound ViT [30] that process variable length echo cardiogram videos since they utilized 16 parallel BERT encoders for spatiotemporal reasoning and two parallel regression tasks where they used ResNetAE to distil the US frames into smaller dimension embedding (1024 D) and the resulting embeddings are stacked for the clip, and BERT encoders are used to process variable length videos. T2Net [31] contains lowest number of trainable parameters (1.4 M) where they proposed a multitask learning framework where shared parallel network backbone, leveraging knowledge from one task to speed up the learning of the other and increase flexibility for sharing complimentary features. Only CNN-based model, such as DenseNet121 has 8.1 M and ResNet-50 is approximately 25.6 M, which shows the fact the hybrid architecture often does not increase the trainable parameter set as it helps to generate dense representation.

Theoretically, light-weight models with less parameters are faster during inference with less number of floating point operations; while T2Net has the lowest parameters, Multi-view ViT [26] for chest X-ray has the fastest inference speed which could be based on the fact that it process 2D compress X-ray images. Inference speed for segmentation (TransUNet [22] and reconstruction (Transmorph [23]) models is often higher than the classification due to their pixel or patch-wise processing strategies. Following the similar trend as CNN, hybrid models are often trained and validated on open-source datasets, e.g., BraTS [32], Figshare [33], KiTS21 [34], TCIA [35], Synapse, with limited samples of high dimensional medical images, and obtained current state-of-the-art performance for all major medical image analysis tasks by outperforming its CNN or transformer only counterparts. These trends for all major medical image analysis tasks show the benefit of designing hybrid architectures by combining both CNN and transformers.

Based on the ranking, we grouped the studies into 4 board categories — (i) High: lighter model with lower FLOPS, applied to high dimensional data and achieved higher computation accuracy across multiple datasets, (ii) Medium: medium to lighter model with lower FLOPS and achieved moderate computation accuracy or heavier model with higher accuracy and validated across multiple datasets, (iii) Low — heavier model with moderate-to-low accuracy and only validated on a single dataset, and (iv) Opaque — no public repository or nor shared by the authors for the comparative analysis. As we can observed that community adoption of the Opaque models is limited (low adoption rate by citation); however some exceptions are: U-net Transformer, TECNN, After U-Net, and AlignTransformer which could be due the novel hybrid design element in the architectures (e.g., TECNN with fusion module) or early work in the domain (AlignTransformer). High- and medium-ranked models are widely adopted, exceptions are some very recent works (D-TrAttUnet and MLABHCTM).

Discussion

Following PRISMA guideline, we performed a systematic review for the 34 hybrid CNN and transformer architectures for radiological image analysis task. The diverse roles of transformers, from encoding raw data to decoding generated outputs, highlight their versatility in medical image analysis tasks. While transformers excel at capturing long-range dependencies, their integration into hybrid architectures introduces challenges in balancing computational efficiency and performance across different applications. The wide range of applications for hybrid vision transformer architectures underscores their potential to address various challenges in radiology, from segmentation and classification to reconstruction and registration. Innovative approaches, such as missing image generation and image quality enhancement, demonstrate the versatility and adaptability of these models to clinical needs.

The predominance of sequential architectures in hybrid vision transformer models suggests a preference for a structured flow of information from CNN feature extraction to transformer-based representation learning. This sequential approach aligns well with the nature of medical image analysis tasks, where hierarchical feature extraction and global context understanding are crucial. The utilization of feature reshaping and positional encoding reflects efforts to bridge the gap between CNN and transformer representations, enabling effective fusion of spatial and contextual information. While sequential architectures predominantly rely on these strategies, parallel architectures explore alternative fusion methods, emphasizing the need for further research into optimal merging techniques. Our derived ranking shows that efficient and publicly shared model validated on multiple datasets across domain has more chance of community adoption.

However, challenges remain in ensuring robustness, interpretability, and scalability for real-world radiological applications. Further validation on diverse datasets and clinical scenarios is essential to establish their efficacy and reliability. While hybrid architectures offer advanced capabilities in feature representation and context modeling, their computational demands pose practical challenges, particularly in resource-constrained clinical environments. Optimizing model size, training efficiency, and inference speed are crucial for facilitating widespread adoption and deployment in clinical settings. Moreover, the reliance of hybrid architectures on large-scale annotated datasets raises concerns about data availability and quality, especially for rare diseases and specialized imaging modalities. Addressing data scarcity through data augmentation, transfer learning, and domain adaptation techniques is critical for generalizing model performance across diverse clinical scenarios. Furthermore, a combination of transformers and CNNs introduces multiple layers of abstraction in the hybrid architectures which complicates interpretability. CNNs are usually interpreted using feature maps and saliency maps and for transformers, we usually interpret the attention weights. Therefore, for parallel architecture, an interpretability solution could be to divide gradient equally and backpropagate through parallel branches to derive different visualization schemes for CNN and transformer. However, combining insights from both components in sequential architecture is complex and requires implementation of novel visualization techniques.

Future research in radiological image analysis should explore novel hybrid architectures that leverage the complementary strengths of CNNs and transformers while addressing their inherent limitations. Investigating alternative fusion strategies, attention mechanisms, and network architectures could lead to more efficient and effective hybrid models tailored to specific clinical tasks and modalities. Comprehensive validation studies on diverse clinical datasets are essential to assess the generalization and robustness of hybrid vision transformer models across different imaging modalities and pathologies. Collaborations between clinicians, radiologists, and machine learning researchers are crucial for co-designing and evaluating models that meet clinical needs and standards. Use of distinct private datasets for evaluation and reporting their performance using different metrics makes the comparison between the models extremely challenging. Standardized benchmarking criteria and open-source dataset are needed for the model understanding.

We believe that our review will set a proper benchmarking framework for the hybrid models and most of the emerging radiology image analysis research, starting from foundational model, binary classification to complex multimodal data integration can obtain significant performance boost by using the hybrid model. However, efforts to optimize model inference speed, memory efficiency, and hardware compatibility are essential for enabling real-time deployment of hybrid architectures in clinical workflows. Leveraging hardware accelerators, model compression techniques, and cloud-based inference services can facilitate seamless integration of these hybrid models into existing radiology infrastructure not only for diagnostic and prognostic tasks but also for real-time image quality restoration, noise reduction, segmentation, quantification. As hybrid vision transformer models become increasingly integrated into clinical practice, attention must be paid to ethical and regulatory considerations, including patient privacy, data security, and algorithmic transparency. By addressing challenges related to model design, computational efficiency, data availability, and ethical considerations, these models can significantly impact patient care and healthcare delivery, paving the way for a future where AI-powered radiology transforms diagnosis, treatment, and patient outcomes.

Limitations

The first ViT paper [2] is published in 2020, thus we started our review timeline from Jan. 2020. We expect that we would not miss significant work before 2020 but due to publication timeline, we stopped reviewing the timeline Sept. 2024 which will not incorporate architecture that were in review or production during the time. Due to non-standardized performance reporting style in literature, we were unable to create a standard performance benchmark, and models were also validated on distinct datasets. If the model is not available with open-source license, we were unable to calculate the inference speed and trainable parameters. We did make an effort to reach the corresponding authors but did not get the model access for a few studies (marked as “–” in Table 2).

Essentials

  1. Hybrid vision transformer architectures preserve both local and global spatial dependencies in radiological images.

  2. Such architecture holds tremendous promise for advancing medical image analysis in radiology, offering superior performance, interpretability, and clinical relevance.

  3. Comprehensive validation studies on diverse clinical datasets are essential to assess the generalization and robustness of hybrid vision transformer models.

  4. Efforts to optimize model inference speed, memory efficiency, and hardware compatibility are essential for enabling real-time deployment.

Data Availability

All the data generated during the systematic review is available upon formal request.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.S. M. Anwar, M. Majid, A. Qayyum, M. Awais, M. Alnowami, and M. K. Khan, Medical image analysis using convolutional neural networks: a review, J. Med. Syst., vol. 42, pp. 1–13, 2018. [DOI] [PubMed] [Google Scholar]
  • 2.A. Dosovitskiy. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv: 2010.11929 (2020).
  • 3.J. Li, J. Chen, Y. Tang, C. Wang, B. A. Landman, and S. K. Zhou, Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives, Med. Image Anal., vol. 85, p. 102762, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.F. Shamshad et al., Transformers in medical imaging: A survey, Med. Image Anal., p. 102802, 2023. [DOI] [PubMed]
  • 5.K. Han, A. Xiao, E. Wu, J. Guo, C. XU, and Y. Wang, Transformer in Transformer, in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2021, pp. 15908–15919. Accessed: Oct. 09, 2024. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/854d9fca60b4bd07f9bb215d59ef5561-Abstract.html
  • 6.T. Wolf, D. Lysandre, V. Sanh, J. Chaumond, C. Delangue, A.Moi, P. Cistac et al. "Transformers: State-of-the-art natural language processing." In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38-45. 2020.
  • 7.Y. Gao, M. Zhou, and D. N. Metaxas, UTNet: a hybrid transformer architecture for medical image segmentation, in Medical Image Computing and Computer Assisted Intervention – MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24, Springer, 2021, pp. 61–71.
  • 8.H.-C. Shin et al., Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning, IEEE Trans. Med. Imaging, vol. 35, no. 5, pp. 1285–1298, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.K. Suzuki, Overview of deep learning in medical imaging, Radiol. Phys. Technol., vol. 10, no. 3, pp. 257–273, 2017. [DOI] [PubMed] [Google Scholar]
  • 10.Wenxuan, Wang, Chen Chen, Ding Meng, Yu Hong, Zha Sen, and Li Jiangyun. "Transbts: Multimodal brain tumor segmentation using transformer." In International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, pp. 109-119. 2021.
  • 11.Khan, A., Z. Rauf, A. Sohail, A. Rehman, H. Asif, A. Asif, and U. Farooq. "A survey of the Vision Transformers and its CNN-Transformer based Variants. arXiv 2023." arXiv preprint arXiv:2305.09880.
  • 12.A. Djouadi, Oe. Snorrason, and F. D. Garber, The quality of training sample estimates of the bhattacharyya coefficient, IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 1, pp. 92–97, 1990. [Google Scholar]
  • 13.C. Li and C. Zhang, Toward a Deeper understanding: RetNet viewed through convolution, Pattern Recognition (2024), p. 110625, 2024.
  • 14.Z. R. Murphy, K. Venkatesh, J. Sulam, and P. H. Yi, Visual transformers and convolutional neural networks for disease classification on radiographs: a comparison of performance, sample efficiency, and hidden stratification, Radiol. Artif. Intell., vol. 4, no. 6, p. e220012, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Mao, Xiaofeng, Gege Qi, Yuefeng Chen, Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, and Hui Xue. "Towards robust vision transformer." In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 12042-12051. 2022.
  • 16.Ambita, Ara Abigail E., Eujene Nikka V. Boquio, and Prospero C. Naval Jr. "Covit-gan: vision transformer forcovid-19 detection in ct scan imageswith self-attention gan forDataAugmentation." In International Conference on Artificial Neural Networks, pp. 587-598. Cham: Springer International Publishing, 2021.
  • 17.Faisal, Muhamad, Jeremie Theddy Darmawan, Nabil Bachroin, Cries Avian, Jenq Shiou Leu, and Chia-Ti Tsai. "CheXViT: CheXNet and Vision Transformer to Multi-Label Chest X-Ray Image Classification." In 2023 IEEE International Symposium on Medical Measurements and Applications (MeMeA), pp. 1-6. IEEE, 2023.
  • 18.Luo, Yanmei, Yan Wang, Chen Zu, Bo Zhan, Xi Wu, Jiliu Zhou, Dinggang Shen, and Luping Zhou. "3D transformer-GAN for high-quality PET reconstruction." In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VI 24, pp. 276-285. Springer International Publishing, 2021.
  • 19.M. Aloraini, A. Khan, S. Aladhadh, S. Habib, M. F. Alsharekh, and M. Islam, Combining the Transformer and Convolution for Effective Brain Tumor Classification Using MRI Images, Appl. Sci., vol. 13, no. 6, p. 3680, 2023. [Google Scholar]
  • 20.J. Zhu, Y. Sheng, H. Cui, J. Ma, J. Wang, and H. Xi, Cross Pyramid Transformer makes U-net stronger in medical image segmentation, Biomed. Signal Process. Control, vol. 86, p. 105361, 2023. [Google Scholar]
  • 21.Shen, Zhiqiang, Hua Yang, Zhen Zhang, and Shaohua Zheng. "Automated kidney tumor segmentation with convolution and transformer network." In International Challenge on Kidney and Kidney Tumor Segmentation, pp. 1-12. Cham: Springer International Publishing, 2021.
  • 22.Chen, Jieneng, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, and Yuyin Zhou. "Transunet: Transformers make strong encoders for medical image segmentation." arXiv preprint arXiv:2102.04306 (2021).
  • 23.J. Chen, E. C. Frey, Y. He, W. P. Segars, Y. Li, and Y. Du, Transmorph: Transformer for unsupervised medical image registration, Med. Image Anal., vol. 82, p. 102615, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Wang, Fan, and Bo Wang. "Hybrid transformer and convolution for medical image segmentation." In 2022 International conference on image processing, computer vision and machine learning (ICICML), pp. 156-159. IEEE, 2022.
  • 25.Zhang, Yungeng, Yuru Pei, and Hongbin Zha. "Learning dual transformer network for diffeomorphic registration." In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, pp. 129-138. Springer International Publishing, 2021.
  • 26.Van Tulder, Gijs, Yao Tong, and Elena Marchiori. "Multi-view analysis of unregistered medical images using cross-view transformers." In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24, pp. 104-113. Springer International Publishing, 2021.
  • 27.Zhang, Zhicheng, Lequan Yu, Xiaokun Liang, Wei Zhao, and Lei Xing. "TransCT: dual-path transformer for low dose computed tomography." In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VI 24, pp. 55-64. Springer International Publishing, 2021.
  • 28.Chen, Zhihong, Yan Song, Tsung-Hui Chang, and Xiang Wan. "Generating Radiology Reports via Memory-driven Transformer." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1439-1449. 2020.
  • 29.O. Dalmaz, M. Yurt, and T. Çukur, ResViT: Residual vision transformers for multimodal medical image synthesis, IEEE Trans. Med. Imaging, vol. 41, no. 10, pp. 2598–2614, 2022. [DOI] [PubMed] [Google Scholar]
  • 30.Reynaud, Hadrien, Athanasios Vlontzos, Benjamin Hou, Arian Beqiri, Paul Leeson, and Bernhard Kainz. "Ultrasound video transformers for cardiac ejection fraction estimation." In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VI 24, pp. 495-505. Springer International Publishing, 2021.
  • 31.Feng, Chun-Mei, Yunlu Yan, Huazhu Fu, Li Chen, and Yong Xu. "Task transformer network for joint MRI reconstruction and super-resolution." In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VI 24, pp. 307-317. Springer International Publishing, 2021.
  • 32.M. Ghaffari, A. Sowmya, and R. Oliver, Automated brain tumor segmentation using multimodal brain scans: a survey based on models submitted to the BraTS 2012–2018 challenges, IEEE Rev. Biomed. Eng., vol. 13, pp. 156–168, 2019. [DOI] [PubMed] [Google Scholar]
  • 33.J. Cheng et al., Enhanced performance of brain tumor classification via tumor region augmentation and partition, PloS One, vol. 10, no. 10, p. e0140381, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Heller, Nicholas, Fabian Isensee, Dasha Trofimova, Resha Tejpaul, Zhongchen Zhao, Huai Chen, Lisheng Wang et al. "The kits21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase ct." arXiv preprint arXiv:2307.01984 (2023).
  • 35.K. Clark et al., The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository, J. Digit. Imaging, vol. 26, pp. 1045–1057, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Petit, Olivier, Nicolas Thome, Clement Rambour, Loic Themyr, Toby Collins, and Luc Soler. "U-net transformer: Self and cross attention for medical image segmentation." In Machine Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, September 27, 2021, Proceedings 12, pp. 267-276. Springer International Publishing, 2021.
  • 37.Hatamizadeh, Ali, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R. Roth, and Daguang Xu. "Unetr: Transformers for 3d medical image segmentation." In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 574-584. 2022.
  • 38.Hatamizadeh, Ali, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger R. Roth, and Daguang Xu. "Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images." In International MICCAI brainlesion workshop, pp. 272-284. Cham: Springer International Publishing, 2021.
  • 39.Xie, Yutong, Jianpeng Zhang, Chunhua Shen, and Yong Xia. "Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation." In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24, pp. 171-180. Springer International Publishing, 2021.
  • 40.Jia, Qiran, and Hai Shu. "Bitr-unet: a cnn-transformer combined network for mri brain tumor segmentation." In International MICCAI Brainlesion Workshop, pp. 3-14. Cham: Springer International Publishing, 2021. [DOI] [PMC free article] [PubMed]
  • 41.Yan, Xiangyi, Hao Tang, Shanlin Sun, Haoyu Ma, Deying Kong, and Xiaohui Xie. "After-unet: Axial fusion transformer unet for medical image segmentation." In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 3971-3981. 2022.
  • 42.Li, Yijiang, Wentian Cai, Ying Gao, Chengming Li, and Xiping Hu. "More than encoder: Introducing transformer decoder to upsample." In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1597-1602. IEEE, 2022.
  • 43.L. Bi, U. Buehner, X. Fu, T. Williamson, P. Choong, and J. Kim, Hybrid CNN-transformer network for interactive learning of challenging musculoskeletal images, Comput. Methods Programs Biomed., vol. 243, p. 107875, Jan. 2024. 10.1016/j.cmpb.2023.107875. [DOI] [PubMed] [Google Scholar]
  • 44.H. Kuang et al., Hybrid CNN-Transformer Network With Circular Feature Interaction for Acute Ischemic Stroke Lesion Segmentation on Non-Contrast CT Scans, IEEE Trans. Med. Imaging, vol. 43, no. 6, pp. 2303–2316, Jun. 2024. 10.1109/TMI.2024.3362879. [DOI] [PubMed] [Google Scholar]
  • 45.F. Bougourzi, F. Dornaika, C. Distante, and A. Taleb-Ahmed, D-TrAttUnet: Toward hybrid CNN-transformer architecture for generic and subtle segmentation in medical images, Comput. Biol. Med., vol. 176, p. 108590, Jun. 2024. 10.1016/j.compbiomed.2024.108590. [DOI] [PubMed] [Google Scholar]
  • 46.R. Lin, W. Qi, and T. Wang, Multi-level Augmentation Boosts Hybrid CNN-Transformer Model for Semi-supervised Cardiac MRI Segmentation, in Neural Information Processing, B. Luo, L. Cheng, Z.-G. Wu, H. Li, and C. Li, Eds., Singapore: Springer Nature, 2024, pp. 552–563. 10.1007/978-981-99-8079-6_43.
  • 47.X. Ying and M. C. Chuah, UCTNet: Uncertainty-Aware Cross-Modal Transformer Network for Indoor RGB-D Semantic Segmentation, in Computer Vision – ECCV 2022, vol. 13690, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds., in Lecture Notes in Computer Science, vol. 13690. , Cham: Springer Nature Switzerland, 2022, pp. 20–37. 10.1007/978-3-031-20056-4_2.
  • 48.Breast Ultrasound Tumor Classification Using a Hybrid Multitask CNN-Transformer Network, MICCAI 2023 - Accepted Papers, Reviews, Author Feedback. Accessed: Oct. 02, 2024. [Online]. Available: https://conferences.miccai.org/097-Paper1347 [DOI] [PMC free article] [PubMed]
  • 49.Rajpurkar, P. "CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning." ArXiv abs/1711 5225 (2017).
  • 50.Y. Korkmaz, S. U. Dar, M. Yurt, M. Özbey, and T. Cukur, Unsupervised MRI reconstruction via zero-shot learned adversarial transformers, IEEE Trans. Med. Imaging, vol. 41, no. 7, pp. 1747–1763, 2022. [DOI] [PubMed] [Google Scholar]
  • 51.Pan, Jiayi, Heye Zhang, Weifei Wu, Zhifan Gao, and Weiwen Wu. "Multi-domain integrative Swin transformer network for sparse-view tomographic reconstruction." Patterns 3, no. 6 (2022). [DOI] [PMC free article] [PubMed]
  • 52.Chen, Junyu, Yufan He, Eric Frey, Ye Li, and Yong Du. "ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration." In Medical Imaging with Deep Learning., 2021.
  • 53.You, Di, Fenglin Liu, Shen Ge, Xiaoxia Xie, Jing Zhang, and Xian Wu. "Aligntransformer: Hierarchical alignment of visual regions and disease tags for medical report generation." In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24, pp. 72-82. Springer International Publishing, 2021.
  • 54.V. M. Campello et al., Multi-centre, multi-vendor and multi-disease cardiac segmentation: the M&Ms challenge, IEEE Trans. Med. Imaging, vol. 40, no. 12, pp. 3543–3554, 2021. [DOI] [PubMed] [Google Scholar]
  • 55.Wang, Xiaosong, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M. Summers. "Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097-2106. 2017.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All the data generated during the systematic review is available upon formal request.


Articles from Journal of Imaging Informatics in Medicine are provided here courtesy of Springer

RESOURCES