KeyCap3D: Keyword-Guided 3D Medical Image Captioning with Cross-Attention

Supriyanto Supriyanto; Muhammad Ibadurrahman Arrasyid Supriyanto; Haviluddin Haviluddin; Chalsi Mala Sari; Hajar Mar'atussholikah Supriyanto; Rayner Alfred

doi:10.1016/j.mex.2026.103890

. 2026 Mar 28;16:103890. doi: 10.1016/j.mex.2026.103890

KeyCap3D: Keyword-Guided 3D Medical Image Captioning with Cross-Attention

Supriyanto Supriyanto ^a,^⁎, Muhammad Ibadurrahman Arrasyid Supriyanto ^b, Haviluddin Haviluddin ^b, Chalsi Mala Sari ^b, Hajar Mar'atussholikah Supriyanto ^c, Rayner Alfred ^d

PMCID: PMC13092010 PMID: 42011371

Abstract

This study presents a keyword-guided cross-attention framework for automated radiological report generation from 3D FLAIR MRI brain tumor images. The architecture integrates M3D-CLIP as the image encoder. Hierarchical keyword extraction is performed using fine-tuned KeyBERT and BioBERT semantic embeddings in a 768-dimensional space. Six cross-attention layers fuse visual features with clinical keywords across four hierarchical levels: abnormality type, lesion characteristics, anatomical location, and lateralization. A four-layer transformer decoder generates captions autoregressively. The BraTS2020 dataset containing 369 glioma patients paired with TextBraTS radiological descriptions was preprocessed with center-focused slice selection of 32 from 155 slices and spatial interpolation to 256 × 256 resolution. Training on NVIDIA RTX 3050 GPU for 15 epochs using AdamW optimizer achieved loss reduction from 4.16 to 1.33. Evaluation on 20 test samples demonstrated BLEU-1 of 0.5359, BLEU-2 of 0.3969, and ROUGE-L of 0.5051, with generated captions accurately capturing clinical information for decision support applications.

•
Multi-modal fusion through keyword-guided cross-attention integrating visual MRI features with hierarchical clinical terminology
•
Transformer-based autoregressive generation conditioned on enriched image-keyword representations
•
Comprehensive evaluation using BLEU and ROUGE metrics on brain tumor caption generation task

Keywords: Automated captioning, Keyword-guided model, Brain tumor detection, Glioblastoma, 3D FLAIR MRI

Graphical abstract

Specifications table.

Subject area	Neuroscience
More specific subject area	Medical Text to Text Generation, Vision-Language Models
Name of your method	KeyCap3D
Name and reference of original method	A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2017.
Resource availability	Hardware: NVIDIA GPU with ≥16GB VRAM Software: PyTorch 2.0+, Transformers 4.30+, Python 3.10+

Open in a new tab

Background

Brain and central nervous system cancer represents a significant global public health burden due to its high mortality and low survival rates. Based on data from the Global Burden of Disease study, a total of 347,992 new cases of brain cancer and 246,253 deaths were recorded worldwide in 2019, indicating a substantial mortality burden associated with this disease [1]. Despite advances in diagnostic and therapeutic strategies, survival outcomes for brain cancer remain poor. The diagnosis of brain cancer relies heavily on advanced imaging techniques, which are more accessible in high-income regions and contribute to regional differences in incidence and mortality rates. Accurate diagnosis and assessment are essential for effective treatment planning, as delayed diagnosis and limited access to appropriate imaging and treatment are associated with poorer outcomes.

To overcome the high burden of medical image interpretation, medical image captioning has been developed as a technology that generates automatic text descriptions of medical images using artificial intelligence [2]. Unlike classification systems that only determine the presence or absence of tumors or segmentation systems that mark tumor boundaries, captioning technology generates complete sentences that describe the location of the tumor, its characteristic shape, and its effects on surrounding tissue [3]. Advances in deep learning have enabled the adaptation of architectures convolutional neural networks to process images and long short-term memory to generate texton a chest X-ray dataset [4]. However, the application of medical image captioning for brain tumors faces three main challenges. First, brain MRI data is volumetrically complex with multi-parametric sequences across millions of voxels [5] . Second, describing the location of a tumor requires the identification of multiple brain structures in a hierarchical manner, from the hemispheres and lobes to individual gyri [6], following established anatomical atlases . Third, brain tumors, particularly glioblastomas, exhibit significant morphological heterogeneity with varying shapes from well-circumscribed to infiltrative margins with ill-defined boundaries making automated recognition challenging for computational models [7].

Several studies have developed medical image captioning approaches to address the challenges of brain tumor image interpretation. In study [4], researchers proposed a method of compressing images into a single global vector using a CNN encoder and LSTM decoder with an attention mechanism that achieved a BLEU-4 score of 0.28 on chest X-rays. The main drawback is that the attention mechanism lacks anatomical guidance, so the model cannot accurately determine anatomical locations. In study [8], researchers addressed the problem of location information loss using ResNet-101 and a memory-driven transformer, achieving a BLEU-4 score of 0.353 on chest X-ray images. The main drawback is that compressing 8.9 million image points into 768 numbers removes anatomical position coordinate information, preventing the model from accurately distinguishing anatomical locations. In study [9], researchers improved the global encoding problem by using hierarchical classification for semantic feature extraction and then integrating it with Vision Transformer and BioMedBERT, achieving a BLEU-1 score of 0.3348 on MRI brain images. The main drawback is the use of the BLEU-1 evaluation metric, which only measures unigram precision and cannot evaluate word sequence accuracy, so that medical phrases such as “right frontal lobe” and “frontal right lobe” receive the same score even though they have different anatomical meanings.

To overcome the problem of spatial information loss, this study proposes KeyCap3d (Keyword-Guided 3D Medical Image Captioning with Cross-Attention) through three integrated innovations. First, M3D-CLIP extracts visual features of each tumor region independently. Second, keyword extraction uses KeyBert to encode medical terms in four levels, namely the type of abnormality, lesion characteristics, anatomical location, and lateralization. Third, keyword-guided cross-attention uses keywords as a guide in the attention mechanism to give high weight to relevant visual regions so that the resulting textual description corresponds to the correct anatomical location. Validation was performed on the TextBraTS dataset using the BLEU-4 and ROUGE-L evaluation metrics.

Method details

We propose a cross-attention-based medical image captioning framework that generates natural language descriptions of brain tumors from 3D MRI FLAIR sequences. The method consists of: (1) feature extraction using M3D Clip encoder, (2) multi-modal fusion through cross-attention layers between image features and keyword embeddings, and (3) caption generation via transformer decoder. The model is trained end-to-end with cross-entropy loss to produce clinically accurate descriptions. This architecture diagram shows in Fig. 1:

Dataset and data preprocessing

In this study, we used two datasets. First, 3D FLAIR MRI volumes were obtained from the BraTS2020 (Brain Tumor Segmentation) dataset, which consists of brain MRI sequences from 369 glioma patients with original dimensions of 240 × 240 × 155 voxels [10]. Second, the caption data used was sourced from TextBraTS, a specialized dataset that provides radiological descriptions for brain tumor MRI in the BraTS2020 dataset [11].

To adjust the 3D MRI data to the input requirements of the M3D Clip encoder (32 × 256 × 256), we applied a strategic slice selection and pre-processing pipeline. First, we applied min-max normalization to standardize the intensity values across the volume to the range [0, 1]. Next, we performed spatial interpolation to resize the axial dimension from 240 × 240 to 256 × 256 using trilinear interpolation. For the depth dimension, instead of processing all 155 slices, we used a center-focused slice selection strategy to capture the most relevant anatomical regions. We identified two central slices (slices 77 and 78 in the original volume) and then extracted 15 slices before the center (slices 62–76) and 15 slices after the center (slices 79–93), resulting in a total of 32 consecutive slices (15 + 2 + 15 = 32). This approach preserves critical tumor information typically located in the midbrain region while reducing computational complexity. The selected 32 slices are stacked along the depth dimension to form a unified 3D array with shape (32, 256, 256), which is then saved in NumPy (.npy) format. The BraTS caption text is encoded using the BioBERT model as described in the Multi-Modal Input Encoding section.

Multi-Modal input encoding

At this stage, there are three different input modalities using different encoders that are appropriate for medical needs. First is the keyword extraction and embedding process. In this process, we use a KeyBERT model that has been fine-tuned specifically for medical use so that the model can extract clinically relevant keywords from radiology reports [12,13]. The extraction follows a four-level hierarchical structure that captures various aspects of lesion descriptions: (1) type of abnormality identifies pathological conditions such as glioblastoma, meningioma, or metastasis, (2) lesion characteristics describing appearance and intensity features including hyperintensity, heterogeneity, and contrast enhancement patterns, (3) anatomical location determining the affected brain region such as the frontal lobe, temporal lobe, or parietal cortex, and (4) lateralization indicating which hemisphere is affected: left, right, or bilateral. For each report, KeyBERT extracts the 5 most relevant keywords in these four categories based on semantic similarity. These multi-level keywords are then combined to produce a unified representation in the form [B, 1, 768] that captures semantic relationships.

Second, at this stage, researchers encoded 3D MRI images using the FLAIR modality. The processed 3D FLAIR MRI (32 × 256 × 256) was processed through the M3D Clip encoder in the previous data preprocessing stage [14]. This M3D Clip encoder uses a 3D convolution layer with spatial pooling to extract volumetric features, producing a global image representation with the form [B, 768]. This feature vector represents high-level anatomical and pathological patterns present in brain MRI, including tumor characteristics and surrounding tissue context.

Third, at this stage, the researchers performed ground-truth text encoding. The previously mentioned TextBraTS caption ground truth was tokenized and encoded using the BioBERT model [15]. We extracted the [CLS] token embedding to obtain a fixed-length representation with the form [B, 768] that captures the semantic content of the ground truth data description. Next, the three modalities keyword embedding [B, 1, 768], image embedding [B, 768], and text embedding [B, 768] use the same embedding space with a dimension of 768. By equalizing this dimension, the three types of data can be combined and processed together in the next cross-attention layer.

Keyword-Guided with cross-attention

The cross-attention mechanism serves as the core fusion module that integrates visual features from MRI images with semantic information from clinical keywords As illustrated in Fig. 2.

Fig 2 dummy alt text — Keyword guided architecture model.

In each keyword guide layer, the keyword embeddings serve as queries (Q) with shape [B, 1, 768], while the image features provide both keys (K) and values (V) with shape [B, 768]. This configuration allows the model to search through the entire image representation guided by clinical semantic information. The attention mechanism computes similarity scores between keyword queries and image keys, then uses these scores to aggregate relevant visual features from the values.

The attention operation follows the scaled dot-product formulation from [16]:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

where $d_{k}$ = 768 represents the dimension of the key vectors. The scaling factor $\sqrt{d_{k}}$ prevents the dot products from growing too large in high-dimensional spaces, which could lead to extremely small gradients during backpropagation. The softmax function normalizes the attention weights to sum to one, creating a probability distribution over the image features.

Transformer decoder

The transformer decoder implements the standard architecture proposed by [16] through PyTorch library with syntax nn.TransformerDecoderLayer. The decoder consists of four stacked identical layers, where each layer contains three main sub-layers with specific computational roles.The first sub-layer performs masked multi-head self-attention over the target sequence. This mechanism computes attention between all positions in the output sequence, with masking applied to prevent positions from attending to subsequent positions. The multi-head attention with eight heads allows the model to jointly attend to information from different representation subspaces at different positions. Each head operates on dimension 96 (768/8), computing independent attention patterns that are concatenated and linearly projected back to dimension 768.

The second sub-layer implements multi-head cross-attention between the decoder and the encoder output. Here, the decoder queries attend to the enhanced multi-modal representations from the keyword-guided cross-attention module. This encoder-decoder attention mechanism enables the decoder to focus on relevant parts of the input conditioning while generating each token. The attention weights dynamically determine which aspects of the visual and semantic features are most relevant for predicting the next word.

The third sub-layer is a position-wise fully connected feed-forward network, which consists of two linear transformations with ReLU activation between them. The network applies the transformation $F F N (x) = m a x (0, x W_{1} + b_{1}) W_{2} + b_{2}$ , where the inner layer expands from 768 to 2048 dimensions and the outer layer projects back to 768 dimensions. This feed-forward network is applied identically to each position separately.

Training and parameter

The model is trained end-to-end to minimize cross-entropy loss between predicted and ground-truth caption tokens, with padding tokens ignored during computation. The loss function is formulated as:

L_{C E} = - T \frac{1}{t} = \sum_{t - 1}^{T} \log P (y t ∣ y < t, x)

where $x$ represents the multi-modal input and $y_{t}$ is the ground-truth token at position $t$ .

We employ the AdamW optimizer with learning rate $5 \times 10^{- 5}$ and weight decay 0.01 [17]. Training runs for 15 epochs with batch size 1, limited by the memory requirements of processing 3D FLAIR volumes (32 × 256 × 256) alongside large pre-trained models. The BraTS2020-TextBraTS dataset (369 patients) is split 80:20 for training and testing with patient-level stratification, yielding approximately 295 training and 74 test patients. Random seed 42 ensures reproducibility.

During training, we implement teacher forcing where target tokens are shifted by one position [18]. Token embeddings are scaled by $\sqrt{768}$ and combined with sinusoidal positional encodings (max length 512, dropout 0.1). Maximum sequence length is set to 256 tokens. Causal masking through square subsequent mask ensures autoregressive generation, preventing the decoder from attending to future tokens.

Gradient clipping with maximum norm 1.0 stabilizes training through the deep architecture [19]. Model checkpoints are saved every 5 epochs, with the final model saved after epoch 15. A critical design choice is freezing the M3D-CLIP encoder weights, keeping it in evaluation mode throughout training [20]. Only the keyword-guided attention module (six cross-attention layers with 8 heads each), transformer decoder (four layers with feedforward dimension 2048), and output projection layer are optimized. The model is trained on an NVIDIA RTX 3050 GPU with 8GB VRAM and 32GB system RAM.

Output generation and evaluation

During inference, the model generates radiological captions one word at a time sequentially. The process begins with a start token and continues by predicting the next word based on the words that have been generated previously. This process stops when the model generates an end token or reaches a maximum length of 256 tokens.The generated captions are evaluated using two metrics. The first metric is BLEU, which measures the precision of n-grams from the unigram level (BLEU-1) to the four-gram level (BLEU-4). BLEU-1 measures the accuracy of individual words, while BLEU-2 to BLEU-4 assess the match of longer word sequences. A higher BLEU score indicates better alignment with the reference. For medical reports, BLEU-1 above 0.4 and BLEU-2 above 0.3 are considered satisfactory given the variation in medical terminology expressions [21].

The second metric is ROUGE-L, which evaluates the longest common subsequence between the generated caption and the reference. ROUGE-L is more tolerant of word choice variations and focuses on the overall structural similarity of sentences [22]. Both metrics have a range of 0 to 1, with scores above 0.5 indicating excellent quality [23]. The evaluation was conducted by comparing each caption to the ground-truth radiological descriptions from TextBraTS, then averaging the scores across all test samples as a performance benchmark.

Method validation

Training performance

The model was trained for 15 epochs on the BraTS2020-TextBraTS dataset with a total of 295 training samples. As shown in Fig. 3 (Training Loss Curve), the loss curve shows consistent convergence during the training process, with the average loss gradually decreasing from 4.15 in the first epoch to 1.32 in the final epoch. A significant decrease in loss occurred in the early epochs, where the loss dropped from 4.1556 (epoch 1) to 2.0646 (epoch 4), indicating rapid learning by the model in understanding the basic patterns between visual features and text captions. After epoch 4, the loss continued to decrease at a slower gradient, reaching a value below 2.0 in epoch 5 (1.9365) and stabilizing around 1.5–1.8 for epochs 6–11. The final epoch shows further convergence with the loss reaching 1.3290, indicating that the model has learned to effectively represent the multimodal relationship between MRI images, clinical keywords, and radiological captions. All training was performed on an NVIDIA RTX 3050 GPU device, with an average processing speed of 3.5–3.8 iterations per second.

Fig 3 dummy alt text — Visualization of training loss curve.

Quantitative evaluation

A quantitative evaluation was conducted on 20 test samples using the BLEU and ROUGE-L metrics, with the summary results shown in Table 1. The model achieved a BLEU-1 score of 0.5359, which falls into the excellent performance category (>0.5) and far exceeds the satisfactory threshold for medical report generation (>0.4). This score indicates that the model has excellent word-level accuracy in generating captions that match the ground-truth reference. For higher n-grams, the model achieved a BLEU-2 score of 0.3969, a BLEU-3 score of 0.3009, and a BLEU-4 score of 0.2328. The BLEU-2 score almost reaches the good performance threshold (0.4), indicating that the model is capable of generating clinically accurate and relevant two-word sequences. The decrease in scores for BLEU-3 and BLEU-4 is common in text generation tasks, as these metrics are more sensitive to the exact matching of long sequences, and in the medical domain there is natural variation in how the same concepts are expressed. For the ROUGE metric, the model achieved a ROUGE-1 score of 0.6466, a ROUGE-2 score of 0.3634, and a ROUGE-L score of 0.5051. The ROUGE-L score of 0.5051 indicates strong performance (>0.5) in maintaining sentence structure and information flow consistent with the reference caption. This indicates that although there may be variations in specific word choices, the model is able to capture and reproduce the longest common subsequence that represents the correct organization of clinical information. The high ROUGE-1 (0.6466) further confirms that the model has excellent recall in identifying and including relevant keywords from the medical vocabulary.

Table 1.

Metric evaluation.

Metric	BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE-1	ROUGE-2	ROUGE-L
Score	0.5359	0.3969	0.3009	0.2328	0.6466	0.3634	0.5051

Open in a new tab

Qualitative analysis

Table 2 compares the ground truth and generated captions for Sample #1. The text marked in green indicates medical terminology and clinical information that was accurately identified by the model. Successfully identified elements include anatomical locations, namely the right frontal and parietal lobes; lesion signal characteristics, namely high and low signals and speckled high signals; edema distribution, namely the right parietal lobe and tissue swelling; necrosis findings, namely the right parietal and occipital lobes with low-signal intensity; and structural complications, namely ventricular compression. Although there are slight differences in specific phrases such as “mixed pattern” and “mixture of heterogeneous”.

Table 2.

Comparisson between ground truth and generated captioning.

Name	Ground truth	Generated Captioning
Sample #1	The lesion area is in the right frontal and parietal lobes with a mixed pattern of high and low signals with speckled high signal regions. Edema is mainly observed in the right parietal lobe, partially extending to the frontal lobe, presenting as high signal, indicating significant tissue swelling around the lesion. Necrosis is within the lesions of the right parietal and frontal lobes, appearing as mixed, with alternating high and low signal regions. Ventricular compression is seen in the lateral ventricles with significant compressive effects on the brain tissue and ventricles.	the lesion area is in the right frontal and parietal lobes with a mixture of heterogeneous high and low signals, with speckled high signal areas. edema is significant, mainly observed in the right frontal and parietal lobes, with a large extent of swelling of the surrounding tissues. necrosis is observed in the right parietal and occipital lobes, characterized by low - signal intensity and mixed signals, concentrated in the right frontal and scattered low - signal intensity, displaying mixed signals. ventricular compression is observed, with the right ventricle

Open in a new tab

Overall, the evaluation results show that the proposed keyword-guided cross-attention architecture is effective in integrating visual information from 3D MRI with semantic guidance from hierarchical clinical keywords. The model demonstrates strong performance on lexical overlap metrics (BLEU-1 and ROUGE-L). The quality of the generated captions consistently captures key clinical information such as anatomical location, lesion characteristics, and edema distribution, which are important elements in brain tumor radiology reports.

Limitations

This study has several methodological limitations. The model relies solely on the FLAIR modality, excluding T1ce and T2 sequences that reveal contrast-enhancing tumor core and peritumoral edema boundaries, limiting the clinical completeness of generated captions.

Ethics statements

This study utilized publicly available datasets that are freely accessible to the research community. The 3D FLAIR MRI volumes were obtained from the Brain Tumor Segmentation (BraTS) 2020 Challenge dataset [24], and the corresponding radiological descriptions were sourced from the TextBraTS dataset [11]. Both datasets are distributed under open-access licenses for research purposes.

CRediT author statement

Supriyanto Supriyanto: Software, Investigation, Data curation, Writing - Original draft preparation, Visualization. Muhammad Ibadurrahman Arrasyid Supriyanto: Conceptualization, Methodology, Supervision. Haviluddin Haviluddin: Methodology, Resources, Supervision. Chalsi Mala Sari: Supervision, Writing - Review & Editing. Hajar Mar'atussholikah: Investigation, Resources. Rayner Alfred: Conceptualization, Supervision. All authors have read and agreed to the published version of the manuscript.

Supplementary material and/or additional information [OPTIONAL]

No.

Declaration of competing interest

None.

Acknowledgments

No Funding.

Footnotes

Related research article

None

Data availability

Data will be made available on request.

References

1.Ilic I., Ilic M. International patterns and trends in the brain cancer incidence and mortality: an observational study based on the global burden of disease. Heliyon. Jul. 2023;9(7) doi: 10.1016/j.heliyon.2023.e18222. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Beddiar D.R., Oussalah M., Seppänen T. Automatic captioning for medical imaging (MIC): a rapid review of literature. Artif. Intell. Rev. May 2023;56(5):4019–4076. doi: 10.1007/s10462-022-10270-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Park H., Kim K., Park S., Choi J. Medical image captioning model to convey more details: methodological comparison of feature difference generation. IEEe Access. 2021;9:150560–150568. doi: 10.1109/ACCESS.2021.3124564. [DOI] [Google Scholar]
4.Xu K., et al. Proceedings of the 32nd International Conference on Machine Learning. PMLR; Jun. 2015. Show, attend and tell: neural image caption generation with visual attention; pp. 2048–2057.https://proceedings.mlr.press/v37/xuc15.html Accessed: Jan. 24, 2026. [Online]. Available: [Google Scholar]
5.Xue Y., et al. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. Frangi A.F., Schnabel J.A., Davatzikos C., Alberola-López C., Fichtinger G., editors. Springer International Publishing; Cham: 2018. Multimodal recurrent model with attention for automated radiology report generation; pp. 457–466. [DOI] [Google Scholar]
6.Klein A., Tourville J. 101 Labeled brain images and a consistent Human cortical labeling protocol. Front. Neurosci. Dec. 2012;6 doi: 10.3389/fnins.2012.00171. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Ellingson B.M., Wen P.Y., Cloughesy T.F. Modified criteria for radiographic response assessment in glioblastoma clinical trials. Neurotherapeutics. Apr. 2017;14(2):307–320. doi: 10.1007/s13311-016-0507-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Chen Z., Song Y., Chang T.H., Wan X. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Webber B., Cohn T., He Y., Liu Y., editors. Online: Association for Computational Linguistics; Nov. 2020. Generating radiology reports via memory-driven transformer; pp. 1439–1449. [DOI] [Google Scholar]
9.Mayzura W.S., Sarno R., Suroto N.S., Supriyanto M.I.A., Sihaj G. Automatic interpretation of brain medical images using hierarchical classification and image captioning model. IEEe Access. 2025;13:84675–84688. doi: 10.1109/ACCESS.2025.3560701. [DOI] [Google Scholar]
10.Bakas S., et al. Advancing the Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data. Sep. 2017;4(1) doi: 10.1038/sdata.2017.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Shi X., et al. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2025. Gee J.C., Alexander D.C., Hong J., Iglesias J.E., Sudre C.H., Venkataraman A., Golland P., Kim J.H., Park J., editors. Springer Nature Switzerland; Cham: 2026. TextBraTS: text-guided volumetric brain tumor segmentation with innovative dataset development and fusion module exploration; pp. 638–648. [DOI] [Google Scholar]
12.Vasuki M., Arun Gangadharan M., Daniel J.T., Sadashiv A., Venugopal V., Vekkot S. 2024 2nd World Conference on Communication & Computing (WCONF) Jul. 2024. Multi-modal automatic video segmentation with sentence transformer embeddings and KeyBERT-based subtopic extraction; pp. 1–6. [DOI] [Google Scholar]
13.Grootendorst M. KeyBERT: minimal keyword extraction with BERT. Zenodo. 2020 doi: 10.5281/zenodo.4461265. [DOI] [Google Scholar]
14.Bai F., Du Y., Huang T., q-h Meng M., Zhao B. M3D: advancing 3D medical image analysis with multi-modal large language models. Oct. 2024. https://openreview.net/forum?id=XQL4Pmf6m6 Accessed: Jan. 25, 2026. [Online]. Available:
15.Lee J., et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. Feb. 2020;36(4):1234–1240. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Vaswani A., et al. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2017. Attention is all you need.https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html Accessed: Jan. 25, 2026. [Online]. Available: [Google Scholar]
17.Loshchilov I., Hutter F. 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. 2019. Decoupled weight decay regularization.https://openreview.net/forum?id=Bkg6RiCqY7 [Online]. Available: [Google Scholar]
18.Williams R.J., Zipser D. A learning algorithm for continually running fully recurrent neural networks. Neural. Comput. Jun. 1989;1(2):270–280. doi: 10.1162/neco.1989.1.2.270. [DOI] [Google Scholar]
19.Pascanu R., Mikolov T., Bengio Y. On the difficulty of training recurrent neural networks. Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, in ICML’13; Atlanta, GA, USA; JMLR.org; Jun. 2013. III-1310-III–1318. [Google Scholar]
20.Howard J., Ruder S. Universal language model fine-tuning for text classification. In: Gurevych I., Miyao Y., editors. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Melbourne, Australia; Association for Computational Linguistics; Jul. 2018. pp. 328–339. [DOI] [Google Scholar]
21.Papineni K., Roukos S., Ward T., Zhu W.J. BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, in ACL ’02; USA; Association for Computational Linguistics; Jul. 2002. pp. 311–318. [DOI] [Google Scholar]
22.Lin C.Y., Hovy E. Automatic evaluation of summaries using N-gram co-occurrence statistics. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, in NAACL ’03; USA; Association for Computational Linguistics; May 2003. pp. 71–78. [DOI] [Google Scholar]
23.Pang T., Li P., Zhao L. A survey on automatic generation of medical imaging reports based on deep learning. Biomed. Eng. Online. May 2023;22(1):48. doi: 10.1186/s12938-023-01113-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Menze B.H., et al. The multimodal brain tumor image segmentation benchmark (BRATS) IEEe Trans. Med. ImAging. Oct. 2015;34(10):1993–2024. doi: 10.1109/TMI.2014.2377694. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data will be made available on request.

[bib0001] 1.Ilic I., Ilic M. International patterns and trends in the brain cancer incidence and mortality: an observational study based on the global burden of disease. Heliyon. Jul. 2023;9(7) doi: 10.1016/j.heliyon.2023.e18222. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0002] 2.Beddiar D.R., Oussalah M., Seppänen T. Automatic captioning for medical imaging (MIC): a rapid review of literature. Artif. Intell. Rev. May 2023;56(5):4019–4076. doi: 10.1007/s10462-022-10270-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0003] 3.Park H., Kim K., Park S., Choi J. Medical image captioning model to convey more details: methodological comparison of feature difference generation. IEEe Access. 2021;9:150560–150568. doi: 10.1109/ACCESS.2021.3124564. [DOI] [Google Scholar]

[bib0004] 4.Xu K., et al. Proceedings of the 32nd International Conference on Machine Learning. PMLR; Jun. 2015. Show, attend and tell: neural image caption generation with visual attention; pp. 2048–2057.https://proceedings.mlr.press/v37/xuc15.html Accessed: Jan. 24, 2026. [Online]. Available: [Google Scholar]

[bib0005] 5.Xue Y., et al. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. Frangi A.F., Schnabel J.A., Davatzikos C., Alberola-López C., Fichtinger G., editors. Springer International Publishing; Cham: 2018. Multimodal recurrent model with attention for automated radiology report generation; pp. 457–466. [DOI] [Google Scholar]

[bib0006] 6.Klein A., Tourville J. 101 Labeled brain images and a consistent Human cortical labeling protocol. Front. Neurosci. Dec. 2012;6 doi: 10.3389/fnins.2012.00171. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0007] 7.Ellingson B.M., Wen P.Y., Cloughesy T.F. Modified criteria for radiographic response assessment in glioblastoma clinical trials. Neurotherapeutics. Apr. 2017;14(2):307–320. doi: 10.1007/s13311-016-0507-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0008] 8.Chen Z., Song Y., Chang T.H., Wan X. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Webber B., Cohn T., He Y., Liu Y., editors. Online: Association for Computational Linguistics; Nov. 2020. Generating radiology reports via memory-driven transformer; pp. 1439–1449. [DOI] [Google Scholar]

[bib0009] 9.Mayzura W.S., Sarno R., Suroto N.S., Supriyanto M.I.A., Sihaj G. Automatic interpretation of brain medical images using hierarchical classification and image captioning model. IEEe Access. 2025;13:84675–84688. doi: 10.1109/ACCESS.2025.3560701. [DOI] [Google Scholar]

[bib0010] 10.Bakas S., et al. Advancing the Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data. Sep. 2017;4(1) doi: 10.1038/sdata.2017.117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0011] 11.Shi X., et al. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2025. Gee J.C., Alexander D.C., Hong J., Iglesias J.E., Sudre C.H., Venkataraman A., Golland P., Kim J.H., Park J., editors. Springer Nature Switzerland; Cham: 2026. TextBraTS: text-guided volumetric brain tumor segmentation with innovative dataset development and fusion module exploration; pp. 638–648. [DOI] [Google Scholar]

[bib0012] 12.Vasuki M., Arun Gangadharan M., Daniel J.T., Sadashiv A., Venugopal V., Vekkot S. 2024 2nd World Conference on Communication & Computing (WCONF) Jul. 2024. Multi-modal automatic video segmentation with sentence transformer embeddings and KeyBERT-based subtopic extraction; pp. 1–6. [DOI] [Google Scholar]

[bib0013] 13.Grootendorst M. KeyBERT: minimal keyword extraction with BERT. Zenodo. 2020 doi: 10.5281/zenodo.4461265. [DOI] [Google Scholar]

[bib0014] 14.Bai F., Du Y., Huang T., q-h Meng M., Zhao B. M3D: advancing 3D medical image analysis with multi-modal large language models. Oct. 2024. https://openreview.net/forum?id=XQL4Pmf6m6 Accessed: Jan. 25, 2026. [Online]. Available:

[bib0015] 15.Lee J., et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. Feb. 2020;36(4):1234–1240. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0016] 16.Vaswani A., et al. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2017. Attention is all you need.https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html Accessed: Jan. 25, 2026. [Online]. Available: [Google Scholar]

[bib0017] 17.Loshchilov I., Hutter F. 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. 2019. Decoupled weight decay regularization.https://openreview.net/forum?id=Bkg6RiCqY7 [Online]. Available: [Google Scholar]

[bib0018] 18.Williams R.J., Zipser D. A learning algorithm for continually running fully recurrent neural networks. Neural. Comput. Jun. 1989;1(2):270–280. doi: 10.1162/neco.1989.1.2.270. [DOI] [Google Scholar]

[bib0019] 19.Pascanu R., Mikolov T., Bengio Y. On the difficulty of training recurrent neural networks. Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, in ICML’13; Atlanta, GA, USA; JMLR.org; Jun. 2013. III-1310-III–1318. [Google Scholar]

[bib0020] 20.Howard J., Ruder S. Universal language model fine-tuning for text classification. In: Gurevych I., Miyao Y., editors. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Melbourne, Australia; Association for Computational Linguistics; Jul. 2018. pp. 328–339. [DOI] [Google Scholar]

[bib0021] 21.Papineni K., Roukos S., Ward T., Zhu W.J. BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, in ACL ’02; USA; Association for Computational Linguistics; Jul. 2002. pp. 311–318. [DOI] [Google Scholar]

[bib0022] 22.Lin C.Y., Hovy E. Automatic evaluation of summaries using N-gram co-occurrence statistics. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, in NAACL ’03; USA; Association for Computational Linguistics; May 2003. pp. 71–78. [DOI] [Google Scholar]

[bib0023] 23.Pang T., Li P., Zhao L. A survey on automatic generation of medical imaging reports based on deep learning. Biomed. Eng. Online. May 2023;22(1):48. doi: 10.1186/s12938-023-01113-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0024] 24.Menze B.H., et al. The multimodal brain tumor image segmentation benchmark (BRATS) IEEe Trans. Med. ImAging. Oct. 2015;34(10):1993–2024. doi: 10.1109/TMI.2014.2377694. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

KeyCap3D: Keyword-Guided 3D Medical Image Captioning with Cross-Attention

Supriyanto Supriyanto

Muhammad Ibadurrahman Arrasyid Supriyanto

Haviluddin Haviluddin

Chalsi Mala Sari

Hajar Mar'atussholikah Supriyanto

Rayner Alfred

Abstract

Graphical abstract

Background

Method details

Fig. 1.

Dataset and data preprocessing

Multi-Modal input encoding

Keyword-Guided with cross-attention

Fig. 2.

Transformer decoder

Training and parameter

Output generation and evaluation

Method validation

Training performance

Fig. 3.

Quantitative evaluation

Table 1.

Qualitative analysis

Table 2.

Limitations

Ethics statements

CRediT author statement

Supplementary material and/or additional information [OPTIONAL]

Declaration of competing interest

Acknowledgments

Footnotes

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases