M3: multimodal artificial intelligence for medical report generation and visual question answering from 3D abdominal CT scans

Abdullah Hosseini; Ahmed Ibrahim; Ahmed Serag

doi:10.1093/bjrai/ubaf011

. 2025 Jul 17;2(1):ubaf011. doi: 10.1093/bjrai/ubaf011

M3: multimodal artificial intelligence for medical report generation and visual question answering from 3D abdominal CT scans

Abdullah Hosseini ¹, Ahmed Ibrahim ², Ahmed Serag ^3,^✉

PMCID: PMC13045705 PMID: 42063998

Abstract

Objectives

Medical imaging is indispensable for diagnosis, with abdominal imaging playing a pivotal role in generating medical reports and informing clinical decision-making. Recent works in artificial intelligence (AI), particularly in multimodal approaches such as vision-language models, have demonstrated significant potential to enhance medical image analysis by seamlessly integrating visual and textual data. While 2D imaging has been the main focus of many studies, the enhanced spatial detail and volumetric consistency offered by 3D images, such as CT scans, remain relatively underexplored. This gap underscores the need for innovative approaches to unlock the potential of 3D imaging in clinical workflows.

Methods

In this study, we utilized a multimodal AI pipeline, Phi3-V, to address 2 key challenges in abdominal imaging: generating clinically coherent medical reports from 3D CT images and performing visual question answering based on these images.

Results

Our optimized model attained an average GREEN score of 0.409 for medical report generation and an accuracy of 79% for multiple-choice visual question answering on the validation cases.

Conclusions

These findings demonstrate the potential of multimodal AI in advancing the analysis of 3D medical imaging, paving the way for more robust and efficient applications in healthcare.

Advances in knowledge

This study advances the use of multimodal AI for 3D CT imaging, achieving improvements in medical report generation and visual question answering.

Keywords: artificial intelligence, multimodal analysis, vision-language models, medical report generation, 3D images

Introduction

Nowadays, artificial intelligence (AI) has propelled marked improvements in medical image analysis, improving both the speed and accuracy of interpretation.¹^,² These advancements have already transformed many aspects of diagnostic workflows, reducing human error and improving efficiency.^3–5 By bridging the gap between imaging and textual data, a multimodal approach offers the potential for deeper insights into patient health, paving the way for more precise and comprehensive clinical decision-making. Furthermore, it can support the development of robust AI systems capable of addressing challenges in personalized medicine and advanced diagnostics.

Recent progress in vision-language models (VLMs), particularly their application to 2D medical imaging,⁶^,⁷ demonstrates that overcoming longstanding challenges is no longer a pipe dream. Early approaches to medical report generation (MRG) often relied on CNN-RNN hybrid architectures. More recent transformer-based models, such as METransformer⁸ and KiUT,⁹ have introduced multi-expert token mechanisms and incorporated knowledge graphs to improve diagnostic alignment and reduce hallucinations.

With the advent of large language models (LLMs), newer methods like R2GenGPT¹⁰ and ClinicalBLIP¹¹ have integrated vision-language alignment modules to map visual features from X-rays into LLM embedding spaces. VLMs such as XrayGPT¹² and CXR-LLAVA¹³ have further narrowed the modality gap by fine-tuning pretrained architectures on clinical datasets, leveraging components like Q-Former to align image tokens with textual prompts and domain-specific medical knowledge.

Extending vision-language capabilities to 3D imaging holds the potential to revolutionize how medical data is processed and interpreted. However, analysing of 3D data introduces critical challenges. Unlike 2D images, 3D CT scans require models that can extract spatial features and maintain consistency across volumetric slices. Moreover, the computational demands of processing 3D data significantly increase model complexity and training requirements.¹⁴

Recent advances in 3D MRG have focused on developing architectures that enhance both diagnostic accuracy and clinical relevance. CT2Rep¹⁵ introduced a 3D encoder-to-decoder transformer framework for chest CT scans, enabling direct report generation from volumetric inputs. Building on this, MvKeTR¹⁶ improved performance through a multi-view encoder that processes axial, coronal, and sagittal planes, combined with a cross-modal knowledge retrieval mechanism to enrich semantic alignment. M3D,¹⁷ in conjunction with LLaMA,¹⁸ leveraged a CLIP-inspired pretrained 3D encoder with spatial pooling and segmentation capabilities to support a comprehensive understanding of 3D medical images. Extending this line of work, 3D-CT-GPT¹⁹ refined the CT encoder by incorporating CT-ViT, 3D pooling, and projection layers, leading to further performance improvements.

In this study, we propose a framework that combines a vision encoder with Phi3-V,²⁰ a compact language model, to address the challenges of 3D CT scan analysis. By integrating the spatial reasoning capabilities of VLMs with the contextual understanding of language models, our method aims to enhance tasks such as MRG and visual question answering (VQA). This approach offers a precise and efficient solution for multimodal AI in abdominal imaging and lays the foundation for future advancements in the automated interpretation of complex medical datasets. The code is available at https://github.com/serag-ai/M3.

Methods

Dataset

Overview

The AMOS-MM dataset²¹ is a comprehensive resource designed to facilitate multimodal analysis of abdominal imaging, supporting tasks such as MRG, and VQA. The dataset includes medical reports, with a focus on the Findings section across 3 anatomical regions: chest, abdomen, and pelvis. Each report is provided in JSON format, with distinct entries for each region. If a region is not addressed in a specific case, it is represented by an empty string.

The VQA component assesses algorithms on their ability to combine visual and clinical knowledge. Tasks are categorized into 2 groups: Perception (eg, anatomical identification, attribute recognition, and quantitative assessment) and Reasoning (eg, disease analysis, risk evaluation, and therapeutic suggestion). Each VQA task is presented as a multiple-choice question with a single correct answer, also formatted in JSON format.

Dataset preparation and refinement

In analysing the AMOS-MM dataset, a significant imbalance was observed in the distribution of MRG samples, particularly across anatomical regions and report types. Of the 1288 total examples, the dataset includes 1285 abdominal organ reports, 1117 pelvic organ reports, and only 378 chest organ reports. This imbalance negatively affects model performance for chest organ report generation.

Additionally, several samples contain notably shorter captions, often limited to a single sentence, which introduces inconsistency in training and results in models generating shorter outputs during evaluation. To address these challenges, we created 3 refined versions of the dataset, each tailored for fine-tuning MRG models:

Version 1

To address the issue of dataset imbalance in MRG, we incorporated additional data from the CtRate dataset,¹⁵^,²² a dataset specifically designed for chest organs. Approximately 2000 samples were incorporated, each containing at least 2 positive abnormalities from the following categories: “Pericardial effusion,” “Lung opacity,” “Pleural effusion,” “Bronchiectasis,” and “Interlobular septal thickening.”

Version 2

To tackle the problem of generating short captions, we utilized only the AMOS-MM dataset but removed captions that were shorter (typically 1 sentence) from the training set. This approach aimed to ensure that the model consistently predicts longer sentences.

Version 3

This version combines the approaches of version 1 and version 2 to address both issues. Specifically, we removed short sentences from the AMOS-MM dataset and added 2000 samples with specific abnormalities to the dataset from the CtRate dataset.

For VQA, the dataset comprises 13 751 examples, with the distribution of correct answers being as follows: 1249 for option A, 5227 for option B, 4729 for option C, and 2546 for option D. To address the issue of data imbalance, we restructured the dataset by equally distributing the questions across each answer choice, ensuring that the total number of correct responses for each option is almost equal.

Vision-language model

Model architecture

In this study, we fine-tuned the Phi3-V²⁰ model, which functions as a generalist VLM for 3D medical image analysis. Figure 1 provides an overview of the architecture. The model consists of 3 primary components: (1) language, (2) projector, and (3) vision.

For the vision component, we used the pretrained M3D model¹⁷ as the foundational image encoder, which was fine-tuned in subsequent stages. The pretrained vision encoder was initially trained using the CLIP approach,²³ where both 3D CT scans and text were jointly trained. Additionally, to effectively integrate the pretrained CLIP-based image encoder with the language model, we employed a pretrained 3D spatial pooling mechanism from the M3D architecture.

In our approach to language modelling tasks, we utilized 2 different base models: the Phi3 pretrained model and the M3D model. For MRG, we selected the Phi3 pretrained model as the initial weights, while for VQA, we based our work on the M3D model. This selection allowed us to tailor the model choice to the specific requirements of each task.

Temperature adjustment in inference

As part of our ablation study, we varied the temperature parameter of the language component in our VLM to examine its influence on the MRG task. This investigation was motivated by preliminary observations suggesting that different temperature settings noticeably affect the quality of the consistency of the generated reports.

Temperature regulates how a language model weights the probabilities of different possible next tokens, influencing the randomness or determinism of its outputs. To modulate the model’s creativity in report generation, the temperature parameter was varied between 0.1 and 0.9, while maintaining a constant top-P value. In contrast, for VQA, we utilized the default prediction settings with the temperature fixed at 0.9.

Prompt formatting and optimization

For VQA, variations in prompt structure can sometimes enhance the model’s performance in answering questions. To investigate the impact of prompt formatting on the model’s performance in answering multiple-choice questions, we modified the prompt format for the VQA task as part of our ablation study.

To identify the optimal prompt format for the VQA task, we trained the model using the original AMOS-MM dataset and compared its accuracy across different prompt structures. The prompt formats we tested are as follows:

{QUESTION}. Correct choice is {ANSWER}.
{QUESTION}. {ANSWER}.
{QUESTION}. {ANSWER})

In this setup, {QUESTION} represents the input medical question, while {ANSWER} is replaced by the correct choice label (A, B, C, or D) from the dataset. However, the model is expected to generate a full explanatory phrase rather than just the letter choice. See Figure 2 for more details.

Figure 2. — Examples of fine-tuning prompts for Phi3-V on the medical VQA task. All prompts share identical question formats but vary in the answer structure to guide the model’s response style for multiple-choice questions. Abbreviation: VQA = visual question answering.

Additionally, to determine whether the model confuses the answer choice “A” with the article letter “a,” we introduced a fourth prompt format in which the answer “A” was replaced with “X”. This modification allowed us to evaluate whether such an adjustment improved the model’s performance.

Evaluation metrics

For MRG, we strictly adhered to the challenge’s designated evaluation metrics and utilized the GREEN score (Generative Radiology Report Evaluation and Error Notation).²⁴ This recently developed metric is specifically designed for assessing radiology report generation, focusing on the factual correctness and clinical relevance of generated reports. Unlike traditional metrics, the GREEN score leverages a language model to provide a more robust and context-aware evaluation.

It should be noted that the current LLM-based GREEN metric struggles with lengthy reports, as it was trained on shorter ones. To minimize evaluation errors, it is decided to calculate the GREEN score separately for each of the 3 organs (pelvis, chest, abdomen), provided the corresponding ground truth is available. The final score for each organ will be the average of these section-wise scores. In VQA, accuracy is used as the metric to evaluate the proportion of correctly answered questions on a question-wise basis.

Statistical analysis

To test differences between the results of both models in downstream tasks, t-tests were used for normally distributed data, while the non-parametric Mann-Whitney U test was applied to non-normal distributions (normality was assessed using the Shapiro-Wilk test). P < .05 were considered significant after controlling error using false discovery rate.

Results

Table 1 summarizes the results for MRG, highlighting the model’s performance across different dataset versions using the GREEN score metric. The comparison includes evaluations on the original AMOS-MM dataset and the refined versions 1, 2, and 3. Experiments were conducted using learning rates of 7 × 10⁻⁴, 5 × 10⁻⁵, and 7 × 10⁻⁶, with training performed over 5 epochs. For clarity, we report only the best results for each combination of dataset version and learning rate.

Table 1.

Results for the report generation task, comparing performance across different datasets.

Model	Dataset	Epoch	LR	Pelvis	Chest	Abdomen	Average
Vit3D²⁵	AMOS-MM	–	–	0.37	0.20	0.32	0.30
Ours	AMOS-MM	5	7e−4	0.476	0.227	0.422	0.375
	AMOS-MM	5	5e−5	0.362	0.193	0.346	0.301
	Version 1	5	7e−4	0.473	0.245	0.443	0.387
		5	5e−5	0.459	0.261	0.425	0.382
		3	7e−6	0.516	0.225	0.456	0.399
	Version 2	5	7e−4	0.475	0.208	0.391	0.358
		5	5e−5	0.308	0.151	0.249	0.236
		3	7e−6	0.455	0.231	0.392	0.360
		5	7e−4	0.465	0.183	0.390	0.346
	Version 3	3	5e−5	0.466	0.282	0.410	0.386
		5	7e−6	0.511	0.276	0.439	0.409

Open in a new tab

For the sake of conciseness, we do not present the detailed results for all configurations and epochs. Instead, we report only the best-performing model, which achieved the highest average GREEN scores across various dataset versions and learning rates. The highest GREEN score is highlighted in bold, and the second highest is underlined.

Among the different versions of MRG datasets, version 3 achieved the highest average GREEN score (0.409), with notable improvements in performance for both the pelvis and chest regions. Version 1 also showed an increase in average performance (0.399) compared to the original dataset (0.375), attributed to the addition of samples focused on chest abnormalities. In contrast, Version 2, which removed shorter captions, resulted in a lower average score (0.360).

A paired t-test comparing the top-performing model (version 3) with other variants revealed no statistically significant difference between version 3 and either the original model (P = .068) or version 1 (P = .68). However, statistically significant differences were observed between version 3 and both version 2 and Vit3D²⁵ (P < .05).

Figure 3 shows the effect of the temperature value on the model’s overall performance score across the pelvis, chest, and abdomen regions. Figure 4 illustrates the relationship between the total number of correct findings, clinically significant errors in the generated reports, the GREEN score, and the predicted token length. Additionally, Figure 3C explores the correlation between the predicted report length and the ground-truth token length with the achieved GREEN score.

Concerning the model’s efficacy in addressing the VQA task, Figure 5 illustrates the model’s performance across both the imbalanced and balanced versions of the dataset for VQA. Although the balanced version demonstrates improved accuracy, particularly for underrepresented answer choices, A t-test comparing the F1 scores and accuracies between the imbalanced and balanced versions of the dataset revealed no statistically significant difference (P = .101).

Additionally, Figure 6 presents the model’s average accuracy across epochs when trained on the curated version of the dataset. The accuracy improves steadily across epochs, with the highest performance observed around Epoch 4, reaching an average accuracy of 78.97%. Notably, choices B and C exhibit the most consistent improvements, showcasing their dominance in the dataset distribution.

As part of the study, we investigated various prompt formulations for addressing the VQA task. The corresponding results, detailed in Table 2, were obtained using an imbalanced version of the dataset. This choice was deliberate, as the primary objective at this stage was to discern the optimal prompt structure for the task under consideration.

Table 2.

Findings from comparing VQA accuracy against other models and input formats, conducted on an imbalanced version of the dataset.

Accuracy of	A	B	C	D	AVG
Vit3D Model²⁵	–	–	–	–	0.61
Original answer format^a	0.56	0.84	0.84	0.78	0.76
X-format^b	0.55	0.84	0.83	0.75	0.74
“Correct choice is ...”^c	0.59	0.85	0.83	0.76	0.76
“...”^d	0.58	0.84	0.83	0.73	0.75

Open in a new tab

The original format refers to the format used by the M3D model, which is like "A. answer choice".

X-Format in the table denotes the scenario in which "A" was replaced with the letter "X".

Modified answer format where ... refers to letter A, B, C or D.

Modified answer format where only the correct choice letter is used.

Discussion

The integration of AI into medical imaging has revolutionized diagnostic and clinical workflows, but significant challenges remain, particularly when addressing tasks such as MRG and VQA from 3D data. These tasks require models capable of interpreting complex, multimodal data, extracting meaningful insights from volumetric medical data, and aligning them with clinically accurate text.

For MRG, we initially pretrained the 3D medical vision encoder using a CLIP-like approach on the AMOS-MM dataset. However, it became evident that the high degree of similarity in report content, particularly for anatomically similar organs, resulted in model confusion. This confusion likely stemmed from the model’s difficulty in distinguishing subtle variations in text descriptions of similar anatomical structures, reducing the overall accuracy of image-text alignment. Given the significant similarity between text-image pairs, the limited dataset size, and the computational demands, we opted to use the M3D CLIP model. With necessary modifications, we achieved an average GREEN score of 0.409, with consistent performance across the 3 target organs.

For hyperparameter tuning, we systematically explored a range of learning rates, epochs, dataset variants, and temperature settings. Specifically, we tested learning rates of 7e−4, 7e−6, 5e−5, and 5e−4, training the model for 5 epochs across 4 distinct dataset versions (Table 1). Notably, the inclusion of 2000 filtered data points led to an improvement of approximately 5% in chest organ results and enabled the pelvis organ to achieve a GREEN score exceeding 0.50. These results underscore the benefits of combining strategies, as implemented in version 3, to address both dataset imbalance and caption length variability.

In addition, we investigated the effect of temperature settings, ranging from 0.1 to 0.9, to modulate the creativity of the model’s predictions. Lower temperature settings (eg, 0.1) consistently led to higher performance scores (Figure 3), likely because reduced randomness encourages more deterministic and conservative predictions. This trend indicates that moderate temperature values optimize the balance between creativity and consistency, leading to improved performance, particularly in the abdomen and pelvis regions. The sharp decline at higher temperature values suggests increased randomness, which negatively impacts overall accuracy. We hypothesize that this outcome stems from the nature of the pretraining dataset, which includes a significant proportion of reports describing normal or healthy conditions. Lower temperatures appear to align the model’s predictions with these conservative patterns, enhancing medically coherent outputs.

Interestingly, Figure 4A highlights that as the number of matched findings increases, the GREEN score improves, while clinically significant errors decrease. Moreover, Figure 4B reveals that cases with zero significant errors often correspond to shorter predicted token lengths (below 50). While shorter predictions are typically more focused, they may sacrifice comprehensiveness. Longer predictions, on the other hand, increase the number of matched findings but also raise the risk of significant errors. This trade-off demonstrates the challenge of balancing prediction length with accuracy. Furthermore, Figure 4C highlights that when the predicted token length closely aligns with the ground-truth token length, the GREEN score tends to exceed 0.4, reinforcing the importance of length consistency for improved performance. The results indicate that a closer alignment between predicted and ground-truth lengths tends to yield higher GREEN scores, suggesting that token length consistency plays a significant role in report quality.

For VQA, balancing the dataset significantly improved the model’s accuracy. By redistributing answers from choices B and C to underrepresented Class A and D, we equalized the dataset distribution. This restructuring increased the model’s accuracy from 75% to nearly 79%. Figure 5 compares the results of the imbalanced and balanced datasets, showing how balancing mitigated biases. Prior to balancing, the model exhibited a recall of 0.57 for Class B and a precision of 0.6 for Class A, indicative of a bias towards predicting the majority class. Post-balancing, these issues were resolved.

Besides that, while the GREEN score performs well as an evaluation metric, its application to lengthy medical reports presents opportunities for further improvement in real-world deployment. Furthermore, given the detailed and comprehensive nature of full-length radiology reports, mitigating the GREEN score’s inherent bias toward shorter texts could further improve its capacity to accurately assess clinical quality across longer reports.

On the other hand, the ground-truth reports, by their nature, contain a preponderance of sentences describing healthy conditions. As a result, models can achieve high GREEN scores by predominantly predicting healthy outcomes, which reflects the statistical bias in the dataset rather than true generalization performance.

This imbalance in dataset composition further complicates the interpretation of the GREEN score. The metric does not explicitly differentiate between true positives (TP) and true negatives (TN), introducing a critical limitation in its interpretability. For instance, a model that predominantly classifies cases as “healthy” in a dataset with a high proportion of healthy examples may achieve a high GREEN score (a very high number of “Matched Findings”) due to an abundance of TNs, even if its ability to detect abnormalities (TPs) is poor.

TPs correspond to correctly identified cases of disease or abnormality, directly reflecting the model’s diagnostic capability, while TNs represent accurate identification of healthy cases. While both are important, the lack of distinction between these 2 types of correct predictions makes the GREEN score insufficient for evaluating nuanced performance.

Additionally, due to the unstructured nature of the radiology reports in the AMOS-MM dataset, we were unable to conduct a detailed analysis of disease-specific distributions or assess the model’s performance on individual pathologies. This limitation highlights an opportunity for future work to address dataset balance, develop complementary metrics, and pathology-level annotations that can more effectively assess model performance across diverse clinical scenarios.

It is also worth noting that, due to the computationally intensive nature of these tasks, certain avenues for improvement, such as exploring alternative image encoder architectures, increasing input resolution (eg, incorporating more slices), and experimenting with larger language models, remain unexplored. These directions present promising opportunities for future work.

Conclusion

In summary, this study highlights the promise of multimodal AI systems, exemplified by the Phi3-V pipeline, in addressing the complexities of 3D medical imaging. By integrating visual and textual data, our approach not only enhances the generation of clinically coherent medical reports but also improves performance in VQA tasks. These results underscore the untapped potential of 3D imaging in clinical workflows and lay a foundation for further advancements in automated medical image analysis to support accurate and efficient healthcare decision-making.

Acknowledgements

The authors thank the administrative and technical teams at WCM-Q for their support in this study.

Contributor Information

Abdullah Hosseini, AI Innovation Lab, Weill Cornell Medicine—Qatar, Doha, 24144, Qatar.

Ahmed Ibrahim, AI Innovation Lab, Weill Cornell Medicine—Qatar, Doha, 24144, Qatar.

Ahmed Serag, AI Innovation Lab, Weill Cornell Medicine—Qatar, Doha, 24144, Qatar.

Author contributions

Abdullah Hosseini (Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing—original draft), Ahmed Ibrahim (Data curation, Validation, Visualization), and Ahmed Serag (Conceptualization, Investigation, Project administration, Supervision, Writing—review & editing)

Funding

This research received no external funding.

Conflicts of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

1. Hosseini A, Serag A. Self-supervised learning powered by synthetic data from diffusion models: application to x-ray images. IEEE Access. 2025;13:59074-59084. [Google Scholar]
2. Liu C, Tian Y, Chen W, Song Y, Zhang Y. Bootstrapping large language models for radiology report generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada. Vol. 38. 2024:18635-18643.
3. Ben Rabah C, Petropoulos IN, Malik RA, Serag A. Vision transformers for automated detection of diabetic peripheral neuropathy in corneal confocal microscopy images. Front Imaging. 2025;4:1542128. [Google Scholar]
4. Guo P, Zhao C, Yang D, et al. Maisi: medical AI for synthetic imaging. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025:4430–4441.
5. Ibrahim A, Hosseini A, Ibrahim S, Sattar A, Serag A. D3: a small language model for drug-drug interaction prediction and comparison with large language models. Machine Learning with Applications. 2025;20:100658. [Google Scholar]
6. Chen Z, Shen Y, Song Y, Wan X. Cross-modal memory networks for radiology report generation. arXiv preprint arXiv: 2204.13258, 2022, preprint: not peer reviewed.
7. Li Y, Wang Z, Liu Y, Wang L, Liu L, Zhou L. Kargen: knowledge-enhanced automated radiology report generation using large language models. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2024:382–392.
8. Wang Z, Liu L, Wang L, Zhou L. Metransformer: radiology report generation by transformer with multiple learnable expert tokens. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada. 2023:11558–11567.
9. Huang Z, Zhang X, Zhang S. Kiut: knowledge-injected u-transformer for radiology report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada. 2023:19809–19818.
10. Wang Z, Liu L, Wang L, Zhou L. R2gengpt: radiology report generation with frozen llms. Meta-Radiology. 2023;1:100033. [Google Scholar]
11. Ji J, Hou Y, Chen X, Pan Y, Xiang Y. Vision-language model for generating textual descriptions from clinical images: model development and validation study. JMIR Form Res. 2024;8:e32690. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Thawakar OC, Shaker AM, Shaji Mullappilly S, et al. Xraygpt: chest radiographs summarization using large medical vision-language models. In: Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, Bangkok, Thailand. 2024:440–448.
13. Lee S, Youn J, Kim H, Kim M, Yoon SH. CXR-llava: A multimodal large language model for interpreting chest X-ray images. Eur Radiol. 2025;35:4374–4386. 10.1007/s00330-024-11339-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Li J, Zhu G, Hua C, et al. A systematic collection of medical image datasets for deep learning. ACM Comput Surv. 2023;56:1-51. [Google Scholar]
15. Hamamci IE, Er S, Menze B. Ct2rep: automated radiology report generation for 3d medical imaging. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2024:476–486.
16. Deng X, He X, Bao J, et al. Mvketr: chest CT report generation with multi-view perception and knowledge enhancement. arXiv preprint arXiv: 2411.18309, 2024, preprint: not peer reviewed. [DOI] [PubMed]
17. Bai F, Du Y, Huang Max T, Meng Q-H, Zhao B. M3d: advancing 3d medical image analysis with multi-modal large language models. arXiv preprint arXiv: 2404.00578, 2024, preprint: not peer reviewed.
18. Touvron H, Martin L, Stone K, et al. Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288, 2023, preprint: not peer reviewed.
19. Chen H, Zhao W, Li Y, et al. 3D-CT-GPT: generating 3d radiology reports through integration of large vision-language models. arXiv preprint arXiv: 2409.19330, 2024, preprint: not peer reviewed.
20. Abdin M, Aneja J, Awadalla H, et al. Phi-3 technical report: a highly capable language model locally on your phone. arXiv preprint arXiv: 2404.14219, 2024, preprint: not peer reviewed.
21. Ji Y, Bai H, Ge C, et al. Amos: a large-scale abdominal multi-organ benchmark for versatile medical image segmentation. Adv Neu Inf Process Syst. 2022;35:36722-36732. [Google Scholar]
22. Ibrahim Ethem H, Sezgin E, Furkan A, et al. 2024. Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography, arXiv, arXiv: 2403.17834, preprint: not peer reviewed. [Google Scholar]
23. Radford A, Kim JW, Hallacy C, et al. Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR, 2021:8748–8763.
24. Ostmeier S, Xu J, Chen Z, et al. Green: generative radiology report evaluation and error notation. arXiv preprint arXiv: 2405.03595, 2024, preprint: not peer reviewed.
25. Li S, Xu B, Luo Y, Nie D, Zhang L. Vit3d alignment of llama3: 3d medical image report generation. arXiv preprint arXiv: 2410.08588, 2024, preprint: not peer reviewed.

[ubaf011-B1] 1. Hosseini A, Serag A. Self-supervised learning powered by synthetic data from diffusion models: application to x-ray images. IEEE Access. 2025;13:59074-59084. [Google Scholar]

[ubaf011-B2] 2. Liu C, Tian Y, Chen W, Song Y, Zhang Y. Bootstrapping large language models for radiology report generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada. Vol. 38. 2024:18635-18643.

[ubaf011-B3] 3. Ben Rabah C, Petropoulos IN, Malik RA, Serag A. Vision transformers for automated detection of diabetic peripheral neuropathy in corneal confocal microscopy images. Front Imaging. 2025;4:1542128. [Google Scholar]

[ubaf011-B4] 4. Guo P, Zhao C, Yang D, et al. Maisi: medical AI for synthetic imaging. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025:4430–4441.

[ubaf011-B5] 5. Ibrahim A, Hosseini A, Ibrahim S, Sattar A, Serag A. D3: a small language model for drug-drug interaction prediction and comparison with large language models. Machine Learning with Applications. 2025;20:100658. [Google Scholar]

[ubaf011-B6] 6. Chen Z, Shen Y, Song Y, Wan X. Cross-modal memory networks for radiology report generation. arXiv preprint arXiv: 2204.13258, 2022, preprint: not peer reviewed.

[ubaf011-B7] 7. Li Y, Wang Z, Liu Y, Wang L, Liu L, Zhou L. Kargen: knowledge-enhanced automated radiology report generation using large language models. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2024:382–392.

[ubaf011-B8] 8. Wang Z, Liu L, Wang L, Zhou L. Metransformer: radiology report generation by transformer with multiple learnable expert tokens. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada. 2023:11558–11567.

[ubaf011-B9] 9. Huang Z, Zhang X, Zhang S. Kiut: knowledge-injected u-transformer for radiology report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada. 2023:19809–19818.

[ubaf011-B10] 10. Wang Z, Liu L, Wang L, Zhou L. R2gengpt: radiology report generation with frozen llms. Meta-Radiology. 2023;1:100033. [Google Scholar]

[ubaf011-B11] 11. Ji J, Hou Y, Chen X, Pan Y, Xiang Y. Vision-language model for generating textual descriptions from clinical images: model development and validation study. JMIR Form Res. 2024;8:e32690. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ubaf011-B12] 12. Thawakar OC, Shaker AM, Shaji Mullappilly S, et al. Xraygpt: chest radiographs summarization using large medical vision-language models. In: Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, Bangkok, Thailand. 2024:440–448.

[ubaf011-B13] 13. Lee S, Youn J, Kim H, Kim M, Yoon SH. CXR-llava: A multimodal large language model for interpreting chest X-ray images. Eur Radiol. 2025;35:4374–4386. 10.1007/s00330-024-11339-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ubaf011-B14] 14. Li J, Zhu G, Hua C, et al. A systematic collection of medical image datasets for deep learning. ACM Comput Surv. 2023;56:1-51. [Google Scholar]

[ubaf011-B15] 15. Hamamci IE, Er S, Menze B. Ct2rep: automated radiology report generation for 3d medical imaging. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2024:476–486.

[ubaf011-B16] 16. Deng X, He X, Bao J, et al. Mvketr: chest CT report generation with multi-view perception and knowledge enhancement. arXiv preprint arXiv: 2411.18309, 2024, preprint: not peer reviewed. [DOI] [PubMed]

[ubaf011-B17] 17. Bai F, Du Y, Huang Max T, Meng Q-H, Zhao B. M3d: advancing 3d medical image analysis with multi-modal large language models. arXiv preprint arXiv: 2404.00578, 2024, preprint: not peer reviewed.

[ubaf011-B18] 18. Touvron H, Martin L, Stone K, et al. Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288, 2023, preprint: not peer reviewed.

[ubaf011-B19] 19. Chen H, Zhao W, Li Y, et al. 3D-CT-GPT: generating 3d radiology reports through integration of large vision-language models. arXiv preprint arXiv: 2409.19330, 2024, preprint: not peer reviewed.

[ubaf011-B20] 20. Abdin M, Aneja J, Awadalla H, et al. Phi-3 technical report: a highly capable language model locally on your phone. arXiv preprint arXiv: 2404.14219, 2024, preprint: not peer reviewed.

[ubaf011-B21] 21. Ji Y, Bai H, Ge C, et al. Amos: a large-scale abdominal multi-organ benchmark for versatile medical image segmentation. Adv Neu Inf Process Syst. 2022;35:36722-36732. [Google Scholar]

[ubaf011-B22] 22. Ibrahim Ethem H, Sezgin E, Furkan A, et al. 2024. Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography, arXiv, arXiv: 2403.17834, preprint: not peer reviewed. [Google Scholar]

[ubaf011-B23] 23. Radford A, Kim JW, Hallacy C, et al. Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR, 2021:8748–8763.

[ubaf011-B24] 24. Ostmeier S, Xu J, Chen Z, et al. Green: generative radiology report evaluation and error notation. arXiv preprint arXiv: 2405.03595, 2024, preprint: not peer reviewed.

[ubaf011-B25] 25. Li S, Xu B, Luo Y, Nie D, Zhang L. Vit3d alignment of llama3: 3d medical image report generation. arXiv preprint arXiv: 2410.08588, 2024, preprint: not peer reviewed.

PERMALINK

M3: multimodal artificial intelligence for medical report generation and visual question answering from 3D abdominal CT scans

Abdullah Hosseini, BEng

Ahmed Ibrahim, BEng

Ahmed Serag, PhD

Abstract

Objectives

Methods

Results

Conclusions

Advances in knowledge

Introduction

Methods

Dataset

Overview

Dataset preparation and refinement

Version 1

Version 2

Version 3

Vision-language model

Model architecture

Figure 1.

Temperature adjustment in inference

Prompt formatting and optimization

Figure 2.

Evaluation metrics

Statistical analysis

Results

Table 1.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Table 2.

Discussion

Conclusion

Acknowledgements

Contributor Information

Author contributions

Funding

Conflicts of interest

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases