Dear Editor:
Digital transformation (DX) refers to the integration of digital technologies across all functional areas of an organization, fundamentally altering data management and decision-making processes to optimize it. DX is crucial for artificial intelligence (AI) use and development, yet in many parts of the world, patient photographs are often stored on PCs without indexing, making retrieval difficult and wasting time on manual organization.
We have developed a medical photographic management system, in development since 2003 and open-sourced in 2007 (MedicalPhoto; https://github.com/whria78/medicalphoto). MedicalPhoto includes most diagnoses found in dermatology textbooks, including rare disorders such as Majocchi’s granuloma, along with corresponding the International Classification of Diseases, 10th Revision codes. Users can also define reserved terms for diagnoses or procedures, allowing for customizable annotations (Fig. 1). It stores photograph files by renaming folders and filenames in the format ‘/2025/2025-01/ID_Name_Dx.jpg,’ allowing direct access to images and facilitating easy migration to other software. Since it uses a client-server architecture, it can be accessed from multiple computers. It uses SQLite for data storage and ASIO for socket communication, with the core implemented according to the C++03 standard to ensure compatibility across various operating systems in the future.
Fig. 1. MedicalPhoto screenshot. Photographs can be annotated using built-in dermatological diagnoses or user-defined keywords. All photographs are fake images created using artificial intelligence (generative adversarial network).
MedicalPhoto has annotated diagnostic labels for around 1.1 million photographs from multiple hospitals in Korea, including three university hospitals, over the past 20 years. The structured annotation of clinical photographs, where diagnoses were assigned to all images rather than only those of interest, has made a significant contribution to the development of skin disease classifiers1,2.
Recently, MedicalPhoto integrated a Vision-Language model (Qwen2-VL) through an llama.cpp3,4 (https://github.com/whria78/llama-qwen-vl). The Qwen2-VL 72B (72 billion parameters) model requires specifications that can be supported even by a local clinic (64GB system memory; optional GTX 1050Ti or higher GPU). Qwen2-VL is a type of large language model (LLM) that can see and understand images and extract meaningful information, enabling tasks such as generating descriptive summaries or recognizing text within images. To automatically extract patient names and IDs from images, our system first uses a convolutional neural network (efficientnet-lite0 & mobilenet-v3 ensemble), another AI model designed to distinguish between indexed images and clinical images. Then, patient information is extracted from the indexed images using the Qwen2-VL. We employed a pre-quantized model on Hugging Face, a widely used open-source platform for sharing and collaborating on machine learning models and datasets (https://huggingface.co/second-state/Qwen2-VL-72B-Instruct-GGUF), without any additional fine-tuning. It can extract patient information even when irrelevant text is present (Fig. 2). We tested whether it could accurately extract the patient's name and ID from the indexed images.
Fig. 2. Organizing photographs using artificial intelligences.
The submitted photographs are processed using a CNN to identify the index photos. Qwen2-VL is used to extract the name and ID from the identified index photos, which are then saved in JSON format. MedicalPhoto then utilizes the stored JSON file to attach metadata to the corresponding images.
CNN: convolutional neural network.
The Qwen2-VL automated organization was tested at Sanggye Paik Hospital after photo management was neglected for a year (Feb, 2024–Mar, 2025) due to a conflict between the government and medical association, which led to the resignation of all trainee doctors in Korea5. The Qwen2-VL recognized 80.4% (1,707/2,123) of the total cases as the index images containing an ID and name. Among the recognized cases, it identified tagged IDs in 97.8% (1,670/1,707) and names in 91.2% (1,563/1,707) of the index images.
Herein, we introduce the first open-source software, used for diagnostic annotation of over a million images and automated ID tagging in scenarios with limited staff. Open-source LLMs such as LLaMA3 and Qwen4 signify the increasing adoption of on-premises AI deployment. The systematic organization of clinical photographs is a fundamental prerequisite for advancing DX in dermatology and for creating robust and large datasets. Aside from text recognition in this study, the VL model can directly identify anatomical regions from clinical photographs, enabling broader applications like statistical analysis of body regions.
It accurately recognized numeric patient IDs in 97.8% of cases, enabling automation of approximately 80% of the ID tagging process with just a few clicks. Although these images were not originally taken with AI recognition in mind, significantly higher accuracy can be expected if image acquisition is optimized for AI processing. All photographs are organized by date using their Exchangeable Image File Format (EXIF) metadata, and the system is designed to reject ID integration if the extracted text does not match the hospital’s specific ID format. Nevertheless, in cases of misclassification by the VL model, users can manually locate and correct images based on the capture date during clinical practice.
The VL model still has certain limitations. It struggled especially when multiple diagnoses or names were present, leading to omissions, mixed information, and factual error (hallucination). In particular, the model frequently failed to distinguish between physician or nurse names and patient names, leading to a high rate of errors. These limitations highlight the importance of producing AI-friendly data as part of DX efforts. Apart from patient demographics, other information such as differential diagnoses can also be extracted using our system. To achieve this, users should remind which photo types the VL model recognize well to successfully extract structured information.
Another limitation is that clinical images must be stored in an encrypted space, for example by utilizing the Microsoft Encrypting File System (EFS), even though they are transmitted via SSL (Secure Sockets Layer), to comply with HIPAA (Health Insurance Portability and Accountability Act) regulations.
In conclusion, we developed and released an open-source clinical photography management system that streamlines medical imagery handling. The system facilitates a seamless migration from traditional filing methods and integrates efficiently into clinical workflows, reducing administrative burdens. In addition, by leveraging VL models, we demonstrated a functional bridge for automating patient information integration. This work serves as a foundational framework for incorporating evolving AI technologies into clinical practice, paving the way for further DX in healthcare.
ACKNOWLEDGMENT
SS Han and MS Kim have full access to all the data used in the study and take responsibility for the integrity of the data and accuracy of the data analysis. We sincerely thank Tongyi-Qianwen for releasing the quantized model of Qwen2-VL on Hugging Face. We would also like to express our gratitude to https://www.codeproject.com for providing helpful codes.
Footnotes
FUNDING SOURCE: None.
CONFLICTS OF INTEREST: The authors have nothing to disclose.
DATA SHARING STATEMENT: Setup and Usage Tutorial: https://youtu.be/JYpVa0g7qTI.
*Microsoft Visual C++ Redistributable is required to run the VL Model Inference tool. (https://whria78.github.io/medicalphoto/warning).
MedicalPhoto: https://github.com/whria78/medicalphoto (previous versions are available at https://sourceforge.net/projects/medicalphoto and https://sourceforge.net/projects/medieye)
VL Model Inference Tool: https://github.com/whria78/llama-qwen-vl.
Both MedicalPhoto and VL Inference Tool are freeware. (MedicalPhoto – GPL license; VL Inference Tool – MIT license).
References
- 1.Han SS, Park GH, Lim W, Kim MS, Na JI, Park I, et al. Deep neural networks show an equivalent and often superior performance to dermatologists in onychomycosis diagnosis: automatic construction of onychomycosis datasets by region-based convolutional deep neural network. PLoS One. 2018;13:e0191493. doi: 10.1371/journal.pone.0191493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Han SS, Park I, Chang SE, Lim W, Kim MS, Park GH, et al. Augmented intelligence dermatology: deep neural networks empower medical professionals in diagnosing skin cancer and predicting treatment options for 134 skin disorders. J Invest Dermatol. 2020;140:1753–1761. doi: 10.1016/j.jid.2020.01.019. [DOI] [PubMed] [Google Scholar]
- 3.Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. Llama: open and efficient foundation language models. arXiv. 2023 Feb 27; doi: 10.48550/arXiv.2302.13971. [DOI] [Google Scholar]
- 4.Wang P, Bai S, Tan S, Wang S, Fan Z, Bai J, et al. Qwen2-VL: enhancing vision-language model’s perception of the world at any resolution. arXiv. 2024 Sep 18; doi: 10.48550/arXiv.2409.12191. [DOI] [Google Scholar]
- 5.Yoon JH, Kwon IH, Park HW. The South Korean health-care system in crisis. Lancet. 2024;403:2589. [Google Scholar]


