Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

ArXiv logoLink to ArXiv
[Preprint]. 2025 May 9:arXiv:2505.05736v1. [Version 1]

Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications

Da Wu 1,!, Zhanliang Wang 1,2,!, Quan Nguyen 1,3, Zhuoran Xu 1,4, Kai Wang 1,5,*
PMCID: PMC12083703  PMID: 40386570

Abstract

The scarcity of high-quality multimodal biomedical data limits the ability to effectively fine-tune pretrained Large Language Models (LLMs) for specialized biomedical tasks. To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from high-quality multimodal biomedical data through preference optimization. While MINT supports different optimization techniques, we primarily implement it with the Odds Ratio Preference Optimization (ORPO) framework as its backbone. This strategy enables the aligned LLMs to perform predictive tasks using text-only or image-only inputs while retaining knowledge learnt from multimodal data. MINT leverages an upstream multimodal machine learning (MML) model trained on high-quality multimodal data to transfer domain-specific insights to downstream text-only or image-only LLMs. We demonstrate MINT’s effectiveness through two key applications: (1) Rare genetic disease prediction from texts, where MINT uses a multimodal encoder model, trained on facial photos and clinical notes, to generate a preference dataset for aligning a lightweight decoder-based text-only LLM (Llama 3.2–3B-Instruct). Despite relying on text input only, the MINT-derived model outperforms models trained with Supervised Fine-Tuning (SFT), Retrieval-Augmented Generation (RAG), or direct preference optimization (DPO), and even outperforms much larger foundation model (Llama 3.1-405B-Instruct). (2) Tissue type classification using cell nucleus images, where MINT uses a vision-language foundation model as the preference generator, containing knowledge learnt from both text and histopathological images to align downstream image-only models. The resulting MINT-derived model significantly improves the performance of Llama 3.2-Vision-11B-Instruct on tissue type classification. In summary, MINT provides an effective strategy to align unimodal LLMs with high-quality multimodal expertise through preference optimization. Our study also highlights a hybrid strategy that grafts the strength of encoder models in classification tasks into large decoder models to enhance reasoning, improve predictive tasks and reduce hallucination in biomedical applications.

Keywords: Large Language Models, Human Phenotype Ontology, Rare Genetic Disorders, Supervised Fine-Tuning, Retrieval Augmented Generation, Direct Preference Optimization, Odds Ratio Preference Optimization

INTRODUCTION

Language Models (LLMs) have achieved remarkable advances in natural language understanding and reasoning, powered by the scaling properties of the Transformer architecture1,2 and pretraining on large-scale, general-purpose text corpora35. Compared to earlier natural language processing (NLP) models, such as those from the Bidirectional Encoder Representations from Transformers (BERT)6, LLMs exhibit superior linguistic capabilities, more coherent long-range dependencies, context handling69 and improved generalization across diverse tasks. Recent models such as Llama 3.17/3.28/3.39 and DeepSeek V310/R111 extend these capabilities further with context windows of up to 128K tokens, enabling more accurate processing of lengthy, detailed, and often noisy biomedical documents. This progress has expanded their potential for a broad range of biomedical applications, including summarizing noisy clinical documents1215, managing patient scheduling1618, and addressing highly specialized tasks such as rare disease prediction and gene prioritization19,20. Despite these advances, adapting LLMs to domain-specific biomedical tasks remains challenging, especially when multimodal data are scarce, supervision is sparse, or the task requires specialized clinical reasoning.

Supervised fine-tuning (SFT)21,22 is the most commonly employed approach for adapting LLMs to domain-specific tasks using task-specific datasets that are typically much smaller than those used during pretraining. SFT has proven effective for linguistically driven applications2325. For example, Yang et al. developed PhenoGPT26 to derive phenotypes from long and complex clinical notes by fine-tuning LLMs on data annotated with Human Phenotype Ontology (HPO) terms, achieving improved performance over the foundation LLMs. While such applications demonstrate benefits of SFT in biomedical NLP tasks, the approach faces notable limitations when applied to more complex tasks that require structured prediction and complex logical reasoning, such as rare disease diagnosis or gene prioritization2729. These tasks often involve generating precise medical terminologies rather than selecting from a predefined set, which is especially challenging for auto-regressive LLMs that generates responses token by token2. The challenge is amplified when labeled data are scarce, as is often the case in biomedical domains. Moreover, indiscriminate fine-tuning on medical datasets can have unintended consequences, potentially weakening the model’s inherent linguistic and logical reasoning abilities significantly3032. These limitations becomes even more pronounced in zero-shot or few-shot scenarios, where models must handle previously unseen disease classes or tissue types, which are increasingly encountered in real-world clinical practice33.

To tackle these challenges, we propose Multimodal Integrated Knowledge Transfer (MINT), a framework that facilitates aligning LLMs with domain-specific patterns from multimodal biomedical databases (e.g., images, audios, videos, clinical texts) while preserving their original linguistic and reasoning capabilities. MINT can be implemented with different preference optimization techniques including Direct Preference Optimization (DPO)53 and Odds Ratio Preference Optimization (ORPO)34. In this paper, we primarily use MINT with ORPO as our default backbone, which we refer to simply as “MINT” hereafter for brevity. The MINT framework comprises two key components: upstream preference learning dataset construction and downstream LLM model training. In the upstream pipeline, a multimodal ML model is trained to generate a preference learning dataset, including a list of preferred responses and a list of unfavored responses. In the downstream pipeline, these preference-labeled inputs are used for aligning LLM using ORPO to prioritize preferred responses over unfavored ones while preserving its general linguistic and reasoning capabilities.

To showcase its power, we demonstrate that the MINT framework significantly enhances LLM performance in two important biomedical tasks: phenotype-driven rare disease prediction and image-based tissue type classification. For the first task, we utilize GestaltMML35, a multimodal Transformer model trained on facial images, clinical texts, and demographic information, to construct a preference learning dataset. Our results show that the MINT-aligned model outperforms widely used methods such as SFT and Retrieval-Augmented Generation (RAG)3638 in the task of performing text -only rare disease prediction, and even demonstrates strong performance on previously unseen disease classes in zero-shot settings. The second application of MINT focuses on tissue type classification from cell nucleus images using the PanNuke Database39,40 and Pathology Language-Image Pretraining (PLIP)41 model as the upstream multimodal preference dataset generator. By aligning Llama 3.2-Vision-11B-Instruct with preferences generated from PLIP, MINT outperforms other widely used techniques for image-based tissue type classification, such as Supervised Fine-Tuning (SFT). Together, we believe that the MINT framework can be readily extended to other biomedical applications, enabling the efficient transfer of knowledge from domain-specific multimodal datasets to LLMs or other large decoder models. This approach serves as a similar but alternative technique to RAG, which combines the strength of encoder in classification tasks with large decoder models.

RESULTS

Overview

In MINT, each task utilizes an upstream multimodal machine learning (MML) model to generate a preference learning dataset, which consists of most-likely and least-likely labels for each sample. The downstream large language model (LLM) then serves as the target for multimodal knowledge transfer (Figure 1). Throughout the Results section, we refer to our default implementation (MINT with ORPO) simply as ‘MINT’ for brevity, while the alternative implementation with DPO is referred to as “DPO”.

Figure 1. Overview of the MINT framework for transferring multimodal knowledge to Large Language Models.

Figure 1.

The framework consists of several pipelines: (1) Upstream Pipeline: A multimodal classifier integrates test and image modality input data to generate top-k most-likely and bottom-q least-likely predictions, which are organized into chosen (preferred) and rejected (non-preferred) responses in natural language to form a preference dataset. (2) Downstream Pipeline-SFT: Standard supervised fine-tuning of base language or vision-language. (3) Downstream Pipeline-MINT with DPO: Direct Preference Optimization approach that uses a frozen reference model (initialized from SFT) and a trainable policy model, optimizing with KL-divergence and maximum likelihood objectives. (4) Downstream Pipeline-MINT with ORPO (default): Our proposed unified framework combining negative log likelihood and odds ratio loss in a single step, directly optimizing the relative probabilities between chosen and rejected responses without requiring a separate reference model.

To demonstrate the effectiveness of the MINT framework, we evaluate its performance on two essential tasks: rare genetic disease prediction and tissue type classification. A summary of datasets, upstream multimodal models and downstream LLMs used in this study is given in Table 1. For rare disease prediction, the upstream model is GestaltMML, a Vision-and-Language Transformer-based multimodal ML model, while the downstream model is the text-only Llama-3.2–3B-Instruct since our input will be clinical texts only. For tissue type classification, the upstream model is Pathology Language-Image Pretraining (PLIP), a vision-language foundation model designed for pathology AI, whereas the downstream model is Llama 3.2-Vision-11B-Instruct, whose input will be images with a trivial question such as “What is shown in Figure?”

Table 1.

Summary of datasets, upstream multimodal ML model and downstream LLM for the two biomedical tasks evaluated in this study.

Detailed Components of MINT Framework
Task Rare Disease Prediction Tissue Type Classification
Dataset GestaltMatcher Database PanNuke Database
Images Frontal Facial Image Nucleus Image
Texts Age, Sex, Ethnicity, HPO terms Tissue type
Upstream Multimodal ML Model GestaltMML Pathology Language and Image Pre-Training (PLIP)
Downstream LLM Llama 3.2–3B-Instruct Llama 3.2-Vision-11B-Instruct
# of Training Sample 6522 5436
# of Testing Sample 386 2330
# of Labels (Categories) 528 19

Phenotype-driven disease prediction4245 is a task that makes predictions of rare diseases based on patients’ clinical phenotypes, and such predictions can assist physicians in selecting diagnostic modalities or to facilitate clinical labs in finding causal genes/mutations46,47. Such clinical phenotypes can include facial photos, clinical notes, lab tests, and other imaging modalities. Several machine learning models have been developed to utilize facial photos for disease prediction, such as DeepGestalt48 and GestaltMatcher49, which are based on facial image analysis, as well as GestaltMML35, a multimodal model that integrates facial images, demographic information, and clinical phenotypes such as Human Phenotype Ontology (HPO) terms47.

GestaltMML is a Transformer-based multimodal machine learning (ML) model that combines the ViLT (Vision-and-Language Transformer50) with the GestaltMatcher Database (GMDB)49,51. Unlike earlier multimodal models50 that use distinct image and text models with simple fusion, GestaltMML employs a unified Transformer encoder2 to model interactions between facial images and structured clinical texts.

PanNuke is a cell nuclei instance segmentation and classification dataset with exhaustive nuclei labels across 19 different tissue types. The dataset consists of 481 visual fields sampled from more than 20K whole slide images at different magnifications, with a total of 205,343 labeled nuclei, each with an instance segmentation mask. PLIP, built upon the widely recognized Contrastive Language-Image Pre-training (CLIP) model52, is the first vision-and-language foundation model designed for pathology AI.

Our goal here is not nuclei segmentation but predicting the tissue type (such as “bile duct” and “colon”) from cell nucleus images. We show that MINT, when applied to Llama 3.2-Vision-11B-Instruct, outperforms other widely used techniques for image-based tissue type classification, such as Supervised Fine-Tuning (SFT).

We assess MINT’s performance by comparing it against the foundation LLM and other approaches, including supervised fine-tuning (SFT), retrieval-augmented generation (RAG), and direct preference optimization (DPO) on four key evaluation metrics: Hallucination Free Accuracy (HFA), Top-N (N= 5 or 10) Accuracy, Top-1 Accuracy, and Coverage-Avoidance Ratio (CAR). Hallucination-Free Accuracy measures the rate at which a language model does not produce hallucinations, with 100% indicating no hallucinations at all. For “Hallucination”, we mean the model either does not follow the instruction at all or produces fabricated labels. The CAR measures the model’s capability to distinguish a list of preferred responses over the least-likely responses. Note that RAG is implemented and evaluated for the text-based rare disease prediction task only in our study.

MINT Enhances Rare Disease Prediction Accuracy by LLM

We first evaluate it on a rare disease prediction task using text-based clinical phenotypes. In this task, the downstream LLM is given a short clinical summary (a few hundred words) that includes demographic information such as the patient’s age, gender, and ethnicity, along with observed phenotype features. For instance, a typical input might describe a 1-year-old female patient of Caucasian descent presenting with developmental delay, overgrowth, scoliosis and cardiovascular abnormalities. Based on this input, the model generates a ranked list of ten possible rare diseases that best match the provided features. The output is structured in a consistent format, with each entry containing the full disease name and, when applicable, a commonly used abbreviation. Examples of generated predictions include conditions such as “Phelan-McDermid syndrome (PHMDS)” and “Williams-Beuren syndrome (WBS).” This structured response helps clinicians quickly interpret model outputs and assess diagnostic relevance.

To benchmark prediction performance, we compared MINT against the base model without any domain knowledge enhancement, as well as several existing model enhancement techniques, including Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT), and SFT followed by Direct Preference Optimization (DPO), which we refer to as DPO hereafter for simplicity. Figure 2 presents the evaluation results of our best-performing configuration, using the Llama 3.2–3B-Instruct enhanced with MINT under a balanced preference setting (AoR = 1.0), i.e., 10 accepted vs. 10 rejected responses. The AoR is defined as the ratio of accepted to rejected responses when building the preference-learning datasets (see the Methods section for details). For this evaluation, we utilize a dataset of 6522 training samples and 386 testing samples compiled from the GestaltMatcher Database (GMDB). The choice of how many responses to accept and reject was determined through considerations of the total number of disease labels (~520). The figure summarizes diagnostic accuracy, robustness under different preference ratios, and general language understanding capability.

Figure 2. Performance evaluation of rare disease prediction techniques using Llama 3.2–3B-Instruct model.

Figure 2.

(a) Comparison of model performance across four evaluation metrics: Hallucination-Free Accuracy (HFA), Top-10 accuracy, Top-1 accuracy, and Coverage-Avoidance Ratio (CAR) for five approaches: Base Model, RAG, SFT, MINT with DPO, and MINT with ORPO (color from light to dark respectively).(b) Effect of varying Acceptance-over-Rejection (AoR) ratios on MINT performance, showing optimal performance at balanced ratio (AoR=1). (c) Radar chart comparing performance on six language understanding benchmarks, demonstrating preserved general capabilities across all fine-tuning techniques.

We compare various enhancement strategies applied to base Llama 3.2–3B-Instruct model, using our four key metrics: Hallucination-Free Accuracy (HFA), Top-10 accuracy, Top-1 accuracy, and Coverage-Avoidance Rate (CAR) (Figure 2a). Our proposed method, MINT, consistently achieves the best performance across all metrics. Notably, MINT raises Top-10 accuracy from 5.19% (base model) to 52.99%, outperforming other methods such as RAG (6.52%), SFT (37.53%) and DPO (38.49%). The Coverage-Avoidance Rate (CAR) metric further highlights MINT’s strength. CAR is a harmonic mean between Top-k coverage and Bottom-q avoidance, assessing whether the model can prioritize likely diseases while avoiding implausible ones. On this metric, MINT achieves a CAR of 0.2877, significantly higher than both SFT (0.1665) and DPO (0.2808). This indicates that MINT not only improves recall of true conditions in the top ranks, but also minimizes the inclusion of clinically irrelevant diseases, thereby improving reliability for real-world diagnostic support.

We also examine the robustness of MINT under varying Acceptance-over-Rejection (AoR) ratios, which reflect the balance between accepted and rejected samples in the preference optimization phase (Figure 2b). We observe that performance improves as the AoR approaches a balanced ratio (10 accepted vs. 10 rejected). This trend suggests that a well-balanced preference dataset provides clearer and more stable gradient signals for learning disease ranking, while overly imbalanced settings such as too many rejected candidates, may dilute the optimization signal and reduce learning efficiency. These results offer practical guidance for designing preference-based supervision strategies in downstream medical applications.

To address the possibility that MINT-enhanced models may lose the general language understanding capabilities, we evaluate it on the H6-benchmark, a comprehensive evaluation suite comprising six-widely-used datasets: MMLU54, TruthfulQA55, HellaSwag56, Winogrande, ARC57 and GSM8k58. In total, the H6-benchmark contains approximately 113,000 diverse samples that test various aspects of language understanding and reasoning through multiple-choice questions. The results demonstrate that MINT does not compromise the base model’s generalization ability. Performance is maintained or modestly changed across all benchmarks, suggesting that MINT’s improvements in rare disease prediction are not at the expense of broad language or reasoning skills. This balance is especially important for models expected to operate in open-domain or multi-task settings.

To validate the generality of MINT across model sizes and architectures, we report full results in Table S2, which includes experiments on Llama 3.2–1B, 3B, 8B, Gemma 2–2B, 9B, and Llama 3.1–8B,70B, 405B. Due to computational constraints, only inference is performed on the largest models (70B and 405B). Across all scales and families, MINT consistently improves both the Top-10 and Top-1 accuracy. For example, on Llama 3.2–1B-Instruct, Top-10 accuracy increases from 4.24% to 34.28%, yet on Gemma-2–9B, Top-10 accuracy improves from 12.37% to 33.80%. Similarly, we observed a large increase of CAR values under different MINT model sizes and architectures, outperforming all the corresponding baselines. These gains are consistently accompanied by near-perfect HFA values, often exceeding 99%, indicating that MINT substantially enhances accuracy while maintaining high factual consistency and avoiding hallucinated outputs. This is to be expected since the upstream model GestaltMML is an encoder model that puts strong constraints on the prediction to ~520 disease labels.

Table 2. Evaluation of Llama 3.2–3B-Instruct variants on Phenopacket-derived clinical notes.

Results show three evaluation metrics: Hallucination Free Accuracy (%), Top-10 Accuracy, and Top-1 Accuracy for five model variants based on the Llama 3.2–3B-Instruct architecture. Each row represents a different approach: Base Model, RAG, SFT, DPO and MINT. Results are presented for three data configurations: overlapping diseases (1638 samples with 72 diseases shared with training data), disjoint diseases (4342 samples with 456 diseases unseen in training data), and the combined dataset (5980 samples). Bold values indicate best performance for each metric in each configuration. ‘MINT’ refers to our default implementation of the MINT framework using ORPO.

Performance on Phenopacket-derived Clinical Notes (N=1638) (72 overlapping diseases)
Model (Llama 3.2–3B-Instruct) HFA (%) Top-10 Accuracy (%) Top-1 Accuracy (%)
Base 99.94 5.13 3.42
RAG 100.00 30.11 26.45
SFT 99.94 46.40 14.65
DPO 99.94 60.99 26.56
MINT 99.89 66.91 47.56
Performance on Phenopacket-derived Clinical Notes (N=4342) (456 disjoint diseases)
Model (Llama 3.2–3B-Instruct) HFA (%) Top-10 Accuracy (%) Top-1 Accuracy (%)
Base 99.99 20.00 11.29
RAG 100.00 24.17 13.80
SFT 100.00 9.16 8.46
DPO 99.99 9.48 7.71
MINT 99.90 10.48 7.00
Performance on Phenopacket-derived Clinical Notes (N=5980)
Model (Llama 3.2–3B-Instruct) HFA (%) Top-10 Accuracy (%) Top-1 Accuracy (%)
Base 99.98 15.58 8.98
RAG 100.00 25.96 17.62
SFT 99.98 20.25 10.18
DPO 99.98 24.24 13.47
MINT 99.90 27.05 20.23

Additional Validation and Case Study for Rare Disease Prediction

Our results presented in the previous section is performed by splitting the original GMDB datasets into training and testing subsets. We recognize that cross-validation can be susceptible to overfitting, especially if the data used for validation is not representative of the true distribution of data the model will encounter in real-world scenarios. Indeed, the clinical phenotypes from the GMDB database often comprises of a few phenotype terms in the form of HPO, rather than natural texts that are written by human experts. To assess the robustness and generalizability of MINT in real-world clinical contexts, we conducted external validation using Phenopacket-derived clinical notes59. This dataset consists of 5980 high-quality synthesized clinical notes that simulate realistic clinical narratives by converting standardized phenotype profiles (“Phenopackets”) into natural language. This introduces the linguistic variability and contextual noise commonly observed in clinical practice. We divided this dataset into two subsets: one containing set of diseases that overlap with the GMDB training set (1638 samples with 72 overlapping diseases) and another that are non-overlapping (disjoint) from the training set (4342 samples with 456 disjoint diseases). This division allows us to assess both in-distribution performance and zero-shot generalization to unseen disease classes.

As shown in Table 2, MINT demonstrates strong performance across both subsets. For the overlapping disease subset, MINT substantially outperforms all baselines, achieving 66.91% Top-10 accuracy and 47.56% Top-1 accuracy, compared to the base model’s 5.13% and 3.42%, respectively. The performance gap is particularly notable compared to other enhancement techniques: DPO (60.99% Top-10, 26.56% Top-1), SFT (46.40% Top-10, 14.65% Top-1), and RAG (30.11% Top-10, 26.45% Top-1).

For the more challenging disjoint disease subset, which evaluates zero-shot generalization to completely unseen disease classes, performance decreases across all methods as expected. MINT achieves 10.48% Top-10 accuracy and 7.00% Top-1 accuracy. Interestingly, in this zero-shot scenario, RAG shows competitive performance with 24.17% Top-10 accuracy and 13.80% Top-1 accuracy, highlighting the complementary strengths of retrieval-based approaches when dealing with previously unseen diseases. This suggests that combining MINT with retrieval mechanisms could yield further improvements for zero-shot prediction scenarios. All methods maintain near-perfect Hallucination-Free Accuracy (HFA) rates above 99%, indicating strong factual consistency across approaches.

These extensive validation experiments demonstrate that MINT improves rare disease prediction across diverse text sources when diseases overlap with the training distribution. For zero-shot scenarios with previously unseen diseases, retrieval augmented approaches offer complementary strengths, suggesting that combining preference optimization with retrieval mechanisms could be particularly effective for comprehensive rare disease prediction systems.

MINT Substantially Enhances Tissue Type Classification Accuracy by LLM

We further assess the capability of MINT on a vision-centric biomedical classification task: tissue type prediction using the PanNuke dataset. The objective is to identify the most likely tissue of origin given a histological image, selected from a fixed set of 19 tissue types. This task is cast into a vision-language interface by providing each model with two inputs: (i) an image patch extracted from a nucleus-level H&E-stained slide, and (ii) a standardized prompt that asks, “Based on the morphological features observed in the provided nucleus image, which tissue types could this nucleus originate from?” The model is instructed to generate a ranked list of exactly five candidate tissue types. While the prompt follows the format of a typical Visual Question Answering (VQA) task, it is semantically constant across all samples and thus does not provide additional context. As a result, the task relies entirely on visual modality from decision-making, making it a faithful benchmark of visual reasoning capacity.

We summarize the results of applying MINT to the downstream Llama 3.2–11B-Vision-Instruct model and three common enhancement baselines: SFT, DPO, and the base model (Figure 3a). For this evaluation, we utilize 5436 training images and 2330 testing images from the PanNuke dataset. Full numerical results with standard deviations across three random seeds are provided in Table S3. The base vision-language model achieves moderate baseline performance, with a Top-5 accuracy of 32.21% and Top-1 accuracy of 16.96%, likely reflecting pre-existing visual understanding from its large-scale pretraining. Upon applying MINT, we observe a substantial performance gain where Top-5 accuracy rises to 57.58% and Top-1 accuracy improves to 28.41%, representing a nearly doubling of accuracy compared to the base model. In addition to accuracy gains, MINT achieves the highest Coverage-Avoidance Rate (CAR) of 0.5203, slightly surpassing the best baseline (DPO, 0.5196) and substantially exceeding the base model (0.2965). The CAR is computed using 5 accepted responses and 5 rejected responses due to the small label set (19 classes), reflects the model’s ability to both surface relevant tissue types in the top predictions and avoid implausible categories in the bottom ranks.

Figure 3. Performance on tissue type classification using different fine-tuning techniques on the Llama 3.2-Vision-11B-Instruct foundation model.

Figure 3.

(a) Bar Chart showing performance metrics across four evaluation criteria: Hallucination Free Accuracy (%), Top-5 Accuracy (%), Top-1 Accuracy (%), and CAR (Coverage-Avoidance Ratio). Four fine-tuning approaches are compared: Base Model, SFT, MINT with DPO, MINT with ORPO (color from light to dark respectively). (b) Radar chart comparing performance across multiple general vision-language capabilities for different fine-tuning techniques.

Other techniques, including SFT, DPO, show moderate gains; For example, both SFT and DPO improve Top-5 accuracy to 41.16%, yet fall short of MINT across all key metrics. These results reinforce MINT’s strength in enhancing multimodal models across both text-based and image-based biomedical tasks. Note that due to the small size of the label space, we do not explore varying Acceptance-over-Rejection (AoR) ratios in this task and instead fix the preference optimization setting to a balanced 5 accepted vs. 5 rejected labels.

To examine whether MINT and other fine-tuning strategies compromise the general capabilities of vision-language models (VLMs), we further evaluate their performance on SEED-Bench60, a large-scale benchmark consists of 19,000 multiple choice questions with high-quality human annotations SEED-Bench spans ten core evaluation dimensions, including instance interaction, instance identity, instance attributes, instance location, instance counting, scene understanding, spatial relation, text understanding, and visual reasoning. As shown in Figure 3b, the performance of all fine-tuned models—MINT, SFT, and DPO—closely aligns with that of the base model across all axes. These findings indicate that none of the fine-tuning approaches, including MINT, degrade the general-purpose visual-language reasoning ability of the foundation model, while MINT simultaneously delivers substantial performance improvements on the domain-specific task of tissue type classification.

Case Study: MINT Enhanced LLM Can Differentiate Between Colon and Bile Duct Tissue Images

We next present a detailed case study demonstrating MINT’s effectiveness in distinguishing between Colon61 and Bile Duct6264 tissue images (Figure 4), which represent a particularly challenging discrimination task due to their histological similarities. Both tissues exhibit comparable epithelia structures and are functionally related within the digestive system65,66. Colon epithelium consists of columnar cells with microvilli that enhance nutrient absorption and water reabsorption, while bile ducts are lined by cholangiocytes with similar microvilli that aid bile modification and transport67,68.

Figure 4. Comparative analysis of tissue type classification performance between Base model, SFT, and MINT for similar-looking bile duct and colon tissues.

Figure 4.

The figure demonstrates how MINT improves discrimination between histologically similar tissues by leveraging both positive and negative training examples. Top panel shows bile duct tissue classification: a representative training sample with corresponding chosen and rejected tissue types (left), and four testing samples with their respective ranks assigned by Base, SFT, and MINT (Right). Bottom panel shows the same analysis for colon tissue classification. Green values represent ranking for the ground truth tissue class, while red values indication rankings for the visually similar confused class. Lower values represent higher confidence (rank 1 is the highest). Average ranks across all test samples are shown in both panels. Average ranks across all test samples are shown at the bottom of each panel. ‘MINT’ refers to our default implementation of the MINT framework using ORPO.

To illustrate MINT’s superior discriminative capability, we select representative training and testing samples that highlight the visual similarity between these tissue types as shown in Figure 4. The testing results demonstrate MINT’s enhanced discriminative ability. For bile duct tissue samples (Top panel in Figure 4), MINT consistently assigns rank 1 to the correct tissue type across all test samples, while pushing the visually similar colon tissue to substantially lower rank with average in 5.75. In comparison, SFT shows considerable confusion, assigning an average of 1.25 to ground truth bile duct but also assigning a higher rank of 2.00 to colon which is the confused class. This indicates that while SFT can identify the correct tissue, it fails to decisively differentiate it from the visually similar tissue. Similarly, for colon tissue samples (Bottom panel in Figure 4), MINT assigns rank 1 to the correct tissue across all test samples while relegating bile duct to much lower positions with average rank 4.00. In contrast, SFT shows significant confusion with an average rank of 1.25 for ground truth class colon and 1.50 for confused class bile duct. This situation indicates that SFT frequently ranks the incorrect confused tissue even higher than the correct one for colon samples. The base model performs considerably worse than both fine-tuned models, showing minimal discrimination between these similar tissue types. For bile duct samples, the base model assigns average ranks of 4.75 to bile duct (correct) and 2.25 to colon (incorrect), frequently ranking the wrong tissue higher. For colon samples, it assigns average ranks of 4.25 to colon (correct) and 1.50 to bile duct (incorrect), consistently preferring the incorrect tissue.

This improved discriminative capability stems directly from MINT’s preference learning approach. Unlike SFT, which learns morphological features solely from positive samples without explicitly accounting for confusing tissue classes. MINT, on the other hand, is explicitly trained to both recognize the correct tissue class and distinguish it from visually similar, potentially confusing tissue classes. For instance, when learning bile duct tissue, MINT is trained to recognize morphological features of colon tissue as a rejected class, and vice versa. This targeted contrastive learning enables MINT to better differentiate between closely related tissue types.

By explicitly incorporating both chosen (positive) and rejected (negative) tissue types during training, MINT learns subtle morphological features that differentiate between visually similar tissues. The odds ratio loss function used in MINT’s training emphasizes the importance of maximizing the separation between correct and incorrect predictions, resulting in more decisive classification for challenging cases.

The case study demonstrates that MINT’s approach is particularly valuable for histopathology applications, where subtle morphological differences between tissue types can have significant diagnostic implications. While SFT improves over the base model in identifying correct tissues, MINT substantially outperforms both in discriminating between visually similar tissue types.

METHODS

Evaluation Metric

In this paper, we assess the capabilities of large language models (LLMs) on rare genetic diseases prediction using text description of clinical phenotypes across three main outcomes: (HFA: Hallucination-Free Accuracy) whether the LLM can produce a meaningful response to a query, specifically a list of real disease names; (Top-N) if (HFA) is successful, whether the LLM’s response of N choices includes the correct disease prediction (true disease name); (Top-1) if (HFA) is successful, whether the LLM can accurately select the correct disease prediction. These tasks follow a logical progression: outcomes (Top-N) and (Top-1) are only applicable if (HFA) is achieved, meaning if (HFA) fails, (Top-N) and (Top-1) will also fail. For rare disease prediction, we choose N = 10, while for tissue type classification, we chose N = 5.

Motivated by the preference optimization framework (DPO and ORPO), we also evaluate the model’s ability to prioritize likely diseases in the Top-k preferred responses while avoiding Bottom- q unlikely responses. Let Ri denote the ith response by LLM, consisting of a list of rare disease names. Let Tk,i denote the top-k disease names (k labels with highest probabilities) predicted by upstream Multimodal ML model. If the true label if not at the first position, then we will swap the true label with the one first position. Let Bq,i denote the bottom-q disease names predicted by upstream Multimodal ML model. Similarly, it consists of diseases with q lowest probabilities among all the diseases.

Next, we define the Top-𝑘 coverage rate and Bottom-𝑞 avoidance rate, denoted by Ck,i and Aq,i respectively, by

Ck,i=Tk,iRiRiandAq,i=1RiBq,iRi,

respectively. Finally, we define the overall Coverage-Avoidance Rate (CAR) to be the weighted harmonic mean between Top-k coverage rate and Bottom-q avoidance rate. Mathematically, for λ0,

CAR(k,q,λ)=1Ni=1N(1+λ)Ck,iAq,iλCk,i+Aq,i

The weighting parameter λ can be adjusted further to balance the importance between Top-k coverage and Bottom-q avoidance rate. Note that when λ=1, both Ck,i and Aq,i are equally weighted, and CAR is just the naive harmonic mean. When kq, we can set λ>1 to give Ck,i more weight. Likewise, when k>q, we can set λ<1 to give Aq,i more weight. We also denote k/q by Acceptance-over-Rejection (AoR) ratio.

Finally, we benchmark all the add-on techniques using six widely recognized datasets, including ARC (Abstract Science Reasoning)57, HellaSwag (Commonsense Inference)56, MMLU (Massive Multitask Language Understanding)54, TruthfulQA (Detecting False Information)55, Winogrande (Context-Based Inference)69, and GSM8K (Mathematical Reasoning)58. ARC (AI2 Reasoning Challenge) assesses scientific reasoning with grade-school science questions. HellaSwag tests commonsense inference through narrative continuation tasks. MMLU (Massive Multitask Language Understanding) evaluates multitask knowledge and reasoning across 57 diverse subjects. TruthfulQA measures the model’s ability to avoid reproducing falsehoods. Winogrande focuses on commonsense reasoning with ambiguous pronoun resolution tasks. GSM8K asks the model with multi-step grade school math problems. All evaluations use a 5-shot setting to ensure consistency and rigor, which verifies that the techniques retain the original language understanding and reasoning capabilities of the base model after fine-tuning. This evaluation aims to determine whether the add-on technique retains the original language and reasoning capabilities of the base model.

Automated Evaluation Program

Given the extensive number of experiments involved, we develop an automated evaluation framework, LLM-Eval, to systematically and rigorously assess the responses generated by LLMs. The evaluation process comprises multiple steps to ensure the plausibility, relevance, and ranking quality of the model’s diagnoses. It begins with HFA, which checks whether the model generates plausible disease names by filtering out extreme hallucinations, such as nonsensical phrases or invalid outputs. Once HFA is passed, Top-N evaluates whether the ground truth disease is included in the ranked list of N possible diseases generated by the model, while Top-1 checks if the top-ranked disease matches the ground truth. Additionally, we compute the CAR to measure whether the model prioritizes the top-𝑘 diseases and avoids the bottom-q diseases recommended by the upstream multimodal ML model.

To account for variations in phrasing or formatting that retain the same semantic meaning, we employ the SequenceMatcher algorithm70. This technique calculates a similarity ratio between two strings, effectively identifying matches even when textual representations differ slightly. For instance, it can align abbreviations with full names (e.g., “CdLS” and “Cornelia de Lange Syndrome”) or account for other minor variations in disease name representation (e.g., “Simpson-golabi-behmel Syndrome, type 1” and “Type 1 Simpson-golabi-behmel Syndrome”.

We set the similarity threshold to 0.6 for HFA to allow greater flexibility, as this stage focuses on detecting plausible outputs rather than assessing strict correctness. This step ensures minimizes false negatives during the plausibility check. Conversely, stricter thresholds of 0.8 are applied for Top-10, Top-1, and CAR, as these later stages require higher precision to avoid false positives and ensure accurate ranking and matching. This balance between flexibility and stringency ensures a robust evaluation framework.

Data sources

For rare disease prediction, we primarily utilize the GMDB database (v1.0.9)49,51, a multimodal medical resource containing frontal facial images of patients along with corresponding textual metadata, including demographic information and clinical HPO terms. This database is accessible to researchers in the medical field but requires an application process to obtain access to the data.

For external validation, we employ two independent data sources beyond the primary GMDB database, the Phenopacket-derived clinical notes59. This dataset comprises 5980 high quality synthesized clinical notes created by converting standardized phenotype profiles (called “Phenopackets”) into natural language narratives. These notes simulate realistic clinical documentation by transforming structured phenotypic data into flowing text with the linguistic variability commonly observed in medical records. We divide this dataset into two subsets: one containing set of diseases that overlap with our GMDB training set (1638 samples representing 72 diseases) and another with diseases that are completely disjoint from our training set. (4342 samples representing 456 diseases)

For tissue type classification, we employ the PanNuke database, a comprehensive pan-cancer dataset designed for nuclei instance segmentation and classification. This dataset encompasses 19 distinct tissue types, annotated through a semi-automated process and rigorously quality-checked by clinical pathologists. As a result, it offers a dataset whose characteristics closely mirror those encountered in clinical settings, with minimal selection bias.

Supervised Fine-tuning (SFT) of LLMs

SFT Data Construction:

Although this is a multimodality dataset, for the purpose of fine-tuning LLMs, we only use the texts component of GMDB. We work with the samples that have non-null present features, or equivalently, at least one HPO id in the “present features” column of the metadata. In this case, we will do the following two steps: (1) Transform the HPO id(s) into real text data via the standard HPO dictionary and then concatenate them with a comma “,” in between. For instance, the “HP:0000486; HP:0001263; HP:0010864” will become “Strabismus, Global developmental delay, Intellectual disability, severe.” (2) Add patients’ demographic information in the front. The image metadata of GMDB database contains patients’ sex, age, and ethnicity information, which will be combined for LLM fine-tuning. For tissue type classification, we format our fine-tuning data in the simple Visual Question Answering (VQA) format. Table S4 contains the sample SFT data used for both rare disease prediction and tissue type classification.

LoRA fine-tuning of LLMs:

For fine-tuning, we utilize a single node with 4 NVIDIA A100 GPUs (40GB) and employ Low-Rank Adaptation (LoRA)71 to reduce memory overhead. Our LoRA configuration is set with 16-bit precision, and the parameters are tailored to model size: for Llama 3.1–8B-Instruct and Gemma-2–9B-Instruct, we set the rank to 128 and scaling factor (lora alpha) to 64; for Llama 3.2–1B-Instruct, Llama 3.2–3B-Instruct and Gemma-2–2B-Instruct, we increase the rank to 256 and scaling factor to 128. A consistent LoRA dropout rate of 0.05 is set across all models to mitigate overfitting. For tissue type classification task, we set the rank to 128 and scaling factor to 64 for Llama3.2-Vision-11B-Instruct model.

Inference via Retrieval Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) combines retrieval systems with LLMs to enhance the relevance and contextual accuracy of generated text, making it particularly well-suited for domain-specific tasks such as disease prediction. In this work, we utilize LlamaIndex72 to construct our RAG database, as it offers a flexible interface for connecting both structured and unstructured medical knowledge sources to LLMs, enabling efficient and accurate retrieval. The database is built from two key biomedical resources: OMIM73 and HPO74. To improve retrieval precision, we preprocess these datasets by systematically mapping HPO and OMIM terms to their corresponding disease names, ensuring that phenotypic and disease-related terminology is accurately aligned.

To further enhance the embedding representations for diseases and phenotypes, we adopt NeuML/pubmedbert-base-embeddings75, a biomedical sentence transformer specifically trained on large-scale biomedical literature. This model captures the complex relationships and semantics of medical terms, allowing our system to understand the nuances of disease and phenotype descriptions more effectively. Together, these components enable our RAG pipeline to retrieve domain-specific, high-quality information, which is then used by the LLM to generate accurate and contextually appropriate outputs tailored to the disease prediction scenario.

Preference Optimization Frameworks

Overview:

MINT is a framework for aligning Large Language Models with multimodal biomedical expertise through preference optimization. In this study, we explore two implementations of the MINT framework: MINT with DPO (Direct Preference Optimization) and MINT with ORPO (Odds Ratio Preference Optimization). While traditional Reinforcement Learning from Human Feedback (RLHF) approaches like PPO7678 require reward model training, extensive hyperparameter tuning, and complex sampling procedures, DPO offers a more direct approach to preference alignment. The process begins with a base model, followed by SFT (Supervised Fine-tuning) to align the model’s responses with specific requirements. The final step introduces a direct optimization objective that leverages the SFT model as a reference point to optimize the preference alignment between chosen and rejected responses. This approach enhances the model’s domain-specific skills by learning from a dataset of human preferences without the complexity of reward modeling and policy optimization typically associated with RLHF methods.

ORPO, on the other hand, integrates preference alignment directly into the SFT phase in a single step. Its objective function leverages the odds ratio, offering a stable and effective approach to distinguish between preferred and non-preferred responses. This method can be applied effectively across various model sizes and demonstrates superior performance.

Preference Learning Dataset Construction:

Due to the substantial time and resources required for manual data labeling, we employ the previously developed GestaltMML and PLIP models to build a dataset based on preference learning.

In each instance within the GMDB or PanNuke database, GestaltMML or PLIP produces a ranked list of potential labels (either disease names or tissue types). From this list, the top k labels (for example, k = 5, 10) are identified as preferred (“chosen”) outputs, indicative of probable diagnoses, while the bottom q labels (for example, q = 10, 50) are classified as non-preferred (“rejected”) outputs, suggesting fewer probable diagnoses. Generally, q is set larger than k to reflect that potential diagnoses are fewer than improbable ones. If the most probable prediction (Top-1 prediction) from GestaltMML or PLIP does not align with the actual prediction, it is substituted with the true prediction to ensure its presence at the top of the “chosen” outputs.

Thus, each case is structured into a pair of prompts: the input data sample is followed by the “chosen” and “rejected” labels, with the accurate prediction consistently presented first in the list. The sample of preference learning dataset are illustrated in Table S4.

Prompting Strategies for Inference:

To optimize MINT for rare disease prediction and tissue type classification, we implement a dual-phase prompt strategy to prevent model collapse and catastrophic forgetting. This approach allows the model to strengthen diagnostic accuracy while ensuring outputs that are clear and clinically useful.

During the inference stage, we introduce a more detailed prompt format to ensure the model’s outputs are presented in a consistent and structured way, making them easier to interpret for clinicians and data analysts. This phase’s prompt requires a numbered list of diseases, with precise formatting guidelines that are easy to follow and interpret. The concrete prompts for inference are presented in Table S4.

The added structure in the inference prompt help MINT produce outputs that are not only diagnostically accurate but also neatly formatted and suitable for clinical review. By implementing this dual-phase prompt strategy, we ensure that MINT could generate flexible and insightful responses during training while providing reliable, structured outputs in inference, which is the key for supporting human interpretation in clinical workflows.

Direct Preference Optimization (DPO):

DPO is a preference-based learning framework built upon a SFT model, designed to align model predictions with task-specific preferences. DPO builds directly on the SFT model by leveraging preference signals to prioritize chosen diagnoses yω over rejected diagnoses yl, ensuring the model aligns its outputs to reflect user-defined preferences. To achieve this, DPO compares the output probabilities of the trainable actor model Pθ() with those of a frozen reference model Pref(), which is a static copy of the SFT model. The reference model ensures that the preference-aligned actor model does not deviate too far from the base SFT model, thus preventing issues such as hallucinations or overly biased outputs. The DPO objective function is expressed as follows:

LDPO(Pθ;Pref)=E(x,yω,yl)D[log σ(βlogPθ(yωx)Pref(yωx)βlogPθ(ylx)Pref(ylx))]

Here, Pθ(yx) represents the conditional probability of the actor model, Pref(yx) represents the frozen reference model, x is the input phenotypic description, and β is a scaling factor to control the distance between actor model and reference model. By maximizing the relative log likelihood of chosen response over rejected response while referencing the base SFT model, DPO ensures the effective alignment with preferences without compromising the foundational consistency of the model.

MINT:

Our default implementation, MINT with ORPO, utilizes the Odds Ratio Preference Optimization (ORPO). Compared to DPO, the ORPO is a framework that combines Supervised Fine-Tuning (SFT) and preference learning in one stage.

Given a set of paired diagnoses yω,yl for each patient case, ORPO minimizes the preference loss LORPO which encourages higher scores for chosen diagnoses:

LORPO=Ex,yω,ylLSFT+βLOR,

where LSFT represents the conventional causal language model negative log-likelihood (NLL) loss function to maximize the likelihood of generating the reference tokens. LOR maximizes the odds ratio between the likelihood of generating the favored response yω over disfavored response yl. More specifically, the odds of y given x is defined by

Oddsθ(x):=Pθ(x)1Pθ(x).

Following the above, the odds ratio of a chosen response yω over yl, denote by ORθyω,yl, is defined by

ORθyω,yl:=Oddsθ(x)Oddsθ(x)

and the LOR=log σlog ORθyω,yl. In other words, the log odds ratio was wrapped with the log sigmoid function so that LOR could be minimized by increasing the log odds ratio between yω and yl.

Finally, in LORPO, the LOR is weighted with β adapts the pretrained model to the domain-specific task and meanwhile enforces separation between chosen and rejected outputs (see Figure S2), compelling the model to push disfavored generations further down.

Benchmark experiments

For rare disease prediction, we perform five methodological comparisons on Llama 3.2–3B-Instruct: (1) direct inference using the base model, (2) base model with RAG support, (3) base model after supervised fine-tuning (SFT), (4) DPO, and (5) MINT. Additionally, we also assess other four relatively small open-source models of different sizes: Llama 3.2–1B-Instruct, Llama 3.1–8B-Instruct, Gemma-2–2B-Instruct, and Gemma-2–9B-Instruct. For larger models like Llama 3.1–70B-Instruct and Llama 3.1–405B-Instruct, we limit our evaluation to their direct inference capabilities using the base model, as implementing additional techniques is not feasible due to computational resource constraints faced by the research community. All the results besides Llama 3.2–3B-Instruct are presented in Table S1.

Beyond these eight experiments, we also evaluate other LLMs, including the closed-source GPT-4o (access time Nov. 2024), on one-shot learning using CoT prompts. Sample responses are included in the Supplementary Material.

For tissue type classification, we perform four methodological comparisons on Llama 3.2-Vision-11B-Instruct: (1) direct inference using the base model, (2) base model after supervised fine-tuning (SFT), (3) DPO, and (4) MINT.

DISCUSSION

In this study, we introduce MINT (Multimodal Integrated kNowledge Transfer), a novel and effective framework for aligning unimodal LLMs with multimodal biomedical data through preference optimization, with ORPO serving as the default backbone mechanism in our implementation. Our comprehensive experiments demonstrate MINT’s effectiveness in two complex biomedical tasks: rare disease prediction from clinical texts and tissue type classification from histological images. These results support MINT’s potential as a versatile approach for enhancing domain-specific capabilities of large language models (LLMs) without compromising their general language and visual understanding.

Despite MINT’s promising results, several important limitations must be acknowledged. First, MINT’s performance diminishes substantially in zero-shot scenarios with completely unseen disease classes, as evidenced by the lower accuracy on disjoint disease subsets in our external validation. While this is expected for any fine-tuning approach, it highlights the need for complementary techniques such as RAG when applying MINT in real-world clinical settings where extremely rare, previously unseen diseases may be encountered. Second, the quality of the upstream multimodal model significantly influences MINT’s effectiveness. MINT inherits both the strengths and the weaknesses of its upstream model. If the upstream model contains biases or errors, these may propagate to MINT-enhanced LLM. This risk is particularly relevant in medical applications where demographic or geographic biases in training data could lead to disparities in diagnostic accuracy. Third, while MINT significantly improves specific task performance, it still falls short of specialized encoder-based classification models in some scenarios. This gap reflects the fundamental architectural differences between encoder models optimized for classification and decoder models designed for generative tasks. The auto-regressive nature of text generation introduces additional complexity compared to direct label selection, which may limit the upper bound of MINT’s performance on classification tasks.

The core innovation of MINT lies in its unified approach to aligning unimodal models with multimodal expertise through preference optimization. Rather than using traditional knowledge transfer methods that require parallel training of teacher and student models, MINT uses multimodal models as preference data generators to create structured datasets of likely and unlikely predictions. These preference datasets then guide the alignment of downstream LLMs through ORPO, enabling them to benefit from patterns identified in multimodal data while preserving their general capabilities. This approach offers several distinct advantages: First, MINT effectively bridges the modality gap between multimodal machine learning models and text-only or image-only LLMs. Our rare disease prediction experiments demonstrate that MINT can align LLMs with patterns from GestaltMML, which utilizes facial images, demographic information and clinical phenotypes. Although the downstream LLMs consume only text, they still benefit from facial image data indirectly through the preference-learning dataset, which encapsulates inter-modal relationships learned by GestaltMML. Second, MINT demonstrates similarly impressive performance in the tissue type classification task with preference data from PLIP, helping models prioritize relevant tissue types and avoid implausible ones more effectively than baseline approaches. Third, MINT’s preference optimization approach enables more nuanced learning than traditional supervised fine-tuning by incorporating both positive and negative examples during training, developing a more discriminative understanding of subtle features that differentiate similar conditions, as evidenced in our tissue type classification case study. Fourth, MINT-enhanced decoder-based LLMs show promising generalization capabilities compared to traditional encoder-only classifiers, particularly in zero-shot scenarios with previously unseen disease classes, demonstrating the decoder architecture’s inherent flexibility and reasoning advantages. Finally, key design choices contribute to MINT’s success: the balanced Acceptance-over-Rejection ratio provides clear contrastive learning signals; the odds ratio loss delivers stable gradients even with limited training data; and the approach preserves general model capabilities while enhancing domain-specific performance, making it particularly valuable for biomedical applications.

Looking forward, the MINT framework opens several promising directions for future research in biomedical AI. First, hybrid approaches combining MINT with retrieval mechanisms could address the zero-shot limitations while preserving MINT’s strong in-distribution performance. Our external validation results on disjoint disease subsets, where RAG outperformed MINT, suggest that such hybrid approaches might be particularly valuable for rare disease prediction, where unseen presentations are common. Second, extending MINT to additional biomedical tasks beyond disease prediction and tissue type classification could demonstrate its broader utility. Potential applications include medical image analysis, genomic interpretation, drug discovery, and clinical decision support. Each domain would require domain-specific upstream models but could benefit from MINT’s flexible knowledge transfer approach. Third, investigating the interpretability of MINT-enhanced models represents an important clinical research direction. While MINT improves predictive performance, understanding the reasoning behind its predictions remains crucial for clinical trust and adoption. Future work could explore techniques for extracting and visualizing the diagnostic patterns learned through preference optimization. For tissue classification specifically, integrating region-of-interest highlighting could help pathologists understand which histological features most influenced the model’s predictions. Finally, prospective clinical evaluations in real-world healthcare settings will be essential to validate MINT’s practical utility. Such studies should assess not only diagnostic accuracy but also impact on clinical decision-making, time-to-prediction for rare conditions, and overall patient outcomes.

In conclusion, MINT represents an advancement in biomedical AI by bridging the gap between specialized multimodal models and general-purpose large language models. By leveraging preference optimization to transfer domain-specific knowledge, MINT enhances the capabilities of open-source LLMs in critical biomedical tasks while maintaining their inherent flexibility and reasoning capabilities. The framework demonstrated consistent effectiveness across multiple tasks, models, and external validation datasets suggests broad applicability in the biomedical domain. As LLMs continue to evolve as interfaces for healthcare information, approaches like MINT that can effectively integrate specialized domain knowledge while preserving general capabilities will be increasingly valuable for advancing precision medicine and improving patient care.

ACKNOWLEDGEMENTS

We thank the GestaltMatcher Database (GMDB) which provides a collection of curated medical photography of genetic syndromes for training the multimodal model used in the current study. We thank patients and their families for contributing facial photos and phenotype descriptions to enable the establishment of computational models. This project is supported by NIH grant HG013031 and the CHOP Research Institute.

Footnotes

DECLARATIONS

Competing interests

The authors declare no competing interests.

Availability of data and materials

The GMDB (v1.0.9) database used in the current study can be obtained from https://db.gestaltmatcher.org/. All the software tools and computational workflow (as Jupyter Notebook) can be found at https://github.com/WGLab/MINT-LLM. This study did not generate any new material.

REFERENCES

  • 1.Wolf T, Debut L, Sanh V, et al. Transformers: State-of-the-Art Natural Language Processing. 2019:
  • 2.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in neural information processing systems. 2017;30
  • 3.McKinzie B, Gan Z, Fauconnier J-P, et al. MM1: methods, analysis and insights from multimodal LLM pre-training. Springer; 2024:304–323.
  • 4.Tirumala K, Simig D, Aghajanyan A, Morcos A. D4: Improving llm pretraining via document deduplication and diversification. Advances in Neural Information Processing Systems. 2023;36:53983–53995. [Google Scholar]
  • 5.Shi W, Ajith A, Xia M, et al. Detecting pretraining data from large language models. arXiv preprint arXiv:231016789. 2023;
  • 6.Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018;
  • 7.Dubey A, Jauhri A, Pandey A, et al. The llama 3 herd of models. arXiv preprint arXiv:240721783. 2024;
  • 8.Meta. Llama 3.2 Model Card. Accessed 12/02/2024, 2024. models/llama3_2/MODEL_CARD.md
  • 9.Meta. Llama 3.3 Model Card. Accessed 02/03/2025, 2025. models/llama3_3/MODEL_CARD.md
  • 10.Liu A, Feng B, Xue B, et al. Deepseek-v3 technical report. arXiv preprint arXiv:241219437. 2024;
  • 11.Guo D, Yang D, Zhang H, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:250112948. 2025;
  • 12.Afantenos S, Karkaletsis V, Stamatopoulos P. Summarization from medical documents: a survey. Artificial intelligence in medicine. 2005;33(2):157–177. [DOI] [PubMed] [Google Scholar]
  • 13.Mamykina L, Vawdrey DK, Stetson PD, Zheng K, Hripcsak G. Clinical documentation: composition or synthesis? Journal of the American Medical Informatics Association. 2012;19(6):1025–1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Mishra R, Bian J, Fiszman M, et al. Text summarization in the biomedical domain: a systematic review of recent research. Journal of biomedical informatics. 2014;52:457–467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Gulden C, Kirchner M, Schüttler C, et al. Extractive summarization of clinical trial descriptions. International journal of medical informatics. 2019;129:114–121. [DOI] [PubMed] [Google Scholar]
  • 16.Tripathi S, Sukumaran R, Cook TS. Efficient healthcare with large language models: optimizing clinical workflow and enhancing patient care. Journal of the American Medical Informatics Association. 2024;31(6):1436–1440. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Qiu J, Lam K, Li G, et al. LLM-based agentic systems in medicine and healthcare. Nature Machine Intelligence. 2024;6(12):1418–1420. [Google Scholar]
  • 18.Gebreab SA, Salah K, Jayaraman R, ur Rehman MH, Ellaham S. Llm-based framework for administrative task automation in healthcare. IEEE; 2024:1–7. [Google Scholar]
  • 19.Kim J, Wang K, Weng C, Liu C. Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease. The American Journal of Human Genetics. 2024;111(10):2190–2202. doi: 10.1016/j.ajhg.2024.08.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.do Olmo J, Logrono J, Mascias C, Martinez M, Isla J. Assessing DxGPT: Diagnosing Rare Diseases with Various Large Language Models. medRxiv. 2024:2024.05. 08.24307062.
  • 21.Gunel B, Du J, Conneau A, Stoyanov V. Supervised contrastive learning for pre-trained language model fine-tuning. arXiv preprint arXiv:201101403. 2020;
  • 22.Chen T, Liu S, Chang S, Cheng Y, Amini L, Wang Z. Adversarial robustness: From self-supervised pre-training to fine-tuning. 2020:699–708.
  • 23.Al-Moslmi T, Ocaña MG, Opdahl AL, Veres C. Named entity extraction for knowledge graphs: A literature overview. IEEE Access. 2020;8:32862–32881. [Google Scholar]
  • 24.Etzioni O, Cafarella M, Downey D, et al. Unsupervised named-entity extraction from the web: An experimental study. Artificial intelligence. 2005;165(1):91–134. [Google Scholar]
  • 25.Daiber J, Jakob M, Hokamp C, Mendes PN. Improving efficiency and accuracy in multilingual entity extraction. 2013:121–124.
  • 26.Yang J, Liu C, Deng W, et al. Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT. Patterns. 2024;5(1) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Wu D, Yang J, Wang K. Exploring the reversal curse and other deductive logical reasoning in BERT and GPT-based large language models. Patterns. 2024;5(9) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Berglund L, Tong M, Kaufmann M, et al. The Reversal Curse: LLMs trained on” A is B” fail to learn” B is A”. arXiv preprint arXiv:230912288. 2023;
  • 29.Zhu H, Huang B, Zhang S, et al. Towards a Theoretical Understanding of the’Reversal Curse’via Training Dynamics. Advances in Neural Information Processing Systems. 2024;37:90473–90513. [Google Scholar]
  • 30.Huang T, Hu S, Ilhan F, Tekin SF, Liu L. Harmful fine-tuning attacks and defenses for large language models: A survey. arXiv preprint arXiv:240918169. 2024;
  • 31.Qi X, Zeng Y, Xie T, et al. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:231003693. 2023;
  • 32.Vrbančič G, Podgorelec V. Transfer learning with adaptive fine-tuning. IEEE Access. 2020;8:196197–196211. [Google Scholar]
  • 33.Wortsman M, Ilharco G, Li M, et al. Robust fine-tuning of zero-shot models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR; ). 2021:7949–7961. [Google Scholar]
  • 34.Hong J, Lee N, Thorne J. Orpo: Monolithic preference optimization without reference model. 2024:11170–11189.
  • 35.Wu D, Yang J, Liu C, et al. GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal Machine Learning Combining Facial Images and Clinical Texts. ArXiv. 2024;
  • 36.Lewis P, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems. 2020;33:9459–9474. [Google Scholar]
  • 37.Gao Y, Xiong Y, Gao X, et al. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:231210997. 2023;
  • 38.Jiang Z, Xu FF, Gao L, et al. Active retrieval augmented generation. arXiv preprint arXiv:230506983. 2023;
  • 39.Gamper J, Alemi Koohbanani N, Benet K, Khuram A, Rajpoot N. Pannuke: an open pan-cancer histology dataset for nuclei instance segmentation and classification. Springer; 2019:11–19. [Google Scholar]
  • 40.Gamper J, Koohbanani NA, Benes K, et al. Pannuke dataset extension, insights and baselines. arXiv preprint arXiv:200310778. 2020;
  • 41.Huang Z, Bianchi F, Yuksekgonul M, Montine TJ, Zou J. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine. 2023;29(9):2307–2316. [DOI] [PubMed] [Google Scholar]
  • 42.Zemojtel T, Köhler S, Mackenroth L, et al. Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome. Science translational medicine. 2014;6(252):252ra123–252ra123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Jacobsen JO, Kelly C, Cipriani V, et al. Phenotype-driven approaches to enhance variant prioritization and diagnosis of rare disease. Human mutation. 2022;43(8):1071–1081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Trakadis YJ, Buote C, Therriault J-F, Jacques P-É, Larochelle H, Lévesque S. PhenoVar: a phenotype-driven approach in clinical genomics for the diagnosis of polymalformative syndromes. BMC medical genomics. 2014;7:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Smedley D, Robinson PN. Phenotype-driven strategies for exome prioritization of human Mendelian disease genes. Genome medicine. 2015;7:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Ferreira CR. The burden of rare diseases. American journal of medical genetics Part A. 2019;179(6):885–892. [DOI] [PubMed] [Google Scholar]
  • 47.Bauskis A, Strange C, Molster C, Fisher C. The diagnostic odyssey: insights from parents of children living with an undiagnosed condition. Orphanet journal of rare diseases. 2022;17(1):233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Hustinx A, Hellmann F, Sümer Ö, et al. Improving Deep Facial Phenotyping for Ultra-rare Disorder Verification Using Model Ensembles. 2023:5018–5028.
  • 49.Lesmann H, Lyon GJ, Caro P, et al. GestaltMatcher Database-a FAIR database for medical imaging data of rare disorders. MedRxiv. 2023;
  • 50.Kim W, Son B, Kim I. Vilt: Vision-and-language transformer without convolution or region supervision. PMLR; 2021:5583–5594.
  • 51.Hsieh T-C, Bar-Haim A, Moosa S, et al. GestaltMatcher facilitates rare disease matching using facial phenotype descriptors. Nature genetics. 2022;54(3):349–357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Radford A, Kim JW, Hallacy C, et al. Learning transferable visual models from natural language supervision. PmLR; 2021:8748–8763.
  • 53.Rafailov R, Sharma A, Mitchell E, Manning CD, Ermon S, Finn C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems. 2024;36
  • 54.Hendrycks D, Burns C, Basart S, et al. Measuring massive multitask language understanding. arXiv preprint arXiv:200903300. 2020;
  • 55.Lin S, Hilton J, Evans O. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:210907958. 2021;
  • 56.Zellers R, Holtzman A, Bisk Y, Farhadi A, Choi Y. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:190507830. 2019;
  • 57.Chollet F. On the measure of intelligence. arXiv preprint arXiv:191101547. 2019;
  • 58.Cobbe K, Kosaraju V, Bavarian M, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:211014168. 2021;
  • 59.Wu D, Wang Z, Nguyen Q, Wang K. Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes. arXiv preprint arXiv:250312286. 2025;
  • 60.Li B, Wang R, Wang G, Ge Y, Ge Y, Shan Y. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:230716125. 2023;
  • 61.Zonios GI, Cothren RM, Arendt JT, et al. Morphological model of human colon tissue fluorescence. IEEE Transactions on biomedical engineering. 1996;43(2):113–122. [DOI] [PubMed] [Google Scholar]
  • 62.Akwari OE, Van Heerden JA, Foulk WT, Baggenstoss AH. Cancer of the bile ducts associated with ulcerative colitis. Annals of Surgery. 1975;181(3):303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Mir-Madjlessi SH, Farmer RG, Sivak MV. Bile duct carcinoma in patients with ulcerative colitis: Relationship to sclerosing cholangitis: Report of six cases and review of the literature. Digestive diseases and sciences. 1987;32:145–154. [DOI] [PubMed] [Google Scholar]
  • 64.Nakanuma Y, Tsuneyama K, Harada K. Pathology and pathogenesis of intrahepatic bile duct loss. Journal of hepato-biliary-pancreatic surgery. 2001;8:303–315. [DOI] [PubMed] [Google Scholar]
  • 65.Hinck L, Näthke I. Changes in cell and tissue organization in cancer of the breast and colon. Current opinion in cell biology. 2014;26:87–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Schomacker KT, Frisoli JK, Compton CC, et al. Ultraviolet laser-induced fluorescence of colonic tissue: basic biology and diagnostic potential. Lasers in surgery and medicine. 1992;12(1):63–78. [DOI] [PubMed] [Google Scholar]
  • 67.Yasuda T, Shiozaki H. Esophageal reconstruction with colon tissue. Surgery today. 2011;41:745–753. [DOI] [PubMed] [Google Scholar]
  • 68.Das KM, Vecchi M, Sakamaki S. A shared and unique epitope (s) on human colon, skin, and biliary epithelium detected by a monoclonal antibody. Gastroenterology. 1990;98(2):464–469. [DOI] [PubMed] [Google Scholar]
  • 69.Sakaguchi K, Bras RL, Bhagavatula C, Choi Y. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM. 2021;64(9):99–106. [Google Scholar]
  • 70.Sree Ram Kiran Nag M, Srinivas G, Venkata Rao K, Vakkalanka S, Nagendram S. An efficient procedure for identifying the similarity between French and English languages with sequence matcher technique. Advances in Data Science and Management: Proceedings of ICDSM 2021. Springer; 2022:29–40. [Google Scholar]
  • 71.Hu EJ, Shen Y, Wallis P, et al. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:210609685. 2021;
  • 72.Liu J. LlamaIndex. 2024. https://github.com/jerryjliu/llama_index
  • 73.Zeeberg BR, Qin H, Narasimhan S, et al. High-Throughput GoMiner, an ‘industrial-strength’ integrative gene ontology tool for interpretation of multiple-microarray experiments, with application to studies of Common Variable Immune Deficiency (CVID). BMC Bioinformatics. 2005;6:168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Zhong S, Storch KF, Lipan O, Kao MC, Weitz CJ, Wong WH. GoSurfer: a graphical interactive tool for comparative analysis of large gene sets in Gene Ontology space. Appl Bioinformatics. 2004;3(4):261–4. [DOI] [PubMed] [Google Scholar]
  • 75.Gu Y, Tinn R, Cheng H, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH). 2021;3(1):1–23. [Google Scholar]
  • 76.Yu C, Velu A, Vinitsky E, et al. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in neural information processing systems. 2022;35:24611–24624. [Google Scholar]
  • 77.Engstrom L, Ilyas A, Santurkar S, et al. Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:200512729. 2020;
  • 78.Engstrom L, Ilyas A, Santurkar S, et al. Implementation matters in deep rl: A case study on ppo and trpo. 2019:

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The GMDB (v1.0.9) database used in the current study can be obtained from https://db.gestaltmatcher.org/. All the software tools and computational workflow (as Jupyter Notebook) can be found at https://github.com/WGLab/MINT-LLM. This study did not generate any new material.


Articles from ArXiv are provided here courtesy of arXiv

RESOURCES