Table 1.
Summary of unimodal (text) large language model (LLM) limitations in medicine and potential multimodal LLM solutions.
Unimodal (text) LLM limitation | Description of unimodal limitation | Multimodal LLM solution | Description of multimodal solution |
Lack of diagnostic imaging context | Unimodal LLMs in medicine can only process textual patient data and cannot interpret diagnostic images, which are vital in many clinical scenarios. | Integration of diagnostic imaging data | Multimodal models process and integrate diagnostic imaging information (eg, x-rays and MRIsa), improving diagnostic accuracy and patient outcomes. |
Inability to analyze temporal data | Text LLMs often struggle with interpreting time-series data, such as continuous monitoring data or progression of diseases, which are vital for tracking patient health over time. | Time-series data integration | Multimodal systems incorporate and analyze temporal data, such as ECGb readings or continuous monitoring data, enabling dynamic tracking of patient health and disease progression. |
Absence of auditory data interpretation | Unimodal LLMs grapple with audio analysis, which limits their effectiveness in health care applications that rely on processing spoken interactions or auditory signals. | Audio data processing | Multimodal systems can process and understand audio signals, such as patient verbal descriptions and heartbeats, enhancing diagnostic precision. |
Limited comprehension of complex medical scenarios | Unimodal LLMs struggle with interpreting complex medical conditions that require a multisensory understanding beyond text. | Multisensory data integration | By processing clinical notes, diagnostic images, and patient audio, multimodal systems offer more comprehensive analyses of complex medical conditions. |
Overfitting to clinical textual patterns | Sole reliance on clinical texts can lead LLMs to overfit to textual anomalies, potentially overlooking critical patient information. | Diverse clinical data sources | Diversifying input types with clinical imaging and audio data allows multimodal systems to increase the number of training data points and, hence, reduce overfitting, enhancing diagnostic reliability. |
Bias and ethical concerns | Unimodal LLMs, especially text-based ones, can inherit biases and misconceptions present in their training data sets, affecting patient care quality. | Richer contextual patient data | Multimodal systems use diverse modalities, including patient interviews and diagnostic images, to provide a broader context that can mitigate biases in clinical decision-making. |
aMRI: magnetic resonance imaging.
bECG: electrocardiography.