Skip to main content
. 2024 Sep 25;26:e59505. doi: 10.2196/59505

Table 1.

Summary of unimodal (text) large language model (LLM) limitations in medicine and potential multimodal LLM solutions.

Unimodal (text) LLM limitation Description of unimodal limitation Multimodal LLM solution Description of multimodal solution
Lack of diagnostic imaging context Unimodal LLMs in medicine can only process textual patient data and cannot interpret diagnostic images, which are vital in many clinical scenarios. Integration of diagnostic imaging data Multimodal models process and integrate diagnostic imaging information (eg, x-rays and MRIsa), improving diagnostic accuracy and patient outcomes.
Inability to analyze temporal data Text LLMs often struggle with interpreting time-series data, such as continuous monitoring data or progression of diseases, which are vital for tracking patient health over time. Time-series data integration Multimodal systems incorporate and analyze temporal data, such as ECGb readings or continuous monitoring data, enabling dynamic tracking of patient health and disease progression.
Absence of auditory data interpretation Unimodal LLMs grapple with audio analysis, which limits their effectiveness in health care applications that rely on processing spoken interactions or auditory signals. Audio data processing Multimodal systems can process and understand audio signals, such as patient verbal descriptions and heartbeats, enhancing diagnostic precision.
Limited comprehension of complex medical scenarios Unimodal LLMs struggle with interpreting complex medical conditions that require a multisensory understanding beyond text. Multisensory data integration By processing clinical notes, diagnostic images, and patient audio, multimodal systems offer more comprehensive analyses of complex medical conditions.
Overfitting to clinical textual patterns Sole reliance on clinical texts can lead LLMs to overfit to textual anomalies, potentially overlooking critical patient information. Diverse clinical data sources Diversifying input types with clinical imaging and audio data allows multimodal systems to increase the number of training data points and, hence, reduce overfitting, enhancing diagnostic reliability.
Bias and ethical concerns Unimodal LLMs, especially text-based ones, can inherit biases and misconceptions present in their training data sets, affecting patient care quality. Richer contextual patient data Multimodal systems use diverse modalities, including patient interviews and diagnostic images, to provide a broader context that can mitigate biases in clinical decision-making.

aMRI: magnetic resonance imaging.

bECG: electrocardiography.