Abstract
Objective
This study presents a systematic review of natural language generation (NLG) methods and applications in the medical domain, providing quantitative and qualitative analyses to answer four key research questions regarding methods, evaluation, applications, and challenges of NLG in healthcare.
Methods
We searched PubMed, ACM Digital Library, Web of Science, Science Direct, Scopus, Embase, and ACL Anthology for NLG-related studies in healthcare from 2018 to 2024. Out of 3,988 research articles, 113 met the inclusion criteria and were analyzed across data modality, model architecture, evaluation metrics, and application domain.
Results and Conclusion
NLG in healthcare has grown substantially, with annual publications increasing from 2 in 2018 to 40 in 2024. Of the 113 included studies, text-to-text generation was the most common data modality (65.5%), followed by image-to-text (19.5%) and multimodal-to-text (15.0%). Transformer-based architectures were dominant, especially encoder–decoder models (61.6%). Automatic evaluation metrics such as ROUGE (81.4%) and BLEU (57.5%) were widely used. Human evaluation metrics, such as Likert scales (31.9%), were increasingly adopted. The four most prevalent application domains include summarization (e.g., discharge summaries, radiology reports), clinical documentation, medical dialogue, and data augmentation.
Keywords: Natural language generation, Healthcare, Large language model, Transformer model, Clinical text, Systematic review
The transformer-based large language models (LLMs) and the accumulation of large-scale multimodal clinical datasets have remarkably advanced NLG in healthcare. However, challenges remain in factual consistency, explainability, evaluation robustness, and AI safety. Addressing these challenges is essential for the adoption of NLG in various healthcare applications.
INTRODUCTION
Natural language generation (NLG) aims to develop algorithms that generate coherent and contextually relevant text from various forms of input[1]. Early-stage NLG systems rely on predefined templates and rules[2] with limited generation ability. Recent breakthroughs in transformer-based large language models[3–5] (LLMs) have reshaped NLG[6], transforming it into a key technology for generative artificial intelligence (AI). LLM-based NLG models[7–9] have demonstrated potential in many medical discovery and healthcare applications through a novel generative ability that has not been witnessed before.
Historically, natural language processing (NLP) researchers have explored various solutions for NLG, from early-stage statistical machine learning models[10], Long-Short Term Memory (LSTM)-based sequence-to-sequence models[11], to recent transformer-based LLMs[12]. Transformers are a specific type of deep neural network composed of an encoder and a decoder[13]. LLM-based NLG systems use the encoder to transform an input sequence into vectors, potentially capturing latent semantics, and use the decoder to translate the vectors into human languages. LLMs can be categorized into three types: (1) encoder-based LLMs (e.g., BERT[14]), which are implemented using the encoder module of the transformer; (2) decoder-based LLMs (e.g., GPT[15]), which are implemented using the decoder module of the transformer; and (3) encoder-decoder LLMs (e.g., T5[5]), which are implemented using both the encoder and decoder components. Among these, decoder-based and encoder-decoder LLMs are widely used to achieve generative AI, known as generative LLMs.
In the medical domain, LLMs have demonstrated potential in a wide range of medical discovery and healthcare applications, including extracting patients’ information from electronic health records (EHRs)[16,17], differentiating diagnoses for clinical decision support[18,19], answering healthcare-related questions[16], and facilitating documentation of patient information in EHR systems[20]. There is increasing interest in LLM-based NLG for conversational and generative AI tasks that the previous generations of AI models have not performed well [10]. In the medical domain, NLG systems[21–23] utilize different modalities of medical data, such as structured EHR database tables, narrative clinical text, medical images, and videos. NLG systems that generate text from non-linguistic data are referred to as “data-to-text” generation, while NLG systems that generate text from text data are known as “text-to-text” generation. Recent studies report that LLM-based NLG systems can assist in documenting patients’ reports[24,25] with improved efficiency and quality[26,27], thereby contributing to more streamlined clinical workflows and enhanced healthcare delivery[28]. These emerging use cases demonstrate the potential of generative AI in real-world clinical practice[29,30].
Given the rapid advancements in generative AI, there is a need for a comprehensive review of NLG to reflect the recent breakthroughs in the medical domain. Existing reviews focused predominantly on traditional NLP tasks, such as information extraction[31], or narrowly focused on “image-to-text” generation[32]; hence, they do not reflect the multimodalities nature of medical data and the rapid breakthroughs in generative AI. To address this gap, this article provides an overview of recent advances in NLG with a focus on generative LLM methods and applications in the medical domain. Following the PRISMA guidelines, we examine several key applications of NLG, including synthetic clinical text generation, automated clinical documentation, medical summarization, radiology report generation, and medical dialogue systems. This review article aims to answer the following questions: (1) What are the methods used by NLG systems in the medical domain? (2) What are the evaluation metrics for NLG? (3) What are the applications of NLP in the medical domain? (4) What are the capabilities, challenges, limitations, and future directions and opportunities for NLG in the medical domain? To the best of our knowledge, this is the first study to systematically review NLG with multiple modalities of medical data and across various healthcare applications.
METHODS
Data Sources and Literature Search
We searched across six academic databases, including PubMed, ACM Digital Library, Web of Science (WoS), Science Direct, Scopus, and Embase, for peer-reviewed research articles. To capture relevant conference proceedings, the Association for Computational Linguistics (ACL) Anthology was included. The search targeted peer-reviewed research articles published between January 1, 2018, and December 31, 2024. The search used grouped keywords related to natural language generation (e.g., “text generation,” “natural language generation,” “sequence-to-sequence”) and healthcare (e.g., “clinical,” “medicine,” “EHR,” “EMR”). As shown in Table 1. Only articles published in English were included.
Table 1.
Literature search using keywords to identify articles.
| Search | ||
|---|---|---|
| Search Strings | Text Generation Group | 1. Text Generation; Report Generation |
| 2. Natural Language Generation; NLG | ||
| 3. Sequence to Sequence; Seq to Seq | ||
| Healthcare Group | 4. Clinical; Medicine; Healthcare; Health | |
| 5. EHR; EMR; Electronic Medical Records; Electronic Health Records | ||
| Combinations | 6. 1 OR 2 OR 3 | |
| 7. 4 OR 5 | ||
| 8. 6 AND 7 |
Eligibility Criteria
The following studies are included: (1) published as peer-reviewed original research articles between January 1, 2018, and December 31, 2024, and (2) presented concrete NLG methods or applications applicable to healthcare or medical research. Studies were excluded if they were editorials, commentaries, reviews, or focused solely on non-clinical applications (e.g., social media communication). Articles without full texts available were excluded.
We conducted a two-stage screening protocol using Covidence, a web-based systematic literature review tool. Duplicates were automatically removed by Covidence. Database searches were conducted in January 2025. Six reviewers screened titles and abstracts, with each article randomly assigned to two reviewers. Disagreements were resolved by a third reviewer. Articles that passed the title and abstract review were screened in the second-stage full-text review. To ensure good agreements among all reviewers, three training sessions were conducted. In each session, all reviewers reviewed a shared set of 15 articles, followed by group discussions to resolve discrepancies and refine the review guidelines.
Data extraction and analysis
Six reviewers conducted data extraction using a standardized spreadsheet. Studies that passed the two-stage screening were reviewed by at least two reviewers, who also extracted required information per the review guidelines. Discrepancies between reviewers were resolved through group discussion and consensus. We determined a total of 10 data elements to extract, including: data modality (e.g., text-to-text, image-to-text), model architecture (e.g., decoder-based, GAN), encoder architecture (indicate the architecture of the encoder module, e.g., LSTM), decoder architecture (indicate the architecture of the decoder module, e.g., GPT), data access (i.e., private or public), data source, clinical application, evaluation methods (i.e., machine evaluation or human evaluation), machine evaluation metrics (e.g., BLEU, ROUGE, METEOR), human evaluation metrics (e.g., Likert scale ratings, expert assessments).
Following data extraction, studies were grouped by clinical application categories to enable thematic synthesis. For each application area, we analyzed trends in model architecture, data types and sources, evaluation strategies, and performance metrics.
RESULTS
Study Selection and Characterization
Figure 1 is a PRISMA flow diagram showing the study selection process. A total of 3,988 articles published from 2018 to 2024 were identified through database searches using keywords (see Table 1). After removing 761 duplicates, 3,227 unique articles remained. The title and abstract screening excluded 2,903 articles, with 324 articles remaining (inter-reviewer agreement Cohen’s κ = 0.78). The 324 remaining articles were further assessed by a “full text review”, where 211 studies not focused on healthcare settings or lacking concrete NLG methods were excluded. Inter-rater agreement for the full-text review stage was substantial, with a Cohen’s κ of 0.82. A total of 113 studies were included in this review, where 76 (67.3%) were conference papers and 37 (32.7%) were journal publications. Detailed characteristics of each study—including model architecture, data sources availability, evaluation methods, and application domain—are summarized in Supplementary Table S1.
Figure 1.

PRISMA flow diagram of included studies.
Figure 2 summarizes the number of publications related to NLG in the medical domain for the past seven years. The number of publications grew from 2 in 2018 to 40 in 2024. In terms of geographic distribution, Asia contributed the largest share of studies (50; 44.3%), followed by North America (41; 36.3%) and Europe (16; 14.2%), showing a wide international engagement. Notably, the United States accounted for the largest proportion of studies, contributing 40 publications (35.4%).
Figure 2. Publications on natural language generation in the medical domain by region from 2018 to 2024.

(a) Temporal trends of publication count by continent, showing the annual number of publications for North America, Asia, Europe, Oceania, and the global total (Total); (b) Geographic distribution of publication counts by country.
Medical data modalities and text generation applications
Per our review of the 113 included studies, NLG systems in the medical domain primarily use data modalities including narrative text, structured EHRs, existing medical knowledge bases, and medical images. According to the input modalities, the 113 studies can be grouped into three categories: (1) text-to-text generation, where the input is purely text, (2) image-to-text generation, where the input is medical images, and (3) multimodal-to-text generation, where more than one modality is used. For each category, we examined the overall architecture, the encoder component, which is responsible for transforming input data into vector representations, and the decoder component, which is responsible for converting the vectors into natural language text. Figure 3 shows an overview of the data modalities and NLG methods.
Figure 3.

An overview of medical data modalities and natural language generation methods. EHR: electronic health record; CNN: convolutional neural network; RNN: recurrent neural network.
What are the methods used by NLG systems in the medical domain?
Table 2 summarizes the model architectures used by the 113 reviewed studies. There are three model architectures for NLG, including: (1) encoder-decoder models, (2) decoder-only models, and (3) generative adversarial networks (GANs). Transformer-based models dominate across all modalities. In text-to-text generation (N=74), 58.1% (43/74) of studies used encoder-decoder transformer models, and 27.0% (20/74) of studies used decoder-only transformer models. In image-to-text generation (N=22), hybrid models combining convolutional neural networks (CNNs) with transformer decoders—CNN+Transformer architectures—were particularly prevalent, leveraging CNNs for feature extraction from medical images and transformers for text generation. In multimodal-to-text generation (N=17), Transformer-based decoder architectures are dominant (94.1%, 16/17), reflecting the transition toward Transformer-based generative AI models.
Table 2.
Detailed model architectures and prevalence across data modalities.
| Table 2a. Text-to-Text generation (n=74) | |||
|---|---|---|---|
| Overall Architecture | Encoder | Decoder | n (% within text-to-text) |
| RNN-family | RNN-family | 8 (10.8%) | |
| Decoder-only | — | Transformer | 20 (27.0%) |
| Other (rare; each n=1) | — | — | 4 (5.4%) |
| Table 2b. Image-to-Text generation (n=22) | |||
| Overall Architecture | Encoder | Decoder | n (% within text-to-text) |
| Transformer | Transformer | 3 (13.6%) | |
| Other (rare; each n=1) | — | — | 3 (13.6%) |
| Table 2b. Multimodal-to-Text generation (n=17) | |||
| Overall Architecture | Encoder | Decoder | n (% within text-to-text) |
| Transformer | Transformer | 6 (35 .3%) | |
| Other (rare; each n=1) | — | — | 2 (11.8%) |
For readability, only mainstream architectures (n≥2 within each modality) are shown. Rare architectures (each n=1) are aggregated as “Other (rare)” and are detailed in Supplementary Table S2. Percentages are reported both within each modality. “—” indicates not specified or not applicable.
Text-to-text generation refers to the task of generating narrative text from textual data, including structured text data (e.g., diagnosis codes, medications), semi-structured text data (e.g., templates or tabular EHR entries), and unstructured free text (e.g., clinical notes, medical reports). Guan et al.[24] proposed the Medical Text Generative Adversarial Network (mtGAN) to generate synthetic text using a conditional GAN framework with disease features as inputs. Early studies of text-to-text generation widely used recurrent neural networks (RNNs), particularly gated recurrent unit (GRU) and LSTM, which worked well for the generation of short-to-medium length text but struggled with long-range text that requires dependency tracking and global coreference/coherence in long clinical narratives. Representative examples include an FNN–LSTM configuration to generate note sections from structured codes[25], an LSTM to generate privacy-preserving clinical notes[33], and attention-based GRU models to improve content selection and focus on clinically salient inputs for structured-to-narrative generation[34].
NLG has shifted to Transformer-based models, owing to their ability to scale up pretraining, better capture long-distance dependencies, and support easy adaptation using human language as prompts. Within Transformers, the encoder–decoder models constitute the largest proportion, followed by decoder-only Transformers, reflecting the rapid advance in transformer-based LLMs[7,35]. The pretrain-finetune strategy and various domain-specific adaptations remarkably enhanced the quality and relevance of text generation. Later, advanced techniques such as in-context learning and multi-task instruction tuning further advanced text generation[9,36] in the medical domain.
Image-to-text generation refers to the task of generating text reports from medical images, such as chest X-rays, CT scans, pathology images, or other types of medical images. CNN–RNN models dominated the early studies, where CNNs served as visual feature extractors, and RNNs served as decoders to generate text. Huang et al.[37] proposed a hierarchical model that incorporates CNNs with multi-attention mechanisms to enhance image feature representation and combines an LSTM layer to generate chest X-ray reports. Beddiar et al.[38] used a pre-trained CNN model to extract visual features and semantic features, which were then forwarded to an LSTM model for text generation. In the most recent studies, Transformers are increasingly used as decoders instead of RNNs and LSTMs, while CNN-based encoders are still common. The Transformers are increasingly used as decoders for their advantages in long-context generation, while CNNs remain computationally efficient for feature extraction from medical images. Memory-augmented and clinically grounded model variants have also emerged. Miura et al.[39] combined CNNs with a memory-augmented transformer model to generate clinical findings from medical images. Vision Transformers (ViTs) enable end-to-end visual encoding, providing parameter-efficient ViT-based report generation, such as MedEPT[40], which greatly improves sample efficiency and reduces training cost.
Multimodal-to-text generation refers to text generation from multimodal data sources, often combinations of more than one modality from medical images, medical text, and external knowledge graphs. Early studies typically extended the CNN–RNN architecture by incorporating visual and structured features through early or late fusion strategies. For example, Edelbrock et al.[41] proposed a model that integrated visual features from X-ray images into an RNN-based encoder–decoder framework to initialize hidden states and modulate encoder outputs and target embeddings, demonstrating improved performance on summarization tasks. Most recent studies have shifted to multimodal Transformer-based architectures that could better model cross-modality interactions and incorporate intermediate vector representations for better text generation. For example, Dalla Serra et al.[42] developed a two-step pipeline utilizing multimodal Transformers to first extract clinically relevant triples (Entity1, Relation, Entity2) from radiology images and then apply another Transformer model to generate radiology reports. By controlling text generation using extracted triples and the original images, the proposed method improved both stylistic accuracy and clinical relevance. For generative LLMs, multimodal foundation models/multimodal LLMs(MLLMs), such as vision-language models, have rapidly evolved from task-specific architectures into general-purpose systems that could be adapted through instruction tuning, prompt-based conditioning, and parameter-efficient fine-tuning. Representative examples include medical visual assistants trained with self-instructed biomedical image–text corpora (e.g., BiomedCLIP[43], LLaVA-Med[44], Med-Flamingo[45]), and open generalist vision–language foundation models designed to perform diverse biomedical tasks, including report writing/summarization and radiology VQA (e.g., BiomedGPT[21]).
What are the evaluation metrics for NLG?
Text generation tasks are typically evaluated by comparing the generated text with gold-standard references using automatic machine-based and/or human-based evaluation metrics. Table 3 summarizes the evaluation metrics and their prevalence per our review of the 113 articles.
Table 3.
Evaluation metrics for text generation.
| Evaluation Type | Metric Category | Metric | Counts |
|---|---|---|---|
| Automatic Evaluation | N-gram overlap | ROUGE[46] | 92 |
| BLEU[47] | 65 | ||
| METEOR[48] | 31 | ||
| CIDEr[49] | 18 | ||
| Embedding-based | BERTScore[50] | 20 | |
| BLEURT[51] | 10 | ||
| Factuality | CheXpert[52] | 17 | |
| CheXbert[53] | 9 | ||
| RadGraph[54] | 8 | ||
| AlignScore[55] | 2 | ||
| Human Evaluation | Likert Scale[56] | 36 | |
| Turing Test | 2 |
For automatic evaluation metrics, the ROUGE and BLEU scores are the most prevalent metrics used by 92 and 65 studies, respectively. Embedding-based methods, such as BERTScore, are also popular (by 20 studies). Later, factuality metrics, reflecting the safety-critical nature of the medical domain, became essential components of evaluation in a large number of studies (e.g., CheXpert in 17 studies, CheXbert in 9 studies). For human evaluation metrics, the Likert Scale is the most prevalent metric used by 36 studies.
Automatic Evaluation Metrics
Automatic evaluation metrics compare string overlap, content overlap, string distance, or lexical diversity between the generated text and gold-standard reference text to calculate various quantitative scores. Standard metrics remain the ubiquitous baseline due to their efficiency.
ROUGE[46] (Recall-Oriented Understudy for Gisting Evaluation) is the most widely used metric (92 /113, 81.42%) reviewed in this study. ROUGE focuses on recall of the surface text by exact matching and is particularly favored for summarization tasks. However, ROUGE may not reflect the real-world semantics even if critical medical negations (e.g., “no pneumonia” vs. “pneumonia”) are missing.
BLEU[47] (Bilingual Evaluation Understudy), utilized in 65 studies, focuses on precision of exact matching by calculating the geometric mean of modified n-gram precisions. Like ROUGE, BLEU is sensitive to exact lexical matching and may correlate weakly with real-world semantics.
To relax exact word matching, 31 studies employed METEOR[48] (Metric for Evaluation of Translation with Explicit Ordering), which aligns texts using stemming and synonymy, providing a more robust evaluation accounting for language variations. Additionally, 18 studies (primarily in image-to-text tasks) used CIDEr[49] (Consensus-based Image Description Evaluation). CIDEr computes the cosine similarity of TF-IDF weighted n-grams, effectively down-weighting common stop words (e.g., “the”, “is”) and prioritizing clinically important terms (e.g., “effusion”, “opacity”).
Embedding-based Metrics
Embedding-based metrics such as BERTScore[50] and BLEURT[51] calculate similarity scores by leveraging contextual embeddings from pretrained models (e.g., BERT) instead of surface-level text overlap.
BERTScore[50] utilizes embeddings from the widely used BERT[14] model to measure precision, recall, and F1 scores between generated and the ground-truth reference texts. A variation metric, BLEURT[51], further fine-tuned BERT on human judgments to focus on semantic similarity and contextual alignment. Both metrics have been utilized in our reviewed studies[57–60].
Factuality and Clinical Efficacy (CE) Metrics
A remarkable difference between medical NLG systems and general-purpose systems is the high stakes of medical errors. Therefore, new evaluation metrics have been proposed to evaluate the clinical factuality and clinical efficacy of medical NLG systems.
Label-Based Verification (CheXpert[52] & CheXbert[53]) is the most prevalent approach that evaluates diagnostic agreement by extracting 14 specific clinical observations (e.g., Pneumonia, Cardiomegaly). Early studies[42,61–64] used CheXpert, a rule-based labeler. Later, CheXbert[60,65–69] was used to account for linguistic variations. Based on a fine-tuned BERT model, CheXbert showed good sensitivity in detecting negations and uncertainty, providing a more robust measure for clinical safety than its rule-based predecessor.
Structure and Inference-Based Verification
Recent studies started measuring the structural consistency beyond flat labels for evaluation. For example, RadGraph constructs knowledge graphs to verify the relationships between anatomies and findings, penalizing hallucinations where pathologies are correctly identified but misplaced (e.g., right vs. left lung). For non-radiology tasks such as dialogue summarization, inference-based metrics such as AlignScore[55] are used to detect extrinsic hallucinations.
These factuality and clinical efficiency measures fill the gap of the surface text-based scores but are limited to specific tasks and disease domains. CheXpert and RadGraph are primarily optimized for Chest X-rays. Their evaluation power degrades remarkably when applied to other modalities (e.g., MRI) or general medical notes.
Human Evaluation
Human evaluation asks domain experts to evaluate the quality of generated text according to predefined aspects.[70] The Likert scale[56] is widely used in human evaluation, where evaluators rate various aspects of the generated text, such as fluency, coherence, and relevance, using a predefined scale such as 1–5. Many studies[39,57,59,71–77] in the medical domain widely use human evaluation. Pairwise comparison is another widely used human evaluation metric, where domain experts are asked to compare pairs of generated text to determine which one is better according to predefined criteria. For instance, Peng et al. [7] conducted a Turing Test on model-generated clinical notes and human-written notes in terms of readability and clinical relevance.
Figure 4 presents a hierarchical summarization of the evaluation metrics. Human evaluation provides a comprehensive assessment considering nuanced linguistic measures such as fluency, coherence, factual correctness, and clinical relevance—criteria often beyond automated metrics. However, human evaluation is time and cost-consuming. Factuality metrics measure the factual consistency of the generated text and the gold standard. Embedding-based metrics such as BERTScore and BLEURT measure similarity using embeddings of generated and reference texts, which capture semantic similarities based on embeddings without the requirement of human experts. N-gram-based metrics, including ROUGE, BLEU, and METEOR, calculate similarity using surface-level lexical overlap. These metrics are not able to account for the linguistic variations, as there are many surface-level text variations for the same meaning. Nevertheless, these metrics are widely used as they are easy to calculate.
Figure 4.

Hierarchical structure of evaluation metrics.
What clinical applications can benefit from NLG?
Based on our review, we categorize NLG applications in the medical domain into four primary domains, as shown in Figure 5:
Figure 5.

Major applications for text generation in the medical domain. (1) Summarization, where models condense lengthy medical text into summaries without altering key information and main idea; (2) Clinical documentation, including the automatic drafting of clinical notes and diagnostic reports from text-based inputs or imaging data; (3) Dialogue generation, which enables contextually relevant responses in healthcare interactions and counseling settings; and (4) Data augmentation, where synthetic texts are generated to support healthcare research.
Clinical Summarization
Clinical summarization aims to transform long clinical text into concise summaries while preserving clinically salient information [78]. Historically, research focused on specific domains such as radiology report summarization (generating IMPRESSION from FINDINGS)[79–82]. More recent studies focus on generating patient summaries and problem lists from clinical narratives. Joshi et al.[71] used text generation to generate patient summaries from telemedicine interactions. Two recent studies[59,83] focused on the summarization of patient problems from progress notes. Krishna et al.[72] applied text generation models to summarize doctor-patient encounters into Subjective, Objective, Assessment, and Plan (SOAP) format. In the clinical NLP community, open challenges and shared tasks have been organized to align NLG systems with real-world clinical workflows. For example, the MEDIQA-Chat 2023[84] shared task provided two dialogue-based summarization benchmarks [85–89]; a recent system applied parameter-efficient fine-tuning (PEFT) and hybrid extractive–abstractive strategies to improve discharge summarization and reduce hallucination [90–92]; the BioNLP 2024 “Discharge Me!” challenge focused on the generation of Brief Hospital Course and Discharge Instructions for emergency visits [93]. Mature systems, such as Nuance DAX[94] and Epic’s ambient[95] summarization, have been deployed in Hospital EHR systems.
Automated medical document generation
Automated document generation refers to the generation of medical documents from various data sources, including structured clinical variables, unstructured free-text, or multimodal inputs. It aims to reduce administrative burden, standardize note formats, and improve the completeness and consistency of clinical documentation[96]. A long-standing task is to translate discrete clinical variables—such as ICD codes, laboratory values, medication lists, or vital signs—into clinical narratives. For example, transforming admission ICD codes into discharge instructions[34] or generating progress notes from structured/tabular EHR data[97].
Many studies have focused on Medical Report Generation from medical images and multiple modalities data sources such as chest X-rays, CT scans, ECG signals, or histopathology slides. For example, Yang et al.[98] used ECG data to generate medical reports; studies [99,100] explored radiology report generation to reduce the burden of radiologists; Chest X-ray images, such as MIMIC-CXR[101] and IU X-Ray[102], are widely used public datasets. There are a few studies, e.g., Paalvast et al.[76], that address other types of data. There is an increasing effort to improve grounding and reduce hallucination through integrating advanced learning algorithms and additional resources, including memory modules [61,62], reinforcement learning [73,103], knowledge graphs [104], knowledge bases [105], and knowledge from patient history [106]. However, most studies on medical document generation are predominantly for research purposes, such as generating synthetic data for simulation.
Data Augmentation
Data augmentation is an efficient solution accounting for data scarcity, privacy restrictions[107,108], and annotation costs. By generating synthetic medical text that mimics the structure, content, and semantics of real-world medical records, researchers aim to create high-quality, privacy-preserving datasets for model training, benchmarking, and evaluation, without compromising patient privacy. [107,108]. For instance, Guan et al. [24] used respiratory disease features, such as “pneumonia” or “lung cancer,” to generate synthetic clinical note sections. Lee[25] proposed an RNN-based NLG model to generate synthetic chief complaints from structured input variables such as age, gender, and discharge diagnosis. Melamud & Shivade [33] proposed methods to generate privacy-preserved synthetic discharge summaries. Hoogi et al.[109] generated synthetic mammography reports to extend labeled medical report corpora. Amin-Nejad et al.[35] and Peng et al.[7] explored generative LLMs to generate synthetic clinical notes.
There is an increasing interest in using synthetic clinical narratives—such as chief complaints[25], discharge summaries[33], radiology reports[109], and medication histories—to augment training data for various applications, including text classification, readmission prediction, phenotype identification, and disease diagnosis support. For instance, synthetic data has been shown to improve cerebrovascular disease classification when annotated data is limited[108]. Synthetic note generation has been applied to support few-shot learning, enabling NLP systems for low-resource settings [109]. Synthetic data generation has also been applied to probabilistic modeling of longitudinal EHRs, simulation of pharmacokinetic/pharmacodynamic (PK/PD) profiles, and the construction of virtual patients for drug development and population-level modeling[110].
However, synthetic text may introduce subtle clinical inconsistencies, amplify biases, or leak sensitive information through memorization. Therefore, most of the studies on data augmentation are for research purposes. Future studies should examine the fidelity, diversity/coverage, and privacy risk of synthetic clinical text generation.
Medical Dialogue Generation
Medical dialogue generation aims to provide clinically meaningful, context-aware, and empathetic responses in real-time interactions between healthcare providers and patients. These systems must handle dynamic dialogue context, interpret patient intent, and provide safe, accurate, and appropriate responses[111]. Recent studies[112,113] focused on multi-turn medical dialogue systems that closely mimic real-world doctor–patient interactions to reduce the burden on healthcare professionals. Dialogue generation has increasingly been integrated into intelligent consultation systems for early-stage disease diagnosis[114] and chronic care support. For instance, privacy-aware conversational agents have been developed to assist patients with chronic conditions like hypertension, offering tailored advice and interactive monitoring while preserving data confidentiality[115]. Recent studies[113,116,117] have explored the quality, factual grounding, and interpretability of dialogue responses, particularly in virtual care environments.
DISCUSSION
This study reviewed the rapid development of NLG in the medical domain. Over the past 5 years, we have witnessed rapid progress of NLG from early-stage CNNs and RNNs to the dominant transformer-based generative AI models that are based on the attention mechanism, self-supervised training, and encoder-decoder architecture[118]. Multiple modalities of medical data have been utilized for NLG, increasingly impacting medical discovery and healthcare applications.
NLG is transforming clinical documentation, decision-making, and patient care through diverse and innovative uses[119]. One important contribution is reducing the burden on healthcare providers. Clinicians constantly document, interpret, and summarize critical information in electronic health records (EHRs) to facilitate continuous and coordinated care. Appropriately summarizing patients’ clinical care information is a critical skill for clinicians to communicate with patients (e.g., discharge summaries), handoff work at shift, and work with colleagues with different expertise and across different clinical departments to handle complex cases (e.g., patients with multiple illnesses). However, the growing burden of clinical documentation has been a significant challenge in healthcare, contributing to physician burnout, reducing the efficiency of care delivery, and potentially compromising patient safety. Through automated summarization applications, NLG helps healthcare providers summarize various patient reports and doctor-patient conversations into concise summaries. Through automated documentation applications, NLG helps streamline the documentation process, reduce clinician burnout, and enhance healthcare delivery efficiency[120]. Through medical dialogue systems, NLG helps healthcare providers and patients interact by enhancing telemedicine consultations and mental health counseling[121]. NLG systems are translating AI from academic prototypes to real-world clinical workflows. For example, mature ambient intelligence systems, such as Nuance DAX[94], Epic’s ambient[95] summarization, and Amazon HealthScribe[124], have been successfully deployed.
The sensitive nature of clinical text causes barriers to sharing patients’ data. Through synthetic medical text generation, NLG helps create large-scale, privacy-preserving synthetic datasets essential for medical research[125] and healthcare technology innovation[126]. Compared with the open domain, there are no large-scale datasets available to facilitate the large-scale training of medical LLMs.
Challenges and Research Gaps
While NLG systems have demonstrated potential, challenges remain in model performance, evaluation, and deployment. Healthcare has a very low tolerance for errors (e.g., misstated medication dosage, omitted contraindication, or fabricated allergy), which may cause disproportionate risk[127] to patients. There is a misalignment between technical evaluation and real-world clinical utility. Current widely used evaluations are still relying heavily on machine-based metrics like ROUGE and BLEU due to their computational efficiency. While factuality-oriented metrics such as CheXpert and RadGraph have emerged, they are predominantly developed for chest X-rays. There is an urgent need for robust and generalizable evaluation metrics that can measure “clinical correctness” rather than surface-level textual overlap, to better translate NLG systems into real-world healthcare applications. Without mature evaluation, the safety and reliability of NLG remain a critical challenge [128,129].
The potential AI biases for subpopulations and the performance drift over time need to be monitored. For example, NLG systems may generate biased or non-compliant text, as models trained on historical data may inadvertently perpetuate existing biases or fail to adhere to current medical guidelines and standards[130,131]. Future research should examine the AI ethics, interpretability, and transparency of NLG applications, as it is crucial for healthcare providers to understand and trust NLG systems[132].
Future work
Future work should continue to explore multimodalities of medical data, such as clinical narratives, medical images, and omics data. In addition, exploring the development of voice-activated NLG systems to enable hands-free operation can be particularly useful in clinical settings such as surgical operations. For example, the use of vision transformers to process medical images in conjunction with text generation tasks. Human-in-the-loop is also an important topic for future studies. As no AI systems can be 100% accurate, it is important to enable clinicians to provide feedback for NLG systems to help refine the models and improve the quality and relevance of the generated content. Developing evaluation metrics that go beyond traditional surface measures like ROUGE and BLEU could help the advancement of NLG. The evaluation metrics should account for diverse aspects beyond the surface form of text, such as clinical relevance, factual accuracy, and potential impact on patient outcomes. Future work should also explore the efficient integration of NLG systems into real-world clinical workflows and test the efficacy of AI implementation.
Limitation
This study has limitations as we focus on NLG methods and applications, studies that about the potential bias, security, risks, interpretability, and AI ethics, but without concrete applications were not included. The exclusion of non-English studies potentially narrows the scope of review, especially for NLG in low-resource languages. We primarily focus on peer-reviewed scientific publications and only manually include high-impact arXiv preprints and industry technical reports, which excludes new but potentially important preprints.
CONCLUSION
This systematic review underscores the rapid advancement and transformative potential of natural language generation (NLG) in healthcare, driven by the emergence of large language models and multimodal data integration. NLG has shown promising applications across clinical summarization, documentation, dialogue systems, and data augmentation, contributing to improved workflow efficiency and expanded research opportunities. However, challenges persist, including ensuring factual accuracy, minimizing bias, and developing robust evaluation methods beyond traditional surface-level metrics. Future research should prioritize the integration of human-in-the-loop systems, exploration of underutilized data modalities, and deployment of NLG models in real-world clinical settings to fully harness their capacity to enhance patient care and clinical decision-making.
Supplementary Material
Attached in a separate document.
ACKNOWLEDGMENTS
We gratefully acknowledge the support of NVIDIA Corporation and the NVIDIA AI Technology Center (NVAITC) UF program.
FUNDING STATEMENT
This study was partially supported by grants from the Patient-Centered Outcomes Research Institute® (PCORI®) Award (ME-2018C3-14754, ME-2023C3-35934), the PARADIGM program awarded by the Advanced Research Projects Agency for Health (ARPA-H), National Institute on Aging, NIA R56AG069880, U24AG098157, National Institute of Allergy and Infectious Diseases, NIAID R01AI172875, National Heart, Lung, and Blood Institute, R01HL169277, R01HL176844, National Institute on Drug Abuse, NIDA R01DA050676, R01DA057886, R01DA063631, and the UF Clinical and Translational Science Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding institutions.
Footnotes
COMPETING INTERESTS STATEMENT
Mengxian Lyu, Xiaohan Li, Ziyi Chen, Jinqian Pan, Cheng Peng, Sankalp Talankar, and Yonghui Wu have no conflicts of interest that are directly relevant to the content of this study.
REFERENCES
- 1.Dong C, Li Y, Gong H, et al. A survey of Natural Language Generation. ACM Comput Surv. 2023;55:1–38. [Google Scholar]
- 2.Cawsey AJ, Webber BL, Jones RB. Natural language generation in health care. J Am Med Inform Assoc. 1997;4:473–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and efficient foundation language models. arXiv [cs.CL]. 2023. [Google Scholar]
- 4.OpenAI, Achiam J, Adler S, et al. GPT-4 Technical Report. arXiv [cs.CL]. 2023. [Google Scholar]
- 5.Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21:5485–551. [Google Scholar]
- 6.Iqbal T, Qureshi S. The survey: Text generation models in deep learning. J King Saud Univ - Comput Inf Sci. 2022;34:2515–28. [Google Scholar]
- 7.Peng C, Yang X, Chen A, et al. A study of generative large language model for medical research and healthcare. NPJ Digit Med. 2023;6:210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Xie Q, Chen Q, Chen A, et al. Me-LLaMA: Foundation large language models for medical applications. Research Square. 2024. [Google Scholar]
- 9.Tran H, Yang Z, Yao Z, et al. BioInstruct: instruction tuning of large language models for biomedical natural language processing. J Am Med Inform Assoc. 2024;31:1821–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Survey of the state of the art in natural language generation: Core tasks, applications and evaluation.
- 11.Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with Neural Networks. arXiv [cs.CL]. 2014. [Google Scholar]
- 12.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. arXiv [cs.CL]. 2017. [Google Scholar]
- 13.Lin T, Wang Y, Liu X, et al. A survey of transformers. AI Open. 2022;3:111–32. [Google Scholar]
- 14.Devlin J, Chang M-W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics; 2019:4171–86. [Google Scholar]
- 15.Radford A, Narasimhan K, Salimans T, et al. Improving Language Understanding by Generative Pre-Training.
- 16.Singhal K, Tu T, Gottweis J, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025;31:943–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hu Y, Chen Q, Du J, et al. Improving large language models for clinical named entity recognition via prompt engineering. J Am Med Inform Assoc. 2024;31:1812–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.McDuff D, Schaekermann M, Tu T, et al. Towards accurate differential diagnosis with large language models. Nature. Published Online First: 9 April 2025. doi: 10.1038/s41586-025-08869-4 [DOI] [Google Scholar]
- 19.Liu X, Liu H, Yang G, et al. A generalist medical language model for disease diagnosis assistance. Nat Med. 2025;31:932–42. [DOI] [PubMed] [Google Scholar]
- 20.WangLab at MEDIQA-Chat 2023: Clinical Note Generation from Doctor-Patient Conversations using Large Language Models.
- 21.Zhang K, Zhou R, Adhikarla E, et al. A generalist vision-language foundation model for diverse biomedical tasks. Nat Med. 2024;30:3129–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616:259–65. [DOI] [PubMed] [Google Scholar]
- 23.Truhn D, Eckardt J-N, Ferber D, et al. Large language models and multimodal foundation models for precision oncology. NPJ Precis Oncol. 2024;8:72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Guan J, Li R, Yu S, et al. Generation of Synthetic Electronic Medical Record Text. Proceedings. 2018;374–80. [Google Scholar]
- 25.Lee S Natural Language Generation for Electronic Health Records. npj Digital Medicine. 2018;1:63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Schiff GD, Bates DW. Can electronic clinical documentation help prevent diagnostic errors? N Engl J Med. 2010;362:1066–9. [DOI] [PubMed] [Google Scholar]
- 27.Amosa TI, Izhar LIB, Sebastian P, et al. Clinical errors from acronym use in electronic health record: A review of NLP-based disambiguation techniques. IEEE Access. 2023;11:59297–316. [Google Scholar]
- 28.Xiao C, Choi E, Sun J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J Am Med Inform Assoc. 2018;25:1419–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Moy AJ, Schwartz JM, Chen R, et al. Measurement of clinical documentation burden among physicians and nurses using electronic health records: a scoping review. J Am Med Inform Assoc. 2021;28:998–1008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Mishra P, Kiang JC, Grant RW. Association of medical scribes in primary care with physician workflow and patient experience. JAMA Intern Med. 2018;178:1467–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Fu S, Chen D, He H, et al. Clinical concept extraction: A methodology review. J Biomed Inform. 2020;109:103526. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.A Systematic Review of Deep Learning-based Research on Radiology Report Generation. A Systematic Review of Deep Learning-based Research on Radiology Report Generation. [Google Scholar]
- 33.Melamud O, Shivade C. Towards Automatic Generation of Shareable Synthetic Clinical Notes Using Neural Language Models. Association for Computational Linguistics; 2019. [Google Scholar]
- 34.J Kurisinkel L, Chen N. Set to Ordered Text: Generating Discharge Instructions from Medical Billing Codes. Association for Computational Linguistics; 2019. [Google Scholar]
- 35.Amin-Nejad A, Ive J, Velupillai S. Exploring Transformer Text Generation for Medical Dataset Augmentation. European Language Resources Association; 2020. [Google Scholar]
- 36.Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large language models can outperform medical experts in clinical text summarization. arXiv [cs.CL]. 2023. [Google Scholar]
- 37.Huang X, Yan F, Xu W, et al. Multi-Attention and Incorporating Background Information Model for Chest X-Ray Image Report Generation. IEEE Access. 2019;7:154808–17. [Google Scholar]
- 38.Beddiar DR, Oussalah M, Seppänen T, et al. ACapMed: Automatic Captioning for Medical Imaging. NATO Adv Sci Inst Ser E Appl Sci. 2022;12:11092. [Google Scholar]
- 39.Miura Y, Zhang Y, Tsai E, et al. Improving Factual Completeness and Consistency of Imageto-Text Radiology Report Generation. Association for Computational Linguistics; 2021. [Google Scholar]
- 40.Li Q Harnessing the Power of Pre-trained Vision-Language Models for Efficient Medical Report Generation. Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2023;1308–17. [Google Scholar]
- 41.Delbrouck J-B, Zhang C, Rubin D. QIAI at MEDIQA 2021: Multimodal Radiology Report Summarization. Association for Computational Linguistics; 2021. [Google Scholar]
- 42.Dalla Serra F, Clackett W, MacKinnon H, et al. Multimodal Generation of Radiology Reports using Knowledge-Grounded Extraction of Entities and Relations. Association for Computational Linguistics; 2022. [Google Scholar]
- 43.Zhang S, Xu Y, Usuyama N, et al. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv [cs.CV]. 2023. [Google Scholar]
- 44.Li C, Wong C, Zhang S, et al. LLaVA-Med: Training a Large Language-and-Vision Assistant for BioMedicine in one day. arXiv [cs.CV]. 2023. [Google Scholar]
- 45.Moor M, Huang Q, Wu S, et al. Med-flamingo: A multimodal medical few-shot learner. arXiv [cs.CV]. 2023;353–67. [Google Scholar]
- 46.Lin C-Y. ROUGE: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics; 2004:74–81. [Google Scholar]
- 47.Papineni K, Roukos S, Ward T, et al. Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002:311–8. [Google Scholar]
- 48.Lavie SBA. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. [Google Scholar]
- 49.Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based image description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2015. [Google Scholar]
- 50.Zhang T, Kishore V, Wu F, et al. BERTScore: Evaluating Text Generation with BERT. arXiv [cs.CL]. 2019. [Google Scholar]
- 51.Sellam T, Das D, Parikh A. BLEURT: Learning robust metrics for text generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics; 2020. [Google Scholar]
- 52.Irvin J, Rajpurkar P, Ko M, et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proc Conf AAAI Artif Intell. 2019;33:590–7. [Google Scholar]
- 53.Smit A, Jain S, Rajpurkar P, et al. CheXbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. arXiv [cs.CL]. 2020. [Google Scholar]
- 54.Jain S, Agrawal A, Saporta A, et al. RadGraph: Extracting clinical entities and relations from radiology reports. arXiv [cs.CL]. 2021. [Google Scholar]
- 55.Zha Y, Yang Y, Li R, et al. AlignScore: Evaluating factual consistency with A unified alignment function. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics; 2023. [Google Scholar]
- 56.Likert R A technique for the measurement of attitudes. Arch Psychol (Chic). 1932. [Google Scholar]
- 57.Ben Abacha A, Yim W-W, Fan Y, et al. An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters. Association for Computational Linguistics; 2023. [Google Scholar]
- 58.Eremeev M, Valmianski I, Amatriain X, et al. Injecting knowledge into language generation: a case study in auto-charting after-visit care instructions from medical dialogue. Association for Computational Linguistics; 2023. [Google Scholar]
- 59.Gao Y, Dligach D, Miller T, et al. Summarizing Patients’ Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models. International Committee on Computational Linguistics; 2022. [Google Scholar]
- 60.Wang T, Zhao X, Rios A. UTSA-NLP at RadSum23: Multi-modal Retrieval-Based Chest X-Ray Report Summarization. Association for Computational Linguistics; 2023. [Google Scholar]
- 61.Chen Z, Shen Y, Song Y, et al. Cross-modal Memory Networks for Radiology Report Generation. Association for Computational Linguistics; 2021. [Google Scholar]
- 62.Chen Z, Song Y, Chang T-H, et al. Generating Radiology Reports via Memory-driven Transformer. Association for Computational Linguistics; 2020. [Google Scholar]
- 63.Lovelace J, Mortazavi B. Learning to Generate Clinically Coherent Chest X-Ray Reports. Association for Computational Linguistics; 2020. [Google Scholar]
- 64.Yan A, He Z, Lu X, et al. Weakly Supervised Contrastive Learning for Chest X-Ray Report Generation. Association for Computational Linguistics; 2021. [Google Scholar]
- 65.Hu J, Li Z, Chen Z, et al. Graph Enhanced Contrastive Learning for Radiology Findings Summarization. Association for Computational Linguistics; 2022. [Google Scholar]
- 66.Karn SK, Ghosh R, Kusuma P, et al. shs-nlp at RadSum23: Domain-Adaptive Pre-training of Instruction-tuned LLMs for Radiology Report Impression Generation. Association for Computational Linguistics; 2023. [Google Scholar]
- 67.Parres D, Albiol A, Paredes R. Improving radiology report generation quality and diversity through reinforcement learning and text augmentation. Bioengineering (Basel). 2024;11. doi: 10.3390/bioengineering11040351 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Nicolson A, Dowling J, Anderson D, et al. Longitudinal data and a semantic similarity reward for chest X-ray report generation. Inform Med Unlocked. 2024;50:101585. [Google Scholar]
- 69.Wu J, Shi D, Hasan A, et al. KnowLab at RadSum23: comparing pre-trained language models in radiology report summarization. Association for Computational Linguistics; 2023. [Google Scholar]
- 70.Zhou Y, Ringeval F, Portet F. A survey of evaluation methods of generated medical textual reports. Proceedings of the 5th Clinical Natural Language Processing Workshop. Stroudsburg, PA, USA: Association for Computational Linguistics; 2023. [Google Scholar]
- 71.Joshi A, Katariya N, Amatriain X, et al. Dr. Summarize: Global Summarization of Medical Dialogue by Exploiting Local Structures. Association for Computational Linguistics; 2020. [Google Scholar]
- 72.Krishna K, Khosla S, Bigham J, et al. Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques. Association for Computational Linguistics; 2021. [Google Scholar]
- 73.Nishino T, Ozaki R, Momoki Y, et al. Reinforcement Learning with Imbalanced Dataset for Data-to-Text Medical Report Generation. Association for Computational Linguistics; 2020. [Google Scholar]
- 74.Zhang L, Negrinho R, Ghosh A, et al. Leveraging Pretrained Models for Automatic Summarization of Doctor-Patient Conversations. Association for Computational Linguistics; 2021. [Google Scholar]
- 75.Karn SK, Liu N, Schuetze H, et al. Differentiable Multi-Agent Actor-Critic for Multi-Step Radiology Report Summarization. Association for Computational Linguistics; 2022. [Google Scholar]
- 76.Paalvast O, Nauta M, Koelle M, et al. Radiology report generation for proximal femur fractures using deep classification and language generation models. Artif Intell Med. 2022;128:102281. [DOI] [PubMed] [Google Scholar]
- 77.Roy K, Gaur M, Soltani M, et al. ProKnow: Process knowledge for safety constrained and explainable question generation for mental health diagnostic assistance. Front Big Data. 2022;5:1056728. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Shakil H, Farooq A, Kalita J. Abstractive text summarization: State of the art, challenges, and improvements. Neurocomputing. 2024;128255. [Google Scholar]
- 79.Sotudeh Gharebagh S, Goharian N, Filice R. Attend to Medical Ontologies: Content Selection for Clinical Abstractive Summarization. Association for Computational Linguistics; 2020. [Google Scholar]
- 80.MEDIQA 2021: Toward Improving Factual Correctness of Radiology Report Abstractive Summarization.
- 81.BDKG at MEDIQA 2021: System Report for the Radiology Report Summarization Task.
- 82.Optum at MEDIQA 2021: Abstractive Summarization of Radiology Reports using simple BART Finetuning.
- 83.Sharma B, Gao Y, Miller T, et al. Multi-Task Training with In-Domain Language Models for Diagnostic Reasoning. Association for Computational Linguistics; 2023. [Google Scholar]
- 84.Ben Abacha A, Yim W-W, Adams G, et al. Overview of the MEDIQA-Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations. Association for Computational Linguistics; 2023. [Google Scholar]
- 85.Teddysum at MEDIQA-Chat 2023: an analysis of fine-tuning strategy for long dialog summarization.
- 86.Team Converge at ProbSum 2023: Abstractive Text Summarization of Patient Progress Notes.
- 87.Team Cadence at MEDIQA-Chat 2023: Generating, augmenting and summarizing clinical dialogue with large language models.
- 88.clulab at MEDIQA-Chat 2023: Summarization and classification of medical dialogues.
- 89.Singh G, Pan Y, Andres-Ferrer J, et al. Large Scale Sequence-to-Sequence Models for Clinical Note Generation from Patient-Doctor Conversations. Association for Computational Linguistics; 2023. [Google Scholar]
- 90.Shimo Lab at “Discharge Me!”: Discharge Summarization by Prompt-Driven Concatenation of Electronic Health Record Sections.
- 91.Discharge Me!”: Evaluating Constrained Generation of Discharge Summaries with Unstructured and Structured Information.
- 92.At U-H. Discharge Me!”: A Hybrid Solution for Discharge Summary Generation Through Prompt-based Tuning of GatorTronGPT Models.
- 93.Overview of the First Shared Task on Clinical Text Generation: RRG24 and “Discharge Me!”.
- 94.Haberle T, Cleveland C, Snow GL, et al. The impact of nuance DAX ambient listening AI documentation: a cohort study. J Am Med Inform Assoc. 2024;31:975–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.AI for Clinicians. https://www.epic.com/software/ai-clinicians/ (accessed 21 December 2025)
- 96.Willis M, Jarrahi MH. Automating documentation: A critical perspective into the role of artificial intelligence in clinical documentation. Information in Contemporary Society. Cham: Springer International Publishing; 2019:200–9. [Google Scholar]
- 97.Toward Relieving Clinician Burden by Automatically Generating Progress Notes using Interim Hospital Data.
- 98.Yang C, Li Z, Fan H, et al. SERG: A Sequence-to-Sequence Model for Chinese ECG Report Generation. Proceedings of the 2022 4th International Conference on Robotics, Intelligent Control and Artificial Intelligence. 2022;708–12. [Google Scholar]
- 99.Monshi MMA, Poon J, Chung V. Deep learning in generating radiology reports: A survey. Artif Intell Med. 2020;106:101878. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Kaur N, Mittal A, Singh G. Methods for automatic generation of radiological reports of chest radiographs: a comprehensive survey. Multimed Tools Appl. 2022;81:13409–39. [Google Scholar]
- 101.Johnson AEW, Pollard TJ, Berkowitz SJ, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data. 2019;6:317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Demner-Fushman D, Kohli MD, Rosenman MB, et al. Preparing a collection of radiology examinations for distribution and retrieval. J Am Med Inform Assoc. 2016;23:304–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Kaur N, Mittal A. CADxReport: Chest x-ray report generation using co-attention mechanism and reinforcement learning. Comput Biol Med. 2022;145:105498. [DOI] [PubMed] [Google Scholar]
- 104.Zhang D, Ren A, Liang J, et al. Improving Medical X-ray Report Generation by Using Knowledge Graph. NATO Adv Sci Inst Ser E Appl Sci. 2022;12:11111. [Google Scholar]
- 105.Yang S, Wu X, Ge S, et al. Radiology report generation with a learned knowledge base and multi-modal alignment. Med Image Anal. 2023;86:102798. [DOI] [PubMed] [Google Scholar]
- 106.Mondal C, Pham D-S, Gupta A, et al. EfficienTransNet: An Automated Chest X-ray Report Generation Paradigm. Proceedings of the 1st International Workshop on Multimodal and Responsible Affective Computing. 2023;59–66. [Google Scholar]
- 107.Spasic I, Nenadic G. Clinical text data in machine learning: Systematic review. JMIR Med Inform. 2020;8:e17984. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108.Beaulieu-Jones BK, Wu ZS, Williams C, et al. Privacy-preserving generative deep neural networks support clinical data sharing. Circ Cardiovasc Qual Outcomes. 2019;12:e005122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Hoogi A, Mishra A, Gimenez F, et al. Natural Language Generation Model for Mammography Reports Simulation. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS. 2020;24. [Google Scholar]
- 110.How to use Language Models for Synthetic Text Generation in Cerebrovascular Diseasespecific Medical Reports.
- 111.Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models.
- 112.Jiang Y, García-Durán A, Losada IB, et al. Generative models for synthetic data generation: application to pharmacokinetic/pharmacodynamic data. J Pharmacokinet Pharmacodyn. 2024;51:877–85. [DOI] [PubMed] [Google Scholar]
- 113.Medical Dialogue System: A Survey of Categories, Methods, Evaluation and Challenges.
- 114.Hu Z, Zhao H, Zhao Y, et al. T-agent: A term-aware agent for medical dialogue generation. 2024 International Joint Conference on Neural Networks (IJCNN). IEEE; 2024:1–8. [Google Scholar]
- 115.Srivastava A, Pandey I, Akhtar MS, et al. Response-act guided reinforced dialogue generation for mental health counseling. Proceedings of the ACM Web Conference 2023. New York, NY, USA: ACM; 2023. [Google Scholar]
- 116.Seeing is believing! Towards Knowledge-Infused Multi-modal Medical Dialogue Generation.
- 117.Montagna S, Aguzzi G, Ferretti S, et al. LLM-based Solutions for Healthcare Chatbots: a Comparative Analysis. 2024 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops). IEEE; 2024:346–51. [Google Scholar]
- 118.Zhao Y, Li Y, Wu Y, et al. Medical dialogue response generation with pivotal information recalling. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM; 2022. [Google Scholar]
- 119.Li D, Ren Z, Ren P, et al. Semi-supervised variational reasoning for medical dialogue generation. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: ACM; 2021. [Google Scholar]
- 120.Islam S, Elmekki H, Elsebai A, et al. A comprehensive survey on applications of transformers for deep learning tasks. Expert Syst Appl. 2024;241:122666. [Google Scholar]
- 121.Yu P, Xu H, Hu X, et al. Leveraging generative AI and large language models: A comprehensive roadmap for healthcare integration. Healthcare (Basel). 2023;11:2776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 122.Baumann LA, Baker J, Elshaug AG. The impact of electronic health record systems on clinical documentation times: A systematic review. Health Policy. 2018;122:827–36. [DOI] [PubMed] [Google Scholar]
- 123.Bickmore T, Giorgino T. Health dialog systems for patients and consumers. J Biomed Inform. 2006;39:556–71. [DOI] [PubMed] [Google Scholar]
- 124.https://aws.amazon.com/healthscribe/ (Accessed 21 December 2025)
- 125.Esteva A, Robicquet A, Ramsundar B, et al. A guide to deep learning in healthcare. Nat Med. 2019;25:24–9. [DOI] [PubMed] [Google Scholar]
- 126.Gonzales A, Guruswamy G, Smith SR. Synthetic data in health care: A narrative review. PLOS Digit Health. 2023;2:e0000082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 127.Kim Y, Jeong H, Chen S, et al. Medical hallucination in foundation models and their impact on healthcare. medRxiv. 2025. [Google Scholar]
- 128.Johnson D, Goodman R, Patrinely J, et al. Assessing the accuracy and reliability of AI-generated medical responses: An evaluation of the chat-GPT model. Res Sq. Published Online First: 28 February 2023. doi: 10.21203/rs.3.rs-2566942/v1 [DOI] [Google Scholar]
- 129.Hüske-Kraus D Text Generation in Clinical Medicine – a Review. Methods Inf Med. 2003;42:51–60. [PubMed] [Google Scholar]
- 130.Sun M, Oliwa T, Peek ME, et al. Negative patient descriptors: Documenting racial bias in the electronic health record. Health Aff (Millwood). 2022;41:203–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 131.Hajikhani A, Cole C. A critical review of large language models: Sensitivity, bias, and the path toward specialized AI. Quant Sci Stud. 2024;1–21. [Google Scholar]
- 132.Choudhury A, Chaudhry Z. Large language models and user trust: Focus on healthcare. arXiv [cs.CY]. 2024. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
