Abstract
Natural Language Processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and human language, enabling computers to understand, generate, and derive meaning from human language. NLP’s potential applications in the medical field are extensive and vary from extracting data from Electronic Health Records –one of its most well-known and frequently exploited uses– to investigating relationships among genetics, biomarkers, drugs, and diseases for the proposal of new medications. NLP can be useful for clinical decision support, patient monitoring, or medical image analysis. Despite its vast potential, the real-world application of NLP is still limited due to various challenges and constraints, meaning that its evolution predominantly continues within the research domain. However, with the increasingly widespread use of NLP, particularly with the availability of large language models, such as ChatGPT, it is crucial for medical professionals to be aware of the status, uses, and limitations of these technologies.
Keywords: Artificial Intelligence, Natural Language Processing, Large Language Models, ChatGPT, Ophthalmology, Clinical Application
Introduction
Artificial Intelligence (AI) is the overarching field that encompasses developing systems capable of performing tasks that traditionally require human intelligence. These tasks include but are not limited to problem-solving, understanding natural language, recognizing patterns, and learning from experience. Machine Learning (ML), a subfield of AI, involves developing algorithms and statistical models that allow computers to perform tasks without being explicitly programmed to do so. Instead, they learn and improve from the data they process. Deep learning (DL) is a subset of ML inspired by the structure and function of the human brain. It uses artificial neural networks, especially deep neural networks, to learn from vast amounts of data.1 Within AI, Natural Language Processing (NLP) plays a crucial role in the interaction between computers and human language. NLP focuses on the interaction between computers and natural language, enabling computers to understand, generate, and derive meaning from human language. Large Language Models (LLMs) are often constructed using DL techniques and are one of the most advanced applications of NLP. LLMs are trained on extensive text data to generate human-like text and can perform a variety of complex language tasks.2 Generative AI (GenAI) is a category of AI encompassing models and algorithms capable of creating new outputs from input data. This includes generating images, music, or text. LLMs can be seen as a type of GenAI since they generate text based on patterns they have learned from a training dataset.3 Fig. 1 illustrates a framework for understanding relationships among these various domains of AI.
Fig. 1. Artificial Intelligence hierarchy.

While we can depict them in a hierarchical structure, such as AI > ML > Deep Learning > NLP/Generative AI > LLM, it is important to note that advancements in these fields continually reshape their relationships and boundaries.
Applications of DL to image-based data have become quite ubiquitous. For instance, it has been demonstrated how AI algorithms can be deployed to automate the detection of skin cancer from lesions’ pictures4 or diabetic retinopathy from retinal fundus images5. However, far fewer attempts have been made to utilize the unstructured data (free--text) available in clinical office visit notes, constituting approximately 80 % of the data in electronic health records (EHRs).6 This is precisely where NLP can step in and play a decisive role. Thus, this article provides a review that serves as an introductory guide for clinicians seeking to understand how the NLP works, its potential clinical applications and existing limitations.
Methods
We conducted a comprehensive literature search using MEDLINE/PUBMED and EMBASE databases without imposing any date restrictions. We used a combination of keywords, including “Natural Language Processing”, “Large Language Models”, and “Text Processing”. In EMBASE, we used Emtree terms and in PUBMED we used free terms and MeSH terms. To ensure comprehensive coverage, we employed Boolean operators (AND and OR) to refine the search strategy. Relevant full-text articles were gathered after reviewing the titles and abstracts of all English-language original articles.
Foundational concepts in NLP
NLP has two basic components: natural language understanding (NLU), which focuses on extracting meaning from text data that spans from brief, unstructured texts to extensive document collections, and natural language generation (NLG), which generates human-like text from structured data.7 Both incorporate a broad array of both supervised and unsupervised methods. This flexibility allows NLP to find applications in areas such as predictive analytics for classification (i.e. disease diagnosis from clinical notes) and regression (i.e. predicting hospital readmission rates) problems that involve text data, entity extraction and labeling, question answering, and condensing extensive text into succinct summaries. NLP is also instrumental in machine translation from one language to another8 (i.e Google translator) and in conducting semantic searches based on text similarity using domain-specific vector embeddings.
Before a computer can understand natural language, it has to first convert such unstructured data into numerical data. Methods like tokenization, stemming, and lemmatization (See Glossary) are some of the common initial steps in an NLP workflow as they help to break down a sentence into words and their root forms. These ‘tokens’, as they are known in NLP, are then converted into numerical data through various embedding techniques (Fig. 2). Combinations of one or more tokens used together are known as n-grams, for example, unigram for a single token, bigram for two tokens and so on.
Fig. 2.

Techniques of text representation in NLP.
The earliest embedding approaches involved using “n-grams,” breaking the text into smaller fragments for analysis. These n-grams are then used to represent the documents in a chart known as a “Document-Term Matrix,” where each document is a row, and n-grams are columns. This matrix serves as a foundation for various methods: the “Bag-Of-Words” model uses binary labels (0/1) as features to indicate whether a particular token or n-gram is present in the document, “Count Vectorization” tallies the frequency of each n-gram within a document, and the term frequency and inverse document frequency “TF-IDF” vectorization calculates the importance of an n-gram by comparing its frequency in a document to its prevalence across all documents (Fig. 2). These techniques are vital for computers to understand and manipulate text data, facilitating tasks such as document sorting and text categorization. (See Glossary).
Eventually, a need emerged to introduce context and meaning into these word vectors, thereby enabling the comparison of words. This led to the development of techniques like Word2Vec and GloVe to create denser vectors with reduced dimensionality compared to the large and sparse document-term vectors. These vectors can represent words as position vectors within a high-dimensional hyperspace, with the dimensionality determined by the length of the dense word vector (Fig. 2). Similarity between two words could be determined using the cosine of the angle between their position vectors. Words that often appear in similar contexts will have similar vector representations. State-of-the-art deep learning architectures like Transformers further improved the quality of embeddings for words and sentences using the Attention Mechanism.9 The current state-of-the-art LLMs use vast amounts of data to learn about patterns in language. Although these LLMs are trained on large-scale, publicly available data, they can be further fine-tuned on domain-specific datasets to perform more specific downstream tasks. For instance, the National Center for Biotechnology Information (NCBI) developed the BioCPT model10 for zero-shot biomedical information retrieval by fine-tuning PubMedBERT11, which was pre-trained using full-text articles from PubMedCentral and abstracts from PubMed.20. Fig. 3 shows the evolution of NLP techniques over time.
Fig. 3.

Evolution of NLP techniques: from Bag of Words to Large Language Models.
NLP tasks and functions
Several NLP tasks break down the human text and voice data in ways that help the computer make sense to understand, generate, and extract meaningful information from human language.12 Combining multiple tasks, we obtain various functionalities of the NLP. Moreover, some of these tasks can also serve specific purposes within the medical context. For example, speech recognition can be utilized to convert a doctor’s spoken notes into written text, which is particularly useful when interpreting radiology images. Table 1 summarizes the task and functions of NLP and the possible application of these functions in the medical field.
Table 1.
List of NLP tasks and functions and their potential uses in the medical field.
| Task | Description | |||||
|---|---|---|---|---|---|---|
| Part of Speech Tagging | Labeling each word in a sentence with its grammatical role. | |||||
| Named Entity Recognition | Identifying and categorizing entities in a text. | |||||
| Co-reference Resolution | Determining when two words refer to the same entity in a text. | |||||
| Speech Recognition | Transcribing spoken language into written form. | |||||
| Relation extraction | Extracting relations between entities. | |||||
| Word sense disambiguation | Selecting the meaning of a word with multiple meanings. | |||||
| Natural language generation | Putting structured information into human language. | |||||
| Function | Description | Clinical Practice | Education | Research | Health services management | Public health and community |
| Sentiment Analysis | Determining the attitude or emotion conveyed in a text. | Analyzing patient feedback on treatments, medicines, or services. | ||||
| Syntax Analysis | Analyzing the grammatical structure of a sentence. | Understanding sentence structures in clinical notes. | Parsing complex sentences in medical research articles. | |||
| Semantic Analysis | Understanding the meaning of sentences. | Understanding the meaning of complex medical terminologies in a clinical note. | ||||
| Information Extraction | Extracting structured information from unstructured text data. | Extracting diagnosis, treatment, and lab results from clinical notes. | ||||
| Question Answering | Responding to text queries in a natural, human-like manner. | AI assistant providing medical information in response to patient queries. | ||||
| Machine Translation | Translating a text from one language to another. | Translating patient health records or medical research articles into different languages. | ||||
| Topic Modeling | Discovering the abstract topics that occur in a collection of documents. | Identifying key themes or topics from a collection of research articles on a specific disease. | ||||
| Text Summarization | Generating a shorter version of a document while preserving key information. | Creating brief summaries of lengthy medical reports for easy comprehension. | ||||
| Text Generation | Creating text, such as generating responses in a chatbot or creating a news story. | Generating patient discharge summaries or healthcare reminders. | ||||
| Chatbots | Understanding and responding to vocal or written commands. | AI chatbots guiding patients through symptom checkers. |
Overall, in medicine, the predominant application of NLP is generally information extraction. Mainly relevant information, such as symptoms, diagnoses, and treatments, from medical documents or clinical records. This is particularly useful in creating structured patient databases from unstructured clinical notes and is crucial for automating processes and efficiently analyzing large volumes of medical data.13 Topic modeling is another popular function of NLP used in medical research. It involves identifying and categorizing topics or themes within a collection of documents, enabling researchers and practitioners to gain insights and identify patterns in the data. Both information extraction and topic modeling have proven to be valuable in the medical field.
Information extraction
Named Entity Recognition (NER) is an NLP task that identifies all instances of a particular entity type from free text data and is an important task during information extraction. Initial studies applied NER to clinical notes in a less sophisticated manner, utilizing rule-based techniques for extracting demographic or clinical information. Such techniques could involve string pattern searches or conditional search matching within the text. However, these rule-based methods exhibited significant drawbacks, as searching for all possible disease symptoms in free texts would require comprehensive human domain knowledge, and the search process could also be inefficient. Thus, ML-based approaches have become more commonly used for NER, and today there are more than 150 methods, pipelines, and algorithm implementations of these.13 Some of the more commonly used ML-based methods are Support Vector Machines (SVM), Conditional Random Fields (CRF), Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (Bi-LSTM), and more recently Bidirectional Encoder Representations from Transformers (BERT).13
It is possible to train a state-of-the-art transformer pipeline on custom entities like medical symptoms, prescribed antibiotics, or other medications using open-source software libraries such as SpaCy (https://spacy.io/). Once sufficiently trained on all necessary entities, the model can identify any instance of that entity from any free text. For example, Wu et al.14 developed a NER model with medical knowledge integration that showed F1-scores above 85 % for all entity types, including problems, treatments, and lab tests.
NER may not suffice when the information required goes beyond the mere extraction of entities, for example, in more intricate queries or when extracting detailed responses, such as pinpointed keywords. This situation necessitates layering other NLP functions such as Question Answering into Information Retrieval systems, which can query documents against an incoming question and return the most suitable response(s). The Multi-source Integrated Platform for Answering Clinical Questions (MiPACQ)15 is an integrated question-answering and information extraction framework. It was developed to succinctly answer free-text clinical questions using sources like general medical encyclopedias and patient data from EHRs. This innovation aimed to enhance existing medical information retrieval systems, which often only return superficial answers, requiring time-strapped clinicians to dig deeper to find exact answers. Achieving a precise understanding of user questions and providing incisive answers is an area where NLP can help.
Another study used the BERT-base transformer model to develop a “why” question-answering system based on patient-specific clinical text.16 The existing pre-trained BERT-base model was further trained for clinical context through the Clinical BERT model.17 It was then pre-finetuned on a large-scale dataset for training and evaluating question-answering (QA) systems devolved at Stanford University, the SQuAD 2.0 dataset. Finally, fine-tuned on a large corpus for question answering on electronic medical records, the emrQA18, with queries containing a “why” question. The study identified gaps in the task’s design using domain-specific embeddings with Clinical BERT, and the off-domain pre-finetuning with SQuAD 2.0 improved accuracy. Inaccurate answers provided by the model suggested that BERT might have relied on adjacent cues and recurring associations instead of truly understanding the data.
Topic modeling
On the other hand, topic modeling is an unsupervised technique that can group documents into distinct, anonymous topics, with each topic defined by a set of words that occur more frequently in association with it. Statistical models like Latent Dirichlet Allocation (LDA) and the recent transformer-based BERTopic are commonly used. A recent study applied topic modeling to 200 ophthalmology-related articles on COVID-19 and grouped them into topics, such as ocular manifestations, viral transmissions, treatment strategies, patient care, and practice management.19 Likewise, Nguyen et al.20 described a novel, robust method for performing sentiment analysis and emotion detection from free text in web-based forums. Discussions and user information about a particular search term (oculoplastics) were mined from a web-based patient health forum called MedHelp. IBM Watson Natural Language Understanding was used to perform Sentiment and emotion analysis, generating sentiment and emotion scores for the posts and identifying associated keywords. Sentiment and emotion scores were calculated for grouped keywords, with the sentiment score reflecting an overall positive, negative, or neutral attitude and the emotion scores (anger, fear, disgust, joy, sadness) representing the likelihood of the presence of these emotions. For instance, posts mentioning body parts were found to evoke strong emotions of sadness and fear, while posts concerning administration were primarily linked to anger. Thus, NLP can be used to understand patient attitudes toward medications, symptoms, and complications by analyzing the free text on web-based forums.
Other NLP functions
Summarizing large texts like clinical notes or scientific publications could make them more patient-friendly by replacing the specialized jargon used by clinicians with layperson-friendly terms. However, most NLP-based summarization work has focused on biomedical literature (97 %), while clinical EHR notes account for only the remaining 3 %.21 This disparity underscores the need for NLP-based summarization in the clinical domain.
NLP-based digital scribes could help mitigate the significant burden of EHR documentation, decreasing physicians’ burnout.22 Besides enhancing documentation efficiency, these scribes could integrate with automated text extraction pipelines to create a comprehensive computer assistant. This assistant would enable physicians to focus more on their patients, potentially improving patient satisfaction and the quality of patient-physician relationships. However, the development of such scribes presents challenges, such as recording audio due to high ambient noise, complex medical vocabulary, false starts, non-lexical pauses, and the non-linear progression of topics during a conversation.23
Additionally, LLMs are well-suited for implementation of chatbots since given some text (prompt), the LLMs generate a human-like response predicting the next best syntactically correct word in a sentence. With the powerful computers available today, LLMs have trillions of parameters that can be used to generate new text. These outputs are so similar to human language that they may give a false impression that LLMs possess actual understanding or meaning. These LLMs have become so relevant nowadays that they deserve a complete section to delve into them.
Large Language Models
LLMs represent a significant advancement in the field of NLP, offering powerful tools for understanding and generating human language, such as the chatbots ChatGPT and Gemini. These models are trained on extensive text datasets, enabling them to learn the statistical patterns and relationships inherent in natural language. Below, we provide an overview of different types of language models, their functionalities, and their applications in medicine, aiming to elucidate their relevance for clinical research.
Types of Language Models
Word n-grams: Word n-grams are sequences of nn consecutive words from a text. For example, the phrase “blood pressure” is a 2-gram (bigram). The n-grams are utilized for text mining scientific literature24 and discovering regulatory elements in DNA sequences25. However, they do not capture the full context of the text, limiting their ability to understand complex relationships between words.
Convolutional Neural Networks (CNNs): CNNs employ convolutional filters to scan text or biological sequences, identifying specific features or patterns. They have been effectively used in a wide range of tasks from detecting regulatory enhancers in DNA26 to triaging patients in the emergency department27. While CNNs excel at recognizing local patterns, they are less adept at capturing long-range dependencies and complex sentence structures. Despite these limitations, CNNs have outperformed other neural network architectures in certain genomics and transcriptomics analyses by using techniques such as dilation and scanning fields.
Long Short-Term Memory (LSTM) Networks: LSTMs are a type of recurrent neural network (RNN) designed to process sequential data by capturing long-range dependencies using both long- and short-term memory constructs. LSTMs have been applied in tasks like microbial diagnostics (i.e. identification of Escherichia coli strains)28 and electroencephalogram interpretation (i.e. for epileptic seizure detection)29. However, their sequential processing nature can lead to issues like the vanishing gradient phenomenon, where early information in long texts is lost. Additionally, LSTMs can be computationally intensive and slow to train due to their inability to leverage parallel computation.
Transformer Models:
Transformers were introduced in 2017 and have revolutionized NLP by addressing the limitations of previous models.9 They utilize a multi-headed attention mechanism, allowing them to focus on different parts of the input text simultaneously. This capability enables transformers to capture long-range dependencies more effectively, overcoming the “forgetting” problem of LSTMs.30 Transformers also benefit from parallel processing, making them more efficient to train and deploy. Despite their high computational requirements, transformers have been successfully applied to a variety of medical questions, providing detailed insights into the statistical relationships between elements in a sequence.31
Methodological framework for Natural Language Processing models
LLMs are basically trained on large volumes of text to better associate what word is likely to come next and predict it in a sequence of words. Thus, NLP is like the nervous system in the human body, allowing it to understand and respond to various stimuli, while LLMs, like GPT-3, are like the brain’s cortex, sophisticatedly processing and generating complex responses based on the input received.
Initial preprocessing of raw textual data is conducted through several key processes: tokenization, the removal of stop words, and lemmatization. This stage is critical for reducing complexity and enhancing the analytical feasibility of the text. Subsequently, the pre-processed text is transformed into a numerical format employing advanced techniques such as word embeddings, facilitating the computational handling of linguistic elements.
During the pre-training phase, the model is exposed to large corpora, enabling the acquisition of general language structures, grammatical patterns, and factual knowledge. This foundational training is crucial for the model’s ability to generalize across different linguistic contexts. Specific task-oriented fine-tuning follows, where the model is trained on a labeled dataset tailored to particular applications, such as sentiment analysis, enhancing its precision for designated tasks.
Post fine-tuning, the model generates multiple candidate responses to given prompts or queries. These responses are subsequently evaluated and ranked by domain experts based on their relevance and accuracy. This expert feedback is instrumental in calibrating the model’s performance, as it informs subsequent adjustments to the learning algorithms. Through this iterative process of feedback and adjustment, the model progressively refines its capacity to generate more accurate and contextually appropriate responses (Fig. 4).
Fig. 4.

Flow of Large Language Models’ training to generate Chatbots.
GPT (Generative Pre-trained Transformer) and other state-of-the-art transformer models like BERT, T5, and RoBERTa represent significant advancements in NLP and machine learning. GPT, developed by OpenAI, is especially noted for its ability to generate coherent and contextually relevant text based on a given prompt. This model is pre-trained on a vast corpus of text and fine-tuned for specific tasks, making it versatile across many applications. Similarly, BERT (Bidirectional Encoder Representations from Transformers)32 excels in understanding the context of words in sentences by processing text bidirectionally. T5 (Text-to-Text Transfer Transformer)33 extends the transformer model to convert all NLP problems into a text-to-text format, facilitating a more uniform approach to various tasks. RoBERTa (A Robustly Optimized BERT Pretraining Approach)34, a variant of BERT optimized with more data and training tweaks, has shown improvements in performance on several benchmarks. These models have set new standards in NLP by efficiently handling complex language tasks and enhancing the understanding of linguistic nuances in large-scale datasets.
The advent of transformer models has not only revolutionized the field of NLP but has also paved the way for advancements in multimodal applications. These applications, such as OpenAI’s CLIP and Google’s MUM, leverage transformers to understand and generate content that combines text, images, and sometimes audio. For instance, CLIP (Contrastive Language-Image Pre-training)35 excels in understanding and categorizing images based on textual descriptions, effectively bridging the gap between visual and linguistic data. Similarly, MUM (Multitask Unified Model)36 is designed to handle complex information requests by integrating insights across text, images, and other data formats, thus providing more comprehensive answers. This extension of transformer technology beyond pure text analysis illustrates its potential to contribute significantly to fields like computer vision, audio processing, and even cross-modal interactions, making it a cornerstone technology in the era of AI-driven multimodal understanding.
In the medical field, such chatbots could be used in patient interaction and facilitate automation of clinician workflows. Chatbots can be developed to classify patient symptoms and recommend appropriate medical specialties using a deep learning-based NLP pipeline.37 Chatbots could also interact with patients, collect data, and increase efficiency. A study categorized electronic dialog data into a fixed number of categories using NLP, leading to the development of a chatbot algorithm.38 This tool could monitor patients beyond consultations, empower, and educate patients.
Applications in medicine
Health screening
The use of ML for triaging ophthalmology referrals has been previously explored. Tan et al. used NLP to identify referrals requiring a “category one” (urgent) prioritization, and emulate human triaging across three categories. They achieved an AUC of 0.83 and an accuracy of 0.81 with CNNs, which outperformed other models like artificial neural networks and logistic regression in a pilot study.27 Similarly, Peissig et al. developed a multi-modal EHR-based algorithm for identifying patients with age-related cataracts, utilizing a combination of database querying, NLP, and optical character recognition.39 This algorithm achieved high positive predictive values over 95 %, illustrating the effectiveness of integrating multiple data processing methods to improve the identification and characterization of cataract cases in large healthcare datasets.
Clinical decision support systems: diagnosis and treatment suggestion
One prominent use of NLP in medicine is in interpreting unstructured clinical notes and medical records. By applying NLP techniques, valuable information hidden within these text-based documents can be extracted, helping improve diagnosis, guide treatment, improve patient care quality, streamline healthcare operations, and even predict patient outcomes. Our group evaluated GPT’s diagnostic accuracy and management recommendations for uveitis, finding that while it slightly lagged behind uveitis-trained ophthalmologists, it provided consistent management plans. These findings suggest that LLMs could eventually effectively supplement clinical expertise in specialized areas of medicine, providing a resource for diagnostic and therapeutic guidance.40,41 Likewise, Huang et al. conducted a comparative cross-sectional study to assess the diagnostic accuracy of GPT-4 against fellowship-trained ophthalmologists in glaucoma and retina management. The study revealed that GPT-4 matched or even outperformed the specialists, emphasizing the model’s capability in clinical decision-making with accuracy and completeness scores that were highly competitive.42 Such studies underscore the transformative potential of AI in enhancing medical diagnostics and patient management across various specialties.
Moreover, NLP can support Clinical Decision Support Systems (CDSS), these algorithms can analyze a patient’s history and symptoms, compare them against a vast library of medical knowledge, and provide physicians with potential diagnoses or treatment suggestions.43 ML models can identify patients at risk for certain conditions by analyzing language patterns and trends in patient records, enabling early intervention, such as the NYUTron, a Health System-scale Language Model for Clinical Operations used to predict 30-day all-cause readmission, in-hospital mortality, comorbidity index, length of stay, and insurance denial.44 Likewise, NLP techniques have been utilized in the field of pharmacovigilance to analyze EHRs data to identify and extract medications and their attributable adverse drug reactions.45
Ancillary test and images interpretation and description
Steimetz et al. evaluated AI chatbots for their ability to simplify pathology reports for easier patient comprehension. This study involved 1134 reports from a multispecialty hospital in New York, using two chatbots: Bard (Google Inc) and GPT-4 (OpenAI). Results showed that both chatbots significantly improved the readability of the reports. Bard and ChatGPT interpreted 993(87.57 %) and 1105 (97.44 %) reports correctly, respectively. However, errors and hallucinations were noted in the interpretations, suggesting that while helpful, chatbot outputs require clinical review before patient distribution.46 Likewise, Mihalache A. et al. assessed the effectiveness of the ChatGPT-4 in interpreting ophthalmic images in a cross-sectional study. Using 448 images from 136 ophthalmic cases on the OCTCases platform at the University of Toronto, the chatbot answered 299 of 429 multiple-choice questions correctly (70 % accuracy). Performance varied across subspecialties, performing best in retina-related questions (77 % correct) and worst in neuro-ophthalmology (58 % correct). ChatGPT was more accurate with non-image–based questions compared to image-based questions, emphasizing the need for careful integration of AI tools in medical diagnostics.47
Improve clinical workflow and patient satisfaction
Additionally, NLP can potentially be helpful in increasing clinician and patient understanding and aiding patient management. By extracting information from medical literature, NLP enables clinicians to access evidence-based knowledge at the point of care.13,48–50 Additionally, NLP-driven tools have been developed to analyze patient data and generate personalized recommendations, as well to present and synthesize information from patients using adaptive EHRs.51 Additionally, sentiment analysis, another aspect of NLP, can be employed in patient feedback and reviews to understand patient satisfaction and improve healthcare services.52 This helps hospitals and clinics to tailor their services according to patient needs and expectations, improving overall patient experience.
Moreover, integrating LLM-based chatbots into electronic health systems could potentially improve the quality of informed consent while reducing the documentation burden on physicians. This application was tested by Decker H. et al. comparing between informed consent documentation generated by a LLM-based chatbot and that by surgeons. The study assessed 36 risk, benefits, and alternatives (RBAs) statements for six common surgical procedures. Results indicated that the LLM-based chatbot produced more readable, accurate, and complete informed consent documents than those generated by surgeons. While readability scores were slightly better for the LLM (12.9 vs. 15.7), the LLM significantly outperformed surgeons in completeness and accuracy, especially in describing benefits and alternatives of surgery.53
Patient engagement and support
Chatbots are being progressively integrated into patient communication channels, demonstrating significant potential in enhancing patient engagement and support. Recent empirical studies offer insight into the efficacy of AI in this domain. Ayers et al. reported that AI-generated responses to general medical queries on social media were preferred over those by human physicians, showing higher ratings for both empathy and quality.54 This suggests a promising role for AI in reducing physician burnout and improving patient comprehension. Complementing these findings, Bernstein et al. evaluated the performance of AI in providing ophthalmology-specific advice.55 Their study indicated that AI responses were comparable to those from ophthalmologists in terms of accuracy, harm likelihood, and adherence to community standards, demonstrating AI’s capability to handle specialized medical queries. Moreover, a study by Chen et al. highlighted AI’s effectiveness in oncology, where chatbots provided responses that were consistently rated higher than those by oncologists in terms of empathy, quality, and readability.56 These studies collectively underline the potential of AI chatbots to not only support but enhance patient-doctor interactions across various medical specialties, suggesting a transformative shift towards AI-augmented healthcare communication strategies. Such advancements hold promise for optimizing healthcare delivery and enhancing patient satisfaction, although their integration requires careful consideration to maintain the essential human elements of medical practice.
Public health monitoring
Furthermore, NLP has been employed in public health. By analyzing social media posts, and news articles, NLP algorithms have obtained insights from local community stakeholders to improve the recommendations on communicating vaccination.57 Likewise, advanced machine learning models can detect fake news from social media related to medical conditions or interventions, for example, helping to mitigate the spread of COVID-19 misinformation.58
Clinical trial assistance
Moreover, NLP has facilitated cohort identification and trial recruitment by automating the process of identifying eligible patients for clinical trials by analyzing EHRs. Studies have demonstrated the successful use of NLP to identify patients with specific conditions for enrollment in a clinical trial, saving time and resources.59 Studies such as Beattie et al. demonstrate the utilization of LLMs, like OpenAI’s GPT series, in automating patient screening for clinical trials. GPT-4 achieved an accuracy of 0.87 and sensitivity of 0.85, underscoring the potential for these technologies to reduce manual workload and error rates in patient selection processes.60 Similarly, Sun and Tao explore the application of NLP in automating the identification and normalization of named entities related to Alzheimer’s Disease within clinical trial eligibility criteria, achieving an impressive F1-score of 0.816 in named entity recognition and accuracy of 0.940 in named entity normalization.61 Furthermore, Cunningham et al. validated an NLP model for adjudicating heart failure hospitalizations in a multi-center clinical trial, which not only correlates closely with traditional physician-led review but also offers a scalable alternative that could streamline future clinical trials. These studies reveal that NLP models can achieve high agreement with expert adjudications, as evidenced by an agreement rate of 93 % and kappa of 0.83.62 These advancements highlight NLP’s role in enhancing data extraction and decision-making processes, ultimately fostering more efficient clinical research environments.
Medical education
NLP technologies enable the creation of dynamic, interactive learning environments that can adapt to individual learners’ needs. For instance, as explored by Mohapatra et al., ChatGPT can serve as teaching assistants in fields like plastic surgery, providing tailored educational content and simulating patient interactions.63 This not only enhances the learning experience but also prepares residents for real-life clinical scenarios by improving their decision-making and problem-solving skills. However, in a study using clinical vignettes, providing physicians with GPT-4 as a diagnostic tool did not markedly enhance clinical reasoning over traditional resources, though it did appear to boost aspects like efficiency. Notably, GPT-4 outperformed both groups of physicians on its own, indicating potential for further enhancements in the collaboration between physicians and AI in clinical settings.64
Moreover, the automation of assessment processes using NLP can significantly enhance the efficiency and objectivity of evaluating medical trainees. Spadafore et al. developed an NLP model to automatically apply the Quality of Assessment for Learning (QuAL) tool to narrative supervisor comments.65 This model assesses the quality of feedback provided in competency-based medical education, highlighting areas for improvement and offering quantifiable data to guide educational strategies.
Additionally, NLP has been leveraged to improve the quality and effectiveness of feedback, crucial for the professional development of medical trainees. Solano et al., validated an NLP model that classifies the quality of feedback given to surgical trainees.66 This model helps programs identify high-quality feedback, which is essential for enhancing the educational outcomes of trainees. However, a significant challenge in integrating NLP into medical education is the potential for automation bias, where learners might over-rely on AI recommendations without sufficient critical assessment. Nguyen et al. highlights this concern, suggesting that medical education must include AI literacy to prevent such biases and ensure that future physicians can effectively integrate AI tools with clinical judgment.67
Drug discovery
NLP has also shown promise in identifying potential drug candidates for new therapeutic indications, as well as drug-drug interactions. Researchers have used NLP techniques to analyze biomedical literature and identify existing drugs that could be repurposed for treating different diseases, extracting relationships between human diseases, genes, mutations, drugs, and metabolites.68,69
These real-life examples underscore the diverse applications of NLP in healthcare tasks, showcasing its potential for improving patient care, drug research, healthcare operations’ efficiency, and public health initiatives. As advancements continue in AI and machine learning, the role of NLP in medicine is only set to become even more crucial. Table 1 and Fig. 5 provides examples of potential real-world uses for the different functions of NLP.
Fig. 5.

Applications of NLP in the clinical practice.
Challenges and limitations
Given the substantial applications of NLP in analyzing medical text data, it is essential to evaluate its current limitations and challenges. The foremost challenge is the literacy of healthcare professionals. It is crucial for medical practitioners to comprehend the complexities involved in implementing NLP applications within healthcare settings, thereby promoting responsible and informed usage.70,71
NLP models, which rely heavily on data, require large, diverse datasets for training. Ideally, these datasets should encapsulate all possible scenarios the model may encounter upon deployment. Such comprehensive data coverage is vital to enable the model to understand the semantics and nuances of the data, thereby ensuring robust and generalizable results. It is fundamental to consider that NLP operates as a mathematical model, responsive only to the data with which it is fed. Research indicates that clinical notes—whether transcribed by speech software, professional transcriptionists72, or hand-written73—can encompass errors, ranging from the documentation of non-existent findings to grammatical and typographical inaccuracies, as well as redundancy due to copy-and-paste.74,75 When such erroneous data is used as input, the output can also be misleading or incorrect. Consequently, manual verification of the input data is crucial in creating an effective NLP system and clinicians must allocate additional time to ensure that clinical histories are thorough and accurate.
The importance of these limitations is exemplified by Wen et al.,16 where using Clinical BERT embeddings trained on over two million notes instead of the baseline BERT model led to a 6 % increase in accuracy. Interestingly, even pre-fine-tuning the BERT model on a generic, off-domain dataset like SQuAD before subjecting it to actual clinical questions resulted in a 3 % increase in accuracy. This result suggests that even off-domain, close-genre corpora can significantly enhance performance. Thus, the study concluded that enhancing data preparation could be a viable approach for further accuracy improvement.
Another relevant challenge is the control of bias. All the AI models generate outputs based on the information on which they were trained. Therefore, when the training datasets are unbalanced (e.g., over-representation of a specific ethnic group), the output can be biased. Although research on this topic is increasing, these limitations and challenges impact the deployment of NLP developments in real-world clinical scenarios, resulting in a relatively low application rate of approximately 5 %.13
Moreover, a framework for a secure health data using NLP is a concern among different hospitals and medical organizations. Most of the NLP open to public for now, are operated in the cloud, and It is not always evident who offers cloud computing resources76, therefore is vital to determine if the risk for better cost and services. Overall, AI Risk Management Framework (AI RMF 1.0) can be a way to confronting the security challenges. This frameworks encouraging responsible AI research and use; it emphasizes data quality and the importance to have a human in the loop to rectify the models answers.77
Determining the responsibility of NLP results in a medical workflow is crucial. These models create replies based on statistical patterns in data but are incapable of actual thinking or comprehension.78 As a consequence, the model may occasionally produce responses that lack coherence or employ words inappropriately within the given context. Topol79 emphasizes that AI systems aim to complement human intelligence rather than replace human decision-making, and therefore, they cannot be held responsible. This underscores the pressing need for stricter control over the proliferation of these models and their impact on society and the economy. In the coming years, the generative AI community will have to closely monitor the ongoing debate between slower innovation and decentralized distribution, as opposed to centralized and regulated adoption with well-defined ethical boundaries.80
Despite their potential, the deployment of LLMs in medicine faces significant challenges. LLMs currently exhibit lower diagnostic accuracy compared to human clinicians, particularly in complex cases, posing risks to patient safety.41,81 They struggle with consistently following clinical guidelines and interpreting laboratory results, often making hasty and incomplete diagnostic decisions. The models’ sensitivity to the phrasing and order of information further complicates their integration into clinical workflows, necessitating extensive clinician supervision. Additionally, the reliance on training data predominantly sourced from American healthcare contexts limits their generalizability across diverse medical systems and languages. These challenges underscore the need for substantial improvements in model robustness, reliability, and adaptability before LLMs can be trusted for autonomous clinical decision-making.81
Conclusion
NLP facilitates the analysis of vast amounts of text data from diverse sources, encompassing EHRs, scientific publications, public datasets, and social media platforms. However, implementing these technologies in real-world settings is not without its unique challenges and difficulties, primarily due to the inherent nature of unstructured data. This has resulted in a relatively low integration of these technologies into clinical practice. Therefore, clinicians must familiarize themselves with these technologies, and understand their applications and limitations, to foster their development and harness their potential benefits.
Acknowledgements
RA was supported by grants awarded by the National Medical Research Council (NMRC), Ministry of Health, Republic of Singapore grant number NRMC/CSAINV22jul-000, NMRC/CSAINV19nov-0007, and NMRC/CIRG21nov-0023. The funders had no role in study design, data collection and analysis, publication decisions, or manuscript preparation.
Glossary
- Embeddings
Embeddings are numerical, low-dimensional representations of text where similar words have similar representations in vector space. These are learned from the data and can capture semantic and syntactic relationships between terms. Medical Example: Embeddings can be used in ophthalmology to model relationships between different clinical terms, enhancing the understanding of symptom relationships and disease classification.
- F1-score
The F1-score is a statistical measure used to evaluate the accuracy of a classification test. It considers both the precision and the recall of the test to compute the score, which provides a balance between the model’s correctness and completeness. The F1-score is the harmonic mean of precision and recall, where an F1-score reaches its best value at 1 (perfect precision and recall) and worst at 0. Medical Example: In ophthalmology, the F1-score can be used to evaluate the performance of a diagnostic AI model in correctly identifying patients with specific eye diseases like diabetic retinopathy or glaucoma, ensuring that the model reliably detects true cases while minimizing false diagnoses.
- Lemmatization
The process of reducing a word to its base or dictionary form. Unlike stemming, it produces a word that is still valid in the language. Medical Example: Converting the words “is”, “are”, and “am” to their base form “be” to standardize the terminology in medical records.
- Natural Language Processing
The use of algorithms to identify, understand, and generate human language. Medical Example: Analyzing patient feedback to identify common concerns about hospital facilities.
- Prompt
A question or statement given to a language model to generate a response or action based on its training. Medical Example: Asking a language model: “What are the common side effects of chemotherapy?” to generate a list for patient education.
- Prompt engineering
The practice of designing, refining, or tailoring prompts to get better or more specific responses from a language model. Medical Example: Refining a question for a medical language model from “What’s this drug?” to “Provide a detailed overview, including side effects, of the drug Metformin.” to get a comprehensive answer about a specific medication.
- State-of-the-art
The highest current level of development or sophistication in a field. The latest and most advanced techniques, methodologies, and architectures in artificial intelligence research and application. Medical Example: The Transformer architecture, and its derivatives like BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and T5 (Text-to-Text Transfer Transformer), represent state-of-the-art in natural language processing. The state-of-the-art in AI is continually evolving
- Stemming
The process of reducing words to their root form by removing suffixes or prefixes. This can often lead to non-standard word forms. Medical Example: Reducing the words “running”, “runner”, and “ran” to the root “run” to analyze medical documents for consistency.
- Tokenization
The process of breaking up a sequence of text into smaller parts called tokens, which can be words, characters, or subwords. Medical Example: Splitting a patient’s medical note into individual words to analyze frequency of specific terms or to identify key elements.
- Transformer
A deep learning model architecture introduced in 2017, known for its ability to handle sequences of data, such as text. It relies on attention mechanisms to weigh the importance of different parts of input. Medical Example: Predicting the next word in a medical transcription or suggesting diagnosis based on a series of clinical notes.
- Close-genre corpora
Close-genre corpora are collections of text documents that are similar in style, format, or subject matter. These corpora are used to train or test natural language processing models to ensure the models perform well on specific types of texts. Medical Example: A close-genre corpus of medical journals and clinical case reports can be used to train a model designed to understand and generate medical texts, improving its accuracy in clinical decision support systems.
- emrQA dataset
The emrQA dataset is a large, publicly available dataset derived from electronic medical records (EMRs), structured for question answering (QA). It allows the development and evaluation of NLP models that can answer questions based on medical records content. Medical Example: The emrQA dataset can be used to train a model to automatically answer queries about patient information from EMRs, such as medication lists or historical diagnoses, enhancing efficiency in clinical settings.
- SQuAD 2.0 dataset
The Stanford Question Answering Dataset (SQuAD) 2.0 is an enhanced version of the original SQuAD, which is a benchmark dataset for machine learning models to practice the extraction of answers to questions based on context provided in a passage. Medical Example: SQuAD 2.0 includes not only questions that have answers in the provided passages but also questions that do not have an answer within the text. SQuAD 2.0 can be employed in medical NLP to develop models that not only fetch answers from medical texts but also identify when questions are unanswerable based on the provided information, which is critical for clinical decision-making support.
References
- 1..LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–444. 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
- 2..Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930–1940. 10.1038/s41591-023-02448-8. [DOI] [PubMed] [Google Scholar]
- 3..Peruselli C, De Panfilis L, Gobber G, Melo M, Tanzi S. Artificial intelligence and palliative care: opportunities and limitations. Recent Prog Med. 2020;111(11):639–645. 10.1701/3474.34564. [DOI] [PubMed] [Google Scholar]
- 4..Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–118. 10.1038/nature21056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5..Abràmoff MD, Lavin PT, Birch M, Shah N, Folk JC. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. npj Digit Med. 2018;1(1):39. 10.1038/s41746-018-0040-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6..Negro-Calduch E, Azzopardi-Muscat N, Krishnamurthy RS, Novillo-Ortiz D. Technological progress in electronic health record system optimization: systematic review of systematic literature reviews. Int J Med Inform. 2021;152, 104507. 10.1016/j.ijmedinf.2021.104507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7..Kavlakoglu E. NLP vs. NLU vs. NLG: the differences between three natural language processing concepts. IBM Blog; Published 2023. 〈https://www.ibm.com/blog/nlp-vs-nlu-vs-nlg-the-differences-between-three-natural-language-processing-concepts/〉. [Google Scholar]
- 8..Brewster RCL, Gonzalez P, Khazanchi R, et al. Performance of ChatGPT and google translate for pediatric discharge instruction translation. Pediatrics. 2024;154(1),e2023065573. 10.1542/peds.2023-065573. [DOI] [PubMed] [Google Scholar]
- 9..Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need; Published online 2017. 〈 10.48550/ARXIV.1706.03762〉. [DOI] [Google Scholar]
- 10..Jin Q, Kim W, Chen Q, et al. BioCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval; Published online 2023. 〈 10.48550/ARXIV.2307.00589〉. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11..Gu Y, Tinn R, Cheng H, et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing; Published online 2020. 〈 10.48550/ARXIV.2007.15779〉. [DOI] [Google Scholar]
- 12..IBM. What is natural language processing? IBM Blog; Published 2023. 〈https://www.ibm.com/topics/natural-language-processing〉. [Google Scholar]
- 13..Fraile Navarro D, Ijaz K, Rezazadegan D, et al. Clinical named entity recognition and relation extraction using natural language processing of medical free text: a systematic review. Int J Med Inform. 2023;177, 105122. 10.1016/j.ijmedinf.2023.105122. [DOI] [PubMed] [Google Scholar]
- 14..Wu Y, Yang X, Bian J, Guo Y, Xu H, Hogan W. Combine factual medical knowledge and distributed word representation to improve clinical named entity recognition. AMIA Annu Symp Proc. 2018;2018:1110–1117. [PMC free article] [PubMed] [Google Scholar]
- 15..Cairns BL, Nielsen RD, Masanz JJ, et al. The MiPACQ clinical question answering system. AMIA Annu Symp Proc. 2011;2011:171–180. [PMC free article] [PubMed] [Google Scholar]
- 16..Wen A, Elwazir MY, Moon S, Fan J. Adapting and evaluating a deep learning language model for clinical why-question answering. JAMIA Open. 2020;3(1):16–20. 10.1093/jamiaopen/ooz072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17..Alsentzer E, Murphy JR, Boag W, et al. Publicly Available Clinical BERT Embeddings; Published online 2019. 〈 10.48550/ARXIV.1904.03323〉. [DOI] [Google Scholar]
- 18..Pampari A, Raghavan P, Liang J, Peng J. emrQA: a large corpus for question answering on electronic medical records. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2018. p. 2357–68. 〈 10.18653/v1/D18-1258〉. [DOI] [Google Scholar]
- 19..Hallak JA, Scanzera AC, Azar DT, Chan RVP. Artificial intelligence in ophthalmology during COVID-19 and in the post COVID-19 era. Curr Opin Ophthalmol. 2020;31(5):447–453. 10.1097/ICU.0000000000000685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20..Nguyen AXL, Trinh XV, Wang SY, Wu AY. Determination of patient sentiment and emotion in ophthalmology: infoveillance tutorial on web-based health forum discussions. J Med Internet Res. 2021;23(5), e20803. 10.2196/20803. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21..Mishra R, Bian J, Fiszman M, et al. Text summarization in the biomedical domain: a systematic review of recent research. J Biomed Inform. 2014;52:457–467. 10.1016/j.jbi.2014.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22..Dymek C, Kim B, Melton GB, Payne TH, Singh H, Hsiao CJ. Building the evidence-base to reduce electronic health record–related clinician burden. J Am Med Inform Assoc. 2021;28(5):1057–1061. 10.1093/jamia/ocaa238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23..Quiroz JC, Laranjo L, Kocaballi AB, Berkovsky S, Rezazadegan D, Coiera E. Challenges of developing a digital scribe to reduce clinical documentation burden. npj Digit Med. 2019;2(1):114. 10.1038/s41746-019-0190-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24..Liu J, Wu H, Robertson DH, Zhang J. Text mining and portal development for gene-specific publications on Alzheimer’s disease and other neurodegenerative diseases. BMC Med Inf Decis Mak. 2024;24(S3):98. 10.1186/s12911-024-02501-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25..Mejía-Guerra MK, Buckler ES. A k-mer grammar analysis to uncover maize regulatory architecture. BMC Plant Biol. 2019;19(1):103. 10.1186/s12870-019-1693-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26..Gao Y, Chen Y, Feng H, Zhang Y, Yue Z. RicENN: prediction of rice enhancers with neural network based on DNA sequences. Inter Sci Comput Life Sci. 2022;14(2):555–565. 10.1007/s12539-022-00503-5. [DOI] [PubMed] [Google Scholar]
- 27..Tan Y, Bacchi S, Casson RJ, Selva D, Chan W. Triaging ophthalmology outpatient referrals with machine learning: a pilot study. Clin Exp Ophthalmol. 2020;48(2):169–173. 10.1111/ceo.13666. [DOI] [PubMed] [Google Scholar]
- 28..Mao Q, Zhang X, Xu Z, Xiao Y, Song Y, Xu F. Identification of Escherichia coli strains using MALDI-TOF MS combined with long short-term memory neural networks. Aging. 2024. 10.18632/aging.205995 [Published online June 29]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29..Zhao W, Wang WF, Patnaik LM, et al. Residual and bidirectional LSTM for epileptic seizure detection. Front Comput Neurosci. 2024;18, 1415967. 10.3389/fncom.2024.1415967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30..Liu L, Liu J, Han J. Multi-head or Single-head? An Empirical Comparison for Transformer Training; Published online 2021. 〈 10.48550/ARXIV.2106.09650〉. [DOI] [Google Scholar]
- 31..Lam HYI, Ong XE, Mutwil M. Large language models in plant biology. Trends Plant Sci. 2024, S1360138524001183. 10.1016/j.tplants.2024.04.013 [Published online May]. [DOI] [PubMed] [Google Scholar]
- 32..Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding; Published online 2018. 〈 10.48550/ARXIV.1810.04805〉. [DOI] [Google Scholar]
- 33..Raffel C, Shazeer N, Roberts A, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer; Published online 2019. 〈 10.48550/ARXIV.1910.10683〉. [DOI] [Google Scholar]
- 34..Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach; Published online 2019. 〈 10.48550/ARXIV.1907.11692〉. [DOI] [Google Scholar]
- 35..Radford A, Kim JW, Hallacy C, et al. Learning Transferable Visual Models From Natural Language Supervision; Published online February 26, 2021. 〈http://arxiv.org/abs/2103.00020〉. [Accessed 7 July 2024].
- 36..Nayak P. Mum: A New AI Milestone for Understanding Information. Google; Published May 2021. 〈https://blog.google/products/search/introducing-mum/〉. [Google Scholar]
- 37..Lee H, Kang J, Yeo J. Medical specialty recommendations by an artificial intelligence chatbot on a smartphone: development and deployment. J Med Internet Res. 2021;23(5), e27460. 10.2196/27460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38..Zand A, Sharma A, Stokes Z, et al. An exploration into the use of a chatbot for patients with inflammatory bowel diseases: retrospective cohort study. J Med Internet Res. 2020;22(5), e15589. 10.2196/15589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39..Peissig PL, Rasmussen LV, Berg RL, et al. Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J Am Med Inf Assoc. 2012;19(2):225–234. 10.1136/amiajnl-2011-000456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40..Rojas-Carabali W, Sen A, Agarwal A, et al. Chatbots vs. human experts: evaluating diagnostic performance of chatbots in uveitis and the perspectives on AI adoption in ophthalmology. Ocul Immunol Inflamm. 2023:1–8. 10.1080/09273948.2023.2266730 [Published online October 13]. [DOI] [PubMed] [Google Scholar]
- 41..Rojas-Carabali W, Cifuentes-González C, Wei X, et al. Evaluating the diagnostic accuracy and management recommendations of ChatGPT in uveitis. Ocul Immunol Inflamm. 2023:1–6. 10.1080/09273948.2023.2253471 [Published online September 18]. [DOI] [PubMed] [Google Scholar]
- 42..Huang AS, Hirabayashi K, Barna L, Parikh D, Pasquale LR. Assessment of a large language model’s responses to questions and cases about glaucoma and retina management. JAMA Ophthalmol. 2024;142(4):371. 10.1001/jamaophthalmol.2023.6917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43..Berge GT, Granmo OC, Tveit TO, Munkvold BE, Ruthjersen AL, Sharma J. Machine learning-driven clinical decision support system for concept-based searching: a field trial in a Norwegian hospital. BMC Med Inf Decis Mak. 2023;23(1):5. 10.1186/s12911-023-02101-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44..Jiang LY, Liu XC, Nejatian NP, et al. Health system-scale language models are all-purpose prediction engines. Nature. 2023;619(7969):357–362. 10.1038/s41586-023-06160-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45..Wei Q, Ji Z, Li Z, et al. A study of deep learning approaches for medication and adverse drug event extraction from clinical text. J Am Med Inform Assoc. 2020;27(1):13–21. 10.1093/jamia/ocz063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46..Steimetz E, Minkowitz J, Gabutan EC, et al. Use of artificial intelligence Chatbots in interpretation of pathology reports. JAMA Netw Open. 2024;7(5), e2412767. 10.1001/jamanetworkopen.2024.12767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47..Mihalache A, Huang RS, Popovic MM, et al. Accuracy of an artificial Intelligence Chatbot’s interpretation of clinical ophthalmic images. JAMA Ophthalmol. 2024;142(4):321. 10.1001/jamaophthalmol.2024.0017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48..Campillos-Llanos L, Valverde-Mateos A, Capllonch-Carrión A, Moreno-Sandoval A. A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine. BMC Med Inf Decis Mak. 2021;21(1):69. 10.1186/s12911-021-01395-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49..Cao Y, Liu F, Simpson P, et al. AskHERMES: an online question answering system for complex clinical questions. J Biomed Inform. 2011;44(2):277–288. 10.1016/j.jbi.2011.01.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50..Wang LL, Lo K. Text mining approaches for dealing with the rapidly expanding literature on COVID-19. Brief Bioinform. 2021;22(2):781–799. 10.1093/bib/bbaa296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51..Hsu W, Taira RK, El-Saden S, Kangarloo H, Bui AAT. Context-based electronic health record: toward patient specific healthcare. IEEE Trans Inf Technol Biomed. 2012;16(2):228–234. 10.1109/TITB.2012.2186149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52..Ye J, Hai J, Wang Z, Wei C, Song J. Leveraging natural language processing and geospatial time series model to analyze COVID-19 vaccination sentiment dynamics on Tweets. JAMIA Open. 2023;6(2), ooad023. 10.1093/jamiaopen/ooad023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53..Decker H, Trang K, Ramirez J, et al. Large language model–based chatbot vs surgeon-generated informed consent documentation for common procedures. JAMA Netw Open. 2023;6(10), e2336997. 10.1001/jamanetworkopen.2023.36997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54..Bernstein IA, Zhang Y (Victor), Govil D, et al. Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions. JAMA Netw Open. 2023;6(8), e2330320. 10.1001/jamanetworkopen.2023.30320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55..Longwell JB, Hirsch I, Binder F, et al. Performance of large language models on medical oncology examination questions. JAMA Netw Open. 2024;7(6), e2417641. 10.1001/jamanetworkopen.2024.17641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56..Chen D, Parsa R, Hope A, et al. Physician and artificial intelligence chatbot responses to cancer questions from social media. JAMA Oncol. 2024. 10.1001/jamaoncol.2024.0836 [Published online May 16]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57..Wang Y, Willis E, Yeruva VK, Ho D, Lee Y. A case study of using natural language processing to extract consumer insights from tweets in American cities for public health crises. BMC Public Health. 2023;23(1):935. 10.1186/s12889-023-15882-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58..Alghamdi J, Lin Y, Luo S. Towards COVID-19 fake news detection using transformer-based models. Knowl-Based Syst. 2023;274, 110642. 10.1016/j.knosys.2023.110642. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59..Meystre SM, Heider PM, Cates A, et al. Piloting an automated clinical trial eligibility surveillance and provider alert system based on artificial intelligence and standard data models. BMC Med Res Method. 2023;23(1):88. 10.1186/s12874-023-01916-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60..Beattie J, Neufeld S, Yang D, et al. Utilizing large language models for enhanced clinical trial matching: a study on automation in patient screening. Cureus. 2024. 10.7759/cureus.60044 [Published online May 10]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61..Sun Z, Tao C. Named Entity Recognition and Normalization for Alzheimer’s Disease Eligibility Criteria. In: Proceedings of the 2023 IEEE 11th International Conference on Healthcare Informatics (ICHI). IEEE; 2023, p. 558–64. 〈 10.1109/ICHI57859.2023.00100〉. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62..Cunningham JW, Singh P, Reeder C, et al. Natural Language Processing for Adjudication of Heart Failure Hospitalizations in a Multi-Center Clinical Trial; Published online August 23, 2023. 〈 10.1101/2023.08.17.23294234〉. [DOI] [Google Scholar]
- 63..Mohapatra DP, Thiruvoth FM, Tripathy S, et al. Leveraging large language models (LLM) for the plastic surgery resident training: do they have a role? Indian J Plast Surg. 2023;56(05):413–420. 10.1055/s-0043-1772704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64..Goh E, Gallo R, Hom J, et al. Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study; Published online March 14, 2024. 〈 10.1101/2024.03.12.24303785〉. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65..Spadafore M, Yilmaz Y, Rally V, et al. Using natural language processing to evaluate the quality of supervisor narrative comments in competency-based medical education. Acad Med. 2024;99(5):534–540. 10.1097/ACM.0000000000005634. [DOI] [PubMed] [Google Scholar]
- 66..Solano QP, Hayward L, Chopra Z, et al. Natural language processing and assessment of resident feedback quality. J Surg Educ. 2021;78(6):e72–e77. 10.1016/j.jsurg.2021.05.012. [DOI] [PubMed] [Google Scholar]
- 67..Nguyen T ChatGPT in medical education: a precursor for automation bias? JMIR Med Educ. 2024;10, e50174. doi: 10.2196/50174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68..Cheng D, Knox C, Young N, Stothard P, Damaraju S, Wishart DS. PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. 2008;36(Web Server): W399–W405. doi: 10.1093/nar/gkn296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69..Zheng S, Dharssi S, Wu M, Li J, Lu Z. Text Mining for Drug Discovery. In: Larson RS, Oprea TI, eds. Bioinformatics and Drug Discovery. Vol 1939. Methods in Molecular Biology. Springer; New York; 2019:231–252. 10.1007/978-1-4939-9089-4_13. [DOI] [PubMed] [Google Scholar]
- 70..Hua D, Petrina N, Young N, Cho JG, Poon SK. Understanding the factors influencing acceptability of AI in medical imaging domains among healthcare professionals: a scoping review. Artif Intell Med. 2024;147, 102698. 10.1016/j.artmed.2023.102698. [DOI] [PubMed] [Google Scholar]
- 71..Laupichler MC, Aster A, Meyerheim M, Raupach T, Mergen M. Medical students’ AI literacy and attitudes towards AI: a cross-sectional two-center study using pre-validated assessment instruments. BMC Med Educ. 2024;24(1):401. 10.1186/s12909-024-05400-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72..Zhou L, Blackley SV, Kowalski L, et al. Analysis of errors in dictated clinical documents assisted by speech recognition software and professional transcriptionists. JAMA Netw Open. 2018;1(3), e180530. 10.1001/jamanetworkopen.2018.0530. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73..Weiner SJ, Wang S, Kelly B, Sharma G, Schwartz A. How accurate is the medical record? A comparison of the physician’s note with a concealed audio recording in unannounced standardized patient encounters. J Am Med Inform Assoc. 2020;27(5):770–775. 10.1093/jamia/ocaa027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74..Cohen R, Elhadad M, Elhadad N. Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies. BMC Bioinform. 2013;14(1):10. 10.1186/1471-2105-14-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75..Wang R, Carrington JM, Hammarlund N, Sanchez O, Revere L. An evaluation of copy and paste events in electronic notes of patients with hospital acquired conditions. Int J Med Inform. 2023;170, 104934. 10.1016/j.ijmedinf.2022.104934. [DOI] [PubMed] [Google Scholar]
- 76..Pais S, Cordeiro J, Jamil ML. NLP-based platform as a service: a brief review. J Big Data. 2022;9(1):54. 10.1186/s40537-022-00603-5. [DOI] [Google Scholar]
- 77..Dwivedi YK, Kshetri N, Hughes L, et al. Opinion paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int J Inf Manag. 2023; 71, 102642. 10.1016/j.ijinfomgt.2023.102642. [DOI] [Google Scholar]
- 78..Harrer S Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine. eBioMedicine. 2023;90, 104512. 10.1016/j.ebiom.2023.104512. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79..Topol E. When M.D. is a Machine Doctor. Ground Truths; Published 2023. 〈https://erictopol.substack.com/p/when-md-is-a-machine-doctor〉. [Google Scholar]
- 80..Larsen B, Narayan J. Generative AI: A Game-changer That Society and Industry Need to Be Ready For2023. Davos: World Economic Forum; 2023. 〈https://www.weforum.org/agenda/2023/01/davos23-generative-ai-a-game-changer-industries-and-society-code-developers/〉. [Google Scholar]
- 81..Hager P, Jungmann F, Holland R, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med. 2024. 10.1038/s41591-024-03097-1 [Published online July 4]. [DOI] [PMC free article] [PubMed] [Google Scholar]
