Skip to main content
BMJ Health & Care Informatics logoLink to BMJ Health & Care Informatics
. 2025 Oct 22;32(1):e101631. doi: 10.1136/bmjhci-2025-101631

From words to action? A scoping review on automatic sentiment analysis of patient experience comments from online sources and surveys

Elma Jelin 1,, Lilja Charlotte Storset 2, Rebecka M Norman 1, Hilde Hestad Hestad Iversen 1, Lina Harvold Ellingsen-Dalskau 1, Petter Mæhlum 2, Erik Velldal 2, Lilja Øvrelid 2, Oyvind Bjertnaes 1
PMCID: PMC12548608  PMID: 41125311

Abstract

Background

Automatic analysis of free-text patient comments enables the efficient processing of large feedback volumes, reducing reliance on manual review. A 2021 review examined natural language processing (NLP) and sentiment analysis (SA) in patient experience research; however, recent advances in deep learning and generative artificial intelligence (AI) call for an updated synthesis.

Objectives

This scoping review aims to map and summarise recent studies applying SA to unstructured patient experience data related to healthcare services.

Methods

Following Joanna Briggs Institute methodology and PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines, we conducted a comprehensive search across Medline, CINAHL, Web of Science, Cochrane, Embase and APA PsycINFO. We included studies published from January 2020 to March 2024 in English or a Scandinavian language. Eligible studies analysed patient feedback using NLP techniques and described the development or validation of SA models. Two reviewers independently screened the studies and extracted data, which were presented in tables both tabular and narrative forms.

Results

30 studies were included, primarily from the USA, Europe and Asia. Patient comments were mostly sourced from online platforms such as social media. Feedback largely concerned hospital care. 18 studies employed rule-based SA approaches, while 12 applied supervised machine learning (ML) and only 4 studies used deep learning models. Few addressed the visualisation or practical applications in healthcare.

Conclusion

Despite significant progress, modern methods like deep learning and generative AI remain underused in SA of patient-experience data. Limited focus on implementation restricts SA’s role in quality improvement. Future research should assess advanced methods and their cost-effectiveness versus traditional ML.

Keywords: Health Services Research, Deep Learning, Machine Learning, Natural Language Processing, Patient-Centered Care

Introduction

Previous research has shown that automated text analyses can effectively extract information from patient feedback, supporting both research and quality improvement initiatives.1,3 However, the evidence base remains limited in this field, with only one systematic review to date, published in 2021, addressing the automated extraction of information from patient experience feedback using natural language processing (NLP).4 Khanbhai et al reviewed studies applying NLP and machine learning (ML) to patient feedback, focusing primarily on sentiment analysis (SA) and topic modelling, a method to identify abstract topics that occur in a collection of texts. They found that most studies (80%) applied language analysis to patient feedback obtained from social media platforms, while the remainder used data from patient experience surveys. Social media comments were predominantly assigned topics using unsupervised models trained on unlabelled data, whereas survey comments were analysed with supervised models trained on labelled data, mainly for SA. 10 of 19 studies combined SA with topic modelling. The review emphasised that human annotation remains the ‘gold standard’, highlighting that ML model performance depends heavily on the quality of the training data.4 The studies identified by Khanbhai et al mostly relied on rule-based or probabilistic statistical approaches for analysing unstructured patient experience data. This review thus provides an important foundation for understanding how NLP and SA have so far been applied to patient experience data, while also illustrating the methodological challenges that remain. The next section introduces the basic principles of NLP and SA, which will serve as a framework for the subsequent discussion of methods, applications and the specific objectives of this review.

NLP and SA

NLP is a multidisciplinary field, combining informatics and linguistics to enable machines to process natural language in ways that are useful to humans. Key NLP tasks include speech recognition, machine translation, text generation and text classification. SA falls under text classification and involves identifying subjective opinions and attitudes expressed in text. SA determines whether sentiment has a positive or negative polarity and identifies the targets (who/what) and the holders (who expresses the sentiment).5 In aspect-based SA, the task also involves mapping the identified targets to broader topic categories.4 6

In SA, different approaches are used to label polarity scores. Binary labels classify sentences as either positive or negative, while ternary labels include an additional neutral category. Polarity can also be presented on a continuous scale, where lower scores indicate negative sentiments and higher scores indicate positive sentiment.5 Sentences that contain both positive and negative sentiment are often referred to as mixed.

ML is often used to perform SA, as well as other tasks within NLP. ML enables computations based on patterns learnt from data, either from human-annotated data (serving as a gold standard) or from unannotated data, allowing the model to learn patterns directly. Models trained in this way can generalise to new data. The approach using annotated data is known as supervised learning, while models that learn from unlabelled data follow an unsupervised learning approach.

Rule-based methods

Although ML is widely used in the field today, several traditional techniques within NLP, such as rule-based approaches, remain in use. In rule-based systems, the model does not learn from data, but analyses text directly based on a set of predefined rules. Valence Aware Dictionary and sEntiment Reasoner (VADER) is a widely used rule-based SA tool that applies a lexicon of human-assigned sentiment scores to words and phrases.7 It assigns sentiment scores to new data, assuming that some of the words are included in the lexicon. Each word in a sentence is matched against the predefined lexicon and assigned a score. Certain heuristics, such as negation and adverbs of degree, are considered before the scores are normalised to produce a final score between –1 and 1, divided into neutral, positive, negative and compound categories. While this is often not specified in the reviewed literature, the original authors note that the compound score is the most commonly used.7

Supervised methods

Among supervised, classical, statistical and non-neural methods, Naïve Bayes is frequently used. This probabilistic algorithm applies Bayes’ theorem to classify data based on the probabilities of different classes given the features of the data.8 The support vector machine (SVM) algorithm is also commonly applied to classification tasks and works by optimally maximising the distance between two classes.9

Deep learning

Among deep learning models that learn patterns from large data sets or neural networks, BERT (Bidirectional Encoder Representations from Transformers) is a commonly used bidirectional language model based on the transformer architecture.10 Masked language models (MLMs) such as BERT have substantially advanced sentiment classification tasks.10 11 Unlike sequential models, BERT considers the context both to the left and right in a sentence when building a representation. Moreover, the transformer architecture has a self-attention mechanism that enables the model to focus on specific parts of the input sequence and weigh their importance. While large language models (LLMs) are strong in text generation,12 MLMs11 are better suited for domain-specific tasks such as sentiment classification of patient comments.

Objectives

The review by Khanbhai et al included studies published between 2012 and 2020, with almost 80% (15 out of 19) published in the last 5 years of that period. Given the rapid development of deep learning models and generative artificial intelligence (AI), substantial development in SA applications within health services research is expected.

The objective of this scoping review is to map and summarise recent evidence on SA applied to unstructured patient experience data related to health services. This objective is addressed through the following research question:

  • How is SA developed and applied in the analysis of unstructured free-text comments from (1) patient experience surveys and (2) online sources?

SA development was examined by mapping and summarising the methods used, and how the data were annotated, labelled and evaluated. The SA application was assessed by mapping whether studies demonstrated practical applications and visualisations of results, and how these could support clinical practice and quality improvement.

Methods

A scoping review is a form of evidence synthesis designed to identify and map the breadth of available evidence and key factors related to a specific topic.13 This methodology is a suitable approach to our research question, as it allows for the synthesis of evidence from a diverse range of studies. We did not conduct a quality appraisal of individual studies included in the review.

Protocol and registration

The review was conducted according to Joanna Briggs Institute methodology for scoping reviews and is reported according to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) statements (online supplemental appendix 1).14 15 The protocol for this scoping review was published on the Open Science Framework (https://doi.org/10.17605/OSF.IO/ZVURW).

Eligibility criteria

Included studies had to report patient-reported experiences related to specific healthcare services, based on unstructured feedback from online platforms (eg, hospital or physician-review sites) or structured patient experience surveys, and include SA. Studies based on general discussions, or those focused on specific interventions (eg, apps, vaccines, medications) or topical issues at the population level (eg, care during COVID-19) were excluded. We set the language criteria to either English or Scandinavian. Furthermore, the studies were required to include at least some (minimum) documentation of development and/or validation of the NLP approach and to focus on the description, development or application of an NLP SA algorithm or pipeline for processing patient experience data.

Information sources and search strategy

The search strategy was developed by the authors in close collaboration with an experienced librarian at the Norwegian Institute of Public Health. The search strategy, including all identified keywords and index terms, was adapted for each included database (full search strategy in online supplemental appendix 2). The literature search was carried out in Medline, CINAHL, Web of Science, Scopus, Epistemonikos, Cochrane Database of Systematic Reviews, Embase and APA PsycINFO, covering the period from January 2020 to March 2024.

Selection of sources of evidence and synthesis of results

Following meetings with the author group and based on the research question, agreements were made regarding which data to extract, including patient groups, sources, countries, approaches for SA and maturity of ML models. The data was extracted and summarised in both tables and text.

Data charting process and data items

Titles and abstracts were screened, and potentially relevant articles were read in full text, by two researchers (EJ and OB). Any disagreement between the reviewers regarding the inclusion of the article or extracted information details was resolved through discussion between the authors or by involving a third researcher. The reference lists of the included articles were also screened for additional relevant articles. Full-text articles deemed irrelevant were excluded, all with a reason for exclusion.

Results

Selection of sources of evidence

A total of 418 unique records were identified through the database searches (figure 1). After screening of titles and abstracts, 37 articles were read in full text and of those, 30 were included in the scoping review. After screening the reference list of the included articles, we found two relevant articles that were also included in the final list of studies. The included articles are presented with some selected descriptive details in table 1.

Figure 1. PRISMA flow diagram. PRISMA 2020 flow diagram showing the selection process for studies identified through database searches. PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-Analyses. Source: Page MJ, et al. BMJ 2021;372:n71.

Figure 1

Table 1. Characteristics of sources of evidence; patient group, source, country and language.

Study Title Patient group Source Country (language)
Patients experience survey
Cammel et al1 How to automatically turn patient experience free-text responses into actionable insights: a natural language programming (NLP) approach General hospital patients The Netherlands (Dutch)
Nawab et al36 Natural language processing to extract meaningful information from patient experience feedback General hospital patients USA (English)
Khanbhai et al37 Using natural language processing to understand, facilitate and maintain continuity in patient experience across transitions of care Emergency, inpatient, outpatient and maternity hospital patients Friends and family test reports free-text comments UK (English)
van Buchem et al38 Analysing patient experiences using natural language processing: development and validation of the artificial intelligence patient reported experience measure (AI-PREM) Vestibular schwannoma medical centre patients The Netherlands (Dutch)
Vehviläinen-Julkunen et al39 Experience of ambulatory cancer care: understanding patients’ perspectives of quality using sentiment analysis Cancer hospital patients Finland (English)
Chekijian et al40 Emergency care and the patient experience: using sentiment analysis and topic modelling to understand the impact of the COVID-19 pandemic Emergency departments Press Ganey USA (English)
Online reviews
Park et al16 A sentiment analysis on online psychiatrist reviews to identify clinical attributes of psychiatrists that shape the therapeutic alliance Patients experience of psychiatrists healthgrades.com USA (English)
Alexander et al41 Automating large-scale healthcare service feedback analysis: sentiment analysis and topic modelling study General hospital patients NHS website structured and free text UK (English)
Almuhaideb et al42 Analysing Arabic Twitter-based patient experience sentiments using multidialect Arabic bidirectional encoder representations from transformers Private and public general hospital patients Tweets Saudi Arabia (Arabic)
Butler et al17 Building better paediatric surgeons: a sentiment analysis of online physician review websites Patients experience with paediatric surgeons healthgrades.com USA (English)
Chandrasekaran et al48 Face time with physicians: how do patients assess providers in video visits? Patient reviews of physicians completing in person or video visits to the physician Reviews (from zocsdoc.com) USA (English)
Yazdani et al18 Use of sentiment analysis for capturing hospitalised patients with cancer’s experience from free-text comments in the Persian language Cancer hospital patients Hospital online feedback system. Designed online form Iran (Persian)
Tang et al19 What are patients saying about minimally invasive spine surgeons online: a sentiment analysis of 2235 physician review website reviews Patients experience from spine surgeons healthgrades.com USA (English)
Pandey et al20 Advanced sentiment analysis for managing and improving patient experience: application for general practitioner (GP) classification in Northamptonshire Patient experience with GPs General practitioners’ websites UK (English)
Mammadova et al21 Development of decision-making technique based on sentiment analysis of crowdsourcing data in medical social media resources Patient/public opinion of clinics status Patient reviews obtained from database ‘cms_hospital_satisfaction_2020 of the Kaggle company’ generated on the basis of crowdsourcing of patient reviews on medical social media USA (English)
Leong and Dahnil22 Classification of healthcare service reviews with sentiment analysis to refine user satisfaction General hospital patient reviews Web scraping from patient reviews of five hospitals Malaysia (English)
Cho et al23 Sentiment analysis of online patient-written reviews of vascular surgeons Patient reviews of vascular surgeons healthgrades.com USA (English)
Cheng et al24 Sentiment analysis of pain physician reviews on Healthgrades: a physician review website Patient reviews of pain physicians healthgrades.com USA (English)
Hotchkiss et al43 Development of a model and method for hospice quality assessment from natural language processing (NLP) analysis of online caregiver reviews Patient hospice reviews Google and Yelp online caregiver written reviews USA (English)
Serrano-Guerrero et al25 How satisfied are patients with nursing care and why? A comprehensive study based on social media and opinion mining General hospital patient reviews Care Opinion UK (English)
Jo et al26 Physician review websites: understanding patient satisfaction with ophthalmologists using natural language processing Ophthalmology patient reviews healthgrades.com USA (English)
Rahim et al44 Hospital Facebook reviews analysis using a machine learning sentiment analyser and quality classifier General hospital patient reviews Facebook Malaysia (English)
Samah et al45 Classification and visualisation: Twitter sentiment analysis of Malaysia’s private hospitals Private hospital patient reviews Tweets Malaysia (Malay and English)
Shah et al46 Mining patient opinion to evaluate the service quality in healthcare: a deep-learning approach Patient reviews of physician (10 different specialities) Yelp.com USA (English)
Tang et al50 How are patients describing you online? A natural language processing-driven sentiment analysis of online reviews on CSRS surgeons Patient reviews of spine surgeons healthgrades.com USA (English)
Tang et al27 What are patients saying about you online? A sentiment analysis of online written reviews on Scoliosis Research Society surgeons Patient reviews of scoliosis surgeons healthgrades.com USA (English)
Vasan et al28 A natural language processing approach to uncover patterns among online ratings of otolaryngologists Patient reviews of otolaryngologists healthgrades.com USA (English)
Zakkar and Lizotte29 Analysing patient stories on social media using text analytics Patient health service reviews Care Opinion UK (English)
Tang et al30 Using sentiment analysis to understand what patients are saying about hand surgeons online Patient reviews of surgeons Healthgrades.com USA (English)
Liu et al47 Data mining of the reviews from online private doctors Patient reviews of private doctors Chinese physician review website (haodf.com) China (Chinese)

NHS, National Health Service.

Characteristics of sources of evidence

The studies originated from the USA (n=16), Europe (n=8) and Asia (n=6), using languages such as English, Dutch, Arabic, Persian and Malay. Most (n=24) used data from platforms like Facebook and Twitter (X), while six used patient experience surveys. Most of the studies included patients with experiences from general hospitals, with others covering emergency departments, cancer care and general practice. The rest of the included studies used data/patient experiences with various medical specialists, including ophthalmologists, spine and vascular surgeons and psychiatrists (table 1).

Synthesis of results

Approaches for SA

Table 2 provides an overview of approaches for SA we found in the included articles. Especially relevant for the SA task is supervised versus rule-based approaches, whether the authors have used annotated data or not, how many labels they employ and how they evaluated their methods.

Table 2. The approaches for sentiment analysis in the included studies.
Learning approach Annotated data Labels Evaluation
Study Supervised (n=12)
Deep learning (n=4)
Rule-based/predefined models (n=18)* HA
(n=12)
SR
(n=14)
No annotation
(n=4)
Binary (n=13)
Ternary (n=12)
Scale (n=5)
Evaluation (n=24)
No evaluation (n=6)
Cammel et al1 Python package, Pattern.nl HA Ternary Fleiss’ Kappa.
Nawab et al36 NN (Keras Sequential) HA Ternary No evaluation.
Khanbhai et al37 SVM, kNN, NB, gradient boosted trees. HA Ternary F1, accuracy.
van Buchem et al38 NN (BERT) HA Binary Separate scores for positive and negative classifier. F1, precision, recall.
Park et al16 VADER SR Scale Linear regression.
Alexander et al41 NB SR Binary No evaluation.
Almuhaideb et al42 SVM, NN (BERT) HA Binary F1, accuracy.
Butler et al17 VADER SR Binary Linear regression.
Chandrasekaran et al48 VADER Binary No evaluation.
Yazdani et al18 SentiStrength HA Binary F1, accuracy, precision, recall, specificity, AUC.
Tang et al19 VADER SR Binary Linear regression.
Pandey et al20 Lexicon-based method VADER HA Ternary K-fold validation on limited data.
Mammadova et al21 VADER Ternary Evaluation through MNB and SVM with VADER-classified data as gold. F1, accuracy, precision, recall.
Leong and Dahnil22 VADER SR Ternary F1, precision, recall per class.
Cho et al23 VADER SR Binary/ternary Linear regression.
Cheng et al24 VADER SR Ternary Student t-test, linear regression.
Chekijian et al40 Press Ganey tool HA Ternary Precision, recall, F1.
Hotchkiss et al43 Google Cloud NLP API SR Ternary No evaluation.
Serrano-Guerrero et al25 VADER Ternary No evaluation.
Jo et al26 VADER SR Binary Linear regression.
Rahim et al44 NB, SVM, logistic regression HA Binary F1, accuracy, precision, recall, Hamming loss.
Samah et al45 NB HA Ternary F1, accuracy, precision, recall.
Shah et al46 NN (CNN-LSTM models etc) SR Binary F1, accuracy, precision, recall.
Tang et al50 VADER SR Scale Linear regression.
Tang et al27 VADER SR Scale Linear regression.
Vasan et al28 VADER SR Scale Linear regression.
Vehviläinen-Julkunen et al39 Random forest, linear SVM, MNB, LR HA Binary Accuracy.
Zakkar and Lizotte29 VADER Ternary No evaluation.
Tang et al30 VADER SR Scale Linear regression.
Liu et al47 SnowNLP
(NB)
HA Binary AUC.
*

The three studies that have mixed both supervised and semisupervised are not included here but in the supervised section.

Including ‘bronze’, ‘silver’, ‘gold’.

AUC, area under the curve; BERT, Bidirectional Encoder Representations from Transformers; CNN, multichannel convolutional neural network; HA, human annotators/raters; LR, logistic regression; MNB, multinomial Naïve Bayes; NB, Naïve Bayes; NLP, natural language processing; NN, neural network; Scale, non-converted sentiment scores; SR, star ratings; SVM, support vector machine; VADER, Valence Aware Dictionary and sEntiment Reasoner.

Rule-based methods

Table 2 illustrates the dominance of rule-based approaches (n=18).116,32 All of these articles used a lexicon-based method, with 16 applying VADER (see table 2). The remaining studies1 18 used SentiStrength and pattern.nl.33,35 These are lexicon-based as well, much like VADER.

Supervised methods

12 studies used supervised approaches for SA.36,47 compared with the rule-based approaches, there is a much larger variety in the methods that are chosen within the supervised scope. However, Naïve Bayes and SVM occur more frequently than others. Two studies employed a fine-tuned version of BERT.38 42 The remaining two studies employed different types of sequential neural networks, that is, networks that consider the context from one direction to the other (eg, left to right), but not both.36 46

Annotation and evaluation

All 12 studies in the supervised category employed an annotated data set of some form. Among these, nine reported using human annotations, while three used star ratings provided by the authors themselves. In the rule-based category, 3 articles used human-annotated data sets, 11 used star ratings and 4 did not report using any labelled data (table 2).

Regarding evaluation, the amount of training data, methods and performance metrics varied across the articles. Notably, linear regression and other statistical methods were frequently used as validation methods, often to assess the correlation between predicted sentiment scores and annotated data sets. Metrics that are more commonly used within NLP, such as the F1-score, precision and recall, were less frequent, although present. The amount of training data used varied substantially, while Chekijian et al included a total of 5800 samples, Pandey et al contrastingly used a total of 34 samples (not shown in the table).20 40 Lastly, six studies included no evaluation of their approach25 29 36 41 43 48 (table 2).

Label employment

Among the included studies, 12 used ternary labels, and 13 employed binary labels. Five studies used a scaled approach, where sentiment scores were directly generated by VADER and subsequently converted into binary or ternary labels. None of the included studies used mixed as part of their label sets.

Maturity of ML models

Most studies (n=26) relied on more traditional probabilistic models, including rule-based systems and statistical approaches. Only 4 of the 30 studies used deep learning methods or neural networks.36 38 42 46 None of the studies used LLMs—foundation models or generative AI.

Mapping of the practical applications and visualisations of the results from SA

Most studies had minimal or no emphasis on practical application, often presenting only basic results or validation/testing of the models. While simple visualisations like word clouds and tables were common,1718 21,23 27 31 36 37 40 42 45 only a few studies employed more advanced methods, such as spider plots or map-based visualisations.16 38 41

Discussion

Summary of evidence

The objective of this scoping review was to map and summarise the evidence on the use of SA applied to unstructured patient experience data in the context of health services. Our findings show that the majority of the included studies relied on more traditional rule-based approaches with data collected from online platforms. Only a small number of the studies used more advanced approaches such as deep learning. Furthermore, analysis and visualisations aimed at supporting quality improvement efforts were rarely reported, suggesting a gap between data analysis and the practical use of data in healthcare settings.

Most studies used data from online platforms like Facebook and Twitter (X), while few relied on patient experience surveys. Rule-based methods dominated, with VADER being the most commonly used tool. These methods are simple, resource-efficient, easy to interpret and inspect and require no annotated data. However, they are static, domain-dependent and struggle with context, ambiguity and novel input. A key limitation of VADER is its inability to distinguish between truly neutral sentences and those with opposing sentiments that cancel each other out, potentially confounding the neutral class.7

We observed a greater diversity in the approaches used within supervised methods. However, although deep learning is considered state-of-the-art among the current ML approaches, only four of the included studies rely on such methods. Thus, considering both the review by Khanbhai et al and the studies included in the current scoping review, the field of patient experience analysis appears to have shown limited development in recent years when it comes to the use of modern ML approaches for automatic SA. One reason might be that modern approaches are complex and technically advanced, requiring specific funding and close collaboration between health services researchers and experts in the field of language technology and SA. The older models are simpler and can be applied more easily without specialist competence in language technology. An important avenue for further research is therefore to develop and test deep learning models for automatic SA of patient experience comments, including testing and contrasting MLMs and LLMs which have different advantages. Additionally, it is also crucial to assess the cost-utility of such advanced models in comparison to simpler ML approaches.

Model evaluation is essential to assess whether its outputs accurately reflect the intended ‘ground truth’ (ie, what we as humans perceive as the reality and want the model to reflect). In supervised learning, this requires labelled data for both training and evaluation. Accordingly, all studies employing supervised approaches use annotated/labelled data sets. Although annotated/labelled data is not required to develop rule-based methods, it is still necessary to evaluate their performance. Despite this, we found that six studies (three using supervised approaches and three using rule-based methods) did not include an evaluation of their models. In some cases, authors present the model’s output without justifying its accuracy or its relevance to the real-world problem. Another factor affecting the reliability of a model’s performance is the size of the training and test data sets. While larger data sets generally lead to better and more robust models, the included studies vary widely in the amount of data used. Some employed substantial data sets, whereas others relied on very limited samples, making their results difficult to interpret or generalise.

Regarding label usage, some authors chose to focus only on binary sentiment, greatly reducing the complexity of the task, while others included a neutral label, often referred to as ternary labelling. One problem with these simplified scoring systems is that they fail to distinguish between truly neutral texts, that is, texts that contain no opinionated text and thus have no polarity, and sentences that contain polarity, but where the combination of positive and negative polarity results in a neutral classification. Sentences containing both positive and negative polarity are often called mixed, but few make an active distinction between these two types. It is worth noting that although a mixed label can be introduced in cases of opinions implying both positive and negative sentiment, none of the studies included mixed as part of their label sets. In addition, in most cases, it is unclear what is meant by neutral in the annotation setups. According to Liu’s definition, neutral sentences do not contain any evaluative content, that is, no opinions. However, the term neutral is sometimes used for sentences with mixed sentiment, where opposing sentiments cancel each other out, leading to seemingly neutral sentiment scores.5 This challenge has parallels in quantitative patient experience research, where neutral response options on Likert scales (eg, ‘neither agree nor disagree’) may conflate mixed experiences with a genuine absence of opinion.

AI applications in healthcare should demonstrate technical performance, but also further evidence of usability and healthcare impacts.49 Our review showed that the majority of studies placed limited or no emphasis on the practical implementation, visualisation or real-world application of ML. Instead, they primarily focused on model development and basic validation. This lack of attention to usability and interpretability hinders the integration of SA into healthcare quality improvement initiatives, thereby limiting the potential impact of these methods. Furthermore, SA using ML is emerging as a rapidly growing field, yet it remains largely disconnected from clinical practice. This reflects a broader gap between those developing computational models and those working within healthcare systems, underscoring the need for interdisciplinary collaboration to bridge methodological innovation with practical utility. At least for patient experience surveys, questions regarding how to integrate quantitative indicators and SA output in reports and dashboards are important, in addition to how information can be effectively presented and used at higher levels (eg, the sentiment for a whole hospital) and at lower levels (eg, sentiment of different aspects for hospital wards, across gender and age groups, etc).

Limitations

The search covered the period from 2020 to March 2024, which means we cannot rule out potential studies in 2024–2025 using modern approaches. Given the fast developments within ML and NLP, we recommend conducting literature searches and updates in the field of patient experience at least annually in the coming years. Our review focused on peer-reviewed studies in order to ensure a minimum level of quality assurance. While this provides a robust overview of the research status, it also represents a limitation, as there probably are numerous examples in the grey literature and within commercial organisations that apply NLP and SA to patient experience data. These sources may contain more extensive discussion of application than what is reflected in the peer-reviewed literature. Another important limitation is the gap between the peer-reviewed literature and practice ‘on the ground’. While none of the studies included in our review used LLMs, it is likely that healthcare providers are already experimenting with general-purpose LLMs (eg, ChatGPT, Microsoft CoPilot) to analyse patient feedback from surveys, online reviews and concerns. This may highlight a growing level of ‘naïve’ use of advanced NLP technologies by non-specialists, which further underlines the importance of future interdisciplinary collaboration between healthcare professionals and language technology experts to ensure development, validation and cost-effectiveness of advanced approaches such as deep learning and generative AI.

Conclusions

Despite rapid advancements in the field, the use of modern techniques such as deep learning and generative AI remains limited. The application of annotated data and the description of those, as well as evaluation and argumentation of use and visualisation in clinical practice, is of variable degree. The included studies show limited adoption of advanced models, such as deep learning. This underuse suggests a gap between technological developments and their application in healthcare. To bridge this gap, future research should promote interdisciplinary collaboration between healthcare and language technology experts, focusing on the development, validation and cost-effectiveness of advanced approaches like deep learning and generative AI.

Supplementary material

online supplemental appendix 1
bmjhci-32-1-s001.docx (109.3KB, docx)
DOI: 10.1136/bmjhci-2025-101631
online supplemental appendix 2
bmjhci-32-1-s002.docx (60.5KB, docx)
DOI: 10.1136/bmjhci-2025-101631

Footnotes

Funding: This study was funded by Norges Forskningsråd (Project number 331770).

Provenance and peer review: Not commissioned; externally peer reviewed.

Patient consent for publication: Not applicable.

Ethics approval: Not applicable.

Data availability statement

Data sharing not applicable as no data sets generated and/or analysed for this study.

References

  • 1.Cammel SA, De Vos MS, van Soest D, et al. How to automatically turn patient experience free-text responses into actionable insights: a natural language programming (NLP) approach. BMC Med Inform Decis Mak. 2020;20:97. doi: 10.1186/s12911-020-1104-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wagland R, Recio-Saucedo A, Simon M, et al. Development and testing of a text-mining approach to analyse patients’ comments on their experiences of colorectal cancer care. BMJ Qual Saf. 2016;25:604–14. doi: 10.1136/bmjqs-2015-004063. [DOI] [PubMed] [Google Scholar]
  • 3.Wallace BC, Paul MJ, Sarkar U, et al. A large-scale quantitative analysis of latent factors and sentiment in online doctor reviews. J Am Med Inform Assoc. 2014;21:1098–103. doi: 10.1136/amiajnl-2014-002711. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Khanbhai M, Anyadi P, Symons J, et al. Applying natural language processing and machine learning techniques to patient experience feedback: a systematic review. BMJ Health Care Inform. 2021;28:e100262. doi: 10.1136/bmjhci-2020-100262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Liu B. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge: Cambridge University Press; 2015. [Google Scholar]
  • 6.Ish D, Parker A, Osoba O, et al. Santa Monica, CA: RAND Corporation; Using natural language processing to code patient experience narratives: capabilities and challenges.https://www.rand.org/pubs/research_reports/RRA628-1.html Available. [Google Scholar]
  • 7.Hutto C, Gilbert E. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. ICWSM. 2014;8:216–25. doi: 10.1609/icwsm.v8i1.14550. [DOI] [Google Scholar]
  • 8.Jurafsky DM, James H. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition with language models. 3rd. edn. [Google Scholar]
  • 9.Hearst MA, Dumais ST, Osuna E, et al. Support vector machines. IEEE Intell Syst Their Appl. 1998;13:18–28. doi: 10.1109/5254.708428. [DOI] [Google Scholar]
  • 10.Neural Information Processing Systems; 2017. Attention is all you need. [Google Scholar]
  • 11.BERT: pre-training of deep bidirectional transformers for language understanding. Association for Computational Linguistics; Minneapolis, Minnesota. 2019. [Google Scholar]
  • 12.Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. 2020
  • 13.Munn Z, Pollock D, Khalil H, et al. What are scoping reviews? Providing a formal definition of scoping reviews as a type of evidence synthesis. JBI Evid Synth . 2022;20:950–2. doi: 10.11124/JBIES-21-00483. [DOI] [PubMed] [Google Scholar]
  • 14.Tricco AC, Lillie E, Zarin W, et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann Intern Med. 2018;169:467–73. doi: 10.7326/M18-0850. [DOI] [PubMed] [Google Scholar]
  • 15.Peters MDJ, Godfrey CM, Khalil H, et al. Guidance for conducting systematic scoping reviews. Int J Evid Based Healthc. 2015;13:141–6. doi: 10.1097/XEB.0000000000000050. [DOI] [PubMed] [Google Scholar]
  • 16.Park SH, Cheng CP, Buehler NJ, et al. A sentiment analysis on online psychiatrist reviews to identify clinical attributes of psychiatrists that shape the therapeutic alliance. Front Psychiatry. 2023;14:1174154. doi: 10.3389/fpsyt.2023.1174154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Butler LR, Tang JE, Hess SM, et al. Building better pediatric surgeons: A sentiment analysis of online physician review websites. J Child Orthop. 2022;16:498–504. doi: 10.1177/18632521221133812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Yazdani A, Shamloo M, Khaki M, et al. Use of sentiment analysis for capturing hospitalized cancer patients’ experience from free-text comments in the Persian language. BMC Med Inform Decis Mak. 2023;23:275. doi: 10.1186/s12911-023-02358-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Tang J, White CA, Arvind V, et al. What Are Patients Saying About Minimally Invasive Spine Surgeons Online: A Sentiment Analysis of 2,235 Physician Review Website Reviews. Cureus . 2022;14:e24113. doi: 10.7759/cureus.24113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Pandey AR, Seify M, Okonta U, et al. Advanced Sentiment Analysis for Managing and Improving Patient Experience: Application for General Practitioner (GP) Classification in Northamptonshire. Int J Environ Res Public Health. 2023;20:6119. doi: 10.3390/ijerph20126119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Mammadova M, Jabrayilova Z, Shikhaliyeva N. Development of decision-making technique based on sentiment analysis of crowdsourcing data in medical social media resources. EEJET. 2023;5:75–85. doi: 10.15587/1729-4061.2023.289989. [DOI] [Google Scholar]
  • 22.Herng Leong K, Putri Dahnil D. Classification of Healthcare Service Reviews with Sentiment Analysis to Refine User Satisfaction. Int j electr comput eng syst (Online) 2022;13:323–30. doi: 10.32985/ijeces.13.4.8. [DOI] [Google Scholar]
  • 23.Cho LD, Tang JE, Pitaro N, et al. Sentiment Analysis of Online Patient-Written Reviews of Vascular Surgeons. Ann Vasc Surg. 2023;88:249–55. doi: 10.1016/j.avsg.2022.07.016. [DOI] [PubMed] [Google Scholar]
  • 24.Cheng CP, Owusu T, Shekane P, et al. Sentiment analysis of pain physician reviews on Healthgrades: a physician review website. Reg Anesth Pain Med. 2024;49:656–60. doi: 10.1136/rapm-2023-104650. [DOI] [PubMed] [Google Scholar]
  • 25.Serrano-Guerrero J, Bani-Doumi M, Chiclana F, et al. How satisfied are patients with nursing care and why? A comprehensive study based on social media and opinion mining. Inform Health Soc Care. 2024;49:14–27. doi: 10.1080/17538157.2023.2297307. [DOI] [PubMed] [Google Scholar]
  • 26.Jo JJ, Cheng CP, Ying S, et al. Physician Review Websites: Understanding Patient Satisfaction with Ophthalmologists Using Natural Language Processing. J Ophthalmol. 2023;2023:4762460. doi: 10.1155/2023/4762460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Tang JE, Arvind V, White CA, et al. What are patients saying about you online? A sentiment analysis of online written reviews on Scoliosis Research Society surgeons. Spine Deform. 2022;10:301–6. doi: 10.1007/s43390-021-00419-y. [DOI] [PubMed] [Google Scholar]
  • 28.Vasan V, Cheng CP, Lerner DK, et al. A natural language processing approach to uncover patterns among online ratings of otolaryngologists. J Laryngol Otol. 2023;137:1384–8. doi: 10.1017/S0022215123000476. [DOI] [PubMed] [Google Scholar]
  • 29.Zakkar MA, Lizotte DJ. Analyzing Patient Stories on Social Media Using Text Analytics. J Healthc Inform Res. 2021;5:382–400. doi: 10.1007/s41666-021-00097-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Tang JE, Arvind V, White CA, et al. Using Sentiment Analysis to Understand What Patients Are Saying About Hand Surgeons Online. Hand (N Y) 2023;18:854–60. doi: 10.1177/15589447211060439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Chandrasekaran R, Bapat P, Jeripity Venkata P, et al. Do Patients Assess Physicians Differently in Video Visits as Compared with In-Person Visits? Insights from Text-Mining Online Physician Reviews. Telemed J E Health. 2023;29:1557–65. doi: 10.1089/tmj.2022.0507. [DOI] [PubMed] [Google Scholar]
  • 32.Tang JE, Arvind V, Dominy C, et al. How Are Patients Reviewing Spine Surgeons Online? A Sentiment Analysis of Physician Review Website Written Comments. Global Spine J. 2023;13:2107–14. doi: 10.1177/21925682211069933. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.De Smedt T, Daelemans W. Pattern for Python. J Mach Learn Res. 2012;13:2063–7. [Google Scholar]
  • 34.Loria S. Textblob Documentation. 0.15 R. 2018. [Google Scholar]
  • 35.Thelwall M. In: Cyberemotions: collective emotions in cyberspace. Holyst JA, editor. Cham: Springer International Publishing; 2017. The heart and soul of the web? sentiment strength detection in the social web with sentistrength; pp. 119–34. [Google Scholar]
  • 36.Nawab K, Ramsey G, Schreiber R. Natural Language Processing to Extract Meaningful Information from Patient Experience Feedback. Appl Clin Inform. 2020;11:242–52. doi: 10.1055/s-0040-1708049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Khanbhai M, Warren L, Symons J, et al. Using natural language processing to understand, facilitate and maintain continuity in patient experience across transitions of care. Int J Med Inform. 2022;157:104642. doi: 10.1016/j.ijmedinf.2021.104642. [DOI] [PubMed] [Google Scholar]
  • 38.van Buchem MM, Neve OM, Kant IMJ, et al. Analyzing patient experiences using natural language processing: development and validation of the artificial intelligence patient reported experience measure (AI-PREM) BMC Med Inform Decis Mak. 2022;22:183. doi: 10.1186/s12911-022-01923-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Vehviläinen-Julkunen K, Turpeinen S, Kvist T, et al. Experience of Ambulatory Cancer Care: Understanding Patients’ Perspectives of Quality Using Sentiment Analysis. Cancer Nurs. 2021;44:E331–8. doi: 10.1097/NCC.0000000000000845. [DOI] [PubMed] [Google Scholar]
  • 40.Chekijian S, Li H, Fodeh S. Emergency care and the patient experience: Using sentiment analysis and topic modeling to understand the impact of the COVID-19 pandemic. Health Technol. 2021;11:1073–82. doi: 10.1007/s12553-021-00585-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Alexander G, Bahja M, Butt GF. Automating Large-scale Health Care Service Feedback Analysis: Sentiment Analysis and Topic Modeling Study. JMIR Med Inform. 2022;10:e29385. doi: 10.2196/29385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.AlMuhaideb S, AlNegheimish Y, AlOmar T, et al. Analyzing Arabic Twitter-Based Patient Experience Sentiments Using Multi-Dialect Arabic Bidirectional Encoder Representations from Transformers. Computers, Materials & Continua. 2023;76:195–220. doi: 10.32604/cmc.2023.038368. [DOI] [Google Scholar]
  • 43.Hotchkiss JT, Ridderman E, Bufkin W. Development of a model and method for hospice quality assessment from natural language processing (NLP) analysis of online caregiver reviews. Palliat Support Care. 2024;22:19–30. doi: 10.1017/S1478951523001001. [DOI] [PubMed] [Google Scholar]
  • 44.Rahim AIA, Ibrahim MI, Chua S-L, et al. Hospital Facebook Reviews Analysis Using a Machine Learning Sentiment Analyzer and Quality Classifier. Healthcare (Basel) 2021;9:1679. doi: 10.3390/healthcare9121679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Abu Samah KAF, Nor Azharludin NM, Riza LS, et al. Classification and visualization: Twitter sentiment analysis of Malaysia’s private hospitals. IJ-AI. 2023;12:1793. doi: 10.11591/ijai.v12.i4.pp1793-1802. [DOI] [Google Scholar]
  • 46.Shah AM, Yan X, Shah SAA, et al. Mining patient opinion to evaluate the service quality in healthcare: a deep-learning approach. J Ambient Intell Human Comput. 2020;11:2925–42. doi: 10.1007/s12652-019-01434-8. [DOI] [Google Scholar]
  • 47.Liu J, Zhang W, Jiang X, et al. Data Mining of the Reviews from Online Private Doctors. Telemed J E Health. 2020;26:1157–66. doi: 10.1089/tmj.2019.0159. [DOI] [PubMed] [Google Scholar]
  • 48.Chandrasekaran R, Bapat P, Venkata PJ, et al. Face time with physicians: How do patients assess providers in video-visits? Heliyon. 2023;9:e16883. doi: 10.1016/j.heliyon.2023.e16883. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Jackson GP, Shortliffe EH. Understanding the evidence for artificial intelligence in healthcare. BMJ Qual Saf. 2025;34:421–4. doi: 10.1136/bmjqs-2025-018559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Tang J, Arvind V, White CA, et al. How are Patients Describing You Online? A Natural Language Processing Driven Sentiment Analysis of Online Reviews on CSRS Surgeons. Clin Spine Surg. 2023;36:E107–13. doi: 10.1097/BSD.0000000000001372. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

online supplemental appendix 1
bmjhci-32-1-s001.docx (109.3KB, docx)
DOI: 10.1136/bmjhci-2025-101631
online supplemental appendix 2
bmjhci-32-1-s002.docx (60.5KB, docx)
DOI: 10.1136/bmjhci-2025-101631

Data Availability Statement

Data sharing not applicable as no data sets generated and/or analysed for this study.


Articles from BMJ Health & Care Informatics are provided here courtesy of BMJ Publishing Group

RESOURCES