Abstract
Background
With the rise of large language models, the application of artificial intelligence in research is expanding, possibly accelerating specific stages of the research processes. This study aims to compare the accuracy, completeness and relevance of chatbot-generated responses against human responses in evidence synthesis as part of a scoping review.
Methods
We employed a structured survey-based research methodology to analyse and compare responses between two human researchers and four chatbots (ZenoChat, ChatGPT 3.5, ChatGPT 4.0, and ChatFlash) to questions based on a pre-coded sample of 407 articles. These questions were part of an evidence synthesis of a scoping review dealing with digitally supported interaction between healthcare workers.
Results
The analysis revealed no significant differences in judgments of correctness between answers by chatbots and those given by humans. However, chatbots’ answers were found to recognise the context of the original text better, and they provided more complete, albeit longer, responses. Human responses were less likely to add new content to the original text or include interpretation. Amongst the chatbots, ZenoChat provided the best-rated answers, followed by ChatFlash, with ChatGPT 3.5 and ChatGPT 4.0 tying for third. Correct contextualisation of the answer was positively correlated with completeness and correctness of the answer.
Conclusions
Chatbots powered by large language models may be a useful tool to accelerate qualitative evidence synthesis. Given the current speed of chatbot development and fine-tuning, the successful applications of chatbots to facilitate research will very likely continue to expand over the coming years.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12874-025-02532-2.
Keywords: Artificial intelligence, Chatbot, Large language model, ChatGPT, ChatFlash, ZenoChat
Background
Since the public launch of ChatGPT (OpenAI) in November 2022, chatbots powered by large language models have received increasing attention by the public, politicians and science [1]. Its usability and challenges have been debated across various sectors, particularly in human-related areas such as education or healthcare [2, 3]. These advanced language models are trained on vast repositories of data and tuned to mimic human conversation [4]. Their scale, pre-training mechanisms, contextual understanding and flexibility set them apart from previous machine learning tools [5, 6].
With the rise of large language models, the application of artificial intelligence in analysing complex datasets, making predictions, and supporting literature reviews has opened up various new opportunities, particularly for healthcare [5, 6]. In addition, its application in research is also expanding due to the anticipated potential to accelerate the research process and possibly improve transparency [7–9]. Especially in evidence synthesis such as systematic reviews, where the mean duration from literature search to publication extends beyond a year, large language models could expedite the preparation of evidence-based guidelines and, thereby, positively impact medical practice [10–12]. In qualitative research, ChatGPT performed well in reproducing specific themes, but less successfully in establishing interpretative themes, and creating depth when coding inductively [7, 13, 14]. Qualitative research tools, such as MAXQDA and ATLAS.ti, have integrated artificial intelligence tools in partnership with OpenAI to support users at various stages of their research process [13]. Chatbots might further enhance and accelerate research by supporting tasks such as writing search strings and summaries [8, 9, 15].
While ChatGPT is a prominent example, it is only one of numerous chatbots to employ large language models [4]. In recent years, the market has seen the introduction of multiple alternatives. Despite sharing a similar core technology, their different training and fine-tuning processes may result in variations in their response generation and capabilities [16]. The aim of this study is to compare the accuracy, completeness and relevance of chatbot-generated responses to questions of pre-coded article excerpts, comparing these responses both against human responses and across different chatbots. This contributes to an advanced understanding of the use and performance of chatbots in evidence synthesis.
Methods
We deployed a randomised and blinded, survey-based process in order to assess the performance of chatbots with regard to supporting evidence synthesis.
Context and sample
As part of a scoping review on digitally supported interaction between healthcare workers, 407 articles were included for data extraction and manually analysed (see Appendix 1) [17]. The analysis scheme was informed by Greenhalgh et al.’s NASSS framework (Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies) [18]. This framework contains seven categories (the condition, the technology, the value proposition, the adopters, the organization, the wider system and the evolution and adaptation over time) with specific questions on each [18]. It was adapted into a coding framework of seven codes and the following questions:
Context: Why was the technology introduced? Is there a specific disease area mentioned?
Technology: What are the key features of the technology? What knowledge and/or support is required to use the technology? What is the technology supply model? What is said current technology use and uptake?
Value proposition: What is the technology’s positive value proposition in terms of e.g., desirability, efficacy, safety, and cost effectiveness amongst others, excluding beneficial effects on staff? What is the technology's negative value proposition in terms of e.g., desirability, efficacy, safety, and cost effectiveness amongst others, excluding beneficial effects on staff?
Adopter system: What changes in staff roles, practices and routines, and identities are implied? What is the technology’s desirability for staff, including beneficial effects on staff?
Organization: What is the organization’s readiness and/or capacity for technology-supported change? What is said about adoption and funding? What work is involved in implementation and who will do it?
Wider context: What is the context for program rollout (e.g., political, economic, regulatory, professional (e.g., medicolegal), and sociocultural)?
Embedding and adaptation over time: How much scope is there for adapting and coevolving the technology and the service over time?
The 407 articles pertained to four different settings: hospital, ambulant, intersectoral and others. For our study, we selected those 39 articles related to the ambulant setting and identified within this subset those containing coded entries for each code. A researcher then selected five articles for each of the six codes (condition, technology, value proposition, adopters, organization, and wider system), ensuring a diverse representation while maintaining randomness. The code “the evolution and adaptation over time” was only applied in two articles within the subset, which were both included. Different coded parts of the same code in one article were combined into one text passage that varied in length and complexity. The text passages were then presented to two researchers and four chatbots with the task of answering the questions for each code including all relevant information. The two researchers had been actively participating in the design and conduct of the research and were thus well-acquainted with the material and scope of the review. A coded text passage under each code is given as an example jointly with an answer provided by one of the four chatbots or a human researcher at the end of this section (Table 1).
Table 1.
Overview of codes
| Code | Question | Example text passage | Example answer |
|---|---|---|---|
| 1. Condition | 1A. Why was the technology introduced? | […] health information technology (HIT) systems for shared electronic medication plans. These systems enable patients and health care professionals to use and manage a common list of current medications across sectors and settings. Shared electronic medication plans have great potential to improve medication management and patient safety, […]a |
ChatGPT3.5: - The technology (HIT systems for shared electronic medication plans) was introduced to improve medication management and patient safety |
| 1B. Is there a specific disease area mentioned? |
ChatGPT3.5: - The paragraph does not mention a specific disease area. The focus is on medication management and patient safety in a broader context |
||
| 2. Technology | 2A. What are the key features of the technology? | 12A multidisciplinary digital communication tool, Congredi (Fast Guide, Oud Gastel, the Netherlands), was implemented to improve communication about care among providers. It includes an interactive multidisciplinary care plan and a secure e-mailing channel. It was selected on the basis of jointly developed functional specifications; The Tool Congredi is a communication tool for providers, and it was originally developed for primary care by a general practitioner. It consists of an option to share data in a care plan and asynchronous communication options using secure email. The included multidisciplinary care plan was based on the social, functional, mental, physical, and communication problem inventory domain model.15 Providers received a practical instruction training session lasting 4 h and demonstrating use of the Congredi tool. The patient had to give permission to begin a record and to invite providers to link. After patient permission was obtained, a provider opened a Congredi record for the patient and filled in a care plan. Other providers who were involved with the patient and had access to Congredi could be invited by link so that they could view the care plan and use it interactively. Provider actions were recorded in the system. The following are three main categories of actions: care actions (problem assessment, defining care goals, observing patients' health status, and adapting care goals), e-mailing, and inviting other providers to link. All providers had the ability to update the care plan and initiate actions; they received alerts when there were e-mails in their inboxes.b |
ChatGPT4.0: -An online platform for creating, using, and managing a list of all the medications a patient was taking and had taken in the past - Pilot project: addition of a shared electronic medication plan system onto the existing eHealth platform; new collaborative model based on the patient-GP-pharmacy triad to make primary care medication management safer in cases involving polypharmacy - information exchange between the patient and an intervening health care professional |
| 2B. What knowledge and/or support is required to use the technology? | ChatGPT4.0: No information is provided in the given paragraph | ||
| 2C. What is the technology supply model? |
ChatGPT4.0: - The tool is a communication tool for providers, particularly developed for primary care - Originally developed by a general practitioner - Providers receive training on its use, and patient permission is obtained to initiate a record |
||
| 2D. What is said current technology use and uptake? |
ChatGPT4.0: - No information is provided in the given paragraph |
||
| 3. Value proposition | 3B.1. What is the technology's positive value proposition in terms of e.g., desirability, efficacy, safety, and cost effectiveness amongst others, excluding beneficial effects on staff? | Among those who did have concerns, concerns for both personal and patient privacy were top of the list (Fig. 3).; In this study, we found that after 1 year of use, the implementation of a private Facebook group for a large, multi-site FHT was ultimately not successful, despite cross-site communication having previously been identified to be in need of improvement by FHT members.12; Use of the group depended heavily on the study team and was not sustained. Survey respondents expressed high levels of concern about both patient and provider privacy and viewed the Facebook group as unnecessary, adding extra time and work to their already busy days; Our survey results suggest that FHT members were inherently suspicious of Facebook, even with guides on privacy settings in place and the recommendation to create a separate profile.c |
Researcher B: - No information is provided in the given paragraph |
| 3B.2. What is the technology's negative value proposition in terms of e.g., desirability, efficacy, safety, and cost effectiveness amongst others, excluding beneficial effects on staff? |
Researcher B: - High levels of concern about both patient and provider privacy (using a Facebook group) . Facebook group was viewed as unnecessary, adding extra time and work to the already busy day of the participants - FHT members were inherently suspicious of Facebook, even with guides on privacy settings in place and the recommendation to create a separate profile |
||
| 4. Adopters | 4A. What changes in staff roles, practices and routines, and identities are implied? |
4.2. Professional tasks and responsibilities reconfigured The informants reported changes in how much information they sent and received due to the simpler means of communication. We might say that more exchange of information is the structural effect of the introduction of e-messaging that, in turn, leads to unforeseen changes in tasks and responsibilities. Two nurses report how they use e-messages below: Nurse A: We answer questions from the GP. We write general information if they [patients] have an appointment with the GP, and if we think they won’t remember to bring up everything that they should; then we send some kind of update in advance, so that all issues are addressed Interviewer: Before the implementation of the e-messages system, did you provide this information via telephone? Nurse B: Yes, but not always, because it was so cumbersome via the telephone, so you would just skip it. I think e-messaging has made it easier. You can inform via a few keystrokes instead of waiting on the telephone Nurse B points to the fact that she informs the GP more often after the introduction of e-messaging, for example, before a patient arrives for his/her appointment with the GP. It means that the GP may receive information that is more comprehensive and up-to-date, ensuring a better quality of service. Similarly, some of the nurses reported that they use e-messages to inform the GP that patients have been admit-ted to hospital, signalling that the GP should be prepared for changes in the patients’ medications GPs are also of the opinion that they send more messages. However, one highlighted a problematic aspect of communication becoming too easy: We send more messages. Earlier, we used to let them accumulate until Tuesdays [when they had meetings with homecare], and then some problems were solved at that time. Now we send them at once, right. The volume of the messages they [homecare] receive has increased as a result. And that might be a problem. Cause we often ‘shoot from the hip’. We see a problem, and then we send it [a message], so we won’t forget. Earlier, we would have thought, “can we try to solve it?”, “can we try to find out?”, “can we ask the patient to bring the pill dispenser?,” can we ask for a “dis-charge note from the hospital?” It [the increased volume of messages] becomes a burden for homecare services. (GP municipality B) The homecare nurses did not cite message overload or messages not carefully thought out as frequent problems. Thus, there may be a discrepancy between what is comprehended as ‘too much’ from a GP’s perspective and from a homecare nurse’s perspective. It is also necessary to keep in mind that we are examining individual experiences. However, some nurses did talk about receiving too much/irrelevant information via e-message, but this was explained as a consequence of (poor) functionality in the e-message system, where some message types automatically imported extensive amounts of information from the EPR and attached it to the message The interviews show agreement across the informants that more information is sent and received, indicating that e-message users have more knowledge of patients than do nonusers. Two unintended consequences are worth noting from the example quotes presented above: (1) e-messaging may become a structural means of change, allowing nurses to become more active organisers and facilitators of GP’s work, which in turn can mean better follow-up of patients. (2) It may also be argued that e-messaging becomes a tool for getting things done and eliminating cognitive overload but in doing so unnecessarily transfers one’s own work to another actor.; 4.3. Empowering nurses Another issue associated with HIT that was not addressed prior to the introduction of e-messaging is the power balance between homecare nurses and GPs. Shifts in the balance of power may occur as a result of the new form of communication. Through the automatic documentation of communication provided by e-messaging, the nurses felt their requests had more weight. Two quotes illustrate this: First, it is documented every time we have made an inquiry. And they [GPs] can’t ignore an e-message. When there is a message, they have to answer. So, that is very much an advantage. (nurse municipality A) We haven’t been spoilt when it comes to having contact with the GP. It’s like...we try to get hold of them, and they are supposed to call back, but it doesn’t always happen. Sometimes it feels like we are not taken seriously. But when you have sent an e-message, then it is there [in their EPR system]. And they have to open it, and they have to send an answer. (nurse municipality A) When the nurses talked about what e-messages meant to them, they often described the positive effect of the automatic documentation on their dialogues with the GPs. Previously, messages for GPs were often mediated through their medical secretaries and could get ‘lost’, according to the nurses – meaning that the GP never called back. According to the nurses, requests documented in the EPR have more weight than a post-it note from a secretary, and they found that it resulted in a more conscientious follow-up from the GP Furthermore, nurses used phrases such as “having their back covered” and “having documented that you have done your job”. Such expressions illustrate how nurses use e-messages for attaining accountability and to fulfil the demands of the public, their colleagues, patients and next of kin. E-messaging might be viewed as a tool that empowers nurses if used consciously.d |
ZenoChat: - Changes in staff roles, practices, and routines are implied by the introduction of e-messaging - More exchange of information is the structural effect of the introduction of e-messaging, which leads to unforeseen changes in tasks and responsibilities - Nurses use e-messages to inform the GP more often and with more comprehensive and up-to-date information - E-messaging becomes a structural means of change, allowing nurses to become more active organizers and facilitators of GP's work, which in turn can mean better follow-up of patients - E-messaging becomes a tool for getting things done and eliminating cognitive overload but in doing so unnecessarily transfers one's own work to another actor - The power balance between homecare nurses and GPs may shift as a result of the new form of communication - Through the automatic documentation of communication provided by e-messaging, nurses feel their requests have more weight - Requests documented in the EPR (Electronic Patient Record) have more weight than a post-it note from a secretary, resulting in a more conscientious follow-up from the GP - E-messaging might be viewed as a tool that empowers nurses if used consciously |
| 4B. What is the technology's desirability for staff? (Including beneficial effects on staff?) |
ZenoChat: - The technology's desirability for staff is not explicitly mentioned in the given paragraph - However, e-messaging may become a tool that empowers nurses and allows them to become more active organizers and facilitators of GP's work, which could lead to improved job satisfaction |
||
| 5. Organization | 5A. What is the organization’s readiness and/or capacity for technology-supported change? | In Australia, video consultations have been funded by Medicare (the universal health insurance system) since 2011; however, general practitioners are not rebated to deliver telehealth services, only to attend patients in person to support video consultations from medical specialists […] Overall, it was very challenging to make video consultations a routine method of delivering general medical ser-vices to long term care facilities. Existing communications infrastructure was usually not suitable, with bandwidth being the greatest barrier.e |
Researcher B: - Video consultations were very challenging as a routine method of delivering general medical services to long term care facilities, as existing communications infrastructure was usually not suitable, amongst others due to bandwidth |
| 5B. What is said about adoption and funding? |
Researcher B: - Video consultations have been funded by Medicare (the universal health insurance system) since 2011 - A problem with funding is that general practitioners are not rebated to deliver telehealth services, only to attend patients in person to support video consultations from medical specialists; |
||
| 5C. What work is involved in implementation and who will do it? |
Researcher B: - The paragraph does not provide specific information about the work involved in implementation or who will do it |
||
| 6. Wider system | 6A. What is the context for program rollout (e.g., political, economic, regulatory, professional (e.g., medicolegal), and sociocultural)? | The Norwegian primary health care is managed by the municipalities and includes home health care, with approximately 31,000 nurses and 4,000 GPs being contracted. These two services are organized separately, governed and financed differently, and also have incompatible EHR systems; (b)Legislation does not permit them to share EHR systems, but they are obliged to exchange patient information when required.28 The nurses in home health care and GPs have many common patients, thus indicating that ensuring a good clinical communication is essential.f |
Researcher A: - Home health nurses and GPs work in different systems and are financed and governed differently - Home health nurses and GPs have incompatible EHR systems - Legislation does not allow sharing of EHR systems |
| 7. Embedding and adaptation over time | 7A. How much scope is there for adapting and coevolving the technology and the service over time? |
5.1. Incorporating other diseases and social determinants A comprehensive integrated care delivery system needs to respond to a broad range of diseases and risks relevant to population health. We have laid some groundwork for taking our present system beyond pregnancy and early childhood. In parallel, the team has initiated a non-communicable disease CHW follow-up program that supports patients detected at the hospital level, and we plan to initiate a surgical follow-up program in the coming year. As an entry point into the social determinants of health, we have also developed an approach to measure household expenditures on health, medical debt, and medical impoverishment, and are currently analyzing these data. Over time, we aim to incorporate these various streams of household-level health, socio-economics, and disease risk data into the comprehensive integrated care digital platform.g |
ChatFlash: - The team has initiated a non-communicable disease CHW follow-up program - The team plans to initiate a surgical follow-up program in the coming year - They have developed an approach to measure household expenditures on health, medical debt, and medical impoverishment - The team aims to incorporate household-level health, socio-economics, and disease risk data into the comprehensive integrated care digital platform over time |
|
The codes are displayed as proposed by Greenhalgh et al. in their NASSS framework (Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies) [18]. The questions under each code were taken from the NASSS framework and partially adapted. A coded text passage under each code is given as an example together with an answer provided by one of the four chatbots or a human References for the example text passage: a Bugnon B, Geissbuhler A, Bischoff T, et al. Improving Primary Care Medication Processes by Using Shared Electronic Medication Plans in Switzerland: Lessons Learned From a Participatory Action Research Study. JMIR Form Res 2021; 5: e22319 b Jong CC de, Ros WJG, van Leeuwen M, et al. Professionals' Use of a Multidisciplinary Communication Tool for Patients With Dementia in Primary Care. Comput Inform Nurs 2018; 36: 193–198 c Lofters AK, Slater MB, Nicholas Angl E, et al. Facebook as a tool for communication, collaboration, and informal knowledge exchange among members of a multisite family health team. J Multidiscip Healthc 2016; 9: 29–34 d Melby L and Hellesø R. Introducing electronic messaging in Norwegian healthcare: unintended consequences for interprofessional collaboration. Int J Med Inform 2014; 83: 343–353 e Wade V, Whittaker F and Hamlyn J. An evaluation of the benefits and challenges of video consulting between general practitioners and residential aged care facilities. J Telemed Telecare 2015; 21: 490–493 f Lyngstad M, Melby L, Grimsmo A, et al. Toward Increased Patient Safety? Electronic Communication of Medication Information Between Nurses in Home Health Care and General Practitioners. Home Health Care Management & Practice 2013; 25: 203–211 g Citrin D, Thapa P, Nirola I, et al. Developing and deploying a community healthcare worker-driven, digitally- enabled integrated care system for municipalities in rural Nepal. Healthc (Amst) 2018; 6: 197–204 | |||
Selection of chatbots and answer generation
We opted for comparing four chatbots: ZenoChat (TextCortex AI), ChatGPT 3.5, ChatGPT 4.0 (OpenAI) and the free version of ChatFlash (neuroflash GmbH). ChatFlash, ChatGPT 3.5 and ChatGPT 4.0 are based on GPT (Generative Pre-trained Transformer). Depending on the settings, ZenoChat uses either GPT 4.0, Sophos-2 or Mixtral as its underlying model. As even small changes affect a language model, we included both ChatGPT 3.5 and 4, based on GPT 3.5 and 4, respectively, as well as the ZenoChat version based on Sophos-2 [16]. To ensure a standardised process, all chatbots were presented with the same prompt for the same pre-coded text: “Use the following paragraph to extract in academic style in bullet points the answers – if any answers are provided – to the following questions:”, followed by one to four questions pertaining to the specific code. No limit was set for the word count, as texts were heterogeneous in complexity and information density. The prompt was derived in an iterative process, where different prompt designs were tested. The answer generation for all chatbots was conducted in November 2023.
Randomization and survey
We used digitally supported randomization in Microsoft Excel to select three text passages of each code, resulting in a total of 20 text passages, to serve as the basis of our survey. The selected text passages of the codes stemmed from the following articles: the condition [19–21], the technology [21–23], the value proposition [24–26], the adopters [22, 27, 28], the organization [27, 29, 30], the wider system [31–33] and the evolution and adaptation over time [30, 33]. Each text passage had four chatbot-written responses and two written by human beings. A survey was designed to evaluate the responses, applying a consistent evaluation framework to each question: the length of the response was measured on a Likert 1–3 scale: 1 – too short, 2 – appropriate length, 3 – too long. Completeness and correctness were equally measured on a Likert 1–3 scale: 1 complete / correct, 2 – partially complete / partially correct, 3 – significant part(s) are missing / content is displayed incorrectly. Three further questions evaluated the correct identification of the context (correct/incorrect) and whether the answer included any addition of new content (yes/no) and or an interpretation beyond the original text (yes/no). The surveys were set-up as Google Forms and respondents could provide open-ended feedback in a free text field on both, the response and the original text passage (Appendix 2). We used word counts to measure the length of the responses and calculated the word ratio by dividing the answers’ word count by that of the original text’s.
Six independent survey participants with a background in social sciences and differing professional expertise ranging from nursing to physiotherapy (henceforth: rater) were recruited from a department in the Bavarian Research Center for Digital Health and Social Care. These individuals were not involved in deriving answers, but were familiar with the scoping review and the aim of the research. They participated in a training session prior to reviewing the text passages where all criteria were explained at length to align their understanding and minimize subjective interpretation.
KN assigned each rater to a text passage with the corresponding answers to each question using random sequence allocation. Raters were blinded to the identities of the author of each answer (i.e., human researcher or specific chatbot model). The blinding was maintained until all raters finished evaluation. In total, 120 text passages were reviewed with each text passage being presented to three raters with each rater assessing in the same group constellation (e.g., rater A, B, C) the same number of text passages. Raters were not shown the ratings of other raters. To minimise recognition bias, we standardized the formatting of answers and omitted introductory phrases such as “The provided paragraph contains information to answer the questions related to the introduction of the technology and the specific disease area:”. Figure 1 summarises the research process and Table 1 provides an overview of the codes and examples of responses provided by the chatbots and human researchers.
Fig. 1.
Methodological process, enriched by comparison of the four chatbots used to answer the questions
Data analysis
Quantitative data obtained from the surveys were exported to Microsoft Excel (Microsoft Office, Professional Plus 2021) and analysed using descriptive and comparative statistical methods in Stata (Version 16). To elucidate statistical difference, we used the non-parametric Kruskal–Wallis test at a significance level of p < 0.05 1) over the entire dataset, 2) to compare responses from chatbots and humans, and 3) between each answer-generating method (i.e., Researcher A and ChatFlash or between two chatbots). The interrater reliability was calculated using Cohen’s Kappa (κ) in cases with two response possibilities and a weighted Cohen’s Kappa for three response possibilities with weights equally distributed, i.e., 0.5 if one step between the ratings (ex. Rater A = 1 and Rater B = 2) and 1 if the ratings were maximally different (ex. Rater A = 1 and Rater B = 3). We followed Cohen’s interpretation of Kappa, with values ≤ 0 being interpretated as no agreement, 0.01–0.20 as none to slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial and 0.81–1.00 as almost perfect agreement [34]. We further performed correlational analysis using Spearman’s rho (ρ) to elucidate relationships between the individual variables. Qualitative data were transferred to a Microsoft Excel document and analysed inductively.
Results
Raters had a fair Cohen’s kappa inter-rater reliability of 0.30 (0.12–0.51), with a standard error of 0.11 (0.07–0.12). Cohen’s kappa for context was lower (κ = 0.18 ± 0.09) than the other variables, with a Cohen’s kappa between 0.27 ± 0.11 for correctness and 0.39 ± 0.10 for length. The following results highlight statistically significant differences – unless indicated otherwise.
Performance
Across the dataset, the correctness of the responses given was similar, while all other categories revealed statistically different results (Table 2). In general, the answers provided by chatbots were perceived to demonstrate a better recognition of the context (chatbot: 92.42% vs. human: 84.85%) and were also longer than those of humans with a mean word ratio of the length of the answer vs. the length of the original text of 0.45 ± 0.50 as compared to 0.21 ± 0.26 for humans. Furthermore, the answers by chatbots were perceived to be more complete than those by humans (chatbot: 79.73% vs. human: 52.65%). Conversely, human answers were perceived to be superior in absence of interpretation (human: 97.35% vs. chatbot: 81.44%) and addition of material not included in the original text (human: 97.73% vs. chatbot: 81.82%). Across all chatbots, ZenoChat provided the best-rated answers, followed by ChatFlash with ChatGPT 3.5 and ChatGPT 4.0 tying for third with no statistical differences between their responses. An overview of the Kruskal–Wallis tests between individual chatbots and humans is available in the supplementary material (Appendix 3).
Table 2.
Overview of survey results
| Category | Value | Total | Chatbot | Human | p-value overall | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Overall | Chatbot | Human | p-value Chatbot vs. Human | ChatFlash | ChatGPT 3.5 | ChatGPT 4.0 | ZenoChat | Researcher A | Researcher B | |||
| Addition | yes | 102 | 96 (18.18%) | 6 (2.27%) | < 0.001* | 16 | 39 | 30 | 11 | 6 | 0 | < 0.001* |
| no | 690 | 432 (81.82%) | 258 (97.73%) | 116 | 93 | 102 | 121 | 126 | 132 | |||
| Completeness | complete | 560 | 421 (79.73%) | 139 (52.65%) | < 0.001* | 108 | 108 | 104 | 101 | 65 | 74 | < 0.001* |
| partial | 131 | 67 (12.69%) | 64 (24.24%) | 17 | 18 | 14 | 18 | 34 | 30 | |||
| incomplete | 101 | 40 (7.58%) | 61 (23.11%) | 7 | 6 | 14 | 13 | 33 | 28 | |||
| Context | correct | 712 | 488 (92.42%) | 224 (84.85%) | 0.001* | 125 | 116 | 123 | 124 | 110 | 114 | 0.007* |
| incorrect | 80 | 40 (7.58%) | 40 (15.15%) | 7 | 16 | 9 | 8 | 22 | 18 | |||
| Correctness | correct | 574 | 390 (73.86%) | 184 (69.70%) | 0.116 | 102 | 90 | 90 | 108 | 91 | 93 | 0.051 |
| partial | 119 | 83 (15.72%) | 36 (13.64%) | 20 | 25 | 26 | 12 | 16 | 20 | |||
| incorrect | 99 | 55 (10.42%) | 44 (16.67%) | 10 | 17 | 16 | 12 | 25 | 19 | |||
| Interpretation | yes | 105 | 98 (18.56%) | 7 (2.65%) | < 0.001* | 18 | 30 | 43 | 7 | 5 | 2 | < 0.001* |
| no | 687 | 430 (81.44%) | 257 (97.35%) | 114 | 102 | 89 | 125 | 127 | 130 | |||
| Length | too short | 93 | 19 (3.60%) | 74 (28.03%) | < 0.001* | 1 | 3 | 4 | 11 | 38 | 36 | |
| perfect | 537 | 363 (68.75%) | 174 (65.91%) | 104 | 84 | 86 | 89 | 89 | 85 | < 0.001* | ||
| too long | 162 | 146 (27.65%) | 16 (6.06%) | 27 | 45 | 42 | 32 | 5 | 11 | |||
Significant differences between groups are indicated by *. Results are presented as absolute values unless indicated otherwise; the percentages refer to the columns
Length
Responses deemed too long had a word ratio (of the length of the answer vs. the length of the original text) of 0.84 ± 0.65 compared to 0.27 ± 0.28 for adequate lengths and 0.10 ± 0.11 deemed too short. ChatGTP 3.5’s responses were longer (mean word ratio: 0.58 ± 0.70) than ChatFlash’s (mean word ratio: 0.34 ± 0.36) and ZenoChat’s (mean word ratio: 0.38 ± 0.37). In contrast, human responses were shorter, with the word ratio of 0.16 ± 0.14 (Researcher A) and 0.25 ± 0.34 (Researcher B). Raters commented in free-text fields that they marked an answer as too long if it contained redundant, irrelevant or too detailed information. Conversely, when responses were rated incomplete, raters sometimes noted in the free text fields the expected content.
Completeness
The evaluation of chatbot’s completeness did not significantly differ between them; demonstrating comparable levels of completeness with between 76.52% (ZenoChat) and 81.82% (ChatFlash and ChatGPT 3.5). However, all chatbots’ answers were deemed more complete than those by human researchers, with completeness scores of 49.24% (Researcher A) and 56.06% (Researcher B).
Correctness
ZenoChat’s answers were assessed as being more correct (81.82%) than those by ChatGPT 3.5 (68.18%) and ChatGPT 4.0 (68.18%) or those by the human researchers (A: 68.94% and B: 70.45%). ChatFlash’s (77.27%) answers were not evaluated significantly differently than those of either chatbots or humans.
Context
ChatFlash’s responses were perceived to have shown a better understanding of the context than ChatGPT 3.5 (94.70% vs. 87.88% correct). Researcher B’s (86.26% correct) responses were perceived to show a lower understanding of the context than ZenoChat (93.94% correct) and ChatFlash (94.70% correct), with Researcher A’s context in responses being additionally inferior to ChatGPT 4.0 (83.33% vs. 93.18% correct).
Addition
ChatGPT 3.5 and ChatGPT 4.0 were evaluated as containing more addition than ZenoChat and ChatFlash; with 29.55% and 22.72% of answers containing an addition vs. 8.33% and 12.12%, respectively. Researcher A showed a slightly higher percentage (4.55%) of addition than Researcher B (0.00%). Researcher A’s responses were perceived to contain less addition (4.55%) than ChatGPT 3.5, ChatGPT 4.0 and ChatFlash; Researcher B’s responses (0.00%) also contained less addition than ZenoChat.
Interpretation
The responses by ChatGPT 4.0 were evaluated as containing more interpretation than ZenoChat and ChatFlash; with 32.58% of answers containing an interpretation vs. 5.30% and 13.66%, respectively. ZenoChat also contained fewer interpretations than ChatFlash and was the only chatbot to not provide more interpretation in its answers than the human researchers (A: 5.30%, B: 3.79%). In the free text fields, raters highlighted sentences they perceived as containing interpretation. Interestingly, many of these included words such as ‘potentially’, ‘suggesting’, ‘pointing to’, ‘could’, ‘appears to be’, ‘indicating’, ‘may reflect’, ‘may lead’. Some raters also acknowledged that a certain degree of interpretation was necessary to correctly answer the question. This ranged from being recognizing abbreviations – such as interpreting ‘MOH’ as ‘Ministry of Health’ – to demonstrating broader contextual understanding, as illustrated by the quote ‘The summary is not explicitly stated in the text. However, it is correct, but is only achieved through interpretation’. (Rater A). The rater mentioned difficulty in ‘answer[ing] the question, because it almost requires an interpretation to grasp the right aspects ‘ (Rater A), indicating that ‘the system therefore already needs to perform semantic or interpretative tasks in order to answer … [the] question’? (Rater B).
Correlational analysis
Correlational analysis between the variables revealed a moderate positive correlation between correctness and completeness (ρ = 0.63), correctness and context (ρ = 0.56), as well as length and word ratio (ρ = 0.56) (Table 3). The correlation between the variables in the human responses was higher than that for chatbots with regard to correctness and completeness (ρ = 0.71 vs. ρ = 0.60) and correctness and context (ρ = 0.72 vs. ρ = 0.44). There was a low positive correlation between context and completeness (ρ = 0.46), as between interpretation and addition (ρ = 0.35). A low negative correlation was determined between completeness and length (ρ = −0.33), i.e., the shorter the answer, the more incomplete, as well as correctness and addition (ρ = −0.35). In the human responses, there was a low negative correlation between length and correctness (ρ = −0.34), indicating a loss of correctness with shorter answers, and a moderate negative correlation between length and completeness (ρ = −0.66), indicating a loss of completeness with shorter answers. The answers by chatbots demonstrated a low negative correlation between addition and correctness, indicating a loss of correctness through addition of new content.
Table 3.
Correlation of questionnaire categories
| Addition | Completeness | Context | Correctness | Interpretation | Length | Ratio | |
|---|---|---|---|---|---|---|---|
| Addition | 1.00 | ||||||
| Completeness | −0.04 | 1.00 | |||||
| Context | −0.12* | 0.46* | 1.00 | ||||
| Correctness | −0.35* | 0.63* | 0.56* | 1.00 | |||
| Interpretation | 0.35* | 0.05 | 0.07 | −0.20* | 1.00 | ||
| Length | −0.25* | −0.33* | −0.12* | −0.02 | −0.22* | 1.00 | |
| Ratio | −0.27* | −0.22* | −0.11* | −0.02 | −0.17* | 0.56* | 1.00 |
Spearman’s rho (ρ): correlation of questionnaire categories. . ρ denotes no/little correlation (|ρ|< 0,3), low correlation (0.5 ≤|ρ|≤ 0.3), moderate correlation (0.7 ≤|ρ|< 0.5) and strong correlation (|ρ|> 0.7) [35]. Entries marked with an * refer to p-values < 0.05
Discussion
The chatbots’ responses were regarded as better in recognising context and providing more complete, albeit longer summaries, while humans were viewed as less prone to add or interpret the material. Among the chatbots, ZenoChat provided the best-rated answers, followed by ChatFlash in second place, with ChatGPT 3.5 and ChatGPT 4.0 equally in third place. Statistical analysis indicated a positive correlation between correct contextualisation and completeness, and between correct contextualisation and correctness of the response. Qualitative feedback highlighted that longer answers often contained redundant information and raised the question of the role of interpretation in effectively answering the question in evidence syntheses.
Indeed, correct contextualisation and the absence of addition and interpretation in the answer were important underlying factors in our research set-up. While Hamilton et al. found ChatGPT to have limited contextual understanding, our study found a clearer understanding of the context by the chatbots than displayed in the answers of the human researchers [14]. The distinction between recognising context – for example, that ‘EHR’ stands for ‘electronic health record’ – and interpreting content to answer a question is subtle and complex. This is closely linked to the debate over what constitutes the addition of new content versus the interpretation of the original text’s content.
To enhance the accuracy and relevance of chatbots’ responses, precise and specific instructions as prompts are essential [7, 12, 36]. In our study, chatbots tended to provide more extensive answers than humans, likely due to their implicit understanding of the exercise. While humans aimed for brevity to expedite subsequent analysis, chatbots did not have this implicit knowledge, as the prompt did not contain any guidance on the response length. Since chatbots are designed to assume user expectations from the given prompt, they tend to comply with instructions instead of seeking clarification and responding according to their own skills and limitations [4, 37].
Recent research has demonstrated the efficacy of chatbots such as ChatGPT in generating discharge and patient summaries of patients’ medical histories, and summarising the scientific literature, achieving levels of quality and accuracy comparable to or exceeding traditional methods [15, 38, 39]. This highlights their ability to distil main ideas from a given text, mirroring the findings from our study. However, these studies did not compare the capacities of different chatbots. Some studies compared chatbots’ ability to answer complex medical examination questions. In these, ChatGPT 4.0 responded more accurately and concisely than Bard (Microsoft) albeit scoring worse than the medical reference group [1, 40–42]. This suggests that chatbots’ ability to answer questions, as observed in our study, is contingent on the availability of an accurate reference text. Given this limitation, chatbots can be implemented in scenarios requiring idea extraction from a given resource, such as content analysis for qualitative research. In addition, the performance of chatbots in the context of evidence syntheses is not only dependent on adequate prompting, but also training and parameters of the respective chatbots. This is visible in variations of results between the different chatbots. Further investigation is needed regarding this aspect.
The study demonstrated an example of how AI-powered chatbots can enhance and accelerate the research process and successfully support humans in conducting research endeavours such as reviews. However, the authors – in concordance with other studies – highlight that human oversight and correction are essential [8, 14, 36]. Recognising chatbots’ potential, it is imperative to be aware of their shortcomings, with some of the most important being potential inclusion of biases and non-disclosure of training data, incorrect information and the possibility of nonsensical responses [1, 8, 9, 13, 38, 43–46].
A major strength of this study is the comparison between different chatbots combined with an evaluation of their performance against human researchers, as most other studies only compare chatbots against one another. Despite this strength, it is essential to acknowledge the study’s constraints. Firstly, we selected four chatbots from a rapidly growing array of chatbots using different underlying large language models. Secondly, caution is advised in generalising the findings of this study, as the text parts used for eliciting the answers were preselected by two human researchers for their topical relevance. Eliciting responses from the full text might yield different outcomes. Lastly, we used a custom-developed metric to assess performance, which has not undergone formal testing.
Future research should aim at refining the prompt to better match the implicit human understanding of the context and the specific objective anticipated from the chatbot’s use to improve the quality of the performance of the chatbots [7, 8]. This entails researchers assessing their underlying assumptions and intentions, clearly defining the criteria for evaluating responses, and employing prompt engineering methods to refine the prompt [7]. As the underlying large language models of chatbots are quickly evolving, longitudinal studies are crucial to offer insights into how chatbots’ capabilities and performance are changing. In line with Hamilton et al.’s recommendations, we also recommend assessing the chatbots’ capabilities to answer the questions when receiving the full text, instead of the curated human-determined important parts [14]. This might not only assess the chatbots’ abilities, but also further assist in identifying additional information that might have escaped the human’s judgment [9].
Conclusion
Our study demonstrates chatbots’ ability to provide complete and correct answers to questions on a given text, which may be a useful tool to accelerate research processes, in specific qualitative evidence synthesis. Given the speed of chatbot development and fine-tuning, the successful applications of chatbots to facilitate research will very likely continue to expand.
Supplementary Information
Supplementary Material 1: Appendix 1. In-detail information about the articles included in the scoping review serving as a base for this study
Supplementary Material 2: Appendix 2. Survey
Supplementary Material 3: Appendix 3. Results of Kruskal-Wallis test
Acknowledgements
The authors would like to extend their gratitude to Setareh Rabbani and Julia Schulze Pröbsting for their participation in the study.
Declaration of generative AI in scientific writing
During the preparation of this work, the authors used GTP 4.0 in order to improve readability and language. After using this tool, the author(s) reviewed and edited the content as needed and take full responsibility for the content of the publication.
Abbreviations
- AI
Artificial intelligence
- EHR
Electronic health record
- GPT
Generative Pre-trained Transformer
- NASSS
Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies
Authors’ contributions
KN: Conceptualization, Methodology, Formal analysis, Investigation, Resources, Data Curation, Visualization, Project administration, Writing—Original Draft. SS: Methodology, Formal analysis, Resources, Visualization, Writing—Review & Editing. MSt: Resources, Writing—Review & Editing. JA: Resources, Writing—Review & Editing. MCR: Resources, Writing—Review & Editing. MSc: Methodology, Resources, Writing—Review & Editing. FF: Conceptualization, Methodology, Resources, Writing—Review & Editing, Supervision.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Data availability
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
Declarations
Ethics approval and consent to participate
As the study did not involve sensitive data, no ethical clearance was necessary. Study participants were researchers acting as experts. They either provided answers to the questions or judged the answers. The respondents of the survey were able to drop out at any time without negative consequences.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Patil NS, Huang RS, van der Pol CB, Larocque N. Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment. Can Assoc Radiol J. 2023;0:1–7. 10.1177/08465371231193716. [DOI] [PubMed] [Google Scholar]
- 2.Alshami A, Elsayed M, Ali E, Eltoukhy AEE, Zayed T. Harnessing the Power of ChatGPT for Automating Systematic Review Process: Methodology, Case Study, Limitations, and Future Directions. Systems. 2023;11:351. 10.3390/systems11070351. [Google Scholar]
- 3.Khlaif ZN, Mousa A, Hattab MK, Itmazi J, Hassan AA, Sanmugam M, Ayyoub A. The Potential and Concerns of Using AI in Scientific Research: ChatGPT Performance Evaluation. JMIR Med Educ. 2023;9: e47049. 10.2196/47049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Meyer JG, Urbanowicz RJ, Martin PCN, O’Connor K, Li R, Peng P-C, et al. ChatGPT and large language models in academia: opportunities and challenges. BioData Min. 2023;16:20. 10.1186/s13040-023-00339-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Chappell M, Edwards M, Watkins D, Marshall C, Graziadio S. Machine learning for accelerating screening in evidence reviews. Cochrane Evidence Synthesis and Methods. 2023. 10.1002/cesm.12021. [Google Scholar]
- 6.van de Schoot R, de Bruin J, Schram R, Zahedi P, de Boer J, Weijdema F, et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell. 2021;3:125–33. 10.1038/s42256-020-00287-7. [Google Scholar]
- 7.Zhang H, Wu C, Xie J, Lyu Y, Cai J, Carroll JM. Redefining Qualitative Analysis in the AI Era: Utilizing ChatGPT for Efficient Thematic Analysis; 19.09.2023.
- 8.Huang Y-M, Rocha T, editors. Innovative Technologies and Learning: 6th International Conference, ICITL 2023 Porto, Portugal, August 28–30, 2023 Proceedings. Cham: Springer Nature; 2023.
- 9.Christou P. A Critical Perspective Over Whether and How to Acknowledge the Use of Artificial Intelligence (AI) in Qualitative Studies. TQR. 2023;28:1981–91. 10.46743/2160-3715/2023.6407. [Google Scholar]
- 10.La Torre-López J, de, Ramírez A, Romero JR. Artificial intelligence to automate the systematic review of scientific literature. Computing. 2023;105:2171–94. 10.1007/s00607-023-01181-x. [Google Scholar]
- 11.Sampson M, Shojania KG, Garritty C, Horsley T, Ocampo M, Moher D. Systematic reviews can be produced and published faster. J Clin Epidemiol. 2008;61:531–6. 10.1016/j.jclinepi.2008.02.004. [DOI] [PubMed] [Google Scholar]
- 12.Mahuli SA, Rai A, Mahuli AV, Kumar A. Application ChatGPT in conducting systematic reviews and meta-analyses. Br Dent J. 2023;235:90–2. 10.1038/s41415-023-6132-y. [DOI] [PubMed] [Google Scholar]
- 13.Morgan DL. Exploring the Use of Artificial Intelligence for Qualitative Data Analysis: The Case of ChatGPT. Int J Qual Methods. 2023;22:1–10. 10.1177/16094069231211248. [Google Scholar]
- 14.Hamilton L, Elliott D, Quick A, Smith S, Choplin V. Exploring the Use of AI in Qualitative Analysis: A Comparative Study of Guaranteed Income Data. Int J Qual Methods. 2023;22:1–13. 10.1177/16094069231201504. [Google Scholar]
- 15.Michalowski M, Abidi SSR, Abidi S, editors. Artificial Intelligence in Medicine: 20th International Conference on Artificial Intelligence in Medicine, AIME 2022 Halifax, NS, Canada, June 14–17, 2022 Proceedings. Cham: Springer International Publishing; 2022.
- 16.Chen L, Zaharia M, Zou J. How is ChatGPT's behavior changing over time?. arXiv. 2023. 10.48550/arXiv.2307.09009.
- 17.Nordmann K, Sauter S, Möbius-Lerch P, Redlich M-C, Schaller M, Fischer F. Conceptualizing Interprofessional Digital Communication and Collaboration in Health Care: Protocol for a Scoping Review. JMIR Res Protoc. 2023;12: e45179. 10.2196/45179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Greenhalgh T, Wherton J, Papoutsi C, Lynch J, Hughes G, A’Court C, et al. Beyond Adoption: A New Framework for Theorizing and Evaluating Nonadoption, Abandonment, and Challenges to the Scale-Up, Spread, and Sustainability of Health and Care Technologies. J Med Internet Res. 2017;19: e367. 10.2196/jmir.8775. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Overstreet K, Derouin A. Improving Interprofessional Collaboration Between Behavioral Mental Health and Pediatric Primary Care Providers Through Standardized Communication. J Pediatr Health Care. 2022;36:582–8. 10.1016/j.pedhc.2022.07.001. [DOI] [PubMed] [Google Scholar]
- 20.Legault F, Humbert J, Amos S, Hogg W, Ward N, Dahrouge S, Ziebell L. Difficulties encountered in collaborative care: logistics trumps desire. J Am Board Fam Med. 2012;25:168–76. 10.3122/jabfm.2012.02.110153. [DOI] [PubMed] [Google Scholar]
- 21.Bugnon B, Geissbuhler A, Bischoff T, Bonnabry P, von Plessen C. Improving Primary Care Medication Processes by Using Shared Electronic Medication Plans in Switzerland: Lessons Learned From a Participatory Action Research Study. JMIR Form Res. 2021;5: e22319. 10.2196/22319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Melby L, Hellesø R. Introducing electronic messaging in Norwegian healthcare: unintended consequences for interprofessional collaboration. Int J Med Inform. 2014;83:343–53. 10.1016/j.ijmedinf.2014.02.001. [DOI] [PubMed] [Google Scholar]
- 23.de Jong CC, Ros WJG, van Leeuwen M, Witkamp L, Schrijvers G. Professionals’ Use of a Multidisciplinary Communication Tool for Patients With Dementia in Primary Care. Comput Inform Nurs. 2018;36:193–8. 10.1097/CIN.0000000000000414. [DOI] [PubMed] [Google Scholar]
- 24.Lofters AK, Slater MB, Nicholas Angl E, Leung F-H. Facebook as a tool for communication, collaboration, and informal knowledge exchange among members of a multisite family health team. J Multidiscip Healthc. 2016;9:29–34. 10.2147/JMDH.S94676. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Knop M, Mueller M, Niehaves B. Investigating the Use of Telemedicine for Digitally Mediated Delegation in Team-Based Primary Care: Mixed Methods Study. J Med Internet Res. 2021;23: e28151. 10.2196/28151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Howard J, Clark EC, Friedman A, Crosson JC, Pellerano M, Crabtree BF, et al. Electronic health record impact on work burden in small, unaffiliated, community-based primary care practices. J Gen Intern Med. 2013;28:107–13. 10.1007/s11606-012-2192-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.O’Malley AS, Draper K, Gourevitch R, Cross DA, Scholle SH. Electronic health records and support for primary care teamwork. J Am Med Inform Assoc. 2015;22:426–34. 10.1093/jamia/ocu029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Alanazi B, Butler-Henderson K, Alanazi MR. The Role of Electronic Health Records in Improving Communication Between Health Professionals in Primary Healthcare Centres in Riyadh: Perception of Health Professionals. Stud Health Technol Inform. 2019;264:499–503. 10.3233/SHTI190272. [DOI] [PubMed] [Google Scholar]
- 29.Wade V, Whittaker F, Hamlyn J. An evaluation of the benefits and challenges of video consulting between general practitioners and residential aged care facilities. J Telemed Telecare. 2015;21:490–3. 10.1177/1357633X15611771. [DOI] [PubMed] [Google Scholar]
- 30.Renfro CP, Ferreri S, Barber TG, Foley S. Development of a Communication Strategy to Increase Interprofessional Collaboration in the Outpatient Setting. Pharmacy. 2018. 10.3390/pharmacy6010004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Mastellos N, Car J, Majeed A, Aylin P. Using information to deliver safer care: a mixed-methods study exploring general practitioners’ information needs in North West London primary care. J Innov Health Inform. 2014;22:207–13. 10.14236/jhi.v22i1.77. [DOI] [PubMed] [Google Scholar]
- 32.Lyngstad M, Melby L, Grimsmo A, Hellesø R. Toward Increased Patient Safety? Electronic Communication of Medication Information Between Nurses in Home Health Care and General Practitioners. Home Health Care Manag Pract. 2013;25:203–11. 10.1177/1084822313480365. [Google Scholar]
- 33.Citrin D, Thapa P, Nirola I, Pandey S, Kunwar LB, Tenpa J, et al. Developing and deploying a community healthcare worker-driven, digitally- enabled integrated care system for municipalities in rural Nepal. Healthc. 2018;6:197–204. 10.1016/j.hjdsi.2018.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.McHugh ML. Interrater reliability: the kappa statistic. Biochem Med. 2012;22:276–82. [PMC free article] [PubMed] [Google Scholar]
- 35.Rovetta A. Raiders of the Lost Correlation: A Guide on Using Pearson and Spearman Coefficients to Detect Hidden Correlations in Medical Sciences. Cureus. 2020;12: e11794. 10.7759/cureus.11794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Qureshi R, Shaughnessy D, Gill KAR, Robinson KA, Li T, Agai E. Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation? Syst Rev. 2023;12:72. 10.1186/s13643-023-02243-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Shen Y, Heacock L, Elias J, Hentel KD, Reig B, Shih G, Moy L. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology. 2023;307: e230163. 10.1148/radiol.230163. [DOI] [PubMed] [Google Scholar]
- 38.Clough RAJ, Sparkes WA, Clough OT, Sykes JT, Steventon AT, King K. Transforming healthcare documentation: harnessing the potential of AI to generate discharge summaries. BJGP Open. 2024. 10.3399/BJGPO.2023.0116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Scott D, Hallett C, Fettiplace R. Data-to-text summarisation of patient records: using computer-generated summaries to access patient histories. Patient Educ Couns. 2013;92:153–9. 10.1016/j.pec.2013.04.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun C-H, Lam JSH, et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95:104770. 10.1016/j.ebiom.2023.104770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Buhr CR, Smith H, Huppertz T, Bahr-Hamm K, Matthias C, Blaikie A, et al. ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions. JMIR Med Educ. 2023;9: e49183. 10.2196/49183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Massey PA, Montgomery C, Zhang AS. Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations. J Am Acad Orthop Surg. 2023;31:1173–9. 10.5435/JAAOS-D-23-00396. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Christou P. Ηow to Use Artificial Intelligence (AI) as a Resource, Methodological and Analysis Tool in Qualitative Research? TQR. 2023;28:1968–80. 10.46743/2160-3715/2023.6406. [Google Scholar]
- 44.Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digit Health. 2023;5:e107–8. 10.1016/S2589-7500(23)00021-3. [DOI] [PubMed] [Google Scholar]
- 45.Naumova EN. A mistake-find exercise: a teacher’s tool to engage with information innovations, ChatGPT, and their analogs. J Public Health Policy. 2023;44:173–8. 10.1057/s41271-023-00400-1. [DOI] [PubMed] [Google Scholar]
- 46.Stephens LD, Jacobs JW, Adkins BD, Booth GS. Battle of the (Chat)Bots: Comparing Large Language Models to Practice Guidelines for Transfusion-Associated Graft-Versus-Host Disease Prevention. Transfus Med Rev. 2023;37: 150753. 10.1016/j.tmrv.2023.150753. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Material 1: Appendix 1. In-detail information about the articles included in the scoping review serving as a base for this study
Supplementary Material 2: Appendix 2. Survey
Supplementary Material 3: Appendix 3. Results of Kruskal-Wallis test
Data Availability Statement
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

