Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2024 Jul 25;14:17118. doi: 10.1038/s41598-024-67429-4

A chatbot based question and answer system for the auxiliary diagnosis of chronic diseases based on large language model

Sainan Zhang 1,, Jisung Song 1
PMCID: PMC11272932  PMID: 39054346

Abstract

In recent years, artificial intelligence has made remarkable strides, improving various aspects of our daily lives. One notable application is in intelligent chatbots that use deep learning models. These systems have shown tremendous promise in the medical sector, enhancing healthcare quality, treatment efficiency, and cost-effectiveness. However, their role in aiding disease diagnosis, particularly chronic conditions, remains underexplored. Addressing this issue, this study employs large language models from the GPT series, in conjunction with deep learning techniques, to design and develop a diagnostic system targeted at chronic diseases. Specifically, performed transfer learning and fine-tuning on the GPT-2 model, enabling it to assist in accurately diagnosing 24 common chronic diseases. To provide a user-friendly interface and seamless interactive experience, we further developed a dialog-based interface, naming it Chat Ella. This system can make precise predictions for chronic diseases based on the symptoms described by users. Experimental results indicate that our model achieved an accuracy rate of 97.50% on the validation set, and an area under the curve (AUC) value reaching 99.91%. Moreover, conducted user satisfaction tests, which revealed that 68.7% of participants approved of Chat Ella, while 45.3% of participants found the system made daily medical consultations more convenient. It can rapidly and accurately assess a patient’s condition based on the symptoms described and provide timely feedback, making it of significant value in the design of medical auxiliary products for household use.

Keywords: Artificial intelligence, Chatbots, Large models, CUQ test, Deep learning, User experience

Subject terms: Health services, Health care, Health occupations, Computer science, Information technology

Introduction

Chronic diseases are increasingly receiving attention on a global scale. According to statistics, nearly 25% of adults suffer from one or more chronic conditions, making chronic diseases a major challenge facing global health. Not only do chronic diseases result in poor health, disability, and even death, but they also account for a substantial portion of healthcare expenditures1. Therefore, the early diagnosis of chronic diseases becomes critically important2. Prompt intervention is required to prevent further progression of the disease. With changes in lifestyle and the impact of the pandemic, the incidence of chronic diseases is on the rise. Consequently, the future demand for health management as well as expenditures on healthcare systems are also gradually increasing3. At the same time, the demand for healthcare resources such as hospitals and doctors has also significantly increased. Patients need to spend more time waiting for medical treatment, and doctors are facing immense workloads and patient volumes. However, to reduce both time and financial costs, an increasing number of people are leaning towards contactless consultations and seeking medical advice online4. In a home setting, when experiencing physical discomfort, people are more inclined to conduct online searches for medical advice5. Indeed, the complexity and uncertain reliability of online information present a potential problem. Therefore, there is an urgent need for remote medical interventions in the management of chronic diseases. Monitoring provided by remote healthcare service organizations can help in the early identification of symptoms and enable rapid responses to disease exacerbation6. Developing a system that can accurately assist in the diagnosis of chronic diseases becomes especially important.

Telemedicine is a popular aspect of electronic medical applications and has seen increased usage with the development of mobile healthcare technologies. This growth has established the significance and value of remote interventions in the management of health services7,8. These technological solutions have the potential to make patients’ lives easier. For example, within clinical decision support systems (CDSS), models such as naive Bayes, neural networks, and K-nearest neighbors (KNN) are utilized to diagnose COVID-199. The VP-expert diagnostic system is specifically designed for populations in underdeveloped regions, aiming to rapidly diagnose diabetes and enhance the diagnostic efficiency of physicians10. The recommendation system for iron deficiency anemia, based on clinical oncology, serves as a CDSS, used for diagnosing and managing its treatment11. Furthermore, the smart tracking system for acute respiratory infections utilizes mobile health technologies to infer new facts from medical data collected during examinations of pediatric patients12. The use of CDSS to assist in enhancing the diagnostic efficiency of healthcare providers has become an established fact.

Currently, as the achievements of artificial intelligence in various domains have been effectively validated, deep learning has also exhibited exceptional performance in a variety of CDSS13,14. The concept of deep learning was introduced in 200615. The primary characteristic of deep learning is unsupervised feature learning, which provides a novel solution for dealing with abstract concepts in human thought. Deep learning algorithms have been effectively applied in numerous domains, including image retrieval and web-based knowledge retrieval16,17. For example, expert diagnostic systems for the automatic identification of asthma and chronic obstructive pulmonary disease (COPD) based on deep learning can interpret respiratory sounds recorded by a stethoscope, assisting doctors with remote access. Such systems can provide more timely and accurate diagnoses, offering a quicker and more efficient approach, especially for patients in resource-poor areas or those who cannot receive immediate medical attention1820. Additionally, multi-modal learning using foundational architectures like long short-term memory can also enable early predictions of exacerbations in COPD. By analyzing a range of modal data, including clinical data and radiographic information, these models capture potential disease features and trends. This helps doctors take timely intervention measures and reduce the risk of disease exacerbation. Deep learning has also shown promise in predicting hospital mortality rates. Through model training and optimization, the risk of patient mortality during hospital stays can be more accurately predicted. This is crucial for the rational allocation of hospital resources, prioritized patient care, and treatment decision-making21,22.

As previously described, research has demonstrated the potential achievements of artificial intelligence in the realm of healthcare services. Among them, conversational agents (CAs), also known as AI chatbots, are applications capable of communicating via natural language23. And with the rise of large language models, open AI’s generative pre-trained transformer (GPT) has gained prominence24. Trained on extensive text datasets, these models can generate natural language and be fine-tuned for a variety of language tasks25. For example, tasks such as language translation, text summarization, and text completion can be undertaken. After being trained on extensive conversational text datasets, these models can understand the context and intent of a conversation. They can be utilized as virtual assistants and other conversational interfaces that generate human-like responses in dialogue scenarios26. While initially mainly used for social chatting, these models are now increasingly being employed in specialized industries. Research in the field of intelligent diagnostics has demonstrated the feasibility of using large language models for radiological decision support in clinical settings, establishing them as effective tools for improving clinical workflows27. The isolation measures during the COVID-19 pandemic have heightened people’s awareness and understanding of online medical treatment. Doctors and patients have resorted to online consultations when face-to-face interactions are not possible. The number of remote consultations increased by 50%, further highlighting the importance of telemedicine28.

GPT, as a natural language processing model, serves various roles in healthcare. For instance, it can generate radiology reports to save radiologists time or assist in diagnostic decisions by providing differential and refined information. It can even communicate with patients, offering information regarding test results and follow-up treatment recommendations29. Additionally, some research has noted that digital therapy services provided by chatbots for mental health counseling have also gained recognition for their effectiveness30. Therefore, the future of healthcare will largely depend on the ability to accurately perform remote diagnoses. The collection of data remotely and its analysis through artificial intelligence will contribute to improved healthcare services and health outcomes. Regrettably, there is currently a relative paucity of research focusing on chronic disease auxiliary diagnosis based on large language models. Seizing this opportunity, our research aims to develop a system for aiding in the diagnosis of chronic diseases by leveraging large language models31. This aims to further enhance the quality of remote medical consultation services. This study is significant for the application of language models in healthcare and the development of systems that assist in the diagnosis of chronic diseases.

In the current technological context, this study comprehensively leverages the GPT-2 deep learning model. Through meticulous training, optimization, and packaging processes, we have successfully designed and developed an intelligent system for the auxiliary diagnosis of chronic diseases—Chat Ella. This system is capable of engaging in dialogues with patients, deeply inquiring into related symptoms, and thereby providing preliminary diagnostic results. Such an approach enables patients to gain an initial understanding of their health conditions and consult professional doctors in a timely manner when necessary. After a series of usability tests and evaluations on a validation set, including metrics such as accuracy and AUC, the system’s performance has met our anticipated standards. The chatbot-based medical auxiliary diagnosis model holds promise for broader promotion and application in the future.

The innovative aspects of this study are primarily reflected in the following areas:

  • The study employs the GPT-2, a large-scale pre-trained language model, for in-depth learning on a rich corpus of disease symptom text, opening up new possibilities for remote diagnosis of chronic diseases.

  • To allow users to more intuitively experience and utilize the training outcomes, we have specifically packaged the software and implemented a conversational interface, significantly enhancing the system’s interactivity and user-friendliness.

  • Through computer usability questionnaire (CUQ) usability testing, we have not only ensured the functional implementation of the system but also further validated its practical utility and user satisfaction in real-world scenarios, guaranteeing the system’s efficiency and reliability.

Methods

Data collection

In this study, features were extracted from publicly available raw data acquired from Kaggle (https://www.kaggle.com/), an online platform for data science and machine learning competitions. Kaggle is recognized as a community dedicated to machine learning and data science, providing essential assets for a wide array of research activities3234. Based on the data provided, a model was successfully constructed and trained, which was endowed with the capability to interpret patient symptom descriptions and assist in diagnosis. To ensure the model’s efficiency and accuracy, the data were meticulously organized and preprocessed, providing a solid foundation for the training and performance optimization of the Chat Ella model. In Supplementary Table 1, a detailed list of 24 common chronic disease categories relevant to diagnostic decision-making is presented. This dataset includes 1200 instances describing symptoms of chronic diseases. Additionally, a subset of symptom descriptions was randomly extracted and matched with corresponding disease outcomes; detailed cases are documented in Supplementary Table 2. Interested readers are directed to access the relevant raw datasets on kaggle.com.

Construction, training, and evaluation of the auxiliary diagnosis model

Generative pre-trained transformer 2 (GPT-2) was released by open AI in 2019 as a model for natural language processing and generation. As a type of natural language generation model, GPT-2 employs a neural network architecture known as “transformer”. Represents an advancement of the transformer architecture, designed to generate coherent and realistic text through unsupervised learning from large datasets35. Owing to its efficiency in handling sequential data, it quickly became a mainstream technology in the field of natural language processing. The transformer model primarily relies on a self-attention mechanism, which enables the model to consider the relationships among different parts of the input sequence and generate output based on these relationships.

Developed based on several core principles, the pre-training and fine-tuning framework involves two stages: in the pre-training phase, the model is trained on a very large text dataset to learn a wide range of language patterns and structures. This stage is unsupervised, meaning it does not require manually labeled data. For training, GPT-2 utilizes a vast amount of text data collected from the internet, covering a wide range of topics and styles, which enables the model to understand and generate various forms of text. GPT-2 features an increased number of layers and parameters. For instance, includes 1.5 billion parameters and 48 transformer layers, enhancing its capability to comprehend complex texts and produce detailed responses. This allows the model to consider the relationships between every word in the input sequence during text generation, facilitating the production of more coherent and contextually relevant text. However, with the emergence of larger LLM models such as GPT-3, the advantages of this fine-tuning might diminish, especially in scenarios requiring higher generalizability36. Therefore, match this in our fine-tuning of GPT-2.

The GPT-2 architecture is primarily composed of large transformer decoder blocks that integrate position encodings37. Within each decoder block, the multi-head self-attention mechanism and the multilayer perceptron are enveloped by layer normalization and dropout layers38. Time-series data inputted are organized into batches (samples, time steps, features) within the multi-head attention mechanism, wherein each time step is associated with a set of features. These features are partitioned and allocated to various processing units, enabling the simultaneous processing of different aspects of the input sequence and facilitating comprehensive analysis. This design of the architecture is shown to enhance the model’s capability to capture both short-term and long-term dependencies, thereby improving the prediction’s stability and accuracy. Illustrated in Fig. 1.

Figure 1.

Figure 1

GPT-2 structure diagram.

Data preprocessing

During the data preprocessing stage, we initially carried out the following operations:

  1. Deduplication: to ensure the quality and diversity of the data, all duplicate entries were removed from the dataset.

  2. Text cleaning: irrelevant information, formatting errors, and special characters were removed from the text to ensure that each entry has a uniform and clear format.

  3. Tokenization and encoding: to make the text data amenable to model processing, we tokenized the text and used specific encoding methods to convert it into numerical sequences that the model can recognize.

Through these steps, we obtained an optimized dataset, laying a solid foundation for the subsequent fine-tuning of the model.

Model fine-tuning methods, training details, and parameter settings

In the domain of deep learning applications, fine-tuning has emerged as a prevalent strategy. This involves conducting additional training on a pre-trained model with a specialized dataset to enhance the model's performance for a specific task. For the purpose of this study, which aims to enable GPT-2 to diagnose patient symptoms more accurately, the model was fine-tuned. The objective was to predict potential chronic diseases based on descriptions of patient symptoms, constituting a multi-class classification task. The data employed for fine-tuning was sourced from a preprocessed medical dataset, where each entry comprises a patient’s symptom description paired with the corresponding disease label.

The selection of an appropriate loss function was deemed critical for successful training. Consequently, cross-entropy loss was chosen due to its suitability for multi-class classification tasks. To thoroughly evaluate the model's performance across various metrics, accuracy, recall, precision, F1 score, and AUC were set as the primary evaluation metrics. To promote the effectiveness and generalizability of the training, the dataset was partitioned into a training set and a test set. Parameter settings were acknowledged as vital during the training process. Owing to the robust performance of the Adam optimizer across diverse deep-learning tasks, it was selected for this training session. The initial learning rate was established at 3e-5, incorporating a strategy for learning rate decay throughout the training duration. The batch size was fixed at 32, which not only facilitated the effective utilization of GPU resources but also contributed to the stability of the training process. Moreover, a plan was devised to complete five full rounds of training. To mitigate the risk of overfitting, a dropout strategy was integrated into the model. In the training of GPT-2, convergence criteria were implemented to prevent overfitting and enhance computational efficiency. Specifically, an early stopping method was utilized, where the training process was terminated if the mean squared error (MSE) remained below a predetermined threshold for three consecutive epochs. This threshold was established at 0.5, determined based on the results from the optimization of the learning rate using the cyclical learning rate (CLR) scheduler3941. This same threshold and count limit were maintained, with preliminary tests demonstrating that a threshold of 0.5 resulted in optimal model performance. Consequently, the dropout rate was also established at 0.5.

Computational equipment and operating environment

All model training and evaluation for this study were conducted on a server equipped with an NVIDIA Tesla V100 GPU. This GPU, featuring 5120 CUDA cores and 16 GB of HBM2 memory, provided substantial computational support for large-scale deep learning models. The server ran on the Ubuntu 18.04 LTS operating system, noted for its stability and reliability. This operating system is widely used in the fields of big data and machine learning, benefiting from extensive open-source support and a robust community of resources. For the implementation and fine-tuning of the GPT-2 model, TensorFlow 2.4 was selected as the primary deep-learning framework. TensorFlow not only offers flexible capabilities for model design but also utilizes a powerful backend that maximizes the computational potential of the GPU. Additionally, Python libraries such as numpy and pandas were employed for data processing and analysis. For model serialization and storage, the h5py library was utilized.

Integrated system design of Chat Ella

In the integrated system design of Chat Ella, the user input for symptom descriptions, database retrieval, and conversational responses must be tightly synchronized to provide timely and accurate feedback on medical symptoms to the users. The overall architecture of the system takes into account both backend processing and frontend user interface design, collaboratively ensuring exceptional user experience and system performance. The backend system primarily handles the processing of user input, database retrieval, and dialog responses. Its core task is to ensure seamless coordination among user query intents, database retrieval, and conversational responses, facilitating a smooth integration across these components. As for the frontend user interface, serving as the bridge for user interaction with Chat Ella, its design philosophy focuses on clarity, intuitiveness, and ease of use. Interface not only guides users in providing detailed symptom input but also clearly displays the responses and recommendations offered by Chat Ella. Furthermore, the tight integration between the front end and back end ensures that user inputs are accurately relayed to the back end, and feedback from the back end is presented to the user in a coherent manner.

Backend system architecture

Chat Ella adopts a structured diagnostic approach based on the symptom descriptions provided by users. Initially, it optimally sorts the symptoms to ensure preliminary filtering based on their importance or prevalence. For each symptom description available in the database, Chat Ella is capable of real-time responses, facilitating real-time bi-directional communication with the user. Upon receiving a user query, the system retrieves relevant data from the database, processes it in the backend, and then sends it back to the user in the form of a message. The core of the diagnostic workflow lies in the comparative analysis between each symptom inputted by the user and the corresponding symptoms of common diseases stored in the database. Whenever a symptom aligns with a particular disease, that disease is added to a list of potential conditions. This process continues until all the symptoms described by the user have been matched. Upon the completion of the entire process, Chat Ella identifies the diseases that best match the user’s description, sorts them based on the probability of matching, and then provides the most likely disease information back to the user. This highly optimized workflow ensures the delivery of accurate and unambiguous diagnostic feedback to users within a short period, as illustrated in Fig. 2.

Figure 2.

Figure 2

Overall system architecture of Chat Ella.

Frontend user interface design and implementation

In consideration of visualizing backend data results and ensuring the integrity of the user experience, our objective was to minimize user friction while maintaining a universally interactive logic. The front end of Chat Ella was designed as a conversational interface, with the initial interface prototype created using Sketch version 83.142, which allows for the creation of interactive interfaces. The final responsive interface was implemented in React43 as reported in recent advancements44. During development, the display sizes of user mobile devices were considered, allowing access from computers and tablets in addition to smartphones. For ease of user testing, the system was deployed on Tencent Cloud servers, eliminating the need for users to download any software to access the service.

Usability testing

Between July 20 and July 28, 2023.Total of 64 adult participants responded to the survey online. Sample size meets the standards for similar usability surveys45.Collect feedback on the usability and acceptability of Chat Ella, such as adherence, user-friendliness, ease of use, appropriateness, user engagement, and user satisfaction, participants first interacted with Chat Ella and then answered a series of questions in the chatbot usability questionnaire (CUQ). Serves as a specific usability assessment tool for chatbots, CUQ is a usability survey questionnaire designed for chatbot assistants, similar in context to the system usability scale (SUS). However, given that our test subject is a conversational system, the use of SUS is not recommended. Better tailored for evaluating aspects related to chatbot personality, user experience, and error handling. Which is a widely used tool for evaluating system usability with a benchmark score of 68 out of 100. The CUQ consists of 16 questions and offers five response options, ranging from ‘strongly agree’ to ‘strongly disagree’. With scores calculated out of a total of 160, then normalized to 100. The CUQ evaluates aspects related to the chatbot’s personality, onboarding process, user experience, and error handling46,47. The supplementary material includes Table 3. Positive aspects of chatbot usability are assessed through odd-numbered questions, while negative aspects are assessed through even-numbered questions. Computational tools can be used to calculate each participant’s score and average scores46. As recommended, CUQ scores were further analyzed using SPSS 26.0 to produce statistical data. All participants provided written informed consent prior to participating in the study. All methods were carried out in accordance with relevant guidelines and regulations informed consent was obtained from all subjects.

Table 3.

Chatbot usability questionnaire scores for study participants (n = 64) according to participant characteristics.

Mean SD Minimum Maximum SE
Gender
 Female 66.57 28.43 14.06 95 4.73
 Male 68.69 26.59 14.06 93.75 5.02
Education
 Bachelor 66.69 27.71 14.06 87.5 4.32
 Master 60.38 32.57 14.06 93.75 8.7
 Doctorate 82.25 5.59 75 95 1.86
Interactive device
 Computer 60.24 31.88 14.06 87.5 7.51
 Tablet computer 72.48 24.89 18.75 93.75 5.86
 Phone (Ios) 68.3 26.42 18.75 95 6.22
 Phone (android) 70.15 26.45 17.19 87.5 8.36
Chat Ella can provide auxiliary diagnosis
 Yes 66.41 28.77 14.06 87.5 6.43
 No 66.31 27.91 14.06 95 5.18
 Maybe 71.25 26.25 18.75 93.75 6.77

Ethical statement

Ethical review and approval were waived for this study due to the fact that the respondents were only required to answer questions without any control over humans. There is no relevant experimental research on the human body itself. (For this study due to the fact that the respondents were only required to answer questions without any control over humans). There is no relevant experimental research on the human body itself. It should also be emphasized that those participating in the chatbot usability testing are not patients. Participants were adults and voluntarily participated in the questionnaire written consent was informed. Because all 64 participants were of Chinese nationality, therefore, according to article 32 of the Measures for Ethical Review of Life Science and Medical Research Involving Human Beings, No. 4, “Ethical Review Measures for Life Science and Medical Research Involving Human Beings” of the People’s Republic of China, February 18, 2023, article 32: Ethical review may be exempted if it does not cause harm to human beings, or if it does not involve sensitive personal information or commercial interests. Therefore, review and approval by the Hanyang University Institutional Review Board is not required. This statement is intended to safeguard the ethical compliance of the research project and to protect the rights and privacy of the participants. The signature of the participant after reading the informed consent form will be displayed in the publication consent statement.

Informed consent

Each participant has been made aware of the study and has only agreed to use the data in academic research.

Result

Results of the auxiliary diagnostic model

To validate the efficacy of the fine-tuned GPT-2 model in chronic disease diagnosis tasks, we relied on multiple evaluation metrics to assess the model’s performance. Below are the detailed assessment results.

As can be observed from Table 1, the model demonstrates outstanding overall performance on the test set. Particularly noteworthy is the area under the curve (AUC) value, which approaches 1, indicating a high discriminative capacity of the model in distinguishing different chronic diseases. Additionally, the accuracy, precision, recall, and F1 scores all exceeded 0.97, signifying that the model performs well both in predicting the correct disease categories and in distinguishing between positive and negative samples. These findings suggest that the fine-tuned GPT-2 model is not only capable of effectively processing and understanding patients’ symptom descriptions, but it can also offer highly reliable chronic disease diagnostic recommendations to medical personnel. This provides robust technical support for the future development of intelligent healthcare systems.

Table 1.

Evaluation results of the auxiliary diagnostic mode.

Methods Value
ACC 0.975
AUC 0.999
F1 0.974
Precision 0.977
Recall 0.975

Backend development results

In the construction of the backend application for this research, the flask framework was employed, a choice known for its flexibility and simplicity in managing web applications. HTTP requests were effectively handled using the request module. Additionally, HTML templates were displayed and data was transmitted to the frontend by utilizing the render_template and jsonify modules. HTML rendering was facilitated by the render_template module, enabling dynamic web page generation based on backend data, while data formatting into JSON was accomplished using the jsonify module, enhancing the efficiency of data exchange between backend and frontend components.

Model and tokenizer loading

In the main backend function, ‘index’, the pre-trained GPT-2 model along with its corresponding tokenizer (GPT-2 tokenizer and GPT-2 for sequence classification) were initially loaded from the transformer library. These models and tokenizers, specifically designed for text classification, assisted in establishing label mappings for disease tag classification, as depicted in Fig. 3.

Figure 3.

Figure 3

Disease condition tag classification.

Disease label mapping and request handling

To precisely map the prediction results, a mapping from indices to disease labels, denoted as ‘index_to_label’, was constructed, where each index corresponds to a disease category. When processing frontend requests, the ‘index’ function retrieves parameters from the request, specifically the user-inputted symptom descriptions, which are utilized for model classification and disease category identification.

Symptom description prediction

Upon obtaining the user-inputted symptom descriptions, the system employs the GPT-2 tokenizer to encode these descriptions, which are then used as inputs for classification prediction by the model. The output prediction result is a vector indicating the probabilities for each disease category.

Result presentation and data feedback

The predicted probability vector is first processed through the softmax function to obtain the predicted probability distribution. Subsequently, the system displays the predicted probabilities for each disease category, along with their corresponding labels, to the user. The prediction results are formatted into JSON and returned to the frontend via the ‘jsonify’ function. In the final part of the backend, conditional statements are utilized to ensure correct data input, and the flask application is invoked accordingly. The entire process is depicted in Fig. 4.

Figure 4.

Figure 4

Disease prediction process.

Frontend page design and implementation

React components were used to design and implement the user interface. Emotion and user sentiment have been consistently regarded as lasting and effective key factors in user experience design48,49. In light of this, a virtual persona for the Ella chatbot was built and displayed on the page to more intuitively guide user interaction, as illustrated in Fig. 4a, b. Users can directly input their symptom descriptions into the interface and interact with the Chat Ella chatbot to receive relevant responses and suggestions. React’s functional components, along with a series of Hooks, were utilized to facilitate this interaction50, such as ‘useState’, ‘useRef’, and ‘useEffect’, were employed to manage component state and references. Several state variables, such as ‘showMenu’, ‘inputTxt’, and ‘messages’, were established within the components. These variables are used to control the chat interface display and manage user input. The ‘useRef’ hook was also used to create references to DOM elements, such as ‘chatDiv’ and ‘menuRef’, which are essential for subsequent operations. Additionally, several event-handling functions, such as ‘onMenuClick’, ‘onBackClick’, and ‘onSendClick’, were integrated into the component to manage various user actions. To ensure chat interface could display the latest messages in real-time, ‘useEffect’ was employed to monitor changes to the messages array, and the scroll position of the chat interface was accordingly set, as depicted in Fig. 5a–c. The ‘axios’ library was utilized to implement HTTP communication with the backend server. When a user enters symptom descriptions in the chat interface and clicks the send button, the system evaluates their input and adds it to the messages array, provide recommendations for the optimal correspondence, as shown in Fig. 5d. If no symptom-related statements are detected, a prompt is displayed, suggesting that the user provide more detailed symptom descriptions, as shown in Fig. 5e. Furthermore, the chat interface features a menu option. When users click on the menu icon, they can view or hide various menu options, such as deleting records and exporting logs, as depicted in Fig. 5f. The server’s link is http://62.234.219.85:8080/, and the detailed source code has been packaged and stored in the supplementary materials.

Figure 5.

Figure 5

Chat Ella dialogue system user interface design screenshot.

Quantitative analysis of usability results

A total of 64 adult participants, comprising 36 females and 28 males, responded to the survey. The educational background of the participants included 41 with a bachelor’s degree, eight with a master’s degree, and three with a Ph.D., as shown in Table 2.

Table 2.

Characteristics of the survey respondents (n = 64).

Education group Female Gender male Total
Bachelor 25 (60.98%) 16 (39.02%) 41 (64.06%)
Master 8 (57.14%) 6 (42.86%) 14 (21.87%)
Doctoral 3 (33.33%) 6 (66.67%) 9 (14.06%)
Total 36 (56.25%) 28 (43.75%) 64 (100%)

CUQ findings

The odd-numbered questions focus on the positive aspects of the chatbot experience. Among the positive feedback regarding Chat Ella’s usability, the question “The chatbot is user-friendly during the initial setup” (Q7) and “It is easy to use” (Q15) scored the highest with an average rating of 3.76. The lowest average scores for positive questions were for questions 1, 5, and 11, with an average score of 3.70. The even-numbered questions focus on the negative aspects. For instance, question 8 “Using the chatbot can sometimes be confusing” scored the highest among these with an average rating of 3.4. On the other hand, question 16, which asked if “Using the chatbot is complicated,” received the lowest average score of 2.1, as shown in Fig. 6. Among the 64 participants, male respondents had a higher average CUQ score of 68.69 compared to females. Participants with a Ph.D. had the highest average score, registering at 82.24. The average score for those who interacted via tablets was higher (72.48) than for those who used other devices. Those who believed that Chat Ella could diagnose diseases had an average score of 66.41. The overall average score across all participants was 68.31, as shown in Table 3. These results indicate a generally positive perception of Chat Ella’s usability, though there is room for improvement, particularly in aspects that received lower scores. These insights are valuable for future updates and refinements to the system.

Figure 6.

Figure 6

Average ranking of chatbot usability questionnaire.

Discussion

Amidst rapid advancements in information technology, artificial intelligence (AI) has undergone extensive validation and is increasingly gaining prominence. This trend is particularly notable in the healthcare sector, where AI-assisted medical services are receiving widespread attention and adoption, especially for specific use cases that benefit from efficient information retrieval. In this context, a chatbot named Chat Ella, grounded in large language models and deep learning techniques, was developed. Integrated with a comprehensive medical database, this chatbot is capable of accurately interpreting patients’ symptom descriptions and medical histories, subsequently providing diagnostic recommendations. The GPT-2 pre-trained model was employed for system development. To evaluate the practical application of the system, a usability assessment was conducted for users of Chat Ella. The results indicate that the average CUQ score among the participants reached 68.31, significantly surpassing general usability evaluation benchmarks. This indicates that Chat Ella has significant potential to assist in the auxiliary diagnosis of chronic diseases remotely and in a non-face-to-face manner.

The primary advantage of this system lies in its ability to conduct precise disease detection and matching based on user-provided symptom descriptions. This approach not only offers convenience but also ensures that patients do not overlook symptoms that might seem less significant, thereby providing them with a simple, efficient, and accurate means to understand their health status without the need for hospital visits or blind internet searches for symptoms. It is anticipated that this approach can not only effectively alleviate the issue of unequal distribution of medical resources but also enhance the work efficiency of medical professionals, thus offering broad application prospects in the healthcare sector.

In the selection of an appropriate language model to assist in diagnostic research, although advanced models such as GPT-3 and updated versions of Chat GPT are available in the current landscape of deep learning, GPT-2 was chosen for study. This decision was based on multiple considerations. Firstly, GPT-2 is recognized for its computational and storage efficiency while offering a level of accuracy comparable to larger models, thus ensuring resource efficiency. Secondly, its smaller scale is associated with better model interpretability and stability, which are critical in healthcare applications. Additionally, the prior release and widespread utilization of GPT-2 mean that its performance on a variety of tasks has been extensively validated, providing a reliable foundation for our research. GPT-2 is also readily fine-tunable, which allows for flexible adaptation to specific diagnostic tasks. Lastly, the adoption of GPT-2 simplifies the deployment process. Therefore, GPT-2 was found to offer an optimal balance for this study.

This study does indeed have certain limitations. First, the dataset we collected is limited in size, necessitating further expansion to ensure the model’s generalization capabilities. Second, the current chatbot primarily supports English interaction, and its response patterns are relatively fixed. In our usability testing, the majority of the participants in the study hail from academic and professional backgrounds in digital design and computer science, as well as clinical medicine. These participants engaged in tests focusing on their personally diagnosed medical experiences in the past which may limit the applicability of our findings to a broader user base. It’s worth noting that although our research focuses on the usability of a prototype model for assisting in the diagnosis of chronic diseases, aiming to save development time while ensuring product quality, some issues might still arise in practical applications. GPT-2 being a large language model that was pre-trained on extensive datasets, as a black box system, its reasoning processes cannot be fully explained or audited. Therefore, it is crucial to conduct extensive validation and close monitoring before its clinical adoption51. Additionally, the knowledge contained within it is only reflective up to the point when the training data was collected, which means it may not provide accurate information or insights about developments that occurred after the data collection period. User may use non-standard or informal language to describe their symptoms, potentially making it difficult for the chatbot to accurately interpret and respond.

To address these challenges and limitations. In deed to address these limitations, it is essential to refine and optimize training datasets to reduce biases and inaccuracies. Integrating external knowledge bases or fact-checking modules within the model can verify the accuracy of generated content. Human–machine interaction methods supervised by clinical physicians and featuring feedback mechanisms can continuously improve clinical logic. Further enrich and refine our database and optimize related algorithms to improve diagnostic accuracy and adaptability in responding to questions. Accessing updated databases or employing real-time fact-checking tools can aid the model in making more accurate judgments and generating precise information. Enhancing the model’s contextual understanding ability is crucial for better grasping complex situations and incorporating relevant new information.

Conclusion

This study has constructed an innovative question–answering system for chronic medical conditions, based on large language models. As large language models continue to advance, the focus on how to effectively integrate them into clinical workflows to enhance rather than replace human decision-making becomes particularly important. The system incorporates an extensive corpus of medical knowledge and is designed to develop a chatbot for intelligent diagnosis of common chronic diseases. Empirical results from a CUQ test demonstrated the system’s anticipated benefits and received positive feedback on usability. While current AI technology cannot fully replace the functions of a medical professional. The widespread application of chatbot technology in healthcare not only enhances the patient experience, expedites medical service response time, and boosts overall service efficiency, but also lowers healthcare economic costs. This technology offers an efficient and convenient option for patients who prefer or are unable to engage in face-to-face consultations.

Looking forward, considering upgrading the system to support voice input and image recognition features. High-quality training data is pivotal for the support of voice and image recognition technologies. This study sought support from medical collaborations to collect diverse training data covering chronic disease types and varied conditions in language and imagery. Data augmentation techniques were employed to enhance the model’s generalization capabilities. The model was integrated into an existing question-and-answer system, employing a modular system architecture designed to facilitate data exchange and function integration across different systems using application programming interfaces multimodal data fusion was crucial, effectively integrating information from various data sources including text, voice, and images, which is essential for improving the system’s diagnostic accuracy and response relevance. Fusion network strategies in deep learning ensured the effective integration of data across different modalities. Such enhancements will not only further improve the system’s usability but will also allow it to handle a wider variety of data types, thereby facilitating further doctor appointments following a chronic disease diagnosis. Leveraging deep learning and large language models, the application prospects of this technology extend beyond merely aiding in the diagnosis of chronic diseases. These technologies have the potential to be widely deployed in the design and development of various medical systems.

Supplementary Information

Supplementary Tables. (16.4KB, docx)

Author contributions

Data curation: Sainan Zhang. Formal analysis: Sainan Zhang. Investigation: Sainan Zhang, Jisung Song. Methodology: Sainan Zhang, Jisung Song. Project administration: Sainan Zhang, Jisung Song. Software: Sainan Zhang. Writing—original draft: Sainan Zhang. Writing—review & editing: Jisung Song.

Data availability

The raw data used during the current study are available from the corresponding author upon reasonable request.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-024-67429-4.

References

  • 1.Bauer, U. E., Briss, P. A., Goodman, R. A. & Bowman, B. A. Prevention of chronic disease in the 21st century: Elimination of the leading preventable causes of premature death and disability in the USA. Lancet384, 45–52 (2014). 10.1016/S0140-6736(14)60648-6 [DOI] [PubMed] [Google Scholar]
  • 2.Garcia, P. T., Dias, A. A., Souza, J. A. C. & Coltro, W. K. T. Batch injection analysis towards auxiliary diagnosis of periodontal diseases based on indirect amperometric detection of salivary α-amylase on a cupric oxide electrode. Anal. Chim. Acta1041, 50–57 (2018). 10.1016/j.aca.2018.08.039 [DOI] [PubMed] [Google Scholar]
  • 3.Steffler, M. et al. Trends in prevalence of chronic disease and multimorbidity in Ontario, Canada. CMAJ193, E270–E277 (2021). 10.1503/cmaj.201473 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wangler, J. & Jansky, M. In Transdisciplinary Perspectives on Public Health in Europe (eds Cassens, M. et al.) 177–187 (Springer, 2022). [Google Scholar]
  • 5.Sassenberg, K. & Greving, H. Internet searching about disease elicits a positive perception of own health when severity of illness is high: A longitudinal questionnaire study. J. Med. Internet Res.18, e56 (2016). 10.2196/jmir.5140 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Bashshur, R. L. et al. The empirical foundations of telemedicine interventions for chronic disease management. Telemed. E Health20, 769–800 (2014). 10.1089/tmj.2014.9981 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Celik Ertugrul, D. & Ulusoy, A. H. Development of a knowledge-based medical expert system to infer supportive treatment suggestions for pediatric patients. ETRI J.41, 515–527 (2019). 10.4218/etrij.2018-0428 [DOI] [Google Scholar]
  • 8.Shah, A. et al. Digital health interventions for depression and anxiety among people with chronic conditions: Scoping review. J. Med. Internet Res.24, e38030 (2022). 10.2196/38030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Afrash, M. R. et al. Machine learning-based clinical decision support system for automatic diagnosis of COVID-19 based on clinical data. J. Biostat. Epidemiol.10.18502/jbe.v8i1.10407 (2022). 10.18502/jbe.v8i1.10407 [DOI] [Google Scholar]
  • 10.Imanov, E., Altıparmak, H. & Imanova, G. E. In 13th International Conference on Theory and Application of Fuzzy Systems and Soft Computing—ICAFS-2018 (eds Aliev, R. A. et al.) 137–145 (Springer, 2019). [Google Scholar]
  • 11.Çelik Ertuğrul, D., Toygar, Ö. & Foroutan, N. A rule-based decision support system for aiding iron deficiency management. Health Inform. J.27, 14604582211066054 (2021). 10.1177/14604582211066054 [DOI] [PubMed] [Google Scholar]
  • 12.Ertugrul, D. Ç. et al. An intelligent tracking system: Application to acute respiratory tract infection (TrackARTI). In 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC) 137–142 (IEEE, 2017). [Google Scholar]
  • 13.Harutyunyan, H., Khachatrian, H., Kale, D. C., Ver Steeg, G. & Galstyan, A. Multitask learning and benchmarking with clinical time series data. Sci. Data6, 96 (2019). 10.1038/s41597-019-0103-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Avati, A. et al. Improving palliative care with deep learning. BMC Med. Inform. Decis. Mak.18, 122 (2018). 10.1186/s12911-018-0677-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Bengio, Y. Learning deep architectures for AI. Found. Trends Mach. Learn.2, 1–27 (2009). 10.1561/2200000006 [DOI] [Google Scholar]
  • 16.Lopes, N. & Ribeiro, B. Towards adaptive learning with improved convergence of deep belief networks on graphics processing units. Pattern Recognit.47, 114–127 (2014). 10.1016/j.patcog.2013.06.029 [DOI] [Google Scholar]
  • 17.Gao, H., Pei, J. & Huang, H. Demystifying dropout, International conference on machine learning, PMLR, 2112–2121 (2019).
  • 18.Hamet, P. & Tremblay, J. Artificial intelligence in medicine. Metabolism69, S36–S40 (2017). 10.1016/j.metabol.2017.01.011 [DOI] [PubMed] [Google Scholar]
  • 19.Srivastava, A. et al. Deep learning based respiratory sound analysis for detection of chronic obstructive pulmonary disease. PeerJ Comput. Sci.7, e369 (2021). 10.7717/peerj-cs.369 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Liu, Y. et al.IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech) 170–177 (IEEE, 2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Badnjevic, A., Gurbeta, L. & Custovic, E. An expert diagnostic system to automatically identify asthma and chronic obstructive pulmonary disease in clinical settings. Sci. Rep.8, 11645 (2018). 10.1038/s41598-018-30116-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Jin, M. et al. Improving hospital mortality prediction with medical named entities and multimodal learning. arXiv preprint arXiv:1811.12276 (2018).
  • 23.Dale, R. The return of the chatbots. Nat. Lang. Eng.22, 811–817 (2016). 10.1017/S1351324916000243 [DOI] [Google Scholar]
  • 24.Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog1, 1–24 (2019). [Google Scholar]
  • 25.Liu, Y. et al. Respiratory sounds feature learning with deep convolutional neural networks. In 2017 IEEE 15th Intl Conf on Dependable 170–177 (IEEE, 2017). [Google Scholar]
  • 26.Bates, M. Health care chatbots are here to help. IEEE Pulse10, 12–14 (2019). 10.1109/MPULS.2019.2911816 [DOI] [PubMed] [Google Scholar]
  • 27.Rao, A. et al. Evaluating GPT as an adjunct for radiologic decision making: GPT-4 versus GPT-3.5 in a breast imaging pilot. J. Am. Coll. Radiol.20, 990–997 (2023). 10.1016/j.jacr.2023.05.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Følstad, A. & Brandtzaeg, P. B. Users’ experiences with chatbots: Findings from a questionnaire study. Qual. User Exp.5, 3 (2020). 10.1007/s41233-020-00033-2 [DOI] [Google Scholar]
  • 29.Varshney, D., Zafar, A., Behera, N. K. & Ekbal, A. Knowledge graph assisted end-to-end medical dialog generation. Artif. Intell. Med.139, 102535 (2023). 10.1016/j.artmed.2023.102535 [DOI] [PubMed] [Google Scholar]
  • 30.Oh, J., Jang, S., Kim, H. & Kim, J. J. Efficacy of mobile app-based interactive cognitive behavioral therapy using a chatbot for panic disorder. Int. J. Med. Inform.140, 104171 (2020). 10.1016/j.ijmedinf.2020.104171 [DOI] [PubMed] [Google Scholar]
  • 31.Lecler, A., Duron, L. & Soyer, P. Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT. Diagn. Interv. Imaging104, 269–274 (2023). 10.1016/j.diii.2023.02.003 [DOI] [PubMed] [Google Scholar]
  • 32.Ibrahim, A. M., Elbasheir, M., Badawi, S., Mohammed, A. & Alalmin, A. F. M. Skin cancer classification using transfer learning by VGG16 architecture (case study on Kaggle dataset). J. Intell. Learn. Syst. Appl.15, 67–75 (2023). [Google Scholar]
  • 33.Shekhar, S. & Thakur, N. Deep learning framework for forecasting diabetic retinopathy: An innovative approach. Int. J. Innov. Res. Comput. Sci. Technol.12, 17–20 (2024). [Google Scholar]
  • 34.Saha, R., Malviya, L., Jadhav, A. & Dangi, R. Early stage HIV diagnosis using optimized ensemble learning technique. Biomed. Signal Process. Control89, 105787 (2024). 10.1016/j.bspc.2023.105787 [DOI] [Google Scholar]
  • 35.Chavez, M. R., Butler, T. S., Rekawek, P., Heo, H. & Kinzler, W. L. Chat generative pre-trained transformer: Why we should embrace this technology. Am. J. Obstet. Gynecol.228, 706–711 (2023). 10.1016/j.ajog.2023.03.010 [DOI] [PubMed] [Google Scholar]
  • 36.Li, T. et al. CancerGPT for few shot drug pair synergy prediction using large pretrained language models. Npj Digit. Med.7, 40 (2024). 10.1038/s41746-024-01024-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog1, 9 (2019). [Google Scholar]
  • 38.Xu, J., Sun, X., Zhang, Z., Zhao, G. & Lin, J. Understanding and improving layer normalization. Adv. Neural. Inf. Process Syst.32, 4383–4393 (2019). [Google Scholar]
  • 39.Zhuo, J., Zhu, J. & Zhang, B. Adaptive dropout rates for learning with corrupted features. IJCAI24, 4126–4133 (2015). [Google Scholar]
  • 40.Ko, B., Kim, H.-G., Oh, K.-J. & Choi, H.-J. Controlled dropout: A different approach to using dropout on deep neural network. IEEE Int. Conf. Big Data Smart Comput.10.1109/BIGCOMP.2017.788169 (2017). 10.1109/BIGCOMP.2017.788169 [DOI] [Google Scholar]
  • 41.Gao, H., Pei, J. & Huang, H. Demystifying dropout. In International Conference on Machine Learning 2112–2121 (PMLR, 2019). [Google Scholar]
  • 42.Cowan, B. R., Clark, L., Candello, H. & Tsai, J. Introduction to this special issue: Guiding the conversation: New theory and design perspectives for conversational user interfaces. Hum. Comput. Interact.38, 159–167 (2023). 10.1080/07370024.2022.2161905 [DOI] [Google Scholar]
  • 43.Banks, A. & Porcello, E. Learning React: Functional Web Development with React and Redux (O’Reilly Media, Inc., 2017). [Google Scholar]
  • 44.Iqbal, M. W., Naqvi, M. R., Khan, M. A., Khan, F. & Whangbo, T. Mobile devices interface adaptivity using ontologies. Comput. Mater. Contin.71, 4767–4784 (2022). [Google Scholar]
  • 45.Larbi, D., Denecke, K. & Gabarron, E. Usability testing of a social media chatbot for increasing physical activity behavior. J. Pers. Med.12, 828 (2022). 10.3390/jpm12050828 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Holmes S et al. Usability testing of a healthcare chatbot: Can we use conventional methods to assess conversational user interfaces? 31st European Conference on Cognitive Ergonomics. (2019).
  • 47.Larbi, D., Gabarron, E. & Denecke, K. Social media chatbot for increasing physical activity: Usability study. In pHealth 227–232 (IOS Press, 2021). [DOI] [PubMed] [Google Scholar]
  • 48.Dirin, A., Nieminen, M. & Laine, T. H. Feelings of being for mobile user experience design. Int. J. Hum. Comput. Interact.38, 1–21 (2022). [Google Scholar]
  • 49.Haugeland, I. K. F., Følstad, A., Taylor, C. & Alexander, C. Understanding the user experience of customer service chatbots: An experimental study of chatbot interaction design. Int. J. Hum. Comput. Stud.161, 102788 (2022). 10.1016/j.ijhcs.2022.102788 [DOI] [Google Scholar]
  • 50.Banks, A. & Porcello, E. Learning React: Modern Patterns for Developing React Apps (O’Reilly Media, 2020). [Google Scholar]
  • 51.Tredinnick, L. & Laybats, C. Black-box creativity and generative artifical intelligence. Bus. Inf. Rev.40, 98–102 (2023). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Tables. (16.4KB, docx)

Data Availability Statement

The raw data used during the current study are available from the corresponding author upon reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES