Agreement of Feline Grimace Scale scores between chatbots and an expert rater

Sze T Ngai; Syed S U H Bukhari; Santiago Alonso Sousa; Paulo V Steagall

doi:10.1038/s41598-025-27404-z

. 2025 Dec 9;15:43461. doi: 10.1038/s41598-025-27404-z

Agreement of Feline Grimace Scale scores between chatbots and an expert rater

Sze T Ngai ¹, Syed S U H Bukhari ^2,³, Santiago Alonso Sousa ², Paulo V Steagall ^2,^3,^✉

PMCID: PMC12689790 PMID: 41366265

Abstract

The performance of large language models (LLMs) during acute pain assessment in cats has not been evaluated. This study evaluated the agreement of the Feline Grimace Scale (FGS) scoring between four chatbots (ChatGPT, Gemini, Claude AI, and Perplexity) and an expert veterinarian, including bias and limits of agreement (LoA), and whether bias would be reduced when retested after two months. Fifty cat facial images were scored twice, two months apart, by each chatbot using the FGS (ear position, orbital tightening, muzzle tension, whiskers change and head position). The Bland–Altman method was used to analyze bias and limits of LoA. Chatbots showed positive bias, indicating underestimation of FGS scores. Claude AI presented an acceptable bias (< 0.1) suggesting good agreement after retesting. However, its LoA spanned the FGS threshold for analgesia (0.39). The LoA of ChatGPT did not span the threshold, but presented unacceptable bias (> 0.1). Gemini showed unacceptable bias and LoA spanned the FGS threshold. Perplexity showed unacceptable bias and its LoA spanned the threshold after retesting. Most chatbots showed poor agreement and could have compromised analgesia during testing and/or retesting; pain scoring could be overestimated or underestimated to an extent that would cause overtreatment or undertreatment of pain.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-27404-z.

Keywords: Analgesia, Artificial intelligence, Cats, Image analysis, Large language models, Pain assessment

Subject terms: Health care, Predictive medicine

Introduction

Pain impairs feline health and welfare^1–3. Oligoanalgesia, which is the medical failure to recognize and provide analgesia in patients with acute pain, is a global issue in small animal practice⁴. Feline pain is historically underdiagnosed and under-treated, receiving less veterinary care and pain management than dogs^3,4. Indeed, feline pain assessment is considered to be challenging for veterinary students, veterinarians, and even experts in the field^5–8. Several factors were found to hinder the adoption of pain assessment scales, including lack of familiarity, time constraints, and lack of training⁹. A greater understanding of pain management enhances the use of analgesics in small animals^10,11, and inadequate knowledge results in poor analgesic practices, including limited use of analgesic drugs¹². Therefore, these factors make cats more prone to oligoanalgesia, causing suffering, pain, and compromising their welfare.

The Feline Grimace Scale (FGS) is a reliable and responsive acute pain assessment instrument based on changes in facial expressions. Construct validity of the FGS was confirmed by its strong correlation with the Glasgow composite measure pain scale¹³. This instrument has demonstrated reliability in cats with medical and surgical pain, in addition to those undergoing dental extractions^13,14. Inter-rater reliability was good among cat owners, veterinarians, veterinary students, and nurses when using total FGS ratio scores. The FGS can be used reliably for acute pain assessment in cats, even by untrained individuals, and potentially in home environments^13,15,16. The assessment focuses on the changes of five action units (AU), comprising ear position, orbital tightening, muzzle tension, whiskers change, and head position. Each AU is scored from 0 to 2 (0 = absence of AU; 1 = moderate appearance of AU or uncertainty over the presence; 2 = obvious appearance of AU)¹³. The final FGS scores are calculated as the sum of the scores assigned to each action unit divided by the total possible score, excluding those marked as not possible to score (i.e. 3/10 = 0.3 or 4/8 = 0.5)^13,15.

The application of artificial intelligence (AI) for medical use has gained popularity. Deep learning algorithms have been employed to encode pain features from multiple modalities, with multimodal approaches showing significant improvements, especially when incorporating temporal dimensions¹⁷. Automated facial analysis for pain assessment is a thriving field in medical research¹⁸. In human healthcare, “PainChek”, a mobile phone application, has been integrated into the clinical setting for infants and people with dementia^19,20. In veterinary medicine, AI-powered models have been published for pain assessment in cats, horses, laboratory rats, and rhesus macaques, among others^6,21–25. Additionally, deep learning and convolutional neural networks have been developed to detect cat emotions and their associated facial expressions^26,27. These studies showed promising performance of AI models, suggesting the feasibility of automated pain recognition in animals^21,23–25.

Deep learning is a type of machine learning algorithm that processes data using neural networks, mimicking a human brain²⁸. A deep learning model used annotated cat facial images for FGS scoring. This model combined convolutional neural networks with machine learning to detect facial landmarks and pain scoring with high accuracy (~ 95.5%) in classifying painful and non-painful cats²¹. Large Language Models (LLMs) like ChatGPT are a specialized subset of deep learning models that use natural language processing (NLP) to produce human-like conversations by understanding the inputs from users and generating contextually relevant responses²⁹. LLMs demonstrate capabilities in medical knowledge retrieval, research support and clinical decision-making³⁰. While LLMs excel in language comprehension, they face challenges in processing medical images. However, integrating LLMs with computer-aided diagnosis (CAD) networks can enhance the interpretation of medical images, including those used in pain assessment³¹. Multimodal LLMs are being developed to process diverse data types, including medical imaging and electronic health records, potentially improving diagnostic accuracy³⁰.

LLMs have shown promising performance in human pain detection^32,33. There is likely an interest among cat caregivers in using LLMs to learn more about their cats’ health, including acute pain assessment using the FGS. However, to the authors’ knowledge, their agreement, precision and accuracy when compared with veterinary experts in feline pain assessment have not been reported. These LLMs (i.e., chatbots) could contribute to the advancement of feline health and welfare, including acute pain assessment, particularly helping cat caregivers with early disease detection if they feel something “is not right” with their cats.

The objectives of this study were to determine the agreement of FGS scores between four chatbots and an expert rater, including bias and limits of agreement (LoA), and whether bias would be reduced when retesting after two months. The hypotheses were that the four chatbots would have good agreement with the expert rater without exceeding LoA and that bias/LoA would be reduced after retesting.

Results

Bias and LOA of total FGS scores between the expert rater and chatbots

For ChatGPT, the LoA in trial-1 was in the range of − 0.13 to 0.59 with a bias of 0.22, Fig. 1A. In trial-2, the LoA were in the range of − 0.14 to 0.57 with a bias of 0.21, Fig. 1B. The bias was > 0.1 in both trials, suggesting poor agreement between ChatGPT and the expert rater at both time points. The LoA did not span the analgesic threshold of 0.39 (difference between bias and lower or upper LoA) at either time point. Positive slope of the linear model showed as the mean value increases, the difference also increases both in trial-1 (estimates = 0.50; p < 0.001) and trial-2 (estimates = 0.51; p < 0.001), indicating proportional bias, Fig. 1.

For Claude AI, the LoA in trial-1 was in the range of − 0.37 to 0.59 with a bias of 0.11, Fig. 2A. In trial-2, the LoA was in the range of − 0.39 to 0.49 with a bias of 0.05, Fig. 2B. The bias was < 0.1 in trial-2, suggesting good agreement between Claude AI and the expert rater. The LoAs spanned the analgesic threshold of 0.39 at both time points. Positive slope of the linear model showed as the mean value increases, the difference also increases in trial-1 (estimates = 0.01; p = 0.91), whereas, in trial-2 model slope was negative (estimates = − 0.03; p = 0.77), indicating proportional bias, Fig. 2.

For Gemini, the LoA in trial-1 was in the range of − 0.31 to 0.75 with a bias of 0.22, Fig. 3A. In trial-2, the LoA was in the range of − 0.29 to 0.79 with a bias of 0.25, Fig. 3B. The bias was > 0.1 in both trials, suggesting poor agreement between Gemini and the expert rater at both time points. The LoAs spanned the analgesic threshold of 0.39 at both time points. Positive slope of the linear model showed as the mean value increases, the difference also increases in trial-1 (estimates = 0.92; p < 0.001) and trial-2 (estimates = 1.16; p < 0.001), indicating proportional bias, Fig. 3.

For Perplexity, the LoA in trial-1 was in the range of -0.18 – 0.59 with a bias of 0.21, Fig. 4A. In trial-2, the LoA was in the range of − 0.29 to 0.65 with a bias of 0.18, Fig. 4B. The bias was > 0.1 in both trials, suggesting poor agreement between Perplexity and the expert rater at both time points. The LoA did not span the analgesic threshold of 0.39 in trial-1. Positive slope of the linear model showed as the mean value increases, the difference also increases in trial-1 (estimates = 0.27; p = 0.009) and trial-2 (estimates = 0.29; p = 0.02), indicating proportional bias, Fig. 4.

Agreement of each action unit scores between the expert rater and chatbots

ChatGPT showed highest agreement with expert rater for orbital scoring both during trial-1 (k = 0.65) and trial-2 (k = 0.70). Claude AI showed highest agreement with expert rater for head scoring during trial-1 (k = 0.45) and for orbital scoring during trial-2 (k = 0.49), Table 1.

Table 1.

Agreement between the gold standard and ChatGPT, Claude AI, Gemini and Perplexity at the action unit level in trial-1 and trial-2. The table shows weighted Cohen’s Kappa (k) coefficient values.

	Trial-1					Trial-2
	Ear	Orbital	Muzzle	Whisker	Head	Ear	Orbital	Muzzle	Whisker	Head
ChatGPT	0.47	0.65	0.10	0.11	0.39	0.37	0.70	0.11	0.04	0.43
Claude AI	0.34	0.39	0.32	0.28	0.45	0.35	0.49	0.35	0.40	0.47
Gemini	0.16	0.22	0.14	0.06	0.17	0.25	0.38	0.06	0.02	0.23
Perplexity	0.46	0.57	0.21	0.14	0.17	0.24	0.40	0.21	0.31	0.26

Open in a new tab

Discussion

This study hypothesized that the four chatbots would show good agreement with the expert rater’s FGS scores and that the bias would reduce after retesting. However, this hypothesis was not corroborated. Except for Claude AI in trial-2, all chatbots exhibited bias > 0.1, suggesting poor agreement with the expert rater. According to the LoA, most chatbots spanned the analgesic cut-off of 0.39, with the exception of ChatGPT in both trials and Perplexity in trial-2. Bias was reduced approximately by 10% during retesting; Claude AI reduced bias by 55%.

Claude AI in trial-2 demonstrated good agreement, suggesting it can provide similar FGS scores to those of the expert rater. Other chatbots (ChatGPT, Claude AI, and Perplexity) demonstrated poor agreement with the expert rater, contrasting with a study that found good agreement between cat owners, veterinary students and nurses, and experienced veterinarians¹⁵. LLMs are text generation models rather than explicitly designed to perform fine-grained visual analysis or medical scoring. This can lead to conservative output when they are tasked with an unfamiliar prediction like the FGS assessment. Thus, it has its own unique biases³⁴. For instance, a LLM may rely on its prior knowledge, like “cats usually hide pain well” or, if it is not sure, “assume no pain” which can systematically skew its outputs³⁴. This discrepancy may also be attributed to the variations in image brightness, contrast and color balance, as the image quality was not standardized beforehand²¹. In a study employing pre-trained deep learning models for a smartphone application, the developers incorporated geometric descriptors and transformations to account for variations in face morphology due to age, sex, coat color, breed, etc.²¹. However, fine-tuning for vision tasks is unavailable for chatbots such as GPT-4o and Gemini 1.5³⁵. These factors could have contributed to the overall poor agreement of chatbots with the expert rater.

All chatbots exhibited a positive bias, indicating pain could have been underestimated using the FGS scores. This finding aligns with a study showing that the novice raters (seven small animal veterinarians) underestimated pain scores, both before and after FGS training³⁶. In contrast, cat owners, veterinary students and nurses slightly overestimated FGS pain scores assigned by experienced veterinarians¹⁵. Our results also contrasted with a finding in human medicine on emergency triage with ChatGPT-4, which overestimated pain severity when compared with human nurses³⁷. There are several technical and training-related factors that may explain these differences in pain scores between human raters and chatbots. LLMs possess training data bias and knowledge gaps. Compared to the vast human medical data, LLMs have less knowledge about feline pain^38–40. The training of LLMs is unlikely to include many examples of detailed descriptions or labeled FGS pain scores⁴⁰. Thus, it might not learn the threshold at which subtle cues show a certain pain score. Therefore, a lack of domain-specific calibration may lead the LLMs to systematically underestimate FGS scores⁴¹. Modern LLMs are influenced by alignment training, resulting in conservative language⁴². Therefore, with reinforcement learning from human feedback, they are tuned to avoid making extreme statements without sufficient evidence, especially in sensitive domains like health. This often leads to a conservative tone and could nudge its baseline assumption towards underestimation⁴³.

The LoA of most chatbots consistently exceeded the FGS analgesic threshold (0.39). The large upper and lower LoA demonstrates that some of these cats could have been under or overscored using the FGS by most chatbots. Some pain-free cats could have received unnecessary analgesia, whereas some painful cats would not have had analgesia if these chatbots were used for pain scoring in the clinical setting. In contrast, human raters in a previous study and ChatGPT showed LoA below the threshold¹⁵. In the comparison between the expert rater and ChatGPT, the positive slope of the linear model indicated that as the mean value increases, the difference also increases, suggesting the presence of proportional bias. This finding implies that as the pain levels in cats increased, the disparity between the expert rater and the chatbots (ChatGPT, Gemini, and Perplexity) also did. However, proportional bias was not significant when comparing the expert rater with Claude AI. While Claude AI showed acceptable bias among the tested chatbots, producing FGS scores comparable to the expert rater, its LoA surpassed the analgesic threshold. Therefore, even if overall pain scoring was comparable to the expert rater, inappropriate analgesic decisions, as mentioned above, could have occurred in some cats. Conversely, although ChatGPT had poor agreement with the expert rater and slightly underestimated pain scores, its LoA did not exceed the cut-off point in both trials. It is thus less likely to cause both overtreatment and undertreatment if used for pain scoring and the administration of analgesia.

Most chatbots demonstrated a modest bias reduction of approximately 10% after retesting. This is considerably less than the remarkable 50% bias reduction observed in a previous study of FGS scoring among seven veterinarians following training³⁶. Although a direct comparison is inappropriate due to the differing methodologies, the example highlights potential bias reduction through targeted interventions. Plus, the current statistical analysis is limited by its inability to quantify the impact of bias on the clinical FGS assessment⁴⁴. Hence, even though the chatbots’ bias decreased after retesting (trial-2), the clinical impact of this reduction remains uncertain. Claude AI showed the greatest bias reduction, but its clinical impact remains questionable, as bias went from 0.11 to 0.05, and therefore, close to or < 0.1, indicating good agreement in both trials. Potential model improvements made by developers between trials, incremental software updates, or improvements in image processing capabilities might partially explain bias reduction. However, these chatbots, as publicly accessible proprietary models, do not learn from user interactions in real-time⁴⁵. Therefore, it is unlikely that the bias reduction observed resulted directly from the interactions in trial-1.

Whiskers change and muzzle tension exhibited the low agreement with the expert rater across both trials, followed by head position and orbital tightening. Whiskers change and muzzle tension are inherently challenging for both human raters and AI-based assessments, a phenomenon also observed in rabbits^21,43,46. This suggests shared interpretative challenges in assessing certain subtle facial features. However, another deep-learning-based FGS study highlighted difficulties with ear position rather than muzzle tension⁶. Such differences demonstrate that agreement variability may exist due to model-specific factors, dataset characteristics or methodology differences. Therefore, while whiskers change and muzzle tension presented the least agreement in our findings, caution should be exercised in assuming this would universally apply across all types of AI assessments or visual analysis methods. The FGS assessment is susceptible to several factors, including fur color, image background, breed, morphological variations, lighting condition and frontal/non-frontal position of the cat¹⁵. The baseline muzzle shape and visibility of whisker pad differ greatly, for example, between a brachycephalic Persian and a long-snouted Siamese. Without geometric normalization using deep learning models, LLMs apply a one-size-fits-all idea. This contextual ambiguity makes certain AUs difficult to assess²¹. Moreover, the static, two-dimensional nature of image analysis may impair the accurate identification of whisker changes and muzzle tension^6,15.

This study has several limitations. Currently, it may not be possible to understand how each chatbot accounts for scoring due to their “black box” nature. The study did not compare the inherent limitations of LLMs in precise landmark detection with the traditional deep learning systems. The original FGS AI study used deep learning to analyze cat images with raw pixel data and data annotations to extract facial landmarks such as ear angle²¹. By contrast, pre-trained LLMs do not inherently process raw images but rely on textual input, for example, a list of observed facial cues. This means LLMs lack the rich raw visual data that computer vision models possess. The training of vision language models (VLMs) for vision tasks such as PaliGemma VLM improved the accuracy of facial attribute recognition such as human emotion classification³⁵. Moreover, we did not standardize the image quality. The variability in resolutions, lighting conditions, face orientations, and blurred and noisy contents potentially introduced bias into the FGS scoring. This contrasts with a previous study that pre-processed images to reduce the variations in brightness, contrast and/or color balance²¹. The two-dimensional nature of images may also impair the identification of certain AU and thus affecting the total FGS ratio scores⁶. Furthermore, the current statistical analysis could not assess the clinical impact of bias changes, making it difficult to interpret the clinical relevance of bias reduction after retesting. Moreover, this study only evaluated short-term performance, and the findings represent a snapshot of time. Therefore, it may not generalize to future chatbots’ performance as model updates are unpredictable. Finally, all comparisons were made with scores from a single expert rater using 50 images. It is not known how results would have changed with other types and the numbers of raters and using different images, for example, of cats of different breeds.

Based on the findings from 50 cat images and the scores of an expert rater, chatbots underestimated acute pain assessment using FGS. The large LoAs revealed considerable individual variability, which potentially influences analgesic decisions. In some individual cases, pain scoring could be overestimated or underestimated to an extent that would cause overtreatment or undertreatment of pain. ChatGPT is less likely to miss painful cats requiring analgesics or give analgesics to non-painful cats. However, ChatGPT may underestimate when pain is high (at higher FGS scores). On the other hand, Claude AI, despite a good overall agreement with the expert rater, showed instances of misclassifying feline pain in some individuals. As people increasingly rely on LLMs for information, it is essential to ensure that these models provide accurate and reliable data on feline pain assessment. The intersection of pain assessment, feline welfare, health and welfare as well as LLMs represents a promising area of research with the potential to transform our understanding and management of pain in cats. By leveraging advanced technologies and AI, we can enhance the quality of life for felines, improve veterinary practices, and promote a deeper understanding of animal welfare. This study highlights the need for an accurate, automated pain assessment deep learning model using the FGS.

Methods

Chatbots

The large language models selected for this study were chosen based on their market share, as widespread adoption increases potential clinical relevance. According to recent market analyses, ChatGPT (OpenAI), Copilot (Microsoft), Gemini (Google), Perplexity AI (Perplexity) and Claude AI (Anthropic) represented over 95% of the global market as of early 2024^47,48. The Microsoft Copilot was excluded due to technical limitations in processing feline facial images effectively, primarily related to image compatibility and processing capabilities. The latest available versions at the time of the study (April 2024) were used: OpenAI ChatGPT Plus (GPT-4o), Google Gemini Advanced (Gemini 1.5 Pro-002), Anthropic Claude (3.5 Sonnet) and Perplexity AI (version 2.27.2)^49–52.

Image selection

A total of 50 feline facial images of domestic short hair or domestic long hair cats were randomly obtained from a database of the expert veterinarian who developed and validated the FGS (PVS). The randomization in image selection was achieved using a random sequence generator available at www.randomization.com¹⁵. Images were acquired from previous clinical trials that received ethical approval from the institutional animal care and use committee of the Université de Montréal. The cats were of different ages and sex, and images were representative of timepoints before or after the analgesic treatment or surgery, and with different degrees of pain. The gold standard scores for each action unit and the total FGS ratio were previously determined by PVS and used for statistical analysis. The expert rater (gold standard) was a board-certified veterinary anesthesiologist with extensive research and clinical experience in feline pain management and a co-developer of the FGS^13,15, as comparisons with non-experts would not align with the Bland–Altman method’s requirement for a validated reference.

Data collection

To ensure that each chatbot’s evaluation was unbiased by prior interactions, all assessments were performed using freshly created chatbot sessions. The previous conversation history was deleted before each assessment. A standardized, structured prompt (“Please assess pain of the following cat facial image using the Feline Grimace Scale and provide scores for each of the five action units—ear position, orbital tightening, muzzle tension, whisker changes, and head position—as well as the total FGS ratio”) was used. This structured prompt aimed to ensure consistency across all chatbot interactions, reducing variability in the assessment. Moreover, one computer was used, and one person was transcribing the results to avoid confounding factors.

Cat facial images were individually uploaded as JPG or PNG files directly into the chatbot interfaces. Each image was assessed independently three separate times during each trial, with repeated assessments spaced by at least 24 h to minimize potential memory or bias effects. Images were randomized using an online randomizer (https://www.randomizer.org) before each assessment to further reduce order effects⁵³.

Assessments were conducted at two separate time points (trial-1 and trial-2). Retesting (trial-2) was performed two months later to evaluate potential improvements or changes in chatbot performance over time using the same chatbot version. The two-month interval was chosen due to the timeline of the study and the dynamic nature of LLM updates. Each chatbot’s responses, including scores for each individual action unit and the total FGS ratio, were documented via manual transcription.

Statistical analysis

Mean FGS scores from three independent assessments were calculated for each image and AU during trials-1 and 2, separately. Agreement and bias between chatbots and an expert rater were assessed using the Bland–Altman method⁵⁴. Based on a previous publication, a bias of < 0.1 (< 1 unit in the FGS score) was acceptable, indicating good agreement, whereas a bias of > 0.1 (> 1 unit in the FGS score) was considered unacceptable, indicating poor agreement^36,55. A bias with a negative value would suggest an overestimation of the FGS score by the novice rater (i.e., chatbot in this case) compared with the gold standard, whereas a positive bias would suggest an underestimation of pain by the chatbot¹⁵. The LoA should not span the analgesic threshold of the FGS (0.39); otherwise, this could indicate that pain-free cats would have received unnecessary analgesia or that painful cats would not have received analgesia^15,55. Linear regression model was fitted between mean and difference values of each Bland and Altman plot to analyze proportional bias. Model slope, estimates and p values are reported. Agreement between chatbots and the expert rater for each AU was also calculated using weighted Cohen’s Kappa (k) in an attempt to understand how each AU contributed to the FGS total scores agreement during each chatbot prompting. Cohen’s Kappa coefficients were computed from three independent assessments for each AU and average k is reported during trials-1 and 2, separately. All statistical analyses were carried out using RStudio with R version 4.2.3⁵⁶.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1^{(34.9KB, xlsx)}

Supplementary Material 2^{(42.4KB, pdf)}

Acknowledgements

Not applicable.

Author contributions

STN: Conceptualization, Investigation, Methodology, Data curation, Writing—original draft, Writing—review & editing. SSUHB: Formal data analysis, Visualization, Writing—review & editing. SAS: Writing—review & editing. PVS: Conceptualization, Investigation, Methodology, Data curation, Writing—original draft, Writing—review & editing. All authors were involved in the research, contributed to the preparation and approved the final manuscript.

Funding

Not applicable.

Data availability

The raw dataset and R analysis codes are available in the supplementary file 1 and 2, respectively.

Declarations

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Monteiro, B. P. et al. 2022 WSAVA guidelines for the recognition, assessment and treatment of pain. J. Small Anim. Pract.64, 177–254 (2023). [Google Scholar]
2.Steagall, P. V. et al. 2022 ISFM consensus guidelines on the management of acute pain in cats. J. Feline Med. Surg.24, 4–30 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Rodan, I. et al. 2022 AAFP/ISFM cat friendly veterinary interaction guidelines: Approach and handling techniques. J. Feline Med. Surg.24, 1093–1132 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Simon, B. T., Scallan, E. M., Carroll, G. & Steagall, P. V. The lack of analgesic use (oligoanalgesia) in small animal practice. J. Small Anim. Pract.58, 543–554 (2017). [DOI] [PubMed] [Google Scholar]
5.Steagall, P. V. et al. Perceptions and opinions of Canadian pet owners about anaesthesia, pain and surgery in small animals. J. Small Anim. Pract.58, 380–388 (2017). [DOI] [PubMed] [Google Scholar]
6.Feighelstein, M. et al. Explainable automated pain recognition in cats. Sci. Rep.13, 1–16 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Väisänen, M.-M., Tuomikoski-Alin, S. K., Brodbelt, D. C. & Vainio, O. M. Opinions of Finnish small animal owners about surgery and pain management in small animals. J. Small Anim. Pract.49, 626–632 (2008). [DOI] [PubMed] [Google Scholar]
8.Turner, D. C. The mechanics of social interactions between cats and their owners. Front. Vet. Sci.8, 1–6 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Menéndez, S., Cabezas, M. A. & Gomez de Segura, I. A. Attitudes to acute pain and the use of pain assessment scales among Spanish small animal veterinarians. Front. Vet. Sci.10, 1302528 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Joubert, K. E. Anaesthesia and analgesia for dogs and cats in South Africa undergoing sterilisation and with osteoarthritis - An update from 2000. J. S. Afr. Vet. Assoc.77, 224–228 (2006). [DOI] [PubMed] [Google Scholar]
11.Berry, S. H. Analgesia in the Perioperative Period. Vet. Clin. North Am. - Small Anim. Pract.45, 1013–1027 (2015). [DOI] [PubMed] [Google Scholar]
12.Hugonnard, M., Leblond, A., Keroack, S., Cadore, J. & Troncy, E. Attitudes and concerns of French veterinarians towards pain and analgesia in dogs and cats. Vet. Anaesth. Analg.31, 154–163 (2004). [DOI] [PubMed] [Google Scholar]
13.Evangelista, M. C. et al. Facial expressions of pain in cats: The development and validation of a Feline Grimace Scale. Sci. Rep.9, 1–11 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Watanabe, R. et al. Inter-rater reliability of the Feline Grimace Scale in cats undergoing dental extractions. Front. Vet. Sci.7, 4–9 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Evangelista, M. C. & Steagall, P. V. Agreement and reliability of the Feline Grimace Scale among cat owners, veterinarians, veterinary students and nurses. Sci. Rep.11, 1–9 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Cheng, A. J., Malo, A., Garbin, M., Monteiro, B. P. & Steagall, P. V. Construct validity, responsiveness and reliability of the Feline Grimace Scale in kittens. J. Feline Med. Surg.25, 1098612X231211765 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Gkikas, S. & Tsiknakis, M. Automatic assessment of pain based on deep learning methods: A systematic review. Comput. Methods Programs Biomed.231, 107365 (2023). [DOI] [PubMed] [Google Scholar]
18.Gama, F. et al. Implementation frameworks for artificial intelligence translation into health care practice: Scoping review. J. Med. Internet Res.24, 1–13 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Hoti, K., Chivers, P. T. & Hughes, J. D. Assessing procedural pain in infants: A feasibility study evaluating a point-of-care mobile solution based on automated facial analysis. Lancet Digit. Heal.3, e623–e634 (2021). [DOI] [PubMed] [Google Scholar]
20.Babicova, I., Cross, A., Forman, D., Hughes, J. & Hoti, K. Evaluation of the psychometric properties of PainChek® in UK aged care residents with advanced dementia. BMC Geriatr.21, 1–8 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Steagall, P. V., Monteiro, B. P., Marangoni, S., Moussa, M. & Sautié, M. Fully automated deep learning models with smartphone applicability for prediction of pain using the Feline Grimace Scale. Sci. Rep.13, 1–13 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Martvel, G. et al. Automated video-based pain recognition in cats using facial landmarks. Sci. Rep.14, 1–11 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Lencioni, G. C., de Sousa, R. V., de Souza Sardinha, E. J., Corrêa, R. R. & Zanella, A. J. Pain assessment in horses using automatic facial expression recognition through deep learning-based modeling. PLoS ONE16, 1–12 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Sotocinal, S. G. et al. The rat grimace scale: A partially automated method for quantifying pain in the laboratory rat via facial expressions. Mol. Pain7, 1–10 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Morozov, A., Parr, L. A., Gothard, K., Paz, R. & Pryluk, R. Automatic recognition of macaque facial expressions for detection of affective states. eNeuro8, 0117-21 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Martvel, G. et al. Computational investigation of the social function of domestic cat facial signals. Sci. Rep.14, 27533 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Abubakar, A., Onana Oyana, C. L. N. & Salum, O. S. Domestic cats facial expression recognition based on convolutional neural networks. Int. J. Eng. Adv. Technol.13, 45–52 (2024). [Google Scholar]
28.Cui, S., Tseng, H. H., Pakela, J., Ten Haken, R. K. & El Naqa, I. Introduction to machine and deep learning for medical physicists. Med. Phys.47, e127–e147 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Deng, J. & Lin, Y. The benefits and challenges of ChatGPT: An overview. Front. Comput. Intell. Syst.2, 81–83 (2023). [Google Scholar]
30.Yuan, M. et al. Large language models illuminate a progressive pathway to artificial intelligent healthcare assistant. Med. Plus1, 100030 (2024). [Google Scholar]
31.Wang, S. et al. Interactive computer-aided diagnosis on medical image using large language models. Commun. Eng.3, 1–9 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Jokar, M., Abdous, A. & Rahmanian, V. AI chatbots in pet health care: Opportunities and challenges for owners. Vet. Med. Sci.10, 1–3 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Biswas, S. S. Role of chat GPT in public health. Ann. Biomed. Eng.51, 868–869 (2023). [DOI] [PubMed] [Google Scholar]
34.Kang, K., Wallace, E., Tomlin, C., Kumar, A. & Levine, S. Unfamiliar Finetuning Examples Control How Language Models Hallucinate. arXiv 1–15 (2024).
35.AlDahoul, N., Tan, M. J. T., Kasireddy, H. R. & Zaki, Y. Exploring Vision Language Models for Facial Attribute Recognition: Emotion, Race, Gender, and Age. arXiv 1–52 (2024).
36.Robinson, A. R. & Steagall, P. V. Effects of training on Feline Grimace Scale scoring for acute pain assessment in cats. J. Feline Med. Surg.26, 1–7 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Haim, G. B. et al. Evaluating large language model-assisted emergency triage: A comparison of acuity assessments by GPT-4 and medical experts. J. Clin. Nurs.10.1111/jocn.17490 (2024). [DOI] [PubMed] [Google Scholar]
38.Robertson, S. & Lascelles, B.D.X. Long-term pain in cats: How much do we know about this important welfare issue? J. Feline Med. Surg.12, 188–199 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Taylor, P. M. & Robertson, S. A. Pain management in cats - Past, present and future. Part 1. The cat is unique. J. Feline Med. Surg.6, 313–320 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Drazen, J. M. et al. Benefits, limits, and risks of GPT-4 as an AI Chatbot for medicine. New Engl. J. Med.388, 1–7 (2023). [DOI] [PubMed] [Google Scholar]
41.Zhang, N. et al. Pruning as a Domain-specific LLM Extractor. Find. Assoc. Comput. Linguist. NAACL 2024 - Find. 1417–1428 (2024) 10.18653/v1/2024.findings-naacl.91.
42.Wang, C. et al. Hybrid alignment training for large language models. Find. Assoc. Comput. Linguist. ACL 2024, 11389–11403 (2024). [Google Scholar]
43.Zhang, Y.-F. et al. Debiasing Multimodal Large Language Models. arXiv 1–38 (2024).
44.Savović, J. et al. Influence of reported study design characteristics on intervention effect estimates from randomised controlled trials: Combined analysis of meta-epidemiological studies. Health Technol. Assess. (Rockv)16, 1–81 (2012). [DOI] [PubMed] [Google Scholar]
45.Johnson, S. & Hyland-Wood, D. A Primer on Large Language Models and their Limitations. arXiv 1–33 (2024).
46.Banchi, P., Quaranta, G., Ricci, A. & Von Degerfeld, M. M. Reliability and construct validity of a composite pain scale for rabbit (CANCRS) in a clinical environment. PLoS ONE15, 1–12 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Mckinsey. The state of AI in 2023: Generative AI’s breakout year. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year (2023).
48.Statista. Artificial Intelligence - Global: Statista market forecast. https://www.statista.com/outlook/tmo/artificial-intelligence/worldwide#market-size (2024).
49.OpenAI. Introducing GPT-4o and more tools to ChatGPT free users. https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free/ (2024).
50.Gemini. Introducing Gemini, your new personal AI assistant. https://gemini.google/assistant/ (2024).
51.Anthropic. Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. https://www.anthropic.com/news/3-5-models-and-computer-use (2024).
52.Perplexity. What is Perplexity? https://www.perplexity.ai/hub/getting-started (2024).
53.Urbaniak, G. C. & Plous, S. Research randomizer (Version 4.0). https://www.randomizer.org/ (2013).
54.Bland, J. M. & Altman, D. G. Measuring agreement in method comparison studies. Stat. Methods Med. Res.8, 135–160 (1999). [DOI] [PubMed] [Google Scholar]
55.Evangelista, M. C. et al. Clinical applicability of the Feline Grimace Scale: Real-time versus image scoring and the influence of sedation and surgery. PeerJ8, e8967 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
56.RStudio. https://www.rstudio.com/ (2025).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1^{(34.9KB, xlsx)}

Supplementary Material 2^{(42.4KB, pdf)}

Data Availability Statement

The raw dataset and R analysis codes are available in the supplementary file 1 and 2, respectively.

[CR1] 1.Monteiro, B. P. et al. 2022 WSAVA guidelines for the recognition, assessment and treatment of pain. J. Small Anim. Pract.64, 177–254 (2023). [Google Scholar]

[CR2] 2.Steagall, P. V. et al. 2022 ISFM consensus guidelines on the management of acute pain in cats. J. Feline Med. Surg.24, 4–30 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Rodan, I. et al. 2022 AAFP/ISFM cat friendly veterinary interaction guidelines: Approach and handling techniques. J. Feline Med. Surg.24, 1093–1132 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Simon, B. T., Scallan, E. M., Carroll, G. & Steagall, P. V. The lack of analgesic use (oligoanalgesia) in small animal practice. J. Small Anim. Pract.58, 543–554 (2017). [DOI] [PubMed] [Google Scholar]

[CR5] 5.Steagall, P. V. et al. Perceptions and opinions of Canadian pet owners about anaesthesia, pain and surgery in small animals. J. Small Anim. Pract.58, 380–388 (2017). [DOI] [PubMed] [Google Scholar]

[CR6] 6.Feighelstein, M. et al. Explainable automated pain recognition in cats. Sci. Rep.13, 1–16 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Väisänen, M.-M., Tuomikoski-Alin, S. K., Brodbelt, D. C. & Vainio, O. M. Opinions of Finnish small animal owners about surgery and pain management in small animals. J. Small Anim. Pract.49, 626–632 (2008). [DOI] [PubMed] [Google Scholar]

[CR8] 8.Turner, D. C. The mechanics of social interactions between cats and their owners. Front. Vet. Sci.8, 1–6 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Menéndez, S., Cabezas, M. A. & Gomez de Segura, I. A. Attitudes to acute pain and the use of pain assessment scales among Spanish small animal veterinarians. Front. Vet. Sci.10, 1302528 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Joubert, K. E. Anaesthesia and analgesia for dogs and cats in South Africa undergoing sterilisation and with osteoarthritis - An update from 2000. J. S. Afr. Vet. Assoc.77, 224–228 (2006). [DOI] [PubMed] [Google Scholar]

[CR11] 11.Berry, S. H. Analgesia in the Perioperative Period. Vet. Clin. North Am. - Small Anim. Pract.45, 1013–1027 (2015). [DOI] [PubMed] [Google Scholar]

[CR12] 12.Hugonnard, M., Leblond, A., Keroack, S., Cadore, J. & Troncy, E. Attitudes and concerns of French veterinarians towards pain and analgesia in dogs and cats. Vet. Anaesth. Analg.31, 154–163 (2004). [DOI] [PubMed] [Google Scholar]

[CR13] 13.Evangelista, M. C. et al. Facial expressions of pain in cats: The development and validation of a Feline Grimace Scale. Sci. Rep.9, 1–11 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Watanabe, R. et al. Inter-rater reliability of the Feline Grimace Scale in cats undergoing dental extractions. Front. Vet. Sci.7, 4–9 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Evangelista, M. C. & Steagall, P. V. Agreement and reliability of the Feline Grimace Scale among cat owners, veterinarians, veterinary students and nurses. Sci. Rep.11, 1–9 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Cheng, A. J., Malo, A., Garbin, M., Monteiro, B. P. & Steagall, P. V. Construct validity, responsiveness and reliability of the Feline Grimace Scale in kittens. J. Feline Med. Surg.25, 1098612X231211765 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Gkikas, S. & Tsiknakis, M. Automatic assessment of pain based on deep learning methods: A systematic review. Comput. Methods Programs Biomed.231, 107365 (2023). [DOI] [PubMed] [Google Scholar]

[CR18] 18.Gama, F. et al. Implementation frameworks for artificial intelligence translation into health care practice: Scoping review. J. Med. Internet Res.24, 1–13 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Hoti, K., Chivers, P. T. & Hughes, J. D. Assessing procedural pain in infants: A feasibility study evaluating a point-of-care mobile solution based on automated facial analysis. Lancet Digit. Heal.3, e623–e634 (2021). [DOI] [PubMed] [Google Scholar]

[CR20] 20.Babicova, I., Cross, A., Forman, D., Hughes, J. & Hoti, K. Evaluation of the psychometric properties of PainChek® in UK aged care residents with advanced dementia. BMC Geriatr.21, 1–8 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Steagall, P. V., Monteiro, B. P., Marangoni, S., Moussa, M. & Sautié, M. Fully automated deep learning models with smartphone applicability for prediction of pain using the Feline Grimace Scale. Sci. Rep.13, 1–13 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Martvel, G. et al. Automated video-based pain recognition in cats using facial landmarks. Sci. Rep.14, 1–11 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Lencioni, G. C., de Sousa, R. V., de Souza Sardinha, E. J., Corrêa, R. R. & Zanella, A. J. Pain assessment in horses using automatic facial expression recognition through deep learning-based modeling. PLoS ONE16, 1–12 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Sotocinal, S. G. et al. The rat grimace scale: A partially automated method for quantifying pain in the laboratory rat via facial expressions. Mol. Pain7, 1–10 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Morozov, A., Parr, L. A., Gothard, K., Paz, R. & Pryluk, R. Automatic recognition of macaque facial expressions for detection of affective states. eNeuro8, 0117-21 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Martvel, G. et al. Computational investigation of the social function of domestic cat facial signals. Sci. Rep.14, 27533 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Abubakar, A., Onana Oyana, C. L. N. & Salum, O. S. Domestic cats facial expression recognition based on convolutional neural networks. Int. J. Eng. Adv. Technol.13, 45–52 (2024). [Google Scholar]

[CR28] 28.Cui, S., Tseng, H. H., Pakela, J., Ten Haken, R. K. & El Naqa, I. Introduction to machine and deep learning for medical physicists. Med. Phys.47, e127–e147 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Deng, J. & Lin, Y. The benefits and challenges of ChatGPT: An overview. Front. Comput. Intell. Syst.2, 81–83 (2023). [Google Scholar]

[CR30] 30.Yuan, M. et al. Large language models illuminate a progressive pathway to artificial intelligent healthcare assistant. Med. Plus1, 100030 (2024). [Google Scholar]

[CR31] 31.Wang, S. et al. Interactive computer-aided diagnosis on medical image using large language models. Commun. Eng.3, 1–9 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Jokar, M., Abdous, A. & Rahmanian, V. AI chatbots in pet health care: Opportunities and challenges for owners. Vet. Med. Sci.10, 1–3 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Biswas, S. S. Role of chat GPT in public health. Ann. Biomed. Eng.51, 868–869 (2023). [DOI] [PubMed] [Google Scholar]

[CR34] 34.Kang, K., Wallace, E., Tomlin, C., Kumar, A. & Levine, S. Unfamiliar Finetuning Examples Control How Language Models Hallucinate. arXiv 1–15 (2024).

[CR35] 35.AlDahoul, N., Tan, M. J. T., Kasireddy, H. R. & Zaki, Y. Exploring Vision Language Models for Facial Attribute Recognition: Emotion, Race, Gender, and Age. arXiv 1–52 (2024).

[CR36] 36.Robinson, A. R. & Steagall, P. V. Effects of training on Feline Grimace Scale scoring for acute pain assessment in cats. J. Feline Med. Surg.26, 1–7 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Haim, G. B. et al. Evaluating large language model-assisted emergency triage: A comparison of acuity assessments by GPT-4 and medical experts. J. Clin. Nurs.10.1111/jocn.17490 (2024). [DOI] [PubMed] [Google Scholar]

[CR38] 38.Robertson, S. & Lascelles, B.D.X. Long-term pain in cats: How much do we know about this important welfare issue? J. Feline Med. Surg.12, 188–199 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Taylor, P. M. & Robertson, S. A. Pain management in cats - Past, present and future. Part 1. The cat is unique. J. Feline Med. Surg.6, 313–320 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Drazen, J. M. et al. Benefits, limits, and risks of GPT-4 as an AI Chatbot for medicine. New Engl. J. Med.388, 1–7 (2023). [DOI] [PubMed] [Google Scholar]

[CR41] 41.Zhang, N. et al. Pruning as a Domain-specific LLM Extractor. Find. Assoc. Comput. Linguist. NAACL 2024 - Find. 1417–1428 (2024) 10.18653/v1/2024.findings-naacl.91.

[CR42] 42.Wang, C. et al. Hybrid alignment training for large language models. Find. Assoc. Comput. Linguist. ACL 2024, 11389–11403 (2024). [Google Scholar]

[CR43] 43.Zhang, Y.-F. et al. Debiasing Multimodal Large Language Models. arXiv 1–38 (2024).

[CR44] 44.Savović, J. et al. Influence of reported study design characteristics on intervention effect estimates from randomised controlled trials: Combined analysis of meta-epidemiological studies. Health Technol. Assess. (Rockv)16, 1–81 (2012). [DOI] [PubMed] [Google Scholar]

[CR45] 45.Johnson, S. & Hyland-Wood, D. A Primer on Large Language Models and their Limitations. arXiv 1–33 (2024).

[CR46] 46.Banchi, P., Quaranta, G., Ricci, A. & Von Degerfeld, M. M. Reliability and construct validity of a composite pain scale for rabbit (CANCRS) in a clinical environment. PLoS ONE15, 1–12 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Mckinsey. The state of AI in 2023: Generative AI’s breakout year. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year (2023).

[CR48] 48.Statista. Artificial Intelligence - Global: Statista market forecast. https://www.statista.com/outlook/tmo/artificial-intelligence/worldwide#market-size (2024).

[CR49] 49.OpenAI. Introducing GPT-4o and more tools to ChatGPT free users. https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free/ (2024).

[CR50] 50.Gemini. Introducing Gemini, your new personal AI assistant. https://gemini.google/assistant/ (2024).

[CR51] 51.Anthropic. Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. https://www.anthropic.com/news/3-5-models-and-computer-use (2024).

[CR52] 52.Perplexity. What is Perplexity? https://www.perplexity.ai/hub/getting-started (2024).

[CR53] 53.Urbaniak, G. C. & Plous, S. Research randomizer (Version 4.0). https://www.randomizer.org/ (2013).

[CR54] 54.Bland, J. M. & Altman, D. G. Measuring agreement in method comparison studies. Stat. Methods Med. Res.8, 135–160 (1999). [DOI] [PubMed] [Google Scholar]

[CR55] 55.Evangelista, M. C. et al. Clinical applicability of the Feline Grimace Scale: Real-time versus image scoring and the influence of sedation and surgery. PeerJ8, e8967 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR56] 56.RStudio. https://www.rstudio.com/ (2025).

PERMALINK

Agreement of Feline Grimace Scale scores between chatbots and an expert rater

Sze T Ngai

Syed S U H Bukhari

Santiago Alonso Sousa

Paulo V Steagall