Evaluating the performance of a generative AI model in assessing qualitative health research articles adherence to objective reporting standards

Aloysius Wei-Yan Chia; Winnie Li-Lian Teo; Yasmin Lynda Munro; Rinkoo Dalan

doi:10.1038/s41598-025-29591-1

. 2026 Jan 24;16:3258. doi: 10.1038/s41598-025-29591-1

Evaluating the performance of a generative AI model in assessing qualitative health research articles adherence to objective reporting standards

Aloysius Wei-Yan Chia ^1,^✉, Winnie Li-Lian Teo ², Yasmin Lynda Munro ⁴, Rinkoo Dalan ^1,³

PMCID: PMC12835271 PMID: 41580466

Abstract

As qualitative research increasingly informs patient-centred care, rapid assessment of existing evidence to meet research guidelines is needed to inform practice settings. We evaluate the performance of Claude, a generative AI model, in assessing qualitative articles adherence to a consensus-based reporting guideline. The Consolidated Criteria for Reporting Qualitative Research (COREQ), commonly used in qualitative research, is used as a reference criteria list to test the performance of Claude. 15 articles from a systematic scoping review were extracted for analysis. Structured prompts were applied to Claude to evaluate if each criterion in COREQ is met for each article. Two independent reviewers checked model results for concordance and accuracy. The F1, balanced accuracy (BA) scores, Matthews correlation coefficient (MCC) and other performance metrics were tabulated at the criterion, criterion domain, and article level. 4 main categories were identified from performance results, namely: (1) balanced (6/32 criteria, 18.75%), (2) under-reported (2/32, 6.25%), (3) mixed errors (9/32, 28.13%), and (4) information limited (15/32, 46.88%) clusters. Results show heterogeneity amongst different clusters of criteria. While balanced criteria perform consistently across a range of metrics, criteria in under- or over-reported clusters require targeted prompt adjustments. Limited information criteria require a larger sample of articles to verify results. Clearly defined criteria outperformed criteria that were broadly defined or requires interpretation. Segmenting criteria into performance clusters allow researchers to identify areas of incongruence, so that specific strategies to modify prompts may be utilised for any given set of research articles. Customised approaches that are expertly crafted can allow for the rapid extraction of valuable insights that may inform patient-centred recommendations and practice guidelines.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-29591-1.

Keywords: AI performance evaluation, Generative AI, Evidence synthesis, Adherence to reporting standards, Adherence to standardised checklists and guidelines, Research checklists

Subject terms: Health care, Medical research

Introduction

Assessing the adherence of qualitative research studies in meeting consensus-based guidelines is a crucial step in determining the reliability and robustness of health and medical research studies. Checklists and guidelines allow readers to validate if study components have been adequately fulfilled and to assess if a study has been systematically conducted. For qualitative research, the 32-item Consolidated Criteria for Reporting Qualitative Research (COREQ) (supplementary information file 3, table A1)¹, and the Standards for Reporting Qualitative Research (SPQR) checklist² are two of the most frequently used guidelines to determine if qualitative articles are adequately reported. Given how patient-centric and scalable solutions are increasingly needed in dynamic, fast-paced healthcare settings, the adoption of insights from qualitative research can help provide diverse perspectives and pragmatic nuance in the deployment of innovative technologies or treatments.

Recent developments in the use of large language models (LLMs) have shown potential in helping researchers improve the efficiency of qualitative analysis³. Bijker et al. ⁴ applied ChatGPT to 539 user generated forum messages related to sugar consumption to generate qualitative codes aimed at identifying behavioural change mechanisms. Findings indicate that inductively generated codes have high agreement with experts compared to deductive codes that followed a pre-determined framework (k = 0.69–0.84 for inductive coding; k = 0.52–0.73 for unconstrained deductive coding, k = 0.66 for structured deductive coding).

Prescott et al. ⁵ tested the ability of ChatGPT and Google’s Bard to perform thematic analysis on 40 short SMS messages as part of a digital intervention program to promote medication adherence amongst methamphetamine users with HIV. Although intercoder reliability between LLM tools and human coders ranged from fair to moderate for ChatGPT (ICR = 47%, 37% for inductive, deductive analysis) and Bard (ICR = 37%, 36%), thematic agreement was good (71%, 50% for ChatGPT; 71%, 58% for bard, for inductive and deductive analysis respectively). More notably, ChatGPT took only 15 and 25 min to conduct inductive and deductive analysis, while Bard took 20 min to conduct both types of analysis respectively, compared to 492 and 705 min for humans, leading to time savings of 97% for both types of analysis for ChatGPT, and 96%, 97%-time savings for Bard.

Despite its wide-ranging potential and demonstrable application in analysing text proficiently, current use of generative AI tools is largely confined to individual studies, where models are applied to an assemblage of data for qualitative coding to generate insights. Less explored are analysis conducted at the meta-analytic level, where a model is tasked with checking the adherence of multiple research studies in meeting consensus-based or objective checklists, which remains a laborious, time consuming and manual process in health or clinical research. By evaluating a model’s performance based on a commonly used objective framework, this study aims to contribute to emerging research on AI-assisted evidence synthesis^6–8.

Study aims

The aim of this study is to evaluate the performance of Claude 3.5 Sonnet (released June 2024)⁹ in assessing qualitative health research articles adherence to a consensus-based, objective set of qualitative reporting guidelines. A 2-step iterative zero-shot approach was used to evaluate 15 articles extracted from a previously conducted scoping review. Output was validated for accuracy and summarised in a confusion matrix at the article and criterion level by 1 reviewer, followed by a second reviewer on 5 articles to check for inter-rater reliability. Additional robustness checks were performed. Model performance was primarily evaluated using F1, balanced accuracy (BA) scores, and the Mathhew’s Correlation Coefficient (MCC) derived from accuracy, precision, recall (sensitivity) and specificity scores. An additional number of supporting performance metrics were tabulated to holistically evaluate performance results. Results are reported at the criterion, criterion domain level as defined in COREQ, and at the article level. Quantitative error analysis was conducted to understand which criterion tend to be falsely evaluated by Claude.

We evaluate articles extracted from a scoping review rather than conduct a separate systematic search tailored specifically for model evaluation, so as to better understand the basic performance of a LLM when used as a checklist tool embedded within a larger scoping review.

The full list of articles evaluated can be found in supplementary information file 3.

Methods

Overview

The overall sequence of this study is as follows:

15 qualitative articles were extracted from a scoping review conducted separately in a previous study, to evaluate the performance of Claude in assessing whether extracted articles meet an objective list of reporting criteria as defined in COREQ.

Initial prompt testing and adjustments using Claude 3 Opus.

2.
The list of COREQ criteria as described in Tong et al. (2007) was first uploaded to Claude.
3.
Instructions were given to create a 32 by 4 table, consisting of:
- 4.
  a numeric number, in ascending order
- 5.
  the list of criteria, as described in the COREQ guideline
- 6.
  'Yes/No’ column to check for the presence/absence of each criterion, and
- 7.
  a column to justify reasons for each 'Yes/No’ response.

8.
Qualitative full-text articles were uploaded to Claude individually for evaluation, with clear and specific prompt instructions provided for Claude to evaluate each article. [Study supplements were not uploaded to Claude]
9.
Prompts adjustments were made based on initial output and errors generated (see Figures. 2 and 3 for full prompt wordings and adjustments made).

Application of prompts to Claude 3.5 Sonnet.

10.
Steps (2) and (3) repeated with updated prompts.
11.
Qualitative full-text articles are re-uploaded individually to Claude for evaluation on a separate message thread. [Article supplementary files/appendices (if any) were not uploaded]
12.
Standardised prompts are applied to all articles individually.

Evaluation of results.

13.
Output generated for all articles are checked and evaluated by a reviewer.
14.
A 2nd reviewer independently evaluates 5 randomly selected articles.
15.
Results are classified as true positive/negative or false positive/negative by both reviewers and added into separate, individual confusion tables, summarised at the criterion/article level.
16.
2 reviewers convened to compare evaluations and discuss discordant criteria.
17.
Final confusion table after consensus agreement is tabulated using a range of performance metrics.

An overview of study procedures is illustrated in Figure. 1 below.

Data sources

15 qualitative articles were retrieved from a scoping review conducted previously, on patient-physician communication of health and risk information pertaining to cardiovascular diseases and diabetes¹⁰. A comprehensive database search was conducted for articles published between 1st Jan 2000 to 3rd October 2023. Of 8378 articles that were screened, 88 articles were reviewed, of which 30 articles were included, comprising of 15 qualitative, 14 quantitative and 1 mixed method studies. The PRISMA flow diagram, search terms and key characteristics of included studies table can be found in additional files 1 to 3 in the referenced scoping review article.

Ethical considerations

This study is exempted from ethics approval, as only journal articles are included as data points for analysis. No human/patient identifiers or information on research subjects were collected.

Guiding prompts and model output

2-step prompt sequence

First, the original COREQ article with description of each criterion was uploaded into Claude as contextual information, followed by prompt instructions to create a 32 by 4 blank table. Claude is then instructed to populate the first and second columns of each row with a sequential number and to include the name of each criterion respectively. In the third column, a ‘Yes/No’ option is included to indicate whether a criterion is mentioned in an article. If the third column is indicated as ‘Yes’ in any given article, the model is tasked to provide evidence from the article justifying the presence of each criterion in the fourth column. If the third column is ‘No’, then ‘N.A.’ should be indicated by the model in the fourth column. The first prompt sequence is to be applied only once within each dialogue thread.

Each article is then uploaded to Claude individually for assessment. Prompts for each article include additional instructions to clarify the scope of extraction. Instructions include requesting textual evidence in Column 4 to be extracted and stated verbatim rather than paraphrased, to ensure that any potential ‘hallucinations’ generated can be verified in a transparent way. Hallucination refers to LLMs presenting wrong information as if the information generated is correct and true, which LLMs are occasionally at risk of generating^11,12.

The 2nd prompt was used each time a new qualitative article was uploaded for assessment. The 2-prompt sequence were first tested on Claude 3 Opus, a precursor to Claude 3.5 Sonnet. Claude 3 Opus was used for evaluation as this was the most advanced Claude model available during initial test-runs. Claude 3.5 Sonnet superseded Claude 3 Opus shortly after the initial testing of prompts was completed.

Prompt adjustments

After applying the 2-step prompt procedure to Claude 3 Opus using a few qualitative articles and generating some output results, adjustments were made to the 2nd prompt due to persistent errors generated among a few criteria. For criteria 2 and 3 (‘credentials’ and ‘occupation’), a prompt qualifier was added to emphasize that credentials or occupation is not the same as the address of an institution, which Claude 3 Opus often conflated with. For criterion 4 (‘gender’), a prompt was added to infer gender from a person’s name. Although this is not necessarily true as names can be neutral, we wanted to test the inference ability of the model. Claude 3 Opus tended to state ‘no’ initially even when gender was mentioned.

Once prompts were finalised, the 2-step prompt sequence was applied to all 15 articles individually in Claude 3.5 Sonnet for evaluation. The first prompt was applied once, while the second prompt was applied repeatedly for each additional article uploaded. Prompts were applied with no change in wordings so that difference in output generated cannot be attributed to changing prompts arbitrarily. Probing of model results through additional prompts were also avoided to ensure successive outputs within the same dialogue thread were not affected.

The preliminary and final prompts applied is shown in Figures. 2 and 3.

Human evaluation and truth-value classification

After all articles were assessed, a reviewer (AC) checked the accuracy of output generated by Claude manually and assigned each criterion into one of 4 truth-value categories:

18.
‘True or False positive’ (TP or FP) – Claude correctly/wrongly classifies a criterion as mentioned in an article.
19.
‘True or False negative’ (TN or FN) – Claude correctly/wrongly classifies a criterion as not mentioned in an article.

A description of truth-values and how each assessed result is assigned is described below in Table 1.

Table 1.

Truth-value assignment by human reviewer.

Criterion appeared or was mentioned in qualitative article	Model output	Truth-value assignment by human evaluator
Yes	Yes	True Positive
Yes	No	False Positive
No	Yes	False Negative
No	No	True Negative

Open in a new tab

Inter-rater reliability and robustness checks

A second, independent (WT) reviewer evaluated 5 randomly selected articles to check for concordance/discordance in results and inter-rater reliability. Statistical tests were performed to ensure the robustness of results. Pre-consensus raw agreement between reviewers was high at 0.944 (Wilson’s CI 0.90–0.97)¹³. To confirm that results were not due to chance or each reviewer’s positive or negative inclination, Cohen’s k was tabulated, generating relatively high results at 0.880 (CI 0.80–0.96)¹⁴. Confidence intervals were obtained by bootstrapping articles using 2,000 resamples¹⁵. To adjust for sensitivity towards class imbalance and asymmetric category use between reviewers, prevalence and bias adjusted k (PABAK) as well as Gwet’s AC1 were tabulated, both of which achieved an equally high result of 0.888 and 0.894 respectively. PABAK rescales percent agreement assuming categories are equally likely, while Gwet’s AC1 adjusts for chance, factoring how raters actually use categories during evaluation^16,17. Prevalence index (PI) at 0.256 indicate notable imbalance tilted towards positive cases, while a very low bias index (BI) score of 0.02 indicate low systematic rater bias. Non-significant McNemar’s χ² test results (χ² = 1.00, p = 0.317) to evaluate if there were systematic differences between the ratings of 2 reviewers on categorical, paired data, present no evidence that one reviewer was more inclined to indicate a criterion as being reported than another¹⁸.

A full documentation of inter-rater evaluation results and robustness tabulation can be found in supplementary information file 2.

Confusion matrix summary

Consensus results between 2 reviewers were summarised in a confusion matrix table and tabulated at the criterion/article-level (supplementary information file 1, Table 1). Numeric totals for each truth-value category (TP, TN, FP, FN) was summed, yielding a total of 480 values (32 criteria*15 articles). Summary at the criterion level was further demarcated into 3 domains as categorized in COREQ. Confusion matrices allow for the tabulation of accuracy, precision, recall (sensitivity) and specificity scores, which in turn allow for the calculation of F1, balanced accuracy and Matthews Corelation Coefficient as well as an associated range of other scores used for performance analysis.

Performance metrics

F1, balanced accuracy scores and the Matthews corelation coefficient

F1 scores were tabulated at the criterion and article level to understand the model’s performance in identifying positive cases. The score measures the harmonic mean between precision (the accuracy of positive predictions) and recall (the rate of true positive identification). Results range from 0 to 1, where 0 indicates no precision/recall, and 1 perfect precision and recall. The F1 score is a standardised metric commonly used to evaluate classification models applied to disease prediction and natural language processing in healthcare^19,20.

Since the F1 score does not account for true negatives, we calculate balanced accuracy (BA) and the Matthews Corelation Coefficient to provide a more balanced measure^21,22. BA measures the average of sensitivity (TP/(TP + FN)) and specificity (TN/(TN + FP)), while MCC is the fundamental discriminator metric that reflects true agreement under class imbalance, for a moderately rather than extremely imbalanced dataset that consists of 35.21% (169/480) true negatives out of all cases (Table 2). Using the BA and MCC provides a more meaningful performance metric to compare across criteria, as both metrics balances criteria where there is a larger proportion of true negatives that would otherwise derive a higher result via the F1 score.

Table 2.

Description of class imbalance over total cases, with percentages.

Class	True	True (%)	False	False (%)	Total
Positive	283	58.96%	10	2.08%	293
Negative	169	35.21%	18	3.75%	187
Total	452	94.17%	28	5.83%	480

Open in a new tab

Other performance metrics measured include the difference between F1 and BA scores, false positive rate (FPR), false negative rate (FNR), actual positives (n+) and actual negatives (n-). Explanation of all metrics and definitions is described in Table 3. It was decided that text results generated by Claude in the 32*4 table (column 4) would not be included for quantitative analysis, since embedding textual analysis in the context of a closed-set confusion matrix would present labelling related challenges. An illustrative problem that may arise from labelling is described in Table A2, supplementary information file 3. Class imbalance and proportions for the present dataset is described in Table 2 below.

Table 3.

List of key performance metrics and definitions used for evaluation.

No.	Performance metrics	Tabulation	Performance range	Definition
1	F1 Score	=	0 to 1 (1 = best)	Harmonic mean of precision and recall. Does not include TN in tabulation.
2	Balanced Accuracy (BA)	=	0 to 1	Average of TN and TP rates; adjusts for class imbalance.
3	Matthews Correlation Coefficient		-1 to 1 (1 = perfect, 0 = chance, -1 = inverse of 1)	Correlations of prediction and truth (actual cases) using all cells; robust to class imbalance.
4	∆BA-F1	BA – F1	N.A.	Shows difference between BA compared to F1 scores
5	False Positive Rate (FPR)	1 – sensitivity =	0 to 1 (lower is better)	Probability that a negative is mislabelled as positive by LLM
6	False Negative Rate (FNR)	1 – specificity =	0 to 1 (lower is better)	Probability that a positive is missed by LLM
7	Actual positives (n+)	TP + FN	N.A.	Number of actual positives in the dataset (i.e. of all articles assessed by LLM)
8	Actual negatives (n−)	TN + FP	N.A.	Number of actual negatives in the dataset (i.e. of all articles assessed by LLM)

Open in a new tab

Imputation and metric stability

To enable item-level comparability and avoid undefined metrics where denominators were zero, the Haldane–Anscombe correction was applied by adding 0.5 to each cell of a 2 × 2 confusion table (TP, FP, TN, FN), prior to computing sensitivity, specificity, BA, F1 scores, and MCCs²³. This approach allows for undefined or unsupported values to gravitate towards more central, stable values and symmetrically reduces small-sample bias. Imputation was applied at the criterion, criterion domain level, but not at the article-level.

Results

To ensure a holistic understanding of Claude’s performance across all articles, we examine a range of key performance metrics, paying particular attention to criterion and criterion domain performance. 3 complementary metrics were used primarily to measure performance. MCC, balanced accuracy (BA), as well as the F1 scores are the principal metrics, along with Δ(BA–F1) and FPR/FNR to identify error direction, while counts of actual positives (n⁺) and negatives (n⁻) were tabulated to determine the interpretability of estimates.

Criterion were categorised using specific thresholds, then consolidated into performance clusters. Based on the overall results, 4 main clusters were identified, namely: (1) balanced criterion, (2) under-reported criterion, (3) mixed errors criterion, and (4) information limited criterion. Performance thresholds of each cluster are described in the Table 4 below:

Table 4.

Performance thresholds and error profile/explanation of each cluster.

Cluster No.	Cluster Name	Performance Indicator	Error Profile / Explanation	No. of criterion
1	Balanced	F1 ≥ 0.85, BA ≥ 0.75, MCC ≥ 0.60, \|Δ(BA–F1)\| <0.05, n + ≥ 3 & n− ≥ 3	Low FPR and FNR; no bias	6
2	Under-reported	FNR ≥ FPR Δ(BA–F1) ≥ + 0.02 and MCC < 0.80 Does not meet cluster 1	Criteria prone to not being reported	2
3	Mixed errors	Mixed metrics, does not meet 1 or 2	Heterogeneous; with mild calibration required due to section/phrasing variability	9
4	Information limited	Min (n+, n−) < 2	Driven by prevalence of one class of cases; not possible to evaluate discrimination	15

Open in a new tab

Criterion level analysis

Cluster 1: balanced criterion

Based on a combination of assessed metrics and performance indicators, 6 out of 32 criterion (18.8%) displayed a balanced profile. Criterion with a balanced profile include ‘occupation’ (C3) (BA = 0.845, MCC = 0.660, FPR/FNR = 0.227/0.083), ‘experience and training’ (C5) (BA = 0.929, MCC = 0.858, FPR/FNR = 0.100/0.042), ‘participant knowledge of interviewer’ (C7) (BA = 0.939, MCC = 0.879, FPR/FNR = 0.071/0.050), ‘field notes’ (C20) (BA = 0.936, MCC = 0.871, FPR/FNR = 0.083/0.045), ‘data saturation’ (C22) (BA = 0.889, MCC = 0.768, FPR/FNR = 0.150/0.071), and ‘software’ (C27) (BA = 0.939, MCC = 0.879, FPR/FNR = 0.071/0.050). Criterion in this cluster have high discriminative scores (MCC ≥ 0.66), very low absolute change in F1 to BA scores (|Δ(BA–F1)|<0.05), consistently low FPR (|<0.227|) and FNR (|<0.083|), and a good mix of actual positive (n+) and negative cases (n-). A balanced profile suggests close alignment between criterion definitions and the range of description represented in articles evaluated by Claude. The low FPR/FNR number confirms that the robustness of BA/MCC scores is not due to model mislabelling.

Cluster 2: under-reported criterion

2 out of 32 (6.3%) criteria are grouped together as under-reported, due to markedly higher FNR than FPR, and change in F1 to BA scores of ≥ + 0.02, indicating misclassification towards negative for actual positive cases. Criterion in this cluster include ‘relationship established’ (C6) (BA = 0.890, MCC = 0.758, FPR/FNR = 0.083/0.136) and ‘interviewer characteristics’ (C8) (BA = 0.841, MCC = 0.606, FPR/FNR = 0.125/0.192). Claude tends to be more conservative evaluating these criteria, indicating variable or implicit meaning in criteria definitions.

Cluster 3: mixed errors criterion

9 out of 32 (28.1%) criteria was grouped into a mixed errors cluster, due to heterogeneous performance and results that do not align neatly with clusters 1 or 2. Criterion in this cluster may further be demarcated into 4 sub-clusters, including: (i) near balanced, (ii) under-reported inclined, (iii) over-reported inclined and (iv) ambiguous clusters. 4 criteria can be classified as near balanced, due to high BA scores and MCC, but having 1 or 2 metrics that is marginally incongruent with the main indicators, such as having a high FPR. Criterion in this sub-cluster include ‘gender’ (C4) (BA = 0.863, MCC = 0.653, FPR = 0.107) ‘non-participant’ (C13) (BA = 0.852, MCC = 0.739, FPR = 0.250), ‘transcripts returned’ (C23) (BA=, MCC=, FPR = 0.167) and ‘participant checking’ (C28) (BA = 0.899, MCC = 0.798, FPR = 0.167).

Criteria in the under-reported inclined sub-cluster, have modest to relatively high BA scores and MCC, but also have a high FNR. Criteria in this cluster include ‘interviewer/facilitator’ (C1) (BA = 0.793, MCC = 0.653, FNR = 0.375), ‘setting of data collection’ (C14) (BA = 0.788, MCC = 0.575, FNR = 0.300), and ‘number of data coders’ (C24) (BA = 0.796, MCC = 0.640, FNR = 0.357). 1 criterion, ‘credentials’ (C2) (BA = 0.719, MCC = 0.479, FPR = 0.500), may be classified as over-reported, which is similar to the under-reported inclined sub-cluster but with high FPR instead. 1 criterion, ‘clarity of minor themes’ (C32) (BA = 0.678, MCC = 0.316, FPR/FNR = 0.269/0.375) is classified as ambiguous due to unstable performance across all indicators.

Cluster 4: information limited criterion

Almost half of all articles (15 out of 32, 46.88%) assessed by Claude had results in mostly or all in one class (positive or negative), leading to performance indicator values that cannot be sufficiently interpreted given the complete absence of one class of values. Criterion that had results that fall into positive classes only include ‘methodological orientation and theory’ (C9), ‘sampling’ (C10), ‘sample size’ (C12), ‘description of sample’ (C16), ‘audio/visual recording’ (C19), ‘derivation of themes’ (C26), ‘quotations presented’ (C29), ‘data and findings consistent’ (C30), and ‘clarity of major themes’ (C31). All criteria had similar F1 scores of 0.969, reflecting a high prevalence of positive cases, with moderately high BA scores of 0.734, and modest MCCs of 0.469 after incorporating true negative results. Criterion that had results fall mostly within positive classes include ‘method of approach’ (C11) (BA = 0.758, MCC = 0.365), ‘interview guide’ (C17) (BA = 0.608, MCC = 0.297) and ‘duration’ (C21) (BA = 0.825, MCC = 0.549). Criterion that had results that comprises of negative classes only include ‘repeat interviews’ (C18) (BA = 0.734, MCC = 0.469), while criterion with results that consists of mostly negative classes only are ‘presence of non-participants’ (C15) (BA = 0.825, MCC = 0.549) and ‘description of the coding tree’ (C25) (BA = 0.608, MCC = 0.297).

A full list of criteria grouped by clusters and sub-clusters, and corresponding list of performance metrics can be found in Figure. 4 and Table 5.

Table 5.

Criterion clustered by performance categories with summary of key performance metrics.

Criterion no.	Domain*	Criterion name	F1 score	Balanced accuracy (BA)	Matthews corelation coefficient (MCC)	∆BA-F1	False positive rate (FPR)	False negative rate (FNR)	Actual positives (n+)	Actual negatives (n−)
Balanced criterion
C3	1	Occupation	0.850	0.845	0.660	− 0.005	0.227	0.083	10	5
C5	1	Experience and training	0.900	0.929	0.858	0.029	0.100	0.042	4	11
C7	1	Participant knowledge of the interviewer	0.929	0.939	0.879	0.011	0.071	0.050	6	9
C20	2	Field notes	0.917	0.936	0.871	0.019	0.083	0.045	5	10
C22	2	Data saturation	0.895	0.889	0.768	− 0.005	0.150	0.071	9	6
C27	3	Software	0.929	0.939	0.879	0.011	0.071	0.050	6	9
Under-reported criterion
C6	1	Relationship established	0.846	0.890	0.758	0.044	0.083	0.136	5	10
C8	1	Interviewer characteristics	0.700	0.841	0.606	0.141	0.125	0.192	3	12
Mixed-errors criterion
Near balanced
C4	1	Gender	0.926	0.863	0.653	− 0.063	0.107	0.167	13	2
C13	2	Non-participant	0.818	0.852	0.739	0.034	0.250	0.045	5	10
C23	2	Transcripts returned	0.833	0.899	0.798	0.065	0.167	0.036	2	13
C28	3	Participant checking	0.833	0.899	0.798	0.065	0.167	0.036	2	13
Under-reported inclined
C1	1	Interviewer/Facilitator	0.926	0.793	0.653	− 0.133	0.038	0.375	12	3
C14	2	Setting of data collection	0.875	0.788	0.575	− 0.088	0.125	0.300	11	4
C24	3	Number of data coders	0.864	0.796	0.640	− 0.067	0.050	0.357	9	6
Over-reported inclined
C2	1	Credentials	0.643	0.719	0.479	0.076	0.500	0.063	8	7
Ambiguous
C32	3	Clarity of minor themes	0.792	0.678	0.316	− 0.114	0.269	0.375	12	3
Information limited criterion
C9	2	Methodological orientation and theory	0.969	0.734	0.469	− 0.234	0.031	0.500	15	0
C10	2	Sampling	0.969	0.734	0.469	− 0.234	0.031	0.500	15	0
C11	2	Method of approach	0.852	0.758	0.365	− 0.094	0.233	0.250	14	1
C12	2	Sample size	0.969	0.734	0.469	− 0.234	0.031	0.500	15	0
C15	2	Presence of non-participants	0.600	0.825	0.549	0.225	0.250	0.100	1	14
C16	2	Description of sample	0.969	0.734	0.469	− 0.234	0.031	0.500	15	0
C17	2	Interview guide	0.935	0.608	0.297	− 0.327	0.033	0.750	14	1
C18	2	Repeat interviews	0.500	0.734	0.469	0.234	0.500	0.031	0	15
C19	2	Audio/visual recording	0.969	0.734	0.469	− 0.234	0.031	0.500	15	0
C21	2	Duration	0.931	0.825	0.549	− 0.106	0.100	0.250	14	1
C25	3	Description of the coding tree	0.333	0.608	0.297	0.275	0.750	0.033	1	14
C26	3	Derivation of themes	0.969	0.734	0.469	− 0.234	0.031	0.500	15	0
C29	3	Quotations presented	0.969	0.734	0.469	− 0.234	0.031	0.500	15	0
C30	3	Data and findings consistent	0.969	0.734	0.469	− 0.234	0.031	0.500	15	0
C31	3	Clarity of major themes	0.969	0.734	0.469	− 0.234	0.031	0.500	15	0

Open in a new tab

*Domain 1 - Research team and reflexivity; Domain 2 - Study design; Domain 3 - Analysis and findings.

Criterion domain level analysis

Results at the criterion domain level show the aggregate performance of criterion when grouped collectively with other related criterion, following domain categories as defined in COREQ. Limited information criteria were excluded from analysis to allow for fair intra- and inter-domain comparisons. Median confidence intervals for each domain were obtained by bootstrapping the median of all evaluable criterion using 2,000 resamples (supplementary information file 1).

Domain 2, ‘analysis and findings’, achieved the highest overall median MCC (0.768, CI 0.575–0.871), followed by domain 3 ‘analysis of findings’ (0.719, CI 0.316–0.879) and domain 1 ‘research team and reflexivity’ (0.656, CI 0.606–0.808). Domain 1 had the highest proportion of evaluative criteria, with all criteria (8/8, 100.0%) of evaluable quality, followed by domains 3 and 2, which had a moderate to low proportion of overall evaluable criteria per domain at 44.4% (4/9) and 33.3% (5/15) respectively. Although most criteria in these domains were not evaluable due to extreme class imbalance (all positive or negative only), the remaining criteria within each domain was still evaluated proficiently. This includes criteria such as ‘field notes’ (C20), ‘data extraction’ (criterion 22) in domain 2, and ‘software’ (C27) in domain 3, which are the most straightforward, quantifiable criterion within these 2 domains.

A lower proportion of evaluable criteria within domains 2 and 3 suggests a larger sample of articles may be needed to sufficiently gauge a model’s true performance within these domains, to determine if a model’s assessment of extremely imbalanced criterion reflects its true discriminative ability. Likewise, whether imbalance towards one class can be attributed to performance rather than the tendency of a specific domain to integrate a mix of criterion with diverse characteristics together.

Conversely, a high proportion of evaluable criteria within domain 1 suggests that most criterion within the ‘research team and reflexivity’ domain are consistently well described, with clear evaluable qualities for generative AI assessment even as median MCC performance is relatively modest (0.656, CI 0.606–0.808) compared to the other 2 domains (0.768, CI 0.575–0.871; 0.719, CI 0.316–0.879 for domains 2 and 3 respectively). The confidence interval of domain 1 (CI difference: 0.202) show a narrower range than domains 2 (CI diff.: 0.293) and 3 (CI diff.: 0.563), indicating greater precision of estimates, in contrast to greater variability in performance and output results for domains 2 and 3. Domain 3 has the widest confidence interval that extends to MCC < 0.500, suggesting the possibility of estimates falling within a lower range, likely due to a larger proportion of criteria that have more open-ended meanings that are susceptible to interpretation.

Median MCC results by domain, and numeric criterion totals by cluster within each domain can be found in Tables 6 and 7 respectively. The list of all criteria and corresponding domains can be found in Table 5.

Table 6.

Median F1, BA scores and MCC at the criterion domain level (excluding information limited criteria).

Domain

Median MCC (CI)*

% criteria MCC ≥ 0.60

Study design

(Domain 2, C9-C23)

0.768 (0.575–0.871)

4/5 (80.0%)

Analysis and findings (Domain 3, C24-C32)

0.719 (0.316–0.879)

3/4 (75.0%)

Research team and reflexivity

(Domain 1, C1-C8)

0.656 (0.606–0.808)

7/8 (87.5%)

Open in a new tab

*CIs are percentile bootstrap over per-criterion MCCs (B = 2,000).

Table 7.

Number of criteria by performance category within each domain.

Domain

Balanced

Under-reported

Mixed errors

Information limited

Proportion of criteria evaluable

Total no. of criteria

Research team and reflexivity (Domain 1, C1-C8)

3/8 (37.5%)

2/8 (25.0%)

3/8 (37.5%)

0/8 (0%)

8/8 (100.0%)

Analysis and findings (Domain 3, C24-C32)

1/9 (11.1%)

0/9 (0%)

3/9 (33.3%)

5/9 (55.6%)

4/9 (44.4%)

Study design

(Domain 2, C9-C23)

2/15 (13.3%)

0/15 (0%)

3/15 (20%)

10/15 (66.7%)

5/15 (33.3%)

Open in a new tab

Article-level analysis

Analysis at the article-level, excluding information limited criteria, displayed generally high scores across all key performance metrics, indicating strong model discrimination at the article level, with overall mean F1 score of 0.904, BA score of 0.911, and MCC of 0.827. Several articles achieved perfect scores (1.000) across all 3 metrics, while one article fared poorly because of a lack of positive cases. The typical article performed well, with overall median F1 score of 0.875, BA of 0.929, and MCC of 0.789. Since analysis at the article-level subsumes multiple types of criteria, each with unique performance thresholds as an overall score, it is thus not possible to fully appraise a model’s performance based on article-level results. Article-level analysis is reported in supplementary information file 1, Table 4.

Error analysis

Error analysis was tabulated quantitatively at the criterion and criterion domain level, excluding limited information criteria. Balanced error rate (BER), the average of false positive and false negative rates was tabulated to reflect the overall rate of misclassification. A higher rate indicates a model that is more error prone towards identifying either false positive or negative cases. Results show domain 3 ‘analysis and findings’ to have the highest aggregate BER at 0.151 (FPR/FNR, 0.111/0.190), followed by domains 1 ‘research team and reflexivity’ at 0.091 (0.068/0.115) and domains 2 ‘study design’ at 0.071 (0.030/0.111). Although most individual criterion had a low to moderate BER range of < 0.25, 3 criterion had an elevated error rate. Description of the coding tree (C25), clarity of minor themes (C32), and credentials (C2) of 0.392 (0.750/0.033), 0.322 (0.269/0.375) and 0.281 (0.500/0.063) respectively had the highest BER. While C25 and C2 were more prone towards FP errors, C32 were inclined towards FN errors.

Full results of BER for each criterion is illustrated graphically in Figure. 5, with full results in the supplementary information file 1, Table 3.

Discussion

We undertake a rigorous, data driven approach to evaluate the performance of Claude in assessing qualitative articles following a consensus-based objective set of criteria, with clinical implications for evidence-based medicine. LLMs such as Claude can be used as evaluation assistants where consensus-based criteria have been clearly demarcated and adequately defined in standards or checklists, to accelerate research in health communication, medication adherence, patient literacy and other health domains^24–26.

A range of quantitative metrics was used to identify areas where Claude performs well without extensive pre-prompting, in assessing adherence towards a comprehensive criteria list. Results reveal 4 key performance clusters: (a) balanced, (b) under-reported, (c) mixed errors, and (d) information limited clusters suggest varied outcomes. Criteria that fall within the balanced cluster, such as ‘occupation’ (C3) and ‘data saturation’ (C22), are typically clearly defined, distinct and well reported, encapsulating performance consistency that can be extrapolated across a diverse set of articles. Criteria that are categorised in the near balanced inclined sub-cluster, such as ‘gender’ (C3) and ‘participant checking’ (C28), require further prompt enclosure to allow for a model to achieve balanced assessment levels. Enclosed prompts adjustments include specifying or suggesting where each criterion may be located within an article or how it is conventionally described. For example, stating within a prompting sequence how the “participant checking (C28) criterion is usually reported in the methods section…” or providing information ‘flags’ commonly related to a criterion that allows for a model to detect a criterion more sensitively. Information flags for C28 may include the following example phrase within a prompt, “Participants in a study are usually provided a summary of results and are asked for their feedback. Feedback refers to thoughts and opinions about the study that they have participated in.”

For criteria that fall within the under-reported cluster or mostly under-reported sub-cluster within the mixed errors group (where FNR > FPR), results suggests that multi-shot prompts that aim to clarify criterion definitions iteratively may be required to further parse through definitions provided in guidelines. Criteria with broadly defined definitions include ‘relationship established’ (C6), ‘interviewer characteristics’ (C8) in the under-reported cluster, and ‘setting of data collection’ (C14) and ‘number of data coders’ (C24) in the under-reported inclined sub-cluster. Paraphrasing or elaborating on primary definitions can allow for the meaning of fundamental words to be explicated in detail so that polysemantic meanings are narrowed. This means interrogating the meaning of concepts such as ‘relationship’, ‘characteristics’, ‘setting’, and ‘data coders’ that are often syntactically determined and contextually dependent. In contrast, criterion that fall within the over-reported inclined or ambiguous sub-cluster such as ‘credentials’ (C2) or ‘clarity of minor themes’ (C31) requires prompt strategies that aim for exactness, through the use of multiple stated examples or hypothetical scenarios that can concretely explicate the preferred output that a LLM should generate for a given criterion.

Interestingly, even though effort was taken at the beginning to define ‘credentials’ (C2) in this study (see Figure. 3), where an illustration was provided of how an output should look like, output by Claude was lacklustre for this criterion, resulting in a low MCC score and a high BER inclined towards FP (FPR/FNR = 0.500/0.063). Such intractable criteria may require a combination of prompt approaches to produce optimal outcomes. Approaches include pre-requesting examples from an AI model to describe or expand upon a given conceptual term prior to prompt iteration, allowing for chain-of-thought reasoning, using multi-shot prompts, and avoiding common errors such as providing irrelevant, redundant or conflicting instructions^27–30.

A substantial proportion of all criteria categorized within the information limited cluster highlight how a larger number of articles are needed to confirm if discriminative ability holds true for each criterion in this category. This means collating sufficient articles that trend in a balanced way towards both positive and negative cases for fair evaluation. In reality, gathering a balanced dataset may not be feasible for health and clinical research groups that are driven mainly by clinical hypotheses, since the evaluation of articles usually comes after a pool of articles has already been extracted from databases. Future research should consider a large scale study dedicated solely to examining the performance of generative AI models in assessing articles, to explore both narrower and wider health domains and to check for comparative performance between domains.

Implications for clinical research

Results suggests that a customised or tailored strategy may be needed for researchers who plan to use AI models to assess whether research articles adhere to consensus-based standardised guidelines. Careful preparation should be given to the development of prompts and the use of performance metrics to ensure coherence between guidelines and output results. A balanced profile indicated by consistent performance over a range of metrics provides confirmation that a criterion can be evaluated by a model reliably. For criteria that indicate under-, over-reported or ambiguous outcomes, additional clarificatory or narrower, demarcating prompts may be required to adjust for optimal outputs.

Categorisation into performance clusters can help clinical teams stratify levels of performance and to determine where an AI model or prompt sequence needs fine-tuning. Researchers should be cognizant that current consensus or standardised reporting guidelines may not be developed in an ideal format for models to evaluate (e.g. STROBE for cross-sectional, observational studies; PRISMA for systematic reviews etc.)^31,32 and thus should tailor prompt approaches to typologies of criteria even as new AI relevant guidelines such as PRISMA-AI are being developed³³. The new AI guidelines plan to look specifically at the reporting of systematic reviews related to AI topics such as machine learning, deep learning and neural networks, but it is unclear whether this will include the utilisation of AI to check for adherence towards standardised checklists or guidelines.

It is likely that new reporting checklists developed for generative AI assisted scoping or systematic reviews will be required in the future, that is similar to CONSORT-AI guidelines used for the reporting of AI systems as interventions³⁴ or TRIPOD-AI guidelines used for diagnostic or prognostic prediction models³⁵. Reporting checklists will need to be developed and structured optimally based on real-world feedback from healthcare professionals and users, and be sensitive to how systematic/scoping or literature reviews are usually conducted. Checklists should ideally be able to provide additional detailed guidance on text descriptions that may appear variedly in articles.

To the best of our knowledge, this is the first study to use a quantitative, data driven approach to assess qualitative research articles adherence to a consensus-based, objective guideline using a generative AI model.

Limitations

One limitation of this study is the relatively small number and range of research articles assessed that limits the scope of results. Evaluating a larger pool of articles can provide a more robust understanding of how a generative AI model performs when presented with a more extensive or complex dataset, for the generation of more precise estimates and in testing the discriminant abilities of a model. A larger scale study involving comparisons between different models, using multiple criteria or standardised checklists, as well as multi-modal benchmark tools simultaneously, can provide a deeper understanding of how generative AI outputs are aligned with human inputs in different contexts, and to gauge which specific models are more reliable for research evaluation.

Although prompts were developed iteratively in this study at the initial stage using an approximation approach, a full-fledged systematic ‘prompt engineering’ strategy would be ideal to test prompts extensively before they are applied to different articles. It is known that the quality of prompts can substantially affect the quality of outputs that a model can generate^36–38. One specific technique, substituting similar words and phrases iteratively based on an initial parsing of results until satisfactory wordings are attained³⁹, would have enhanced the generation of model outputs and allow for a more multi-faceted analysis.

One additional limitation is the use of binary quantitative results (TP, TN/FP, FN) extracted from a classification model (confusion matrix) as the main driver of model assessment ability. Text generated by Claude was not included in this study that would have allowed for a more comprehensive performance analysis or comparison between text and quantitative results.

Conclusion

Although LLMs can support and accelerate the evaluation of qualitative research findings based on consensus-based guidelines, the quality of output depends significantly on prompts that are well calibrated and measured using a range of performance metrics. Near balanced criteria require prompt enclosure adjustments or information flags to achieve a more balanced performance, while under-reported criteria require paraphrasing or interrogation of key concepts so that polysemantic meanings are narrowed. Over-reported criteria require the use of concretely stated examples or hypothetical scenarios to achieve balance.

Conversely, criteria with limited information require a larger sample of articles for assessment to determine that performance is not primarily due to the propensity of truth-values falling within one class (T/F). Segmenting criteria into performance clusters allow researchers to identify areas of incongruence, so that specific strategies to modify prompts can be applied for any given set of research articles. Customised approaches that are expertly crafted can allow for the rapid extraction of valuable insights from articles to inform patient-centred recommendations and practice guidelines.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1^{(117.7KB, xlsx)}

Supplementary Material 2^{(25.7KB, xlsx)}

Supplementary Material 3^{(29.4KB, docx)}

Author contributions

Study design and conceptualisation: AC, WT; Prompt development and adjustments: AC; Model run and testing: AC; Data extraction from model output: AC; Intercoder validation AC, WT; Statistical/quantitative analysis: AC; Drafting and preparation of manuscript: AC; Review of manuscript: AC, WT, RD. All authors approve of the final version of this manuscript.

For scoping review (source of extracted articles for evaluation) Database search and extraction of article list: YM; Screening and review of articles: AC, WT.

Funding

This research study is kindly supported and generously funded by the Ng Teng Fong Foundation (Grant Reference: NTF_SRP_P1) and NHG Health as part of the Personalised Cardiometabolic Risk Management program (Predict to Prevent, ‘P2P study’). The P2P study is a population health program that aims to monitor, predict, and delay the risk of macrovascular complications through early risk identification and stratification amongst patients and population groups at high risk of developing cardiovascular disease. The funding agencies were not involved in the design, planning, screening, analysis or interpretation of the findings of this study, as well as the preparation of this manuscript in any way.

Data availability

All articles evaluated in this study were obtained from publicly available journal databases; article references may be found in supplementary file 3. The COREQ checklist and definition of each criterion is provided in supplementary information file 3. Human-coded classifications of output data generated by Claude 3.5 Sonnet are available in supplementary information files 1 and 2. No personal patient or identifiable data was used in this study.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Tong, A., Sainsbury, P. & Craig, J. Consolidated criteria for reporting qualitative research (COREQ): a 32-item checklist for interviews and focus groups. Int. J. Qual. Health Care. 19 (6), 349–357. 10.1093/intqhc/mzm042 (2007). [DOI] [PubMed] [Google Scholar]
2.O’Brien, B. C., Harris, I. B., Beckman, T. J., Reed, D. A. & Cook, D. A. Standards for reporting qualitative research: a synthesis of recommendations. Acad. Med.89 (9), 1245–1251. 10.1097/ACM.0000000000000388 (2014). [DOI] [PubMed] [Google Scholar]
3.Mishra, T. et al. Use of large Language models as artificial intelligence tools in academic research and publishing among global clinical researchers. Sci. Rep.14 (1), 31672. 10.1038/s41598-024-81370-6 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Bijker, R., Merkouris, S. S., Dowling, N. A. & Rodda, S. N. ChatGPT for automated qualitative research: content analysis. J. Med. Internet Res.26, e59050. 10.2196/59050 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Prescott, M. R. et al. Comparing the efficacy and efficiency of human and generative AI: qualitative thematic analyses. JMIR AI. 3, e54482. 10.2196/54482 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Gartlehner, G. et al. Data extraction for evidence synthesis using a large Language model: A proof-of‐concept study. Res. Synthesis Methods. 15 (4), 576–589. 10.1002/jrsm.1710 (2024). [DOI] [PubMed] [Google Scholar]
7.Ovelman, C., Kugley, S., Gartlehner, G. & Viswanathan, M. The use of a large Language model to create plain Language summaries of evidence reviews in healthcare: A feasibility study. Cochrane Evid. Synthesis Methods. 2 (2), e12041. 10.1002/cesm.12041 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Spillias, S. et al. Human-AI collaboration to identify literature for evidence synthesis. Cell. Rep. Sustain.1 (7). 10.1016/j.crsus.2024.100132 (2024).
9.Anthropic. Claude AI (Sonnet 3.5, June 2024 release) [Large language model]. Anthropic. (2024). https://www.anthropic.com.
10.Chia, A. W. Y., Teo, W. L. L., Acharyya, S., Munro, Y. L. & Dalan, R. Patient-physician communication of health and risk information in the management of cardiovascular diseases and diabetes: a systematic scoping review. BMC Med.23 (1), 96. 10.1186/s12916-025-03873-x (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Agarwal, V. et al. MedHalu: hallucinations in responses to healthcare queries by large Language models. ArXiv Preprint arXiv:2409 19492. 10.48550/arXiv.2409.19492 (2024). [Google Scholar]
12.Zhang, Y. et al. Siren’s song in the AI ocean: a survey on hallucination in large Language models. ArXiv Preprint arXiv:2309 01219. 10.48550/arXiv.2309.01219 (2023). [Google Scholar]
13.Wilson, E. B. Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc.22 (158), 209–212. 10.1080/01621459.1927.10502953 (1927). [Google Scholar]
14.Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas.20 (1), 37–46. 10.1177/001316446002000104 (1960). [Google Scholar]
15.Efron, B. Better bootstrap confidence intervals. J. Am. Stat. Assoc.82 (397), 171–185. 10.2307/2289144 (1987). [Google Scholar]
16.Byrt, T., Bishop, J. & Carlin, J. B. Bias, prevalence and kappa. J. Clin. Epidemiol.46 (5), 423–429. 10.1016/0895-4356(93)90018-v (1993). [DOI] [PubMed] [Google Scholar]
17.Gwet, K. L. Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol.61 (1), 29–48. 10.1348/000711006X126600 (2008). [DOI] [PubMed] [Google Scholar]
18.McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika12 (2), 153–157. 10.1007/BF02295996 (1947). [DOI] [PubMed] [Google Scholar]
19.Grandini, M., Bagli, E. & Visani, G. Metrics for multi-class classification: an overview. ArXiv Preprint arXiv:2008 05756. 10.48550/arXiv.2008.05756 (2020). [Google Scholar]
20.Hicks, S. A. et al. On evaluation metrics for medical applications of artificial intelligence. Sci. Rep.12 (1), 5979. 10.1038/s41598-022-09954-8 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Brodersen, K. H., Ong, C. S., Stephan, K. E. & Buhmann, J. M. The balanced accuracy and its posterior distribution. In 2010 20th international conference on pattern recognition (pp. 3121–3124). IEEE. (2010). 10.1109/ICPR.2010.764.
22.Chicco, D. & Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom.21 (1), 6. 10.1186/s12864-019-6413-7 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Anscombe, F. J. On estimating binomial response relations. Biometrika43 (3/4), 461–464 (1956). [Google Scholar]
24.Liu, C. et al. What is the meaning of health literacy? A systematic review and qualitative synthesis. Family Med. Commun. Health. 8 (2), e000351 (2020). 10.1136/fmch-2020-000351. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Marshall, I. J., Wolfe, C. D. & McKevitt, C. Lay perspectives on hypertension and drug adherence: systematic review of qualitative research. Bmj34510.1136/bmj.e3953 (2012). [DOI] [PMC free article] [PubMed]
26.Mentrup, S., Harris, E., Gomersall, T., Köpke, S. & Astin, F. Patients’ experiences of cardiovascular health education and risk communication: a qualitative synthesis. Qual. Health Res.30 (1), 88–104. 10.1177/1049732319887949 (2020). [DOI] [PubMed] [Google Scholar]
27.Google Cloud. Overview of prompting strategies. Generative AI on Vertex AI — Google Cloud. (2025). https://cloud.google.com/vertex-ai/generative-ai/docs/learn/prompts/prompt-design-strategies.
28.Meskó, B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J. Med. Internet. Res.25, e50638 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.OpenAI. GPT-5 prompting guide. OpenAI Cookbook. (2025). https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_guide.
30.Microsoft Prompt engineering techniques. Microsoft Learn. (2025)., September 30 https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/prompt-engineering.
31.Von Elm, E. et al. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet370 (9596), 1453–1457. 10.1136/bmj.39335.541782.AD (2007). [DOI] [PubMed] [Google Scholar]
32.Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow,C. D., & Moher, D. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372. 10.1136/bmj.n71 (2021). [DOI] [PMC free article] [PubMed]
33.Cacciamani, G. E., Chu, T. N., Sanford, D. I., Abreu, A., Duddalwar, V., Oberai, A., & Hung, A. J. PRISMA AI reporting guidelines for systematic reviews and meta-analyses on AI in healthcare. Nat. Med.29(1), 14–15. 10.1038/s41591-022-02139-w (2023). [DOI] [PubMed]
34.Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Lancet Digit. Health. 2 (10), e537–e548. 10.1038/s41591-020-1034-x (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Collins, G. S. et al. TRIPOD + AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ385, 1. 10.1136/bmj-2023-078378 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Kim, J. et al. Which is better? Exploring prompting strategy for llm-based metrics. ArXiv Preprint arXiv:2311 03754. 10.48550/arXiv.2311.03754 (2023). [Google Scholar]
37.Yugeswardeenoo, D., Zhu, K. & O’Brien, S. Question-analysis prompting improves LLM performance in reasoning tasks. ArXiv Preprint arXiv:2407 03624. 10.48550/arXiv.2407.03624 (2024). [Google Scholar]
38.Sun, S., Zhuang, S., Wang, S. & Zuccon, G. An investigation of prompt variations for Zero-shot LLM-based rankers. ArXiv Preprint arXiv:2406 14117. 10.1007/978-3-031-88711-6_12 (2024). [Google Scholar]
39.Wang, B., Deng, X. & Sun, H. Iteratively prompt pre-trained Language models for chain of thought. ArXiv Preprint arXiv:2203 08383. 10.48550/arXiv.2203.08383 (2022). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1^{(117.7KB, xlsx)}

Supplementary Material 2^{(25.7KB, xlsx)}

Supplementary Material 3^{(29.4KB, docx)}

Data Availability Statement

[CR1] 1.Tong, A., Sainsbury, P. & Craig, J. Consolidated criteria for reporting qualitative research (COREQ): a 32-item checklist for interviews and focus groups. Int. J. Qual. Health Care. 19 (6), 349–357. 10.1093/intqhc/mzm042 (2007). [DOI] [PubMed] [Google Scholar]

[CR2] 2.O’Brien, B. C., Harris, I. B., Beckman, T. J., Reed, D. A. & Cook, D. A. Standards for reporting qualitative research: a synthesis of recommendations. Acad. Med.89 (9), 1245–1251. 10.1097/ACM.0000000000000388 (2014). [DOI] [PubMed] [Google Scholar]

[CR3] 3.Mishra, T. et al. Use of large Language models as artificial intelligence tools in academic research and publishing among global clinical researchers. Sci. Rep.14 (1), 31672. 10.1038/s41598-024-81370-6 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Bijker, R., Merkouris, S. S., Dowling, N. A. & Rodda, S. N. ChatGPT for automated qualitative research: content analysis. J. Med. Internet Res.26, e59050. 10.2196/59050 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Prescott, M. R. et al. Comparing the efficacy and efficiency of human and generative AI: qualitative thematic analyses. JMIR AI. 3, e54482. 10.2196/54482 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Gartlehner, G. et al. Data extraction for evidence synthesis using a large Language model: A proof-of‐concept study. Res. Synthesis Methods. 15 (4), 576–589. 10.1002/jrsm.1710 (2024). [DOI] [PubMed] [Google Scholar]

[CR7] 7.Ovelman, C., Kugley, S., Gartlehner, G. & Viswanathan, M. The use of a large Language model to create plain Language summaries of evidence reviews in healthcare: A feasibility study. Cochrane Evid. Synthesis Methods. 2 (2), e12041. 10.1002/cesm.12041 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Spillias, S. et al. Human-AI collaboration to identify literature for evidence synthesis. Cell. Rep. Sustain.1 (7). 10.1016/j.crsus.2024.100132 (2024).

[CR9] 9.Anthropic. Claude AI (Sonnet 3.5, June 2024 release) [Large language model]. Anthropic. (2024). https://www.anthropic.com.

[CR10] 10.Chia, A. W. Y., Teo, W. L. L., Acharyya, S., Munro, Y. L. & Dalan, R. Patient-physician communication of health and risk information in the management of cardiovascular diseases and diabetes: a systematic scoping review. BMC Med.23 (1), 96. 10.1186/s12916-025-03873-x (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Agarwal, V. et al. MedHalu: hallucinations in responses to healthcare queries by large Language models. ArXiv Preprint arXiv:2409 19492. 10.48550/arXiv.2409.19492 (2024). [Google Scholar]

[CR12] 12.Zhang, Y. et al. Siren’s song in the AI ocean: a survey on hallucination in large Language models. ArXiv Preprint arXiv:2309 01219. 10.48550/arXiv.2309.01219 (2023). [Google Scholar]

[CR13] 13.Wilson, E. B. Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc.22 (158), 209–212. 10.1080/01621459.1927.10502953 (1927). [Google Scholar]

[CR14] 14.Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas.20 (1), 37–46. 10.1177/001316446002000104 (1960). [Google Scholar]

[CR15] 15.Efron, B. Better bootstrap confidence intervals. J. Am. Stat. Assoc.82 (397), 171–185. 10.2307/2289144 (1987). [Google Scholar]

[CR16] 16.Byrt, T., Bishop, J. & Carlin, J. B. Bias, prevalence and kappa. J. Clin. Epidemiol.46 (5), 423–429. 10.1016/0895-4356(93)90018-v (1993). [DOI] [PubMed] [Google Scholar]

[CR17] 17.Gwet, K. L. Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol.61 (1), 29–48. 10.1348/000711006X126600 (2008). [DOI] [PubMed] [Google Scholar]

[CR18] 18.McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika12 (2), 153–157. 10.1007/BF02295996 (1947). [DOI] [PubMed] [Google Scholar]

[CR19] 19.Grandini, M., Bagli, E. & Visani, G. Metrics for multi-class classification: an overview. ArXiv Preprint arXiv:2008 05756. 10.48550/arXiv.2008.05756 (2020). [Google Scholar]

[CR20] 20.Hicks, S. A. et al. On evaluation metrics for medical applications of artificial intelligence. Sci. Rep.12 (1), 5979. 10.1038/s41598-022-09954-8 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Brodersen, K. H., Ong, C. S., Stephan, K. E. & Buhmann, J. M. The balanced accuracy and its posterior distribution. In 2010 20th international conference on pattern recognition (pp. 3121–3124). IEEE. (2010). 10.1109/ICPR.2010.764.

[CR22] 22.Chicco, D. & Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom.21 (1), 6. 10.1186/s12864-019-6413-7 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Anscombe, F. J. On estimating binomial response relations. Biometrika43 (3/4), 461–464 (1956). [Google Scholar]

[CR24] 24.Liu, C. et al. What is the meaning of health literacy? A systematic review and qualitative synthesis. Family Med. Commun. Health. 8 (2), e000351 (2020). 10.1136/fmch-2020-000351. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Marshall, I. J., Wolfe, C. D. & McKevitt, C. Lay perspectives on hypertension and drug adherence: systematic review of qualitative research. Bmj34510.1136/bmj.e3953 (2012). [DOI] [PMC free article] [PubMed]

[CR26] 26.Mentrup, S., Harris, E., Gomersall, T., Köpke, S. & Astin, F. Patients’ experiences of cardiovascular health education and risk communication: a qualitative synthesis. Qual. Health Res.30 (1), 88–104. 10.1177/1049732319887949 (2020). [DOI] [PubMed] [Google Scholar]

[CR27] 27.Google Cloud. Overview of prompting strategies. Generative AI on Vertex AI — Google Cloud. (2025). https://cloud.google.com/vertex-ai/generative-ai/docs/learn/prompts/prompt-design-strategies.

[CR28] 28.Meskó, B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J. Med. Internet. Res.25, e50638 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.OpenAI. GPT-5 prompting guide. OpenAI Cookbook. (2025). https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_guide.

[CR30] 30.Microsoft Prompt engineering techniques. Microsoft Learn. (2025)., September 30 https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/prompt-engineering.

[CR31] 31.Von Elm, E. et al. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet370 (9596), 1453–1457. 10.1136/bmj.39335.541782.AD (2007). [DOI] [PubMed] [Google Scholar]

[CR32] 32.Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow,C. D., & Moher, D. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372. 10.1136/bmj.n71 (2021). [DOI] [PMC free article] [PubMed]

[CR33] 33.Cacciamani, G. E., Chu, T. N., Sanford, D. I., Abreu, A., Duddalwar, V., Oberai, A., & Hung, A. J. PRISMA AI reporting guidelines for systematic reviews and meta-analyses on AI in healthcare. Nat. Med.29(1), 14–15. 10.1038/s41591-022-02139-w (2023). [DOI] [PubMed]

[CR34] 34.Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Lancet Digit. Health. 2 (10), e537–e548. 10.1038/s41591-020-1034-x (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Collins, G. S. et al. TRIPOD + AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ385, 1. 10.1136/bmj-2023-078378 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Kim, J. et al. Which is better? Exploring prompting strategy for llm-based metrics. ArXiv Preprint arXiv:2311 03754. 10.48550/arXiv.2311.03754 (2023). [Google Scholar]

[CR37] 37.Yugeswardeenoo, D., Zhu, K. & O’Brien, S. Question-analysis prompting improves LLM performance in reasoning tasks. ArXiv Preprint arXiv:2407 03624. 10.48550/arXiv.2407.03624 (2024). [Google Scholar]

[CR38] 38.Sun, S., Zhuang, S., Wang, S. & Zuccon, G. An investigation of prompt variations for Zero-shot LLM-based rankers. ArXiv Preprint arXiv:2406 14117. 10.1007/978-3-031-88711-6_12 (2024). [Google Scholar]

[CR39] 39.Wang, B., Deng, X. & Sun, H. Iteratively prompt pre-trained Language models for chain of thought. ArXiv Preprint arXiv:2203 08383. 10.48550/arXiv.2203.08383 (2022). [Google Scholar]

PERMALINK

Evaluating the performance of a generative AI model in assessing qualitative health research articles adherence to objective reporting standards

Aloysius Wei-Yan Chia

Winnie Li-Lian Teo

Yasmin Lynda Munro

Rinkoo Dalan

Abstract

Supplementary Information

Introduction

Study aims

Methods

Overview

Figure. 1.

Data sources

Ethical considerations

Guiding prompts and model output

2-step prompt sequence

Prompt adjustments

Figure. 2.

Figure. 3.

Human evaluation and truth-value classification

Table 1.

Inter-rater reliability and robustness checks

Confusion matrix summary

Performance metrics

F1, balanced accuracy scores and the Matthews corelation coefficient

Table 2.

Table 3.

Imputation and metric stability

Results

Table 4.

Criterion level analysis

Cluster 1: balanced criterion

Cluster 2: under-reported criterion

Cluster 3: mixed errors criterion

Cluster 4: information limited criterion

Figure. 4.

Table 5.

Criterion domain level analysis

Table 6.

Table 7.

Article-level analysis

Error analysis

Figure. 5.

Discussion

Implications for clinical research

Limitations

Conclusion

Supplementary Information

Author contributions

Funding

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases