. 2024 Sep 19;8(4):rkae120. doi: 10.1093/rap/rkae120

Table 1.

Summary of the included studies

Author, year [ref]	Data type and sample size	Model	Summary of most important results
Chen et al., 2023 [48]	EMR, data from >2 million patients	NLP (NER, POS)	Synonym-based pain-level detection tool accurately identified patients with moderate–severe pain due to OA
Saini et al., 2023 [29]	X-ray image reports, structured EHR data, 4508 patients	CNN, YOLO v4, Transformer, BERT	High performance in predicting knee OA severity and generating reports with AUROCs from 0.897 to 0.9582.
Benavent et al., 2024 [5]	Unstructured EHR data, 4337 patients	NLP-based system	High precision, recall and F1 scores (>0.80) for detecting clinical entities related to SpA
Li et al., 2022 [46]	EMRs, 1600 clinical notes	BERT	Improved NER in clinical notes with an F1 score of 0.936
Krusche et al., 2024 [31]	Patient vignettes, 20 different real-world patient vignettes	GPT-4	Comparable diagnostic accuracy to rheumatologists for IRDs, with top diagnosis accuracy of 35%
Madrid-García et al., 2023 [39]	Exam questions from Spanish access exam, 145 questions	GPT-4	GPT-4 showed 93.71% accuracy in answering rheumatology questions
Irfan and Yaqoob 2023 [23]	Database of peer-reviewed articles and clinical guidelines	GPT-4	Provided insights into SS, highlighting key characteristics and management details
Nelson et al., 2015 [49]	Medical text infusion notes, 115 patients, 2029 infliximab infusions	Custom rule-based NLP software	Improved sensitivity (0.858) and PPV (0.976) for identifying infliximab infusion dates and doses
Liu et al., 2023 [25]	Chinese EMRs, 1986 CEMRs	MC-BERT-BiLSTM-CRF, MC-BERT + FFNN	Achieved F1 scores of 92.96% for NER and 95.29% for relation extraction
Humbert-Droz et al., 2023 [30]	Clinical notes from the RISE registry, 854 628 patients	NLP pipeline (Spacy)	Sensitivity, PPV and F1 scores of 95%, 87% and 91%, respectively, for RA outcome measures extraction
Benavent et al., 2023 [6]	Free-text and structured clinical information, 758 patients	EHRead technology	High performance in identifying clinical variables for axSpA and PsA, precision of 0.798 and recall of 0.735 for PsA
VanSchaik et al., 2023 [53]	PubMed abstracts, 2350 abstracts	ELECTRA-based model	Extracted causal relationships with an F1 score of 0.91
Walsh et al., 2020 [40]	Clinical notes, structured EHR data, 600 patients	NLP algorithms with random forest	AUROC of 0.96 for full algorithm in identifying axSpA
Yoshida et al., 2024 [42]	EHR notes and Medicare claims data, 500 patients	LASSO	Combined model showed an AUROC of 0.731 for identifying gout flares
Li et al., 2023 [52]	FAQ-based question-answering pairs, 176 questions	BERT, RoBERTa, ALBERT, MacBERT	Achieved top-1 precision of 0.551 and MRR of 0.660 in an RA question-answering system
Ye et al., 2024 [33]	Patient-generated rheumatology questions, 17 patients	GPT-4	Patients rated AI responses similarly to physician responses; rheumatologists rated AI lower in comprehensiveness
Coskun et al., 2024 [23]	Questions on methotrexate use, 23 questions	GPT-4, GPT-3.5, BARD	GPT-4 achieved 100% accuracy in providing information on methotrexate use
Liao et al., 2010 [36]	Narrative and codified EMR data, 29 432 subjects	HITEx system	Improved RA classification accuracy with a PPV of 94% using narrative and codified data
Lin et al., 2015 [24]	Structured and unstructured EHR data, 5903 patients	Apache cTAKES, ML	PPV of 0.756, sensitivity of 0.919 and F1 score of 0.829 for identifying methotrexate-induced liver toxicity
Wang et al., 2017 [32]	Spontaneous reports, EMRs, 138 000 patients	MedEx, UMLS, MedDRA PT codes	Detected 152 signals for biologics and 147 for DMARDs from clinical notes
Uz and Umay, 2023 [34]	Structured EHR data and internet search data	ChatGPT	Reliability scores ranged from 4 to 7, with the highest for OA (5.62); usefulness scores highest for AS (5.87)
Luedders et al., 2023 [37]	Chest CT reports, 650 patients	Automated regular expressions	Improved PPV to 94.6% for RA-ILD identification
Osborne et al., 2024 [41]	Chief complaint text from emergency department, 8037 CCs	Rule-based, BERT-based algorithm	BERT-GF achieved an F1 score of 0.57 for detecting gout flares
Yang et al., 2024 [26]	Responses from ChatGPT and Bard, 20 treatments	GPT, BARD	ChatGPT had an 80% concordance rate with AAOS CPGs, while Bard had 60%
England et al., 2024 [38]	Clinical notes from EHRs, 7485 patients	NLP	95.8% of NLP-derived FVC values were within 5% predicted of PFT equipment values
Love et al., 2011 [54]	EMR notes, billing codes, 2318 patients	NLP with random forest	PPV of 90% at sensitivity of 87% for PsA classification using NLP and coded data
Deng et al., 2024 [12]	Structured EHR data, clinical notes, 472 patients	MetaMap, logistic regression	Identified lupus nephritis phenotype with an F1 score of 0.79 at NU and 0.93 at VUMC
van Leeuwen et al., 2024 [50]	EHRs, 287 patients	AI tool, NLP	Sensitivity of 97.0% in training and 98.0% in validation centres for AAV identification
Román Ivorra et al., 2024 [47]	EHRs, 13 958 patients	EHRead, NLP, ML	Achieved precision of 79.4% for ILD detection and 76.4% for RA detection
Zhao et al., 2020 [43]	EHRs, 7853 patients	NLP, ICD codes, logistic regression	Sensitivity of 0.78, specificity of 0.94 and AUROC of 0.93 for identifying axSpA
Kerr et al., 2015 [45]	Clinical narrative data from EMRs, 2280 patients	NLP system	Compliance rates for gout QIs: QI 1, 92.1%; QI 2, 44.8%; QI 3, 7.7%
Redd et al., 2014 [44]	Structured and unstructured EHR data, 4272 patients	NLP, SVM	Precision of 0.814 and recall of 0.973 for identifying SSc patients at risk for SRC
Oliveira et al., 2024 [35]	Chief complaint notes from emergency department, 8037 CCs	RoBERTa-large, BioGPT	Achieved F1 scores of 0.8 (2019 dataset) and 0.85 (2020 dataset) for detecting gout flares
Gräf et al., 2022 [28]	Survey data, clinical vignettes, 132 vignettes	ADA	ADA’s diagnostic accuracy for IRD was higher compared with physicians (70% vs 54%)

CCs: Clinical Cases; NER: named entity recognition; POS: parts of speech; CNN: convolutional neural network; YOLO: You Only Look Once; IRD: inflammatory rheumatic disease; FVC: forced vital capacity; QI: quality indicator; PFT: pulmonary function test; ADE: adverse drug event; RISE: Rheumatology Informatics System for Effectiveness; SRC: scleroderma renal crisis; GPA: granulomatosis with polyangiitis; MPA: microscopic polyangiitis; EGPA: eosinophilic granulomatosis with polyangiitis; ML: machine learning; HCPCS: Healthcare Common Procedure Coding System; LASSO: least absolute shrinkage and selection operator; MAP: maximum a posteriori; RoBERTa: A Robustly Optimized BERT Pretraining Approach; BioGPT: Biomedical Generative Pre-trained Transformer; NU: Northwestern University; VUMC: Vanderbilt University Medical Center; HITEx: Health Information Text Extraction; EHRead: Electronic Health Read; ADA: AI-based symptom checker; FFNN: Feedforward neural network.