Skip to main content
. 2024 Sep 19;8(4):rkae120. doi: 10.1093/rap/rkae120

Table 1.

Summary of the included studies

Author, year [ref] Data type and sample size Model Summary of most important results
Chen et al., 2023 [48] EMR, data from >2 million patients NLP (NER, POS) Synonym-based pain-level detection tool accurately identified patients with moderate–severe pain due to OA
Saini et al., 2023 [29] X-ray image reports, structured EHR data, 4508 patients CNN, YOLO v4, Transformer, BERT High performance in predicting knee OA severity and generating reports with AUROCs from 0.897 to 0.9582.
Benavent et al., 2024 [5] Unstructured EHR data, 4337 patients NLP-based system High precision, recall and F1 scores (>0.80) for detecting clinical entities related to SpA
Li et al., 2022 [46] EMRs, 1600 clinical notes BERT Improved NER in clinical notes with an F1 score of 0.936
Krusche et al., 2024 [31] Patient vignettes, 20 different real-world patient vignettes GPT-4 Comparable diagnostic accuracy to rheumatologists for IRDs, with top diagnosis accuracy of 35%
Madrid-García et al., 2023 [39] Exam questions from Spanish access exam, 145 questions GPT-4 GPT-4 showed 93.71% accuracy in answering rheumatology questions
Irfan and Yaqoob 2023 [23] Database of peer-reviewed articles and clinical guidelines GPT-4 Provided insights into SS, highlighting key characteristics and management details
Nelson et al., 2015 [49] Medical text infusion notes, 115 patients, 2029 infliximab infusions Custom rule-based NLP software Improved sensitivity (0.858) and PPV (0.976) for identifying infliximab infusion dates and doses
Liu et al., 2023 [25] Chinese EMRs, 1986 CEMRs MC-BERT-BiLSTM-CRF, MC-BERT + FFNN Achieved F1 scores of 92.96% for NER and 95.29% for relation extraction
Humbert-Droz et al., 2023 [30] Clinical notes from the RISE registry, 854 628 patients NLP pipeline (Spacy) Sensitivity, PPV and F1 scores of 95%, 87% and 91%, respectively, for RA outcome measures extraction
Benavent et al., 2023 [6] Free-text and structured clinical information, 758 patients EHRead technology High performance in identifying clinical variables for axSpA and PsA, precision of 0.798 and recall of 0.735 for PsA
VanSchaik et al., 2023 [53] PubMed abstracts, 2350 abstracts ELECTRA-based model Extracted causal relationships with an F1 score of 0.91
Walsh et al., 2020 [40] Clinical notes, structured EHR data, 600 patients NLP algorithms with random forest AUROC of 0.96 for full algorithm in identifying axSpA
Yoshida et al., 2024 [42] EHR notes and Medicare claims data, 500 patients LASSO Combined model showed an AUROC of 0.731 for identifying gout flares
Li et al., 2023 [52] FAQ-based question-answering pairs, 176 questions BERT, RoBERTa, ALBERT, MacBERT Achieved top-1 precision of 0.551 and MRR of 0.660 in an RA question-answering system
Ye et al., 2024 [33] Patient-generated rheumatology questions, 17 patients GPT-4 Patients rated AI responses similarly to physician responses; rheumatologists rated AI lower in comprehensiveness
Coskun et al., 2024 [23] Questions on methotrexate use, 23 questions GPT-4, GPT-3.5, BARD GPT-4 achieved 100% accuracy in providing information on methotrexate use
Liao et al., 2010 [36] Narrative and codified EMR data, 29 432 subjects HITEx system Improved RA classification accuracy with a PPV of 94% using narrative and codified data
Lin et al., 2015 [24] Structured and unstructured EHR data, 5903 patients Apache cTAKES, ML PPV of 0.756, sensitivity of 0.919 and F1 score of 0.829 for identifying methotrexate-induced liver toxicity
Wang et al., 2017 [32] Spontaneous reports, EMRs, 138 000 patients MedEx, UMLS, MedDRA PT codes Detected 152 signals for biologics and 147 for DMARDs from clinical notes
Uz and Umay, 2023 [34] Structured EHR data and internet search data ChatGPT Reliability scores ranged from 4 to 7, with the highest for OA (5.62); usefulness scores highest for AS (5.87)
Luedders et al., 2023 [37] Chest CT reports, 650 patients Automated regular expressions Improved PPV to 94.6% for RA-ILD identification
Osborne et al., 2024 [41] Chief complaint text from emergency department, 8037 CCs Rule-based, BERT-based algorithm BERT-GF achieved an F1 score of 0.57 for detecting gout flares
Yang et al., 2024 [26] Responses from ChatGPT and Bard, 20 treatments GPT, BARD ChatGPT had an 80% concordance rate with AAOS CPGs, while Bard had 60%
England et al., 2024 [38] Clinical notes from EHRs, 7485 patients NLP 95.8% of NLP-derived FVC values were within 5% predicted of PFT equipment values
Love et al., 2011 [54] EMR notes, billing codes, 2318 patients NLP with random forest PPV of 90% at sensitivity of 87% for PsA classification using NLP and coded data
Deng et al., 2024 [12] Structured EHR data, clinical notes, 472 patients MetaMap, logistic regression Identified lupus nephritis phenotype with an F1 score of 0.79 at NU and 0.93 at VUMC
van Leeuwen et al., 2024 [50] EHRs, 287 patients AI tool, NLP Sensitivity of 97.0% in training and 98.0% in validation centres for AAV identification
Román Ivorra et al., 2024 [47] EHRs, 13 958 patients EHRead, NLP, ML Achieved precision of 79.4% for ILD detection and 76.4% for RA detection
Zhao et al., 2020 [43] EHRs, 7853 patients NLP, ICD codes, logistic regression Sensitivity of 0.78, specificity of 0.94 and AUROC of 0.93 for identifying axSpA
Kerr et al., 2015 [45] Clinical narrative data from EMRs, 2280 patients NLP system Compliance rates for gout QIs: QI 1, 92.1%; QI 2, 44.8%; QI 3, 7.7%
Redd et al., 2014 [44] Structured and unstructured EHR data, 4272 patients NLP, SVM Precision of 0.814 and recall of 0.973 for identifying SSc patients at risk for SRC
Oliveira et al., 2024 [35] Chief complaint notes from emergency department, 8037 CCs RoBERTa-large, BioGPT Achieved F1 scores of 0.8 (2019 dataset) and 0.85 (2020 dataset) for detecting gout flares
Gräf et al., 2022 [28] Survey data, clinical vignettes, 132 vignettes ADA ADA’s diagnostic accuracy for IRD was higher compared with physicians (70% vs 54%)

CCs: Clinical Cases; NER: named entity recognition; POS: parts of speech; CNN: convolutional neural network; YOLO: You Only Look Once; IRD: inflammatory rheumatic disease; FVC: forced vital capacity; QI: quality indicator; PFT: pulmonary function test; ADE: adverse drug event; RISE: Rheumatology Informatics System for Effectiveness; SRC: scleroderma renal crisis; GPA: granulomatosis with polyangiitis; MPA: microscopic polyangiitis; EGPA: eosinophilic granulomatosis with polyangiitis; ML: machine learning; HCPCS: Healthcare Common Procedure Coding System; LASSO: least absolute shrinkage and selection operator; MAP: maximum a posteriori; RoBERTa: A Robustly Optimized BERT Pretraining Approach; BioGPT: Biomedical Generative Pre-trained Transformer; NU: Northwestern University; VUMC: Vanderbilt University Medical Center; HITEx: Health Information Text Extraction; EHRead: Electronic Health Read; ADA: AI-based symptom checker; FFNN: Feedforward neural network.