Table 1.
Summary of the included studies
Author, year [ref] | Data type and sample size | Model | Summary of most important results |
---|---|---|---|
Chen et al., 2023 [48] | EMR, data from >2 million patients | NLP (NER, POS) | Synonym-based pain-level detection tool accurately identified patients with moderate–severe pain due to OA |
Saini et al., 2023 [29] | X-ray image reports, structured EHR data, 4508 patients | CNN, YOLO v4, Transformer, BERT | High performance in predicting knee OA severity and generating reports with AUROCs from 0.897 to 0.9582. |
Benavent et al., 2024 [5] | Unstructured EHR data, 4337 patients | NLP-based system | High precision, recall and F1 scores (>0.80) for detecting clinical entities related to SpA |
Li et al., 2022 [46] | EMRs, 1600 clinical notes | BERT | Improved NER in clinical notes with an F1 score of 0.936 |
Krusche et al., 2024 [31] | Patient vignettes, 20 different real-world patient vignettes | GPT-4 | Comparable diagnostic accuracy to rheumatologists for IRDs, with top diagnosis accuracy of 35% |
Madrid-García et al., 2023 [39] | Exam questions from Spanish access exam, 145 questions | GPT-4 | GPT-4 showed 93.71% accuracy in answering rheumatology questions |
Irfan and Yaqoob 2023 [23] | Database of peer-reviewed articles and clinical guidelines | GPT-4 | Provided insights into SS, highlighting key characteristics and management details |
Nelson et al., 2015 [49] | Medical text infusion notes, 115 patients, 2029 infliximab infusions | Custom rule-based NLP software | Improved sensitivity (0.858) and PPV (0.976) for identifying infliximab infusion dates and doses |
Liu et al., 2023 [25] | Chinese EMRs, 1986 CEMRs | MC-BERT-BiLSTM-CRF, MC-BERT + FFNN | Achieved F1 scores of 92.96% for NER and 95.29% for relation extraction |
Humbert-Droz et al., 2023 [30] | Clinical notes from the RISE registry, 854 628 patients | NLP pipeline (Spacy) | Sensitivity, PPV and F1 scores of 95%, 87% and 91%, respectively, for RA outcome measures extraction |
Benavent et al., 2023 [6] | Free-text and structured clinical information, 758 patients | EHRead technology | High performance in identifying clinical variables for axSpA and PsA, precision of 0.798 and recall of 0.735 for PsA |
VanSchaik et al., 2023 [53] | PubMed abstracts, 2350 abstracts | ELECTRA-based model | Extracted causal relationships with an F1 score of 0.91 |
Walsh et al., 2020 [40] | Clinical notes, structured EHR data, 600 patients | NLP algorithms with random forest | AUROC of 0.96 for full algorithm in identifying axSpA |
Yoshida et al., 2024 [42] | EHR notes and Medicare claims data, 500 patients | LASSO | Combined model showed an AUROC of 0.731 for identifying gout flares |
Li et al., 2023 [52] | FAQ-based question-answering pairs, 176 questions | BERT, RoBERTa, ALBERT, MacBERT | Achieved top-1 precision of 0.551 and MRR of 0.660 in an RA question-answering system |
Ye et al., 2024 [33] | Patient-generated rheumatology questions, 17 patients | GPT-4 | Patients rated AI responses similarly to physician responses; rheumatologists rated AI lower in comprehensiveness |
Coskun et al., 2024 [23] | Questions on methotrexate use, 23 questions | GPT-4, GPT-3.5, BARD | GPT-4 achieved 100% accuracy in providing information on methotrexate use |
Liao et al., 2010 [36] | Narrative and codified EMR data, 29 432 subjects | HITEx system | Improved RA classification accuracy with a PPV of 94% using narrative and codified data |
Lin et al., 2015 [24] | Structured and unstructured EHR data, 5903 patients | Apache cTAKES, ML | PPV of 0.756, sensitivity of 0.919 and F1 score of 0.829 for identifying methotrexate-induced liver toxicity |
Wang et al., 2017 [32] | Spontaneous reports, EMRs, 138 000 patients | MedEx, UMLS, MedDRA PT codes | Detected 152 signals for biologics and 147 for DMARDs from clinical notes |
Uz and Umay, 2023 [34] | Structured EHR data and internet search data | ChatGPT | Reliability scores ranged from 4 to 7, with the highest for OA (5.62); usefulness scores highest for AS (5.87) |
Luedders et al., 2023 [37] | Chest CT reports, 650 patients | Automated regular expressions | Improved PPV to 94.6% for RA-ILD identification |
Osborne et al., 2024 [41] | Chief complaint text from emergency department, 8037 CCs | Rule-based, BERT-based algorithm | BERT-GF achieved an F1 score of 0.57 for detecting gout flares |
Yang et al., 2024 [26] | Responses from ChatGPT and Bard, 20 treatments | GPT, BARD | ChatGPT had an 80% concordance rate with AAOS CPGs, while Bard had 60% |
England et al., 2024 [38] | Clinical notes from EHRs, 7485 patients | NLP | 95.8% of NLP-derived FVC values were within 5% predicted of PFT equipment values |
Love et al., 2011 [54] | EMR notes, billing codes, 2318 patients | NLP with random forest | PPV of 90% at sensitivity of 87% for PsA classification using NLP and coded data |
Deng et al., 2024 [12] | Structured EHR data, clinical notes, 472 patients | MetaMap, logistic regression | Identified lupus nephritis phenotype with an F1 score of 0.79 at NU and 0.93 at VUMC |
van Leeuwen et al., 2024 [50] | EHRs, 287 patients | AI tool, NLP | Sensitivity of 97.0% in training and 98.0% in validation centres for AAV identification |
Román Ivorra et al., 2024 [47] | EHRs, 13 958 patients | EHRead, NLP, ML | Achieved precision of 79.4% for ILD detection and 76.4% for RA detection |
Zhao et al., 2020 [43] | EHRs, 7853 patients | NLP, ICD codes, logistic regression | Sensitivity of 0.78, specificity of 0.94 and AUROC of 0.93 for identifying axSpA |
Kerr et al., 2015 [45] | Clinical narrative data from EMRs, 2280 patients | NLP system | Compliance rates for gout QIs: QI 1, 92.1%; QI 2, 44.8%; QI 3, 7.7% |
Redd et al., 2014 [44] | Structured and unstructured EHR data, 4272 patients | NLP, SVM | Precision of 0.814 and recall of 0.973 for identifying SSc patients at risk for SRC |
Oliveira et al., 2024 [35] | Chief complaint notes from emergency department, 8037 CCs | RoBERTa-large, BioGPT | Achieved F1 scores of 0.8 (2019 dataset) and 0.85 (2020 dataset) for detecting gout flares |
Gräf et al., 2022 [28] | Survey data, clinical vignettes, 132 vignettes | ADA | ADA’s diagnostic accuracy for IRD was higher compared with physicians (70% vs 54%) |
CCs: Clinical Cases; NER: named entity recognition; POS: parts of speech; CNN: convolutional neural network; YOLO: You Only Look Once; IRD: inflammatory rheumatic disease; FVC: forced vital capacity; QI: quality indicator; PFT: pulmonary function test; ADE: adverse drug event; RISE: Rheumatology Informatics System for Effectiveness; SRC: scleroderma renal crisis; GPA: granulomatosis with polyangiitis; MPA: microscopic polyangiitis; EGPA: eosinophilic granulomatosis with polyangiitis; ML: machine learning; HCPCS: Healthcare Common Procedure Coding System; LASSO: least absolute shrinkage and selection operator; MAP: maximum a posteriori; RoBERTa: A Robustly Optimized BERT Pretraining Approach; BioGPT: Biomedical Generative Pre-trained Transformer; NU: Northwestern University; VUMC: Vanderbilt University Medical Center; HITEx: Health Information Text Extraction; EHRead: Electronic Health Read; ADA: AI-based symptom checker; FFNN: Feedforward neural network.