Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Feb 14.
Published in final edited form as: Arthritis Care Res (Hoboken). 2025 Oct 16;78(1):54–65. doi: 10.1002/acr.25654

Development and External Validation of a Multivariable Predictive Model for Progression to Difficult-to-Treat Rheumatoid Arthritis in Biologic-Experienced Patients

Misti L Paudel 1,2, Nancy Shadick 1,2, Michael Weinblatt 1,2, George Reed 3, Heather J Litman 4, Joel M Kremer 5, Dimitrios A Pappas 4,5,6, Daniel H Solomon 1,2
PMCID: PMC12904143  NIHMSID: NIHMS2141374  PMID: 40977501

Abstract

Objective:

Approximately 20% of patients with rheumatoid arthritis (RA) cycle through multiple therapies without achieving treatment goals and are classified as “difficult-to-treat” RA (D2T-RA); however, no risk-prediction tools exist to identify which patients are at highest risk. Our aim was to develop and validate a predictive model for progression to D2T-RA among patients with RA.

Methods:

We used data from two large independent observational cohorts of patients with RA to develop and externally validate a multivariable prediction model to identify participants at risk of D2T-RA, defined using EULAR 2021 criteria. We developed a multivariable predictive model for D2T-RA using Random Survival Forests in participants treated with their first biologic and/or targeted synthetic disease modifying anti-rheumatic drug (b/tsDMARD) (derivation cohort). We validated the model in a cohort of participants initiating or switching b/tsDMARD therapies.

Results:

A total of 700 participants were in the derivation cohort (84% females, mean age=55 years, median follow-up=40 months, 113 (16%) with D2T-RA), and 2,070 participants were included in the validation cohort (79% females, mean age=56 years, median follow-up=8 months, 571 (28%) with D2T-RA). We observed C-index values of 0.643 (95% CI 0.585-0.698; derivation cohort) and 0.620 (95% CI 0.596-0.643; validation cohort). Calibration measures suggested overall moderate predictive ability. Worsened functional status, pain, fatigue, and global disease activity were consistently top predictors across both cohorts.

Conclusions:

Our model demonstrated moderate discrimination and calibration, highlighting the challenge in accurately predicting D2T-RA outcomes. These findings underscore the need for further research to improve predictive performance, potentially through the incorporation of additional biomarkers.

INTRODUCTION

Rheumatoid arthritis (RA) is a chronic inflammatory disease that impacts 0.3-1.0% of people worldwide, and approximately 1.3 million adults in the United States (US).1,2 The past several decades have observed advancements in RA management, in part due to the increasing availability of new biologic and/or targeted synthetic disease-modifying antirheumatic drugs (b/tsDMARDs).3 Despite advancements in therapies, a subset of patients with RA remains persistently disabled with chronic pain and functional limitations.

Prior studies have observed that approximately one in five patients with RA cycle through multiple treatments without experiencing relief in symptoms and/or remission and are classified as ‘difficult-to-treat’47 Additionally, one sixth of patients with RA have been treated with five different classes of b/tsDMARD therapies7. Treatment non-response and cycling poses a significant challenge for rheumatologists due to limited evidence on how to identify and manage patients.8 In 2021, after reviewing results of an international survey9 and existing literature, European League Against Rheumatism (EULAR) published criteria for assessing D2T-RA, stating that three conditions must all be present: 1) multiple DMARD failure, 2) signs suggestive of active/progressive disease, and 3) management of symptoms viewed as problematic according to the rheumatologist and/or patient.10 Some have criticized this definition as too broad,11 but its strength lies in the requirement of the three domains of treatment cycling, persistent disease activity, and management difficulties, ensuring accurate identification of an important phenotype of this subgroup of patients with RA.

Prior studies have identified several clinical characteristics that are potential risk factors for D2T-RA, including older age; higher (RF); higher clinical disease activity index (CDAI); lower methotrexate dosage; a greater proportion with hypertension; a greater proportion with diabetes, and greater treatment delay.12 Finally, higher RF levels, greater disease activity and comorbid pulmonary disease were associated with D2T-RA risk in another cohort.13 Developing a predictive model that incorporates clinical, demographic, and patient-reported outcomes (PROs) to predict risk of progressing to D2T-RA would be a clinically useful tool to assist rheumatologists in identifying patients with RA who could benefit from alternative treatment and management approaches.

We used data from two longitudinal cohorts of patients with RA with comprehensive longitudinal assessments of RA-related treatments, clinical factors, and PROs to develop and externally validate a predictive model to identify which patients with RA are most likely to advance to this state of multi-treatment failure.

PATIENTS AND METHODS

Data sources.

We used data from the Brigham and Women’s Arthritis Sequential Study (BRASS) and Comparative Effectiveness Registry to study Therapies for Arthritis and Inflammatory Conditions (CERTAIN). The BRASS registry is a single-center, prospective observational cohort that enrolled participants from the Brigham and Women’s Hospital Arthritis Center in Boston, MA from March 2003 – May 2019. Participants were eligible if they had a diagnosis of RA using American College of Rheumatology RA classification criteria or based on the opinion of a rheumatologist. Participants completed annual visits with a rheumatologist and interim 6-month questionnaires. Study follow-up continued through April 2023. Details of the BRASS registry, including its overall goals and design have been reported elsewhere.14 All participants provided informed consent, and approval was obtained by the Institutional Review Board of Brigham and Women’s Hospital (2002P001762). The BRASS clinical trial registration number is NCT01793103.

The CERTAIN study is a 12-month prospective sub study of the large multi-center US-based registry, CorEvitas (formerly CORRONA). Patients with at least moderate disease activity (CDAI >10) and those who were starting or switching to a new biologic agent that was an anti-TNF (adalimumab, etanercept, certolizumab, golimumab, or infliximab) or a non-anti-TNF agent (tocilizumab, abatacept, or rituximab) were eligible to participate. Participants completed visits every three months for up to one year, at which point longitudinal follow-up continued in the parent CorEvitas registry. Participants needing to discontinue and switch to a different biologic agent during CERTAIN follow-up were offered an opportunity to re-enroll in the study, providing they met study inclusion criteria. Details of the CERTAIN study have been reported elsewhere.15 All participants provided informed consent, and approval was obtained by the New England Institutional Review Board (NEIRB 120160610). The CERTAIN clinical trial registration number is NCT01625650. For academic investigative sites in CERTAIN that did not receive a waiver to use the central IRB, approval was obtained from the respective governing IRBs and documentation of approval was submitted to the Sponsor prior to initiating any study procedures. All CERTAIN subjects were required to provide written informed consent prior to participating.

Study design.

We used data from BRASS to develop a predictive algorithm for progression to D2T-RA, and we externally validated the algorithm using data from CERTAIN. For brevity, we will refer to these as the derivation and validation cohorts throughout the manuscript. In the BRASS cohort we subset to participants with RA who were treated with their first b/tsDMARD and defined the index date as the earliest recorded study visit where first b/tsDMARD treatment was documented. In CERTAIN we restricted to participants who were not currently classified as D2T-RA at enrollment. Also, for CERTAIN participants who switched therapies and were subsequently re-enrolled, we strung consecutive enrollments together and re-calculated follow-up time from study entry. Across both cohorts, patients missing demographic data were excluded. In secondary analyses using CERTAIN, we subset to those who were treated with their first b/tsDMARD (naïve). Both sets of analyses are presented.

Difficult-to-treat RA.

We mapped available study variables to each component of EULARs D2T-RA criteria and identified the earliest study visit in which participants met all three components: DMARD failure, persistent disease activity, and perception of a persistent problem during follow-up. Details of EULARs D2T-RA algorithm and mapped variables for each cohort are provided in Table 1. Details of how this algorithm was implemented in BRASS have been previously published.16 We made modifications to D2T-RA sub criteria when needed, based on the availability of data elements in CERTAIN. These modifications included extra-articular manifestations, prednisone use, and erosions and have been described in the methods section.

Table 1.

Description of EULAR criteria for difficult-to-treat rheumatoid arthritis in BRASS and CERTAIN

Criterion EULAR Definition BRASS Definition CERTAIN Definition
DMARD Failure Criterion 1: Failure of ≥2 b/tsDMARDs with different mechanisms of action after failing conventional synthetic DMARD therapy Patient reported treatment with ≥2 b/tsDMARDs with different mechanisms of action for at least three months Patient reported treatment with 2 b/tsDMARDs with different mechanisms of action for at least three months
Persistent disease activity Criterion 2: At least one of the following conditions (a-e)
a) At least moderate disease activity such as DAS28-ESR >3.2 or CDAI >10
b) Signs (acute phase reactants and imaging) and/or symptoms suggestive of active disease
c) Inability to taper glucocorticoid treatment below 7.5 mg/day prednisone or equivalent
d) Rapid radiographic progression (with or without signs of active disease)
e) Well-controlled disease according to above standards but still having RA symptoms that are causing a reduction in quality-of-life

DAS28-CRP >3.2 or CDAI >10
Presence of extra-articular manifestations in the past year
Glucocorticoid most frequent average dose in past six months ≥6 mg
Change in Sharp score of 5 or more points
RAPID3 score >3 (moderate/high disease severity) and does not meet criteria a-d

DAS28-CRP >3.2 or CDAI >10
History of selected extra-articular manifestations
Current treatment with prednisone with a reported dose ≥7.5 mg
History of ever having erosive disease
RAPID3 score >3 (moderate/high disease severity) and does not meet criteria a-d
Perception of a persistent problem Criterion 3: Management of RA and signs/symptoms are perceived as problematic by the rheumatologist and/or the patient PtGA score >3 or PrGA score >3 PtGA score >3 or PrGA score >3

b/tsDMARD: biologic and/or targeted synthetic DMARD; BRASS: Brigham and Women’s Hospital Rheumatoid Arthritis Sequential Study; CDAI: Clinical Disease Activity Index; CERTAIN: Comparative Effectiveness Registry to study Therapies for Arthritis and Inflammatory Conditions; DAS28: 28-joint DAS; PtGA: Patient Global Disease Activity; PrGA: Provider Global Assessment; RAPID3: Routine Assessment of Patient Index Data-3.

Potential Predictors.

We included several potential predictors in RSF models, spanning RA-related treatments, clinical factors, demographics and PROs. We provide more details of each of these measures from both cohorts in the following paragraphs.

RA-related treatments.

RA-related treatments included conventional synthetic(cs) and biologic/targeted synthetic(b/ts) DMARD therapies, and glucocorticoid use. We classified b/tsDMARD use as TNF inhibitors (etanercept, adalimumab, infliximab, golimumab, certolizumab pegol), anti T-cell-co-stimulator (abatacept), anti-IL-1 inhibitors (anakinra), b-cell depletion (rituximab), anti-IL-6 inhibitors (tocilizumab, sarilumab), and JAK inhibitors (tofacitinib, baricitinib, upadacitinib). Glucocorticoid use, such as prednisone or methylprednisolone, was assessed at each visit and expressed as the most common daily dosage in the past six months (BRASS) or current treatment (CERTAIN).

Measures of disease activity.

Rheumatologists completed a Disease Activity Score-28-C-reactive protein-3 (DAS28-CRP3)17 including a tender joint count (TJC) and swollen joint count (SJC), patient global assessment (PtGA), and CRP laboratory values. A Clinical Disease Activity Index (CDAI) was assessed with scores >10 indicative of active RA with moderate disease. CDAI consists of SJC, TJC, PtGA, and provider global assessment (PrGA). For both cohorts, For the PtGA, scores ranged from zero (arthritis was not at all active) to 100 (arthritis was extremely active).18,19 Scores of ≥30 were indicative of patient perception of active RA disease. In addition, rheumatologists completed a global assessment of disease activity that ranged from 0 (no arthritis activity) to 100 (severe arthritis activity).20 Scores ≥30 were indicative of provider perception of active RA disease.

Radiographic measures and erosions.

In BRASS, bilateral hand and wrist radiographs were performed at enrollment and every two years during follow-up, and were scored by four BWH radiologists according to the Sharp/van der Heijde method.21,22 Erosions were scored for sixteen joints on each side of the body and a total erosion score summed erosions in hands and wrists. Joint space narrowing (JSN) was summed across hands and wrists with range 0–120.22,23 The Sharp/van der Heijde score (SHS) was computed as total erosion score + total JSN score.24 In CERTAIN, a history of joint erosions was recorded.

Extra-articular manifestations.

Extra-articular manifestations of RA experienced by the participant were recorded in BRASS and included pericarditis, pleuritis, pulmonary fibrosis, bronchiectasis, pulmonary nodules, bronchiolitis obliterans with organizing pneumonia, subcutaneous rheumatoid nodules, Sjogren’s Syndrome/Keratoconjunctivitis sicca, large granular lymphocytic leukemia, cervical myelopathy, neuropathy, Felty’s Syndrome, cutaneous vasculitis, pyoderma gangrenosum, scleritis/episcleritis, glomerulonephritis, vasculitis of other organs, amyloidosis, and lymphadenopathy. In CERTAIN these consisted of secondary Sjogren’s; history of subcutaneous nodules; interstitial lung disease; pulmonary fibrosis; rheumatoid nodules; or rheumatoid pleurisy.

A Routine Assessment of Patient Index Data (RAPID3)25 is a patient-reported measure of RA severity which includes three subcomponents on the multidimensional Health Assessment Questionnaire (MDHAQ)26 for function, pain, and global estimate of status. Standard cut-offs of disease activity on the RAPID3 were remission (≤3.1), low disease activity (3.1-6), moderate disease activity (6.1-12), and high disease activity (>12).

Demographics and other non-RA related clinical factors.

Demographics (age, sex, education, ethnicity), RA disease duration, comorbid medical conditions (physician reported history of diabetes mellitus, hypertension, and cardiovascular disease [angina, heart attack, and heart failure]), age at RA onset, RF, and anti-CCP positivity were assessed at enrollment. Body mass index (BMI) and smoking status were assessed at baseline, and follow-up, and we used a last value carry forward approach to fill in missing values at interim six-month and missed visits.

Statistical analysis:

Descriptive analyses were performed to evaluate demographic, clinical characteristics, and PROs in each cohort, separately. Follow-up time was computed from index date (as defined above for each cohort) until the earliest visit where D2T-RA criteria were met or last study visit. This study was conducted in accordance with the TRIPOD+AI guidelines for transparent reporting of multivariable prediction models using machine learning.27

Predictive model.

To build a predictive model, we first performed necessary data cleaning and harmonization steps evaluating outliers and missing values. We excluded variables with very low/high prevalence that are unlikely to have predictive utility (i.e. <3% and >97%). We imputed missing values using the impute-only feature in Random Survival Forests under the assumption that data is missing at random (MAR).28 We used random survival forests (RSF) to develop prediction models using the ‘RandomForestSRC’ package in R.28,29 RSF are an extension of random forest methods for right-censored time-to-event data where survival trees are generated using bootstrapped data and features (predictors) are selected randomly when splitting tree nodes. RSF constructs predictions by drawing a set of bootstrap samples from the data for the number of specified trees (ntree), leaving some observations (~1/3) out-of-bag (OOB). We computed prediction accuracy and variable importance measures on the OOB samples. We tuned the model using an iterative process to optimize ntree, mtry, and nodesize hyperparameters with the overall goal to minimize the OOB error rate. We evaluated the Harrell’s concordance index (C-index) from the model, which represents an extension of Area Under the Receiver Operating Curve (AUROC) for survival models and estimates the probability that patients who became D2T-RA first had a worst predicted outcome (C-index values of 1 indicate perfect discrimination). We used bootstrapping to obtain 95% confidence intervals for C-index values. We calculated variable importance (VIMP) values, which compare model performance with and without the variable included. Initial models were run on 45 total predictors (Supplementary Table S1). After reviewing initial model performance metrics, we removed predictors with VIMP values that were zero, or negative (indicating the variable may be negatively impacting model performance) from the model.

We plotted Kaplan-Meier-Like survival curves based on predicted probabilities for each time point from the RSF model. These curves resemble Kaplan-Meier (KM) plots in appearance and interpretation but differ as they are created from a machine-learning survival model and not a nonparametric KM estimator.

In RSF model calibration refers to how closely survival times predicted by the model match observed survival times. We plotted calibration curves that evaluate the percentage of prediction error by time, including a reference model curve (null model with no predictors), and a curve for the final predictive model. Values closer to zero indicate better calibration. In addition, we evaluated standardized continuous ranked probability scores (CRPS), which quantify the difference between the predicted cumulative probability distribution and observed outcomes standardized by follow-up time. Values from a standardized CRPS are scaled to range from 0-1, with those closer to 0 indicating better probabilistic prediction. Although there is no threshold to indicate adequacy, in general, values below 0.5 suggest substantial improvement over a reference model.

We evaluated overall model performance by assessing the integrated Brier Score (IBS). The IBS is the average of Brier Scores across each time interval, which are computed by calculating mean squared difference between predicted and actual outcomes for specific time points. Values for IBS range from zero (perfect prediction) to upwards. There are no thresholds to indicate adequacy, however in general lower IBS values indicate overall better predictive performance, and values for models with good performance often range between 0.1 and 0.3, although this can vary. We evaluated SHAP (SHapley Additive exPlanations)30 values to better understand the contribution of each variable in the model to the overall prediction. We present SHAP values in a beeswarm plot, where each row represents a predictor in the model, ranked (highest to lowest) by mean absolute SHAP value. Dots in a beeswarm plot represent the SHAP values for each participant and predictor in the analysis. The color of the dots denotes feature value, with higher values in orange and lower values in dark red. Predictors with a positive SHAP value positively impact the model’s prediction of higher risk, while those with negative values have a negative impact on the model’s prediction (lower risk), and the magnitude of the SHAP value indicates how important that variable was to the prediction.

We externally validated our predictive model in CERTAIN data, assessing the same metrics for discrimination and calibration. In primary analyses, we assess model performance on the full analytic sample in CERTAIN consisting of b/tsDMARD naïve and experienced participants. In secondary analyses we restricted to b/tsDMARD naïve participants, to better align with the characteristics of the derivation cohort, who were all treated with their first b/tsDMARD. We also performed an additional secondary analysis refitting the model in the full sample of CERTAIN participants.

RESULTS

A total of 1,581 participants were enrolled in BRASS, and of these, 708 (45%) had at least one visit where they were recorded as being treated with their first b/tsDMARD. After excluding 8 (1%) participants with missing demographics, the final derivation cohort consisted of 700 (99%) participants. In the validation cohort, 2,350 participants enrolled in CERTAIN, of which we excluded 2 (<1%) who had missing demographics and 278 (12%) who met D2T-RA criteria at study entry. Our final analytic sample for validation consisted of 2,070 (88%) participants, of which 1,043 (50%) were b/tsDMARD naïve (Supplementary Figure S1).

Descriptive information on demographics and RA-related clinical and PROs in each cohort are reported in Table 2. In general, derivation and validation cohorts had similar mean ages (55-56 years) and similar distributions of female sex (83% - 78%). We noted differences in BMI, prevalence of cardiovascular disease, and multiple measures of disease activity and PRO measures between derivation and validation cohorts.

Table 2.

Patient Characteristics at Baseline in BRASS and CERTAIN

Characteristic Derivation cohort (BRASS; N= 700) Validation cohort (CERTAIN)
Full Sample (N=2,070) b/tsDMARD naïve (N=1,043)
N (%) or mean (SD)
Follow-up, months 80 (63) 8.2 (5.7) 9.5 (5.6)
Demographics and comorbidities
Current age, years 55 (14) 56 (13) 56 (13)
 18-44 years 167 (24%) 393 (19%) 202 (19%)
 45 – 65 years 370 (53%) 1,160 (56%) 597 (57%)
 ≥ 65 years 163 (23%) 517 (25%) 244 (23%)
Age at RA diagnosis, years 42 (15) 49 (14) 51 (14)
Female sex 584 (83%) 1,644 (79%) 817 (78%)
Education
 High school diploma or less 120 (17%) 815 (39%) 435 (42%)
 At least some technical college or professional school 154 (22%)
 Graduated college 190 (27%) 1,255 (61%) 608 (58%)
 Graduate school 236 (34%)
Smoking status
 Never smoked 425 (61%) 997 (48%) 513 (49%)
 Former smoker 231 (33%) 665 (32%) 305 (29%)
 Current smoker 44 (6.3%) 408 (20%) 225 (22%)
Body mass index, BMI, kg/m
 <30 527 (75%) 1,199 (58%) 607 (58%)
 ≥30 173 (25%) 871 (42%) 436 (42%)
History of type 2 diabetes 36 (5.1%) 174 (8.4%) 87 (8.3%)
History of hypertension 197 (28%) 565 (27%) 271 (26%)
History of cardiovascular disease* 52 (7.4%) 1,267 (61%) 616 (59%)
RA-Related Clinical Factors
Rheumatoid Factor (RF) positivity 482 (69%) 1,376 (66%) 685 (66%)
Anti-CCP-Positivity 489 (70%) 1,268 (61%) 616 (59%)
DAS28-CRP 3.4 (1.5) 4.8 (1.1) 4.8 (1.1)
 ≤ 3.2 348 (50%) 142 (6.9%) 62 (5.9%)
 3.3-5.1 249 (36%) 1,159 (56%) 600 (58%)
 > 5.1 103 (15%) 769 (37%) 381 (37%)
CDAI >10 404 (58%) 2,046 (99%) 1,031 (99%)
C-Reactive Protein (CRP) 6.1 (13.7) 10 (18) 11 (19)
Tender joint count (0-28) 6 (7) 11 (7) 11 (7)
Swollen joint count (0-28) 6 (6) 8 (6) 8 (6)
Patient Global Activity Score (PtGA, VAS 0-100) 4.4 (2.8) 5.3 (2.5) 5.2 (2.5)
PtGA ≥ 30 486 (69%) 1,773 (86%) 889 (85%)
Provider Global Assessment Score (PrGA, VAS 0-100) 2.7 (1.9) 5.0 (2.0) 5.1 (1.9)
PrGA Score ≥ 30 347 (50%) 1,919 (93%) 985 (94%)
RAPID3 > 3 479 (68%) 1,960 (95%) 982 (94%)
One or more extra-articular manifestations 156 (22%) 578 (28%) 252 (24%)
MDHAQ total score (0-100 scale) 18 (16) 32 (23) 31 (22)
MDHAQ Fatigue scale (0-100) 39 (29) 54 (29) 53 (30)
MDHAQ Pain scale (0-100) 30 (25) 54 (27) 53 (27)
Methotrexate concomitant therapy 367 (55%) 1,257 (61%) 710 (68%)
Current prednisone use 139 (21%) 691 (33%) 333 (32%)
b/tsDMARD use
 Adalimumab 177 (25%) 439 (21%) 322 (32%)
 Etanercept 409 (58%) 346 (17%) 270 (26%)
 Other TNFis 75 (11%) 641 (31%) 309 (30%)
 Other b/tsDMARDs 39 (5.6%) 644 (31%) 142 (14%)
b/tsDMARD Treatment naïve 700 (100%) 1,043 (50%) 1,043 (100%)

BRASS: Brigham and Women’s Hospital Rheumatoid Arthritis Sequential Study; b/tsDMARD: biologic and/or targeted synthetic DMARD; CDAI: Clinical Disease Activity Index; CERTAIN: Comparative Effectiveness Registry to study Therapies for Arthritis and Inflammatory Conditions; DAS28: 28-joint DAS; RA: rheumatoid arthritis; RAPID3: Routine Assessment of Patient Index Data-3; TNFi: Tumor necrosis factor inhibitor; VAS: Visual Analog Scale.

In the derivation cohort, 113 (16%) of participants progressed to D2T-RA over a mean of 80 months of follow-up. In initial RSF models, we included 45 variables and calculated VIMP values. We removed 22 variables that had zero, or negative VIMP values to improve model performance. The final predictive model consisted of 23 variables, including D2T-RA flag, follow-up time, age, age at RA diagnosis, education, BMI categories, smoking status, history of type 2 diabetes, history of hypertension, history of cardiovascular disease, current prednisone use, DAS28-CRP score, DAS28-CRP categories, C-reactive protein, PrGA score, PrGA categories, PtGA score, swollen joint count, MDHAQ total score, MDHAQ anxiety scale, MDHAQ fatigue scale, MDHAQ pain scale, and RAPID3 score. An overview of all variables considered and included in final models is presented in Supplementary Table S1. After performing model tuning processes, the optimal settings for the final RSF model were 1,000 trees, terminal node size of 10, and mtry of 6. We additionally selected the default of sampling without replacement for model training.

In the final derivation model, we observed a Harrell’s C-index of 0.64 (95% CI 0.59 – 0.70), standardized CRPS of 0.136, and integrated Brier Score of 0.064 (Table 3). Our visual inspection of a prediction error curve by time indicates overall low prediction error that only slightly increases over time, remaining below a maximum of 0.2 (Figure 1).

Table 3.

Model Performance Metrics for Random Survival Forests

Metrics Derivation cohort (BRASS) Validation cohort (CERTAIN)
Full Sample b/tDMARD Naïve subset
Sample size (ntree) 700 2,070 1,043
Number of outcomes (D2T-RA) 113 571 66
Number of trees 1,000 1,000 1,000
Terminal node size (nodesize) 10 10 10
Average number of terminal nodes 38.5 38.5 38.5
Total number of variables 23 23 23
Number of variables tried at each split (mtry) 6 6 6
Continuous ranked probability scores (CRPS) 29.4 85.14 51.7
Standardized CRPS 0.136 0.394 0.239
Integrated Brier Score (IBS; full follow-up) 0.064 0.473 0.313
Harrell’s C-index (95% CI) 0.64 (0.59 – 0.70) 0.62 (0.60 – 0.65) 0.63 (0.54 – 0.71)

Sampling without replacement and log rank random splitting used to grow trees.

Abbreviations: b/tsDMARD: biologic and/or targeted synthetic DMARD; BRASS: Brigham and Women’s Hospital Rheumatoid Arthritis Sequential Study; CERTAIN: Comparative Effectiveness Registry to study Therapies for Arthritis and Inflammatory Conditions; D2T-RA: difficult-to-treat rheumatoid arthritis.

Figure 1. Plot of Prediction Error by Time in BRASS and CERTAIN.

Figure 1.

These plots depict how prediction error changes over follow-up in the models. The black reference line represents a null model (without predictors), and the red line represents the Random Survival Forest (RSF) prediction model. Plots are provided for the derivation cohort (BRASS), as well as the validation in the full sample, and in the bDMARD naïve subset. In an ideal scenario, the prediction model would a maintain low, consistent prediction error over time, and demonstrate improvement over the reference model.

In the validation cohort of the full sample, we observed that 571 (28%) participants progressed to D2T-RA over a median of 8.2 months of follow-up. We externally validated the final predictive model in CERTAIN data and obtained a Harrell’s C-index of 0.62 (95% CI 0.60 – 0.65), standardized CRPS of 0.394, and integrated Brier Score of 0.473 (Table 3). A plot of the prediction error curve by time indicates a nearly linear increase in prediction error across time, with better calibration observed across shorter follow-up times (Figure 1). In a secondary analysis, we validated the model in a b/tsDMARD naïve subset of participants enrolled in CERTAIN. We observed that 66 (6%) progressed to D2T-RA over a median of 9.5 months of follow-up. We observed a Harrell’s C-index of 0.63 (95% CI 0.54 – 0.71), standardized CRPS of 0.239, and integrated Brier Score of 0.313 (Table 3). A visual inspection of the calibration curve suggested improvement in prediction error for the first 12-15 months of follow-up, compared to the full sample (Figure 1).

We present KM-like survival curves in Figure 2 for derivation and validation cohorts, which demonstrate a much steeper decline for participants in the validation cohorts than in derivation cohort, despite differences in follow-up.

Figure 2. Kaplan-Meier-Like Survival Curves in Derivation and Validation Cohorts.

Figure 2.

These curves resemble a Kaplan-Meier plot in appearance and interpretation, but were constructed by obtaining and averaging the predicted probabilities for each time point from a machine learning survival model and not a non-parametric Kaplan-Meier estimator. Abbreviations: Brigham and Women’s Hospital Rheumatoid Arthritis Sequential Study; CERTAIN: Comparative Effectiveness Registry to study Therapies for Arthritis and Inflammatory Conditions.

In our derivation model, we observed that PROs for MDHAQ and PtGA score were the top predictors in terms of absolute mean SHAP value (Figure 3). In general, feature value was higher for positive SHAP values, indicating that greater values of MDHAQ or PtGA score were predictive of a higher risk of D2T-RA, and did not appear to be linear across most variables. The importance and spread of top predictors in our validation analysis indicated some differences. As an example, the topmost important predictors in validation data were PtGA score, MDHAQ pain subscale, BMI>30 and RAPID3 score. Additionally, in validation data, smoking status, BMI, and RAPID3 appeared to be more important than they were observed to be in the derivation analysis.

Figure 3. Beeswarm Plot of SHAP Values.

Figure 3.

These plots provide a global overview of the contribution of each of the top 15 predictors in the models. Predictors (features) are ordered vertically from highest to lowest absolute mean SHAP value, and each dot in the rows represents an individuals’ SHAP value in the dataset. Feature value is indicated by color, with higher values represented by yellow colors.

Finally, results of sensitivity analyses refitting our model to CERTAIN data demonstrated similar discrimination (C-index = 0.62, 95% CI 0.60 – 0.65, Supplementary Table S2), and calibration to the derivation cohort (CRPS = 0.200; IBS = 0.084).

DISCUSSION

We used data from two large prospective cohorts of individuals with RA to develop and externally validate a predictive model to identify patients at the highest risk of progressing to D2T-RA. Overall, our predictive model demonstrated moderate discrimination and calibration in derivation and validation cohorts. In addition, our results indicated that top predictors of difficult-to-treat RA consistently included PROs of functioning, pain, and fatigue and PtGA scores, while additional predictors differed in ranking of variable importance between the two models.

Our finding that PROs of pain, fatigue, functioning, and PtGA score were top predictors of progression to D2T-RA has not been previously reported. However, a connection between persistent unresolved pain in patients with RA treated with b/tsDMARDs has been reported in the literature. In this prior study, it was noted that inflammation was not able to fully explain pain trajectories and non-inflammatory predictors, such as greater disability and smoking history, were predictive of persistent pain.31 These findings highlight the importance of integrating PROs into predictive models, and the potential clinical value of including them in treatment-related decision-making, as they may reflect unmeasured disease mechanisms such as central sensitization or psychosocial burden, which influence disease progression, and treatment response. Future research should continue to explore trajectories of PROs and risk of DT-RA.

In addition to PROs, we identified several additional predictors of D2T-RA which included PrGA score, obesity, and swollen joint count. These results have some overlap, but are not in full alignment with prior studies evaluating risk factors for D2T-RA. For example, in a study of 672 patients with RA, investigators observed that over a mean follow-up of 48 months, 8% of participants progressed to D2T-RA, and the primary predictors in a multivariable adjusted logistic regression model were high baseline RF (>156.4 IU/mL), greater DAS28-ESR, and co-existing pulmonary disease.12 In our study, RF and extra-articular manifestations, such as pulmonary disease, were not strong predictors of D2T-RA and did not get selected for our final model. However, we similarly observed that DAS28-CRP score was a top risk factor in both cohorts. Other studies have also reported that female sex, delay in initial treatment, higher disease activity, and/or number of prior treatments are associated with risk of refractory RA.32,33 We did not observe these same associations, as sex was not selected in our final model based on low VIMP. We did not have a measure for treatment delay in our cohort and were unable to assess this risk factor. Finally, our derivation cohort consisted of participants with RA who were treated with their first b/tsDMARD therapy, and thus number of prior b/tsDMARDs was not a factor.

We included conventional DMARD therapies in model derivation, and observed that concurrent methotrexate use (meaning in combination with b/tsDMARD) was inversely associated with D2T-RA risk, indicating that participants who were not concurrently treated with methotrexate at index date had a higher risk of D2T-RA. However, in the validation cohort, concurrent methotrexate use was not identified as an important risk factor, despite it being slightly more prevalent. We are unable to assess why methotrexate non-use was an important risk factor for D2T-RA, and this should be explored in future research.

Our derivation and validation cohorts are robust longitudinal studies of patients with RA, enriched with several clinical, laboratory, and PRO assessments. They also have several key differences in study design and participant characteristics. Despite these notable differences, our results indicated moderate discrimination, which was similar across both derivation and validation cohorts. Our results also depicted slightly worsened calibration in the validation cohort, which was improved slightly when we subset to participants who were bDMARD naïve. These results indicate that our model effectively distinguished between higher and lower risk participants but was less accurate with individual level predictions. However, in the validation cohort, calibration was strongest in the first 12-15 months of follow-up with adequate performance, likely reflecting the greater number of observed events during this period. Taken together, these results highlight the robustness and generalizability of our model in handling variations in patient characteristics. Additionally, the model’s ability to generalize to a higher-risk population in our validation (CERTAIN cohort) enhances its clinical utility. It is important to note that our goal was to develop and validate an algorithm to predict the broad definition of D2T-RA that EULAR intended, which captures a heterogenous mix of patients with RA who have inflammatory and non-inflammatory evidence. It is possible that focusing on a narrower definition of D2T-RA, such as inflammatory D2T-RA, could improve model performance. We were unable to assess this as we observed very little non-inflammatory D2T-RA in prior analyses using BRASS.16 Therefore, it is important for future studies to continue to validate the model in additional cohorts, including populations with differences in characteristics and/or additional heterogeneity. Navigating heterogeneity across studies highlights a common challenge with predictive modeling, and the importance of training models on diverse populations.

This study has many strengths. We used rigorous statistical methodology on two robust longitudinal cohorts of participants with established RA to develop and validate a predictive model in adherence to TRIPOD+AI best practices. We also used a comprehensive definition of D2T-RA based on EULAR criteria and incorporated an extensive set of commonly available clinical and PRO risk factors in our model.

Some limitations we considered are noted herein. We harmonized measures across cohorts to the extent possible, however some variables were not measured equivalently. We have noted these examples in our methods section which includes extra-articular manifestations, prednisone use, MDHAQ anxiety subscale, and erosions. In addition, erythrocyte sedimentation rate (ESR) was not assessed in either of the cohorts, and thus we used C-reactive protein in our measures of disease activity. The cohorts also had notable differences in clinical characteristics including length of follow-up, and disease activity. As we have noted above, differences in characteristics and follow-up between the derivation and validation sets did not have a substantive impact on model performance and calibration. Finally, we used RSF to build predictive models, but we could have selected alternative methods, such as gradient boosted trees, or extremely randomized survival trees. Given that RSF is a robust method, and our data lacks high dimensionality, it is unlikely that results would be substantially improved with alternative models.

In summary, we developed and externally validated a predictive model to identify patients with RA, who are b/tsDMARD experienced, who are at highest risk of progressing to a multi-treatment failure state of D2T-RA. Our model demonstrated moderate performance on measures of demographics, RA-related clinical factors and PROs of pain, fatigue, and functioning. Future research should continue to better understand the role of pain, fatigue, and functioning in multi-treatment failure, and explore additional predictors, such as candidate inflammatory biomarkers, to improve model performance and assess clinical utility.

Supplementary Material

Supplementary Material

Significance and Innovations.

  • This is the first study to develop and externally validate a risk prediction model to identify which patients with rheumatoid arthritis (RA) are at highest risk of progressing to ‘difficult-to-treat’ RA (D2T-RA)

  • Participants with worse functioning, greater pain, fatigue, and disease activity have increased risk of progressing to D2T-RA.

  • These findings underscore the need for further research to improve predictive performance, through the incorporation of additional biomarkers.

Acknowledgements:

The authors would like to thank all the participating investigators, study staff, and participants in the BRASS and CorEvitas Rheumatoid Arthritis Registries who contributed data for this study.

Funding statement:

This work was supported by NIH P30-AR072577 and CorEvitas LLC. The BRASS cohort is additionally supported by Bristol Myers Squibb, Sanofi, Aqtual, and Janssen. CorEvitas has been supported through contracted subscriptions in the last two years by AbbVie, Amgen, Inc., Arena, Boehringer Ingelheim, Bristol Myers Squibb, Chugai, Eli Lilly and Company, Genentech, GlaxoSmithKline, Janssen Pharmaceuticals, Inc., LEO Pharma, Novartis, Ortho Dermatologics, Pfizer, Inc. Sun Pharmaceutical Industries Ltd., and UCB S.A.

REFERENCES

  • 1.Almutairi K, Nossent J, Preen D, Keen H, Inderjeeth C. The global prevalence of rheumatoid arthritis: a meta-analysis based on a systematic review. Rheumatol Int 2021; 41(5): 863–77. [DOI] [PubMed] [Google Scholar]
  • 2.Hunter TM, Boytsov NN, Zhang X, Schroeder K, Michaud K, Araujo AB. Prevalence of rheumatoid arthritis in the United States adult population in healthcare claims databases, 2004-2014. Rheumatol Int 2017; 37(9): 1551–7. [DOI] [PubMed] [Google Scholar]
  • 3.Fraenkel L, Bathon JM, England BR, et al. 2021 American College of Rheumatology Guideline for the Treatment of Rheumatoid Arthritis. Arthritis Rheumatol 2021; 73(7): 1108–23. [DOI] [PubMed] [Google Scholar]
  • 4.Buch MH. Defining refractory rheumatoid arthritis. Ann Rheum Dis 2018; 77(7): 966–9. [DOI] [PubMed] [Google Scholar]
  • 5.Buch MH, Eyre S, McGonagle D. Persistent inflammatory and non-inflammatory mechanisms in refractory rheumatoid arthritis. Nat Rev Rheumatol 2021; 17(1): 17–33. [DOI] [PubMed] [Google Scholar]
  • 6.de Hair MJH, Jacobs JWG, Schoneveld JLM, van Laar JM. Difficult-to-treat rheumatoid arthritis: an area of unmet clinical need. Rheumatology (Oxford) 2018; 57(7): 1135–44. [DOI] [PubMed] [Google Scholar]
  • 7.Matsson A, Solomon DH, Crabtree MM, Harrison RW, Litman HJ, Johansson FD. Patterns in the sequential treatment of rheumatoid arthritis patients starting a b/tsDMARD: 10-year experience from a US-based registry. Res Sq 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Nagy G, Roodenrijs NMT, Welsing PMJ, et al. EULAR points to consider for the management of difficult-to-treat rheumatoid arthritis. Ann Rheum Dis 2022; 81(1): 20–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Roodenrijs NMT, de Hair MJH, van der Goes MC, et al. Characteristics of difficult-to-treat rheumatoid arthritis: results of an international survey. Ann Rheum Dis 2018; 77(12): 1705–9. [DOI] [PubMed] [Google Scholar]
  • 10.Nagy G, Roodenrijs NMT, Welsing PM, et al. EULAR definition of difficult-to-treat rheumatoid arthritis. Ann Rheum Dis 2021; 80(1): 31–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Tan Y, Buch MH. ‘Difficult to treat’ rheumatoid arthritis: current position and considerations for next steps. RMD Open 2022; 8(2). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Watanabe R, Hashimoto M, Murata K, et al. Prevalence and predictive factors of difficult-to-treat rheumatoid arthritis: the KURAMA cohort. Immunol Med 2022; 45(1): 35–44. [DOI] [PubMed] [Google Scholar]
  • 13.Watanabe R, Okano T, Gon T, et al. Difficult-to-treat rheumatoid arthritis: Current concept and unsolved problems. Front Med (Lausanne) 2022; 9: 1049875. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Bykerk VP, Shadick N, Frits M, et al. Flares in rheumatoid arthritis: frequency and management. A report from the BRASS registry. J Rheumatol 2014; 41(2): 227–34. [DOI] [PubMed] [Google Scholar]
  • 15.Pappas DA, Kremer JM, Reed G, Greenberg JD, Curtis JR. “Design characteristics of the CORRONA CERTAIN study: a comparative effectiveness study of biologic agents for rheumatoid arthritis patients”. BMC Musculoskelet Disord 2014; 15: 113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Paudel ML, Li R, Naik C, Shadick N, Weinblatt ME, Solomon DH. Prevalence and Characteristics of Adults with Difficult-to-Treat Rheumatoid Arthritis in a Large Patient Registry. Rheumatology (Oxford) 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Prevoo ML, van t’ Hof MA, Kuper HH, van Leeuwen MA, van de Putte LB, van Riel PL. Modified disease activity scores that include twenty-eight-joint counts. Development and validation in a prospective longitudinal study of patients with rheumatoid arthritis. Arthritis Rheum 1995; 38(1): 44–8. [DOI] [PubMed] [Google Scholar]
  • 18.Felson DT, Anderson JJ, Boers M, et al. The American College of Rheumatology preliminary core set of disease activity measures for rheumatoid arthritis clinical trials. The Committee on Outcome Measures in Rheumatoid Arthritis Clinical Trials. Arthritis Rheum 1993; 36(6): 729–40. [DOI] [PubMed] [Google Scholar]
  • 19.Pincus T, Callahan LF, Sale WG, Brooks AL, Payne LE, Vaughn WK. Severe functional declines, work disability, and increased mortality in seventy-five rheumatoid arthritis patients studied over nine years. Arthritis Rheum 1984; 27(8): 864–72. [DOI] [PubMed] [Google Scholar]
  • 20.Wolfe F, Michaud K, Pincus T, Furst D, Keystone E. The disease activity score is not suitable as the sole criterion for initiation and evaluation of anti-tumor necrosis factor therapy in the clinic: discordance between assessment measures and limitations in questionnaire use for regulatory purposes. Arthritis Rheum 2005; 52(12): 3873–9. [DOI] [PubMed] [Google Scholar]
  • 21.Landewe R, van der Heijde D. Radiographic progression in rheumatoid arthritis. Clin Exp Rheumatol 2005; 23(5 Suppl 39): S63–8. [PubMed] [Google Scholar]
  • 22.van der Heijde D How to read radiographs according to the Sharp/van der Heijde method. J Rheumatol 2000; 27(1): 261–3. [PubMed] [Google Scholar]
  • 23.Boini S, Guillemin F. Radiographic scoring methods as outcome measures in rheumatoid arthritis: properties and advantages. Ann Rheum Dis 2001; 60(9): 817–27. [PMC free article] [PubMed] [Google Scholar]
  • 24.Lillegraven S, Prince FH, Shadick NA, et al. Remission and radiographic outcome in rheumatoid arthritis: application of the 2011 ACR/EULAR remission criteria in an observational cohort. Ann Rheum Dis 2012; 71(5): 681–6. [DOI] [PubMed] [Google Scholar]
  • 25.Pincus T, Yazici Y, Bergman MJ. RAPID3, an index to assess and monitor patients with rheumatoid arthritis, without formal joint counts: similar results to DAS28 and CDAI in clinical trials and clinical care. Rheum Dis Clin North Am 2009; 35(4): 773–8, viii. [DOI] [PubMed] [Google Scholar]
  • 26.Pincus T, Summey JA, Soraci SA Jr., Wallston KA, Hummon NP. Assessment of patient satisfaction in activities of daily living using a modified Stanford Health Assessment Questionnaire. Arthritis Rheum 1983; 26(11): 1346–53. [DOI] [PubMed] [Google Scholar]
  • 27.Collins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 2024; 385: e078378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Iswaran HKU. Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC). 2025. https://cran.r-project.org/package=randomForestSRC.
  • 29.Ishwaran HKU, Blackstone EH, Lauer MS. Random Survival Forests. The Annals of Applied Statistics 2008; 2(3): 841–60. [Google Scholar]
  • 30.Lundberg SM, Lee S-I. A Unified Approach to Interpreting Model Predictions. In: Guyon I, Luxburg UV, Bengio S, et al. , editors.; 2017. [Google Scholar]
  • 31.McWilliams DF, Dawson O, Young A, Kiely PDW, Ferguson E, Walsh DA. Discrete Trajectories of Resolving and Persistent Pain in People With Rheumatoid Arthritis Despite Undergoing Treatment for Inflammation: Results From Three UK Cohorts. J Pain 2019; 20(6): 716–27. [DOI] [PubMed] [Google Scholar]
  • 32.Becede M, Alasti F, Gessl I, et al. Risk profiling for a refractory course of rheumatoid arthritis. Semin Arthritis Rheum 2019; 49(2): 211–7. [DOI] [PubMed] [Google Scholar]
  • 33.Messelink MA, Roodenrijs NMT, van Es B, et al. Identification and prediction of difficult-to-treat rheumatoid arthritis patients in structured and unstructured routine care data: results from a hackathon. Arthritis Res Ther 2021; 23(1): 184. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES