Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Aug 1.
Published in final edited form as: Semin Arthritis Rheum. 2019 Jan 4;49(1):84–90. doi: 10.1016/j.semarthrit.2019.01.002

Identifying Lupus Patients in Electronic Health Records: Development and Validation of Machine Learning Algorithms and Application of Rule-Based Algorithms

April Jorge 1, Victor M Castro 2, April Barnado 3, Vivian Gainer 2, Chuan Hong 4, Tianxi Cai 2,6, Tianrun Cai 5,6, Robert Carroll 7, Joshua C Denny 7, Leslie Crofford 3, Karen H Costenbader 5, Katherine P Liao 5,6, Elizabeth W Karlson 5, Candace H Feldman 5
PMCID: PMC6609504  NIHMSID: NIHMS1519139  PMID: 30665626

Abstract

Objective:

To utilize electronic health records (EHRs) to study SLE, algorithms are needed to accurately identify these patients. We used machine learning to generate data-driven SLE EHR algorithms and assessed performance of existing rule-based algorithms.

Methods:

We randomly selected subjects with ≥1 SLE ICD-9/10 codes from our EHR and identified gold standard definite and probable SLE cases by chart review, based on 1997 ACR or 2012 SLICC Classification Criteria. From a training set, we extracted coded and narrative concepts using natural language processing and generated algorithms using penalized logistic regression to classify definite or definite/probable SLE. We assessed predictive characteristics in internal and external cohort validations. We also tested performance characteristics of published rule-based algorithms with pre-specified permutations of ICD-9 codes, laboratory tests and medications in our EHR.

Results:

At a specificity of 97%, our machine learning coded algorithm for definite SLE had 90% positive predictive value (PPV) and 64% sensitivity and for definite/probable SLE, 92% PPV and 47% sensitivity. In the external validation, at 97% specificity, the definite/probable algorithm had 94% PPV and 60% sensitivity. Adding NLP concepts did not improve performance metrics. The PPVs of published rule-based algorithms ranged from 45–79% in our EHR.

Conclusion:

Our machine learning SLE algorithms performed well in internal and external validation. Rule-based SLE algorithms did not transport as well to our EHR. Unique EHR characteristics, clinical practices and research goals regarding the desired sensitivity and specificity of the case definition must be considered when applying algorithms to identify SLE patients.

Introduction:

Electronic health records (EHR) allow for the identification of large numbers of patients for clinical and translational research studies. In addition, EHR can provide granular data regarding medication prescribing, laboratory tests, comorbidities and clinical disease activity and severity. Due to the complexity of a systemic lupus erythematosus (SLE) diagnosis that often requires the complex integration of a patient’s history, laboratory tests, and physical examination findings, the accurate identification of patients with systemic lupus erythematosus (SLE) from the EHR is an important challenge. The use of billing codes alone to identify SLE patients is suboptimal; the positive predictive value (PPV) of one or two SLE ICD-9 codes is 50–60%.(1, 2) Previously published rule-based algorithms to identify patients with SLE from EHR, which involve clinicians pre-specifying combinations of ICD-9 codes, medications and laboratory tests based on clinical practice and judgment, demonstrated improved accuracy over ICD codes alone.(3) However, such algorithms may have limited portability across institutions.(4)

An alternative approach to the rule-based algorithms utilizes machine learning, a data-driven approach in which a training set of gold standard cases and non-cases and statistical methods that account for predictive ability, collinearity and generalization error are used to derive an optimal algorithm using all candidate features. Prior studies in rheumatoid arthritis (RA) found that a machine learning-based algorithm that incorporated both structured (coded) data as well as natural language processing (NLP) of narrative data had higher PPVs compared to rule-based algorithms using structured data alone, and was portable across institutions.(4, 5) This machine learning approach has not been previously applied to SLE.

We aimed to develop novel SLE algorithms derived from our EHR using machine learning methods and incorporating both coded and NLP data, to assess whether this strategy could accurately identify patients with SLE from our EHR. We also aimed to assess the portability of recently published rule-based algorithms to correctly identify patients with SLE in our EHR.

Methods:

Data Source

To develop our machine learning algorithms and to test the portability of the published rule-based algorithms, we utilized a multicenter academic EHR at Partners HealthCare. Our primary source population was the Partners HealthCare Biobank, a large cohort of consented subjects with a biospecimen repository linked with a centralized longitudinal EHR database, the Research Patient Data Repository (RPDR).(6, 7) We also used the larger Partners HealthCare EHR population, which includes the Biobank population, as a validation cohort because we planned to apply the algorithm to this larger cohort. The Partners HealthCare EHR began in 1994 and includes 4.6 million patients from two large academic health centers, Massachusetts General Hospital and Brigham and Women’s Hospital, as well as seven community hospitals and over 20 affiliated community health centers and primary care practices. The Partners HealthCare Biobank includes >68,000 patients (approximately 2% of the EHR population). This study was approved by the Partners HealthCare Institutional Review Board.

SLE Case Definition and Cohort Development

Individuals in the Partners HealthCare Biobank with ≥1 International Classification of Diseases, Ninth and Tenth Revision (ICD-9/ICD-10 CM) code for SLE (ICD-9 710.0; ICD-10 M32.1*) were considered screen positive for possible SLE (N=1,322). Of 400 randomly selected charts, we identified gold standard cases, classified as definite SLE, probable SLE, or non-SLE by detailed medical record review. Two rheumatologists (AJ and CF) independently reviewed the same initial 100 subjects, adjudicated discrepancies, and applied the same rules to the subsequent 300 subjects. Definite cases were defined as meeting 1997 ACR or 2012 SLICC Classification Criteria for SLE,(8, 9) and were diagnosed as SLE by their rheumatologist. Those with partial criteria, usually three ACR criteria, considered to have likely or “probable” SLE by the treating rheumatologist and the reviewer, were defined as probable SLE. Individuals not meeting criteria for either definite or probable SLE were classified as definite non-cases. We divided these charts into a training set (the first randomly selected n=200) and a validation set (the subsequent randomly selected n=200). Both sets were reviewed prior to the development of the algorithms to prevent the introduction of unintentional bias from knowledge of algorithm features at the time of chart review.

We created a secondary validation cohort using 100 subjects from the larger Partners HealthCare EHR, as we intended to apply our algorithm to this overarching population. We randomly selected among all subjects with ≥1 ICD-9/10 code for SLE and two or more visits to a primary care practice affiliated with our healthcare system, to establish a cohort of patients with sufficient medical record data in our EHR. AJ and CF reviewed these charts and identified definite SLE, probable SLE, and non-SLE cases following the aforementioned rules.

External Validation Cohort

We identified a cohort from another academic medical center to externally test our algorithms. This cohort included 173 subjects who were randomly selected from the Vanderbilt University’s Synthetic Derivative, among all subjects with ≥1 ICD-9 for SLE.(10) These subjects’ charts were reviewed by a Rheumatologist (AB) and were classified as SLE or non-SLE, using a broader case definition which included all subjects with specialist-reported SLE diagnosis (Rheumatologist, Dermatologist or Nephrologist) as SLE cases as previously described.(3) Fulfillment of ACR or SLICC criteria was not a requirement to define a SLE case due to missing criteria data in many charts, even in SLE cases who followed with a rheumatologist, thus reflecting regional differences in SLE classification and documentation.

Candidate Feature Curation for Machine Learning Algorithms

We obtained candidate features for our algorithms using both disease-related codes and NLP features extracted from unstructured data for SLE-related concepts. Disease experts (CF, AJ, and EK) defined a list of EHR codes related to SLE diagnosis. To define a dictionary of clinical concepts for analyzing narrative text data and to expand beyond only those terms experts might think of, clinical terms related to SLE were extracted from online knowledge sources (Wikipedia, Medscape, Merck Manuals Professional Edition, Mayo Clinic Diseases and Conditions, and MedlinePlus Medical Encyclopedia) through an automated process using named entity cognition software. The clinical terms were mapped to concept unique identifiers (CUIs) in the Unified Medical Language System (UMLS). The CUIs appearing in more than half of the source articles were retained to create an NLP dictionary.(11)

Coded data, including ICD-9 and ICD-10 diagnosis codes, laboratory orders and medications, were extracted from visit encounters and medication lists using an EHR-wide query tool. All algorithm variables are included as counts (such as the number of times a dsDNA laboratory test was ordered), rather than lab results, except for ANA which was also included as screen positive (≥1:40) or negative. Medications were categorized as antimalarials (including hydroxychloroquine, chloroquine, and quinacrine), oral corticosteroids, SLE-related immunosuppressives (including azathioprine, methotrexate, mycophenolate, cyclophosphamide, rituximab, belimumab, tacrolimus, cyclosporine, and leflunomide), and anti-tumor necrosis factor agents/other biologics (including etanercept, adalimumab, infliximab, certolizumab, golimumab, abatacept, tofacitinib, tocilizimab, secukinumab, and ustekinumab). We extracted narrative data for the candidate CUIs by performing NLP with the Health Information Text Extraction (HITEx) system(12) on healthcare provider notes, radiology reports, pathology reports, discharge summaries, and operative reports in typed format, as previously described.(13) Coded and narrative data were extracted from all available dates within the medical record for each patient.

Algorithm Development Using Machine Learning Methods

We generated four different algorithms using the first randomly selected 200 patient charts from the Partners Biobank cohort, the “training set,” based on two SLE case definitions with two sets of features: coded only or coded plus NLP. Algorithms using coded-only data would potentially be the most readily portable to other EHRs. For SLE case definition one, only definite SLE cases were classified as SLE, and we classified probable SLE cases and definite non-cases together as non-SLE. For SLE case definition two, we classified both definite SLE and probable SLE as SLE. For coded-only algorithms, all expert-curated coded features were included as covariates in the supervised machine learning algorithm training against gold standard labels. For coded plus NLP algorithms, because of the significantly larger number of candidate features, we first performed an unsupervised surrogate assisted feature selection (SAFE) over all candidate features, and only the features selected by SAFE were included in the supervised algorithm training.(14) This method has been previously used in clinical phenotype algorithm development when there is a long list of candidate features. It has been shown to reduce over-fitting caused be including large numbers of uninformative features in an algorithm, and when compared with a method of expert-curated NLP features, this SAFE method resulted in improved AUCs in models developed for classifying patients with RA and coronary artery disease.(11)

The penalized logistic regression with adaptive LASSO penalty was used for the supervised algorithm training.(15) This approach simultaneously identifies potentially influential variables and obtains estimates of the model parameters, while accounting for potential collinearity.(5, 13) The optimal penalty parameter was determined based on the Bayesian information criterion. All models included age, and all predictors were standardized to have unit variance. These algorithms provide equations to calculate the predicted probability of being a SLE case vs. non-case for each patient, as previously described in the development of an algorithm for RA.(13) Subjects whose predicted probability exceeded a threshold value (determined by setting a minimum specificity cutoff) were classified as having SLE. Depending on the nature of the study, higher vs. lower specificity cutoffs may be chosen which would correspond to different PPVs. We also evaluated the models’ area under the receiver operating characteristic curve (AUC) which summarizes the overall classification performance of machine learning algorithms.(16)

Internal Validation of the SLE Algorithms

We applied these algorithms to the internal validation cohort (n=200) from the Partners Biobank, and the second validation cohort (n=100) from the larger Partners HealthCare HER population. We compared overall AUCs. We chose a 97% specificity cut-off and compared PPVs of the algorithms in the training and validation sets at that same cut-off. We also assessed algorithm predictive characteristics over a range of specificity cut-offs corresponding with PPVs greater than 80%.

External Validation of the Machine Learning SLE Algorithm

To examine the portability of our machine learning based SLE algorithm, we applied our coded-only second SLE definition algorithm (i.e., definite SLE plus probable SLE) to the external validation cohort at Vanderbilt University Medical Center, since this algorithm’s SLE case definition was closer to the case definition used in determining gold SLE cases status in the external cohort.(3) We assessed predictive characteristics over a range of predicted probability cutoffs.

Assessment of the Portability of Previously Published, Externally-Generated Rule-Based Algorithms to our EHR

We applied the highest-performing published rule-based algorithms for identifying subjects with SLE derived from the external cohort to 300 randomly selected screen-positive patients from the Partners Biobank cohort.(3) Algorithm variables, such as number of ICD-9 billing codes for SLE (710.0), dermatomyositis (710.3), and systemic sclerosis (SSc) (710.1), were obtained from our EHR. We only used ICD-9 billing codes and not ICD-10 to be consistent with the original published algorithms. We only counted as repeat ICD codes if coded on different days. We obtained information on antimalarial use, other disease modifying anti-rheumatic drug (DMARD) use, and steroid use from coded medication lists. We defined antimalarials and DMARDs using the medication lists per the original publication.(3) However, the system used to extract medication data “MedEx” in the rule-based algorithm generation had not been previously validated in our system and was therefore not used to extract medication data within our EHR.(17, 18) ANA titers were obtained using our EHR search engine from coded laboratory values. However, ANA titers above 1:40 could not be reliably interpreted from the coded data due to variations in laboratory reporting, and manual review was required. We determined the predictive characteristics of these algorithms in our cohort, assessing algorithm performance for both SLE case definitions.

Results:

Patient characteristics

1,322 patients, 1.92% of the total Partners HealthCare Biobank population (N=68,784), had ≥1 ICD SLE code. In the training set (n=200), the prevalence of definite SLE was 33% (n=66 cases), and the prevalence of definite SLE/probable SLE was 47% (n=94 definite/probable SLE cases). In the validation set (n=200), the prevalence of definite SLE was 24% (n=48 cases), and the prevalence of definite SLE/probable SLE was 35% (n=69 definite/probable SLE cases). In the combined training and test sets, nearly all the definite SLE cases (97.4%) had ≥4 of the 1997 revised ACR classification criteria (mean number of criteria 4.7 [SD 1.1]), and the three that did not meet ACR criteria did meet SLICC criteria with biopsy-proven lupus nephritis (Table 1). While 46.9% of probable SLE cases met ACR criteria based on chart review (mean 3.4 [SD 1.3]), they were thought to have probable and not definite SLE by their treating rheumatologist. Hydroxychloroquine was used by 70.2% of definite and 72.9% of probable SLE cases, as well as 31.3% of non-SLE. Other DMARDs were used by 50.9% with definite SLE, 37.5% with probable SLE, and 36.5% with definite non-SLE. The mean number of ICD counts for the SLE code (710.0) were 54.0 for definite SLE, 20.4 for probable SLE, and 3.8 for non-SLE.

Table 1.

Characteristics of the Partners Biobank Cohort of Possible SLE (≥1 ICD Code for SLE)

Overall Definite SLE Probable SLE Non-SLE
Total subjects (N, %) 400 114 49 237
Age (Mean years, SD) 54.8 (15.2) 49.0 (15.2) 52.1 (11.3) 58.1 (14.9)
Female (N, %) 348 (87.4) 103 (90.4) 44 (89.8) 201 (85.5)
Race (N, %)
  White 294 (73.9) 68 (59.7) 39 (79.6) 187 (79.6)
  Black 55 (13.8) 21 (18.4) 6 (12.2) 28 (11.9)
  Asian 14 (3.52) 8 (7.0) 1 (2.0) 5 (2.1)
  Hispanic 23 (5.8) 11 (9.7) 3 (6.1) 9 (3.8)
Ethnicity (N, %)
  Hispanic 23 (5.8) 11 (9.7) 3 (6.1) 9 (3.8)
  Non-Hispanic 375 (93.8) 103 (90.4) 46 (93.9) 226 (95.4)
Marital status
  Single 147 (36.9) 51 (44.7) 16 (32.7) 80 (34.0)
  Married 186 (46.7) 49 (43.0) 24 (49.0) 113 (48.1)
  Divorced 33 (8.3) 4 (3.5) 6 (12.2) 23 (9.8)
  Widowed 15 (3.8) 3 (2.6) 1 (2.0) 11 (4.7)
  Other 17 (4.3) 7 (6.1) 2 (4.1) 8 (3.4)
Diagnosis number (Mean, SD)
  ICD - 9 20.1 (42.3) 54.0 (62.2) 20.4 (36.7) 3.8 (8.4)
  ICD - 9 + ICD - 10 32.2 (66.5) 89.0 (96.7) 32.3 (52.1) 4.8 (11.9)
Hydroxychloroquine use (N, %) 188 (47.6) 80 (70.2) 35 (72.9) 73 (31.3)
DMARD use (N, %) 161 (40.7) 58 (50.9) 18 (37.5) 85 (36.5)
Number of ACR criteria (Mean, SD) 2.5 (1.9) 4.7 (1.1) 3.4 (1.3) 1.1 (1.0)
At least 4 ACR criteria 137 (34.3) 111 (97.4) 23 (46.9) 4 (1.7)
  ANA ≥ 1:40 295 (73.8) 108 (94.7) 44 (89.8) 143 (60.3)
  Malar rash 59 (14.8) 44 (38.6) 10 (20.4) 5 (2.1)
  Discoid rash 36 (8.8) 21 (18.4) 2 (4.1) 13 (5.5)
  Photosensitivity 51 (12.3) 31 (27.2) 11 (22.5) 9 (3.8)
  Oral ulcers 37 (9.3) 21 (18.4) 9 (18.4) 7 (3.0)
  Arthritis 183 (45.8) 86 (75.4) 38 (77.6) 59 (24.9)
  Serositis 50 (12.5) 36 (31.6) 11 (22.5) 3 (1.3)
  Renal Disorder 53 (13.3) 46 (40.4) 5 (10.2) 2 (0.8)
  Seizure or Psychosis 8 (2.0) 6 (5.3) 1 (2.0) 1 (0.4)
  Hematologic* 69 (17.3) 53 (46.5) 13 (26.5) 3 (1.3)
  Immunologic disorder 133 (32.3) 85 (74.6) 22 (44.9) 26 (11.0)

ANA titers were manually reviewed for the 300 charts used for the Vanderbilt validation.

DMARD: Disease Modifying Anti-rheumatic Drug, includes azathioprine, methotrexate, mycophenolate, cyclophosphamide, rituximab, etanercept, adalimumab, infliximab, and abatacept, as originally defined by Vanderbilt algorithm.

Renal disorder defined as persistent proteinuria > 0.5g/day or >3+ or cellular casts or lupus nephritis confirmed on renal biopsy

*

Hematologic disorder defined as hemolytic anemia, leukopenia < 4,000/mm3, lymphopenia <1500/mm3 or thrombocytopenia <100,000/mm3

Immunologic disorder defined as positive anti-DNA antibody, anti-smith antibody, or antiphospholipid antibody

Machine Learning SLE Algorithms

All variables considered for algorithm inclusion(14) following the SAFE procedure are shown in Supplemental Table 1. The complete NLP dictionary of candidate features is included as Supplemental Table 2. This list includes 123 CUIs, comprising clinical concepts such as pericarditis and discoid lupus as well as narrative mention of medications such as belimumab. The variables included in the equations for our final SLE algorithms are shown in Table 2 along with their beta coefficients. For example, the logistic regression equation for Algorithm 1 is:

predictedprobablityofSLE=1/[(e(2.193age(0.016)#facts(0.307)+#SLEcodes(1.322)+#chronicrenalfailurecodes(0.096)#RAcodes(0.124)#siccasyndromecodes(0.696)#UCTD codes(0.810)+#dsDNA lab codes(0.349)+#complement lab codes(0.294)#antiTNF & other biologics(0.014)]

Therefore, the number of counts of various positive and negative predictors of having SLE per the first case definition are each weighted in the equation to determine a given patient’s probability of having SLE. Using this algorithm, all patients with a calculated predicted probability of SLE above a certain threshold chosen by the researcher would be classified as having SLE. This predicted probability threshold varies across a range of specificities, sensitivities, and PPVs (Table 3). Therefore, whether a give patient is classified as SLE depends on the chosen threshold and depending on where this is, more or less covariates in the equation may be included. In the Partners Biobank validation cohort, the coded-only algorithm for the first SLE definition (Algorithm 1) had the highest AUC of 0.947. The AUCs were 0.922 for the coded-only algorithm for the second SLE definition (Algorithm 2), 0.912 for the coded plus NLP algorithm for the first SLE definition (Algorithm 3), and 0.909 for the coded plus NLP algorithm for the second SLE definition (Algorithm 4). At 97% specificity, the PPVs and sensitivities for each algorithm were: 90% PPV and 64% sensitivity for Algorithm 1, 92% PPV and 47% sensitivity for Algorithm 2, 87% PPV and 46% sensitivity for Algorithm 3, and 90% PPV and 41% sensitivity for Algorithm 4 (Table 3). For Algorithm 2, the sensitivity could be increased to 86%, allowing a PPV of 81% and specificity of 86% (Table 3).

Table 2.

Variables Selected for the SLE Algorithms from Penalized Logistic Regression

Variable Standardized Regression Coefficient Standard Error
Algorithm 1. Coded-Only, First Case Definition: Definite SLE
  Age −0.016 0.016
  Number of facts* −0.307 0.258
  Chronic renal failure code 0.096 0.131
  Rheumatoid arthritis code −0.124 0.178
  Sicca Syndrome code −0.696 0.360
  SLE code 1.322 0.252
  Unspecified connective tissue disease code −0.810 0.360
  Anti-dsDNA lab code 0.349 0.280
  Complement lab code 0.294 0.244
  Anti-TNF/other biologics code −0.014 0.170
Algorithm 2. Coded-Only, Second Case Definition: Definite and Probable SLE
  Age −0.010 0.016
  Number of facts* −0.667 0.312
  Chronic renal failure code 0.047 0.145
  SLE code 0.994 0.261
  Anti-dsDNA lab code 0.419 0.350
  Complement lab code 0.652 0.382
  Antimalarial medication code 0.265 0.195
Algorithm 3. Coded and NLP, First Case Definition: Definite SLE
  Number of facts* −0.928 0.326
  SLE code 1.423 0.378
  Anti-dsDNA lab code 0.249 0.299
  Complement lab code 0.010 0.217
  NLP CUI Lupus Erythematosus 0.294 0.253
Algorithm 4. Coded and NLP, Second Case Definition: Definite and Probable SLE
  Age −0.001 0.016
  Number of facts* −0.967 0.340
  SLE code 0.966 0.391
  Anti-dsDNA lab code 0.463 0.430
  Complement lab code 0.665 0.462
  Antimalarial medication code 0.281 0.220
  NLP CUI Systemic lupus erythematosus 0.192 0.238

Algorithm 1 intercept= −2.193; Algorithm 2 intercept= −0.096; Algorithm 3 intercept= −0.791; Algorithm 4 intercept= 0.309.

NLP, natural language processing; CUI, concept unique identifier

*

Facts are related to the number of entries a subject has in the EHR, a measure of healthcare utilization.

Includes etanercept, adalimumab, infliximab, abatacept, tofacitinib, tocilizumab, certolizumab, golimumab, secukinumab, and ustekinumab.

Includes hydroxychloroquine, chloroquine, and quinacrine.

Table 3.

Machine Learning-Derived Algorithm Performance Characteristics in the Partners HealthCare Biobank Validation Cohort

Coded-term Only Algorithms 1 and 2* Coded + NLP Algorithms 3 and 4

Algorithms Specificity (%) Sensitivity (%) PPV (%) Specificity (%) Sensitivity (%) PPV (%)
Definite SLE vs. Probable SLE or Non-SLE (First Definition) 99 51 96 99 31 93
98 60 93 98 43 90
97 64 90 97 46 87
96 67 87 96 48 84
95 68 85 95 50 81

Definite SLE and Probable SLE vs. Non-SLE (Second Definition) 99 22 94 99 16 92
98 37 93 98 30 91
97 47 92 97 41 90
96 55 90 96 49 90
95 62 89 95 56 89
93 69 87 94 65 88
91 77 85 92 70 86
90 79 84 89 77 83
86 86 81 86 82 80
*

Algorithm 1 (coded-only, first case definition: Definite SLE), Algorithm 2 (coded-only, second case definition: Definite SLE and Probable SLE)

Algorithm 3 (coded plus natural language processing, first case definition: Definite SLE), Algorithm 4 (coded plus natural language processing, second case definition: Definite SLE and Probable SLE)

In the entire Partners HealthCare EHR, we identified 2,241 patients with ≥1 ICD SLE code and ≥2 visits to a primary care practice within the Partners HealthCare system. In the randomly selected 100 charts, the prevalence of definite SLE was 33% and definite SLE plus probable SLE was 50% (Supplemental Table 3). The mean (SD) age of definite SLE cases was 49.6 (14.6) and non-cases 56.3 (16.2). The mean (SD) number of ICD-9 codes among definite cases was 44.4 (46.2), 28.9 (40.3) among probable cases, and 4.7 (5.5) among non-cases; 85% of definite cases had received hydroxychloroquine compared to 82% of probable cases and 34% of non-cases. When we applied the coded-only algorithms to this secondary validation cohort, the AUCs were 0.875 for Algorithm 1 and 0.926 for Algorithm 2. At 97% specificity, Algorithm 1 had 86% PPV and 38% sensitivity and Algorithm 2 had 95% PPV and 60% sensitivity.

External Validation of Machine Learning-based Algorithms

In the external validation cohort (N=173), the prevalence of SLE per their case definition was 49%. The mean (SD) age of SLE cases was 55.8 (15.3) and non-cases 62.4 (15.8) (Supplemental Table 4). The mean (SD) number of ICD-9 codes among cases was 18.3 (19.8) and 3.9 (7.6) among non-cases; 80% of cases had received hydroxychloroquine compared to 40% of non-cases. When applied to this cohort, the coded-only algorithm for the second SLE definition (Algorithm 2) had 0.898 AUC, 9% PPV and 61% sensitivity at 97% specificity (Table 4).

Table 4.

Application of the Machine-Learning Coded-only, Second Case Definition SLE Algorithm (Algorithm 2) to an External Validation Cohort at Vanderbilt University Medical Center

Outcome: Definite/Probable SLE vs. Non-SLE

Specificity (%) Sensitivity (%) PPV (%)
100 25 100
98 42 95
97 61 94
96 63 93

External cohort includes 173 randomly selected subjects at Vanderbilt University Medical Center with ≥ 1 ICD-9 code for SLE after the year 2000. SLE case definition= specialist-reported SLE diagnosis.

Application of Rule-based Algorithms to Partners Biobank

Of the 300 randomly selected screen-positive patients from the Partners Biobank, 277 had available ANAs in our EHR. The algorithms that did not require ANAs were applied to the full cohort (n=300) in which the prevalence of definite SLE was 28%, and definite/probable SLE was 41%. Among those with documented ANAs (N=277), the prevalence of definite SLE was 30%, and the prevalence of definite/probable SLE was 44%. We found that 12% of our 121 definite/probable SLE cases had ANA titers of 1:40 but not of 1:160, and all patients with ≥1:40 titers also had ≥4 ICD-9 codes for SLE. Among those with available ANAs classified as definitely not having SLE (N=156), 28% had ANA values <1:40, 20% were between 1:40 and <1:160, and 52% were ≥1:160. As per the original paper by Barnado et al., we also applied each algorithm with the exclusion of ICD-9 codes for dermatomyositis (ICD-9 710.3) and SSc (ICD-9 710.1) (Table 5).

Table 5.

Application of Published Rule-Based SLE Algorithms from Vanderbilt University Medical Center to an External EHR at Partners HealthCare

Outcome: Definite SLE vs. Probable or Non-SLE (First Definition), N=85 Outcome: Definite/Probable SLE vs Non-SLE (Second Definition), N=124

Algorithms Sens. (%) Spec. (%) PPV (%) Sens. (%) Excl. DM, SSc Spec. (%) Excl. DM, SSc PPV (%) Excl. DM, SSc Sens. (%) Spec. (%) PPV (%) Sens. (%) Excl. DM, SSc Spec. (%) Excl. DM, SSc PPV (%) Excl. DM, SSc
 ≥3 ICD-9 codes for SLE plus ANA ≥1:40 and ever DMARD* and ever steroid use 58 81 57 43 83 53 53 86 74 39 87 69
 ≥4 ICD-9 codes for SLE plus ANA ≥1:40 and ever DMARD and ever steroid use 57 84 61 42 85 56 51 89 78 38 89 73
 ≥4 ICD-9 codes for SLE plus ANA ≥1:40 and ever antimalarial use 83 73 60 58 76 51 77 85 79 56 82 71
 ≥4 ICD-9 codes for SLE and ever antimalarial use 86 66 50 60 73 47 81 73 68 56 78 65
 ≥3 ICD-9 codes for SLE and ever antimalarial use 86 60 46 60 71 45 84 69 65 58 77 64
 ≥3 ICD-9 codes for SLE and ever DMARD use and ever steroid use 61 70 45 45 80 47 56 73 59 40 82 60
 ≥4 ICD-9 codes for SLE and ever DMARD use and ever steroid use 60 76 50 44 82 49 54 80 65 39 85 64
 ≥4 ICD-9 codes for SLE plus ANA ≥1:160 80 75 58 60 75 51 73 82 76 55 79 68

Sens., sensitivity; Spec, specificity; Excl., excluding; DM, dermatomyositis; SSc, systemic Sclerosis; DMARD, disease modifying anti-rheumatic drug.

*

DMARD category includes azathioprine, methotrexate, mycophenolate, cyclophosphamide, rituximab, etanercept, adalimumab, infliximab, and abatacept, including generic and brand names, as per the original paper.(3)

For algorithms including ANA titers, n=277 with 84 definite cases and 121 definite plus probable cases; charts without ANAs available within our EHR system were excluded. ANAs were obtained from structured data, and the highest ANA titer was counted.

The published rule-based SLE algorithms had PPVs ranging from 45%−79% (Table 5). Excluding dermatomyositis and SSc reduced sensitivity and increased specificity slightly and resulted in modestly lower PPVs. All algorithms performed better when applied to the definite/probable SLE case definition. Upon re-review of the charts for subjects misclassified as SLE by these algorithms, none had diagnoses of definite or probable SLE by a rheumatologist, dermatologist or nephrologist. The diagnoses of misclassified cases included cutaneous/discoid lupus, drug-induced lupus, undifferentiated connective tissue disease, inflammatory myopathy with interstitial lung disease, polymyositis, RA, seronegative inflammatory arthritis, psoriatic arthritis, chronic inflammatory demyelinating polyneuropathy, fibromyalgia, and antiphospholipid antibody syndrome.

Discussion:

We have demonstrated that EHR algorithms can be developed using machine learning methods to identify SLE cases with high PPV using readily available, coded data. We generated different algorithms to identify subjects meeting a strict definition of SLE and for a broader definition of both probable and definite SLE.

The novel algorithms we generated using machine learning methods had very good performance characteristics in our Partners Biobank training set and in the Partners Biobank validation set, as well as in the larger Partners HealthCare secondary validation set, and in the external institution Vanderbilt validation set. Comparing our Biobank cohort to the Partners validation cohort, the Partners’ cohort had a similar mean age and gender distribution but a greater percentage of African American individuals. In addition, the Partners cohort had, on average, fewer SLE ICD-9 and ICD-10 codes for SLE. Comparing our Biobank cohort to the Vanderbilt cohort, the mean age of the Vanderbilt cohort was slightly older, a higher percentage received hydroxychloroquine overall, and there were significantly fewer SLE ICD-9 and ICD-10 codes overall and among cases. These differences between the Biobank cohort and the overall Partners and Vanderbilt cohorts highlight the portability of this algorithm to cohorts with different demographic characteristics and medication use patterns. In addition, the successful portability to cohorts with fewer SLE-related ICD-9/10 codes suggests that the coded algorithm can identify SLE patients who may have fewer subspecialty visits and shorter follow-up and those receiving care at institutions with different coding practices. While a prior study demonstrated greater accuracy for predicting RA diagnosis with the addition of narrative data analyzed by NLP to algorithms also including coded data,(5) we found that our algorithm that relied on coded data only yielded the best performance statistics in our EHR and was easily portable and performed very well when externally validated. Our approach for incorporation of NLP terms relied on the data to select the potentially important variables for the algorithm. (14) Based on the data, the NLP concepts were less informative than the coded data to classify SLE. We believe this occurred because there were many potential narrative concepts pertaining to a diagnosis of SLE but a low prevalence of these individual features within a population of SLE patients, reflecting the heterogeneity of the disease. The more inclusive coded concepts may have already captured the same information as the most prevalent NLP terms. Regardless, this is a useful finding as coded data are most readily available and applying NLP-containing algorithms may not be feasible in all EHRs or in administrative databases. In addition, the one feature that required manual review (ANA positivity) did not prove to be an important component in any of our algorithms. We allowed for two different definitions of SLE (i.e., a stricter definition of definite SLE and a more inclusive definition of definite SLE and probable SLE) to improve applicability of our algorithms for various potential uses. In using penalized logistic regression to develop these algorithms, we also allowed for added flexibility as different predicted probability cutoffs can be selected allowing for tradeoffs between PPV and sensitivity in detecting subjects with SLE depending on the intended use. For instance, to identify patients with SLE for genetic and other translational studies, an algorithm with a high PPV and a strict definition of SLE, such as Algorithm 1 with a chosen specificity of 97% or higher, would be needed to identify subjects with a clearly defined phenotype, whereas for population health services research a more sensitive algorithm which would identify large numbers of patients allowing for a broader range of phenotypes, such as Algorithm 2 with a chose specificity of 86% allowing 86% sensitivity, may be preferred.

We also demonstrated a challenge when applying high performing rule-based algorithms, which were developed and validated in an external EHR, to our EHR. We found lower performance characteristics for all published rule-based algorithms in our EHR cohort despite excellent performance in the original external EHR. It is important to note process differences and challenges, including differences in case definitions, SLE prevalence rates, medical billing practices, reporting of ANA titers, and medication data extraction methods, which may have contributed to this discrepancy. SLE is a heterogeneous disease with many potential mimics and there is no “gold standard” test to allow for uniform diagnosis and classification of SLE in clinical practice. Therefore, practice patterns in the classification of patients as having SLE may vary between different institutions. We chose a case definition using the 1997 revised ACR classification criteria or 2012 SLICC criteria to allow for a standard and reproducible classification scheme and a high degree of diagnostic certainty. However, the Vanderbilt algorithms were derived using a broader case definition of specialist-determined SLE both with the goal of developing a more inclusive cohort to identify all potential SLE cases in the medical system, and because ACR classification criteria were not described in a significant proportion of the charts that were reviewed, even in rheumatologists’ notes. This provided a “real world” application of algorithms across institutions. Differences in prevalence rates of SLE, which will be determined by the particular SLE case definition used as well as variability in different patient populations between institutions, could also affect an algorithm’s performance. For example, the prevalence of SLE in the Vanderbilt training set was about 10 percentage points higher than the prevalence of definite and probable SLE in our cohorts. Therefore, the PPVs of algorithms would similarly be expected to be higher. The disease severity of the patient population may also impact algorithm performance.

Furthermore, differences in the use of medical billing codes for SLE and alternative diagnoses may also vary between institutions. For instance, in the Biobank cohort, SLE cases had a mean ICD-9 code count of 54 vs. 3.8 for non-SLE cases. In contrast, SLE cases from the Vanderbilt cohort had a mean code count of 18 vs. 4 codes for the non-SLE cases. This may reflect EHR structural differences as well as differences in duration of follow-up, visit frequency, or coding and billing practices at the respective institutions. There may also be differences in the use of medical billing codes for SLE between community hospitals and academic medical centers and across different EHR platforms which could further impact the performance of algorithms that incorporate billing codes. The reporting of related laboratory values such as ANA titers also varies among institutions, as well as the sensitivity of the assays, potentially limiting the utility of ANA titers and other laboratory results within a SLE algorithm across institutions. For example, in the Vanderbilt cohort, using an ANA cutoff of 1:40 versus 1:160 affected algorithm performance.(3) In our EHR, because of differences in laboratory reporting practices, and changes to these practices over time, manual review of ANA titers was required to apply these algorithms. We found that the majority of patients with ANA titers of ≥1:40 also had titers ≥1:160, and all patients with ANA titers ≥1:40 also had ≥4 ICD-9 codes for SLE, which likely contributed to our observation of similar PPVs across these different algorithms. In our machine learning algorithms, ANA positivity did not remain in the model as a significant variable.

Both medication prescribing and the reporting of medications in the EHR also vary between institutions, which can impact the performance of algorithms that include medication use. In our EHR cohort, subjects with probable SLE had similar antimalarial usage as subjects with definite SLE. As a result, the rule-based algorithm relying on antimalarial use had a poor ability to discriminate definite from probable SLE cases. Correspondingly, this medication was not included in our machine learning-based coded-only algorithm to detect definite SLE but was included in the coded-only algorithm to detect definite/probable SLE. Furthermore, the Synthetic Derivative, Vanderbilt’s EHR, utilizes MedEx to process narrative data and extract medication information.(1719) However, at the time of this study, the search engine for our EHR relied on coded medication lists, which could have influenced algorithms’ performance. Furthermore, differences in medical charting may occur between institutions which may impact the utility of narrative data extraction. Overall, these challenges highlight the potentially limited portability of such algorithms based on practice, cohort and EHR-based differences, and the need to consider the purpose of an algorithm as well as the setting for it to be applied prior to use.

Another set of SLE phenotype algorithms was recently developed using a different approach, incorporating initial expert physician-determined rules for ICD code counts with subsequent machine-learning, using a variation of the EasyEnsemble method with the Labeling with Noisy Labels technique, to train predictive algorithms.(20) The authors also incorporated medication orders and laboratory test results, as well as narrative text from all clinical notes in each patient’s electronic medical record. Their algorithms performed very well with AUCs of 0.94 (for a more inclusive SLE definition) and 0.97 (for a strict definition) in an internal test set which was enriched for a higher prevalence of SLE, by requiring ≥1 ICD-9 code associated with SLE and antibody positivity highly associated with SLE in at least half of the subjects. Like our algorithms, their machine-learning approach allows for different predictive probability thresholds to be chosen depending on the sensitivity/specificity tradeoff needed for the study question of interest. This work complements our findings by demonstrating the strengths of a data-driven approach to SLE phenotype algorithm development. However, as their algorithms have not been externally validated, and narrative text processing and laboratory result curation require significant infrastructure, further work is needed to understand the portability of this method to different EHRs.

While we provide a set of alternative algorithms that had high predictive characteristics in our EHR and when externally validated, we encourage critical data exploration in other EHRs before applying any published algorithm to discriminate SLE cases from non-cases. In selecting an algorithm to identify patients with SLE from a separate EHR where it has not been validated, investigators must first consider their individual data source and research questions. For example, the most specific algorithm may be preferred for genetic studies, but in other instances a more sensitive SLE algorithm available would be preferable to ensure that sufficient resources are allocated to care for all individuals with potential SLE. Using a machine learning-derived algorithm like the one we propose, provides more flexibility than a rule-based algorithm allowing for different cutoffs to be chosen based on the question of interest. The most appropriate case definition of SLE must then be identified to address such questions, and both the data available in EHR (e.g., whether it is common practice to include ACR criteria in clinic notes), and how strict vs. inclusive one wishes to be, must be considered. We recommend then considering the distribution of relevant data variables such as billing codes, laboratory values, and medications among the population of interest. These data variables must be similar to those used to generate an existing algorithm and the case definition must be appropriate for such an algorithm to perform well.

In the era of personalized medicine and “big data,” the use of algorithms to identify subjects with SLE and other diagnoses from the EHR will likely increase. While we have shown that the development of such algorithms with high internal accuracy and external portability is feasible, investigators must take caution when applying any externally generated algorithm to their own data. When used appropriately, such algorithms can serve as a powerful tool to enable clinical and translational SLE research.

Supplementary Material

1

Acknowledgments

Funding:

Dr. Jorge is supported in part by NIAMS/NIH T32-AR-007258.

Dr. Barnado is supported in part by NIAMS/NIH K08-AR-07275701 and NICHD/NIH 5K12-HD-043483012.

Dr. Denny is supported in part by NLM R01-LM-010685.

Drs. Costenbader, Liao, and Karlson are supported in part by NIAMS/NIH P30-AR-072577.

Dr. Karlson is also supported in part by NHGRI/NIH U01-HG-008685.

Dr. Feldman is supported by NIAMS/NIH K23-AR-071500.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Commercial Interests:

The authors have no relevant commercial interests or other conflicts of interest to disclose.

References:

  • 1.Moores KG, Sathe NA. A systematic review of validated methods for identifying systemic lupus erythematosus (SLE) using administrative or claims data. Vaccine 2013;31 Suppl 10:K62–73. [DOI] [PubMed] [Google Scholar]
  • 2.Bernatsky S, Linehan T, Hanly JG. The accuracy of administrative data diagnoses of systemic autoimmune rheumatic diseases. J Rheumatol 2011;38(8):1612–6. [DOI] [PubMed] [Google Scholar]
  • 3.Barnado A, Casey C, Carroll RJ, Wheless L, Denny JC, Crofford LJ. Developing Electronic Health Record Algorithms That Accurately Identify Patients With Systemic Lupus Erythematosus. Arthritis care & research 2017;69(5):687–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Carroll RJ, Thompson WK, Eyler AE, Mandelin AM, Cai T, Zink RM, et al. Portability of an algorithm to identify rheumatoid arthritis in electronic health records. J Am Med Inform Assoc 2012;19(e1):e162–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Liao KP, Cai T, Savova GK, Murphy SN, Karlson EW, Ananthakrishnan AN, et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. Bmj 2015;350:h1885. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wright A, Bates DW. Chapter 6: Patients, Doctors, and Information Technology at Brigham and Women’s Hospital and Partners Healthcare. In: Greens RA, editor. In Clinical Decision Support: The Road to Broad Adoption: Academic Press; 2014. [Google Scholar]
  • 7.Gainer VS, Cagan A, Castro VM, Duey S, Ghosh B, Goodson AP, et al. The Biobank Portal for Partners Personalized Medicine: A Query Tool for Working with Consented Biobank Samples, Genotypes, and Phenotypes Using i2b2. Journal of personalized medicine 2016;6(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Hochberg MC. Updating the American College of Rheumatology revised criteria for the classification of systemic lupus erythematosus. Arthritis Rheum 1997;40(9):1725. [DOI] [PubMed] [Google Scholar]
  • 9.Petri M, Orbai AM, Alarcon GS, Gordon C, Merrill JT, Fortin PR, et al. Derivation and validation of the Systemic Lupus International Collaborating Clinics classification criteria for systemic lupus erythematosus. Arthritis Rheum 2012;64(8):2677–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Roden DM, Pulley JM, Basford MA, Bernard GR, Clayton EW, Balser JR, et al. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin Pharmacol Ther 2008;84(3):362–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Yu S, Liao KP, Shaw SY, Gainer VS, Churchill SE, Szolovits P, et al. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. Journal of the American Medical Informatics Association : JAMIA 2015;22(5):993–1000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Goryachev S, Sordo M, Zeng QT. A suite of natural language processing tools developed for the I2B2 project. AMIA Annual Symposium proceedings AMIA Symposium. 2006:931. [PMC free article] [PubMed] [Google Scholar]
  • 13.Liao KP, Cai T, Gainer V, Goryachev S, Zeng-treitler Q, Raychaudhuri S, et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis care & research 2010;62(8):1120–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Yu S, Chakrabortty A, Liao KP, Cai T, Ananthakrishnan AN, Gainer VS, et al. Surrogate-assisted feature extraction for high-throughput phenotyping. Journal of the American Medical Informatics Association : JAMIA 2017;24(e1):e143–e9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Zou H The adaptive lasso and its oracle properties. J Am Stat Assoc 2006;101:1418–29. [Google Scholar]
  • 16.Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 1997;30(7):1145–59. [Google Scholar]
  • 17.Jiang M, Wu Y, Shah A, Priyanka P, Denny JC, Xu H. Extracting and standardizing medication information in clinical text – the MedEx-UIMA system. AMIA Summits on Translational Science Proceedings 2014;2014:37–42. [PMC free article] [PubMed] [Google Scholar]
  • 18.Xu H, Stenner SP, Doan S, Johnson KB, Waitman LR, Denny JC. MedEx: a medication information extraction system for clinical narratives. Journal of the American Medical Informatics Association : JAMIA 2010;17(1):19–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Marmor MF. Comparison of screening procedures in hydroxychloroquine toxicity. Arch Ophthalmol 2012;130(4):461–9. [DOI] [PubMed] [Google Scholar]
  • 20.Murray SG, Avati A, Schmajuk G, Yazdany J. Automated and flexible identification of complex disease: building a model for systemic lupus erythematosus using noisy labeling. Journal of the American Medical Informatics Association : JAMIA 2018. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES