Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2025 May 22;2024:172–181.

A data-driven approach to discover and quantify systemic lupus erythematosus etiological heterogeneity from electronic health records

Marco Barbero Mota 1,, John M Still 1, Jorge L Gamboa 2, Eric V Strobl 3, Charles M Stein 2, Vivian K Kawai 2, Thomas A Lasko 1,4,*
PMCID: PMC12099369  PMID: 40417520

Abstract.

Systemic lupus erythematosus (SLE) is a complex heterogeneous disease with many manifestational facets. We propose a data-driven approach to discover probabilistic independent sources from multimodal imperfect EHR data. These sources represent exogenous variables in the data generation process causal graph that estimate latent root causes of the presence of SLE in the health record. We objectively evaluated the sources against the original variables from which they were discovered by training supervised models to discriminate SLE from negative health records using a reduced set of labelled instances. We found 19 predictive sources with high clinical validity and whose EHR signatures define independent factors of SLE heterogeneity. Using the sources as input patient data representation enables models to provide with rich explanations that better capture the clinical reasons why a particular record is (not) an SLE case. Providers may be willing to trade patient-level interpretability for discrimination especially in challenging cases.

Introduction

Systemic lupus erythematosus (SLE) is a complex relapsing disease that manifests through various combinations of symptoms and clinical signs. SLE’s heterogeneity makes its recognition in the health record slow and subjective. SLE diagnosis is usually reached after excluding other likely explanations to patient symptoms1.

Current practice relies on the application of imperfect classification criteria by non-specialists to trigger a subsequent referral to a rheumatologist. These tools have traditionally focused on either sensitivity or specificity at the expense of the other2 with the most recent revision balancing both metrics3. These criteria were designed to form uniform phenotypic cohorts for research, but they disregard the high variability of clinical presentations among SLE patients4. Some ineligible patients are at high risk of being missed or misdiagnosed, a phenomena referred as spectrum effect5 that leads to high false negative rates1,4,6. These include patients at the early stages of the condition where some clinical signs may have not yet developed, single organ/system-dominant forms, negative antinuclear antibody (ANA) test cases, and rare but severe presentations7.

SLE’s heterogeneity is also a reason why few new drugs have been successfully developed. The current one-size-fits-all approach applied in other autoimmune diseases such as rheumatoid arthritis (RA) has not been efficacious for SLE8,9. Even in positive trials, only marginal treatment effects have been found which highlights the need to dissect patient populations into clinically homogeneous subgroups which would increase clinical trial efficiency9.

Delimiting patient subgroups at scale requires pulling apart the components that make up SLE’s heterogeneity. Clinically meaningful sources of variation must link manifestational criteria and the underlying causal mechanisms of disease (i.e. etiology)10. Identifying the independent causes of SLE diagnosis may allow to accurately define targets for novel therapies in more efficient trials11,12.

Some SLE subtypes have been recognized in the literature as constellations of non-specific clinical manifestations and laboratory values1,6. A recent study proposed a model that divided SLE into type 1 and 213,14. There also exist established SLE subtypes due to their significant differences in prognosis and treatment such as lupus nephritis15 or lupus with antiphospholipid syndrome16. However, these have been defined based on clinical experience and have neither been quantitively characterized nor causally validated.

Electronic health records (EHR) are a rich source of medical information routinely collected in clinical practice. Prior research has leveraged EHR data to develop phenotyping algorithms that attempt to recognize SLE records at scale: expert-guided decision trees17 and supervised machine learning (ML) models trained on chart review18,19 or noisy20 labels. A recent study also incorporated genetic data and found no gain over EHR data alone21. All these systems were optimized for accuracy, which places greater emphasis on unsurprising easy cases at the expense of the long tail of less common SLE facets, a phenomenon known as hidden stratification22. The patterns these algorithms uncover are likely not causal which precludes models to fully capture the data generating process. Using causal predictors enhances model interpretability which may be preferred to better accuracy in recognizing heterogenous diseases23,24.

In this study, we demonstrate that large amounts of health records data and independence-based pattern discovery25 is sufficient to identify the imprints left in the EHR by the independent latent sources of the SLE label which describe the disease’s heterogeneity. The disentanglement of these patterns or signatures is done in a completely unsupervised manner without any prior knowledge: we let the data tell us what the signatures are. Under a set of causal assumptions, the corresponding independent sources have been shown to represent the root causes of the data generating process26. Lasko et al. showed that using this causal representation of patient data as input to ML models was more accurate than the original clinical variables in predicting the malignancy of solitary lung nodules. Notably, the most predictive latent sources appeared to represent undiagnosed cancer in some patients25. In this work we do not try to forecast future disease but rather recognize SLE in the EHR across its clinical facets. We also probe whether the signatures of predictive sources can provide inference into SLE etiological heterogeneity.

In this work we make the following contributions:

  1. We infer 2000 probabilistically independent latent sources and their EHR signatures from noisy, sparse, and irregular data from a broad sample of rheumatology patients.

  2. We evaluate the benefits of using the sources to recognize SLE in the health record, when compared to using estimates of the observed clinical variables at the same points in time.

  3. We identify 19 probabilistically independent sources predictive of SLE whose EHR signatures describe clinically recognizable pictures of SLE heterogeneity.

  4. We demonstrate the higher expressivity and interpretability that the sources bring to supervised models as they allow to quantify the patient-level root causes of SLE being present in the health record.

Methods

This study was conducted at Vanderbilt University Medical Center (VUMC) and was determined by the VUMC’s IRB to be non-human-subjects research.

Data.

We extracted all data from VUMC’s de-identified EHR mirror, the Synthetic Derivative27 (SD), which hosts longitudinal information for over 3 million patients. With cutoff date 02/16/2018, we collected historical patient data for patients who had had an ANA test done, with or without a titer result. This cohort conforms our discovery set and considers a wide spectrum of autoimmune and non-autoimmune conditions.

We collected data from diagnosis codes, medications, clinical measurements and demographic information (race, age, and biological sex) variables which we refer to as channels. We discarded those with less than 1000 total events or appearing in less than 10 records in the discovery set. We mapped SNOMED concepts from ICD-9/10 codes.

Data denoising, synchronization and fusion.

We followed Lasko et al.25 for EHR data preprocessing and clinical signature discovery (figure 1A) and refer to the paper for full details. In summary, we computed longitudinal curves for all considered channels with daily resolution from the discrete asynchronized noisy observations. Each data modality follows a specific curve generation process:

  1. Conditions (SNOMED codes): We inferred smooth code-intensity curves using a variation of Random Average Shifted Histograms28 that accounts for the non-stationarity of the event-arrival distribution. In the event of missing data, we imputed a constant curve with a baseline annual event-arrival probability of 1/20.

  2. Clinical measurements (laboratory values and BMI): We used the univariate monotonic cubic interpolator PCHIP29 to generate smooth curves without overshooting. Records without observations for a measurement were imputed a constant curve with the population median.

  3. Medications (RxNorm): Medication data only exists when noted in a visit’s medication list and missing otherwise. We built binary curves through a best-guess approximation of drug-taking between visits. We considered that a regime was continued between contiguous visits if noted on both. On the contrary, if a particular medication was missing at the later visit, the midpoint between dates was considered as stopping time. We imputed a zero constant curve for those records that did not include any mention for a particular drug.

  4. Demographics: Race and sex were constant binary variables. Age curves were linearly increasing integers.

Figure 1.

Figure 1.

Data flow summary. A) Noisy, sparse observations are used to generate longitudinal curves, one per clinical variable (channel) and per patient. For each individual, all channels curves are sampled at random times. Cross-sections for the discovery set are aggregated into a dense data matrix (X) that is decomposed into 2000 approximately independent sources expressions at the input timepoints and their EHR signatures. B) Supervised ML models are trained to discriminate SLE cases from ‘near misses’ (negative) records. Learning happens from either the channels curves last value or their projections onto the signatures space (i.e. the sources expressions levels).

Unlike Lasko et al.25, we did not perform post-computation curve smoothing. Instead, we directly stacked each patient curves longitudinally to form a p × τ curveset, where p is the number of clinical variables and T is the number of days spanned by the record. Next, we serially concatenated all curvesets and performed n cross-sections at uniformly random times with an average density of one sample per record-year. Merging of all data produced a dense data matrix X ∈ ℝP×n whose columns represent synchronized complete instances of patient records. We standardized X variable-wise to bring all variables into roughly the same scale. Measurements, log-transformed condition codes intensities, and age were mean centered and scaled by two standard deviations30. Binary variables (medications, race and biological sex) were left untouched30.

Signature discovery.

We use probabilistic independence as our guiding principle to discover clinical signatures from X (figure 1A). This choice comes naturally from the assumption that latent mechanisms of disease (or sources) operate independently and leave a pattern imprint (or signature) in the clinical history of a patient25,26,31. We formalized our unsupervised decomposition through an Independent Component Analysis (ICA) model32:

X=AS. (1)

For implementation we used the fastICA algorithm33 (Python 3.10, scikit-learn 1.1.3) to obtain A ∈ ℝp×k and S ∈ ℝk×n which correspond to the mixing and source matrices, respectively (figure 1). Each column of A (A•,i) depicts a signature, or the p linear mixing weights that describe the trace left in the record by the corresponding latent source (Si,•). The j-th column of S (S•,j) quantifies the sources expressions for the j-th data instance (X•,j), or how active each source was in the j-th cross-section. We set the number of latent sources k = 2000 due to RAM memory constraints.

We made a first approach to evaluating the signatures clinical face validity by subjectively assessing their resemblance to common clinical pictures. For such task we used the signatures description diagrams (see Results figure 5) that quantitatively display the signature (A•,i) and the histogram of its expressions (Si,•) across the discovery set. All signatures are given a random sequential identifier by the algorithm (e.g. S-650). We provide with descriptive names to those that are relevant for the discussion of recognizing SLE in the health record.

Figure 5.

Figure 5.

Signatures description diagrams for the top 5 most predictive sources. Bar length indicates the normalized change of a channel per unit of expression. This change is shown inside parenthesis in the original data space. For example, if a patient expresses 10 of S-650 the record would experience a 75.4% change on SLE code intensity which corresponds to a factor of 1.75410 = 275.6, an increase in ANA titer of 10×31.64 = 316.4 and 10×0.023 = 0.23 higher probability of female sex. Diagrams show the top 10 contributor channels ordered by normalized change. Insets are log-scaled histograms of expression levels for all cross-sections in the discovery set. Expression units are individually scaled for each signature such that the standard deviation is 0.5, placing 95% of all expressions within the interval [-1, 1].

SLE w/ org./syst. involv.: SLE with organ/system involvement; GN: Glomerulonephritis; DNA-ds Ab: DNA double stranded antibody test; TM: Toxic maculopathy.

Objective signatures evaluation.

We evaluated the disentangled sources by quantifying their predictive power in recognizing SLE health records among a labelled set of patients or learning set (figure 1B). The learning set included the records in our discovery set with at least one SLE ICD-9/10 code, no codes for systemic sclerosis and/or dermatomyositis, and with existing EHR-linked genomic data who were exclusively of European ancestry (i.e. white race). Patients from the learning set were assigned a binary label through chart review by three SLE experts (JG, VK and CMS) who followed the definition of SLE case used by Barnado et al.17: An SLE case must contain an explicit diagnosis by a VUMC or external rheumatologist, dermatologist or nephrologist in the clinical notes. Patients with only cutaneous lupus and drug-induced lupus were not considered SLE cases. This definition included both easy and hard cases to diagnose. Negative records were labelled as “near misses” due to their historical suspicion of SLE.

We extracted the last available cross-section of the channels curves (xt) for all learning set patients and projected them onto the signatures space following the learned ICA model (equation 1):

st=A1x't. (2)

xt and st are therefore two representations of the same information in the channels and signatures space respectively. We used these datasets as input to train four ML architectures to discriminate SLE cases from “near misses”. We trained Elastic Net (ENet)34 (scikit-learn 0.24.2, python 3.9.12) and Adaptive Elastic Net (AdaNet)35 equipped with sure independence screening36 (in-house implementation on python 3.9.12) penalized logistic regression. These architectures introduce L2 and L1 penalties in their loss functions which provides with stability against collinearity and automatic feature selection, respectively. ENet penalties exhibit the grouping effect which leaves either in or out of the model sets of strongly correlated predictors. This behavior may zero out important features and include noise variables into the model. Under a set of weak regularity conditions, the AdaNet overcomes this problem as it achieves the oracle property guaranteeing variable selection consistency35.

Despite their simplicity, linear models may be less interpretable since they may not fully capture the true data generating function24. Decision trees ensembles are flexible non-linear models that tend to outperform linear architectures in many supervised tasks. We trained Random Forest37 (scikit-learn 0.24.2, python 3.9.12) and XGBoost38 (xgboost 1.7.4, python 3.9.12) to account for any non-linearity that may persist the ICA transformation. We made a random 70/30 split of our learning set to form training and test sets. We tuned each model architecture under a two-step process. First, we randomly searched across the hyperparameter space using Bayesian optimization39 (optuna 3.1.0, python 3.9.12) with 10-fold cross-validation area under the receiver operating characteristic curve (10f-CV AUROC) as target and a universal computational budget of 4000 trials. Since the variance of K-fold CV may be very large40, we employed a second step to increase tuning robustness. From the initial search results, we identified the subset of hyperparameters combinations that were statistically comparable in performance with the best model through overlap of DeLong41,42 logistic43 confidence intervals (CI) (α = 0.2). We then optimized for mean out-of-bag AUROC on 100 bootstrapped samples to find the final model training AUROC for which we report Wald 95% CIs. This tuning process was performed for all architectures in both the signatures and the channels. To select the final model architecture, we computed test set performance of all four architectures on both data representations and quantified statistical significance of their pairwise differences via the DeLong test (α = 0.05) for correlated AUROCs.

For each combination of model architecture and data representation, we also computed the integrated calibration index (ICI)44 and cross-entropy loss along their pivot-based 95% CIs based on 1000 bootstrapped samples. The former quantifies the accuracy of individual predictions while the latter combines both calibration and discrimination.

We benchmarked our two final models against the SLE phenotyping algorithm with the highest PPV (95%) in Barnado et al17 by computing its recall, specificity, and precision on our test set.

Feature importance.

We trained our final signature model under 500 bootstrapped samples of the training set to obtain the empirical distribution of global feature importance, computed as the mean absolute SHapley Additive exPlanations (SHAP)45 values across the test set. We used linear SHAP45 and assumed feature independence to stay true to the model46 and align with the causal definition of SHAP47. The independence assumption is automatically satisfied over the signatures model predictors under the ICA model but does not hold true for the channels.

Record-level root causes.

Figure 2A describes the causal data generation process of this study problem of recognizing SLE in the health record. The disease process, its causes (e.g. genetics and environmental exposures), and its effects (e.g. pathophysiologic or lifestyle changes), are the most upstream latent nodes. These cause a set of disease observations (OD) that are recorded by a clinician. The provider may then recommend the patient a certain treatment that she may or may not adhere to (T ). We do not observe the actual treatment but only what the clinician documents in the EHR (OT). The actual treatment causes a set of downstream effects to which we also do not have direct access (DE ) (e.g. drug secondary effects). T, OT, and DE go on to cause additional observations in the record by the clinician (ODE). As described above, a binary chart review label is assigned to each patient by examining the record (i.e. OD, OT and ODE , which include the clinical note that reflects the physician diagnosis).

Figure 2.

Figure 2.

Two views of the causal graph that describe the data generating process for a particular patient. A) Includes conceptual latent variables while in B) unobserved nodes are compressed into error terms. These are root causes to the degree observed data allows for upstream identification. Shaded grey nodes correspond to observed dimensions, dashed lines define latent nodes, and the green box represent the binary SLE label. D: Disease; O: Observations; T: Treatment; DE: Downstream effects; E: Error term.

The predictive true root causes behind the individual patient label are the parentless latent nodes in figure 2A. In an ideal world, we would like to observe and use these as input to predictive models. However, we only have access to imperfect observations of them48. Under the LiNGAM model assumptions49, latent sources discovered using probabilistic independence represent unobserved root causes of the target sink node in the data generating process to the extent these causes are identifiable from the observed data26. Formally, ICA sources correspond to the exogeneous independent error terms of the structural equation model that we assume to describe our problem causal process (figure 2B). Causal relations are transitive, where A causes C if A causes B and B causes C. The discovery of the error terms from observational data allows for a compact representation of unobserved causes (dashed nodes) in figure 2A using the data we access to (gray rounded nodes). We can thus equivalently represent the data generative process using the causal graph in figure 2B, where we have removed T and DE from Figure 2A without introducing confounding50.

The sources in matrix S are equivalent to the predictive error terms (E), each associated to an observed variable26. The individual predictivity of each source for each patient mathematically corresponds to its SHAP value (ϕ) given a model that estimates P(Label | S )26. If ϕi > 0 then the i-th source is a root cause of the SLE label, while if ϕi < 0 the source is protective. Thus, SHAP values provide a quantitative causal explanation of why a patient record did or did not receive an SLE label. We obtain SHAP explanations for the same patient using both the signatures and channels final models to demonstrate the added expressivity and information richness of the sources.

Results

Cohorts and dimensionality.

63,775 patients fulfilled our inclusion criteria and comprised the discovery set. The final EHR data dimensionality was p = 7947, composed of 6218 SNOMED condition codes, 839 clinical measurements, 879 medications, race, age, and biological sex. Curve sampling generated n = 646,775 cross-sections stacked in X ∈ ℝ7947×646,775 and fed to ICA for its decomposition into S ∈ ℝ2000×646,775 and A ∈ ℝ7947×2000. The learning set consisted of 490 SLE cases and 261 negative but difficult to diagnose “near misses”. Demographics for patients in all these cohorts are shown in table 1.

Table 1.

Demographics for the different patient cohorts. Std: standard deviation.

graphic file with name 4276t1.jpg

Model performance.

There were no significant differences on the training accuracy among architectures for either data representation (table 2). The channels achieved higher training AUROC point estimates than the signatures. On the test set, the channel models were statistically more accurate. No architecture showed statistically better performance which suggests that the underlying function was sufficiently captured by the linear models. Test accuracy was generally higher than in training which aligns with the idea that the test set had a larger share of easy cases to classify. AdaNet and XGBoost showed the best calibration among models trained on the signatures and the channels, respectively. In general, the signatures models were better calibrated, with XGBoost as the exception. Cross-entropy was slightly better for both linear models in using the signatures and for ENet and XGBoost on the channels data representation (figure 3).

Table 2.

Supervised models performance and calibration. AUROC: Area under the receiver operating characteristic curve (higher is better). ICI: Integrated calibration index (less is better). CE: Cross entropy (less is better). Brackets enclose 95% CIs. Bolding denotes statistically equivalence within columns by pairwise DeLong test (α = 0.05). *: Random Forest and XGBoost AUROC difference was significant.

graphic file with name 4276t2.jpg

Figure 3.

Figure 3.

Test cross-entropy.

We chose the AdaNet as architecture for the rest of the analysis due to its performance and calibration per above as well as its feature selection consistency (oracle property) and higher interpretability (low complexity with fewer nonzero features).

Our final models outperformed Barnado et al. algorithm17 on the test set (figure 4A and B) even though the estimation of their performance is probably optimistic. Their work and ours share the same data source but a completely independent train/test split which may have leaked patients from their training set into our test set.

Figure 4.

Figure 4.

Test set A) receiver operating characteristic curve and B) precision-recall curve for both final models. Green dot shows the performance of Barnado et al.17.C) Mean absolute SHAP value (in log-odds units) empirical distribution of the nonzero features for the C) 5 most predictive signatures that make up the final signatures model and D) the additional 14 nonzero sources found to be predictive under training set sampling variation. Y-axis symmetric logarithmic scale threshold and KDE bin width were adjusted source-wise to enhance visualization. Each source model coefficient sign is provided in parenthesis to indicate the directionality of its effect in the prediction (-: away from SLE label). SHAP waterfall diagrams for the E) channels and F) signatures final models explanations for the same patient. Y-axis gray numbers indicate the feature value. Arrows show each predictor SHAP value which quantify the marginal contribution to model output (f(x)) starting from the expected estimate over the training set (E[f(X)]). APL: Antiphospholipid syndrome; HCQ: Hydroxychloroquine.

Predictive sources of SLE’s heterogeneity.

The channels model learnt from the full training set selected 8 clinical variables which included SNOMED codes (SLE, SLE with organ/system involvement, chest pain, ECG normal, and low back pain) and laboratory tests (ANA titer, DNA double stranded antibody test and Complement C4).

The final signatures model selected 5 latent sources (figure 4C). S-650 describes a recognizable description of SLE with the particularity that the systemic autoimmunity has advanced to damage a particular organ or system (figure 5), which is a strong predictor of definite SLE diagnosis6. S-1289 shows the characteristic picture of a generic SLE patient, slightly anemic female with increased intensity of the unspecific SLE code. S-1497 represents the SLE complication of lupus nephritis15. S-1588 depicts typical SLE laboratory abnormalities of high ANA titer, low complements, and elevated SLE-specific DNA double-stranded antibody test1. Finally, S-1683 shows a picture of toxic maculopathy in patients with codes for autoimmune diseases. Manual review revealed that this source is highly expressed in records documenting long-term use of hydroxychloroquine among patients with confirmed autoimmune conditions such as SLE, RA and Sjögren’s syndrome. S-1683’s signature accounts for increments in the probability of exposition to the drug of up to 0.18, however this change is not shown in the diagram as it fell below the display cutoff (figure 5). Manual review of selected clinical notes revealed that providers use the toxic maculopathy code for insurance purposes when recommending hydroxychloroquine users for their annual eye exam. This may explain why the extremely rare disease toxic maculopathy was found to be the top contributor of S-1683. Graphically, S-1683 is an example of EDE in the causal graph (figure 2B) while the other 4 predictive sources are error terms of the set depicted by ED.

Sample variability testing revealed 14 additional predictive sources (figure 4D) most of which have signatures that are recognizable clinical patterns in SLE care (diagrams not shown for space). We identified three distinct isotypes of antiphospholipid syndrome, complete16 (S-999), and anticardiolipin IgG (S-1378) and IgA (S-1475)51. The model also recognized sources for the two main autoimmune conditions that overlap with SLE, RA (S-788) and Sjögren’s syndrome (S-976), the expression of which moves the prediction away from SLE. Treatment of SLE with hydroxychloroquine (S-1581) was identified as predictive. Increased expression of the ‘Injuries’ source (S-295) moved the prediction away from SLE, which aligns with the lifestyle of these patients.

Record-specific prediction interpretability.

Figure 4E and 4F display the SHAP explanations for the same “near miss patient by the channels and signatures models. Clinical note review revealed a long history (>30 years) of autoimmune diagnoses with overlapping RA and SLE and clinician disagreement. Assessment by rheumatologist confirmed this was a case of cutaneous lupus with no evidence of systemic damage to qualify as SLE. It is an example of the type of situation where higher interpretability at patient-level may be more useful than accuracy. Even though both models give a wrongly confident prediction that this is an SLE record, the explanations differ in both the depth and clinical relevance of the information they offer. Small negative expressions of the SLE laboratory pattern (S1588), lupus nephritis (S-1497), and long-term user of hydroxychloroquine (S-1683) signatures pull the prediction toward the correct label. However, the high intensity of SLE codes, captured by the expressions of S-650 and S-1289, misleads the model to assign this record P(SLE) = 0.82. Conversely, the channels’ view does not provide as detailed of an explanation even though it uses more variables. It does capture the high intensity of SLE codes, but also conveys a mix of codes and laboratory values that may be misleading. Note that almost all of them, lead the model to wrongly assign a P(SLE) = 0.76 to this patient.

Discussion and Conclusions

We proposed a data-driven approach to address the problem of recognizing SLE’s heterogeneity in the health record. We discovered 2000 independent latent sources and its corresponding EHR signatures from large amounts of unlabeled, noisy, sparse, and high-dimensional data from rheumatology patients. ICA-inferred sources represent the most-upstream causes of the SLE label within the data generating process insofar as the observed data permits their identification. Assuming a linear structural equation model the sources are the predictive exogenous error terms of each patient causal graph. Using the source expressions as input data representation forces supervised models to exploit the causal paths that generated the SLE labels which might be more predictive, interpretable, and clinically informative We tested the sources against the original 7947 clinical variables (or channels) by training analogous SLE recognition models. Under internal validation the latent sources provide supervised models with higher model interpretability, more clinically relevant explanations and calibration although slightly lower discrimination. Notably, both models outperformed an optimistic view of the current go-to SLE phenotyping algorithm.

One limitation of our study is the lack of a bijective relationship between the observed variables and the sources. Having less sources than channels (2000 < 7947) implies that some of the sources may be linear combinations of error terms in the causal graph. If this were the case, the sources predictivity will be obscured and that would explain our findings. However, this limitation does not invalidate the causal value and higher interpretability of the sources vs. the channels in recognizing SLE heterogeneity.

The sources were discovered with no human input. We let the data tell us what linear combinations of changes in clinical variables define each signature. These independent patterns seem to reflect what clinicians look for when making sense of a record. And these may be known or unknown and current or emergent. We identified 19 clinical signatures had high clinical face validity and defined meaningful factors of SLE heterogeneity in our training set. We suspect that the true heterogeneity of SLE is likely greater, with more than 19 latent sources. Our study was limited by our learning set which was mainly of white race and not representative of the SLE population at large. It is well-known that SLE has a higher prevalence and severity among minority populations, thus it is reasonable to expect that their inclusion would reveal additional latent sources. This is a promising line of future work.

At the patient level, providers may be willing to trade off a small amount of accuracy for more meaningful causal explanations that can quantify the why behind their specific patient prediction. The equivalence between the ICA sources and error terms in the patient-specific causal graph allows to use SHAP as a framework to both quantify the predictivity of each source and explain the signatures model predictions making these easier to comprehend compared to those from the channels model. The signatures model is also more interpretable which is especially useful for challenging cases where the clinical history is not very informative, incoherent or reflects several diagnoses of overlapping conditions. In such scenarios, a provider would benefit from causal explanations that can be used to audit the model and ensure that its behavior complies with domain knowledge before making an interventional decision.

Acknowledgements

MBM received fellowships from Fulbright Spain and “la Caixa” Foundation (ID: 100010434, code: LCF/BQ/EU22/11930087). This project was funded by NIAMS/NIH R01 AR076516, Lupus Research Alliance BMS Accelerator Award, and CTSA award UL1 TR002243 from the National Center for Advancing Translational Sciences.

Figures & Tables

References

  • 1.Bertsias GK, Pamfil C, Fanouriakis A, Boumpas DT. Diagnostic criteria for systemic lupus erythematosus: has the time come? Nature reviews Rheumatology. 2013 doi: 10.1038/nrrheum.2013.103. [DOI] [PubMed] [Google Scholar]
  • 2.Petri M, Orbai AM, Alarcón GS, et al. Derivation and Validation of Systemic Lupus International Collaborating Clinics Classification Criteria for Systemic Lupus Erythematosus. Arthritis Rheum. 2012. [DOI] [PMC free article] [PubMed]
  • 3.Aringer M, Costenbader K, Daikh D, et al. 2019 European League Against Rheumatism/American College of Rheumatology classification criteria for systemic lupus erythematosus. Annals of the rheumatic diseases. 2019.
  • 4.Aggarwal R, Ringold S, Khanna D, et al. Distinctions Between Diagnostic and Classification Criteria? Arthritis care & research. 2015. [DOI] [PMC free article] [PubMed]
  • 5.Mulherin SA, Miller WC. Spectrum Bias or Spectrum Effect? Subgroup Variation in Diagnostic Test Evaluation. Ann Intern Med. 2002. [DOI] [PubMed]
  • 6.Fanouriakis A, Tziolos N, Bertsias G, Boumpas DT. Update οn the diagnosis and management of systemic lupus erythematosus. Annals of the Rheumatic Diseases. 2021. [DOI] [PubMed]
  • 7.Schattner A. Unusual Presentations of Systemic Lupus Erythematosus: A Narrative Review. The American Journal of Medicine. 2022. [DOI] [PubMed]
  • 8.Fraenkel L, Bathon JM, England BR, et al. 2021 American College of Rheumatology Guideline for the Treatment of Rheumatoid Arthritis. Arthritis Care Res (Hoboken) 2021. [DOI] [PMC free article] [PubMed]
  • 9.Ehrenstein MR, Shipa M. SLE is not a one-size-fits-all disease. Journal of Experimental Medicine. 2023. [DOI] [PMC free article] [PubMed]
  • 10.Schoenbach VJ. Wayne Rosamond. Understanding the Fundamentals of Epidemiology an evolving text. 2000.
  • 11.Ehrenstein MR, Mauri C. If the treatment works, do we need to know why?: the promise of immunotherapy for experimental medicine. Journal of Experimental Medicine. 2007. [DOI] [PMC free article] [PubMed]
  • 12.Agache I, Akdis CA. Precision medicine and phenotypes, endotypes, genotypes, regiotypes, and theratypes of allergic diseases. The Journal of clinical investigation. 2019. [DOI] [PMC free article] [PubMed]
  • 13.Pisetsky DS, Lipsky PE. New insights into the role of antinuclear antibodies in systemic lupus erythematosus. Nat Rev Rheumatol. 2020. [DOI] [PMC free article] [PubMed]
  • 14.Eudy AM, Rogers JL, Corneli A, et al. Intermittent and Persistent Type 2 lupus: patient perspectives on two distinct patterns of Type 2 SLE symptoms. Lupus Sci Med. 2022. [DOI] [PMC free article] [PubMed]
  • 15.Anders HJ, Saxena R, Zhao M hui, et al. Lupus nephritis. Nat Rev Dis Primers. 2020. [DOI] [PubMed]
  • 16.Pons-Estel GJ, Andreoli L, Scanzi F, Cervera R, Tincani A. The antiphospholipid syndrome in patients with systemic lupus erythematosus. J Autoimmun. 2017. [DOI] [PubMed]
  • 17.Barnado A, Casey C, Carroll RJ, et al. Developing Electronic Health Record Algorithms That Accurately Identify Patients With Systemic Lupus Erythematosus. Arthritis Care and Research. 2017. [DOI] [PMC free article] [PubMed]
  • 18.Jorge A, Castro VM, Barnado A, et al. Identifying lupus patients in electronic health records: Development and validation of machine learning algorithms and application of rule-based algorithms. Seminars in arthritis and rheumatism. 2019. [DOI] [PMC free article] [PubMed]
  • 19.Adamichou C, Genitsaridi I, Nikolopoulos D, et al. Lupus or not? SLE Risk Probability Index (SLERPI): a simple, clinician-friendly machine learning-based model to assist the diagnosis of systemic lupus erythematosus. Annals of the Rheumatic Diseases. 2021. [DOI] [PMC free article] [PubMed]
  • 20.Murray SG, Avati A, Schmajuk G, Yazdany J. Automated and flexible identification of complex disease: building a model for systemic lupus erythematosus using noisy labeling. Journal of the American Medical Informatics Association. 2019. [DOI] [PMC free article] [PubMed]
  • 21.Barnado A, Wheless L, Camai A, et al. Phenotype Risk Score but Not Genetic Risk Score Aids in Identifying Individuals With Systemic Lupus Erythematosus in the Electronic Health Record. Arthritis & Rheumatology. 2023. [DOI] [PMC free article] [PubMed]
  • 22.Oakden-Rayner L, Dunnmon J, Carneiro G, Ré C. Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging. Proc ACM Conf Health Inference Learn (2020) 2020. [DOI] [PMC free article] [PubMed]
  • 23.Shortliffe EH, Sepúlveda MJ. Clinical Decision Support in the Era of Artificial Intelligence. JAMA. 2018. [DOI] [PubMed]
  • 24.Lundberg SM, Erion G, Chen H, et al. From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence. 2020. [DOI] [PMC free article] [PubMed]
  • 25.Lasko TA, Still JM, Li TZ, Mota MB, Stead WW, Strobl EV, et al. Unsupervised Discovery of Clinical Disease Signatures Using Probabilistic Independence. ArXiv. 2024. [DOI] [PMC free article] [PubMed]
  • 26.Strobl EV, Lasko TA. Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. New York, NY, USA: Association for Computing Machinery; 2022. Identifying patient-specific root causes of disease; pp. p. 1–10. (BCB ‘22) [Google Scholar]
  • 27.Roden DM, Pulley JM, Basford MA, et al. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clinical pharmacology and therapeutics. 2008. [DOI] [PMC free article] [PubMed]
  • 28.Bourel M, Fraiman R, Ghattas B. Random average shifted histograms. Comp.Stat. & Data Analysis. 2014.
  • 29.Fritsch FN, Butland J. A Method for Constructing Local Monotone Piecewise Cubic Interpolants. 2006.
  • 30.Gelman A. Scaling regression inputs by dividing by two standard deviations. Statistics in medicine. 2008. [DOI] [PubMed]
  • 31.Lasko TA, Mesa DA. Computational Phenotype Discovery via Probabilistic Independence. Proc KDD Appl Data Sci for Healthcare (DSHealth 2019)
  • 32.Hyvärinen A, Karhunen J, Oja E. 1st edition. New York Weinheim: Wiley-Interscience; 2001. Independent Component Analysis. [Google Scholar]
  • 33.Hyvarinen A. Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans Neural Netw. 1999. [DOI] [PubMed]
  • 34.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005.
  • 35.Zou H, Zhang HH. ON THE ADAPTIVE ELASTIC-NET WITH A DIVERGING NUMBER OF PARAMETERS. Annals of statistics. 2009;37:1733. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2008. [DOI] [PMC free article] [PubMed]
  • 37.Breiman L. Random Forests. Machine Learning. 2001.
  • 38.Chen T, Guestrin C. Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining. San Francisco California USA: ACM; 2016. XGBoost: A Scalable Tree Boosting System; pp. p. 785–94. [Google Scholar]
  • 39.Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, NY, USA: Association for Computing Machinery; 2019. Optuna: A Next-generation Hyperparameter Optimization Framework; pp. p. 2623–31. (KDD ‘19) [Google Scholar]
  • 40.Breiman L. Heuristics of instability and stabilization in model selection. The Annals of Statistics. 1996.
  • 41.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988. [PubMed]
  • 42.Sun X, Xu W. Fast implementation of DeLong’s algorithm for comparing the areas under correlated receiver operating characteristic curves. IEEE Signal Processing Letters. 2014.
  • 43.Qin G, Hotilovac L. Comparison of non-parametric confidence intervals for the area under the ROC curve of a continuous-scale diagnostic test. Statistical methods in medical research. 2008;17:207–21. doi: 10.1177/0962280207087173. [DOI] [PubMed] [Google Scholar]
  • 44.Austin PC, Steyerberg EW. The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Statistics in Medicine. 2019. [DOI] [PMC free article] [PubMed]
  • 45.Lundberg SM, Allen PG, Lee SI. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems. 2017.
  • 46.Chen H, Janizek JD, Lundberg S, Lee SI. True to the Model or True to the Data? 2020.
  • 47.Janzing D, Minorics L, Bloebaum P. Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics. PMLR; 2020. Feature relevance quantification in explainable AI: A causal problem; pp. p. 2907–16. [Google Scholar]
  • 48.Lasko TA, Strobl EV, Stead WW. Why do probabilistic clinical models fail to transport between sites. npj Digit Med. 2024. [DOI] [PMC free article] [PubMed]
  • 49.Shimizu S, Hoyer PO, Hyvärinen Kerminen A. A Linear Non-Gaussian Acyclic Model for Causal Discovery. Journal of Machine Learning Research. 2006.
  • 50.Strobl E, Lasko TA. Proceedings of the Second Conference on Causal Learning and Reasoning. PMLR; 2023. Sample-Specific Root Causal Inference with Latent Variables; pp. p. 895–915. [Google Scholar]
  • 51.Johns Hopkins Lupus Center; Antiphospholipid Antibodies. [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES