Skip to main content
Pulmonary Circulation logoLink to Pulmonary Circulation
. 2023 Jun 6;13(2):e12237. doi: 10.1002/pul2.12237

A claims‐based, machine‐learning algorithm to identify patients with pulmonary arterial hypertension

Bethany Hyde 1,, Carly J Paoli 2, Sumeet Panjabi 2, Katherine C Bettencourt 3, Karimah S Bell Lynum 3, Mona Selej 4
PMCID: PMC10243208  PMID: 37287599

Abstract

Many patients with pulmonary arterial hypertension (PAH) experience substantial delays in diagnosis, which is associated with worse outcomes and higher costs. Tools for diagnosing PAH sooner may lead to earlier treatment, which may delay disease progression and adverse outcomes including hospitalization and death. We developed a machine‐learning (ML) algorithm to identify patients at risk for PAH earlier in their symptom journey and distinguish them from patients with similar early symptoms not at risk for developing PAH. Our supervised ML model analyzed retrospective, de‐identified data from the US‐based Optum® Clinformatics® Data Mart claims database (January 2015 to December 2019). Propensity score matched PAH and non‐PAH (control) cohorts were established based on observed differences. Random forest models were used to classify patients as PAH or non‐PAH at diagnosis and at 6 months prediagnosis. The PAH and non‐PAH cohorts included 1339 and 4222 patients, respectively. At 6 months prediagnosis, the model performed well in distinguishing PAH and non‐PAH patients, with area under the curve of the receiver operating characteristic of 0.84, recall (sensitivity) of 0.73, and precision of 0.50. Key features distinguishing PAH from non‐PAH cohorts were a longer time between first symptom and the prediagnosis model date (i.e., 6 months before diagnosis); more diagnostic and prescription claims, circulatory claims, and imaging procedures, leading to higher overall healthcare resource utilization; and more hospitalizations. Our model distinguishes between patients with and without PAH at 6 months before diagnosis and illustrates the feasibility of using routine claims data to identify patients at a population level who might benefit from PAH‐specific screening and/or earlier specialist referral.

Keywords: early diagnosis, rare disease, real‐world evidence

INTRODUCTION

Pulmonary arterial hypertension (PAH) is a rare disease associated with rapid progression and poor outcomes. 1 , 2 Typical symptoms of PAH, such as breathlessness and fatigue, are nonspecific and are often mistaken for other conditions; diagnosis is therefore challenging. 3 This leads to a substantial delay (>2 years, on average) between symptom onset and a confirmed PAH diagnosis, with most newly diagnosed patients already experiencing severe symptoms. 4 A survey conducted in the United Kingdom reported that almost half of patients saw at least four doctors before being diagnosed and over a third waited at least 2 years for a diagnosis. 5 , 6 Delays of around 2 years from symptom onset to diagnosis have also been reported in other studies. 7 A more recent study, on registry data from the Pulmonary Hypertension Society of Australia and New Zealand, reported mean diagnostic delay of 2.5 years, which highlights that, despite better awareness of PAH over the years, the situation with delays in diagnosis has not improved. 8 , 9

Therapeutic options for PAH have expanded over the past two decades, 10 yet despite this, estimates of 1‐year mortality range from 8% to 17% and 3‐year mortality from 25% to 44%. 4 , 11 , 12 , 13 Patients with less severe disease at diagnosis have better outcomes than those with more advanced disease. 11 , 14 Moreover, a clinical study showed that among patients with PAH treated with an endothelin receptor antagonist, those with New York Heart Association functional class I/II had better survival than those with functional class III/IV (1‐year survival 100% vs. 78%, respectively). 15 These data suggest that the ability to identify PAH earlier in the course of the disease could greatly improve patient outcomes.

The application of artificial intelligence and machine‐learning algorithms in healthcare is rising and is expected to offer a considerable contribution to real‐world clinical decisions. 16 , 17 Machine‐learning algorithms evaluate large amounts of data to identify repeating themes or patterns that are then used to predict relationships. 17 These algorithms can offer obvious benefits in tasks that require consideration of several factors, such as diagnosis and prediction of outcomes. Recently, machine‐learning algorithms have been developed that use large databases of routinely collected patient data to screen for disease or identify patients at high risk; for example, electronic health record data have been utilized in diabetes 18 and heart failure, 19 as well as in PAH. 20 In this article, we describe the development of a machine‐learning algorithm based on retrospective US healthcare claims data for early identification of patients with PAH.

METHODS

Study design

This was a retrospective, predictive analysis using a supervised machine‐learning model to distinguish patients with PAH from patients without PAH who present with similar symptoms. We used data collected between January 2015 and December 2019. Deidentified data were obtained from the US‐based Optum® Clinformatics® Data Mart claims database (CDM), which holds information on demographics, insurance, healthcare resource utilization and costs, diagnosis, medication, and physician specialty. This information spans all 50 US states and is considered to be broadly representative of the US population enrolled in commercial health plans.

Patients with shared early symptomology typical of PAH, including dyspnea or breathlessness, fatigue, chest pain, light‐headedness, lower limb edema, and syncope (Supporting Information: Tables S1 and S2), were identified from the database and divided into two cohorts for analysis: a PAH cohort and a non‐PAH (control) cohort with other cardiovascular or respiratory disease. The final diagnoses for the non‐PAH cohort are provided in Supporting Information: Table S3.

The process and timeline for cohort selection is shown in Supporting Information: Figure S1. The date of the first report of early symptomology typical of PAH was captured as “first symptom date.” The date of diagnosis was the date of the first confirmed diagnosis of PAH (PAH cohort) or of cardiovascular or respiratory diagnosis related to the early symptomology (non‐PAH cohort). A baseline washout period of 12 months before the diagnosis date was applied to both cohorts to ensure that the confirmed diagnosis was not present before the diagnosis date. For the PAH cohort, the final diagnosis was PAH; for the non‐PAH cohort, it was their final diagnosis of other cardiovascular or respiratory disease. Both cohorts were screened for PAH treatment during the 12‐month washout period. The PAH cohort had to be continuously enrolled for at least 12 months from the first symptom date or at least 6 months from the diagnosis date, whichever was latest. The non‐PAH cohort had to be continuously enrolled for at least 24 months from the first symptom date. The extended enrollment period for the non‐PAH cohort was to ensure sufficient time for a PAH diagnosis to be obtained, if appropriate. Additionally, analysis of the data indicated the majority of patients received a diagnosis within that timeframe.

An initial version of the model was designed as a baseline to determine if a model could distinguish between patients with PAH and patients without PAH when using the same detailed information as physicians (as captured in claims data). This model sought to classify patients as either PAH or non‐PAH at the time of diagnosis and thus had an index date as the diagnosis date (at‐diagnosis index) and included data from up to 12 months before the diagnosis date. This version of the model incorporated pre‐existing comorbidities or healthcare treatments as well as all data available from the entire diagnostic journey. However, as our primary goal was to identify patients with PAH earlier in the disease course, rather than at the time of diagnosis, a second version of the model (the main model) was produced. The main model included data up to 6 months before diagnosis only (i.e., more recent data [0–6 months before diagnosis] were removed) and the index date was set as 6 months before the diagnosis date (prediagnosis model date; Supporting Information: Figure S1). Data from up to 12 months before the prediagnosis model date were included and the 1‐year look‐back period therefore consisted of data from 18 months to 6 months before the at‐diagnosis index date.

Patient population

Inclusion criteria for the PAH cohort were at least one claim for early PAH symptomology, and evidence of PAH diagnosis. Since claims databases may contain inaccuracies with diagnostic coding, we deployed a code‐based algorithm to detect and confirm PAH diagnosis. 21 Specifically, patients in the PAH cohort were required to have at least one inpatient claim for PAH or pulmonary hypertension (PH) based on an International Classification of Diseases, Ninth Revision (ICD‐9) or Tenth Revision (ICD‐10) diagnostic codes (ICD‐9: 416.x; ICD‐10: I27.x; full code list in Supporting Information: Figure S2), or at least two outpatient claims on 2 separate days; they also had to have at least one claim for right heart catheterization (the accepted method of making a definitive diagnosis of PAH 3 ) within 6 months of the first PAH or PH diagnosis and at least one claim for a PAH‐specific medication (including endothelin receptor antagonist, phosphodiesterase‐5 inhibitor, prostacyclin, prostacyclin analog, or soluble guanylate cyclase stimulator) within 6 months of the first PAH or PH diagnosis. Diagnosis codes for PH were used in addition to PAH codes because of the relatively low use of PAH codes in clinical practice. In addition, patients had to be ≥18 years of age at the index date (at‐diagnosis index date in the initial version of the model and prediagnosis model date in the main model) and to have no PAH or PH diagnosis, no PAH‐specific medications, and no right heart catheterization for at least 12 months before the diagnosis date (washout period). They were also required to have continuous medical and pharmacy health plan enrollment for at least 12 months during the washout period, until at least 12 months following the initial qualifying early symptom diagnosis code or at least 6 months following the diagnosis date, whichever was latest.

Non‐PAH patients were included if they had at least one claim for early PAH symptomology, and at least three claims listed as a primary diagnosis for a new cardiovascular or respiratory disease following the original symptom presentation (e.g., three instances of asthma from different medical visits in the absence of a previous asthma diagnosis). A confirmation was conducted to ensure that the cardiovascular or respiratory disease was not present in the previous 12 months. The requirement for confirmation of the initial cardiovascular or respiratory disease diagnosis by two subsequent diagnostic codes for the same disease addressed any potential issues with coding accuracy that can be present in real‐world evidence data sources such as claims data. Non‐PAH patients also had to be ≥18 years of age at the at‐diagnosis index date in the initial version of the model and prediagnosis model date in the main model and to have never had a PAH or PH diagnosis, PAH‐specific medications, or right heart catheterization. Additionally, there was a 12‐month look‐back from the at‐diagnosis index or prediagnosis model date, to confirm that the non‐PAH patients did not have their final diagnosis before the first PH symptom. Non‐PAH patients were also required to have continuous medical and pharmacy health plan enrollment for at least 12 months during the washout period until at least 24 months (to ensure correct assignment) following the initial qualifying early symptom diagnosis code. Patients with chronic thromboembolic pulmonary hypertension (ICD‐10 I27.24, I27.0, I27.2, I27.20, I27.21, I27.29, I27.89, I27.9) were excluded from the study.

Population features such as sociodemographic and general health factors can influence the probability of an individual receiving treatment and/or diagnosis, which can either obscure significant differences between populations or create erroneous differences. Moreover, it is known that patients with PAH are generally older and more likely to be female than patients without PAH. 22 Thus, propensity score matching was used to control for disparity between PAH and non‐PAH cohorts. Propensity score matching was conducted using the features of age, sex, race/ethnicity, insurance type, geographic region, income, education level, and year of first early PAH symptom. The propensity‐matched non‐PAH cohort was randomly downsampled to create a non‐PAH to PAH cohort ratio of approximately 3:1. This ratio was chosen to demonstrate the low prevalence of PAH in real‐world populations, while still allowing the algorithm to detect signals that could distinguish PAH from non‐PAH.

Variables included in the model

Variables included in the model and their definitions are provided in Supporting Information: Table S4 and classification of US states can be found in Supporting Information: Table S5. Demographic characteristics included in the model were patient age, sex, US geographic region, insurance type (commercial, or Medicare), race/ethnicity, education level, and household income (Supporting Information: Table S4). Clinical characteristics were the Deyo‐modified Charlson Comorbidity Index, 23 PH‐related comorbidities (connective tissue disease, portal hypertension, congenital heart disease, schistosomiasis, substance abuse, and human immunodeficiency virus infection) and patient history comorbidities reported during the 12 months before the at‐diagnosis index or prediagnosis model period, and patient medication history (aggregate number of medication procedure codes per month) (Supporting Information: Table S4).

All‐cause healthcare utilization per patient per month during the 12 months before the at‐diagnosis index or prediagnosis model period was included for inpatient hospitalization, emergency room visits, office visits, and other outpatient visits; diagnosis‐related group hospitalization reasons were used (Supporting Information: Table S4). Out‐of‐pocket costs were included with 6‐month look‐back from the at‐diagnosis index or prediagnosis model date and from start of the year until the at‐diagnosis index or prediagnosis model date (which could be a variable amount of time). These two cost periods were included to account for the varying effect of deductibles throughout a year. Standard costs (for inpatient hospitalization costs, outpatient costs [including emergency room visits, office visits, and other outpatient visits], and total inpatient and outpatient costs) were included with 6‐month look‐back from the at‐diagnosis index or prediagnosis model date. Costs were adjusted to 2019 US dollars using the medical care component of the consumer price index.

Model analysis plan

The machine‐learning approach utilized a random forest algorithm, which is a supervised binary classification decision tree algorithm. The target variable was diagnosis of PAH for the PAH cohort as a binary variable (yes or no). The random forest classifier is an ensemble method that trains several decision trees in parallel. Trees were developed using a bagging approach (i.e., bootstrapping followed by aggregation). In the bootstrapping, several individual decision trees were trained in parallel on various subsets of the training data set using different subsets of available features. Bootstrapping ensured that each individual decision tree in the random forest is unique, which reduces the overall variance of the random forest classifier. For the final decision, the random forest classifier aggregated the decisions of individual trees.

The performance of the model was evaluated using a five‐fold cross‐validation approach. The model was trained on 60% of the data, with 20% reserved for test data. The final 20% of data (validation set) was used to determine model performance and generalizability and did not include any data used to train or optimize the model. As the model was designed to provide insights and not intended for clinical use at the individual patient level, the test sample consisted of blinded data reflecting the downsampled population ratio. Performance was measured using AUC ROC, recall, and precision. Accuracy was calculated as the number of patients with PAH correctly identified by the model. Recall, also described as sensitivity, was defined using a confusion matrix as the ratio of the number of true positives to the number of true positives plus the number of false negatives. Precision, defined as the ratio of the number of true positives to the number of all positives, was also measured. The model was tuned to prioritize sensitivity, in recognition that the risk of failing to identify a patient with PAH is higher than the risk of falsely identifying a non‐PAH patient. Sensitivity thresholds were adjusted based on the AUC and false positive rates, to ensure optimal performance.

Key features and their impact on the likelihood of a PAH diagnosis were calculated using the SHapley Additive exPlanations (SHAP) method. 24 A Shapley value is the average marginal contribution of an instance of a feature among all possible subgroups. The objective of SHAP is to calculate the Shapley values for each feature, where each Shapley value denotes the effect that the feature to which it is connected produces in the prediction. Descriptive analysis evaluated differences between the cohorts across top features identified as important by the machine‐learning model.

RESULTS

Patient population

Optum's deidentified CDM included 5,602,123 patients who had at least one claim for early symptomology and 12 months of continuous enrollment. After application of other study inclusion criteria and exclusion of patients without demographic data, 1339 patients were included in the PAH cohort. For the non‐PAH cohort, 3,824,922 patients with at least one early symptomology claim and 24 months of continuous enrollment were identified, which reduced to 64,648 patients following application of the inclusion criteria. Patient disposition for the PAH and non‐PAH cohorts are shown in Supporting Information: Figure S2. After propensity matching and downsampling to create a non‐PAH to PAH cohort ratio of approximately 3:1, the non‐PAH cohort included 4222 patients. Propensity scores before and after propensity matching are shown in Supporting Information: Figure S3.

Key patient demographics can be found in Table 1. As expected, due to propensity score matching, mean age of patients as well as sex and race/ethnicity ratios were similar in both cohorts. The mean ages of patients were 69 and 71 years for the PAH and non‐PAH cohorts, respectively. Approximately two‐thirds were female and around two‐thirds were White.

Table 1.

Key demographics for PAH and non‐PAH cohorts

Characteristic PAH cohort (N = 1339) Non‐PAH cohort (N = 4222)
Mean age, years 69 71
Female sex, n (%) 849 (63.4) 2519 (59.7)
Ethnicity, n (%) White 888 (66.3) 2824 (66.9)
Black 264 (19.7) 787 (18.6)
Hispanic 139 (10.4) 449 (10.6)
Asian 28 (2.1) 105 (2.5)
Other/unknown/Missing 20 (1.5) 57 (1.4)
US geographic region, n (%) South Atlantic 397 (29.6) 1368 (32.4)
West South Central 203 (15.2) 629 (14.9)
East North Central 204 (15.2) 596 (14.1)
Mountain 159 (11.9) 483 (11.4)
Pacific 118 (8.8) 421 (10.0)
Middle Atlantic 83 (6.2) 238 (5.6)
West North Central 78 (5.8) 224 (5.3)
East South Central 57 (4.3) 157 (3.7)
New England 40 (3.0) 99 (2.3)
Unknown/Missing 0 7 (0.2)
Insurance type, n (%) Commercial 448 (33.5) 1526 (36.1)
Medicare 891 (66.5) 2696 (63.9)
Education, n (%) <High school 7 (0.5) 26 (0.6)
High school diploma 465 (34.7) 1434 (34.0)
<Bachelor's degree 711 (53.1) 2276 (53.9)
Bachelor's degree 155 (11.6) 482 (11.4)
Unknown/Missing 1 (0.07) 4 (0.09)

Abbreviation: PAH, pulmonary arterial hypertension.

Model performance

The initial version of the model (including data from up to 12 months before the diagnosis date) confirmed that the algorithm could accurately identify patients with PAH at the time of diagnosis. The AUC was 0.94, recall (sensitivity) was 0.86, specificity (true negative rate) was 0.84 and precision of the model was 0.64 (Figure 1a).

Figure 1.

Figure 1

Performance of the machine‐learning model (a) at diagnosis and (b) at 6 months before diagnosis as assessed in the validation set. AUC, area under the curve; PAH, pulmonary arterial hypertension; ROC, receiver operator characteristic.

The main version of the model was designed to test the ability of the model to identify patients with PAH 6 months earlier in the diagnosis journey. All of the following results are focused on this earlier prediagnosis model. In this version of the model, data from the 6‐month period immediately before and including diagnosis were excluded. The performance of the model declined slightly but remained robust; AUC was 0.84, recall was 0.73, specificity was 0.78 and precision was 0.50 (Figure 1b).

To reflect the very low incidence of PAH in the general population, we biased our population so that there were three times as many patients in the non‐PAH cohort (75%) as in the PAH cohort (25%). Thus, the probability of our model correctly identifying a patient with PAH by chance would be 25%. As accuracy is an inappropriate performance measure for imbalanced classification problems, we relied more on recall and precision. The AUC/ROC curve shows the overall model performance related to a random chance model. A threshold was selected for the trade‐off of precision and recall, to maximize recall. Precision/recall curves for model training and validation are shown in Supporting Information: Figures S4 and S5, respectively.

The SHAP diagram (Figure 2a) and feature importance diagram (Figure 2b) illustrate the importance and directionality of the top 20 features that enable the model to distinguish between patients with PAH and non‐PAH patients. Information from the SHAP analysis was then further explored in post hoc descriptive analysis of these features (detailed in the following section).

Figure 2.

Figure 2

Importance and directionality of the top 20 features enabling the model to distinguish between patients with PAH and non‐PAH: (a) SHAP diagram; (b) feature importance diagram. On the SHAP diagram, high values of the features are shown in red and low values are shown in blue; for example, a longer period between the first symptom and diagnosis is shown in red and a shorter period is shown in blue. The SHAP scores, shown on the x‐axis, measure the impact of features on the model; negative values indicate that the feature is likely to be associated with non‐PAH (green area of the diagram), while positive values indicate that it is likely to be associated with PAH (purple area of the diagram). The further away from zero that the SHAP value is, the more impact it has on the model. The model was more likely to identify a patient with a long period between first symptom and diagnosis as having PAH and a patient with a short period between first symptom and diagnosis as being non‐PAH. Therefore, the red values (longer period) fall into the PAH (purple) area of the diagram and the blue values (shorter period) fall into the non‐PAH (green) area. CCI, Charlson Comorbidity Index; OOP, out‐of‐pocket; PAH, pulmonary arterial hypertension; SHAP, SHapley Additive exPlanations.

Features identifying patients with PAH at 6 months before diagnosis

The model successfully identified the key features associated with PAH at 6 months before diagnosis (Figure 3). More time between the first symptom and prediagnosis model date strongly predicted a PAH diagnosis. However, a substantial proportion of patients had their first symptom less than 6 months before diagnosis. For these patients (PAH n = 535, non‐PAH n = 1763). Consequently the time from first symptom to prediagnosis model date feature was grouped together to be “less than 6 months for these patients.” Patients within the “less than 6 months” group were imputed with the value of −1 for their time from first symptom to prediagnosis model date. Interpretation of the period between first symptom and prediagnosis model date did not include imputed data. When looking at time from first symptom to prediagnosis model date, the mean time from first symptom to diagnosis was 403 days for the PAH cohort and 232 days for the non‐PAH cohort. The longer mean time from first symptom to diagnosis was mainly driven by the number of patients in the PAH cohort who experienced very long delays before diagnosis (Figure 4).

Figure 3.

Figure 3

Profile of patients with PAH versus non‐PAH patients at 6 months before diagnosis. aMean costs in US dollars. Out‐of‐pocket expense includes co‐pay plus deductible plus co‐insurance. Co‐pay was assessed with 6‐month look‐back and from start of year. Start‐of‐year expense was assessed to ensure that the variation in deductibles over the year was not a confounder. bImaging procedures (i.e., electrocardiography, echocardiography [echo], X‐ray, ventilation perfusion scan, and transthoracic echo) in the 1‐year look‐back (18 months to 6 months before diagnosis). CCI, Charlson Comorbidity Index; OOP, out‐of‐pocket; PAH, pulmonary arterial hypertension.

Figure 4.

Figure 4

Time from first symptom to prediagnosis model date for PAH versus non‐PAH cohorts. Patient counts are shown. The long tail on the graph illustrates the high number of patients with PAH who experienced long delays before diagnosis. PAH, pulmonary arterial hypertension.

Worse overall health also consistently indicated PAH. The percentage of patients with a Charlson Comorbidity Index score of 0 (indicating a very low risk of 1‐year mortality, due to few or low severity comorbidities) was 71% in the non‐PAH cohort compared with 32% in the PAH cohort. Overall, the PAH cohort had a greater proportion of patients with higher Charlson Comorbidity Index scores compared with the non‐PAH cohort (Figure 5). A higher number of diagnostic and prescription claims was also indicative of PAH for both unique and total claims of each type. During the 1‐year look‐back period (from 18 months to 6 months before diagnosis), for the PAH versus non‐PAH cohort, the mean number of unique diagnostic codes was 36 versus 23, the mean number of total diagnostic codes was 224 versus 116, the mean number of unique prescription codes was 15 versus 10, and the mean number of total prescription codes was 60 versus 36.

Figure 5.

Figure 5

Charlson Comorbidity Index (CCI) scores in the 1‐year look‐back (18 months to 6 months before diagnosis). The CCI predicts the 1‐year mortality for a patient who may have a range of comorbid conditions, such as heart disease, AIDS, or cancer, and considers a total of 22 conditions. Higher scores indicate poorer health status and greater risk of 1‐year mortality. CCI, Charlson Comorbidity Index; PAH, pulmonary arterial hypertension.

A higher number of diagnostic claims related to circulatory system diseases (ICD codes I00–I99, inclusive) during the 1‐year look‐back period also indicated PAH; 71% of patients with PAH had a claim related to circulatory system diseases compared with 29% of non‐PAH patients, and there was a mean of 3.2 unique codes for patients with PAH versus 0.9 for non‐PAH patients. Claims related to circulatory system disease diagnoses accounted for 15% of the total number of claims among patients with PAH versus 5% among non‐PAH patients. The most common diagnoses in this category among patients with PAH were essential hypertension, heart failure, and coronary artery disease (Figure 6a).

Figure 6.

Figure 6

(a) Percentage of patients (≥10%) with a unique circulatory code in the 1‐year look‐back (18 months to 6 months before diagnosis); (b) standard healthcare costs and out‐of‐pocket expenses; and (c) imaging procedures in the 1‐year look‐back (18 months to 6 months before diagnosis). Standard healthcare costs, defined as an estimate of the allowed amount for the facility charges related to the confinement, were assessed 12 months to 6 months before diagnosis. Out‐of‐pocket expense includes co‐pay plus deductible plus co‐insurance. Out‐of‐pocket expenses were assessed for two periods: (1) from the start of the calendar year up to 6 months before diagnosis; (2) from 12 months to 6 months before diagnosis. Expenses from the start of the calendar year were assessed to ensure that the variation in deductibles over the year was not a confounder. PAH, pulmonary arterial hypertension.

Higher healthcare resource utilization also predicted PAH in our model. Mean standard healthcare costs were US$333 for patients with PAH and US$281 for non‐PAH patients in the period from 12 months to 6 months before diagnosis (Figure 6b). For PAH and non‐PAH patients, respectively, mean out‐of‐pocket expenses were US$1200 and US$644 in the period from 12 months before diagnosis to 6 months before diagnosis, and US$1417 and US$751 from the start of the calendar year to 6 months before diagnosis (Figure 6b).

The number of imaging procedures also played an important role in the model's ability to predict PAH. Patients with PAH had more imaging procedures, that is, a mean of 6.5 imaging procedures in total and 2.6 unique imaging procedures, compared with a mean of 3.6 total procedures and 2.0 unique procedures for non‐PAH patients (Figure 6c).

Hospitalizations

The number of hospitalizations did not reliably predict PAH in the model at 6 months before diagnosis. However, patients with PAH were hospitalized more often than non‐PAH patients in the 18 months before their diagnosis, and the difference was particularly marked during the 6‐month period before diagnosis (when our model could already differentiate between PAH and non‐PAH patients), when 42% of patients with PAH were hospitalized compared with 21% of non‐PAH patients (Figure 7).

Figure 7.

Figure 7

Percentage of patients hospitalized during the 18 months before diagnosis. PAH, pulmonary arterial hypertension.

DISCUSSION

Earlier diagnosis of PAH is expected to result in improved patient well‐being, patient outcomes as well as decrease healthcare utilization costs. Previous studies have demonstrated the benefit of early diagnosis and treatment on PAH outcomes and costs. 11 , 14 , 15 , 25 , 26

Our model correctly distinguished between patients with PAH and those without PAH at diagnosis and at 6 months before diagnosis. The AUC of the early identification model was 0.84, indicating very good discrimination between patients with PAH and non‐PAH patients. Recall (sensitivity) was 0.73, which shows that the model correctly distinguished 73% of patients with PAH. While precision was only 0.50 (i.e., when the model predicted PAH, it was correct 50% of the time), the model was designed to prioritize sensitivity, as the risks related to a false positive (e.g., undergoing some additional, unnecessary noninvasive PAH screening) are greatly reduced compared with the risk associated with delayed diagnosis of PAH as it is a highly progressive, fatal disease. The results of the model are intended to flag patients who might benefit from further assessment and possibly a referral to a PAH specialist if additional clinical datapoints are consistent with the model flag.

Key features that identified patients with PAH were a longer time between the first symptom and diagnosis (5.6 months longer, on average, for patients with PAH than for non‐PAH patients for the full cohort), worse overall health, higher number of diagnostic and prescription codes, higher number of codes related to circulatory system diseases (ICD codes I00–I99, inclusive), higher healthcare resource utilization and costs, and a greater number of imaging procedures. Although the time from first symptom to prediagnosis model date was a strong predictor of PAH, as expected, other features also played a prominent role in the model (Figure 2b).

We also found that patients with PAH have a higher healthcare burden than those without PAH, even after removing the additional costs of screenings and procedures that occur late in the diagnostic journey. These findings are in line with those of a recent study that reported an average of 25 interactions with a hospital in the 3 years before PAH diagnosis. 27 It is well known that PAH incurs a high economic and healthcare resource burden, particularly in patients with more severe disease. 28 , 29 , 30

Although hospitalization did not predict PAH in the model at 6 months before diagnosis, our descriptive analysis illustrated that patients with PAH are hospitalized substantially more often than non‐PAH patients in the 18 months before diagnosis, with the PAH cohort having nearly double the hospitalization rate in every 6‐month window examined. In the 6‐month period immediately before diagnosis, this difference is particularly stark, with 42% of the PAH cohort hospitalized. Thus, shifting the diagnosis even 6 months earlier, as indicated in our model, could have a considerable impact on hospitalization rates. Moreover, PAH‐related hospitalization is associated with poor prognosis; for example, in‐hospital mortality for patients with PAH and right heart failure has been reported as 14% overall, but as high as 48% for those admitted to an intensive care unit. 31 , 32 This finding further underscores that reducing hospitalizations through earlier diagnosis could improve outcomes for patients with PAH.

The current study also highlighted the high rates of imaging procedures experienced by patients with PAH before receiving a confirmed diagnosis. While only about half of the study population had imaging procedure claims in the 1‐year look‐back shifted 6 months before the diagnosis date (therefore 18 months to 6 months before diagnosis), patients with PAH had more unique (2.6) and total (6.5) imaging procedure claims compared with non‐PAH patients (2.0 and 3.6, respectively). That patients with PAH had almost double the number of total imaging procedures versus non‐PAH patients, but still had longer times to diagnosis, underscores the need for improved screening processes and diagnostic procedures. Machine‐learning models for earlier diagnosis of PAH could also facilitate development and validation of newer diagnostic techniques. Several are in development, including cardiac magnetic resonance imaging, blood‐based biomarkers, exercise testing, and machine‐learning approaches to existing imaging techniques. 2 , 33

Machine‐learning algorithms using routinely collected patient data now exist to screen for disease or identify patients at high risk. 18 , 19 , 20 A machine‐learning model based on electronic health record data in England (Hospital Episode Statistics) has been shown to predict idiopathic PAH. 20 Similar to in our own model, the UK study found that the timing and frequency of clinical specialty seen, frequency of comorbidities, and patient age were important variables. 20 A machine‐learning approach leveraging electronic health record data to identify patients likely to have pulmonary hypertension has also been reported. 34 Kogan et al. 34 used the Optum EHR database to create and train an algorithm to identify PH patients. Similar to our study, they created a PH cohort and a control cohort who all had the same PH‐like symptoms, and applied an algorithm incorporating patient demographics, physician visits, diagnoses, procedures, prescriptions and laboratory test results to distinguish patients with PH from control patients. The model predicted PH with an AUROC of 0.92 that remained above 0.80 for the prediction of PH up to and beyond 18 months before diagnosis. This is similar to our result of 0.84. Among the patients with PH, there were also subgroups of patients with PAH and patients with CTEPH, with AUROCs of 0.79–0.90 and 0.87–0.96, respectively. As with our algorithm, this model is yet to be externally validated and prospectively tested.

Another machine learning algorithm was published by Schuler et al. 35 They utilized patients with a physician‐confirmed diagnosis of PAH from a single health system to create a machine learning algorithm for identifying PAH in claims databases. Different from the aim of the Schuler and colleagues study, our focus was to identify PAH patients sooner than achieved by the typical current PAH diagnostic pathway based on patients presenting with PAH‐like symptoms. The algorithm developed by Schuler and colleagues is designed to correctly identify PAH in claims‐based databases, while our algorithm is designed for system‐wide screening for those patients likely to have PAH who should be referred to specialists sooner for further workup and confirmed diagnosis. While this study uses different cohort selection criteria on an Electronic Medical Record (EMR) database, it demonstrates that using a similar machine learning approach (Random Forest) the final algorithm resulted in comparable sensitivity of 0.88 to our at‐diagnosis model sensitivity of 0.86 and strengthens the result of 0.73 sensitivity in the prediagnosis model.

However, there are limitations associated with use of claims data in any study, including missing data and potential diagnosis inaccuracies. A further limitation is that, although we used a study population sample that included more non‐PAH patients than patients with PAH, the true prevalence of PAH in real‐world populations is substantially lower. The selected control cohort was required to be enrolled until at least 2 years of the first early symptom claim to ensure correct assignment. Analysis of the data indicated the majority of patients received a diagnosis within that timeframe. It is possible that this could exclude patients who continue to experience symptoms without diagnosis or include patients with PAH that have not received a diagnosis. The model was not intended to be deployed in the individual clinical setting; instead, it was developed to generate novel insights into the prediagnosis PAH population to be used within an integrated delivery network to identify patients that would benefit from further referral. A machine‐learning model designed for broader clinical application would need to be validated in another claims data set that has access to clinical notes to ensure that it is applicable outside the CDM. In addition, it would need to incorporate the well‐known imbalances in the PAH population such as age and sex, as these are valuable predictors that we excluded through propensity score matching to allow more novel scientific insights.

Despite these limitations, machine learning based approaches are well suited for evaluating data sets from healthcare claims databases. Specifically, our model was able to incorporate categorical and continuous variables, as well as high dimensional data and nonlinear data. Additionally, this work leverages data from a large administrative claims database, making it highly generalizable to the overall US population. This data set also has the benefit of providing information related to cost and healthcare resource utilization before diagnosis, in addition to diagnosis and procedural information.

In conclusion, our model was able to distinguish between patients with PAH and those without PAH at 6 months before diagnosis. The performance of our model illustrates the feasibility of identifying patients at a population level who might warrant further PAH‐specific screening, and suggests that claims data can identify features, beyond current established factors, that indicate PAH before a confirmed diagnosis. Patients with PAH face higher costs and disease burden than those without PAH even before diagnosis; therefore, earlier diagnosis of PAH might not only improve patient outcomes through timelier intervention, but may also help to decrease overall costs to both patients, health systems and payers. The ideal use of this model is implementation by integrated delivery network healthcare providers for early identification of PAH patients; a rare but chronic disease with significant healthcare utilization, expensive medications, and high‐cost specialty care. 36 This is especially important as the healthcare providers and payers brace for waves of delayed and deferred care that have started with the COVID‐19 pandemic. 37

AUTHOR CONTRIBUTIONS

Concept and design: Katherine C. Bettencourt, Bethany Hyde, Carly J. Paoli, and Sumeet Panjabi. Acquisition, analysis, and interpretation of data: Katherine C. Bettencourt, Bethany Hyde,Carly J. Paoli, Sumeet Panjabi, Karimah S. Bell Lynum, and Mona Selej. Drafting of the manuscript: Katherine C. Bettencourt and Bethany Hyde.

CONFLICT OF INTEREST STATEMENT

All authors are/were employees of Janssen Pharmaceuticals US, Inc., Titusville, NJ, USA at the time of study analysis and design and manuscript development.

ETHICS STATEMENT

Not applicable.

Supporting information

Supporting information.

ACKNOWLEDGMENTS

Medical writing support was provided by Mary Greenacre and Ify Sargeant on behalf of Twist Medical, and was funded by Actelion Pharmaceuticals US, Inc., a Janssen Pharmaceutical Company of Johnson & Johnson.

Hyde B, Paoli CJ, Panjabi S, Bettencourt KC, Bell Lynum KS, Selej M. A claims‐based, machine‐learning algorithm to identify patients with pulmonary arterial hypertension. Pulm Circ. 2023;13:e12237. 10.1002/pul2.12237

DATA AVAILABILITY STATEMENT

The data sharing policy of the Janssen Pharmaceutical Companies of Johnson & Johnson is available at: https://www.janssen.com/clinical-trials/transparency. As noted on this site, requests for access to the study data can be submitted through the Yale Open Data Access (YODA) Project site at: http://yoda.yale.edu.

REFERENCES

  • 1. Humbert M, Gerry Coghlan J, Khanna D. Early detection and management of pulmonary arterial hypertension. Eur Respir Rev. 2012;21(126):306–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Kiely DG, Lawrie A, Humbert M. Screening strategies for pulmonary arterial hypertension. Eur Heart J Suppl. 2019;21(Suppl K):K9–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Gibbs JSR. Making a diagnosis in PAH. Eur Respir Rev. 2007;16:8–12. 10.1183/09059180.00010203 [DOI] [Google Scholar]
  • 4. Humbert M, Sitbon O, Chaouat A, Bertocchi M, Habib G, Gressin V, Yaici A, Weitzenblum E, Cordier JF, Chabot F, Dromer C, Pison C, Reynaud‐Gaubert M, Haloun A, Laurent M, Hachulla E, Simonneau G. Pulmonary arterial hypertension in France: results from a national registry. Am J Respir Crit Care Med. 2006;173(9):1023–30. [DOI] [PubMed] [Google Scholar]
  • 5. Armstrong I, Harries C, Yorke J. The Impahct survey: living with pulmonary arterial hypertension. Am J Respir Crit Care Med [Internet]. 2011;183:A6130. https://www.atsjournals.org/doi/abs/10.1164/ajrccm-conference.2011.183.1_MeetingAbstracts.A6130 [Google Scholar]
  • 6. Armstrong I, Rochnia N, Harries C, Bundock S, Yorke J. The trajectory to diagnosis with pulmonary arterial hypertension: a qualitative study. BMJ Open. 2012;2(2):e000806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Strange G, Gabbay E, Kermeen F, Williams T, Carrington M, Stewart S, Keogh A. Time from symptoms to definitive diagnosis of idiopathic pulmonary arterial hypertension: the delay study. Pulm Circ. 2013;3(1):89–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Khou V, Anderson JJ, Strange G, Corrigan C, Collins N, Celermajer DS, Dwyer N, Feenstra J, Horrigan M, Keating D, Kotlyar E, Lavender M, McWilliams TJ, Steele P, Weintraub R, Whitford H, Whyte K, Williams TJ, Wrobel JP, Keogh A, Lau EM. Diagnostic delay in pulmonary arterial hypertension: insights from the Australian and New Zealand pulmonary hypertension registry. Respirology. 2020;25(8):863–71. [DOI] [PubMed] [Google Scholar]
  • 9. Weatherald J, Humbert M. The ‘great wait’ for diagnosis in pulmonary arterial hypertension. Respirology. 2020;25(8):790–2. [DOI] [PubMed] [Google Scholar]
  • 10. Humbert M, Kovacs G, Hoeper MM, Badagliacca R, Berger RMF, Brida M, Carlsen J, Coats AJS, Escribano‐Subias P, Ferrari P, Ferreira DS, Ardeschir Ghofrani H, Giannakoulas G, Kiely DG, Mayer E, Meszaros G, Nagavci B, Olsson KM, Pepke‐Zaba J, Quint JK, Radegran G, Simonneau G, Sitbon O, Tonia T, Toshner T, Vachiery J‐L, Vonk Noordegraaf A, Delcroix M, Rosenkranz S, the ESC/ERS Scientific Document Group . 2022 ESC/ERS guidelines for the diagnosis and treatment of pulmonary hypertension. Eur Resp J. 2023;61:2200879. [DOI] [PubMed] [Google Scholar]
  • 11. Humbert M, Sitbon O, Chaouat A, Bertocchi M, Habib G, Gressin V, Yaïci A, Weitzenblum E, Cordier J, Chabot F, Dromer C, Pison C, Reynaud‐Gaubert M, Haloun A, Laurent M, Hachulla E, Cottin V, Degano B, Jaïs X, Montani D, Souza R, Simonneau G. Survival in patients with idiopathic, familial, and anorexigen‐associated pulmonary arterial hypertension in the modern management era. Circulation. 2010;122(2):156–63. [DOI] [PubMed] [Google Scholar]
  • 12. Thenappan T, Shah SJ, Rich S, Tian L, Archer SL, Gomberg‐Maitland M. Survival in pulmonary arterial hypertension: a reappraisal of the NIH risk stratification equation. Eur Respir J. 2010;35(5):1079–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Launay D, Sitbon O, Hachulla E, Mouthon L, Gressin V, Rottat L, Clerson P, Cordier JF, Simonneau G, Humbert M. Survival in systemic sclerosis‐associated pulmonary arterial hypertension in the modern management era. Ann Rheum Dis. 2013;72(12):1940–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Dimopoulos K, Inuzuka R, Goletto S, Giannakoulas G, Swan L, Wort SJ, Gatzoulis MA. Improved survival among patients with Eisenmenger syndrome receiving advanced therapy for pulmonary arterial hypertension. Circulation. 2010;121(1):20–5. [DOI] [PubMed] [Google Scholar]
  • 15. Launay D, Sitbon O, Le Pavec J, Savale L, Tcherakian C, Yaici A, Achouh L, Parent F, Jais X, Simonneau G, Humbert M. Long‐term outcome of systemic sclerosis‐associated pulmonary arterial hypertension treated with bosentan as first‐line monotherapy followed or not by the addition of prostanoids or sildenafil. Rheumatology. 2010;49(3):490–500. [DOI] [PubMed] [Google Scholar]
  • 16. Bohr A, Memarzadeh K. The rise of artificial intelligence in healthcare applications. In: Bohr A, Memarzadeh K, editors. Artificial intelligence in healthcare. Academic Press; 2020. p. 25–60. [Google Scholar]
  • 17. Aung YYM, Wong DCS, Ting DSW. The promise of artificial intelligence: a review of the opportunities and challenges of artificial intelligence in healthcare. Br Med Bull. 2021;139(1):4–15. [DOI] [PubMed] [Google Scholar]
  • 18. Lai H, Huang H, Keshavjee K, Guergachi A, Gao X. Predictive models for diabetes mellitus using machine learning techniques. BMC Endocr Disord. 2019;19(1):101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Kwon J, Kim K‐H, Jeon K‐H, Lee SE, Lee HY, Cho HJ, Choi JO, Jeon ES, Kim MS, Kim JJ, Hwang KK, Chae SC, Baek SH, Kang SM, Choi DJ, Yoo BS, Kim KH, Park HY, Cho MC, Oh BH. Artificial intelligence algorithm for predicting mortality of patients with acute heart failure. PLoS One. 2019;14(7):e0219302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Kiely DG, Doyle O, Drage E, Jenner H, Salvatelli V, Daniels FA, Rigg J, Schmitt C, Samyshkin Y, Lawrie A, Bergemann R. Utilising artificial intelligence to determine patients at risk of a rare disease: idiopathic pulmonary arterial hypertension. Pulm Circ. 2019;9(4):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Sprecher VP, Didden EM, Swerdel JN, Muller A. Evaluation of code‐based algorithms to identify pulmonary arterial hypertension and chronic thromboembolic pulmonary hypertension patients in large administrative databases. Pulm Circ. 2020;10(4):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Memon HA, Park MH. Pulmonary arterial hypertension in women. Methodist Debakey Cardiovasc J. 2017;13(4):224–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Deyo R. Adapting a clinical comorbidity index for use with ICD‐9‐cm administrative databases. JCE. 1992;45(6):613–9. [DOI] [PubMed] [Google Scholar]
  • 24. Lundberg S, Lee S. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems [Internet]. 2017 December [cited 2023 Apr 13]. 4768–77. Available from: https://dl.acm.org/doi/10.5555/3295222.3295230
  • 25. Burger CD, Ghandour M, Padmanabhan Menon D, Helmi H, Benza RL. Early intervention in the management of pulmonary arterial hypertension: clinical and economic outcomes. ClinicoEconomics Outcomes Res. 2017;9:731–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Tran‐Duy A, Morrisroe K, Clarke P, Stevens W, Proudman S, Sahhar J, Nikpour M. Cost‐effectiveness of combination therapy for patients with systemic sclerosis‐related pulmonary arterial hypertension. J Am Heart Assoc. 2021;10(7):e015816. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Bergemann R, Allsopp J, Jenner H, Daniels FA, Drage E, Samyshkin Y, Schmitt C, Wood S, Kiely DG, Lawrie A. High levels of healthcare utilization prior to diagnosis in idiopathic pulmonary arterial hypertension support the feasibility of an early diagnosis algorithm: the SPHInX project. Pulm Circ. 2018;8(4):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Dufour R, Pruett J, Hu N, Lickert C, Stemkowski S, Tsang Y, Lane D, Drake W. Healthcare resource utilization and costs for patients with pulmonary arterial hypertension: real‐world documentation of functional class. J Med Econ. 2017;20(11):1178–86. [DOI] [PubMed] [Google Scholar]
  • 29. Exposto F, Hermans R, Nordgren Å, Taylor L, Sikander Rehman S, Ogley R, Davies E, Yesufu‐Udechuku A, Beaudet A. Burden of pulmonary arterial hypertension in England: retrospective HES database analysis. Ther Adv Respir Dis. 2021;15:175346662199504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Zozaya N, Abdalla F, Casado Moreno I, Crespo‐Diz C, Ramírez Gallardo AM, Rueda Soriano J, Alcalá Galán M, Hidalgo‐Vega Á. The economic burden of pulmonary arterial hypertension in Spain. BMC Pulm Med. 2022;22(1):105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Campo A, Mathai SC, Le Pavec J, Zaiman AL, Hummers LK, Boyce D, Housten T, Lechtzin N, Chami H, Girgis RE, Hassoun PM. Outcomes of hospitalisation for right heart failure in pulmonary arterial hypertension. Eur Respir J. 2011;38(2):359–67. [DOI] [PubMed] [Google Scholar]
  • 32. Huynh TN, Weigt SS, Sugar CA, Shapiro S, Kleerup EC. Prognostic factors and outcomes of patients with pulmonary hypertension admitted to the intensive care unit. J Crit Care. 2012;27(6):739.e7–13. [DOI] [PubMed] [Google Scholar]
  • 33. Deshwal H, Weinstein T, Sulica R. Advances in the management of pulmonary arterial hypertension. J Investig Med. 2021;69(7):1270–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Kogan E, Didden E‐M, Lee E, Nnewihe A, Stamatiadis D, Mataraso S, Quinn D, Rosenberg D, Chehoud C, Bridges C. A machine learning approach to identifying patients with pulmonary hypertension using real‐world electronic health records. Int J Cardiol. 2023;374:95–9. [DOI] [PubMed] [Google Scholar]
  • 35. Schuler KP, Hemnes AR, Annis J, Farber‐Eger E, Lowery BD, Halliday SJ, Brittain EL. An algorithm to identify cases of pulmonary arterial hypertension from the electronic medical record. Respir Res. 2022;23(1):138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Hanover L. Artificial intelligence saves payers time and money. Manag Healthc Exec [Internet]. 2021;31(8):18–20. https://www.managedhealthcareexecutive.com/view/artificial-intelligence-saves-payers-time-and-money [Google Scholar]
  • 37. King R. How AI can help payers navigate a coming wave of delayed and deferred care [Internet]. 2020 13 Aug [cited 2023 Apr 13]. Available from: https://www.fiercehealthcare.com/payer/how-ai-can-help-payers-navigate-a-coming-wave-delayed-and-deferred-care

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting information.

Data Availability Statement

The data sharing policy of the Janssen Pharmaceutical Companies of Johnson & Johnson is available at: https://www.janssen.com/clinical-trials/transparency. As noted on this site, requests for access to the study data can be submitted through the Yale Open Data Access (YODA) Project site at: http://yoda.yale.edu.


Articles from Pulmonary Circulation are provided here courtesy of Wiley

RESOURCES