Abstract
Background
An algorithm to classify heart failure (HF) endpoints inclusive of contemporary measures of biomarkers and echocardiography was recently proposed by an international expert panel. Our objective was to assess agreement of HF classification by this contemporaneous algorithm with that by a standardized physician reviewer panel, when applied to data abstracted from community-based hospital records.
Methods and Results
During 2005-2007, all hospitalizations were identified from four U.S. communities under surveillance as part of the Atherosclerosis Risk in Communities (ARIC) study. Potential HF hospitalizations were sampled by ICD discharge codes and demographics from men and women aged 55 years and older. The HF classification algorithm was automated and applied to 2,729 (N=13,854 weighted hospitalizations) hospitalizations in which either BNP measures or ejection fraction were documented (mean age 75 years). There were 1,403 (54%, N=7,534 weighted) events classified as acute, decompensated HF (ADHF) by the automated algorithm, and 1,748 (68%, N=9,276 weighted) such events by the ARIC reviewer panel. The chance-corrected agreement between ADHF by physician reviewer panel and the automated algorithm was moderate (Kappa=0.39). Sensitivity and specificity of the automated algorithm with ARIC reviewer panel as the referent standard was 0.68 (95% CI, 0.67 - 0.69), and 0.75 (95% CI, 0.74 - 0.76), respectively.
Conclusions
Although the automated classification improved efficiency and decreased costs, its accuracy in classifying HF hospitalizations was modest compared to a standardized physician reviewer panel.
Keywords: heart failure, classification, BNP, ejection fraction, ARIC
Clinical research and epidemiologic studies of heart failure (HF) have been hindered by the lack of a consensus definition of HF as an event or endpoint that is valid, repeatable and cost-effective 1-4. The pleomorphic nature of the HF syndrome contributes to the difficulty in defining and classifying HF. HF manifestations can be vague, as well as shared with other conditions that are often comorbid with HF, such as respiratory and renal disease5. Thus, the current gold standard for HF classification is expert review of medical records and adjudication, 6, 7 although classification of HF by an expert reviewer panel is subject to more misclassification than for events such as myocardial infarction and stroke. A standardized and repeatable event review by a reviewer panel is expensive and time consuming, and thus not practical for most studies, and further difficulties include the use of diverse classification schema. Although the Framingham, modified Boston and NHANES classification schema are widely used, their relevance to contemporary classifications of HF events is questionable4, since most extant HF classification schema were created prior to the clinical use of biomarkers and cardiac imaging in HF diagnosis and care. Furthermore, they largely do not consider whether an HF event is new, or decompensated2.
To develop a contemporary definition of HF, an international group of cardiovascular clinical trialists, biostatisticians, National Institutes of Health (NIH) scientists, regulators, and pharmaceutical industry scientists published recommendations for an updated classification of HF for clinical trials and observational studies of HF4. Extending prior HF classifications, biomarker and echocardiographic information was included, and the distinction between 3 types of HF events was emphasized (Table 1). The 3 types of HF events include: those with a new diagnosis, a new event without prior HF, or a new event with history of HF. The first two groups largely differ by severity and setting of presentation. Henceforth we refer to this expert panel as the Cardiovascular Clinical Trialists (CCT) Workshop4 and to their proposed HF event definition as the HF algorithm. As far as we know this algorithm has yet to be implemented or evaluated, therefore we operationalized and automated a modified version of the HF algorithm proposed by the CCT for hospitalized events of HF regardless of history of HF. We examined its performance characteristics on data abstracted by trained personnel from medical records of a population-based sample of HF hospitalizations in four U.S communities. Hospitalizations included all men and women aged 55 years and older with ICD-coded discharge diagnoses related to HF during 2005-2007 in these areas8. We tested the concordance of this automated HF classification algorithm with an established panel of standardized physician reviewers of the Atherosclerosis Risk in Communities (ARIC) study8.
Table 1. Cardiovascular Clinical Trialists' (CCT) definitions of heart failure for a new diagnosis, new event, or recurrent event, followed by the adapted automated algorithm used in this study (adapted based on the highlighted column).
| 1. New onset HF as a diagnosis | 2. HF as a new event* | 3. HF as an event | Adapted automated algorithm† | |
|---|---|---|---|---|
| History of HF | No | No | Yes | Yes or No |
| HF Signs and Symptoms | +/-* | Yes | Yes | Yes |
| Treatment for HF | Yes* | Yes | Yes | Yes |
| Imaging and biomarkers | Yes | Yes | No | Yes |
For new onset HF as a diagnosis, treatment must be for HF symptoms, but there is not requirement that there be at least 2 symptoms present as with the other 2 categories
Modified to include all 4 categories of “HF as a new event” (column 2) to include those with or without a prior history of HF documented in the medical. Outpatient visits are not included in the study sample.
Methods
Automated Classification of HF
To examine the applicability and usefulness of the HF algorithm in population settings, we ascertained the HF classification criteria items from hospital medical records, and thus applicable to clinical research studies using electronic health records (EHR) or epidemiologic surveillance studies. Accordingly we deferred classification of HF according to history of HF and instead examined performance of a modified version of the HF algorithm that does not consider HF history. We modified the criteria identified by the CCT as “HF as a new event,” to achieve wider interest and applicability (Table 1). If a classification algorithm performs sufficiently well, the distinction of events according to their prevalent or incident nature is typically done as an analytic step and not as an event classification category. Furthermore, we did not include death due to HF. See Supplemental Methods Section for details. Given that the purpose of this study is to test the automated algorithm in the real world setting of hospital medical records, there was neither an echocardiogram reading center nor a central laboratory for the measurement of BNP. Hospital records that made no reference to measures of either ejection fraction or BNP/NT pro-BNP were considered not indicative of HF for the missing measure. Records missing both BNP and ejection fraction measures were excluded to preserve the validity of the comparison.
Study Population
The Atherosclerosis Risk in Communities (ARIC) study has conducted population-based retrospective surveillance for coronary heart disease since 19878. HF has been a target for community surveillance in ARIC since 2005, based on a sample of hospital discharges in four geographically defined areas in the U.S., for all residents age 55 years and older 9. Because ARIC began automatically classifying some of the eligible hospitalizations in 2008, we limit this analysis to 2005-2007. The four ARIC study areas are the city of Jackson, Mississippi; Washington County, Maryland; eight northwestern suburbs of Minneapolis, Minnesota; and Forsyth County, North Carolina. In 2005, these four regions had an overall population of 177,000 ages 55 and older. Non-black and non-white race groups are excluded due to small numbers. The institutional review boards from each study site approved the ARIC study.
Ascertainment of Hospitalizations for Heart Failure
Annually lists of hospital discharges meeting a target list of International Classification of Disease 9thRevision-Clinical Modification (ICD-9-CM) codes were obtained from the hospitals in the 4 ARIC communities (31 hospitals in 2005). See Supplemental Table 1 for a list of targeted HF ICD-9-CM codes. For 91% of the sample, a ‘428’ for “congestive HF” was listed as one of the codes. For all community residents aged 55 years and older, hospitalizations were sampled using stratified probabilistic sampling by HF ICD-9-CM code, age, gender, race and area of residence in the community. Sampling probabilities by strata were selected to optimize variance estimates for HF event rates within strata, and based on the pre-specified maximum number of events planned for data abstraction9. Results are weighted for these sampling probabilities to maintain population estimates for the distribution of ICD codes and other factors that may affect concordance.
Abstraction and Classification of Heart Failure Events
Medical records were abstracted by trained study personnel following a standardized protocol. Each record was first abstracted to answer 6 screening questions for ADHF; if any of the answers were positive a full abstraction ensued. The 6 screening items included mention of any of the following: increasing or new onset shortness of breath, peripheral edema, paroxysmal dyspnea, orthopnea, hypoxia, or HF as a cause for hospitalization. Of all records with a HF ICD code, 36% did not meet the screening criteria and were not abstracted in full, and were not included in these analyses. A separate analysis examined the effect of this efficiency-based screening in a subset of 797 medical records, based on a full data abstraction for medical records that would have been screened out. We found that 48% (N=386) had either BNP or a measure of ejection fraction and thus would have qualified for analysis. Of the 386 medical records with biomarker or imaging information, 11.7% were found to have definite or possible ADHF per ARIC reviewer panel. In comparison, 68% of the records fully abstracted for this study had definite or possible HF by ARIC reviewer panel. Thus, screening prior to full record abstraction was effective in yielding a low number of false negatives.
Full record abstraction using the heart failure abstraction form comprehensively incorporated the pertinent elements for classification of HF, and history of comorbid conditions as described previously (abstraction form available at http://drupal.cscc.unc.edu/aric/hf-forms).9 A computer–based classification was applied to the abstracted data to arrive at the appropriate classification for the CCT automated algorithm (Table 1, and Table 2). Secondarily, conventional HF criteria (Framingham10, Boston11, NHANES12, and Gothenburg13) were also defined from abstracted data (results are presented in the Supplement).9 Eligible hospitalizations were independently reviewed by one or two trained physician reviewer (s) with resolution of disagreements by an adjudicator. Physicians followed ARIC HF classification guidelines when evaluating medical records, and applied judgment to arrive at a classification of definite acute decompensated HF (ADHF), possible ADHF, chronic HF, HF unlikely, or unclassifiable9. Here definite and possible ADHF have been combined into a single category of ‘ADHF present’, and the other 3 categories have been combined as ‘ADHF absent’.
Table 2.
Adapted automated Cardiovascular Clinical Trialists' (CCT) algorithm for a hospitalized event of ADHF (either new or recurrent). All 3 criteria elements must be met to define a heart failure event.
|
1) Signs and Symptoms, presence of ≥ 2 HF signs or symptoms among the following: shortness of breath or dyspnea on exertion, orthopnea, paroxysmal nocturnal dyspnea, fatigue or reduced exercise tolerance, pulmonary edema, rales, peripheral edema, JVD, S3, hepatojugular reflux, altered hemodynamics, cardiomegaly |
|
2) Treatment Initiation or increase in treatment with loop diuretics, or IV vasoactive agents. The automated algorithm' criteria specifies that this treatment should be specifically for the above symptoms, however our abstraction only confirms that such treatment was provided during this hospitalization |
|
3) Biomarkers and Imaging, at least one of the following: |
Note: SI units shown of ng/L = pg/ml.
Elevated NT-proBNP defined as: if <50years then ≥ 450 ng/L; if 50-75 years then ≥ 900 ng/L; if 75 years then ≥ 1800 ng/L. Moderately elevated NT pro-BNP defined with 300 ng/L as the bottom cutpoint for all age groups: if <50years then 300-450 ng/L; if 50-75 years then 300-900 ng/L; if 75 years then 300-1800 ng/L.
Classification of Heart Failure Events in ARIC
The ARIC classification guidelines have been described 9. Classification of definite acute decompensated HF (ADHF) required clear evidence of HF with active decompensation, and the presence of HF with certainty as to the cause of the presentation. Possible acute decompensated HF included criteria similar to definite ADHF, without as much certainty that HF is the cause of the presentation. A classification of chronic HF applied to a history of HF that was not decompensated.
Statistical Analysis
All estimates were weighted to account for the sampling design and to maintain the population distribution of ICD codes and other factors that may affect concordance. We cannot reliably link hospitalizations to identify repeat events, therefore all hospitalizations are assumed to be independent. The positive and negative agreement, the kappa coefficient, and the prevalence and bias adjusted Kappa (PABAK) were calculated relating the automated algorithm to the ARIC reviewer panel classifications. The prevalence and bias adjusted kappa (PABAK) were calculated since the prevalences of positive and negative tests were not balanced, which can result in a Kappa with low reliability even when observed agreement is good14, 15. Measures of validity were calculated for the components of the automated algorithm individually and for the schema overall. Positive and negative predictive values were calculated for several different disease prevalences16. Formulas specified in table footnotes.
Results
There were 2,729 sampled hospitalizations eligible for review during 2005-2007, which resulted in 15,484 events after applying weights to account for sampling fractions. The tables and their discussion refer to the weighted number of events. Of these, 10.5% (N=1,630 weighted) were missing BNP measures and ejection fraction and thus were excluded, leaving a sample of 13,854 for this analysis. Of those classified as ADHF by the automated algorithm, 85% (69% + 16%) were classified as definite or possible ADHF by the ARIC reviewer panel (See Table 3, with unweighted numbers in Supplemental Table 2). Of those classified as not having ADHF by the automated algorithm, 47% were classified as ADHF and 20% as chronic HF by ARIC panel review.
Table 3. Hospitalizations* during 2005-2007 among residents ages 55 years and older of four U.S. communities, identified as possible heart failure, according to the automated HF classification algorithm and the ARIC reviewer panel.
| Automated algorithm | ||||
|---|---|---|---|---|
|
| ||||
| ARIC panel HF classification | Numbers before exclusion, N*(%) | Numbers after exclusion† due to missing BNP and ejection fraction, N*(%) | ADHF present, N* (%) | ADHF absent, N* (%) |
| ADHF present | ||||
| Definite HF | 6888 (45%) | 6,813 (49%) | 5,211 (69%) | 1,602 (25%) |
| Possible HF | 2864 (19%) | 2,563 (19%) | 1,200 (16%) | 1,363 (22%) |
|
| ||||
| ADHF absent | ||||
| Chronic HF | 2005 (13%) | 1,738 (13%) | 464 (6%) | 1,274 (20%) |
| Not HF | 2389 (15%) | 1,811 (13%) | 358 (5%) | 1,453 (23%) |
| Unclassifiable | 1338 (8%) | 929 (7%) | 301 (4%) | 629 (10%) |
|
| ||||
| Total | 15,484 (100%) | 13,854 (100%) | 7,534 (100%) | 6,321 (100%) |
All numbers were weighted to account for sampling fractions
Overall 10.5% excluded due to missing values for BNP or ejection fraction
Overall, characteristics of patients with ADHF per the ARIC reviewer panel and the automated algorithm did not differ appreciably (Supplemental Table 3). In each group the mean age was 75 years, with 51-52% women, and 28-30% African Americans. Hypertension (83%) and diabetes (46-48%) were common for both groups.
Table 4 shows the characteristics of those classified with agreement and disagreement when comparing the automated algorithm to the ARIC reviewer panel. The overt differences between groups were few, but informative. The frequency of end stage renal disease was highest (34%) in those without ADHF by both criteria, and then next highest (21%) for those with ADHF per the automated algorithm, and not by ARIC reviewer panel. The mean levels of BNP and NT-proBNP were visibly lower in the group classified as ADHF absent by the automated algorithm but present according to the ARIC reviewer panel. Those given diuretics were more likely to be classified with agreement as ADHF present (85% of those correctly classified, as compared to 55-69% for those misclassified).
Table 4. Characteristics of hospitalizations according to agreement between the automated HF classification algorithm and the ARIC reviewer panel for the classification of Acute Decompensated HF (ADHF) for hospitalizations identified as eligible for review as possible HF†.
| Disagreement | Agreement | |||
|---|---|---|---|---|
|
| ||||
| ARIC Reviewer Panel | ADHF present | ADHF absent | ADHF present | ADHF absent |
| Automated algorithm | ADHF absent | ADHF present | ADHF present | ADHF absent |
|
| ||||
| Weighted* number | N=2,966 | N=1,123 | N=6,411 | N=3,356 |
|
| ||||
| Demographics (%, unless stated) | ||||
| Age, mean (SD) | 75 (24) | 76 (23) | 75 (24) | 73 (21) |
| African-American | 24 | 28 | 30 | 19 |
| Women | 52 | 61 | 51 | 54 |
| Teaching Hospital | 43 | 34 | 34 | 42 |
|
| ||||
| Comorbidities (%) | ||||
| Coronary heart disease | 45 | 48 | 47 | 50 |
| Diabetes mellitus | 52 | 44 | 47 | 45 |
| Hypertension | 82 | 80 | 84 | 81 |
| COPD | 35 | 43 | 33 | 46 |
| End Stage Renal Disease | 10 | 21 | 14 | 34 |
| Atrial fibrillation | 37 | 30 | 32 | 28 |
| Heart block or other bradycardia | 5 | 9 | 5 | 6 |
|
| ||||
| HF Signs and Symptoms (%) | ||||
| ≥ 2 HF signs and symptoms | 93 | 89 | 97 | 74 |
|
| ||||
| Biomarkers and Imaging | ||||
| BNP level, ng/L, mean (SD) | 376 (1,400) | 1,084 (2,764) | 1,846 (6,849) | 178 (614) |
| NT-proBNP level, ng/L, mean (SD) | 3,075 (15,896) | 7,544 (19,492) | 10,224 (21,162) | 725 (1730) |
| Ejection Fraction, %, mean (SD) | 46 (35) | 51 (30) | 39 (38) | 53 (28) |
| Ejection Fraction < 40% | 27 | 20 | 46 | 28 |
|
| ||||
| Treatment (%) | ||||
| Diuretics (at admission or during) | 69 | 67 | 74 | 64 |
| IV inotropes | 7 | 7 | 10 | 4 |
| IV diuretics | 69 | 55 | 85 | 30 |
| IV diuretics or inotropes | 88 | 86 | 95 | 32 |
All numbers were weighted to account for sampling fraction
Table 5 shows measures of test validity calculated for the automated algorithm and its components, compared to the ARIC reviewer panel as a referent. The sensitivity was 0.68 and specificity 0.75 for the automated algorithm overall, with a positive predictive value of 0.85 and negative predictive value of 0.53. The prevalence of ADHF was 68% in this enriched sample of hospitalized events. Since predictive values differ according to prevalences we calculated predictive values for lower disease prevalences (e.g., for a prevalence of HF in the sample of 25%, the PPV = 0.48, and NPV = 0.88). As for the individual components of the algorithm, notably, elevated BNP or NT-proBNP taken in isolation showed comparable levels of validity to the algorithm overall (a sensitivity of 0.78 and specificity of 0.64), although this represents a smaller group (81% of the sampled hospital records) with non-missing biomarkers.
Table 5. Sensitivity, specificity, positive and negative predictive values, and likelihood ratios positive and negative for the Modified Automated algorithm' classification and its components, compared with a referent standard of ADHF (definite or possible) classification by ARIC reviewer.
| Elements of Automated HF algorithm | Weighted* Number in analysis | Sensitivity | Specificity | Positive Predictive Value† | Negative Predictive Value† | Likelihood Ratio Positive | Likelihood Ratio Negative |
|---|---|---|---|---|---|---|---|
|
| |||||||
| HF signs and symptoms (≥ 2) | 13,855 | 0.99 | 0.06 | 0.69 | 0.75 | 1.05 | 0.17 |
| IV diuretics or inotropes | 13,842 | 0.92 | 0.23 | 0.72 | 0.60 | 1.19 | 0.35 |
| Diastolic dysfunction | 7,536 | 0.20 | 0.78 | 0.74 | 0.25 | 0.91 | 1.03 |
| Systolic dysfunction | 11,474 | 0.46 | 0.73 | 0.81 | 0.35 | 1.70 | 0.74 |
| High BNP or NT-proBNP | 11,216 | 0.78 | 0.64 | 0.85 | 0.52 | 2.17 | 0.34 |
|
| |||||||
| Automated algorithm | 13,854 | 0.68 | 0.75 | 0.85 | 0.53 | 2.72 | 0.43 |
Hospitalizations were screened and include only those that mention at least one of six signs or symptoms of ADHF.
All numbers were weighted to account for sampling fractions
Formulas used in calculations:
Sensitivity = a/a+c; specificity = d/b+d; PPV = a/(a+ b); NPV = d/c+d. Using a 2 × 2 table with ARIC as the gold standard, a−d are defined as follows: a = + ARIC, + Trialist, b= − ARIC, + Automated algorithm, c = +ARIC, − Automated algorithm, and d = −ARIC, −Automated algorithm
Positive Likelihood Ratio= (sensitivity)/(1−specificity) = TP/FP
Negative Likelihood Ratio= (1−sensitivity)/(specificity) = FN/TN
Note: The positive and negative predictive values vary as a function of disease prevalence in the population. The prevalence of HF by definite or acute decompensated HF by ARIC reviewer panel = 9376/13854 = 0.68. To calculate the PPV and NPV for populations with different disease prevalences, use the following formulas. Formulas: PPV = (sensitivity × prevalence)/((sensitivity × prevalence) + ((1−specificity) × (1− prevalence))); NPV = (specificity × 1−prevalence)/((1−sensitivity) × prevalence) + (specificity × (1− prevalence)) Thus for a prevalence of 0.5, the PPV = 0.73, and NPV = 0.71; and for a prevalence of 0.4, the PPV = 0.64, and the NPV = 0.78.
In Table 6 (also in Supplemental Table 4), the agreement and validity statistics for ADHF by the ARIC reviewer panel were compared to the automated algorithm. The prevalence and bias adjustment of Kappa (PABAK) does not suggest a large influence of internal imbalance in these data on the Kappa statistic.
Table 6. Measures of agreement and validity for the classification of acute decompensated heart failure (ADHF) using the automated algorithm as compared with a referent standard of the ARIC reviewer panel (N=13,855*).
| ARIC Reviewer Panel | ||
|---|---|---|
|
| ||
| ADHF present, N | ADHF absent, N | |
|
| ||
| Automated Algorithm, ADHF present | 6,411 | 1,123 |
|
| ||
| Automated Algorithm, ADHF absent | 2,966 | 3,356 |
|
| ||
| Measures of Agreement | ||
|
| ||
| Positive agreement | 0.76 | |
| Negative agreement | 0.62 | |
| Cohen's Kappa | 0.39 (0.38, 0.41) | |
| Prevalence Adjusted Bias Adjusted Kappa (PABAK) | 0.41 | |
| Prevalence Index | 0.22 | |
| Bias Index | 0.13 | |
|
| ||
| Measures of Validity with ARIC Reviewer Panel as Referent Standard | ||
|
| ||
| Sensitivity | 0.68 (0.67, 0.69) | |
| Specificity | 0.75 (0.74, 0.76) | |
| Positive predictive value | 0.85 (0.84, 0.86) | |
| Negative Predictive Value | 0.53 (0.52, 0.54) | |
| Likelihood Ratio positive | 2.72 (2.59, 2.87) | |
| Likelihood Ratio negative | 0.43 (0.41, 0.44) | |
All numbers were weighted to account for sampling fractions
Formulas used in calculations: Using a 2 × 2 table with ARIC as the gold standard, a−d are defined as follows: a = + ARIC, + automated algorithm, b= − ARIC, + Automated algorithm, c = +ARIC, − Automated algorithm, and d = −ARIC, −Automated algorithm
Sensitivity = a/a+c; specificity = d/b+d; PPV = a/(a+ b); NPV = d/c+d.
Positive Likelihood Ratio= (sensitivity)/(1−specificity) = TP/FP
Negative Likelihood Ratio= (1−sensitivity)/(specificity) = FN/TN
PABAK formula = 2 (observed agreement −1). The observed agreement = (a+d)/N The prevalence of ADHF by ARIC = 9376/13854 = 0.68
Discussion
We assessed the applicability and classification properties of an algorithm proposed for the classification of HF endpoints in clinical trials or observational studies that incorporates diagnostic tools routinely used in current medical practice. We evaluated the performance of this algorithm in the setting of hospitalizations sampled from a large number of hospitals from four regions of the US that participate in an NHLBI-sponsored epidemiology study of HF. The evaluation of an automated HF algorithm that incorporates biomarkers and echocardiographic measures is novel in the context of a large, population-based sample of hospitalizations, and is notable for its scope and generalizability. In addition to signs and symptoms as elements of the HF syndrome, the availability of echocardiographic imaging and biomarker information abstracted from records generated in the course of routine medical care indicate that an application of an automated algorithm is feasible under these circumstances, and was successful. We found that 89.5% of hospital medical records sampled during the period 2005-2007 contained either BNP/NT-proBNP or echocardiographic measures suitable for use in applying this algorithm. Further, by adding detail and some modifications to the definitions published by Zannad et al4, we were able to operationalize an algorithmic definition for HF. The ability to apply an automated algorithm to real-world settings and EHR highlights the potential efficiencies in the classification of HF events for research and administrative applications based on hospital medical records.
To our knowledge this is the first study to assess the classification performance of this algorithm and its validity relative to a standardized HF classification method by a panel of physician reviewers. Since HF is a clinical syndrome for which there is no consensus definition, its classification is difficult. Additional complexity is added by the episodes of acute decompensation that characterize HF. This study focused on an accurate and reproducible algorithmic classification of ADHF. Based on 15,484 (weighted) hospitalizations sampled during 2005-2007 from all hospitals that serve the residents of four regions in the U.S., we found modest agreement between ADHF defined by the automated algorithm and by the ARIC reviewer panel (Kappa of 0.39, PABAK of 0.41). Chance adjusted agreement, as measured by Cohen's Kappa was slightly higher here than agreement between existing HF criteria and the ARIC reviewer panel as shown in a prior publication (Framingham K=0.32, Modified Boston K=0.18)9. Given that the ARIC HF panel reviewers considered BNP measures and echocardiography findings in classifying HF events, we expected the automated algorithm (which includes criteria elements for these measures) to have better agreement with ARIC's ADHF than the other schema considered which do not consider these measures. In addition, existing HF schemas do not distinguish acute decompensated HF from chronic HF, whereas the automated algorithm and the ARIC reviewer classification do.
Employing ARIC's classification of ADHF as a referent standard, we found a sensitivity of 0.68 (95% CI, 0.67-0.69) for the automated algorithm, i.e., 68% of those with ADHF by the ARIC reviewer panel (reference standard) were also found to have ADHF per automated algorithm. Thus, 32% of those with ADHF were missed as false negatives applying the automated algorithm. Specificity was estimated as 0.75 (95% CI, 0.74-0.76), implying that 75% of those who did not have ADHF by the reference standard also were found not to have ADHF by the automated algorithm (true negatives), and 25% of those without HF were false positives according to the automated algorithm. Using the ARIC reviewer panel as the referent standard, the automated algorithm performed with higher specificity, and lower sensitivity compared to other commonly used HF classification schema (Supplemental Table 5). The automated algorithm did not perform at higher validity on both sensitivity and specificity when compared to the existing criteria, thus the relative value of sensitivity and specificity, and the cost of each type of misclassification need to be considered in the particular setting for which the classification of HF events is needed. The varied settings in which HF classification may be applied include the identification of potential participants with HF for a clinical trial, the identification of HF as an adverse events, and case-finding efforts that search through large databanks of electronic medical records.
Overall the automated algorithm had a better balance of sensitivity and specificity than any one individual component. The overall balance between sensitivity and specificity for the automated algorithm was closest to that for the BNP levels as individual criterion element, although the biomarkers achieved higher sensitivity than specificity. Of note, hospital records containing BNP measures may reflect a different spectrum of disease or patient population than the overall sample of hospitalized residents of these study areas. Since biomarkers and echocardiographic measurements may be performed differentially in clinical settings, the automated algorithm is not likely to perform as well in circumstances where these measures would not be obtained routinely. Furthermore, our focus is the performance of this algorithm in real world settings, and thus we did not limit the analysis to those with both measures. We would expect different results in a population that had both biomarkers and echocardiogram measures performed during a hospitalization, but both of these measures are not routinely obtained in HF hospitalizations.
Our study material included only hospitalizations lasting at least 24 hours. This may be a limitation in that milder cases of ADHF that could have been managed in the emergency department or admitted overnight to observation care would mostly be missed here. It is therefore unknown how the automated HF classification algorithm performs on data that include milder forms of ADHF. An additional limitation is that eligible hospital records with a qualifying ICD code for HF were abstracted only in part when the record did not include reference to increasing, or new onset shortness of breath, peripheral edema, paroxysmal dyspnea, orthopnea, hypoxia or documentation that the reason for the event was HF. Across all hospitals included in this study 36% of medical records did not meet the above screening criteria, and were not abstracted in full. A calibration sub-study of hospital records that did not meet these inclusion criteria and were fully abstracted (N = 797) found that only 48% of those records that were screened out would have met criteria for classification using this automated algorithm, and that only 11.7% of those were identified as ADHF by ARIC. Thus, the impact of the criterion to select hospital records eligible for full abstraction on the results reported here is therefore quite small. Lastly, as expected in the setting of community-based hospitals, the biomarker assays and echocardiography measurements were not interpreted in a central reading center or laboratory; therefore, some (unmeasured) variability is to be expected. Further studies should assess this algorithm in other settings, such as in a clinical trial, in which the goal is usually to define HF endpoints of ADHF in those known to have HF. Although centrally analyzed biomarker levels and centrally read imaging may be available from most clinical trials at baseline, it is relevant to note that many large multi-center clinical trials with HF hospitalization as endpoints are also dependent on the clinical infrastructure for imaging and biomarker measures, in place of a centralized processing of these measures4.
Among the strengths of this report are the novelty of the application of an automated algorithm for the classification of HF in a population-based setting and the rigorous evaluation of its performance characteristics in contemporaneous hospital-based practice. Additional strengths include the use of a large database of hospital records sampled to represent hospitalizations among the residents of four regions, and their abstraction by trained study personnel following a standardized protocol. Since as of yet, there is no agreed-upon gold standard to classify HF, a systematic physician review and classification according to standardized criteria represents the best available gold standard. Our reliance on a comprehensive and standardized protocol for the classification of ADHF that included a panel of calibrated physician reviewers adds strengths to the information reported here.
In conclusion, we were able to apply an algorithm recommended by an international panel of experts for the classification of HF to medical records sampled from diverse hospitals in geographically well-defined areas in the U.S., and to automate this algorithm efficiently for use on data abstracted from records by trained personnel. The validity (accuracy) of the automated algorithm for ADHF was moderate at best compared to the classification of ADHF by ARIC's reviewer panel, although the agreement and specificity for the automated algorithm were greater than for the commonly used HF criteria that do not account for contemporary measures of BNP or echocardiography (Supplemental Table 5) . The development of HF classification criteria that agree with the highest reference standard of physician reviewer classification, and their evaluation in the setting of medical practice, are priorities for clinical and population based research. If such an algorithm is used to classify all hospital admissions rather than those with high prior odds of HF as done in this study, then the concordance will be extremely high as most records will not have HF by either criterion.
Diastolic dysfunction, a common finding in the elderly population without HF, does not contribute much to the ability to classify HF. Unlike with systolic dysfunction, the CCT algorithm requires that those with diastolic dysfunction must also have moderately elevated biomarkers to meet criteria for ADHF. It is possible that uniform measurements of diastolic parameters, which are not often reported in clinical echocardiograms, and research to define the appropriate set of parameters to best define diastolic dysfunction may improve its utility for classification. The ability to classify ADHF with an up-to-date, automated classification algorithm and evaluate its performance characteristics is a critical step toward the establishment and standardized application of consensus criteria for HF. Advantages derived from their use would apply to the utilization of large medical records database resources, as well as efficiencies in time and costs.
Supplementary Material
Supplemental Table 1. ICD-9 codes that were eligible for probabilistic sampling
Supplemental Table 2. Unweighted number of hospitalizations* during 2005-2007 among residents ages 55 years and older of four communities in the U.S., identified as eligible for review as possible heart failure, according to the automated HF classification algorithm and the ARIC reviewer panel
Supplemental Table 3. Characteristics of hospitalizations, by presence of Acute Decompensated Heart Failure (ADHF) according to Cardiovascular Clinical Trailists' (CCT) algorithm and ARIC classification
Supplemental Table 4. Measures of Agreement between hospitalized Acute Decompensated Heart Failure (ADHF) as defined by the Cardiovascular Clinical Trailists' (CCT) algorithm and other classification schema (N=13,855*), as compared to a referent standard of the ARIC reviewer panel
Supplemental Table 5. Sensitivity, specificity, positive and negative predictive values, and likelihood ratios positive and negative for the comparison the Cardiovascular Clinical Trailists' (CCT) algorithm and other HF classification schema for heart failure, compared to a referent standard of the ARIC reviewer panel
Acknowledgments
The authors thank the staff and participants of the ARIC study for their important contributions.
Sources of Funding: The Atherosclerosis Risk in Communities Study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN268201100011C, and HHSN268201100012C).
Footnotes
Disclosures: None
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- 1.Vasan RS, Levy D. Defining diastolic heart failure: A call for standardized diagnostic criteria. Circulation. 2000;101:2118–2121. doi: 10.1161/01.cir.101.17.2118. [DOI] [PubMed] [Google Scholar]
- 2.Mosterd A, Deckers JW, Hoes AW, Nederpel A, Smeets A, Linker DT, Grobbee DE. Classification of heart failure in population based research: An assessment of six heart failure scores. European journal of epidemiology. 1997;13:491–502. doi: 10.1023/a:1007383914444. [DOI] [PubMed] [Google Scholar]
- 3.Di Bari M, Pozzi C, Cavallini MC, Innocenti F, Baldereschi G, De Alfieri W, Antonini E, Pini R, Masotti G, Marchionni N. The diagnosis of heart failure in the community. Comparative validation of four sets of criteria in unselected older adults: The icare dicomano study. Journal of the American College of Cardiology. 2004;44:1601–1608. doi: 10.1016/j.jacc.2004.07.022. [DOI] [PubMed] [Google Scholar]
- 4.Zannad F, Stough WG, Pitt B, Cleland JG, Adams KF, Geller NL, Torp-Pedersen C, Kirwan BA, Follath F. Heart failure as an endpoint in heart failure and non-heart failure cardiovascular clinical trials: The need for a consensus definition. European heart journal. 2008;29:413–421. doi: 10.1093/eurheartj/ehm603. [DOI] [PubMed] [Google Scholar]
- 5.Rutten FH, Cramer MJ, Lammers JW, Grobbee DE, Hoes AW. Heart failure and chronic obstructive pulmonary disease: An ignored combination? Eur J Heart Fail. 2006;8:706–711. doi: 10.1016/j.ejheart.2006.01.010. [DOI] [PubMed] [Google Scholar]
- 6.Heckbert SR, Kooperberg C, Safford MM, Psaty BM, Hsia J, McTiernan A, Gaziano JM, Frishman WH, Curb JD. Comparison of self-report, hospital discharge codes, and adjudication of cardiovascular events in the women's health initiative. Am J Epidemiol. 2004;160:1152–1158. doi: 10.1093/aje/kwh314. [DOI] [PubMed] [Google Scholar]
- 7.Schellenbaum GD, Heckbert SR, Smith NL, Rea TD, Lumley T, Kitzman DW, Roger VL, Taylor HA, Psaty BM. Congestive heart failure incidence and prognosis: Case identification using central adjudication versus hospital discharge diagnoses. Annals of epidemiology. 2006;16:115–122. doi: 10.1016/j.annepidem.2005.02.012. [DOI] [PubMed] [Google Scholar]
- 8.White AD, Folsom AR, Chambless LE, Sharret AR, Yang K, Conwill D, Higgins M, Williams OD, Tyroler HA. Community surveillance of coronary heart disease in the atherosclerosis risk in communities (aric) study: Methods and initial two years' experience. Journal of clinical epidemiology. 1996;49:223–233. doi: 10.1016/0895-4356(95)00041-0. [DOI] [PubMed] [Google Scholar]
- 9.Rosamond WD, Chang PP, Baggett C, Johnson A, Bertoni AG, Shahar E, Deswal A, Heiss G, Chambless LE. Classification of heart failure in the atherosclerosis risk in communities (aric) study: A comparison of diagnostic criteria. Circ Heart Fail. 2012;5:152–159. doi: 10.1161/CIRCHEARTFAILURE.111.963199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Ho KK, Anderson KM, Kannel WB, Grossman W, Levy D. Survival after the onset of congestive heart failure in framingham heart study subjects. Circulation. 1993;88:107–115. doi: 10.1161/01.cir.88.1.107. [DOI] [PubMed] [Google Scholar]
- 11.Carlson KJ, Lee DC, Goroll AH, Leahy M, Johnson RA. An analysis of physicians' reasons for prescribing long-term digitalis therapy in outpatients. J Chronic Dis. 1985;38:733–739. doi: 10.1016/0021-9681(85)90115-8. [DOI] [PubMed] [Google Scholar]
- 12.Schocken DD, Arrieta MI, Leaverton PE, Ross EA. Prevalence and mortality rate of congestive heart failure in the united states. Journal of the American College of Cardiology. 1992;20:301–306. doi: 10.1016/0735-1097(92)90094-4. [DOI] [PubMed] [Google Scholar]
- 13.Eriksson H, Caidahl K, Larsson B, Ohlson LO, Welin L, Wilhelmsen L, Svardsudd K. Cardiac and pulmonary causes of dyspnoea--validation of a scoring test for clinical-epidemiological use: The study of men born in 1913. European heart journal. 1987;8:1007–1014. doi: 10.1093/oxfordjournals.eurheartj.a062365. [DOI] [PubMed] [Google Scholar]
- 14.Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. Journal of clinical epidemiology. 1993;46:423–429. doi: 10.1016/0895-4356(93)90018-v. [DOI] [PubMed] [Google Scholar]
- 15.Cunningham M. In: Inc. SI, editor. More than just the kappa coefficient: A program to fully characterize inter-rater reliability between two raters; Proceedings of the sas®global forum 2009 conference; Cary, NC. SAS Institute Inc.; 2009. [Google Scholar]
- 16.Altman DG, Bland JM. Diagnostic tests 2: Predictive values. BMJ. 1994;309:102. doi: 10.1136/bmj.309.6947.102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental Table 1. ICD-9 codes that were eligible for probabilistic sampling
Supplemental Table 2. Unweighted number of hospitalizations* during 2005-2007 among residents ages 55 years and older of four communities in the U.S., identified as eligible for review as possible heart failure, according to the automated HF classification algorithm and the ARIC reviewer panel
Supplemental Table 3. Characteristics of hospitalizations, by presence of Acute Decompensated Heart Failure (ADHF) according to Cardiovascular Clinical Trailists' (CCT) algorithm and ARIC classification
Supplemental Table 4. Measures of Agreement between hospitalized Acute Decompensated Heart Failure (ADHF) as defined by the Cardiovascular Clinical Trailists' (CCT) algorithm and other classification schema (N=13,855*), as compared to a referent standard of the ARIC reviewer panel
Supplemental Table 5. Sensitivity, specificity, positive and negative predictive values, and likelihood ratios positive and negative for the comparison the Cardiovascular Clinical Trailists' (CCT) algorithm and other HF classification schema for heart failure, compared to a referent standard of the ARIC reviewer panel
