Abstract
EHR-based phenotype development and validation are extremely time-consuming and have considerable monetary cost. The creation of a phenotype currently requires clinical experts and experts in the data to be queried. The new approach presented here demonstrates a computational alternative to the classification of patient cohorts based on automatic weighting of ICD codes. This approach was applied to data from six different clinics within the University of Arkansas for Medical Science (UAMS) health system. The results were compared with phenotype algorithms designed by clinicians and informaticians for asthma and melanoma. Relative to traditional phenotype development, this method shows potential to considerably reduce time requirements and monetary costs with comparable results.
Introduction
Today, only a fraction of clinical decisions are based upon evidence derived from the gold-standard of knowledge discovery. For example, a recent study reports that only 18% of recommendations in primary care are based on strong evidence.1 At the same time, Electronic Health Record (EHR) use has climbed to 96% in US hospitals and 87% in clinics generating copious amounts of data newly available to science.2 Opportunistically, researchers and the National Institutes of Health are calling for and exploring responsible use of routine care data to probe unanswered clinical questions.3
EHR-based phenotypes, or computable phenotypes, are algorithms for using EHR data to identify patient cohorts, aspects of their care, and outcomes. Phenotype development using EHR data is an active area of research.4,5 Notable articles report differences in phenotype results based on subtle algorithmic differences emphasizing the need for generalized phenotype design and validation methods.7 Further, there is mounting evidence that the same algorithm used in different data sources can yield unexpectedly different results.7 Validation to measure sensitivity, specificity, and positive and negative predictive value requires chart review with large sample sizes to detect false negatives.8
While secondary use of EHR data has been seen as a panacea, phenotype development and validation are extremely time-consuming and have considerable monetary cost.8 Phenotype development and validation has commonly been accomplished deductively by experts with clinical knowledge and experience in (1) defining the phenotypes, i.e., classification rules, based on diagnostic criteria, condition-specific treatment or other clinical data common and specific to the disease or disorder of interest and (2) validating them against a gold standard, usually other clinicians reviewing charts and manually classifying the cases.8 A closely related approach includes adapting existing phenotypes to the desired use. There has been some exploration of inductive approaches such as clustering or associative methods alone or in machine learning algorithms.9
Methods
The approach presented in this paper is an inductive weighting approach to automatically define by clustering diagnosis-based phenotypes within EHR data leveraging and adapting the Fellegi-Sunter method for classification.6 This approach seeds the generation of a probabilistic classification model using the existing diagnosis code for a disease. The weight for each ICD-9 code is calculated within a dataset based on a user-selected, single diagnosis as a seed to initiate the algorithm. A score is then calculated for each patient based on a summation of the weights assigned to each diagnosis present for the patient. Patients are then classified by comparing the score against a user-defined threshold. The five major steps of this approach are outlined in Table 1 below.
Table 1.
Algorithm Schema
1. All patients a with a user-selected (seed) diagnosis code are counted in the dataset. For example, consider that 10 out of 40 total patients contain the seed diagnosis code of “493.x” (Asthma). P(d) in . d is the patient has one or more encounters containing the predefined (seed) ICD code for the disease. |
2. The probability of each code in the dataset is similarly calculated as the number of patients that meet the criteria in Step 1 and have that diagnosis code, divided by the total number of patients in the dataset, i.e., P(c ∩ d) in Equation . c is the patient has one or more encounters containing the ICD code for which the weight is being calculated. For example, consider that 5 of the 10 patients from the previous step meet these criteria for the diagnosis code “493.x”. Also, consider that 5 of the 30 patients that do not meet the criteria in Step 1 also have the diagnosis code “493.x”. |
3. The weight for this diagnosis code is calculated using equations , . The ICD code weights are the log2 of the ratio of the conditional probabilities. First, mi is calculated as Ei/Eh = 5/10 = 0.5. Then, ui is calculated as !Ei/!Eh= 5/30 = 0.1667. Then, the “agreement” weight is applied as log2(mi/ui) = log2(0.5/0.1667) = 1.5847. |
4. Each patient is then scored by summing the weight for all of the ICD codes present in their encounters. |
5. The user identifies an initial threshold for classification based on the highest scoring patient; Fellegi-Sunter classification is performed for each patient in the dataset (considering diagnoses from all encounters a patient may have) using the threshold. |
This approach assigns a weight to each ICD code present in a set of EHR data based on frequency of the code in the dataset. The sum of the weights of all ICD codes for each patient across all encounters in the dataset is used to classify each patient into one of two different outcomes: patient has the disease versus patient does not have the disease.
The scoring algorithm seeded the initial weight calculation by using a single ICD code to designate the presence of the disease for a patient. We use asthma (493.x) and melanoma (172.x) to demonstrate the classification. This research used legacy data and thus ICD-9 codes to describe patient encounters, where “X” is a wildcard denoting any diagnosis code below 493 in the ICD-9 tree. Though a single ICD code often does not positively identify a patient as having a disease with a high level of confidence, we use it to initialize the calculation of the weights for ICD codes. Presence of the code is used as an indication that the patient has the disease for the weight calculation process. The weight for a given ICD code is only applied when the patient has one or more encounters containing that ICD code.
ICD Code Weight Calculation
The Fellegi-Sunter probabilistic model for estimated weights was used to calculate the “agreement” weight for each ICD code in the data set.6 Two conditional probabilities were calculated. The first probability (mi = p(c| d)) is the conditional probability of two events:
d – the patient has one or more encounters containing the predefined (seed) ICD code for the disease (493.x for asthma and 172.x for melanoma)
c – the patient has one or more encounters containing the ICD code for which the weight is being calculated
The probability that a patient with the seed code also has a given ICD code (mi) was calculated from the number of patients with an encounter containing the defined asthma ICD code (493.x) that also had a given ICD code (Ei) and the total number of patients with an encounter containing the defined asthma ICD code (493.x) (Eh). This calculation is expressed in Equation 1 below.
(1) |
Second, the conditional probability that a patient without the disease has an encounter containing a given ICD code (Ui) was calculated from the number of patients with no encounters containing the defined asthma ICD code (493.x) that also had a given ICD code (! Ei) and the total number of patients that did not have an encounter containing the defined asthma ICD code (493.x) (! Eh). The calculation for probability ui expressed in Equation 2 below.
(2) |
Using the probabilities mi and ui, the “agreement” weight of individual attributes were calculated and applied using the criteria in Equation 3. The calculation corresponds to an “agreement”.
(3) |
Results
Data
To test this approach, de-identified data was extracted from the UAMS Data Warehouse within the context of phenotype development and validation for an EHR Data Quality study (UAMS IRB protocol 206786). Data from six different family medicine clinics in the UAMS health system were processed through the tool. Table 2 below lists summary information for each of the datasets.
Table 2.
Summary of Each Data Source
Participants with Asthma | Participants with Melanoma | ||||
---|---|---|---|---|---|
Clinic | Participants | Count | Percent | Count | Percent |
1 | 30,267 | 1,761 | 5.82 | 8 | 0.03 |
2 | 19,126 | 936 | 4.89 | 13 | 0.07 |
3 | 27,553 | 1,572 | 5.71 | 16 | 0.06 |
4 | 22,217 | 1,078 | 4.85 | 12 | 0.05 |
5 | 10,381 | 640 | 6.17 | 0 | 0 |
6 | 43,828 | 2,528 | 5.77 | 13 | 0.03 |
Total | 153,372 | 8,518 | 5.55 | 62 | 0.04 |
Across all 6 sources, there were 869,543 total records in which each record described a patient encounter. Each clinic was processed and evaluated separately from each other. Processing each clinic separately allowed for the evaluation of the performance of the algorithm within a single clinic, which is typical of most applications of phenotypes for real-world clinical data.
In this research, two of the clinical conditions were identified for testing. The high frequency condition was asthma and the low frequency condition was melanoma. The results for these two conditions is shown in the result section.
Proof of Concept
A Python program was written to automatically calculate the agreement weights for the data from each clinic based on the ICD code weight calculation described previously. For the initial implementation and testing, only the agreement weight was applied. The tool was configured to calculate weights to identify patients with asthma and melanoma. For asthma, a single occurrence of ICD-9 = “493 .X” for a given patient was used as the initial seed for the weights; for melanoma, a single occurrence of ICD 9 = “172.X” for a given patient was used as the initial seed for the weights. Since the attribute weights were calculated within each source, the value for the weights for each ICD code varied for each source. The total list of weights was too large for inclusion in this paper.
Given that the truth is currently unknown for this data, the expert-defined phenotypes for asthma and melanoma from the EHR Data Quality study were used as a benchmark. The EHR Data Quality research used phenotype algorithms validated by physicians and compared the results with the patient’s self-report data. Precision, recall, and F-measure were calculated relative to the performance of the clinical expert developed phenotype for asthma. The expert-defined phenotypes are best described as screening phenotypes, meaning that the experts sought wide definitions of individuals who likely had the disease.
The algorithm returned a positive outcome when it classified a patient as having a disease. A true positive (TP) was a positive outcome that corresponded with the expert defined asthma phenotype classification. A false positive (FP) outcome was a positive outcome from the algorithm that did not agree with the expert defined asthma phenotype. The algorithm returned a negative outcome when it classified the patient as not having the disease. A true negative (TN) outcome was a negative outcome that corresponded with the expert defined asthma phenotype classification. A false negative (FN) outcome was a negative outcome that did not agree with the expert defined asthma phenotype. Each of these outcomes were counted and input into the calculations for precision, recall, and F-measure, as outlined in Equations 4, 5 and 6 below.
(4) |
(5) |
(6) |
Each of the sources were iteratively processed through the algorithm with different thresholds for disease classification until the highest F-measure was produced. The initial threshold value for each source was the maximum patient score for that source. The threshold was reduced by 0.5 all the way to 0. The F-measure and threshold for each result was stored. For each source, the metrics presented were calculated based on the threshold that corresponded with the best F-measure.
We ran the algorithm multiple times and picked the threshold that optimized F-measure. That gave us optimal sensitivity and specificity. Thus we are optimistic that the use of orthogonal computational approaches together may be fruitful to automate classification
Calculated Metrics
For this initial testing and early proof of concept, the algorithm performed remarkably well. The maximum patient score, classification threshold, precision, recall, and F-measure for each source of asthma is presented in Table 4.
Table 4.
Algorithm Result Summary for Asthma
Source | Maximum Patient Score | Classification Threshold | Precision | Recall | F-Measure |
---|---|---|---|---|---|
Clinic 1 | 53.6360 | 12.6360 | 0.9410 | 0.9335 | 0.9373 |
Clinic 2 | 58.3873 | 12.3873 | 0.9129 | 0.9412 | 0.9268 |
Clinic 3 | 55.6088 | 12.6088 | 0.9033 | 0.9281 | 0.9155 |
Clinic 4 | 45.8018 | 11.8018 | 0.9707 | 0.9526 | 0.9616 |
Clinic 5 | 64.4364 | 11.4364 | 0.8137 | 0.9296 | 0.8678 |
Clinic 6 | 55.5678 | 10.5678 | 0.9611 | 0.9584 | 0.9598 |
The maximum patient score, classification threshold, precision, recall, and F-measure for each source of melanoma is presented in Table 5.
Table 5.
Algorithm Result Summary for Melanoma
Source | Maximum Patient Score | Classification Threshold | Precision | Recall | F-Measure |
---|---|---|---|---|---|
Clinic 1 | 125.4725 | 54.97254 | 1 | 0.285714 | 0.444444 |
Clinic 2 | 110.3233 | 33.82334 | 0.636364 | 0.583333 | 0.608696 |
Clinic 3 | 167.5986 | 58.59864 | 1 | 0.4 | 0.571429 |
Clinic 4 | 67.30173 | 28.30173 | 0.8 | 0.363636 | 0.5 |
Clinic 5 | No patient has melanoma in both benchmark and scoring results | ||||
Clinic 6 | 106.7645 | 40.76453 | 0.875 | 0.583333 | 0.7 |
The algorithm was able to achieve an F-measure ranging from 0.8137 to 0.9707 across the six sources used in this initial proof of concept for asthma. For melanoma, Clinic 5 had no patient with this disease based on the benchmark and the scoring algorithm results. The F-measure ranged from 0.444 to 0.7 across the six sources. Overall, the algorithm achieved an average F-measure of 0.9171 across all six sources of high frequent disease asthma and average F-measure of 0.5649 across all five sources of low frequent disease melanoma. Considering the resources required to generate these results relative to the creation of a phenotype, this algorithm is incredibly effective.
Discussion
This research presents a modification of the Fellegi-Sunter Model of Record Linkage for the classification of diseases within cohorts using clinical data. The initial implementation and testing of this approach shows considerable promise. For the disease selected across the six sources tested, the prototype was able to produce classifications very close to that of a phenotype designed based on expert knowledge over the course of three years. With additional improvements and further testing, it is possible that this approach will match or surpass traditional phenotype algorithms with reduced time and resource requirements. Further, the approach may have greater utility in assessing phenotype performance as a relative measure of validating different data sets in multicenter studies based on EHR data.
The initial prototype only applied agreements weights for ICD codes based on patient encounters. There are several potential improvements that are currently being tested. First, the disagreement weights will provide better granularity in the ability of the algorithm to discriminate between patients with disease and those without. One of the most powerful components of the Fellegi-Sunter model lies in its ability to augment automated machine classification with manual review judgements by subject matter experts. The time cost and effectiveness of blending these two methods in a refined machine learning approach needs to be further explored.
Finally, the presence of a single ICD code in a patient’s set of encounters represents an extremely simplistic use case. It is possible that considering and weighing more complex ICD code relationships will further boost the accuracy of this algorithm. For example, the co-occurrence of two or more different ICD codes for a given patient or multiple occurrences of the same ICD code across encounters for a patient leverages more diagnostic features than a single ICD code. Beyond the ICD codes provided by encounters, there is a considerable amount of additional information describing a patient that has the potential to further increase the accuracy of results.
Conclusion
In addition to improving the efficiency and accuracy of the classification of patients into cohorts, the scoring-based approach can provide mechanisms for potentially assessing the accuracy of disease classification before the results can be properly audited and vetted by a subject matter expert or through contact with a patient. Similar approaches have applied the Fellegi-Sunter model for quality assessment of record linkage results in the past.6 It is possible that this same technique could be adapted for disease classification.
Table 3.
The frequency of Clinical Conditions in research data
Condition | (%) Participants with condition |
---|---|
Asthma | 5.56% |
Melanoma | 0.04% |
References
- 1.Ebell MH, Sokol R, Lee A, Simons C, Early J. How good is the evidence to support primary care practice? Evid Based Med. 2017 Jun;22(3):88–92. doi: 10.1136/ebmed-2017-110704. [DOI] [PubMed] [Google Scholar]
- 2.Office of the National Coordinator for Health IT (ONC), Health IT Dashboard quick statistics, 2018. Retrieved from https://dashboard.healthit.gov/quickstats/pages/
- 3.Kahn MG, Callahan TJ, Barnard J. A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. EGEMS. 2016;4(1):1244. doi: 10.13063/2327-9214.1244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Electronic Health Records-Based Phenotyping. (n.d.) 2018.. Retrieved from http://www.rethinkingclinicaltrials.org/resources/ehr-phenotyping/
- 5.Spratt SE, Pereira K, Granger BB. Assessing electronic health record phenotypes against gold-standard diagnostic criteria for diabetes mellitus. J Am Med Inform Assoc. 2017. [DOI] [PMC free article] [PubMed]
- 6.Fellegi IP, Sunter AB. A Theory for Record Linkage. Journal of the American Statistical Association. 1969;64(328):1183–1210. [Google Scholar]
- 7.Richesson RL, Rusincovitch SA, Wixted D, Batch BC, Feinglos MN, Miranda ML, Hammond WE, Califf RM, Spratt SE. A comparison of phenotype definitions for diabetes mellitus. J Am Med Inform Assoc. 2013;20:e319–e326. doi: 10.1136/amiajnl-2013-001952. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Richesson RL, Sun J, Pathak J, Kho AN, Denny JC. Clinical phenotyping in selected national networks: demonstrating the need for high-throughput, portable, and computational methods. Artif Intell Med. 2016 Jul;71:57–61. doi: 10.1016/j.artmed.2016.05.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N, Johnson SB, Lai AM. A review of approaches to identifying patient phenotype cohorts using electronic health records. Journal of the American Medical Informatics Association: JAMIA. 2014;21(2):221–230. doi: 10.1136/amiajnl-2013-001935. [DOI] [PMC free article] [PubMed] [Google Scholar]