Improving the accuracy and precision of disease identification when utilizing EHR data for research: the case for hepatocellular carcinoma

Carrie R Wong; Yvonne N Flores; Analissa Avila; Lina Tieu; Catherine M Crespi; Folasade P May; Douglas S Bell; Beth Glenn; Roshan Bastani

doi:10.1186/s13104-025-07465-z

. 2025 Oct 1;18:410. doi: 10.1186/s13104-025-07465-z

Improving the accuracy and precision of disease identification when utilizing EHR data for research: the case for hepatocellular carcinoma

Carrie R Wong ^1,^2,^✉, Yvonne N Flores ^2,^3,⁴, Analissa Avila ^2,⁵, Lina Tieu ², Catherine M Crespi ^2,⁵, Folasade P May ^1,², Douglas S Bell ⁶, Beth Glenn ^2,³, Roshan Bastani ^2,³

PMCID: PMC12487526 PMID: 41035064

Abstract

Objective

We assessed the performance of International Classification of Diseases (ICD) codes to identify patients with hepatocellular carcinoma (HCC) in a large academic health system and determined whether employing an algorithm using a combination of ICD codes could deliver higher accuracy and precision than single ICD codes in identifying HCC cases using electronic health record (EHR) data.

Results

The use of a single ICD code entry for HCC (ICD-9-CM 155.0 or ICD-10-CM C22.0) in our cohort of 1,007 established ambulatory care patients with potential HCC yielded 58% false positives (not true HCC cases) based on chart reviews. We developed an ICD code-based algorithm that prioritized positive predictive value (PPV), F-score, and accuracy to minimize false positives and negatives. Using manual chart reviews as the gold standard, the highest performing algorithm required at least 10 ICD code entries for HCC and the sum of ICD code entries for HCC to exceed the sum of ICD code entries for non-HCC malignancies. The algorithm demonstrated high performance (PPV 97.4%, F-score 0.92, accuracy 94%), which was internally validated (PPV 92.3%, F-score 0.90, accuracy 91%) using a separate sample of 285 cancer registry cases. Our findings support the need to assess the accuracy and precision of ICD codes before using EHR data to study HCC more broadly.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13104-025-07465-z.

Keywords: Electronic health record, Electronic medical record, International Classification of Diseases, ICD code, Hepatocellular carcinoma, Liver cancer, Chronic liver disease, Accuracy, Precision, Algorithm

Introduction

Hepatocellular carcinoma (HCC) is the most common type of liver cancer among adults and the sixth leading cause of cancer deaths in the United States [1]. Epidemiological trends for HCC vary based on race/ethnicity, sex, and age, and outcomes are conditional on patient and tumor characteristics and treatment options. Population-based estimates of the epidemiology and outcomes of HCC have been largely derived from Surveillance Epidemiology and End Results (SEER) or Veterans Affair (VA) administrative data [2, 3]. SEER captures large patient samples using validated measures from histology and radiology reports but lacks treatment-related factors, which limits its ability to compare treatments and outcomes [4]. While the use of a single International Classification of Diseases (ICD) code for HCC has been previously validated in the VA with a high positive predictive value (PPV) of 86%, findings from VA administrative data have limited generalizability beyond the veteran population [5].

Large data from health systems that provide longitudinal care for patients with HCC are needed to obtain more precise estimates of HCC epidemiology and treatment outcomes. A study based in an academic health system examined the accuracy of ICD, Ninth Revision, Clinical Modification (ICD-9-CM) codes from an administrative database to identify patients with HCC and found that a combination of two occurrences of ICD-9-CM codes for HCC and two ICD-9-CM codes for chronic liver disease and/or cirrhosis were needed to achieve an 87% PPV in capturing true HCC cases [6]. Since transitioning to the ICD, Tenth Revision, Clinical Modification (ICD-10-CM) coding system, the precision, accuracy, and validity of ICD codes in identifying HCC cases have not been thoroughly assessed using health system administrative data outside of SEER and VA sources.

We sought to assess the performance of ICD-9-CM and ICD-10-CM diagnostic codes to identify patients with HCC with established ambulatory care at a large academic health system and determine whether employing an ICD code-based algorithm could deliver higher accuracy and precision in identifying HCC cases.

Methods

We used ICD-9-CM and ICD-10-CM codes (Table 1 in Supplement) to assemble a study cohort of adult patients with chronic liver disease between 2006 and 2022 using electronic health record (EHR) data at an academic health system (UCLA Health), which included two hospitals and over 200 medical clinics across Southern California (Fig. 1). All patients had a minimum of two ambulatory care visits in primary care at least one year apart to be considered established ambulatory care patients in the health system (n = 26,439). We defined the follow-up period as the time between the first ICD code entry for HCC to the most recent encounter. We first queried for patients with potential HCC, which we defined as having at least one entry of ICD-9-CM (155.0) or ICD-10-CM (C22.0) code for HCC (n = 1,007). The ICD codes appeared across different dates. To assess the performance of a single ICD code entry for HCC, we assembled a development sample, which was a random pool of 300 patients from the potential HCC sample, for chart review by three physicians (SJS, AB, SB) using a structured abstraction form. The reviewers were aware that the development sample included potential HCC cases with at least one ICD code for HCC but were blinded to the ICD codes. The inter-rater agreement for a random set of 135 duplicated cases from the development sample was 72%. Separate chart reviews for any discrepant cases were conducted by a transplant hepatologist (CRW). HCC diagnosis was established using a combination of imaging, histology, and clinical notes in the EHR. Abstraction of clinical data from chart reviews provided information about true HCC cases (true positive) and false HCC cases (false positive), which served as the gold standard comparison. From our chart reviews using the development sample, we identified the most frequent non-HCC malignancies, which were included in our algorithm. Since the algorithm was based on the development sample using manual chart reviews as the gold standard, we also performed a sensitivity analysis using the institution’s cancer registry as a reference. This sensitivity analysis was applied to 285 patients from the development sample diagnosed with HCC between 2006 and 2020. The sensitivity analysis excluded 15 patients because they were diagnosed with HCC outside of the registry date range (e.g. after 2020 when registry data was unavailable at the time of this study).

Performance measurements, including sensitivity, specificity, PPV (precision), negative predictive value (NPV), F-score (harmonic mean of PPV and sensitivity), and accuracy (percentage of patients who were correctly classified) accompanied each algorithm iteration (Table 1). We selected the best performing algorithm with the highest PPV, F-score, and accuracy to reduce the number of false positive and negative cases. We internally validated the highest performing algorithm using a different random sample of 300 patients from the pool of potential patients with HCC (Fig. 1).

Table 1.

Performance measurements of ICD code-based algorithm iterations

Open in a new tab

PPV; positive predictive value, NPV; negative predictive value; HCC, hepatocellular carcinoma

Results

The cohort of 26,439 established patients with ICD codes indicative of chronic liver disease included 1,007 patients with potential HCC based on a single ICD-9-CM or ICD-10-CM code entry. Chart reviews of the random sample of 300 potential HCC cases revealed that 58% of the potential cases of HCC were false positives (not true HCC cases). Of the 174 false HCC cases, 64 patients (36.8%) were found to have non-HCC malignancies, including 20 with cholangiocarcinoma, 16 with metastatic colorectal cancer, 9 with metastatic neuroendocrine tumors, 4 with metastatic pancreatic adenocarcinoma, and 15 with other metastatic malignancies. The second largest group of false positives (32.1%) included patients undergoing surveillance for HCC in the setting of chronic liver disease (n = 32) and other malignancies (n = 24) including history of renal cell carcinoma, breast cancer, and head and neck cancers. Other false HCC cases included patients who were found to have non-HCC liver lesions (n = 32; 18.4%) such as hemangiomas and hepatocellular adenomas or had diagnostics performed for other indications including abnormal liver enzymes (n = 22; 12.6%).

In the development sample (n = 300), the median age (interquartile range [IQR]) was similar between the true and false HCC cases, which were 64 (57–71) and 65 (53–73) years, respectively (p = 0.69). Median follow-up was also not significantly different between the HCC cases and non-cases: 6.4 (2.6–10.3) versus 5.7 (2.2–9.4) years, respectively (p = 0.21). Notably, there were significantly more females among false HCC cases (50%) compared to true HCC cases (31%), p < 0.01.

Following prior work [6], we first examined the performance measurements of Algorithm 1 by HCC code frequency alone (Table 1). The performance metrics improved with increasing frequency of ICD codes for HCC until the inflection point at 10 or more ICD codes for HCC (Fig. 2). Since we prioritized the highest PPV, F-score, and accuracy to reduce the number of false positive and negative cases, the Algorithm 1 iteration with at least 10 ICD code entries for HCC had the best performance. For Algorithm 2, we assessed the performance metrics of each iteration that did not have ICD codes for a commonly miscoded non-HCC malignancy. Algorithm 3 combined the best performing iteration from Algorithm 1 (≥ 10 ICD code entries for HCC) with each non-HCC malignancy as done for Algorithm 2. The F-score and accuracy for Algorithm 3 iterations for pancreatic cancer, colorectal cancer, or neuroendocrine tumor were more favorable than the iterations for secondary malignancy or cholangiocarcinoma. However, if we eliminated the Algorithm 3 iterations for secondary malignancy or cholangiocarcinoma based on performance metrics alone, the algorithm could have more false positives based on the higher specificities for secondary malignancy (95.4%) and cholangiocarcinoma (98.9%) compared to the specificities in Algorithm 1. There could also be more false negatives based on the lower sensitivities seen in Algorithm 3 iterations for secondary malignancy (83.3%) and cholangiocarcinoma (77.8%) compared to the sensitivities in Algorithm 1. Balancing the risk of false positives and false negatives and recognizing that increasing frequency of ICD codes for HCC had improved performance (Algorithm 1), we applied the sum of ICD codes for HCC to exceed the sum of ICD codes for non-HCC malignancies in Algorithm 4.

Fig. 2 — PPV, F-score, and accuracy for Algorithm 1 and 4. PPV; positive predictive value. Performance measurements of each algorithm were obtained using the Development Sample, which was a random pool of 300 potential HCC patients with at least 1 ICD code entry for HCC. Algorithm 1 included an increasing number of ICD code entries for HCC using ICD codes 155.0 and C22.0. Algorithm 4 included iterations from Algorithm 1, which included an increasing number of ICD code entries for HCC and required the sum of ICD code entries for HCC to exceed the sum of ICD code entries for non-HCC malignancies (secondary malignancy, cholangiocarcinoma, pancreatic cancer, colorectal cancer, neuroendocrine tumor)

The best performing algorithm was an iteration of Algorithm 4 which had a of PPV of 97.4%, F-score of 0.92, and accuracy of 94% using the development sample. When we tested the algorithm using the cancer registry as the gold standard, the algorithm had similar performance metrics (PPV 94.3%, F-score 0.93, accuracy 94.4%). Our internal validation results revealed a consistently high performance with a PPV of 92.3%, F-score of 0.90, and accuracy of 91%.

Discussion

This study assessed the performance of multiple ICD-9-CM and ICD-10-CM for HCC to precisely and accurately identify patients with HCC and longitudinal ambulatory care in a large academic health system. In contrast to findings from the VA Corporate Data Warehouse [3], a single ICD code for HCC performed poorly with a 58% false positive rate (42% accuracy). An algorithm requiring at least 10 ICD code entries for HCC in combination with the sum of HCC ICD code entries exceeding the sum of non-HCC malignancy ICD code entries identified true HCC cases with at least 90% accuracy and precision in our development and validation samples.

The discrepancy between the performance metrics of a single ICD code entry in our health system compared to the VA may be related to different coding practices and different patient populations including a higher prevalence of patients undergoing specialized cancer treatments in our health system. Compared to the VA, our health system includes patients with more female-specific malignancies, such as breast cancer, which require monitoring for liver metastasis. Specialized cancer treatments offered in our health system also require frequent surveillance imaging. As a referral center for specialty care, including transplant hepatology and hepatobiliary surgery, our health system likely encounters more consultations for evaluation of suspicious liver lesions, which may be miscoded using ICD-9-CM and ICD-10-CM codes for HCC. The PPV of the algorithm increased with an increasing number of ICD code entries for HCC as similarly reported in a previous study conducted in an academic health system [6]. As suggested by our chart reviews, potential real-world causes of miscoding include frequent imaging-based cancer surveillance and premature linkage of labs to disease codes during diagnostic work-up. To mitigate such issues, future research ought to consider using only ICD codes specifically linked to clinical encounters with providers.

Limitations and strengths

Several limitations should be considered when interpreting our findings. First, UCLA Health includes a tertiary care center, which receives referrals for specialized care in oncology, hepatobiliary surgery, and liver transplantation. Therefore, the prevalence of HCC is expectedly high in the study’s population, which could have inflated the PPV of our algorithm. Nonetheless, a higher prevalence of non-HCC malignancies would also be expected, which was countered by including the sum of ICD code entries for HCC to exceed the sum of ICD code entries for non-HCC malignancies in the algorithm. We also included additional performance metrics, such as the specificity and F-score, which are not as affected by disease prevalence. Second, while chart reviews were considered the gold standard to develop the best performing algorithm, human error during manual abstractions of EHR data could bias results. Duplicate chart reviews were not performed for all 300 potential HCC cases in the development sample, which could limit the reliability of the reference standard. Third, the study cohort was defined using ICD codes for chronic liver diseases, which were similarly subject to misclassification and could have led to downstream effects on the development and performance of our algorithm. In recognition of these limitations, we tested the performance of the algorithm using cancer registry data, which identified HCC cases independently from our study team using separate criteria, as a sensitivity test which showed similar results. Fourth, our algorithm was developed and validated exclusively within a single academic health system which limits its generalizability. Future directions include external validation of this algorithm in other academic and non-academic settings.

Conclusion

In conclusion, this study demonstrated that a combination of ICD-9-CM and ICD-10-CM codes for HCC can be used via an algorithm with over 90% accuracy and precision, whereas the use of a single ICD code entry to identify HCC cases requires caution in similar large health systems. With emerging new treatments for HCC [7] and use of artificial intelligence to capture HCC characteristics and outcomes using EHR data [8], large samples of patients with accurate, precise, and valid HCC diagnoses are needed to derive population-based estimates of treatment receipt and response, compare effectiveness of different treatments, and assess for differences in treatment uptake and outcomes by patient characteristics. Future studies that leverage EHR-based data to conduct studies about HCC epidemiology and treatment outcomes are encouraged to validate the accuracy and precision of ICD codes for HCC diagnoses before generating population-based estimates.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1.^{(17.9KB, docx)}

Acknowledgements

We would like to acknowledge Drs. Saroja Bangaru, Aileen Bui, and Sarah J. Stettner and the Cancer Registry Department of Performance Strategy and Analytics at the University of California, Los Angeles for their contributions.

Abbreviations

HCC: Hepatocellular carcinoma
SEER: Surveillance Epidemiology and End Results
VA: Veterans Affair
ICD: International Classification of Diseases
PPV: Positive predictive value
ICD-9-CM: ICD, Ninth Revision, Clinical Modification
ICD-10-CM: ICD, Tenth Revision, Clinical Modification
NPV: Negative predictive value
HER: Electronic health records

Author contributions

C.R.W., Y.N.F., A.A., L.T., F.P.M., D.B., B.G., and R.B. contributed to the study conception and design. Y.N.F., C.M.C., F.P.M., D.B., B.G., and R.B. acquired the data. A.A. provided primary statistical support. All authors contributed to the methods and statistical analyses. C.R.W. wrote the main manuscript text. All authors reviewed and edited the manuscript.

Funding

Jonsson Comprehensive Cancer Center, David Geffen School of Medicine, and Clinical Translational Science Institute at the University of California, Los Angeles (UL1TR001881). National Center for Advancing Translational Science of the National Institutes of Health.

Data availability

Study data cannot be openly shared to protect the private health information of patients at UCLA Health.

Declarations

Ethics approval and consent to participate

The current study was approved by the Institutional Review Board at the University of California, Los Angeles. Written informed consent was not required for the use of de-identified electronic health data.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Liver Cancer Causes, Risk Factors, and Prevention - NCI. 2022. https://www.cancer.gov/types/liver/what-is-liver-cancer/causes-risk-factors. Accessed 15 Aug 2024.
2.SEER. Cancer of the liver and intrahepatic bile duct - cancer stat facts. https://seer.cancer.gov/statfacts/html/livibd.html. Accessed 2 Apr 2024.
3.Ju MR, Karalis JD, Chansard M, Augustine MM, Mortensen E, Wang SC, et al. Variation of hepatocellular carcinoma treatment patterns and survival across geographic regions in a veteran population. Ann Surg Oncol. 2022;29(13):8413–20. 10.1245/s10434-022-12390-7. [DOI] [PubMed] [Google Scholar]
4.Noone AM, Lund JL, Mariotto A, Cronin K, McNeel T, Deapen D, et al. Comparison of SEER treatment data with medicare claims. Med Care. 2016;54(9):e55–64. 10.1097/MLR.0000000000000073. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Overview of VHA Patient Veteran, and, Non-veteran Populations, and Characteristics. In: National Healthcare Quality and Disparities Report: Chartbook on Healthcare for Veterans. Agency for Healthcare Research and Quality (US); 2020. https://www.ncbi.nlm.nih.gov/books/NBK578553/. Accessed 22 Aug 2024.
6.Goldberg DS, Lewis JD, Halpern SD, Weiner MG, Lo Re V. Validation of a coding algorithm to identify patients with hepatocellular carcinoma in an administrative database. Pharmacoepidemiol Drug Saf. 2013;22(1):103–7. 10.1002/pds.3367. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Llovet JM, Kelley RK, Villanueva A, Singal AG, Pikarsky E, Roayaie S, et al. Hepatocellular carcinoma. Nat Rev Dis Primers. 2021;7(1):1–28. 10.1038/s41572-020-00240-3. [DOI] [PubMed] [Google Scholar]
8.Ge J, Li M, Delk MB, Lai JC. A comparison of a large language model vs manual chart review for the extraction of data elements from the electronic health record. Gastroenterology. 2024;166(4):707–e7093. 10.1053/j.gastro.2023.12.019. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Ge J, Li M, Delk MB, Lai JC. A comparison of a large language model vs manual chart review for the extraction of data elements from the electronic health record. Gastroenterology. 2024;166(4):707–e7093. 10.1053/j.gastro.2023.12.019. [DOI] [PMC free article] [PubMed]

Supplementary Materials

Supplementary Material 1.^{(17.9KB, docx)}

Data Availability Statement

Study data cannot be openly shared to protect the private health information of patients at UCLA Health.

[CR1] 1.Liver Cancer Causes, Risk Factors, and Prevention - NCI. 2022. https://www.cancer.gov/types/liver/what-is-liver-cancer/causes-risk-factors. Accessed 15 Aug 2024.

[CR2] 2.SEER. Cancer of the liver and intrahepatic bile duct - cancer stat facts. https://seer.cancer.gov/statfacts/html/livibd.html. Accessed 2 Apr 2024.

[CR3] 3.Ju MR, Karalis JD, Chansard M, Augustine MM, Mortensen E, Wang SC, et al. Variation of hepatocellular carcinoma treatment patterns and survival across geographic regions in a veteran population. Ann Surg Oncol. 2022;29(13):8413–20. 10.1245/s10434-022-12390-7. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Noone AM, Lund JL, Mariotto A, Cronin K, McNeel T, Deapen D, et al. Comparison of SEER treatment data with medicare claims. Med Care. 2016;54(9):e55–64. 10.1097/MLR.0000000000000073. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Overview of VHA Patient Veteran, and, Non-veteran Populations, and Characteristics. In: National Healthcare Quality and Disparities Report: Chartbook on Healthcare for Veterans. Agency for Healthcare Research and Quality (US); 2020. https://www.ncbi.nlm.nih.gov/books/NBK578553/. Accessed 22 Aug 2024.

[CR6] 6.Goldberg DS, Lewis JD, Halpern SD, Weiner MG, Lo Re V. Validation of a coding algorithm to identify patients with hepatocellular carcinoma in an administrative database. Pharmacoepidemiol Drug Saf. 2013;22(1):103–7. 10.1002/pds.3367. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Llovet JM, Kelley RK, Villanueva A, Singal AG, Pikarsky E, Roayaie S, et al. Hepatocellular carcinoma. Nat Rev Dis Primers. 2021;7(1):1–28. 10.1038/s41572-020-00240-3. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Ge J, Li M, Delk MB, Lai JC. A comparison of a large language model vs manual chart review for the extraction of data elements from the electronic health record. Gastroenterology. 2024;166(4):707–e7093. 10.1053/j.gastro.2023.12.019. [DOI] [PMC free article] [PubMed]

PERMALINK

Improving the accuracy and precision of disease identification when utilizing EHR data for research: the case for hepatocellular carcinoma

Carrie R Wong

Yvonne N Flores

Analissa Avila

Lina Tieu

Catherine M Crespi

Folasade P May

Douglas S Bell

Beth Glenn

Roshan Bastani

Abstract

Objective

Results

Supplementary Information

Introduction

Methods

Fig. 1.

Table 1.

Results

Fig. 2.

Discussion

Limitations and strengths

Conclusion

Supplementary Information

Acknowledgements

Abbreviations

Author contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Competing interests

Footnotes

References

Associated Data

Data Citations

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases