Skip to main content
Open Forum Infectious Diseases logoLink to Open Forum Infectious Diseases
. 2025 Aug 15;12(8):ofaf496. doi: 10.1093/ofid/ofaf496

Development and Validation of a Machine Learning–Based Screening Algorithm to Predict High-Risk Hepatitis C Infection

Suk-Chan Jang 1, Wei-Hsuan Lo-Ciganic 2,3, Pilar Hernandez-Con 4, Chanakan Jenjai 5, James Huang 6, Ashley Stultz 7, Shunhua Yan 8, Debbie L Wilson 9, Ashley Norse 10, Faheem W Guirgis 11, Robert L Cook 12, Christine Gage 13, Khoa A Nguyen 14, Patrick Hornes 15, Yonghui Wu 16, David R Nelson 17, Haesuk Park 18,✉,2
PMCID: PMC12378832  PMID: 40874186

Abstract

Background

Amid the opioid epidemic in the United States, hepatitis C virus (HCV) infections are rising, with one-third of individuals with infection unaware due to the asymptomatic nature. This study aimed to develop and validate a machine learning (ML)-based algorithm to screen individuals at high risk of HCV infection.

Methods

We conducted prognostic modeling using the 2016–2023 OneFlorida+ database of all-payer electronic health records. The study included individuals aged ≥18 years who were tested for HCV antibodies, RNA, or genotype. We identified 275 features of HCV, including sociodemographic and clinical characteristics, during a 6-month period before the test result date. Four ML algorithms—elastic net (EN), random forest (RF), gradient boosting machine (GBM), and deep neural network (DNN)—were developed and validated to predict HCV infection. We stratified patients into deciles based on predicted risk.

Results

Among 445 624 individuals, 11 823 (2.65%) tested positive for HCV. Training (75%) and validation (25%) samples had similar characteristics (mean, standard deviation age, 45 [16] years; 62.86% female; 54.43% White). The GBM model (C statistic, 0.916 [95% confidence interval = .911–.921]) outperformed the EN (0.885 [.879–.891]), RF (0.854 [.847–.861]), and DNN (0.908 [.903–.913]) models (P < .0001). Using the Youden index, GBM achieved 79.39% sensitivity and 89.08% specificity, identifying 1 positive HCV case per 6 tests. Among patients with HCV, 75.63% and 90.25% were captured in the top first and first to third risk deciles, respectively.

Conclusions

ML algorithms effectively predicted and stratified HCV infection risk, offering a promising targeted screening tool for clinical settings.

Keywords: HCV infection, hepatitis C virus, high-risk prediction, machine learning, screening algorithm


Our machine learning–based screening algorithm improves hepatitis C virus detection efficiency, optimizing resource allocation and supporting timely diagnosis and treatment among high-risk individuals, addressing limitations of universal screening, and guiding clinicians and policymakers in enhancing public health strategies.


Hepatitis C virus (HCV) infection is the most prevalent chronic blood-borne infection and a leading cause of liver-related morbidity, liver transplantation, and mortality in the United States [1]. In 2022, more than 12 000 HCV-related deaths were reported, exceeding the total number of deaths from other infectious diseases combined, including human immunodeficiency virus (HIV) [2–4]. The HCV infection rate in 2022 amid the opioid epidemic and injection drug use (IDU) was more than 3 times as high as in 2014 [3, 5–7]. Despite this increase, HCV often remains undiagnosed due to its asymptomatic nature, leaving approximately one-third of infected individuals unaware of their infections [2, 8]. While direct-acting antiviral therapies offer a high cure rate, with sustained virologic response exceeding 95%, the HCV epidemic continues to worsen due to undiagnosed individuals unknowingly transmitting the virus [9].

The U.S. Centers for Disease Control and Prevention (CDC) recommends universal screening of HCV for all adults and frequent testing for individuals with high-risk behaviors [10]. However, universal screening presents practical and resource-related challenges, including unnecessary testing and staffing requirements, which can be costly. Given HCV's high transmissibility [11], there is a need for sustainable, predictive screening approaches in clinical practice. While targeted screening based on risk factors like IDU offers an alternative, it is often limited by clinicians' time constraints and discomfort in discussing sensitive topics [12, 13]. To improve identification of high-risk individuals without disrupting clinical workflows, a more integrated approach is needed.

The widespread adoption of standardized electronic health records (EHRs) and the rapid advancement of machine learning (ML) offer new opportunities to detect undiagnosed HCV more efficiently [14]. Therefore, we developed and validated an ML-based algorithm to identify patients at high risk for HCV infection using all-payer EHR data by comparing multiple ML models to identify the most effective one to enhance targeted screening in clinical settings.

METHODS

Data Source

We conducted prognostic modeling with retrospective cohort data using the OneFlorida+ database, an all-payer EHR database that captures diverse patient data across Florida. The OneFlorida+ database includes information from over 26 million patients. Comprehensive patient data are collected based on the PCORnet Common Data Model, encompassing sociodemographics, clinical specialties, disease diagnoses, medical procedures, laboratory tests, and prescribed medications [15].

Study Population and Design

We included adults aged 18–79 years who were tested for HCV (antibody, RNA, or genotype) between 1 January 2016, and 31 July 2023. We excluded individuals who had a prior diagnosis of HCV or treatment for HCV before their first HCV test, or individuals without a test result. Case and control groups were categorized based on their test results (Supplementary Table 1). The case group comprised individuals having a positive RNA or genotype test result, and the control group comprised individuals having a negative antibody or RNA test result with no record of positive antibody, RNA, or genotype test results. The index date was the date of the first positive RNA or genotype test result for the case group and the date of the last negative antibody or RNA test result for controls (Supplementary Figure 1). To avoid misclassification, we further excluded individuals (1) with inconsistent test results or (2) having insufficient RNA levels (ie, <3 log IU/mL) unless 2 results were available (Supplementary Figure 2).

This study adhered to the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis + Artificial Intelligence reporting guideline (Supplementary Table 2) [16].

Study Outcome

The primary outcome was HCV infection, defined by the aforementioned case criteria.

Predictor Candidates

Based on the prior literature, we identified a series of baseline candidate predictors (n = 275) encompassing sociodemographic and clinical factors during the 6 months prior to the index date, excluding the index date (Supplementary Table 3). We chose the 6-month period to assess potential exposure based on the CDC's recommendation and previous literature, considering HCV incubation time [17, 18]. Sensitivity analyses were performed using a 12-month predictor assessment window. Sociodemographic predictors included age, sex, ethnicity (Hispanic and non-Hispanic), race (Asian, Black, White, and other), preferred spoken language (English, Spanish, and other), and payer type (Medicare, Medicaid, commercial, self-pay, and other). Clinical predictors included risk factors for HCV infection (eg, HIV), conditions, symptoms, or markers potentially related to illicit drug use, liver conditions or diseases, extrahepatic manifestations of HCV, other conditions (eg, dyslipidemia), and healthcare utilization (eg, specialty visits).

ML Approaches and Prediction Performance Evaluation

Our analysis followed 5 main steps (Figure 1). In step 1 (preprocessing), we randomly allocated 75% of individuals to a training sample to develop the prediction algorithm, and the remaining 25% to a validation sample to evaluate the algorithm's prediction performance. Missing values were imputed, and continuous variables were standardized through normalization for the collected predictor candidates. For step 2 (ML modeling), algorithms for HCV infection were trained using 4 commonly used ML approaches—elastic net (EN), random forest (RF), gradient boosting machine (GBM), and deep neural network (DNN) [14]. Details for each of the ML approaches used are described in the Supplementary Methods. For step 3 (performance assessment), discrimination performance was assessed by comparing C statistics (with >0.8 considered good and >0.9 indicating excellent) and precision-recall curves across methods using the DeLong test [19]. A P-value <.005 was considered statistically significant due to multiple comparisons (n = 10; modified P = .05/10) using the Bonferroni correction [20]. In addition to C statistics, we reported 8 evaluation metrics to comprehensively assess prediction performance: sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio, negative likelihood ratio, the number needed to screen (NNS) to identify 1 HCV infection, and the estimated positive alert rate (Supplementary Table 4). When prediction performance was comparable across methods, we selected the model with the fewest predictors. Since no single prediction probability threshold fits every purpose, we presented prediction metrics at multiple levels of sensitivity and specificity, using 90% sensitivity as an anchor and a threshold with balanced sensitivity and specificity determined by the Youden index [21]. Model fairness was examined by assessing the balance of false-negative rates (FNRs) across race, ethnicity, sex, age group, place of HCV testing at the index date, and frequent emergency department (ED) use defined as 2 or more ED visits during the 6-month window [22–25]. In step 4 (risk stratification), based on the predicted risk score, we stratified individuals into decile subgroups based on risk threshold levels established from the training sample. This approach allowed for a more granular analysis of high-risk individuals. In the final step, step 5 (identifying important predictors), we identified the top 25 most important predictors using the SHapley Additive exPlanations (SHAP) values from the best-performing model selected. This approach offers insights into key variables relevant to prediction and informing clinical practice [26]. For comparisons between characteristics of subgroups, we used analysis of variance for continuous variables and χ2 tests for categorical variables. All analyses were performed using SAS, version 9.4 (SAS Institute Inc.), and Python, version 3.12 (Python Software Foundation).

Figure 1.

Figure 1.

Data analysis pipeline—machine learning algorithms. Icon designed by FLATICON (https://www.flaticon.com/). Abbreviations: HCV, hepatitis C virus; SHAP, SHapley Additive exPlanations.

RESULTS

Cohort Characteristics

Individuals in the training (n = 334 218) and validation (n = 111 406) samples had similar sociodemographic and clinical characteristics (Table 1). Overall, the mean (standard deviation) age of included individuals was 45 (16) years, 62.86% (n = 280 114) were female, 2.43% (n = 10 844) were Asian, 26.49% (n = 118 056) were Black, 54.43% (n = 242 565) were White, and 16.14% (n = 71 926) were Hispanic. In total, 34.01% (n = 151 561) had commercial insurance, and 18.59% (n = 82 839) were covered by Medicaid. We found that 18.05% (n = 80 427) received HIV testing, and 1.29% (n = 5765) tested positive for HIV. Additionally, 1.38% (n = 6159) of the participants were diagnosed with opioid-related IDU, 1.64% (n = 7300) with IDU related to other substances, and 2.70% (n = 12 026) with a substance use disorder (SUD). Overall, 2.65% (n = 11 823) tested positive for HCV and were classified as cases, while the remaining 97.35% (n = 433 801) were classified as controls.

Table 1.

Baseline Sociodemographic and Clinical Characteristics Stratified by Training and Validation Sample

Characteristic Training Sample (n = 334 218) Validation Sample (n = 111 406)
Age, mean (SD), y 45.28 (15.88) 45.37 (15.87)
Age group, n (%), y
 18–29 68 977 (20.64) 22 821 (20.48)
 30–39 69 514 (20.80) 23 136 (20.77)
 40–49 52 918 (15.83) 17 534 (15.74)
 50–59 66 673 (19.95) 22 191 (19.92)
 60–79 76 136 (22.78) 25 724 (23.09)
Sex, n (%)
 Male 124 118 (37.14) 41 247 (37.02)
 Female 209 996 (62.83) 70 118 (62.94)
 Unknown 104 (0.03) 41 (0.04)
Ethnicity, n (%)
 Hispanic 53 907 (16.13) 18 019 (16.17)
 Non-Hispanic 214 665 (64.23) 71 792 (64.44)
 Unknown 65 646 (19.64) 21 595 (19.38)
Race, n (%)
 Asian 8156 (2.44) 2688 (2.41)
 Black 88 459 (26.47) 29 597 (26.57)
 White 181 794 (54.39) 60 771 (54.55)
 Othera 2214 (0.66) 697 (0.63)
 Unknown 53 595 (16.04) 17 653 (15.85)
Payer type, n (%)
 Medicare 23 959 (7.17) 7991 (7.17)
 Medicaid 62 235 (18.62) 20 604 (18.49)
 Commercial 113 693 (34.02) 37 868 (33.99)
 Self-pay 9567 (2.86) 3200 (2.87)
 Other 4961 (1.48) 1723 (1.55)
 Unknown 119 803 (35.85) 40 020 (35.92)
Comorbidity, n (%)
 Cirrhosis 5093 (1.52) 1732 (1.55)
 Coronary artery disease 20 245 (6.06) 6690 (6.01)
 Depression 12 466 (3.73) 4159 (3.73)
 Dyslipidemia 62 172 (18.60) 20 879 (18.74)
 HIV 4369 (1.31) 1396 (1.25)
 Injection drug use (opioid) 4622 (1.38) 1537 (1.38)
 Injection drug use (other) 5432 (1.63) 1868 (1.68)
 NAFLD/NASH 8490 (2.54) 2822 (2.53)
 Sexually transmitted disease 175 (0.05) 60 (0.05)
 Substance use disorder 8943 (2.68) 3083 (2.77)
Procedures and tests, n (%)
 Complete blood count test 116 244 (34.78) 38 507 (34.56)
 HIV test 60 182 (18.01) 20 245 (18.17)
 Liver function test 99 203 (29.68) 32 832 (29.47)
 Liver procedure 1375 (0.41) 502 (0.45)
 Prothrombin time test 40 340 (12.07) 13 476 (12.10)
 Sodium test 188 019 (56.26) 62 651 (56.24)
Smoking, n (%) 40 551 (12.13) 13 567 (12.18)
Number of visits, mean (SD)
 Emergency department 0.81 (3.85) 0.82 (3.90)
 Inpatient hospital stay 0.75 (5.77) 0.76 (5.68)
 Outpatient visit 6.76 (15.13) 6.79 (15.67)
 Other 4.98 (13.62) 4.98 (13.74)

Abbreviations: HIV, human immunodeficiency virus; n, number; NAFLD/NASH, nonalcoholic fatty liver disease/nonalcoholic steatohepatitis, SD, standard deviation.

aOther races included American Indian/Alaska Native, Native Hawaiian or Other Pacific Islander, and >1 race.

Prediction Performance Across ML Methods

As shown in Figure 2A, all 4 ML approaches demonstrated C statistics higher than 0.8: DNN, C statistic = 0.908, 95% confidence interval (CI) = .903–.913; GBM, C statistic = 0.916, 95% CI = .911–.921; RF, C statistic = 0.854, 95% CI = .847–.861; and EN, C statistic = 0.885, 95% CI = .879–.891. Among these algorithms, the GBM model achieved the highest C statistic, outperforming all other models based on the DeLong test (all P < .0001). For precision-recall performance, assessed by the area under the curve (AUC), the GBM and DNN exhibited similarly high AUC values (Figure 2B). However, while the GBM required 238 predictors, the DNN utilized all 275 predictors. The prediction performance measures at different levels of sensitivity and specificity are detailed in Supplementary Table 5. When we balanced sensitivity and specificity using the Youden index, the GBM approach achieved a sensitivity of 79.39% and a specificity of 89.08%, with a PPV of 16.53%, an NPV of 99.37%, an NNS of 6 to identify 1 HCV infection, and 13 positive alerts per 100 patients. When sensitivity was set to 90% (aiming to identify 90% of actual HCV diagnoses), the GBM had a specificity of 72.32%, a PPV of 8.14%, an NPV of 99.62%, an NNS of 12 to identify 1 HCV infection, and 29 positive alerts per 100 patients (Supplementary Table 5 and Figure 2C and D). In sensitivity analysis, expanding the predictor assessment window to 12 months resulted in a GBM C statistic of 0.921 (95% CI = .917–.925; Supplementary Figure 3). The performance of the GBM model was consistent across ethnicity, sex, age group, place of HCV testing at the index date, and frequent ED use categories, although it indicated a higher FNR for Asian individuals compared with other race categories (Supplementary Figure 4).

Figure 2.

Figure 2.

Performance matrix of machine learning models for HCV infection. A, Areas under the receiver operating characteristic curves (AUC), with C statistics and 95% confidence intervals. B, Precision-recall curves (precision = positive predictive value, and recall = sensitivity). Precision-recall curves for methods closer to the upper-right corner or above another method have improved performance. C, The number needed to be evaluated by different cutoffs of sensitivity. D, Positive alerts per 100 patients by various sensitivity cutoffs. Abbreviations: DNN, deep neural network; EN, elastic net; GBM, gradient boosting machine; RF, random forest.

Risk Stratification by Decile Risk Subgroup

Figure 3 illustrates the actual HCV infection rate for individuals in each risk decile subgroup. In the top risk decile of the validation sample, the top first (n = 1115), second to fifth (n = 4456), and sixth to tenth (n = 5570) percentile subgroups had PPVs of 76.05%, 22.80%, and 6.66%, respectively. Among the 2955 individuals diagnosed with HCV in the validation set, 75.63% (n = 2235) and 90.25% (n = 2667) were captured in the top first and top first to third risk deciles, respectively. Individuals in the top first and the top first to third deciles had HCV-positivity rates 28 and 22 times as high, respectively, as those in the remaining groups (HCV-positivity rate: decile 1 = 20.06% vs decile 1–10 = 0.72%; decile 1–3 = 7.98% vs decile 4–10 = 0.37%). Individuals with higher risk scores, identified in the higher risk decile subgroups, were more likely to be White individuals, male, covered by Medicaid or uninsured, and to have a history of HIV diagnosis, HIV testing, IDU diagnosis, and SUD diagnosis (all P < .0001; Supplementary Table 6).

Figure 3.

Figure 3.

Hepatitis C virus (HCV) infection by risk subgroup assessed using the GBM model with the validation sample. The overall HCV-positivity rate was 2.65% for the cohort that underwent HCV screening. In risk stratification analysis, the positivity rate was 20.06% in the top decile, 11.22% across the first 2 deciles, and 7.98% across the first 3 deciles, whereas it was 0.37% for deciles 4–10. The positivity rate was calculated by dividing the number of positive cases by the number of patients. For example, 20.06% = [(1115 × 0.761) + (4456 × 0.228) + (5570 × 0.067)]/(1115 + 4456 + 5570) and 11.22% = [(1115 × 0.761) + (4456 × 0.228) + (5570 × 0.067) + (11 140 × 0.024)]/(1115 + 4456 + 5570 + 11 140). Universal screening for all individuals.

Important Predictors

Figure 4 presents variable importance plots utilizing SHAP values derived from the GBM model. The predictor “history of HIV test” (listed in the figure as “HIV test”) showed the strongest positive correlation with HCV infection. Other positively correlated predictors included identifying as a non-Hispanic, White, or Black individual; smoking; being older; having a history of HIV; undergoing prothrombin time or liver function tests; and having a higher number of ED visits. A higher number of outpatient visits was inversely associated with HCV infection. Having commercial insurance, a history of dyslipidemia, and other internal medicine visits were also associated with a reduced risk of HCV infection.

Figure 4.

Figure 4.

Top 25 important predictors for hepatitis C virus (HCV) infection selected by the gradient boosting machine model. The y-axis lists each feature, and the x-axis indicates the SHapley Additive exPlanations value; the color bar on the right side illustrates the descending order of the average impact on HCV infection for each feature, showing positive (red) or negative (blue) correlations, with overlapping points jittered along the y-axis. Abbreviations: HIV, human immunodeficiency virus; NAFLD/NASH, nonalcoholic fatty liver disease/nonalcoholic steatohepatitis; Rx, prescription; SUD, substance use disorder.

DISCUSSION

This prognostic modeling study used a 2016–2023 all-payer EHR database to develop and validate ML-based algorithms to predict high-risk HCV infection. All 4 assessed ML models demonstrated strong discrimination performance, with C statistics ranging from 0.854 to 0.916, and the GBM model performed the best. The GBM algorithm effectively stratified the risk, capturing 75.63% and 90.25% of individuals living with HCV in the top risk decile and top first to third risk decile subgroups, respectively. These excellent performance metrics and the model's ability to effectively stratify risk highlight its potential for enabling efficient targeted screening.

To the best of our knowledge, 3 studies have developed ML algorithms to predict HCV risk [27–29]. One study used tree-based models with claims data and reported good performance, but it lacked clinical and laboratory information and likely misclassified undiagnosed cases as controls due to reliance on diagnosis codes alone [28]. Additionally, its use of a predictor window spanning over 4 years poses a challenge for real-world applicability. Two other studies utilized XGBoost or gradient-boosted tree models with EHR data, similar to our study. However, both were limited by misclassification of undiagnosed HCV cases in the control group, a 2-year predictor collection period, or suboptimal ML performances (C statistics <0.8) [27, 29]. In contrast, our study addressed these limitations by using laboratory-confirmed HCV test results to accurately classify cases and controls, avoiding misclassification. We also adopted a more practical 6-month predictor window aligned with the CDC recommendations, enabling real-time clinical use. Results from the sensitivity analysis using a 12-month window were similar, suggesting a longer predictor window does not enhance algorithm performance. Leveraging a large, recent all-payer EHR dataset (n = 445 624; 2016–2023), our models achieved strong discriminatory performance and improved the feasibility of targeted HCV screening in clinical practice.

Consistent with previous ML studies [27–29], our findings confirmed that a history of IDU and SUD are important predictors of HCV infection, as emphasized in numerous studies, highlighting their crucial role in HCV transmission [5–7]. The majority of individuals in the United States become infected with HCV by sharing needles, syringes, or other drug-injection equipment [3]. Similarly, SUD can be a marker for IDU, although it may not always be captured by a history of IDU or other risky behaviors (eg, unprotected, high-risk sexual activities), which can increase the risk of HCV exposure [30]. Laboratory tests related to liver function, such as prothrombin time and liver function tests, were also identified as important predictors in our study and by others [27–29].

Distinct from previous studies, our study identified that HIV diagnosis and testing serve as important predictors of HCV infection. HIV and HCV share common transmission routes, predominantly through blood-borne pathways and IDU [6, 11]. Individuals co-infected with HIV and HCV exhibit higher HCV RNA levels due to HIV-related immunosuppression, which impairs the body's ability to clear HCV, subsequently enhancing viral replication and transmission risk [31]. It is important to note that in our analysis, HIV testing refers to whether an individual underwent an HIV test, regardless of whether they ultimately received an HIV diagnosis. In our dataset, 18.60% of individuals had a record of HIV testing, while only 1.31% were actually diagnosed with HIV, indicating that HIV testing history encompasses a much broader at-risk population than HIV diagnosis alone. Notably, unlike previous studies, our study included a history of undergoing an HIV test as a potential predictor and found that a history of HIV testing, regardless of the result, was identified as the most important predictor. This finding highlights the importance of including HIV testing in the HCV prediction model. In fact, over 25% of individuals infected with HIV in the United States also have HCV, and HCV infection can serve as a marker of HIV exposure, as HCV is most often transmitted before HIV [32, 33]. This finding also underscores the importance of integrated screenings for HCV and HIV infections, particularly within populations at risk for blood-borne infections.

Our study findings also suggested an association between healthcare access and the risk of HCV infection. Specifically, we found that frequent ED visits were associated with an increased risk of HCV infection, while frequent outpatient visits were associated with a decreased risk of HCV infection. We believe that these results may reflect distinct populations: individuals with access to healthcare resources versus individuals without insurance or with low socioeconomic status who rely on the ED as their primary source of care. Additionally, individuals who frequently access healthcare are more likely to undergo testing and receive an early diagnosis. Indeed, the ED often serves as the primary source of healthcare for patients at high risk of HCV or HIV, as well as for populations with limited access to care, including individuals with IDU and uninsured individuals without access to other healthcare [34–37]. This finding emphasizes the potential of the ED as a key and strategic component for extending HCV testing, supporting efforts such as HCV/HIV testing initiatives implemented in EDs [38–42]. However, challenges remain in optimizing HCV testing effectively and efficiently in ED settings without disrupting workflow during time-constrained visits.

Our ML-based HCV prediction algorithms offer more precise identification of HCV cases, reducing the need to screen large populations while improving detection rates of undiagnosed infections, thereby aiding clinical decision-making [14]. In practice, healthcare systems can integrate this algorithm into EHRs to generate alerts for individuals at high risk of HCV infection. By setting appropriate risk score thresholds, healthcare providers can adjust the desired HCV detection rate. The risk score can be updated regularly in the HER based on a predefined interval (eg, bi-weekly). For example, when a patient presents to the clinic, a precomputed risk score could trigger a real-time provider alert, enabling timely screening. Setting a cutoff at the risk score corresponding to the bottom of the top third decile could potentially identify 90% of individuals with HCV. Additionally, implementing this algorithm could substantially reduce unnecessary testing, improving healthcare system efficiency and optimizing resource allocation [43]. However, it is important to note that while our model demonstrated robust performance, it exhibited a higher FNR for Asian individuals, who had the lowest HCV-positive rates compared with individuals in the other race categories assessed. This issue may be mitigated by applying a different risk score cutoff for Asian patients.

Our risk prediction algorithms are easy to apply, as the predictors in our study utilized historical data that are routinely captured in clinical practice. Additionally, to address the lack of transparency commonly associated with ML models, we provided key predictors. This approach helps clinicians interpret the risk scores in the context of the important predictors [44]. Moreover, since our algorithm was developed using data from the PCORnet Common Data Model, its findings are generalizable to a broad array of datasets, enhancing its applicability and relevance in diverse clinical settings [15].

Despite the promising findings, several limitations need to be acknowledged. First, our analysis was based on data from the Florida all-payer EHR database, which may limit the generalizability of our findings for populations outside Florida. External validation in other regions and settings is necessary to confirm the applicability of the model. Second, we acknowledge that predictive performance may decrease as the prediction window is extended further into the future. A strategy of regularly updating risk scores can help mitigate this performance loss and maintain clinical relevance, though the optimal update interval should be established by future prospective studies. Third, due to the high FNRs among Asian individuals, applying different risk score cutoffs for this population may introduce additional complexity in clinical settings. Ensuring that these adjustments do not result in inequities in healthcare provision will require careful management and continuous monitoring. Fourth, the predictive ML methods used in this study generated associative rather than causative results; therefore, individual predictors should be carefully interpreted. Fifth, we were unable to directly compare our algorithm to existing screening strategies, such as universal screening. Future prospective studies are needed to evaluate the effectiveness of the algorithm and external validation in real-world clinical settings.

CONCLUSIONS

Our prognostic modeling study demonstrated the efficacy of ML algorithms, particularly a GBM model, for identifying individuals at high risk of HCV infection. By addressing the limitations of previously developed algorithms, we developed more advanced predictive algorithms capable of effectively identifying and stratifying individuals at high risk of HCV. Our findings identified the most important HCV risk factors, including a history of IDU, SUD, or HIV diagnosis and testing, and frequent ED visits. The findings of this study underscore the potential for implementing targeted screening in clinical settings to improve HCV screening as part of public health strategies.

Supplementary Material

ofaf496_Supplementary_Data

Notes

Author contributions. All authors contributed to the conceptualization and design of the study. H. P. and S.-C. J. curated the data. S.-C. J. and J. H. performed the data analyses. S.-C. J., W.-H. L.-C., J. H., and H. P. interpreted the data, with assistance from all other authors. S.-C. J. and H. P. drafted the original manuscript. All authors reviewed, edited, and approved the final manuscript.

Patient consent. This study was conducted in accordance with both the Declaration of Helsinki and Istanbul. The study, based on a retrospective cohort, was approved by the institutional review board of the University of Florida (approval number 202301553). As participants' personal information was anonymized, our study does not require informed consent.

Data availability. The data used in this study come from the OneFlorida+ database, accessible through appropriate authorization (https://onefl.net/). Researchers seeking access to data for replication or verification should contact the OneFlorida+ Data Consortium.

Financial support. Research reported in this publication was supported by the National Institute on Drug Abuse of the National Institutes of Health (R01DA057886). The views presented here are those of the authors alone and do not necessarily represent the views of the National Institute on Drug Abuse of the National Institutes of Health.

Contributor Information

Suk-Chan Jang, College of Pharmacy, University of Florida, Gainesville, Florida, USA.

Wei-Hsuan Lo-Ciganic, Department of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, USA; North Florida/South Georgia Veterans Health System Geriatric Research, Education and Clinical Center (GRECC), Gainesville, Florida, USA.

Pilar Hernandez-Con, College of Pharmacy, University of Florida, Gainesville, Florida, USA.

Chanakan Jenjai, College of Pharmacy, University of Florida, Gainesville, Florida, USA.

James Huang, College of Pharmacy, University of Florida, Gainesville, Florida, USA.

Ashley Stultz, College of Pharmacy, University of Florida, Gainesville, Florida, USA.

Shunhua Yan, College of Pharmacy, University of Florida, Gainesville, Florida, USA.

Debbie L Wilson, College of Pharmacy, University of Florida, Gainesville, Florida, USA.

Ashley Norse, College of Medicine, University of Florida, Jacksonville, Florida, USA.

Faheem W Guirgis, Department of Emergency Medicine, College of Medicine, University of Florida, Gainesville, Florida, USA.

Robert L Cook, Department of Epidemiology, University of Florida, Gainesville, Florida, USA.

Christine Gage, College of Medicine, University of Florida, Jacksonville, Florida, USA.

Khoa A Nguyen, College of Pharmacy, University of Florida, Gainesville, Florida, USA.

Patrick Hornes, College of Medicine, University of Florida, Gainesville, Florida, USA.

Yonghui Wu, College of Medicine, University of Florida, Gainesville, Florida, USA.

David R Nelson, Department of Emergency Medicine, College of Medicine, University of Florida, Gainesville, Florida, USA.

Haesuk Park, College of Pharmacy, University of Florida, Gainesville, Florida, USA.

Supplementary Data

Supplementary materials are available at Open Forum Infectious Diseases online. Consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author.

References

  • 1. Stanaway  JD, Flaxman  AD, Naghavi  M, et al.  The global burden of viral hepatitis from 1990 to 2013: findings from the Global Burden of Disease Study 2013. Lancet  2016; 388:1081–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Lewis  KC, Barker  LK, Jiles  RB, Gupta  N. Estimated prevalence and awareness of hepatitis C virus infection among US adults: National Health and Nutrition Examination Survey, January 2017-March 2020. Clin Infect Dis  2023; 77:1413–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Centers for Disease Control and Prevention . Viral Hepatitis Surveillance Report—United States, 2022. https://www.cdc.gov/hepatitis-surveillance-2022/about/index.html. Accessed 6 January 2025.
  • 4. Centers for Disease Control and Prevention . Hepatitis C Kills More Americans than Any Other Infectious Disease. https://archive.cdc.gov/www_cdc_gov/media/releases/2016/p0504-hepc-mortality.html. Accessed 14 February 2025.
  • 5. Liang  TJ, Ward  JW. Hepatitis C in injection-drug users—a hidden danger of the opioid epidemic. N Engl J Med  2018; 378:1169–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Akbarzadeh  V, Mumtaz  GR, Awad  SF, Weiss  HA, Abu-Raddad  LJ. HCV prevalence can predict HIV epidemic potential among people who inject drugs: mathematical modeling analysis. BMC Public Health  2016; 16:1216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Bull-Otterson  L, Huang  YA, Zhu  W, King  H, Edlin  BR, Hoover  KW. Human immunodeficiency virus and hepatitis C virus infection testing among commercially insured persons who inject drugs, United States, 2010–2017. J Infect Dis  2020; 222:940–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Force  USPST, Owens  DK, Davidson  KW, et al.  Screening for hepatitis C virus infection in adolescents and adults: US preventive services task force recommendation statement. JAMA  2020; 323:970–5. [DOI] [PubMed] [Google Scholar]
  • 9. Falade-Nwulia  O, Suarez-Cuervo  C, Nelson  DR, Fried  MW, Segal  JB, Sulkowski  MS. Oral direct-acting agent therapy for hepatitis C virus infection: a systematic review. Ann Intern Med  2017; 166:637–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Schillie  S, Wester  C, Osborne  M, Wesolowski  L, Ryerson  AB. CDC recommendations for hepatitis C screening among adults—United States, 2020. MMWR Recomm Rep  2020; 69:1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Taylor  LE, Swan  T, Mayer  KH. HIV coinfection with hepatitis C virus: evolving epidemiology and treatment paradigms. Clin Infect Dis  2012; 55(Suppl 1):S33–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Anderson  ES, Galbraith  JW, Deering  LJ, et al.  Continuum of care for hepatitis C virus among patients diagnosed in the emergency department setting. Clin Infect Dis  2017; 64:1540–6. [DOI] [PubMed] [Google Scholar]
  • 13. Burrell  CN, Sharon  MJ, Davis  S, et al.  Using the electronic medical record to increase testing for HIV and hepatitis C virus in an Appalachian emergency department. BMC Health Serv Res  2021; 21:524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Hastie  T. The elements of statistical learning: data mining, inference, and prediction. New York, NY, USA: Springer, 2009. [Google Scholar]
  • 15. Hogan  WR, Shenkman  EA, Robinson  T, et al.  The OneFlorida data trust: a centralized, translational research data infrastructure of statewide scope. J Am Med Inform Assoc  2022; 29:686–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Collins  GS, Moons  KGM, Dhiman  P, et al.  TRIPOD + AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ  2024; 385:e078378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Centers for Disease Control and Prevention . Viral Hepatitis: Guide for Investigating Outbreaks in Health Care Settings. https://www.cdc.gov/hepatitis/php/health-care-outbreak-toolkit/guide.html?utm_source=chatgpt.com. Accessed 20 February.
  • 18. Hajarizadeh  B, Grebely  J, Dore  GJ. Epidemiology and natural history of HCV infection. Nat Rev Gastroenterol Hepatol  2013; 10:553–62. [DOI] [PubMed] [Google Scholar]
  • 19. DeLong  ER, DeLong  DM, Clarke-Pearson  DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics  1988; 44:837–45. [PubMed] [Google Scholar]
  • 20. Armstrong  RA. When to use the B onferroni correction. Ophthalmic Physiol Opt  2014; 34:502–8. [DOI] [PubMed] [Google Scholar]
  • 21. Fluss  R, Faraggi  D, Reiser  B. Estimation of the Youden Index and its associated cutoff point. Biom J  2005; 47:458–72. [DOI] [PubMed] [Google Scholar]
  • 22. Xu  J, Xiao  Y, Wang  WH, et al.  Algorithmic fairness in computational medicine. EBioMedicine  2022; 84:104250. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Birmingham  LE, Cheruvu  VK, Frey  JA, Stiffler  KA, VanGeest  J. Distinct subgroups of emergency department frequent users: a latent class analysis. Am J Emerg Med  2020; 38:83–8. [DOI] [PubMed] [Google Scholar]
  • 24. LaCalle  E, Rabin  E. Frequent users of emergency departments: the myths, the data, and the policy implications. Ann Emerg Med  2010; 56:42–8. [DOI] [PubMed] [Google Scholar]
  • 25. Soril  LJ, Leggett  LE, Lorenzetti  DL, Noseworthy  TW, Clement  FM. Characteristics of frequent users of the emergency department in the general adult population: a systematic review of international healthcare systems. Health Policy  2016; 120:452–61. [DOI] [PubMed] [Google Scholar]
  • 26. Nohara  Y, Matsumoto  K, Soejima  H, Nakashima  N. Explanation of machine learning models using improved Shapley additive explanation. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. Niagara Falls, NY, USA: Association for Computing Machinery, 2019:546.
  • 27. Fong  A, Hughes  J, Gundapenini  S, et al.  Evaluation of structured, semi-structured, and free-text electronic health record data to classify hepatitis C virus (HCV) infection. GI Disorders  2023; 5:115–26. [Google Scholar]
  • 28. Doyle  OM, Leavitt  N, Rigg  JA. Finding undiagnosed patients with hepatitis C infection: an application of artificial intelligence to patient claims data. Sci Rep  2020; 10:10521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Rigg  J, Doyle  O, McDonogh  N, et al.  Finding undiagnosed patients with hepatitis C virus: an application of machine learning to US ambulatory electronic medical records. BMJ Health Care Inform  2023; 30:e100651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Platt  L, Easterbrook  P, Gower  E, et al.  Prevalence and burden of HCV co-infection in people living with HIV: a global systematic review and meta-analysis. Lancet Infect Dis  2016; 16:797–808. [DOI] [PubMed] [Google Scholar]
  • 31. Sulkowski  MS. Viral hepatitis and HIV coinfection. J Hepatol  2008; 48:353–67. [DOI] [PubMed] [Google Scholar]
  • 32. Centers for Disease Control and Prevention . National Center for HIV, Viral Hepatitis, STD, and Tuberculosis Prevention. HIV and Viral Hepatitis. https://stacks.cdc.gov/view/cdc/48584/cdc_48584_DS1.pdf. Accessed 20 February 2025.
  • 33. Vickerman  P, Hickman  M, May  M, Kretzschmar  M, Wiessing  L. Can hepatitis C virus prevalence be used as a measure of injection-related human immunodeficiency virus risk in populations of injecting drug users? An ecological analysis. Addiction  2010; 105:311–8. [DOI] [PubMed] [Google Scholar]
  • 34. Centers for Disease Control and Prevention . National Hospital Ambulatory Medical Care Survey: 2017 Emergency Department Summary Table. https://archive.cdc.gov/www_cdc_gov/nchs/data/nhamcs/web_tables/2017_ed_web_tables-508.pdf. Accessed 20 February 2025.
  • 35. Hsieh  YH, Patel  AV, Loevinsohn  GS, Thomas  DL, Rothman  RE. Emergency departments at the crossroads of intersecting epidemics (HIV, HCV, injection drug use and opioid overdose)-Estimating HCV incidence in an urban emergency department population. J Viral Hepat  2018; 25:1397–400. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Liaw  W, Petterson  S, Rabin  DL, Bazemore  A. The impact of insurance and a usual source of care on emergency department use in the United States. Int J Family Med  2014; 2014:842847. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Becker  WC, Fiellin  DA, Merrill  JO, et al.  Opioid use disorder in the United States: insurance status and treatment access. Drug Alcohol Depend  2008; 94:207–13. [DOI] [PubMed] [Google Scholar]
  • 38. Schechter-Perkins  EM, Miller  NS, Hall  J, et al.  Implementation and preliminary results of an emergency department nontargeted, opt-out hepatitis C virus screening program. Acad Emerg Med  2018; 25:1216–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Calner  P, Sperring  H, Ruiz-Mercado  G, et al.  HCV screening, linkage to care, and treatment patterns at different sites across one academic medical center. PLoS One  2019; 14:e0218388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Ford  JS, Chechi  T, Toosi  K, et al.  Universal screening for hepatitis C virus in the ED using a best practice advisory. West J Emerg Med  2021; 22:719–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Daniel Moore  J, Galbraith  J, Humphries  R, Havens  JR. Prevalence of hepatitis C virus infection identified from nontargeted screening among adult visitors in an academic Appalachian regional emergency department. Open Forum Infect Dis  2021; 8(8):ofab374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Hluhanich  R, Ford  JS, Bruce  D, et al.  Comparing hepatitis C virus screening in clinics versus the emergency department. West J Emerg Med  2022; 23:312–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Durham  DP, Skrip  LA, Bruce  RD, et al.  The impact of enhanced screening and treatment on hepatitis C in the United States. Clin Infect Dis  2016; 62:298–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Yang  G, Ye  Q, Xia  J. Unbox the black-box for the medical explainable AI via multi-modal and multi-centre data fusion: a mini-review, two showcases and beyond. Inf Fusion  2022; 77:29–52. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ofaf496_Supplementary_Data

Articles from Open Forum Infectious Diseases are provided here courtesy of Oxford University Press

RESOURCES