Abstract
Summary
Background
Cardiovascular outcomes for people with familial hypercholesterolaemia can be improved with diagnosis and medical management. However, 90% of individuals with familial hypercholesterolaemia remain undiagnosed in the USA. We aimed to accelerate early diagnosis and timely intervention for more than 1·3 million undiagnosed individuals with familial hypercholesterolaemia at high risk for early heart attacks and strokes by applying machine learning to large health-care encounter datasets.
Methods
We trained the FIND FH machine learning model using deidentified health-care encounter data, including procedure and diagnostic codes, prescriptions, and laboratory findings, from 939 clinically diagnosed individuals with familial hypercholesterolaemia (395 of whom had a molecular diagnosis) and 83 136 individuals presumed free of familial hypercholesterolaemia, sampled from four US institutions. The model was then applied to a national health-care encounter database (170 million individuals) and an integrated health-care delivery system dataset (174 000 individuals). Individuals used in model training and those evaluated by the model were required to have at least one cardiovascular disease risk factor (eg, hypertension, hypercholesterolaemia, or hyperlipidemia). A Health Insurance Portability and Accountability Act of 1996-compliant programme was developed to allow providers to receive identification of individuals likely to have familial hypercholesterolaemia in their practice.
Findings
Using a model with a measured precision (positive predictive value) of 0·85, recall (sensitivity) of 0·45, area under the precision–recall curve of 0·55, and area under the receiver operating characteristic curve of 0·89, we flagged 1 331 759 of 170 416 201 patients in the national database and 866 of 173 733 individuals in the health-care delivery system dataset as likely to have familial hypercholesterolaemia. Familial hypercholesterolaemia experts reviewed a sample of flagged individuals (45 from the national database and 103 from the health-care delivery system dataset) and applied clinical familial hypercholesterolaemia diagnostic criteria. Of those reviewed, 87% (95% Cl 73–100) in the national database and 77% (68–86) in the health-care delivery system dataset were categorised as having a high enough clinical suspicion of familial hypercholesterolaemia to warrant guideline-based clinical evaluation and treatment.
Interpretation
The FIND FH model successfully scans large, diverse, and disparate health-care encounter databases to identify individuals with familial hypercholesterolaemia.
Introduction
Familial hypercholesterolaemia is a common inherited condition affecting approximately one in every 250 individuals worldwide and causes lifelong elevations in LDL cholesterol and premature coronary artery disease.1 Timely identification of familial hypercholesterolaemia and initiation of guideline-based therapies can markedly attenuate the risk of coronary artery disease. Unfortunately, fewer than 10% of individuals with familial hypercholesterolaemia have been identified, creating a huge reservoir of unidentified and untreated individuals with the condition.1–3 Current guidelines from the American College of Cardiology and American Heart Association, WHO, and the Centers for Disease Control and Prevention recommend screening to identify families with familial hypercholesterolaemia.1,4–7
The best method for large-scale screening has yet to be established but is likely to include both efficient cascade screening and effective index identification.8–12 Machine learning models have promising applications to medicine13–15 because they can efficiently analyse datasets at scale and have the potential to predict outcomes across disciplines.16 Characteristics of familial hypercholesterolaemia in large health-care-related datasets (eg, laboratory values, procedure codes, diagnosis codes, or prescribing information) are amenable to this approach. We have previously reported results of a successful application of a machine learning model to identify undiagnosed individuals with familial hypercholesterolaemia built from and applied to single health-care institutions.15 The current study, which was conceived of in parallel, aims to build a model that can be applied at both the institutional and national health-care database scales to identify new index cases.
To this scope, we constructed the FIND FH machine learning model. We aimed to define model characteristics using individuals with familial hypercholesterolaemia and individuals presumed not to have familial hypercholesterolaemia from four health-care systems; assess whether the derived model can be effectively applied to different sources of health-care encounter data, including a national database and an integrated health-care delivery system; and determine whether the derived model successfully identifies individuals with medical profiles consistent with familial hypercholesterolaemia in independent clinical settings. We hypothesised that targeted screening with a machine learning model applied to large health-care datasets can identify individuals highly likely to have familial hypercholesterolaemia.
Methods
Study design
In this study, we used electronic health record (EHR) data from large academic health systems to build and train the FIND FH machine learning model. The training dataset was split using the 80–20 holdout method to tune the parameters and measure model performance. The model was then externally validated on two independent, real-world datasets.
Training dataset
The FIND FH initiative predated the availability of an International Classification of Diseases, Tenth Revision (ICD-10) code for familial hypercholesterolaemia. Therefore, we needed an alternative means of identifying true-positive familial hypercholesterolaemia training examples for the model. We chose to work with familial hypercholesterolaemia experts from various health systems to ensure we had high-quality true positives from across the USA. To this scope, we used structured EHR data from four different health systems: Stanford University, University of Pennsylvania, Geisinger Medical Center, and Ohio State University. These datasets include all individuals who interacted with these health systems, regardless of age or insurance coverage. The data cover 3 years of individual history, from Sept 1, 2013, to Aug 31, 2016.
The data used included prescription, procedure, diagnosis, and laboratory result data. Diagnosis and procedure data in the study used ICD-9, ICD-10, and Current Procedural Terminology standard formats. Common data dictionaries were developed for prescription and laboratory result data. Unstructured data, including clinical notes and other data from patient medical files (eg, notes about family history of cardiovascular disease), were not included. To train the model, we exposed machine learning technology to data from individuals with clinically diagnosed familial hypercholesterolaemia and found relevant medical patterns in the years leading up to their diagnosis.
A supervised learning analysis requires an adequate number of clinically diagnosed individuals as well as a large number of presumed controls to train the model. We focused model development on individuals with at least one cardiovascular disease or primary prevention (eg, with hypertension, hypercholesterolaemia, or hyperlipidaemia without cardiovascular disease) comorbidity claim, and trained the model to differentiate those with familial hypercholesterolaemia from those presumed without familial hypercholesterolaemia within this large subset (see appendix pp 11–24 for the full list of comorbidities and their medical codes). For our training examples, a case was defined as an individual with a clinical diagnosis of familial hypercholesterolaemia by a lipid expert3 and a presumed control was defined as an individual with no previous diagnosis of familial hypercholesterolaemia by a lipid expert in their medical record. The total sample of training examples from the four datasets comprised 939 individuals with familial hypercholesterolaemia (42% of whom were genetically confirmed) and 83 136 individuals presumed controls (table 1). Use of training examples from multiple institutions, where there might be small differences in therapeutic and coding patterns, was done to ensure a more robust model.
Table 1:
Clinical and demographic data for the training dataset
| Stanford University | University of Pennsylvania | Geisinger Medical Center | Ohio State University | |||||
|---|---|---|---|---|---|---|---|---|
| Cases | Presumed controls | Cases | Presumed controls | Cases | Presumed controls | Cases | Presumed controls | |
| Number of participants | 106 | 7699 | 293 | 34 797 | 446 | 32 640 | 94 | 8000 |
| Age, years | 49 (14) | 67 (15) | 58 (15) | 60 (19) | 63 (16) | 62 (17) | 53 (14) | 60 (16) |
| Female participants | 54 (50.9%) | 3379 (43.9%) | 195 (66.6%) | 20 244 (58.2%) | 272 (61.0%) | 15 639 (47.9%) | 54 (57.4%) | 4031 (50.4%) |
| Genetically confirmed variant of familial hypercholesterolaemia | 20 (18.9%) | .. | 159 (54.3%) | .. | 216 (48.4%) | .. | .. | .. |
| Individuals with data on laboratory results | 97 (91.5%) | 7517 (97.6%) | 273 (93.2%) | 28 964 (83.2%) | 446 (100%) | 32 244 (98.8%) | 73 (77.7%) | 5868 (73.4%) |
| Individuals with prescription data | 89 (84.0%) | 6346 (82.4%) | 284 (96.9%) | 33 723 (96.9%) | 445 (99.8%) | 32 485 (99.5%) | 93 (98.9%) | 7073 (88.4%) |
| Existing diagnoses | ||||||||
| Arteriosclerotic cardiovascular disease | 25 (23.6%) | 2066 (26.8%) | 147 (50.2%) | 10 904 (31.3%) | 172 (38.6%) | 8606 (26.4%) | 70 (74.5%) | 3973 (49.7%) |
| Hypercholesterolaemia | 92 (86.8%) | 3022 (39.3%) | 279 (95.2%) | 16 714 (48.0%) | 92 (20.6%) | 5930 (18.2%) | 80 (85.1%) | 3659 (45.7%) |
| Diabetes | 8 (7.5%) | 1585 (20.6%) | 26 (8.9%) | 7951 (23%) | 24 (54%) | 2472 (7.6%) | 21 (22.3%) | 1607 (20.4%) |
| Hypertension | 18 (17.0%) | 5151 (66.9%) | 119 (40.6%) | 18 230 (52.4%) | 57 (12.8%) | 6676 (20.5%) | 50 (53.2%) | 4550 (56.9%) |
| Using any statin | 71 (67.0%) | 2088 (27.1%) | 224 (76.5%) | 14 187 (40.8%) | 392 (87.9%) | 18 347 (56.2%) | 85 (90.4%) | 3240 (40.5%) |
Data are n (%) or mean (SD). In the training dataset, cases are individuals with diagnosed familial hypercholesterolaemia and presumed controls are individuals presumed to not have familial hypercholesterolaemia.
Model features developed for individuals in the familial hypercholesterolaemia and non-familial hypercholesterolaemia datasets included demographic features (ie, age and sex) and other features built from health-care encounters and laboratory data. More than 60 000 features were tested. The most frequently used features were counting features, such as the number of times a given individual received a specific prescription or underwent a specific procedure within a time window. Combinations of long-term prescription and laboratory result histories were assessed.
The expected prevalence of familial hypercholesterolaemia is an important input parameter as the performance of the model varies with it. Due to the prespecified cardiovascular disease or primary prevention selection criteria, the expected prevalence of familial hypercholesterolaemia in our data will be higher than in the general population. We estimated the prevalence of familial hypercholesterolaemia in this cohort by calculating the ratio of the number of individuals with undiagnosed familial hypercholesterolaemia in the USA (1·3 million, assuming 1:250 prevalence) to the total number of individuals in the health-care encounter dataset (92 million, after applying the comorbidity requirement). Assuming that our comorbidity criteria are loose enough to capture the vast majority of these individuals, an expected prevalence ratio of 1:71 was derived. This expected prevalence ratio was used to balance the number of individuals in our positive and negative training datasets by down-sampling the negative data to remove bias in the training procedure.
The FIND FH model
The FIND FH machine learning model consists of two consecutive random forest17 model layers (figure 1). The first random forest layer selects the highest ranked features—ie, those that improve predictive value. The training data were then filtered to include only those highly ranked features that are then passed to the second random forest. Random forests were used since they are easily interpretable, prevent overtraining, and can capture non-linear relationships among features. Implementing two consecutive random forest layers was found to both improve performance and increase the portability of the model across different hospital network datasets. Model building tested the inclusion of more than 60 000 input features. We tested several other algorithms for the second model layer, including gradient-boosted and adaptive-boosted decision trees, support vector machines, and a logistical regression model. When considering performance metrics (ie, precision and recall) and examining the decisions made by each model, the random forest performed best and was less likely to exhibit overfitting compared with other tree-based methods. We did not investigate resampling techniques because the model performance met our goals.
Figure 1: Model training.
Training and optimisation setup for the FIND FH model. Individuals with and without familial hypercholesterolaemia from four institutions were pooled together to train the model. Model performance was measured on an 80–20 holdout tuning dataset. *Presumed controls were down-sampled to meet the 1:71 prevalence ratio.
Internal model parameters were optimised on the training data on the basis of the final F1 score. The F1 statistic is a performance metric that provides a properly balanced measure of the model’s precision (positive predictive value) and recall (sensitivity). After model parameters were set, we used a holdout tuning dataset to define a threshold for probable familial hypercholesterolaemia. This threshold was set to identify and flag individuals with familial hypercholesterolaemia with precision greater than 0·6. Operationally, once the identified individuals are screened, the model can be run a second time with a threshold set at a lower precision to identify additional individuals if resources are available.
Model performance and validation
To create the holdout tuning dataset, we excluded 20% of the individuals with and without familial hypercholesterolaemia in the combined training dataset from the training process. The FIND FH model performance was then computed in this 20% to generate a measure of the F1 score, precision, recall, area under the precision–recall curve (AUPRC), and area under the receiver operating characteristic curve (AUROC), independent of the training procedure. AUROC is the area under the curve defined by a model’s recall and false-positive rate (1 − specificity) measured at different thresholds. AUPRC is the area under the curve defined by a model’s precision and recall measured at different thresholds. In this study, the AUPRC is a more appropriate metric of model performance because our data are imbalanced and precision-based metrics better emphasise the efficiency of a model to correctly identify less frequent positive cases.
We did not do a full cross validation, instead choosing the holdout method for simplicity; additionally, we could monitor overfitting or underfitting with random forest out-of-bag errors. Given the size of our dataset, we had a sufficient number of cases for the holdout method to produce consistent datasets; additionally, we verified that results were consistent by varying the random splitting.
To both externally validate and test the robustness and portability of the FIND FH model, we applied the model to two types of health-care encounter data: a national health-care encounter database and data from a large tertiary-care academic medical centre. In both cases, the datasets that the model was applied to were completely distinct from the training and tuning datasets. For alignment with the training phase and comparison purposes, individuals were required to have a documented history of at least one cardiovascular disease risk factor and to have no previous diagnosis of familial hypercholesterolaemia. For completeness, we investigated the dependence of model performance on familial hypercholesterolaemia prevalence and found differences to be negligible for a wide region of values around our established prevalence ratio of 1:71.
When applying the model to new datasets, we did an extensive analysis to flag data elements that were systematically missing or under-represented in the included individuals. Once these issues were found, they could be fixed or resolved with the data provider in a way that does not introduce bias into the results. One of the benefits of using two random forest layers is that we only needed to ensure that data characteristics used in the 75 features chosen by the model were consistent in these new datasets.
Our first external validation test set was a national database of health-care encounters consisting of diagnosis, procedure, and medication transactions on 170 416 201 American residents, secured from IQVIA (Durham, NC, USA), a company that provides health-care encounter data aggregated across hospital, payer, and pharmacy sources. Laboratory result data on 15 368 850 of these individuals were provided by Laboratory Corporation of America Holdings (LabCorp; Burlington, NC, USA). Each individual had a unique, anonymised identifier to allow linkage of encounter data with laboratory result data and to permit removal of identifying information. After merging the data, the FIND FH model was applied. Individuals with values above the threshold for probable familial hypercholesterolaemia were flagged; those that had been previously diagnosed were removed from the dataset. We developed an outreach programme, compliant with the Health Insurance Portability and Accountability Act of 1996 (HIPAA), that allowed us to notify health-care providers of these flagged individuals. Flagged individuals were associated with their primary attending physician. The FH Foundation then contacted chosen physicians with at least five flagged individuals in their practices. The targeted clinicians were all considered to be experts in the diagnosis and treatment of familial hypercholesterolaemia. The study group discussed the FIND FH model with these physicians and asked them if they wanted to know the identity of the flagged individuals in their practices. If they did, LabCorp, a covered entity (ie, health-care provider) under HIPAA, provided individual identification. The physicians were asked to determine the likelihood of familial hypercholesterolaemia in all identified patients using four methods: the three accepted diagnostic criteria (Dutch Lipid Clinic Network, Make Early Diagnosis to Prevent Early Deaths [MEDPED], and Simon Broome) and their expert clinical judgment. Using all of these criteria, the physicians categorised flagged individuals into one of the following categories: definite, probable, possible, inconclusive, or unlikely familial hypercholesterolaemia. Individuals with risk labelled possible or greater for any of the evaluations were considered candidates for further screening and follow-up.
For the second external validation, we obtained independent anonymised data for 173 733 individuals who had previously had contact with the Oregon Health & Science University (OHSU) health-care system over 3 years (March 27, 2015–March 27, 2018). Data included diagnoses, procedures, laboratory tests, and medications. Individuals with an ICD-10 code for familial hypercholesterolaemia were removed from the list to limit the data to individuals without a previously documented diagnosis of familial hypercholesterolaemia, in keeping with the patient profile that the model was explicitly designed to identify. The FIND FH model was applied to this dataset and approximately 100 flagged individuals with at least one LDL cholesterol laboratory result in their history were chosen for review by the data analysts in the study (WH, DS, KDM), as this would give reviewers meaningful clinical data for assessment. Not all flagged individuals were reviewed owing to time limitations. Assessments included a review of all laboratory results, genetic tests, outpatient visits, inpatient or emergency department visits, family history, and established diagnoses. Records were reviewed on the OHSU Epic electronic medical record system (Epic Systems Corporation; Verona, WI, USA), including the Care Everywhere network, which integrates health-care records (including outpatient visits, inpatient or emergency department visits, and all diagnostic testing) for any hospital or health-care system using the Epic platform. Similar to the national database application, each individual was given a categorisation for each of the three diagnostic criteria as well as the physician’s clinical opinion.
The FIND FH model was built using open-source tools and libraries. The model and plotting code is written in Python (version 2.7), using NumPy, SciPy, and scikit learn random forest machine learning libraries.
Role of the funding source
The FH Foundation funded this study. The sponsors of the FH Foundation had no role in the study design, data procurement, data analysis, data interpretation, or writing of the report. The corresponding author had full access to all the data in the study and had final responsibility for the decision to submit for publication.
Results
The final FIND FH model included a mix of features: demographic, conditional (ie, designed to capture patient health response, both positive and negative, throughout the course of therapy), prescription based, diagnosis based, procedure based, and laboratory result based (table 2). These features were ranked by their importance in the model, in proportion to the number of decision nodes over the whole forest in which that feature was used and how effectively that feature distinguished between individuals with or without familial hypercholesterolaemia at those nodes (table 2). To quantify which categories of features had the greatest effect on model performance, we grouped the features and present the average importance of each category with mean rankings.
Table 2:
Breakdown of feature categories
| Number of features investigated | Number of investigated features in the top 75 | Mean rank of all features in category | Mean rank of features in the top 75 | Top four features in category | |
|---|---|---|---|---|---|
| Demographic | 2 | 2 | 18 | 18 | Age, sex |
| Conditional | 21 | 18 | 39 | 24 | High LDL cholesterol with no lipid-lowering therapies, high LDL cholesterol with high-intensity statin prescription, high LDL cholesterol with moderate-intensity statin prescription, high LDL cholesterol with statins and ezetimibe |
| Prescription based | 14 000 | 6 | 29 825 | 33 | Total number of prescription codes, number of atorvastatin prescriptions, number of rosuvastatin prescriptions, number of evolocumab prescriptions |
| Diagnosis based | 26 700 | 14 | 29 392 | 35 | Number of E78.00 codes (hypercholesterolaemia), total number of diagnosis codes, number of E78.4 or E78.5 codes (hyperlipidaemia), number of I10 codes (hypertension) |
| Procedure based | 14 200 | 8 | 31 806 | 47 | Total number of procedure codes, number of 93000 codes (electrocardiogram), number of 99214 codes (outpatient services), number of 36415 codes (venipuncture) |
| Laboratory result based | 3900 | 27 | 19 510 | 49 | Maximum value of total cholesterol, maximum value of LDL cholesterol, average value of LDL cholesterol, average value of total cholesterol |
Table shows the number and rank of all features investigated in a given category and of those features that were included in the top 75 features (ie, that were eligible for inclusion in the second random forest layer). Top features in each category are listed by rank.
Prescription-based, diagnosis-based, and procedure-based features include those created from the frequency of relevant health-care encounter transactions. Laboratory result features were used when available. These four categories provided the majority of the top 75 features, with laboratory-based features serving as the most frequent contributor, followed by health-care encounter-based features (ie, prescription, diagnosis, and procedure). Demographic features, including age and sex, also factored in prominently. Many conditional features based on specific combinations of the long-term prescription and laboratory result histories of individuals proved effective.
The FIND FH model performance was measured on the holdout tuning dataset (prevalence 1:71), yielding a precision (positive predictive value) of 0·85 and a recall (ie, sensitivity) of 0·45. The AUPRC of the model was found to be 0·55, whereas the AUROC was 0·89 (appendix pp 1–2). The optimisation of the model performance was primarily a scan of number of features passed between random forest layers. This is shown and discussed in the appendix (pp 6–7).
In the first external validation dataset (ie, the national dataset, 1 331 759 of 170 416 201 individuals were flagged by the model as likely to have familial hypercholesterolaemia, of whom 45 were reviewed (table 3; figure 2). A breakdown of the physician and reviewer diagnoses is presented in the appendix (pp 9–10). In agreement with the precision measurement on the holdout tuning data, 87% (95% CI 73–100) of the evaluated individuals in the national database were identified as having possible, probable, or definite familial hypercholesterolaemia by at least one of the diagnostic criteria or by the attending physician. Applying a simple test of LDL cholesterol greater than 190 mg/dL to each individual’s medical history would have only identified 18 (46%) of the 39 new cases of familial hypercholesterolaemia. The results from the FIND FH model application to the national database can be visualised at a national scale by showing the number of yet-to-be-diagnosed individuals with familial hypercholesterolaemia as a choropleth map aggregated to three-digit ZIP code geographies (figure 3). Visualising at this scale informs about high-density regions within the country and helps stakeholders to allocate resources to efficiently reach the unmet need. For example, we find high-density regions that track with founder populations known to have high rates of familial hypercholesterolaemia, including Ashkenazi Jewish and Amish populations. This trend holds both with crude counts (figure 3) and population-adjusted counts (appendix p 8).
Table 3:
Clinical and demographic data for the external validation datasets
| National database | OHSU dataset | ||||||
|---|---|---|---|---|---|---|---|
| Complete dataset | Flagged by model | Reviewed by attending physician | Complete dataset | Flagged by model | Reviewed by expert | ||
| Number of participants | 170 416 201 | 1331759 | 45 | 173 733 | 866 | 103 | |
| Age, years | 54 (21) | 61 (15) | 55 (17) | 50 (23) | 59 (14) | 45 (15) | |
| Female participants | 90 974 711 (53.4%) | 597 409 (4.9%) | 24 (53.3%) | 91 294 (52.5%) | 376 (43.4%) | 48 (46.6%) | |
| Individuals with data on laboratory results | 15 368 850 (9.0%) | 307 507 (23.1%) | 43 (95.6%) | 105 605 (60.8%) | 656 (75.8%) | 102 (99.0%) | |
| Individuals with prescription data | 82 979 553 (48.7%) | 1 137 815 (85.4%) | 45 (100%) | 126 624 (72.9%) | 649 (74.9%) | 100 (97.1%) | |
| Existing diagnoses | |||||||
| Arteriosclerotic cardiovascular disease | 21 568 927 (12.7%) | 335 180 (25.2%) | 12 (26.7%) | 24 104 (13.9%) | 570 (65.8%) | 44 (42.7%) | |
| Hypercholesterolaemia | 40 165 393 (23.6%) | 734 534 (55.2%) | 29 (64.4%) | 28 679 (16.5%) | 766 (88.5%) | 88 (85.4%) | |
| Diabetes | 20 199 157 (11.9%) | 279 631 (21.0%) | 4 (8.9%) | 17 326 (10.0%) | 100 (11.5%) | 12 (11.7%) | |
| Hypertension | 49 731 923 (29.2%) | 587 272 (44.1%) | 10 (22.2%) | 40 411 (23.3%) | 303 (35.0%) | 28 (27.2%) | |
| Using any statin | 26 677 718 (15.7%) | 513 639 (38.6%) | 31 (68.9%) | 25 567 (14.7%) | 469 (54.2%) | 71 (68.9%) | |
OHSU=Oregon Health & Science University.
Figure 2: Clinical review data Venn diagrams.
Clinical review validation data from individuals flagged in the FIND FH national database and OHSU dataset. Numbers indicate individuals categorised as possible, probable, or definite familial hypercholesterolaemia. Each individual was evaluated with the Dutch Lipid Clinic Network, MEDPED, and Simon Broome criteria and expert opinion; individuals identified with the Simon Broome criteria (11 in the national database and 29 in the OHSU dataset) were consistent with evaluations by the Dutch Lipid Clinic Network or MEDPED. In the national database, identification by an attending physician indicates the presence of an E78.01 (International Classification of Diseases, Tenth Revision code for familial hypercholesterolaemia) by a physician in the individual’s history subsequent to the model scoring and clinical review. In the OHSU dataset, expert assessment is clinical chart review without an in-person clinical assessment. MEDPED=Make Early Diagnosis to Prevent Early Deaths. OHSU=Oregon Health & Science University.
Figure 3: FIND FH heatmap.
Number of undiagnosed individuals with familial hypercholesterolaemia aggregated to three-digit ZIP code geographies. Values were not population adjusted; a population-adjusted version is presented in the appendix (p 8). The frequencies shown reflect the geographical coverage of the national health-care encounter and laboratory datasets and not a true, unbiased measure of the US population distribution.
In the second external validation dataset (ie, OHSU), 866 of 173 733 individuals were flagged by the model as likely to have familial hypercholesterolaemia, of whom 103 were reviewed (table 3; figure 2). Of the individuals flagged by the model, 77% (95% CI 68–86) were identified as having possible, probable, or definite familial hypercholesterolaemia by at least one of the diagnostic criteria or by the familial hypercholesterolaemia expert, in agreement with the precision measurement on the holdout tuning dataset. Applying a simple threshold of LDL cholesterol greater than 190 mg/dL to the individual’s medical history would have only identified 37 (47%) of the 79 new cases of familial hypercholesterolaemia.
Discussion
We describe here the development of a machine learning model, FIND FH, which is designed to identify phenotypic familial hypercholesterolaemia when applied to large medical datasets. FIND FH was built on longitudinal medical data from individuals with at least one documented cardiovascular disease risk factor in their history. This requirement was applied to allow us to collect data for model training and to optimise the model to the envisioned test cases, finding individuals embedded in a health-care system. Additionally, it forces the model to focus on discriminating between familial hypercholesterolaemia and other cardiac conditions that might mimic familial hypercholesterolaemia.
The FIND FH model is a precision screening tool and does not replace clinical evaluations nor existing diagnostic criteria. It differs from the traditional diagnostic criteria (ie, Dutch Lipid Clinic Network, MEDPED, or Simon Broome) in that it does not require specific information, such as tendon xanthomas or family history, that either might not be present or cannot be regularly and reliably extracted from current EHR data at the national scale. FIND FH was structured to identify a phenotype consistent with familial hypercholesterolaemia whereas the other criteria were designed to evaluate the likelihood of having a positive genetic test for familial hypercholesterolaemia.18,19 This is why we validated the model using familial hypercholesterolaemia expert clinical evaluation.1 As genetic testing results become more readily available, we intend to include this feature in future versions.
Importantly, the FIND FH model does not only rely on predetermined thresholds for lipid concentrations. The algorithm was developed with data from patients with an existing diagnosis of familial hypercholesterolaemia; at some point in the past, these patients would have met the lipid diagnostic criteria for this condition, but those data were not in the EHR for many of the patients on whom the algorithm was trained. This is a key difference between FIND FH and conventional scoring systems. Although lipid concentrations are helpful to the machine learning algorithm, many patients identified in this study either did not have lipid levels obtained and were flagged by other characteristics or they were taking lipid-lowering medications and had lipid levels below pre-treatment diagnostic thresholds. Future studies should assess the effect of having lipid levels available on a higher percentage of the available cohort on model performance.
Application of FIND FH to a national database consisting of diagnosis, procedure, and medication transactions in more than 170 million Americans flagged 1 345 477 individuals with medical profiles consistent with familial hypercholesterolaemia. FIND FH was also applied to a health-care delivery system dataset consisting of structured EHR data in more than 170 000 individuals, in which it flagged 866 individuals. The proportion of individuals flagged in these cohorts, which is empirically lower than the training prevalence of 1:71, is a direct reflection of the high precision threshold chosen. We chose this threshold to avoid the possibility of too many false positives at the start of the outreach programme to attending physicians. Chart review of the flagged individuals categorised 77–87% of them as having possible, probable, or definite familial hypercholesterolaemia, indicating a high enough clinical suspicion of familial hypercholesterolaemia to warrant a guideline-based, formal clinical evaluation and treatment. More than half of the individuals flagged would not have been identified with a simple screen for elevated LDL cholesterol levels alone;20,21 this test does not capture data crucial to conventional diagnostic criteria, such as family history, and misses situations such as an individual on statins with LDL cholesterol levels below threshold. Furthermore, the model flagged individuals undergoing statin therapy and those that were not. Of the 79 individuals categorised as having risk of possible familial hypercholesterolaemia or greater in the OHSU dataset, 31 of them had no record of any statin prescriptions in the previous 2 years. These results indicate that application of a machine learning approach such as FIND FH to medical big data might be feasible for identifying many undiagnosed individuals with familial hypercholesterolaemia.
FIND FH performed comparably across two types of health-care data: a national health-care encounter database and an integrated health-care delivery system with a structured EHR database. This portability was a design consideration and arises from the fact that the model is built on structured health-care encounter and laboratory result data. Although we15 and others have found success when including unstructured EHR data and clinical notes in machine learning models,14,22 to our knowledge, no national database with such data currently exists. The fact that the model performs similarly in distinct health-care data frameworks suggests that it might be generalisable to other institutions, agencies, employers, and health-care delivery systems. Our previous model15 took the complementary approach and showed good performance in identifying previously undiagnosed familial hypercholesterolaemia patients within a single institution. lmportantly, the fact that the individuals identified in this latter case were already within the institution lead to easier and quicker individual engagement.
We have developed a novel HIPAA-compliant outreach process to notify health-care providers of their patients flagged by the FIND FH model, a disclosure that is permitted by the treatment exception to HIPAA privacy rules. This process can be easily implemented across diverse health-care systems. Providers, or an integrated delivery system, can opt in to participate in the programme and then learn the identities of individuals in their practices flagged as having probable familial hypercholesterolaemia. These individuals can then be evaluated by their providers and, if formally diagnosed with familial hypercholesterolaemia, receive the necessary medical therapy.
This study has several limitations and caveats. By design, the model is given a 3-year snapshot of an individual’s full medical history to calculate their likelihood of familial hypercholesterolaemia. It is not possible to account for important pieces of information outside of the 3-year interval. The time windows chosen for building and applying the model to a dataset are a balance between the positive contribution of more data and the increased costs and other issues associated with using longer time windows. We investigated shorter and longer windows in previous FIND FH versions (data not shown) and found that 3 years yielded a good balance. In the training data, this limitation is mitigated by using a large number of individuals from multiple health-care systems.
For logistical reasons, the physician review of flagged individuals could only be done on a small subset of those identified. We cannot rule out selection bias in the validation results because neither scenario was perfectly random. In the national database, we relied on professional (second to third degree) connections to collect physician reviews. In the OHSU dataset, we imposed the practical requirement that the patient have at least one LDL cholesterol laboratory result, so that the physicians could easily assess those flagged with conventional diagnostic criteria. Individuals with LDL cholesterol values represent the simplest patients to assess, and we expect this cohort to be the most commonly reviewed in practice.
An additional limitation of the study is that the mean age of individuals identified by the FIND FH model was 61 years (SD 15) in the national dataset and 59 years (SD 14) in the OHSU dataset, despite the fact that familial hypercholesterolaemia is a genetic condition and therefore present from birth. This result probably stems from two factors: first, the model was trained on individuals diagnosed in specialty lipid clinics where individuals are typically referred later in the course of their preventive or cardiac care (mean age of cases and presumed controls in the institutional training datasets were 49–63 years and 60–67 years, respectively),23 and second, the very low prevalence of lipid data in individuals younger than 40 years of age in the databases. The final model identified those patients that it was trained to find—namely, older individuals with familial hypercholesterolaemia. The best value from the model might be achieved by successful cascade screening of family members of identified and diagnosed cases. Adding relevant clinical notes—particularly family history—would be an important development. However, there is currently no database at the national scale with this information available, nor are these data routinely included in EHRs. Therefore, addition of these data to the model would prevent using the model to scan the full national population.
In summary, when applied to two distinct types of large medical datasets, FIND FH identified a large number of individuals with probable familial hypercholesterolaemia who had not been previously diagnosed. Additional validation and demonstration of clinical utility of FIND FH will be needed before large-scale adoption of this approach. A crucial hurdle will be engaging providers to become familiar with machine learning approaches designed to reconnect them with their patients regarding new diagnoses not presented in previous medical encounters. This new tool carries the promise of finding new individuals with familial hypercholesterolaemia at scale and leading to more effective preventive therapy for them and newly identified family members.
Supplementary Material
Research in context.
Evidence before this study
Familial hypercholesterolaemia is a dominantly inherited genetic disorder that affects roughly one in 250 individuals in the USA. Evidence shows that untreated familial hypercholesterolaemia leads to premature atherosclerotic cardiovascular disease in 25% of women and 50% of men. National and international guidelines describe the need for screening programmes and recommend, in particular, family-based cascade screening approaches after the initial diagnosis of a proband. A literature search of databases such as MEDLINE, PubMed, and Google Scholar, including terms “familial hypercholesterolemia” and “screening”, as well as reference lists in guideline documents, found several studies of published implementations. From 1994 to 2014, the Netherlands underwent a government-subsidised national screening programme. This programme was highly effective and identified more than 28 000 previously undiagnosed individuals with familial hypercholesterolaemia. More recently, in the UK, another study has shown the effectiveness of incorporating child–parent screening at routine immunisation visits to identify young, undiagnosed individuals with familial hypercholesterolaemia. However, in the USA, with the strict privacy rules around the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and little acceptance of universal screening, familial hypercholesterolaemia remains underdiagnosed; current data suggest that less than 10% of the more than 1 million people with familial hypercholesterolaemia in the USA have been identified. In fact, efficient cascade screening and effective index identification—both crucial components modelled by previous successful screening programmes—remain challenges in the USA.
Use of machine learning algorithms or other large-scale query methods to mine electronic health records data and find suspected index cases offers promise. In our previous analysis, the Familial Hypercholesterolemia Foundation (FH Foundation) showed the utility of a machine learning approach to train and identify previously undiagnosed patients with familial hypercholesterolaemia from data within a health-care institution.
Added value of this study
To our knowledge, this study is the first to show the use of a machine learning algorithm to successfully and efficiently screen individuals for familial hypercholesterolaemia at a national scale in the USA. The algorithm accurately flagged individuals in both a national health-care database and an integrated health-care delivery system. Additionally, the FH Foundation presents a novel HIPAA-compliant programme that allows individual physician practitioners to opt in and receive identification of the individuals with probable familial hypercholesterolaemia in their practice. Furthermore, we show national heat maps for the first time to simplify the identification of regions of the country with higher concentrations of individuals with probable familial hypercholesterolaemia who are undiagnosed. Data at this scale will allow resources to be efficiently allocated to meet the need of these individuals.
Implications of all the available evidence
Our study shows an efficient machine learning approach towards screening individuals with undiagnosed familial hypercholesterolaemia. Our algorithm successfully identifies medical profiles consistent with familial hypercholesterolaemia in data at both a national level and within an integrated health-care delivery system. Additionally, the FH Foundation provides the framework for a HIPAA-compliant method to contact these identified individuals. The FIND FH model was developed to recognise the clinical phenotype for familial hypercholesterolaemia; future models could be developed to identify individuals who would have high probability of a familial hypercholesterolaemia genetic variant.
Acknowledgments
The FH Foundation, a 501(c)(3) organisation, funded this study. Support was received from Amgen, Sanofi, and Regeneron. Amgen is the founding sponsor of the FIND FH initiative. JWK received support from the American Heart Association (grant 15IRG222930034), the Stanford Data Science Initiative, the Stanford Diabetes Research Center (P30DK116074), and the National Institutes of Health (grant U41HG009649). MDS is supported by the National Institutes of Health (grant K12HD043488). The FH Foundation would like to thank Penn Medicine, Stanford University Medical Center, Geisinger Medical Center, The Ohio State University Wexner Medical Center, Oregon Health & Science University, and Laboratory Corporation of America for providing vital data and shared vision for the FIND FH initiative. Moreover, individuals beyond the authors from each institution played a key role in providing data to help to train, test, and validate the FIND FH model. We thank Yuliya Borovskly and Dan Soffer (Penn Medicine); Kylie McElheran, Beth Wilson, and Kathy Lee (Oregon Health & Science University); Kelly J Scheiderer, Brian Myers, and Jing Ding (Ohio State University); Jim Fleming, Wade Tanico, Arren Fisher, Sherrie Duke, Eric Rotthoff, Lee Terrell, and M J Lewis (Laboratory Corporation of America) for their critical contributions.
Funding The FH Foundation funded this study. Support was received from Amgen, Sanofi, and Regeneron.
Footnotes
Declaration of interests
KW, SSG, DZ, and LW are employees of the FH Foundation; KDM, WH, and DS are paid consultants for the FH Foundation. JWK is the unpaid chief research advisor for the FH Foundation and the FIND FH project and has enrolled patients and adjudicated outcomes in PCSK9i trials. DJR serves on the science advisory board for Alnylam, Novartis, and Pfizer and is an unpaid advisor to the FH Foundation. MDS is supported by NIH K12HD043488. MFM reports receiving grants from Regeneron Pharmaceuticals as well as personal fees from Invitae, both outside the scope of the study. SJB serves on the scientific advisory board for Amgen, Sanofi, Novartis, Regeneron, and Akcea; is a consultant at Sanofi, Amgen, Cleveland Heart Labs, GLG Group, Guidepoint Global, Regeneron, Novo Nordisk, and Akcea; and is a national speaker for Amgen, Aralez, Boehringer Ingelheim Pharmaceutical, Novo Nordisk, and Akcea. NHS is a co-founder and scientific advisor to Cardinal Analytx and an advisor to TwoXaR. WCC was an employee of the Laboratory Corporation of America Holdings. ET is an employee of Pfizer. All other authors declare no competing interests.
References
- 1.Gidding SS, Champagne MA, de Ferranti SD, et al. The agenda for familial hypercholesterolemia: a scientific statement from the American Heart Association. Circulation 2015; 132: 2167–92. [DOI] [PubMed] [Google Scholar]
- 2.Nordestgaard BG, Chapman MJ, Humphries SE, et al. European Atherosclerosis Society Consensus Panel. Familial hypercholesterolemia is underdiagnosed and undertreated in the general population: guidance for clinicians to prevent coronary heart disease: consensus statement of the European Atherosclerosis Society. Eur Heart J 2013; 34: 3478–90a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Knowles JW, Rader DJ, Khoury MJ. Cascade screening for familial hypercholesterolemia and the use of genetic testing. JAMA 2017; 318: 381–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Grundy SM, Stone NJ, Bailey AL, et al. 2018 AHA/ACC/AACVPR/AAPA/ABC/ACPM/ADA/AGS/APhA/ASPC/NLA/PCNA guideline on the management of blood cholesterol: executive summary. J Am Coll Cardiol 2018; 73: 3168–209. [DOI] [PubMed] [Google Scholar]
- 5.Harada-Shiba M, Arai H, Oikawa S, et al. Guidelines for the management of familial hypercholesterolemia. J Atheroscler Thromb 2012; 19: 1043–60. [DOI] [PubMed] [Google Scholar]
- 6.WHO. Familial hypercholesterolemia: report of a second WHO consultation. Geneva, Switzerland: World Health Organization, 1999. [Google Scholar]
- 7.Centers for Disease Control and Prevention. More detailed information on key tier 1 applications—familial hypercholesterolemia. March, 2014. http://www.cdc.gov/genomics/implementation/toolkit/FH_1.htm (accessed Oct 11, 2019).
- 8.Wald DS, Bestwick JP, Morris JK, Whyte K, Jenkins L, Wald NJ. Child–parent familial hypercholesterolemia screening in primary care. N Engl J Med 2016; 375: 1628–37. [DOI] [PubMed] [Google Scholar]
- 9.McCrindle BW, Gidding SS. What should be the screening strategy for familial hypercholesterolemia? N Engl J Med 2016; 375: 1685–86. [DOI] [PubMed] [Google Scholar]
- 10.Louter L, Defesche J, Roeters van Lennep J. Cascade screening for familial hypercholesterolemia: practical consequences. Atheroscler Suppl 2017; 30: 77–85. [DOI] [PubMed] [Google Scholar]
- 11.Andersen R, Andersen L. Examining barriers to cascade screening for familial hypercholesterolemia in the United States. J Clin Lipidol 2016; 10: 225–27. [DOI] [PubMed] [Google Scholar]
- 12.Lázaro P, Pérez de Isla L, Watts GF, et al. Cost-effectiveness of a cascade screening program for the early detection of familial hypercholesterolemia. J Clin Lipidol 2017; 11: 260–71. [DOI] [PubMed] [Google Scholar]
- 13.Obermeyer Z, Emanuel EJ. Predicting the future—big data, machine learning, and clinical medicine. N Engl J Med 2016; 375: 1216–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Rajkomar A, Oren E, Chen K, et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med 2018; 1: 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Banda JM, Sarranju A, Abbasi F, et al. Finding missed cases of familial hypercholesterolemia in health systems using machine learning. NPJ Digit Med 2019; 2: 23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Jordan MI, Mitchell TM. Machine learning: trends, perspectives, and prospects. Science 2015; 349: 255–60. [DOI] [PubMed] [Google Scholar]
- 17.Breiman L. Random forests. Mach Learn 2001; 45: 5–32. [Google Scholar]
- 18.Benn M, Watts GF, Tybjærg-Hansen A, Nordestgaard BG. Mutations causative of familial hypercholesterolaemia: screening of 98,098 individuals from the Copenhagen General Population Study estimated a prevalence of 1 in 217. Eur Heart J 2016; 37: 1384–94. [DOI] [PubMed] [Google Scholar]
- 19.Amor-Salamanca A, Castillo S, Gonzalez-Vioque E, et al. Genetically confirmed familial hypercholesterolemia in patients with acute coronary syndrome. J Am Coll Cardiol 2017; 70: 1732–40. [DOI] [PubMed] [Google Scholar]
- 20.Khera AV, Won HH, Peloso GM, et al. Diagnostic yield and clinical utility of sequencing familial hypercholesterolemia genes in patients with severe hypercholesterolemia. J Am Coll Cardiol 2016; 67: 2578–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Abul-Husn NS, Manickam K, Jones LK, et al. Genetic identification of familial hypercholesterolemia within a single US health care system. Science 2016; 354: 6319. [DOI] [PubMed] [Google Scholar]
- 22.Afzal N, Sohn S, Abram S, et al. Mining peripheral arterial disease cases from narrative clinical notes using natural language processing. J Vasc Surg 2017; 65: 1753–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.deGoma EM, Ahmad ZS, O’Brien EC, et al. Treatment gaps in adults with heterozygous familial hypercholesterolemia in the United States: data from the CASCADE-FH registry. Circ Cardiovasc Genet 2016; 9: 240–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



