Precision screening for familial hypercholesterolaemia: a machine learning study applied to electronic health encounter data

Kelly D Myers; Joshua W Knowles; David Staszak; Michael D Shapiro; William Howard; Mrinal Yadava; David Zuzick; Latoya Williamson; Nigam H Shah; Juan M Banda; Joe Leader; William C Cromwell; Ed Trautman; Michael F Murray; Seth J Baum; Seth Myers; Samuel S Gidding; Katherine Wilemon; Daniel J Rader

doi:10.1016/S2589-7500(19)30150-5

. Author manuscript; available in PMC: 2021 Apr 30.

Published in final edited form as: Lancet Digit Health. 2019 Oct 21;1(8):e393–e402. doi: 10.1016/S2589-7500(19)30150-5

Precision screening for familial hypercholesterolaemia: a machine learning study applied to electronic health encounter data

Kelly D Myers ¹, Joshua W Knowles ¹, David Staszak ¹, Michael D Shapiro ¹, William Howard ¹, Mrinal Yadava ¹, David Zuzick ¹, Latoya Williamson ¹, Nigam H Shah ¹, Juan M Banda ¹, Joe Leader ¹, William C Cromwell ¹, Ed Trautman ¹, Michael F Murray ¹, Seth J Baum ¹, Seth Myers ¹, Samuel S Gidding ¹, Katherine Wilemon ¹, Daniel J Rader ¹

¹The Familial Hypercholesterolemia Foundation, Pasadena, CA, USA (K D Myers BS, J W Knowles MD, D Zuzick MBA, L Williamson MS, S S Gidding MD, K Wilemon BS, Prof D J Rader MD); Division of Cardiovascular Medicine and Cardiovascular Institute (J W Knowles) and Stanford Center for Biomedical Informatics Research (N H Shah PhD, J M Banda PhD), Stanford University, Stanford, CA, USA; Atomo, Austin, TX, USA (K D Myers, D Staszak PhD, W Howard PhD, S Myers PhD); Department of Medicine, Center for Preventive Cardiology, Knight Cardiovascular Institute, Oregon Health & Science University, Portland, OR, USA (M D Shapiro DO, M Yadava MD); Geisinger Health System, Danville, PA, USA (J Leader BS); Lipoprotein & Metabolic Disorders Institute, Raleigh, NC, USA (W C Cromwell MD); Laboratory Corporation of America Holdings, Burlington, NC, USA (E Trautman PhD); Yale Center for Genomic Health, New Haven, CT, USA (Prof M F Murray MD); Department of Integrated Medical Sciences, Charles E Schmidt College of Medicine, Florida Atlantic University, Boca Raton, FL, USA (S J Baum MD); and Department of Genetics (Prof D J Rader), Department of Medicine (Prof D J Rader), and Department of Pediatrics (Prof D J Rader), Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA

Contributors

KDM, JWK, KW, DJR, SJB, MFM, NHS, JMB, and SSG conceived, designed, and oversaw the development of the FIND FH algorithm and this study. KDM, SM, WH, and DS developed the model and did the main analysis. MDS, MY, and SJB did clinical patient reviews and contributed to the analysis. KDM, JL, WCC, DZ, and ET developed the HIPAA-compliant methodology and acquired necessary data for this work. LW and DS searched the literature. KDM and DS wrote the initial publication draft and created the figures. All authors contributed to the critical review, editing, and final approval of this manuscript.

^✉

Correspondence to: Mr Kelly D Myers, The Familial Hypercholesterolemia Foundation, Pasadena, CA 91106, USA, km@thefhfoundation.org

PMCID: PMC8086528 NIHMSID: NIHMS1678511 PMID: 33323221

Abstract

Summary

Background

Cardiovascular outcomes for people with familial hypercholesterolaemia can be improved with diagnosis and medical management. However, 90% of individuals with familial hypercholesterolaemia remain undiagnosed in the USA. We aimed to accelerate early diagnosis and timely intervention for more than 1·3 million undiagnosed individuals with familial hypercholesterolaemia at high risk for early heart attacks and strokes by applying machine learning to large health-care encounter datasets.

Methods

We trained the FIND FH machine learning model using deidentified health-care encounter data, including procedure and diagnostic codes, prescriptions, and laboratory findings, from 939 clinically diagnosed individuals with familial hypercholesterolaemia (395 of whom had a molecular diagnosis) and 83 136 individuals presumed free of familial hypercholesterolaemia, sampled from four US institutions. The model was then applied to a national health-care encounter database (170 million individuals) and an integrated health-care delivery system dataset (174 000 individuals). Individuals used in model training and those evaluated by the model were required to have at least one cardiovascular disease risk factor (eg, hypertension, hypercholesterolaemia, or hyperlipidemia). A Health Insurance Portability and Accountability Act of 1996-compliant programme was developed to allow providers to receive identification of individuals likely to have familial hypercholesterolaemia in their practice.

Findings

Using a model with a measured precision (positive predictive value) of 0·85, recall (sensitivity) of 0·45, area under the precision–recall curve of 0·55, and area under the receiver operating characteristic curve of 0·89, we flagged 1 331 759 of 170 416 201 patients in the national database and 866 of 173 733 individuals in the health-care delivery system dataset as likely to have familial hypercholesterolaemia. Familial hypercholesterolaemia experts reviewed a sample of flagged individuals (45 from the national database and 103 from the health-care delivery system dataset) and applied clinical familial hypercholesterolaemia diagnostic criteria. Of those reviewed, 87% (95% Cl 73–100) in the national database and 77% (68–86) in the health-care delivery system dataset were categorised as having a high enough clinical suspicion of familial hypercholesterolaemia to warrant guideline-based clinical evaluation and treatment.

Interpretation

The FIND FH model successfully scans large, diverse, and disparate health-care encounter databases to identify individuals with familial hypercholesterolaemia.

Introduction

Familial hypercholesterolaemia is a common inherited condition affecting approximately one in every 250 individuals worldwide and causes lifelong elevations in LDL cholesterol and premature coronary artery disease.¹ Timely identification of familial hypercholesterolaemia and initiation of guideline-based therapies can markedly attenuate the risk of coronary artery disease. Unfortunately, fewer than 10% of individuals with familial hypercholesterolaemia have been identified, creating a huge reservoir of unidentified and untreated individuals with the condition.^1–3 Current guidelines from the American College of Cardiology and American Heart Association, WHO, and the Centers for Disease Control and Prevention recommend screening to identify families with familial hypercholesterolaemia.^1,4–7

The best method for large-scale screening has yet to be established but is likely to include both efficient cascade screening and effective index identification.^8–12 Machine learning models have promising applications to medicine^13–15 because they can efficiently analyse datasets at scale and have the potential to predict outcomes across disciplines.¹⁶ Characteristics of familial hypercholesterolaemia in large health-care-related datasets (eg, laboratory values, procedure codes, diagnosis codes, or prescribing information) are amenable to this approach. We have previously reported results of a successful application of a machine learning model to identify undiagnosed individuals with familial hypercholesterolaemia built from and applied to single health-care institutions.¹⁵ The current study, which was conceived of in parallel, aims to build a model that can be applied at both the institutional and national health-care database scales to identify new index cases.

To this scope, we constructed the FIND FH machine learning model. We aimed to define model characteristics using individuals with familial hypercholesterolaemia and individuals presumed not to have familial hypercholesterolaemia from four health-care systems; assess whether the derived model can be effectively applied to different sources of health-care encounter data, including a national database and an integrated health-care delivery system; and determine whether the derived model successfully identifies individuals with medical profiles consistent with familial hypercholesterolaemia in independent clinical settings. We hypothesised that targeted screening with a machine learning model applied to large health-care datasets can identify individuals highly likely to have familial hypercholesterolaemia.

Methods

Study design

In this study, we used electronic health record (EHR) data from large academic health systems to build and train the FIND FH machine learning model. The training dataset was split using the 80–20 holdout method to tune the parameters and measure model performance. The model was then externally validated on two independent, real-world datasets.

Training dataset

The FIND FH initiative predated the availability of an International Classification of Diseases, Tenth Revision (ICD-10) code for familial hypercholesterolaemia. Therefore, we needed an alternative means of identifying true-positive familial hypercholesterolaemia training examples for the model. We chose to work with familial hypercholesterolaemia experts from various health systems to ensure we had high-quality true positives from across the USA. To this scope, we used structured EHR data from four different health systems: Stanford University, University of Pennsylvania, Geisinger Medical Center, and Ohio State University. These datasets include all individuals who interacted with these health systems, regardless of age or insurance coverage. The data cover 3 years of individual history, from Sept 1, 2013, to Aug 31, 2016.

The data used included prescription, procedure, diagnosis, and laboratory result data. Diagnosis and procedure data in the study used ICD-9, ICD-10, and Current Procedural Terminology standard formats. Common data dictionaries were developed for prescription and laboratory result data. Unstructured data, including clinical notes and other data from patient medical files (eg, notes about family history of cardiovascular disease), were not included. To train the model, we exposed machine learning technology to data from individuals with clinically diagnosed familial hypercholesterolaemia and found relevant medical patterns in the years leading up to their diagnosis.

A supervised learning analysis requires an adequate number of clinically diagnosed individuals as well as a large number of presumed controls to train the model. We focused model development on individuals with at least one cardiovascular disease or primary prevention (eg, with hypertension, hypercholesterolaemia, or hyperlipidaemia without cardiovascular disease) comorbidity claim, and trained the model to differentiate those with familial hypercholesterolaemia from those presumed without familial hypercholesterolaemia within this large subset (see appendix pp 11–24 for the full list of comorbidities and their medical codes). For our training examples, a case was defined as an individual with a clinical diagnosis of familial hypercholesterolaemia by a lipid expert³ and a presumed control was defined as an individual with no previous diagnosis of familial hypercholesterolaemia by a lipid expert in their medical record. The total sample of training examples from the four datasets comprised 939 individuals with familial hypercholesterolaemia (42% of whom were genetically confirmed) and 83 136 individuals presumed controls (table 1). Use of training examples from multiple institutions, where there might be small differences in therapeutic and coding patterns, was done to ensure a more robust model.

Table 1:

Clinical and demographic data for the training dataset

	Stanford University		University of Pennsylvania		Geisinger Medical Center		Ohio State University

	Cases	Presumed controls	Cases	Presumed controls	Cases	Presumed controls	Cases	Presumed controls
Number of participants	106	7699	293	34 797	446	32 640	94	8000
Age, years	49 (14)	67 (15)	58 (15)	60 (19)	63 (16)	62 (17)	53 (14)	60 (16)
Female participants	54 (50.9%)	3379 (43.9%)	195 (66.6%)	20 244 (58.2%)	272 (61.0%)	15 639 (47.9%)	54 (57.4%)	4031 (50.4%)
Genetically confirmed variant of familial hypercholesterolaemia	20 (18.9%)	..	159 (54.3%)	..	216 (48.4%)	..	..	..
Individuals with data on laboratory results	97 (91.5%)	7517 (97.6%)	273 (93.2%)	28 964 (83.2%)	446 (100%)	32 244 (98.8%)	73 (77.7%)	5868 (73.4%)
Individuals with prescription data	89 (84.0%)	6346 (82.4%)	284 (96.9%)	33 723 (96.9%)	445 (99.8%)	32 485 (99.5%)	93 (98.9%)	7073 (88.4%)
Existing diagnoses
Arteriosclerotic cardiovascular disease	25 (23.6%)	2066 (26.8%)	147 (50.2%)	10 904 (31.3%)	172 (38.6%)	8606 (26.4%)	70 (74.5%)	3973 (49.7%)
Hypercholesterolaemia	92 (86.8%)	3022 (39.3%)	279 (95.2%)	16 714 (48.0%)	92 (20.6%)	5930 (18.2%)	80 (85.1%)	3659 (45.7%)
Diabetes	8 (7.5%)	1585 (20.6%)	26 (8.9%)	7951 (23%)	24 (54%)	2472 (7.6%)	21 (22.3%)	1607 (20.4%)
Hypertension	18 (17.0%)	5151 (66.9%)	119 (40.6%)	18 230 (52.4%)	57 (12.8%)	6676 (20.5%)	50 (53.2%)	4550 (56.9%)
Using any statin	71 (67.0%)	2088 (27.1%)	224 (76.5%)	14 187 (40.8%)	392 (87.9%)	18 347 (56.2%)	85 (90.4%)	3240 (40.5%)

Open in a new tab

Data are n (%) or mean (SD). In the training dataset, cases are individuals with diagnosed familial hypercholesterolaemia and presumed controls are individuals presumed to not have familial hypercholesterolaemia.

Model features developed for individuals in the familial hypercholesterolaemia and non-familial hypercholesterolaemia datasets included demographic features (ie, age and sex) and other features built from health-care encounters and laboratory data. More than 60 000 features were tested. The most frequently used features were counting features, such as the number of times a given individual received a specific prescription or underwent a specific procedure within a time window. Combinations of long-term prescription and laboratory result histories were assessed.

The expected prevalence of familial hypercholesterolaemia is an important input parameter as the performance of the model varies with it. Due to the prespecified cardiovascular disease or primary prevention selection criteria, the expected prevalence of familial hypercholesterolaemia in our data will be higher than in the general population. We estimated the prevalence of familial hypercholesterolaemia in this cohort by calculating the ratio of the number of individuals with undiagnosed familial hypercholesterolaemia in the USA (1·3 million, assuming 1:250 prevalence) to the total number of individuals in the health-care encounter dataset (92 million, after applying the comorbidity requirement). Assuming that our comorbidity criteria are loose enough to capture the vast majority of these individuals, an expected prevalence ratio of 1:71 was derived. This expected prevalence ratio was used to balance the number of individuals in our positive and negative training datasets by down-sampling the negative data to remove bias in the training procedure.

The FIND FH model

The FIND FH machine learning model consists of two consecutive random forest¹⁷ model layers (figure 1). The first random forest layer selects the highest ranked features—ie, those that improve predictive value. The training data were then filtered to include only those highly ranked features that are then passed to the second random forest. Random forests were used since they are easily interpretable, prevent overtraining, and can capture non-linear relationships among features. Implementing two consecutive random forest layers was found to both improve performance and increase the portability of the model across different hospital network datasets. Model building tested the inclusion of more than 60 000 input features. We tested several other algorithms for the second model layer, including gradient-boosted and adaptive-boosted decision trees, support vector machines, and a logistical regression model. When considering performance metrics (ie, precision and recall) and examining the decisions made by each model, the random forest performed best and was less likely to exhibit overfitting compared with other tree-based methods. We did not investigate resampling techniques because the model performance met our goals.

Internal model parameters were optimised on the training data on the basis of the final F1 score. The F1 statistic is a performance metric that provides a properly balanced measure of the model’s precision (positive predictive value) and recall (sensitivity). After model parameters were set, we used a holdout tuning dataset to define a threshold for probable familial hypercholesterolaemia. This threshold was set to identify and flag individuals with familial hypercholesterolaemia with precision greater than 0·6. Operationally, once the identified individuals are screened, the model can be run a second time with a threshold set at a lower precision to identify additional individuals if resources are available.

Model performance and validation

To create the holdout tuning dataset, we excluded 20% of the individuals with and without familial hypercholesterolaemia in the combined training dataset from the training process. The FIND FH model performance was then computed in this 20% to generate a measure of the F1 score, precision, recall, area under the precision–recall curve (AUPRC), and area under the receiver operating characteristic curve (AUROC), independent of the training procedure. AUROC is the area under the curve defined by a model’s recall and false-positive rate (1 − specificity) measured at different thresholds. AUPRC is the area under the curve defined by a model’s precision and recall measured at different thresholds. In this study, the AUPRC is a more appropriate metric of model performance because our data are imbalanced and precision-based metrics better emphasise the efficiency of a model to correctly identify less frequent positive cases.

We did not do a full cross validation, instead choosing the holdout method for simplicity; additionally, we could monitor overfitting or underfitting with random forest out-of-bag errors. Given the size of our dataset, we had a sufficient number of cases for the holdout method to produce consistent datasets; additionally, we verified that results were consistent by varying the random splitting.

To both externally validate and test the robustness and portability of the FIND FH model, we applied the model to two types of health-care encounter data: a national health-care encounter database and data from a large tertiary-care academic medical centre. In both cases, the datasets that the model was applied to were completely distinct from the training and tuning datasets. For alignment with the training phase and comparison purposes, individuals were required to have a documented history of at least one cardiovascular disease risk factor and to have no previous diagnosis of familial hypercholesterolaemia. For completeness, we investigated the dependence of model performance on familial hypercholesterolaemia prevalence and found differences to be negligible for a wide region of values around our established prevalence ratio of 1:71.

When applying the model to new datasets, we did an extensive analysis to flag data elements that were systematically missing or under-represented in the included individuals. Once these issues were found, they could be fixed or resolved with the data provider in a way that does not introduce bias into the results. One of the benefits of using two random forest layers is that we only needed to ensure that data characteristics used in the 75 features chosen by the model were consistent in these new datasets.

Our first external validation test set was a national database of health-care encounters consisting of diagnosis, procedure, and medication transactions on 170 416 201 American residents, secured from IQVIA (Durham, NC, USA), a company that provides health-care encounter data aggregated across hospital, payer, and pharmacy sources. Laboratory result data on 15 368 850 of these individuals were provided by Laboratory Corporation of America Holdings (LabCorp; Burlington, NC, USA). Each individual had a unique, anonymised identifier to allow linkage of encounter data with laboratory result data and to permit removal of identifying information. After merging the data, the FIND FH model was applied. Individuals with values above the threshold for probable familial hypercholesterolaemia were flagged; those that had been previously diagnosed were removed from the dataset. We developed an outreach programme, compliant with the Health Insurance Portability and Accountability Act of 1996 (HIPAA), that allowed us to notify health-care providers of these flagged individuals. Flagged individuals were associated with their primary attending physician. The FH Foundation then contacted chosen physicians with at least five flagged individuals in their practices. The targeted clinicians were all considered to be experts in the diagnosis and treatment of familial hypercholesterolaemia. The study group discussed the FIND FH model with these physicians and asked them if they wanted to know the identity of the flagged individuals in their practices. If they did, LabCorp, a covered entity (ie, health-care provider) under HIPAA, provided individual identification. The physicians were asked to determine the likelihood of familial hypercholesterolaemia in all identified patients using four methods: the three accepted diagnostic criteria (Dutch Lipid Clinic Network, Make Early Diagnosis to Prevent Early Deaths [MEDPED], and Simon Broome) and their expert clinical judgment. Using all of these criteria, the physicians categorised flagged individuals into one of the following categories: definite, probable, possible, inconclusive, or unlikely familial hypercholesterolaemia. Individuals with risk labelled possible or greater for any of the evaluations were considered candidates for further screening and follow-up.

For the second external validation, we obtained independent anonymised data for 173 733 individuals who had previously had contact with the Oregon Health & Science University (OHSU) health-care system over 3 years (March 27, 2015–March 27, 2018). Data included diagnoses, procedures, laboratory tests, and medications. Individuals with an ICD-10 code for familial hypercholesterolaemia were removed from the list to limit the data to individuals without a previously documented diagnosis of familial hypercholesterolaemia, in keeping with the patient profile that the model was explicitly designed to identify. The FIND FH model was applied to this dataset and approximately 100 flagged individuals with at least one LDL cholesterol laboratory result in their history were chosen for review by the data analysts in the study (WH, DS, KDM), as this would give reviewers meaningful clinical data for assessment. Not all flagged individuals were reviewed owing to time limitations. Assessments included a review of all laboratory results, genetic tests, outpatient visits, inpatient or emergency department visits, family history, and established diagnoses. Records were reviewed on the OHSU Epic electronic medical record system (Epic Systems Corporation; Verona, WI, USA), including the Care Everywhere network, which integrates health-care records (including outpatient visits, inpatient or emergency department visits, and all diagnostic testing) for any hospital or health-care system using the Epic platform. Similar to the national database application, each individual was given a categorisation for each of the three diagnostic criteria as well as the physician’s clinical opinion.

The FIND FH model was built using open-source tools and libraries. The model and plotting code is written in Python (version 2.7), using NumPy, SciPy, and scikit learn random forest machine learning libraries.

Role of the funding source

The FH Foundation funded this study. The sponsors of the FH Foundation had no role in the study design, data procurement, data analysis, data interpretation, or writing of the report. The corresponding author had full access to all the data in the study and had final responsibility for the decision to submit for publication.

Results

The final FIND FH model included a mix of features: demographic, conditional (ie, designed to capture patient health response, both positive and negative, throughout the course of therapy), prescription based, diagnosis based, procedure based, and laboratory result based (table 2). These features were ranked by their importance in the model, in proportion to the number of decision nodes over the whole forest in which that feature was used and how effectively that feature distinguished between individuals with or without familial hypercholesterolaemia at those nodes (table 2). To quantify which categories of features had the greatest effect on model performance, we grouped the features and present the average importance of each category with mean rankings.

Table 2:

Breakdown of feature categories

	Number of features investigated	Number of investigated features in the top 75	Mean rank of all features in category	Mean rank of features in the top 75	Top four features in category
Demographic	2	2	18	18	Age, sex
Conditional	21	18	39	24	High LDL cholesterol with no lipid-lowering therapies, high LDL cholesterol with high-intensity statin prescription, high LDL cholesterol with moderate-intensity statin prescription, high LDL cholesterol with statins and ezetimibe
Prescription based	14 000	6	29 825	33	Total number of prescription codes, number of atorvastatin prescriptions, number of rosuvastatin prescriptions, number of evolocumab prescriptions
Diagnosis based	26 700	14	29 392	35	Number of E78.00 codes (hypercholesterolaemia), total number of diagnosis codes, number of E78.4 or E78.5 codes (hyperlipidaemia), number of I10 codes (hypertension)
Procedure based	14 200	8	31 806	47	Total number of procedure codes, number of 93000 codes (electrocardiogram), number of 99214 codes (outpatient services), number of 36415 codes (venipuncture)
Laboratory result based	3900	27	19 510	49	Maximum value of total cholesterol, maximum value of LDL cholesterol, average value of LDL cholesterol, average value of total cholesterol

Open in a new tab

Table shows the number and rank of all features investigated in a given category and of those features that were included in the top 75 features (ie, that were eligible for inclusion in the second random forest layer). Top features in each category are listed by rank.

Prescription-based, diagnosis-based, and procedure-based features include those created from the frequency of relevant health-care encounter transactions. Laboratory result features were used when available. These four categories provided the majority of the top 75 features, with laboratory-based features serving as the most frequent contributor, followed by health-care encounter-based features (ie, prescription, diagnosis, and procedure). Demographic features, including age and sex, also factored in prominently. Many conditional features based on specific combinations of the long-term prescription and laboratory result histories of individuals proved effective.

The FIND FH model performance was measured on the holdout tuning dataset (prevalence 1:71), yielding a precision (positive predictive value) of 0·85 and a recall (ie, sensitivity) of 0·45. The AUPRC of the model was found to be 0·55, whereas the AUROC was 0·89 (appendix pp 1–2). The optimisation of the model performance was primarily a scan of number of features passed between random forest layers. This is shown and discussed in the appendix (pp 6–7).

In the first external validation dataset (ie, the national dataset, 1 331 759 of 170 416 201 individuals were flagged by the model as likely to have familial hypercholesterolaemia, of whom 45 were reviewed (table 3; figure 2). A breakdown of the physician and reviewer diagnoses is presented in the appendix (pp 9–10). In agreement with the precision measurement on the holdout tuning data, 87% (95% CI 73–100) of the evaluated individuals in the national database were identified as having possible, probable, or definite familial hypercholesterolaemia by at least one of the diagnostic criteria or by the attending physician. Applying a simple test of LDL cholesterol greater than 190 mg/dL to each individual’s medical history would have only identified 18 (46%) of the 39 new cases of familial hypercholesterolaemia. The results from the FIND FH model application to the national database can be visualised at a national scale by showing the number of yet-to-be-diagnosed individuals with familial hypercholesterolaemia as a choropleth map aggregated to three-digit ZIP code geographies (figure 3). Visualising at this scale informs about high-density regions within the country and helps stakeholders to allocate resources to efficiently reach the unmet need. For example, we find high-density regions that track with founder populations known to have high rates of familial hypercholesterolaemia, including Ashkenazi Jewish and Amish populations. This trend holds both with crude counts (figure 3) and population-adjusted counts (appendix p 8).

Table 3:

Clinical and demographic data for the external validation datasets

	National database			OHSU dataset

	Complete dataset	Flagged by model	Reviewed by attending physician	Complete dataset	Flagged by model	Reviewed by expert
Number of participants	170 416 201	1331759	45	173 733	866	103
Age, years	54 (21)	61 (15)	55 (17)	50 (23)	59 (14)	45 (15)
Female participants	90 974 711 (53.4%)	597 409 (4.9%)	24 (53.3%)	91 294 (52.5%)	376 (43.4%)	48 (46.6%)
Individuals with data on laboratory results	15 368 850 (9.0%)	307 507 (23.1%)	43 (95.6%)	105 605 (60.8%)	656 (75.8%)	102 (99.0%)
Individuals with prescription data	82 979 553 (48.7%)	1 137 815 (85.4%)	45 (100%)	126 624 (72.9%)	649 (74.9%)	100 (97.1%)
Existing diagnoses
Arteriosclerotic cardiovascular disease	21 568 927 (12.7%)	335 180 (25.2%)	12 (26.7%)	24 104 (13.9%)	570 (65.8%)	44 (42.7%)
Hypercholesterolaemia	40 165 393 (23.6%)	734 534 (55.2%)	29 (64.4%)	28 679 (16.5%)	766 (88.5%)	88 (85.4%)
Diabetes	20 199 157 (11.9%)	279 631 (21.0%)	4 (8.9%)	17 326 (10.0%)	100 (11.5%)	12 (11.7%)
Hypertension	49 731 923 (29.2%)	587 272 (44.1%)	10 (22.2%)	40 411 (23.3%)	303 (35.0%)	28 (27.2%)
Using any statin	26 677 718 (15.7%)	513 639 (38.6%)	31 (68.9%)	25 567 (14.7%)	469 (54.2%)	71 (68.9%)

Open in a new tab

OHSU=Oregon Health & Science University.

Figure 2: — Clinical review validation data from individuals flagged in the FIND FH national database and OHSU dataset. Numbers indicate individuals categorised as possible, probable, or definite familial hypercholesterolaemia. Each individual was evaluated with the Dutch Lipid Clinic Network, MEDPED, and Simon Broome criteria and expert opinion; individuals identified with the Simon Broome criteria (11 in the national database and 29 in the OHSU dataset) were consistent with evaluations by the Dutch Lipid Clinic Network or MEDPED. In the national database, identification by an attending physician indicates the presence of an E78.01 (International Classification of Diseases, Tenth Revision code for familial hypercholesterolaemia) by a physician in the individual’s history subsequent to the model scoring and clinical review. In the OHSU dataset, expert assessment is clinical chart review without an in-person clinical assessment. MEDPED=Make Early Diagnosis to Prevent Early Deaths. OHSU=Oregon Health & Science University.

Figure 3: — Number of undiagnosed individuals with familial hypercholesterolaemia aggregated to three-digit ZIP code geographies. Values were not population adjusted; a population-adjusted version is presented in the appendix (p 8). The frequencies shown reflect the geographical coverage of the national health-care encounter and laboratory datasets and not a true, unbiased measure of the US population distribution.

In the second external validation dataset (ie, OHSU), 866 of 173 733 individuals were flagged by the model as likely to have familial hypercholesterolaemia, of whom 103 were reviewed (table 3; figure 2). Of the individuals flagged by the model, 77% (95% CI 68–86) were identified as having possible, probable, or definite familial hypercholesterolaemia by at least one of the diagnostic criteria or by the familial hypercholesterolaemia expert, in agreement with the precision measurement on the holdout tuning dataset. Applying a simple threshold of LDL cholesterol greater than 190 mg/dL to the individual’s medical history would have only identified 37 (47%) of the 79 new cases of familial hypercholesterolaemia.

Discussion

We describe here the development of a machine learning model, FIND FH, which is designed to identify phenotypic familial hypercholesterolaemia when applied to large medical datasets. FIND FH was built on longitudinal medical data from individuals with at least one documented cardiovascular disease risk factor in their history. This requirement was applied to allow us to collect data for model training and to optimise the model to the envisioned test cases, finding individuals embedded in a health-care system. Additionally, it forces the model to focus on discriminating between familial hypercholesterolaemia and other cardiac conditions that might mimic familial hypercholesterolaemia.

The FIND FH model is a precision screening tool and does not replace clinical evaluations nor existing diagnostic criteria. It differs from the traditional diagnostic criteria (ie, Dutch Lipid Clinic Network, MEDPED, or Simon Broome) in that it does not require specific information, such as tendon xanthomas or family history, that either might not be present or cannot be regularly and reliably extracted from current EHR data at the national scale. FIND FH was structured to identify a phenotype consistent with familial hypercholesterolaemia whereas the other criteria were designed to evaluate the likelihood of having a positive genetic test for familial hypercholesterolaemia.^18,19 This is why we validated the model using familial hypercholesterolaemia expert clinical evaluation.¹ As genetic testing results become more readily available, we intend to include this feature in future versions.

Importantly, the FIND FH model does not only rely on predetermined thresholds for lipid concentrations. The algorithm was developed with data from patients with an existing diagnosis of familial hypercholesterolaemia; at some point in the past, these patients would have met the lipid diagnostic criteria for this condition, but those data were not in the EHR for many of the patients on whom the algorithm was trained. This is a key difference between FIND FH and conventional scoring systems. Although lipid concentrations are helpful to the machine learning algorithm, many patients identified in this study either did not have lipid levels obtained and were flagged by other characteristics or they were taking lipid-lowering medications and had lipid levels below pre-treatment diagnostic thresholds. Future studies should assess the effect of having lipid levels available on a higher percentage of the available cohort on model performance.

Application of FIND FH to a national database consisting of diagnosis, procedure, and medication transactions in more than 170 million Americans flagged 1 345 477 individuals with medical profiles consistent with familial hypercholesterolaemia. FIND FH was also applied to a health-care delivery system dataset consisting of structured EHR data in more than 170 000 individuals, in which it flagged 866 individuals. The proportion of individuals flagged in these cohorts, which is empirically lower than the training prevalence of 1:71, is a direct reflection of the high precision threshold chosen. We chose this threshold to avoid the possibility of too many false positives at the start of the outreach programme to attending physicians. Chart review of the flagged individuals categorised 77–87% of them as having possible, probable, or definite familial hypercholesterolaemia, indicating a high enough clinical suspicion of familial hypercholesterolaemia to warrant a guideline-based, formal clinical evaluation and treatment. More than half of the individuals flagged would not have been identified with a simple screen for elevated LDL cholesterol levels alone;^20,21 this test does not capture data crucial to conventional diagnostic criteria, such as family history, and misses situations such as an individual on statins with LDL cholesterol levels below threshold. Furthermore, the model flagged individuals undergoing statin therapy and those that were not. Of the 79 individuals categorised as having risk of possible familial hypercholesterolaemia or greater in the OHSU dataset, 31 of them had no record of any statin prescriptions in the previous 2 years. These results indicate that application of a machine learning approach such as FIND FH to medical big data might be feasible for identifying many undiagnosed individuals with familial hypercholesterolaemia.

FIND FH performed comparably across two types of health-care data: a national health-care encounter database and an integrated health-care delivery system with a structured EHR database. This portability was a design consideration and arises from the fact that the model is built on structured health-care encounter and laboratory result data. Although we¹⁵ and others have found success when including unstructured EHR data and clinical notes in machine learning models,^14,22 to our knowledge, no national database with such data currently exists. The fact that the model performs similarly in distinct health-care data frameworks suggests that it might be generalisable to other institutions, agencies, employers, and health-care delivery systems. Our previous model¹⁵ took the complementary approach and showed good performance in identifying previously undiagnosed familial hypercholesterolaemia patients within a single institution. lmportantly, the fact that the individuals identified in this latter case were already within the institution lead to easier and quicker individual engagement.

We have developed a novel HIPAA-compliant outreach process to notify health-care providers of their patients flagged by the FIND FH model, a disclosure that is permitted by the treatment exception to HIPAA privacy rules. This process can be easily implemented across diverse health-care systems. Providers, or an integrated delivery system, can opt in to participate in the programme and then learn the identities of individuals in their practices flagged as having probable familial hypercholesterolaemia. These individuals can then be evaluated by their providers and, if formally diagnosed with familial hypercholesterolaemia, receive the necessary medical therapy.

This study has several limitations and caveats. By design, the model is given a 3-year snapshot of an individual’s full medical history to calculate their likelihood of familial hypercholesterolaemia. It is not possible to account for important pieces of information outside of the 3-year interval. The time windows chosen for building and applying the model to a dataset are a balance between the positive contribution of more data and the increased costs and other issues associated with using longer time windows. We investigated shorter and longer windows in previous FIND FH versions (data not shown) and found that 3 years yielded a good balance. In the training data, this limitation is mitigated by using a large number of individuals from multiple health-care systems.

For logistical reasons, the physician review of flagged individuals could only be done on a small subset of those identified. We cannot rule out selection bias in the validation results because neither scenario was perfectly random. In the national database, we relied on professional (second to third degree) connections to collect physician reviews. In the OHSU dataset, we imposed the practical requirement that the patient have at least one LDL cholesterol laboratory result, so that the physicians could easily assess those flagged with conventional diagnostic criteria. Individuals with LDL cholesterol values represent the simplest patients to assess, and we expect this cohort to be the most commonly reviewed in practice.

An additional limitation of the study is that the mean age of individuals identified by the FIND FH model was 61 years (SD 15) in the national dataset and 59 years (SD 14) in the OHSU dataset, despite the fact that familial hypercholesterolaemia is a genetic condition and therefore present from birth. This result probably stems from two factors: first, the model was trained on individuals diagnosed in specialty lipid clinics where individuals are typically referred later in the course of their preventive or cardiac care (mean age of cases and presumed controls in the institutional training datasets were 49–63 years and 60–67 years, respectively),²³ and second, the very low prevalence of lipid data in individuals younger than 40 years of age in the databases. The final model identified those patients that it was trained to find—namely, older individuals with familial hypercholesterolaemia. The best value from the model might be achieved by successful cascade screening of family members of identified and diagnosed cases. Adding relevant clinical notes—particularly family history—would be an important development. However, there is currently no database at the national scale with this information available, nor are these data routinely included in EHRs. Therefore, addition of these data to the model would prevent using the model to scan the full national population.

In summary, when applied to two distinct types of large medical datasets, FIND FH identified a large number of individuals with probable familial hypercholesterolaemia who had not been previously diagnosed. Additional validation and demonstration of clinical utility of FIND FH will be needed before large-scale adoption of this approach. A crucial hurdle will be engaging providers to become familiar with machine learning approaches designed to reconnect them with their patients regarding new diagnoses not presented in previous medical encounters. This new tool carries the promise of finding new individuals with familial hypercholesterolaemia at scale and leading to more effective preventive therapy for them and newly identified family members.

Supplementary Material

33323221 Supplementary Appendix

NIHMS1678511-supplement-33323221_Supplementary_Appendix.pdf^{(592KB, pdf)}

Research in context.

Evidence before this study

Familial hypercholesterolaemia is a dominantly inherited genetic disorder that affects roughly one in 250 individuals in the USA. Evidence shows that untreated familial hypercholesterolaemia leads to premature atherosclerotic cardiovascular disease in 25% of women and 50% of men. National and international guidelines describe the need for screening programmes and recommend, in particular, family-based cascade screening approaches after the initial diagnosis of a proband. A literature search of databases such as MEDLINE, PubMed, and Google Scholar, including terms “familial hypercholesterolemia” and “screening”, as well as reference lists in guideline documents, found several studies of published implementations. From 1994 to 2014, the Netherlands underwent a government-subsidised national screening programme. This programme was highly effective and identified more than 28 000 previously undiagnosed individuals with familial hypercholesterolaemia. More recently, in the UK, another study has shown the effectiveness of incorporating child–parent screening at routine immunisation visits to identify young, undiagnosed individuals with familial hypercholesterolaemia. However, in the USA, with the strict privacy rules around the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and little acceptance of universal screening, familial hypercholesterolaemia remains underdiagnosed; current data suggest that less than 10% of the more than 1 million people with familial hypercholesterolaemia in the USA have been identified. In fact, efficient cascade screening and effective index identification—both crucial components modelled by previous successful screening programmes—remain challenges in the USA.

Use of machine learning algorithms or other large-scale query methods to mine electronic health records data and find suspected index cases offers promise. In our previous analysis, the Familial Hypercholesterolemia Foundation (FH Foundation) showed the utility of a machine learning approach to train and identify previously undiagnosed patients with familial hypercholesterolaemia from data within a health-care institution.

Added value of this study

To our knowledge, this study is the first to show the use of a machine learning algorithm to successfully and efficiently screen individuals for familial hypercholesterolaemia at a national scale in the USA. The algorithm accurately flagged individuals in both a national health-care database and an integrated health-care delivery system. Additionally, the FH Foundation presents a novel HIPAA-compliant programme that allows individual physician practitioners to opt in and receive identification of the individuals with probable familial hypercholesterolaemia in their practice. Furthermore, we show national heat maps for the first time to simplify the identification of regions of the country with higher concentrations of individuals with probable familial hypercholesterolaemia who are undiagnosed. Data at this scale will allow resources to be efficiently allocated to meet the need of these individuals.

Implications of all the available evidence

Our study shows an efficient machine learning approach towards screening individuals with undiagnosed familial hypercholesterolaemia. Our algorithm successfully identifies medical profiles consistent with familial hypercholesterolaemia in data at both a national level and within an integrated health-care delivery system. Additionally, the FH Foundation provides the framework for a HIPAA-compliant method to contact these identified individuals. The FIND FH model was developed to recognise the clinical phenotype for familial hypercholesterolaemia; future models could be developed to identify individuals who would have high probability of a familial hypercholesterolaemia genetic variant.

Acknowledgments

The FH Foundation, a 501(c)(3) organisation, funded this study. Support was received from Amgen, Sanofi, and Regeneron. Amgen is the founding sponsor of the FIND FH initiative. JWK received support from the American Heart Association (grant 15IRG222930034), the Stanford Data Science Initiative, the Stanford Diabetes Research Center (P30DK116074), and the National Institutes of Health (grant U41HG009649). MDS is supported by the National Institutes of Health (grant K12HD043488). The FH Foundation would like to thank Penn Medicine, Stanford University Medical Center, Geisinger Medical Center, The Ohio State University Wexner Medical Center, Oregon Health & Science University, and Laboratory Corporation of America for providing vital data and shared vision for the FIND FH initiative. Moreover, individuals beyond the authors from each institution played a key role in providing data to help to train, test, and validate the FIND FH model. We thank Yuliya Borovskly and Dan Soffer (Penn Medicine); Kylie McElheran, Beth Wilson, and Kathy Lee (Oregon Health & Science University); Kelly J Scheiderer, Brian Myers, and Jing Ding (Ohio State University); Jim Fleming, Wade Tanico, Arren Fisher, Sherrie Duke, Eric Rotthoff, Lee Terrell, and M J Lewis (Laboratory Corporation of America) for their critical contributions.

Funding The FH Foundation funded this study. Support was received from Amgen, Sanofi, and Regeneron.

Footnotes

Declaration of interests

KW, SSG, DZ, and LW are employees of the FH Foundation; KDM, WH, and DS are paid consultants for the FH Foundation. JWK is the unpaid chief research advisor for the FH Foundation and the FIND FH project and has enrolled patients and adjudicated outcomes in PCSK9i trials. DJR serves on the science advisory board for Alnylam, Novartis, and Pfizer and is an unpaid advisor to the FH Foundation. MDS is supported by NIH K12HD043488. MFM reports receiving grants from Regeneron Pharmaceuticals as well as personal fees from Invitae, both outside the scope of the study. SJB serves on the scientific advisory board for Amgen, Sanofi, Novartis, Regeneron, and Akcea; is a consultant at Sanofi, Amgen, Cleveland Heart Labs, GLG Group, Guidepoint Global, Regeneron, Novo Nordisk, and Akcea; and is a national speaker for Amgen, Aralez, Boehringer Ingelheim Pharmaceutical, Novo Nordisk, and Akcea. NHS is a co-founder and scientific advisor to Cardinal Analytx and an advisor to TwoXaR. WCC was an employee of the Laboratory Corporation of America Holdings. ET is an employee of Pfizer. All other authors declare no competing interests.

References

1.Gidding SS, Champagne MA, de Ferranti SD, et al. The agenda for familial hypercholesterolemia: a scientific statement from the American Heart Association. Circulation 2015; 132: 2167–92. [DOI] [PubMed] [Google Scholar]
2.Nordestgaard BG, Chapman MJ, Humphries SE, et al. European Atherosclerosis Society Consensus Panel. Familial hypercholesterolemia is underdiagnosed and undertreated in the general population: guidance for clinicians to prevent coronary heart disease: consensus statement of the European Atherosclerosis Society. Eur Heart J 2013; 34: 3478–90a. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Knowles JW, Rader DJ, Khoury MJ. Cascade screening for familial hypercholesterolemia and the use of genetic testing. JAMA 2017; 318: 381–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Grundy SM, Stone NJ, Bailey AL, et al. 2018 AHA/ACC/AACVPR/AAPA/ABC/ACPM/ADA/AGS/APhA/ASPC/NLA/PCNA guideline on the management of blood cholesterol: executive summary. J Am Coll Cardiol 2018; 73: 3168–209. [DOI] [PubMed] [Google Scholar]
5.Harada-Shiba M, Arai H, Oikawa S, et al. Guidelines for the management of familial hypercholesterolemia. J Atheroscler Thromb 2012; 19: 1043–60. [DOI] [PubMed] [Google Scholar]
6.WHO. Familial hypercholesterolemia: report of a second WHO consultation. Geneva, Switzerland: World Health Organization, 1999. [Google Scholar]
7.Centers for Disease Control and Prevention. More detailed information on key tier 1 applications—familial hypercholesterolemia. March, 2014. http://www.cdc.gov/genomics/implementation/toolkit/FH_1.htm (accessed Oct 11, 2019).
8.Wald DS, Bestwick JP, Morris JK, Whyte K, Jenkins L, Wald NJ. Child–parent familial hypercholesterolemia screening in primary care. N Engl J Med 2016; 375: 1628–37. [DOI] [PubMed] [Google Scholar]
9.McCrindle BW, Gidding SS. What should be the screening strategy for familial hypercholesterolemia? N Engl J Med 2016; 375: 1685–86. [DOI] [PubMed] [Google Scholar]
10.Louter L, Defesche J, Roeters van Lennep J. Cascade screening for familial hypercholesterolemia: practical consequences. Atheroscler Suppl 2017; 30: 77–85. [DOI] [PubMed] [Google Scholar]
11.Andersen R, Andersen L. Examining barriers to cascade screening for familial hypercholesterolemia in the United States. J Clin Lipidol 2016; 10: 225–27. [DOI] [PubMed] [Google Scholar]
12.Lázaro P, Pérez de Isla L, Watts GF, et al. Cost-effectiveness of a cascade screening program for the early detection of familial hypercholesterolemia. J Clin Lipidol 2017; 11: 260–71. [DOI] [PubMed] [Google Scholar]
13.Obermeyer Z, Emanuel EJ. Predicting the future—big data, machine learning, and clinical medicine. N Engl J Med 2016; 375: 1216–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Rajkomar A, Oren E, Chen K, et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med 2018; 1: 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Banda JM, Sarranju A, Abbasi F, et al. Finding missed cases of familial hypercholesterolemia in health systems using machine learning. NPJ Digit Med 2019; 2: 23. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Jordan MI, Mitchell TM. Machine learning: trends, perspectives, and prospects. Science 2015; 349: 255–60. [DOI] [PubMed] [Google Scholar]
17.Breiman L. Random forests. Mach Learn 2001; 45: 5–32. [Google Scholar]
18.Benn M, Watts GF, Tybjærg-Hansen A, Nordestgaard BG. Mutations causative of familial hypercholesterolaemia: screening of 98,098 individuals from the Copenhagen General Population Study estimated a prevalence of 1 in 217. Eur Heart J 2016; 37: 1384–94. [DOI] [PubMed] [Google Scholar]
19.Amor-Salamanca A, Castillo S, Gonzalez-Vioque E, et al. Genetically confirmed familial hypercholesterolemia in patients with acute coronary syndrome. J Am Coll Cardiol 2017; 70: 1732–40. [DOI] [PubMed] [Google Scholar]
20.Khera AV, Won HH, Peloso GM, et al. Diagnostic yield and clinical utility of sequencing familial hypercholesterolemia genes in patients with severe hypercholesterolemia. J Am Coll Cardiol 2016; 67: 2578–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Abul-Husn NS, Manickam K, Jones LK, et al. Genetic identification of familial hypercholesterolemia within a single US health care system. Science 2016; 354: 6319. [DOI] [PubMed] [Google Scholar]
22.Afzal N, Sohn S, Abram S, et al. Mining peripheral arterial disease cases from narrative clinical notes using natural language processing. J Vasc Surg 2017; 65: 1753–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.deGoma EM, Ahmad ZS, O’Brien EC, et al. Treatment gaps in adults with heterozygous familial hypercholesterolemia in the United States: data from the CASCADE-FH registry. Circ Cardiovasc Genet 2016; 9: 240–49. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

33323221 Supplementary Appendix

NIHMS1678511-supplement-33323221_Supplementary_Appendix.pdf^{(592KB, pdf)}

[R1] 1.Gidding SS, Champagne MA, de Ferranti SD, et al. The agenda for familial hypercholesterolemia: a scientific statement from the American Heart Association. Circulation 2015; 132: 2167–92. [DOI] [PubMed] [Google Scholar]

[R2] 2.Nordestgaard BG, Chapman MJ, Humphries SE, et al. European Atherosclerosis Society Consensus Panel. Familial hypercholesterolemia is underdiagnosed and undertreated in the general population: guidance for clinicians to prevent coronary heart disease: consensus statement of the European Atherosclerosis Society. Eur Heart J 2013; 34: 3478–90a. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Knowles JW, Rader DJ, Khoury MJ. Cascade screening for familial hypercholesterolemia and the use of genetic testing. JAMA 2017; 318: 381–82. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Grundy SM, Stone NJ, Bailey AL, et al. 2018 AHA/ACC/AACVPR/AAPA/ABC/ACPM/ADA/AGS/APhA/ASPC/NLA/PCNA guideline on the management of blood cholesterol: executive summary. J Am Coll Cardiol 2018; 73: 3168–209. [DOI] [PubMed] [Google Scholar]

[R5] 5.Harada-Shiba M, Arai H, Oikawa S, et al. Guidelines for the management of familial hypercholesterolemia. J Atheroscler Thromb 2012; 19: 1043–60. [DOI] [PubMed] [Google Scholar]

[R6] 6.WHO. Familial hypercholesterolemia: report of a second WHO consultation. Geneva, Switzerland: World Health Organization, 1999. [Google Scholar]

[R7] 7.Centers for Disease Control and Prevention. More detailed information on key tier 1 applications—familial hypercholesterolemia. March, 2014. http://www.cdc.gov/genomics/implementation/toolkit/FH_1.htm (accessed Oct 11, 2019).

[R8] 8.Wald DS, Bestwick JP, Morris JK, Whyte K, Jenkins L, Wald NJ. Child–parent familial hypercholesterolemia screening in primary care. N Engl J Med 2016; 375: 1628–37. [DOI] [PubMed] [Google Scholar]

[R9] 9.McCrindle BW, Gidding SS. What should be the screening strategy for familial hypercholesterolemia? N Engl J Med 2016; 375: 1685–86. [DOI] [PubMed] [Google Scholar]

[R10] 10.Louter L, Defesche J, Roeters van Lennep J. Cascade screening for familial hypercholesterolemia: practical consequences. Atheroscler Suppl 2017; 30: 77–85. [DOI] [PubMed] [Google Scholar]

[R11] 11.Andersen R, Andersen L. Examining barriers to cascade screening for familial hypercholesterolemia in the United States. J Clin Lipidol 2016; 10: 225–27. [DOI] [PubMed] [Google Scholar]

[R12] 12.Lázaro P, Pérez de Isla L, Watts GF, et al. Cost-effectiveness of a cascade screening program for the early detection of familial hypercholesterolemia. J Clin Lipidol 2017; 11: 260–71. [DOI] [PubMed] [Google Scholar]

[R13] 13.Obermeyer Z, Emanuel EJ. Predicting the future—big data, machine learning, and clinical medicine. N Engl J Med 2016; 375: 1216–19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Rajkomar A, Oren E, Chen K, et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med 2018; 1: 18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Banda JM, Sarranju A, Abbasi F, et al. Finding missed cases of familial hypercholesterolemia in health systems using machine learning. NPJ Digit Med 2019; 2: 23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Jordan MI, Mitchell TM. Machine learning: trends, perspectives, and prospects. Science 2015; 349: 255–60. [DOI] [PubMed] [Google Scholar]

[R17] 17.Breiman L. Random forests. Mach Learn 2001; 45: 5–32. [Google Scholar]

[R18] 18.Benn M, Watts GF, Tybjærg-Hansen A, Nordestgaard BG. Mutations causative of familial hypercholesterolaemia: screening of 98,098 individuals from the Copenhagen General Population Study estimated a prevalence of 1 in 217. Eur Heart J 2016; 37: 1384–94. [DOI] [PubMed] [Google Scholar]

[R19] 19.Amor-Salamanca A, Castillo S, Gonzalez-Vioque E, et al. Genetically confirmed familial hypercholesterolemia in patients with acute coronary syndrome. J Am Coll Cardiol 2017; 70: 1732–40. [DOI] [PubMed] [Google Scholar]

[R20] 20.Khera AV, Won HH, Peloso GM, et al. Diagnostic yield and clinical utility of sequencing familial hypercholesterolemia genes in patients with severe hypercholesterolemia. J Am Coll Cardiol 2016; 67: 2578–89. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Abul-Husn NS, Manickam K, Jones LK, et al. Genetic identification of familial hypercholesterolemia within a single US health care system. Science 2016; 354: 6319. [DOI] [PubMed] [Google Scholar]

[R22] 22.Afzal N, Sohn S, Abram S, et al. Mining peripheral arterial disease cases from narrative clinical notes using natural language processing. J Vasc Surg 2017; 65: 1753–61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.deGoma EM, Ahmad ZS, O’Brien EC, et al. Treatment gaps in adults with heterozygous familial hypercholesterolemia in the United States: data from the CASCADE-FH registry. Circ Cardiovasc Genet 2016; 9: 240–49. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Precision screening for familial hypercholesterolaemia: a machine learning study applied to electronic health encounter data

Kelly D Myers

Joshua W Knowles

David Staszak

Michael D Shapiro

William Howard

Mrinal Yadava

David Zuzick

Latoya Williamson

Nigam H Shah

Juan M Banda

Joe Leader

William C Cromwell

Ed Trautman

Michael F Murray

Seth J Baum

Seth Myers

Samuel S Gidding

Katherine Wilemon

Daniel J Rader

Abstract

Summary

Background

Methods

Findings

Interpretation

Introduction

Methods

Study design

Training dataset

Table 1:

The FIND FH model

Figure 1: Model training.

Model performance and validation

Role of the funding source

Results

Table 2:

Table 3:

Figure 2: Clinical review data Venn diagrams.

Figure 3: FIND FH heatmap.

Discussion

Supplementary Material

Research in context.

Evidence before this study

Added value of this study

Implications of all the available evidence

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases