Skip to main content
Journal of the American Medical Informatics Association : JAMIA logoLink to Journal of the American Medical Informatics Association : JAMIA
. 2010 Sep-Oct;17(5):568–574. doi: 10.1136/jamia.2010.004366

Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease

Iftikhar J Kullo 1,, Jin Fan 1, Jyotishman Pathak 2, Guergana K Savova 2, Zeenat Ali 1, Christopher G Chute 2
PMCID: PMC2995686  PMID: 20819866

Abstract

Background

There is significant interest in leveraging the electronic medical record (EMR) to conduct genome-wide association studies (GWAS).

Methods

A biorepository of DNA and plasma was created by recruiting patients referred for non-invasive lower extremity arterial evaluation or stress ECG. Peripheral arterial disease (PAD) was defined as a resting/post-exercise ankle-brachial index (ABI) less than or equal to 0.9, a history of lower extremity revascularization, or having poorly compressible leg arteries. Controls were patients without evidence of PAD. Demographic data and laboratory values were extracted from the EMR. Medication use and smoking status were established by natural language processing of clinical notes. Other risk factors and comorbidities were ascertained based on ICD-9-CM codes, medication use and laboratory data.

Results

Of 1802 patients with an abnormal ABI, 115 had non-atherosclerotic vascular disease such as vasculitis, Buerger's disease, trauma and embolism (phenocopies) based on ICD-9-CM diagnosis codes and were excluded. The PAD cases (66±11 years, 64% men) were older than controls (61±8 years, 60% men) but had similar geographical distribution and ethnic composition. Among PAD cases, 1444 (85.6%) had an abnormal ABI, 233 (13.8%) had poorly compressible arteries and 10 (0.6%) had a history of lower extremity revascularization. In a random sample of 95 cases and 100 controls, risk factors and comorbidities ascertained from EMR-based algorithms had good concordance compared with manual record review; the precision ranged from 67% to 100% and recall from 84% to 100%.

Conclusion

This study demonstrates use of the EMR to ascertain phenocopies, phenotype heterogeneity and relevant covariates to enable a GWAS of PAD. Biorepositories linked to EMR may provide a relatively efficient means of conducting GWAS.

Introduction

There is considerable interest in identifying genetic susceptibility variants for common diseases to facilitate novel diagnostic and therapeutic strategies. Genome-wide association studies (GWAS) have been successful in identifying hundreds of susceptibility variants for common diseases.1 However, this approach has yet to be applied to many diseases and quantitative traits. Furthermore, the odds ratios of common susceptibility alleles identified thus far are modest, and large sample sizes will be needed to identify additional susceptibility alleles for diseases/traits already studied. Whereas genotyping and sequencing costs continue to drop, recruitment and phenotyping of patients remain effort and time-intensive processes, thereby creating a bottleneck in genomics research.

When matched to biorepositories, the electronic medical record (EMR) can be leveraged for high throughput phenotyping of large numbers of patients for genomics research,2 3 thereby substantially reducing the effort and time required to identify genetic variants that influence disease susceptibility. The EMR may be used to ascertain other conditions that mimic the disease of interest (ie, phenocopies), phenotype heterogeneity and covariates relevant for investigating gene–environment interactions. Data from disparate sources (eg, diagnosis and procedure codes, laboratory data, medication use, imaging studies, etc.) can be mined from the EMR and processed using phenotype-specifying algorithms, allowing for efficient data extraction and construction of databases for genetic/biomarker analyses.4 In contrast, conventional phenotype characterization of clinical cohorts by manual review of medical records is effort-intensive, time-consuming, expensive, and the accuracy may vary depending on the experience of the abstractor.

We describe the use of the Mayo Clinic's EMR to annotate a biorepository with relevant clinical covariates in order to conduct a GWAS of a relatively common manifestation of atherosclerotic vascular disease—lower extremity peripheral arterial disease (PAD). The biorepository consisted of DNA and plasma of patients referred for non-invasive lower extremity arterial evaluation to rule out PAD, and stress ECG to rule out coronary artery disease. We ascertained case–control status, comorbidities and cardiovascular risk factors from the EMR—including the corresponding laboratory databases—and compared EMR data with those obtained by manual review of medical records in 95 PAD cases and 100 controls.

Methods

The present study was conducted as part of the electronic MEdical Records and GEnomics (eMERGE) Network (http://www.gwas.net) funded by the National Human Genome Research Institute, and comprising five institutions across the USA (including the Mayo Clinic, Rochester, Minnesota, USA). eMERGE is a national consortium formed to develop, disseminate and apply approaches to combine DNA biorepositories with EMR systems for large-scale, high-throughput genetic studies. The aim of the Mayo eMERGE study is to leverage the EMR to identify genetic variants influencing susceptibility to PAD.

Study population

In October 2006, a biorepository of plasma and DNA samples was initiated by recruiting patients referred for lower extremity arterial evaluation to the Mayo Clinic non-invasive vascular laboratory and individuals referred to the stress ECG laboratory to screen for coronary artery disease. Between October 2006 and May 2009, 3527 patients were recruited. All participants gave their written consent for participation in the studies and the use of their data for future research. The study protocol was approved by the Institutional Review Board of the Mayo Clinic.

Demographic factors

Specific data elements from the EMR were selected because of their potential relevance to PAD. These include birth date, sex and race. The categories of race were ‘white’, ‘black or African American’, ‘American Indian or Alaskan’, ‘Asian’, ‘other’, and ‘unknown’. We assessed the geographical distribution of the enrolled patients using the addresses provided in the EMR. A graphical representation of geographical distribution was created using the package ‘maps’ in R (http://www.r-project.org) and by downloading the average latitude and longitude for US states from the website http://www.maxmind.com/app/state_latlon. Height, weight and computed body mass index (BMI) closest to index date (defined as the date of vascular laboratory evaluation or stress ECG testing) were abstracted from the EMR directly.

Case status

Patients at the Mayo Clinic suspected of having PAD are referred for lower extremity arterial evaluation in the non-invasive vascular laboratory. The evaluation consists of measurement of the ankle-brachial index (ABI) at rest and one minute post-exercise, as well as continuous wave Doppler and pulse volume recordings. The ABI is the ratio of blood pressure (BP) at the ankle to the BP in the arm. Normally greater than 1.0, the ratio falls in the setting of atherosclerosis of the leg arteries and subsequent arterial narrowing. An ABI of 0.9 or less is commonly used in the clinical setting to diagnose PAD. The ABI is measured according to a standardized protocol in the laboratory in the supine position. Appropriately sized BP cuffs are placed over each brachial artery and above each malleolus, and systolic BP measured using a hand-held 8.3 MHz Doppler probe. The higher of the two systolic brachial BP is used to calculate the ABI. A standard treadmill test with a speed of 1–2 mph and a fixed grade of 10° with continuous ECG monitoring is performed to measure post-exercise ABI as well as pain-free and maximum walking distances. In patients with PAD, ankle systolic BP decreases during low levels of workload5 6 and post-exercise ABI values are more sensitive for detecting PAD.7 Since 1997, laboratory findings have been recorded in an electronic database employing an in-house software package for data archiving and retrieval; these data become part of Mayo Clinic's EMR. We used the following criteria to define the presence of PAD: an ABI of 0.9 or less at rest or one minute after exercise; or the presence of poorly compressible arteries; or normal ABI but a history of revascularization for PAD.

Phenocopies and phenotype heterogeneity

Several non-atherosclerotic vascular diseases can result in a low ABI, thereby mimicking atherosclerotic PAD. Such phenocopies include several vasculitides, Buerger's disease, embolism, trauma to leg arteries and other rare arteriopathies. Patients who had an abnormal ABI secondary to these conditions were excluded using a set of appropriate International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) diagnosis codes. Poor compressibility of arteries results from calcification in the medial layer and such a phenomenon in the lower extremities represents a distinct subset of PAD. The presence of poorly compressible arteries was established based on: an average ABI in one leg greater than 1.4; or lower extremity BP of 255 mm Hg or greater; or the use of text searching techniques to identify the terms ‘stiff artery/vessel’, ‘calcified arteries/vessels’, or ‘poorly compressible arteries/vessels’ in the physician's interpretative report.

Control subjects

Controls were identified from patients referred to the cardiovascular health clinic for ruling out coronary artery disease. A majority (87%) of these patients underwent stress exercise ECG as part of the screening process. We excluded patients who had a positive stress ECG, who were younger than 50 years, or who had history of PAD. A proportion (36%, n=670) of the subjects who underwent exercise ECG also underwent measurement of ABI. The prevalence of an abnormal ABI in patients who had a negative stress ECG was less than 1% and those with qualifying ABI values were classified as having PAD.

Natural language processing

High throughput phenotyping based on EMR often requires the reliable transformation of unstructured data in the EMR to structured data. To extract data from the free-text clinical narrative, we used the free-text data annotations generated by the clinical text analytics and knowledge extraction system (cTAKES)8 9—a natural language processing (NLP) system developed at the Mayo Clinic (the software pipeline and documentation are available as open source from the Open Health Natural Language Consortium at http://www.ohnlp.org). The cTAKES parses the clinical narrative to identify types of clinically relevant concepts referred to as named entities along with qualifying attributes (negated, non-negated, current, history of, family history of). The named entity types discovered by cTAKES are drugs, diseases, signs/symptoms, anatomical sites and procedures. Each class is mapped to a domain-specific terminology to provide the concept code, thus dealing with language variations. In particular, Systematized Nomenclature of Medicine–Clinical Terms (SNOMED–CT)10 is used as the concept backbone for signs/symptoms, anatomical sites, diseases/disorders and procedures, while RxNorm—a standardized nomenclature for clinical drugs and drug delivery devices11—serves as the target terminology for drugs. In the present study, we utilized the drug annotations with their RxNorm codes to ascertain medication use, and clinical description of smoking history to establish smoking status.

Medications

The cTAKES NLP pipeline was used to discover drug named-entity mentions from clinical notes as medications are often recorded in clinical notes as free text. A time window of one year before and after the arterial evaluation was used to ascertain medication use. In addition to extracting drug mentions (coded to RxNorm) our approach identified drug dosage, frequency and route. For example, a sentence of the form ‘Simvastatin 80 mg oral tablet daily started on March 1, 2005’ was represented as Simvastatin (text), 200345 (associated code), 80 mg (strength), March 1, 2005 (start date), 1.0 (frequency), daily (frequency unit), oral (route). Furthermore, as in many cases it is practical to query and identify patients based on a class of drugs (eg, find all patients who have been prescribed statin medications), we mapped the individual RxNorm codes to drug classes as represented by the national drug file reference terminology (NDF–RT). Continuing with the above example, drugs such as simvastatin 80 mg oral tablet (RxNorm code=200345) were categorized under the drug class ‘antilipemic agents’ (NDF–RT code=C8912). The patient level accuracy was tested in 210 documents from 95 PAD cases and 100 controls.

Smoking status

The Mayo Clinic smoking status classification system,12 based on cTAKES, was first developed in 2006 and was recently remodeled to extend functionality and improve performance.13 The entire EMR up to and six months after the arterial evaluation was mined to assign smoking status. Rule-based and machine learning techniques were applied to clinical narratives and smoking status was classified as ‘past’, ‘current’, ‘smoker’, ‘non-smoker’, or ‘unknown’. A ‘past smoker’ label was assigned if a patient had not smoked for at least one year; a ‘current smoker’ label was assigned if a patient was a smoker within the past year; and ‘a smoker’ label was assigned if there was evidence of smoking but not enough information to classify the patient as either a current or a past smoker. Smoking status for each patient was determined by the combination of precedence rules and document-level smoking status frequency.12 Level 1 and 2 components are rule based (keywords search). To train the machine learners for the level 3 classifier, we manually extracted smoking-related sentences and annotated their smoking status using 975 documents from 45 PAD patients. The patient-level smoking status classification was tested on held-out data—829 documents from 41 PAD patients and 910 documents from 43 control subjects.

Cardiovascular risk factors and comorbidities

To ascertain risk factors and comorbidities we used ICD-9-CM diagnosis and procedure codes, medication use and patient laboratory data in the EMR. The presence of relevant ICD-9-CM codes up to six months following the index date was considered. Total, high-density lipoprotein, and low-density lipoprotein cholesterol, triglycerides, fasting blood sugar and glycosylated hemoglobin were obtained from the laboratory database using a window of one year around the time of the index date. Resting systolic and diastolic BP were obtained as a structured observation from the EMR. The diagnosis of hypertension was established based on two BP readings of 140/90 mm Hg or greater within three months closest to the index date, or a previous diagnosis of hypertension and current treatment with antihypertensive medication. Similarly, diabetes was diagnosed if a patient had fasting plasma glucose of 126 mg/dl or greater, or a random glucose greater than 200 mg/dl, or hemoglobin A1c of more than 6.5%, or had a previous diagnosis and was on treatment with oral hypoglycemic agent(s) or insulin. Dyslipidemia was defined as total cholesterol greater than 220 mg/dl, or high-density lipoprotein cholesterol less than 40 mg/dl (in men), less than 45 mg/dl (in women), or triglycerides greater than 200 mg/dl or the use of a lipid-lowering medication. Coronary heart disease was defined as the presence of ICD-9-CM diagnosis codes for ischemic heart disease including 410.××–414.××, or a history of percutaneous coronary intervention or coronary artery bypass surgery (ICD-9-CM procedure codes 36.10–36.14). Cerebrovascular disease was defined as the presence of ICD-9-CM diagnosis codes 430.××–438.×× or a history of carotid stenting or endarterectomy (ICD-9-CM procedure codes 00.61, 00.63, 38.10).

The medical records of 95 randomly selected PAD cases and 100 controls underwent manual review by a trained abstractor. Using manual medical record review as the gold standard, we assessed the concordance of cardiovascular risk factors and comorbidities identified from the EMR.

Statistical analyses

Continuous variables were reported as mean±SD and categorical variables as percentages. Differences in covariates were tested using Student's t test and the χ2 test. Precision, recall, and F-measure—well-established metrics in the NLP and information retrieval community14—were used for evaluating EMR-based algorithms compared with manual medical record review. Precision is a ratio of retrieved examples that were correct. Recall is a ratio of correct examples that were retrieved. F-measure is the weighted harmonic mean of precision and recall, defined as (2×precision×recall)/(precision+recall). Macro and micro-averages were also calculated; the former by calculating each class metric and then taking the average of these, the latter by using a global count of each class and averaging these sums. We assessed micro and macro-precision, recall and F-measure results for each comorbidity and risk factor separately. A two-tailed p value of less than 0.05 was considered significant. All the data were analyzed using the SAS 9.1.3 statistical package.

Results

Demographic characteristics are shown in table 1.

Table 1.

Demographic characteristics

PAD cases (n=1687) Controls (n=1725) p Value
Age, years 66±11 61±8 <0.0001
Men 1073 (64%) 1035 (60%) 0.0303
Race NS
 White 1592 (92%) 1566 (93%)
 Black or African American 5 (0.3%) 11 (0.6%)
 Native American Indian or Alaskan 2 (0.1%) 5 (0.2%)
 Asian 4 (0.2%) 0 (0%)
 Other 7 (0.4%) 10 (0.6%)
 Unknown 110 (6.4%) 89 (5.3%)
 Missing 5 (0.3%) 6 (0.4%)
Geographical distribution <0.0001
 Minnesota 918 (54%) 1047 (61%)
 Iowa 204 (12%) 96 (5%)
 Wisconsin 125 (7%) 77 (4%)
 Illinois 109 (6%) 120 (7%)
 Michigan 67 (4%) 62 (4%)
 Other 264 (16%) 323 (19%)

Age is presented as mean±SD.

Categorical variables are presented as percentages (%).

PAD, peripheral arterial disease.

Age, sex and BMI

No mismatches for age and sex were noted between EMR mined data and manually abstracted data. In 14 patients, BMI values were implausibly high or low. On further review, the implausible values were noted to be due either to standard unit data being entered in metric units or height being entered for weight and vice versa. BMI was missing in 101 individuals. Trained nurse abstractors were able to obtain BMI by reviewing the medical record in 78 individuals. Nonetheless, in 33 individuals, BMI could not be ascertained.

Race and geographical distribution

Based on self-reported race, the study subjects were predominantly white (93%). A small proportion (6%) reported race as unknown. Genotyping results obtained later showed that 99% of such individuals were of European origin (data not shown). Figure 1 shows the geographical distribution of cases and controls. Five states in the upper Midwest contributed to 84% of cases and 81% of controls. Furthermore, 95% of cases and 83% of controls resided within a 500 mile radius around Rochester, Minnesota; 21% of cases and 36% of controls were from Olmsted County, Minnesota, USA.

Figure 1.

Figure 1

Geographical distribution of peripheral arterial disease cases and controls. Size of the circles is proportional to the number of study participants from a state. The dark shading within a circle represents the proportion of cases.

Phenocopies and phenotype heterogeneity

Of the 3527 patients, 1802 were possible PAD and 1725 controls. Of 1802 patients with possible PAD, 115 had ICD-9-CM diagnosis codes for non-atherosclerotic vascular diseases including vasculitis, Buerger's disease, embolism and trauma (phenocopies) and were excluded. Of the remaining 1687 PAD patients, 1444 had an abnormal ABI (85.6%), 233 had poorly compressible arteries (13.8%) and 10 (0.6%) had a history of revascularization for PAD with normal ABI. As shown in table 2, the mean ABI of the PAD patients (without poorly compressible arteries) was 0.69±0.19 and that in the controls was 1.10±0.12. More than two-thirds (69%) of the PAD patients underwent a treadmill walking test, and the median post-exercise ABI was 0.48±0.24.

Table 2.

Vascular laboratory and stress test characteristics

PAD cases (n=1687) Controls (n=1725) p Value
ABI available 1677 (99.4%) 670 (36%) <0.0001
 ABI, at rest 0.69±0.19 1.10±0.12 <0.0001
 Poorly compressible arteries 233 (13.8%) 0 (0%) <0.0001
 Walking test 1168 (69%) 105 (6.1%) <0.0001
 ABI, post exercise 0.48±0.24
Exercise stress ECG 356 (21%) 1493 (87%) <0.0001

Continuous variables are presented as mean±SD.

Categorical variables are presented as percentages (%).

ABI, ankle-brachial index; PAD, peripheral arterial disease.

Medications

Use of the various medication classes is summarized in table 3. This information was obtained after mapping the individual RxNorm codes to drug classes as represented by the NDF–RT terminology. Patients with PAD had a higher percentage of medication use than controls, 71% of PAD patients and 40% controls were on antiplatelet agents, 55% of PAD and 30% of controls were on lipid-lowering agents, 80% of PAD patients and 40% controls were on BP-lowering medications. Several RxNorm drug products (semantic clinical drugs) did not have a correspondence in NDF–RT. Similarly, drug products in NDF–RT were missing in RxNorm. Most of the mismatches could be attributed to differences in dosage, strength and route form.

Table 3.

Medication use in PAD cases and controls

NDF–RT code NDF–RT class name PAD cases (n=1687) Controls (n=1725)
C8824 [BL117] Antiplatelets 1209 (71.7%) 693 (40.2%)
C8902 [Cv100] β-Blockers/related 862 (51%) 287 (17%)
C8904 [Cv150] α-Blockers/related 135 (8%) 92 (5.3%)
C8906 [Cv200] Calcium antagonists 432 (25%) 142 (8.2%)
C8912 [Cv350] Antilipemic agents 927 (55%) 520 (30%)
C8914 [Cv400] Antihypertensive combinations 94 (5.6%) 49 (3%)
C8916 [Cv490] Antihypertensives 78 (4.6%) 11 (0.6%)
C8918 [Cv500] Peripheral vasodilators 2 (0.1%) 2 (0.1%)
C8924 [Cv701] Thiazides/related diuretics 402 (24%) 206 (12%)
C8926 [Cv702] Loop diuretics 346 (20%) 52 (3%)
C8928 [Cv703] Carbonic anhydrase inhibitor diuretics 2 (0.1%) 3 (0.2%)
C8930 [Cv704] Potassium sparing/combinations diuretics 183 (11%) 116 (6.7%)
C8934 [Cv800] ACE inhibitors 714 (42%) 284 (16%)
C8936 [Cv805] Angiotensin II inhibitor 258 (15%) 111 (6.4%)
C9118 [Hs500] Blood glucose regulation agents 6 (0.4%) 4 (0.2%)
C9120 [Hs501] Insulin 266 (15.8%) 33 (1.9%)

NDF–RT, national drug file reference terminology; PAD, peripheral arterial disease.

Risk factors and comorbidities

Prevalences of cardiovascular risk factors and comorbidities in PAD patients and controls are shown in table 4. As expected, patients with PAD were older than controls (66±11 vs 61±8 years) and had a greater prevalence of cardiovascular risk factors. Based on the EMR, of the 1687 PAD patients, 34% had diabetes, 73% had hypertension, 82% had dyslipidemia and 83% had a history of smoking. Among 1725 controls, 10% had diabetes, 38% had hypertension, 65% had dyslipidemia and 60% had a history of smoking.

Table 4.

Cardiovascular risk factors and comorbidities

PAD cases (n=1687) Controls (n=1725) Missing n p Value
BMI, kg/m2 29.0±5.5 28.5±5.3 33 0.0132
Triglycerides, mg/dl 187.1±235.6 139.0±79.5 246 <0.0001
Total cholesterol, mg/dl 190.8±50.7 202.8±37.2 245 <0.0001
HDL-cholesterol, mg/dl 48.2±15.4 56.5±17.2 253 <0.0001
LDL-cholesterol, mg/dl 106.3±38.8 118.7±32.8 253 <0.0001
Ever smoker 1405 (83%) 1036 (60%) 166 <0.0001
Hypertension 1237 (73%) 658 (38%) <0.0001
Diabetes 577 (34%) 164 (10%) <0.0001
Dyslipidemia 1380 (82%) 1113 (65%) <0.0001
CHD 927 (55%) 270 (16%) <0.0001
Cerebrovascular disease 559 (33%) 79 (4.6%) <0.0001

Continuous variables are presented as mean±SD.

Categorical variables are presented as percentages (%).

BMI, body mass index; CHD, coronary heart disease; HDL, high-density lipoprotein; LDL, low-density lipoprotein; PAD, peripheral arterial disease.

Comparison of EMR-based algorithms with manual medical record review

The comparison results are summarized in table 5. The EMR-based algorithms encompassing the validated ICD-9-CM codes for each of the cardiovascular risk factors and comorbidities had moderate to excellent agreement with manual medical record review. The precision ranged from 67% to 100% and recall from 86% to 100%. The EMR-based algorithms were accurate in identifying diabetes and coronary heart disease in PAD patients, achieved micro F-measure scores above 0.92 for most of the comorbidities and risk factors involved in our study. The lowest precision was for identifying cerebrovascular disease in controls. Our NLP algorithms to classify smoking status in cases and ascertain medication use from free-text clinical narrative data also had high precision, with a macro and micro average F-measure of 0.92 and 0.90 for patient-level smoking status classification, respectively. Ascertainment of medication class at the patient level reached 100% precision and recall. Precision for smoking status was slightly lower in the controls, with a macro and micro-averaged F-measure of 0.86 and 0.84. The lower precision and recall in controls, a significant proportion of whom were non smokers, suggests that the negation detection algorithm of the NLP tool may need further optimization.

Table 5.

Comorbidities and risk factors; comparison of EMR-based algorithms to manual medical record review in PAD patients (n=95) and controls (n=100)

Precision Recall F-measure
PAD Control PAD Control PAD Control
Hypertension
 Macro-averaged 0.93 0.98 0.94 0.98 0.93 0.98
 Micro-averaged 0.95 0.98 0.95 0.98 0.95 0.98
Diabetes
 Macro-averaged 1.00 0.94 1.00 0.94 1.00 0.94
 Micro-averaged 1.00 0.98 1.00 0.98 1.00 0.98
Dyslipidemia
 Macro-averaged 0.90 0.97 0.98 0.93 0.94 0.95
 Micro-averaged 0.97 0.96 0.97 0.96 0.97 0.96
Coronary heart disease
 Macro-averaged 1.00 0.88 1.00 0.98 1.00 0.92
 Micro-averaged 1.00 0.96 1.00 0.96 1.00 0.96
Cerebrovascular disease
 Macro-averaged 0.94 0.67 0.95 0.99 0.95 0.75
 Micro-averaged 0.95 0.98 0.95 0.98 0.95 0.98
Medication class use
 Macro-averaged 1.00 1.00 1.00 1.00 1.00 1.00
 Micro-averaged 1.00 1.00 1.00 1.00 1.00 1.00
Smoking*
 Macro-averaged 0.92 0.87 0.91 0.90 0.92 0.86
 Micro-averaged 0.90 0.84 0.90 0.84 0.90 0.84
*

Smoking status classification was compared using 829 documents from 41 peripheral arterial disease (PAD) cases and 910 documents from 43 control subjects.

EMR, electronic medical record.

Discussion

In this study, we used the EMR to confirm the phenotype of interest, identify phenocopies (ie, mimics of atherosclerotic PAD) and phenotype heterogeneity, and ascertain comorbidities and risk factors associated with PAD to enable a GWAS of PAD. PAD is a highly prevalent disease affecting approximately 8 million individuals aged 40 years or older in the USA, with nearly 20% of the elderly (>70 years) patients seen in general medical practice affected by the disease.15–18 Peripheral arterial disease is associated with significant mortality and morbidity, underscoring the necessity of a rigorous investigation of genetic variants that influence susceptibility to PAD. Although manual abstraction of medical records can provide high-quality data, for large studies such as genetic association studies, manual review of medical records can be prohibitively expensive and time-consuming. Our study demonstrates that the EMR can be used as a potential means of overcoming these challenges, as it offers several significant advantages over traditional approaches to genomic medicine research by simplifying logistics, and reducing time lines and overall costs through efficient data acquisition.

Since the turn of the 20th century, every Mayo Clinic patient has been assigned a unique identifier, and information from any encounter (whether inpatient, outpatient, emergency room, home visit, nursing home, or autopsy) is contained within a unit medical record. In addition, the diagnoses for each encounter are coded using the ICD-9-CM system and entered into a central index for ease of retrieval.19 A federated warehouse of patient data—Mayo Enterprise Data Trust,20 is derived from EMR data sources throughout the clinic since 1997. It accommodates most EMR contents for over 8 million patients, including demographics, highly annotated full-text clinical notes, laboratory data, diagnostic findings and related clinical data. It contains significant clinical phenotype information in both free-text and structured form. Using this infrastructure, we extracted relevant clinical variables on study participants that could confound the association of genetic susceptibility variants with PAD.

Age, sex and BMI

Expectedly, EMR ascertainment of age and sex was accurate. However, obtaining BMI from the EMR proved to be challenging. In several cases, the values were implausible as data were incorrectly entered in the EMR. In a small proportion (n=101 patients) the data were missing. Using manual record review, we were able to obtain BMI in 78 of these patients.

Race and geographical distribution

While population stratification is best assessed by actual genotype data, the EMR can provide an initial assessment of the potential for stratification as it contains information on self-reported race as well as residential addresses. Self-reported race was obtained in all of the participants, although 6% listed race as unknown. We were also able to assess the geographical distribution of cases and controls. Most of the cases and controls were non-Hispanic whites from the Upper Midwest, within a 500-mile radius of Rochester, Minnesota (89%), and nearly a third belonged to Olmsted County, Minnesota (29%), indicating a low probability of significant population admixture.

Phenocopies and phenotype heterogeneity

Using ICD-9-CM diagnosis codes for vasculitides, embolism and trauma, we were able to exclude patients with non-atherosclerotic vascular diseases that mimic atherosclerotic PAD (ie, phenocopies) such as vasculitides, Buerger's disease, trauma, embolism or thrombosis in situ and other rare arteriopathies. Although we did not attempt to do so in the present study, it would be possible to further mine the vascular laboratory database to obtain additional phenotypic information about PAD, such as disease location and disease severity. Such information may be useful for future studies with large numbers of PAD patients, in which subset analyses would be feasible. We were able to classify individuals with PAD into two broad subsets of PAD; the common form, in which the ABI is measurable and disease severity can be quantified, and the other subtype of poorly compressible arteries, in which ABI cannot be reliably calculated. The latter appears to be a distinct form of PAD, characterized by medial artery calcification,21 and is typically found in the elderly22 23 and patients with diabetes,24–26 or end-stage renal disease.27 It is possible that genetic susceptibility variants for PAD with poorly compressible arteries may be distinct from those that are associated with the conventional form of PAD. Our work indicates the potential of the EMR to characterize disease phenotypes in depth, and also to assess for phenotype heterogeneity.

Comparison of EMR versus manual medical record review

Most of the cardiovascular risk factors and comorbidities were captured from the EMR with an accuracy of over 90%. The F-measure scores for risk factors and comorbidities were higher in PAD patients than in controls, likely due to a higher prevalence of comorbidities and risk factors in the former. F-measure scores for smoking were lower in controls. Similarly, the low F-measure score for cerebrovascular disease among controls may be due to the relatively low prevalence of cerebrovascular disease in controls (one event in 100 cases). Diabetes and dyslipidemia had higher accuracy scores than others, likely because objective data are used in the algorithms, leading to better concordance.

Identifying covariates

Most genetic susceptibility variants identified thus far for atherosclerotic cardiovascular disease have been ‘orthogonal’ markers (ie, they do not mediate risk through conventional risk factors).28 However, there are exceptions, and it is important in genetic association studies to adjust for potential confounding factors, such as conventional risk factors. For example, a single nucleotide polymorphism (SNP) was associated with susceptibility to PAD as well as lung cancer29; however, the effect of the SNP was mediated by increased nicotine dependence, and patients with this SNP tended to have a higher degree of tobacco use over the years. Similarly, SNP that alter circulating lipid levels have also been associated with the risk of myocardial infarction.30

Gene–environment interactions

Another important reason to characterize risk factors, particularly environmental risk factors, is to enable investigation of gene–environment interactions. The depth and breadth of the data extracted from the EMR may be useful for investigating the complex interactions between genetic susceptibility and environmental factors. For PAD, the most important environmental factor is smoking. Genetic studies of PAD will need to assess potential interactions of genetic variants with smoking in mediating the risk of PAD, although this will be challenging given the high proportion of PAD patients who smoke. Confirming smoking status is more challenging than the extraction of other demographic variables because it requires NLP of free-text portions of the EMR. Previous work describes some of the problems involved in extracting smoking status from the EMR, and how NLP has been used to overcome these challenges.31 32 A limitation of EMR-ascertained smoking status is that quantifying pack-years of smoking is typically not feasible.

Limitations and strengths

The present study utilizes data from just one medical center. However, the methods are exportable to the EMR of other medical centers. Missing data are another issue when using the EMR. In the present study, less than 7.5% had missing lipid levels and only 2% had missing BMI values. Patients may have limited membership duration or short follow-up periods during which important variables (eg, exclusion or inclusion criteria) might not have been recorded in the EMR. However, in the present study only 77 patients had a single visit and only 575 had five or fewer visits. Although ICD-9-CM codes are easily available at a relatively low cost, systematic misclassification and exclusion of conditions or procedures not pertinent to reimbursement are potential limitations to their use.33 Variation in completeness and quality of EMR data may be affected by different practices among medical staff and clinicians,34 35 and the consistency of clinical definitions may vary with different providers36 or when data are extracted from free text,37 potentially impacting the accuracy of phenotype definitions. Comprehensive and standardized EMR-based algorithms that include laboratory values and medications may be needed to increase precision and generalizability. Future work will involve improving information structure in the EMR so that it is more usable and friendly for clinical research. The exchange of EMR-based data across institutions in a structured way, based on national and international standards, offers great potential for diverse research studies including those related to understanding the genetic bases of common diseases.38

Conclusion

In summary, we demonstrate the feasibility of leveraging a biorepository linked to the EMR to enable a GWAS. We annotated a clinical laboratory-based biorepository with PAD case–control status and obtained relevant covariates as a step towards the identification of genetic susceptibility variants associated with PAD. We were also able to use the EMR to assess phenotype heterogeneity in cases and to exclude phenocopies. Ascertainment of cardiovascular risk factors and comorbidities using ICD-9-CM codes, laboratory data, medication use and NLP was reasonably accurate when compared with more detailed manual medical record review. The data presented here support the feasibility of EMR-based genomic research. The potential of the EMR for genomic medicine may be enhanced by modifying existing clinical processes to allow for research-grade data collection. Biorepositories matched to EMR may be a rapid, cost-saving and efficient means of conducting studies to identify genetic susceptibility variants of common cardiovascular diseases. Replication of previously detected genetic susceptibility variants, as well as detection of novel variants, will validate the use of EMR in genomic research.

Acknowledgments

The authors wish to acknowledge Vicki M Schmidt for help with manuscript preparation, Lacey Hart MBA for project coordination, Jeremy Palbicki and Kevin Bruce for informatics support, and Keyue Ding PhD for helpful discussion.

Footnotes

Funding: The eMERGE Network was initiated and funded by the National Human Genome Research Institute, with additional funding from the National Institute of General Medical Sciences. The Mayo eMERGE study was supported by grant no U01-HG04599.

Competing interests: None.

Patient consent: Obtained.

Ethics approval: The study protocol was approved by the Institutional Review Board of the Mayo Clinic.

Provenance and peer review: Not commissioned; externally peer reviewed.

References

  • 1.Manolio TA, Brooks LD, Collins FS. A HapMap harvest of insights into the genetics of common disease. J Clin Invest 2008;118:1590–605 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Gerhard G, Langer R, Carey D, et al. Electronic medical records in genomic medicine practice and research. In: Willard H, Ginsburg G, eds. Genomic and personalized medicine. San Diego, CA: Elsevier, 2009 [Google Scholar]
  • 3.Hernandez-Boussard T, Woon M, Klein TE, et al. Integrating large-scale genotype and phenotype data. OMICS 2006;10:545–54 [DOI] [PubMed] [Google Scholar]
  • 4.Butte AJ. Translational bioinformatics applications in genome medicine. Genome Med 2009;1:64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Stahler C, Strandness DE., Jr Ankle blood pressure response to graded treadmill exercise. Angiology 1967;18:237–41 [DOI] [PubMed] [Google Scholar]
  • 6.Feringa H, Bax J, van Waning V, et al. The long-term prognostic value of the resting and postexercise ankle-brachial index. Arch Intern Med 2006;166:529. [DOI] [PubMed] [Google Scholar]
  • 7.Ouriel K, McDonnell AE, Metz CE, et al. Critical evaluation of stress testing in the diagnosis of peripheral vascular disease. Surgery 1982;91:686–93 [PubMed] [Google Scholar]
  • 8.Savova GK, Kipper-Schuler K, Buntrock JD, et al. UIMA-based clinical information extraction system. LREC 2008: towards enhanced interoperability for large HLT systems: UIMA for NLP. Morocco: LREC, 2008 [Google Scholar]
  • 9.Savova G, Masanz J, Ogren P, et al. Mayo Clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010;17:507–13 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.SNOMED Clinical Terms® User Guide. July 2008 International release: the international health terminology standards development organisation. 2008. http://www.ihtsdo.org/snomed-ct/ (accessed 21 Jan 2010).
  • 11.US National Library of Medicine Unified medical language system® (UMLS®): RxNorm. Bethesda, MD. 2004. http://www.nlm.nih.gov/research/umls/rxnorm/ (accessed 21 Jan 2010).
  • 12.Savova GK, Ogren PV, Duffy PH, et al. Mayo Clinic NLP system for patient smoking status identification. J Am Med Inform Assoc 2008;15:25–8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Sohn S, Savova GK. Mayo Clinic smoking status classification system: extensions and improvements. AMIA Annu Symp Proc 2009;2009:619–23 [PMC free article] [PubMed] [Google Scholar]
  • 14.Manning CD, Schutze H. Foundations of statistical natural language processing. Cambridge, Massachusetts: The MIT Press, 1999 [Google Scholar]
  • 15.Hirsch AT, Criqui MH, Treat-Jacobson D, et al. Peripheral arterial disease detection, awareness, and treatment in primary care. JAMA 2001;286:1317–24 [DOI] [PubMed] [Google Scholar]
  • 16.McDermott MM, Kerwin DR, Liu K, et al. Prevalence and significance of unrecognized lower extremity peripheral arterial disease in general medicine practice. J Gen Intern Med 2001;16:384–90 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Rosamond W, Flegal K, Furie K, et al. Heart disease and stroke statistics – 2008 update: a report from the American Heart Association Statistics Committee and Stroke Statistics Subcommittee. Circulation 2008;117:e25–146 [DOI] [PubMed] [Google Scholar]
  • 18.Allison MA, Ho E, Denenberg JO, et al. Ethnic-specific prevalence of peripheral arterial disease in the United States. Am J Prev Med 2007;32:328–33 [DOI] [PubMed] [Google Scholar]
  • 19.Leibson CL, Naessens JM, Brown RD, et al. Accuracy of hospital discharge abstracts for identifying stroke. Stroke 1994;25:2348–55 [DOI] [PubMed] [Google Scholar]
  • 20.Chute CG, Beck SA, Fisk TB, et al. The Enterprise Data Trust at Mayo Clinic: a semantically integrated warehouse of biomedical data. J Am Med Inform Assoc 2010;17:131–5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Resnick HE, Lindsay RS, McDermott MM, et al. Relationship of high and low ankle brachial index to all-cause and cardiovascular disease mortality: the Strong Heart Study. Circulation 2004;109:733–9 [DOI] [PubMed] [Google Scholar]
  • 22.Coni N, Tennison B, Troup M. Prevalence of lower extremity arterial disease among elderly people in the community. Br J Gen Pract 1992;42:149–52 [PMC free article] [PubMed] [Google Scholar]
  • 23.Aboyans V, Lacroix P, Preux PM, et al. Variability of ankle-arm index in general population according to its mode of calculation. Int Angiol 2002;21:237–43 [PubMed] [Google Scholar]
  • 24.Quigley FG, Faris IB, Duncan HJ. A comparison of Doppler ankle pressures and skin perfusion pressure in subjects with and without diabetes. Clin Physiol 1991;11:21–5 [DOI] [PubMed] [Google Scholar]
  • 25.Lehto S, Niskanen L, Suhonen M, et al. Medial artery calcification. A neglected harbinger of cardiovascular complications in non-insulin-dependent diabetes mellitus. Arterioscler Thromb Vasc Biol 1996;16:978–83 [DOI] [PubMed] [Google Scholar]
  • 26.Creager MA, Jones DW, Easton JD, et al. Atherosclerotic vascular disease conference: writing group V: medical decision making and therapy. Circulation 2004;109:2634–42 [DOI] [PubMed] [Google Scholar]
  • 27.Fishbane S, Youn S, Kowalski EJ, et al. Ankle-arm blood pressure index as a marker for atherosclerotic vascular diseases in hemodialysis patients. Am J Kidney Dis 1995;25:34–9 [DOI] [PubMed] [Google Scholar]
  • 28.Ding K, Kullo IJ. Genome-wide association studies for atherosclerotic vascular disease and its risk factors. Circ Cardiovasc Genet 2009;2:63–72 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Thorgeirsson TE, Geller F, Sulem P, et al. A variant associated with nicotine dependence, lung cancer and peripheral arterial disease. Nature 2008;452:638–42 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Kathiresan S, Melander O, Anevski D, et al. Polymorphisms associated with cholesterol and risk of cardiovascular events. N Engl J Med 2008;358:1240–9 [DOI] [PubMed] [Google Scholar]
  • 31.Uzuner O, Goldstein I, Luo Y, et al. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc 2008;15:14–24 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Zeng QT, Goryachev S, Weiss S, et al. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Med Inform Decis Mak 2006;6:30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Tirschwell DL, Longstreth WTJ. Validating administrative data in stroke research. Stroke 2002;33:2465–70 [DOI] [PubMed] [Google Scholar]
  • 34.Thiru K, Hassey A, Sullivan F. Systematic review of scope and quality of electronic patient record data in primary care. BMJ 2003;326:1070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Treweek S. The potential of electronic medical record systems to support quality improvement work and research in Norwegian general practice. BMC Health Serv Res 2003;3:10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.de Lusignan S. The optimum granularity for coding diagnostic data in primary care: report of a workshop of the EFMI Primary Care Informatics Working Group at MIE 2005. Inform Prim Care 2006;14:133–7 [DOI] [PubMed] [Google Scholar]
  • 37.Voorham J, Denig P. Computerized extraction of information on the quality of diabetes care from free text in electronic patient records of general practitioners. J Am Med Inform Assoc 2007;14:349–54 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Gerdsen DIF, Müller DIS, Jablonski IS, et al. Standardized exchange of medical data between a research database, an electronic patient record and an electronic health record using CDA/SCIPHOX. AMIA Annu Symp Proc 2005;2005:963. [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES