Skip to main content
Journal of the American Medical Informatics Association : JAMIA logoLink to Journal of the American Medical Informatics Association : JAMIA
. 2011 Nov 19;19(2):212–218. doi: 10.1136/amiajnl-2011-000439

Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study

Abel N Kho 1,, M Geoffrey Hayes 1, Laura Rasmussen-Torvik 1, Jennifer A Pacheco 1, William K Thompson 1, Loren L Armstrong 1, Joshua C Denny 2, Peggy L Peissig 3, Aaron W Miller 3, Wei-Qi Wei 4, Suzette J Bielinski 4, Christopher G Chute 4, Cynthia L Leibson 4, Gail P Jarvik 5, David R Crosslin 5, Christopher S Carlson 6, Katherine M Newton 7, Wendy A Wolf 8, Rex L Chisholm 1, William L Lowe 1
PMCID: PMC3277617  PMID: 22101970

Abstract

Objective

Genome-wide association studies (GWAS) require high specificity and large numbers of subjects to identify genotype–phenotype correlations accurately. The aim of this study was to identify type 2 diabetes (T2D) cases and controls for a GWAS, using data captured through routine clinical care across five institutions using different electronic medical record (EMR) systems.

Materials and Methods

An algorithm was developed to identify T2D cases and controls based on a combination of diagnoses, medications, and laboratory results. The performance of the algorithm was validated at three of the five participating institutions compared against clinician review. A GWAS was subsequently performed using cases and controls identified by the algorithm, with samples pooled across all five institutions.

Results

The algorithm achieved 98% and 100% positive predictive values for the identification of diabetic cases and controls, respectively, as compared against clinician review. By standardizing and applying the algorithm across institutions, 3353 cases and 3352 controls were identified. Subsequent GWAS using data from five institutions replicated the TCF7L2 gene variant (rs7903146) previously associated with T2D.

Discussion

By applying stringent criteria to EMR data collected through routine clinical care, cases and controls for a GWAS were identified that subsequently replicated a known genetic variant. The use of standard terminologies to define data elements enabled pooling of subjects and data across five different institutions to achieve the robust numbers required for GWAS.

Conclusions

An algorithm using commonly available data from five different EMR can accurately identify T2D cases and controls for genetic study across multiple institutions.

Keywords: Analytics, application of biological knowledge to clinical care, bioinformatics, biomedical informatics, clinical phenotyping, controlled terminologies and vocabularies, data mining, EHR, EMR secondary and meaningful use, genetic epidemiology, genetics, genome-wide association studies, genomics, HIT data standards, improving the education and skills training of health professionals, infection control, information retrieval, knowledge representations, linking the genotype and phenotype, medical informatics, modeling, natural-language processing, ontologies, pharmacogenomics, phenotyping, reuseability, translational research


Type 2 diabetes (T2D) is an increasing public health problem.1 2 Although environmental factors including diet and physical activity contribute to the etiology of T2D, a genetic contribution to T2D has been well documented.3–6 Genome-wide association studies (GWAS) have been most effective in identifying T2D susceptibility genes, although susceptibility alleles identified through these approaches have typically conferred small increases in risk requiring large numbers of subjects to identify susceptibility genes.7

To date, the traditional approach to case–control studies has been to recruit and phenotype study subjects prospectively. An alternative and more efficient approach would be to identify large numbers of cases and controls among patients receiving routine medical care. Until recently, it was unclear whether data collected through routine clinical care could achieve similar data quality compared with prospective study collection. The Electronic Medical Records and Genomics (eMERGE) consortium formed in 2007 to investigate if electronic medical record (EMR) linked DNA biorepositories can be leveraged for high-throughput genomic research.8 One of eMERGE's main goals is to assess whether EMR used in routine clinical care provide suitable data to identify individuals with specific phenotypes for GWAS. To date, the eMERGE consortium has successfully developed and validated several algorithms to identify accurately individuals with specific phenotypes.9–11 With recent initiatives promoting the widespread adoption of EMR, algorithms leveraging EMR-derived data for the identification of phenotypes may take on increasing importance.12 13

Candidate gene and GWAS have demonstrated a large number of genetic variants that contribute to T2D susceptibility providing known targets to test our hypothesis.3–7 14 15 Previous studies have documented that data captured through routine clinical care as part of the EMR can successfully identify patients with diabetes,16 17 but T2D is a particularly challenging phenotype. First, it must be distinguished from type 1 diabetes (T1D), which has a similar phenotype of hyperglycemia and shares at least one treatment, insulin, with T2D. However, the genetic underpinnings of T1D and T2D differ.18 19 Second, although T2D is increasingly common in youth and young adults, onset is typically later in life, complicating the identification of a control group at low risk of T2D.

Objective

This study aimed to develop an algorithm, using commonly collected data across multiple EMR systems, to identify individuals with T2D and test the hypothesis that EMR-derived phenotypes can be used as an alternative approach to the prospective collection of disease cohorts to identify genetic variants associated with T2D. Described here is an algorithm that successfully identified a multi-ethnic cohort of T2D cases and controls across five institutions and permitted replication of the TCF7L2 variant rs7903146, the polymorphisms most strongly genome-wide associated and replicated with T2D.

Research design and methods

eMERGE site overview

Five institutions participated in this study. All institutions obtained appropriate approval from their respective institutional review boards, and made use of a common data use agreement to enable data sharing between institutions. Each institution used an EMR for documentation of routine clinical care linked to a research specimen biorepository. Table 1 lists key features of each institution's EMR and biorepository. Additional details have been published previously.8 20 Notably, each site obtained appropriate patient consent for all study participants, with one site, Vanderbilt University (VU) making use of an opt-out consent model.21 Study participants represent the subset of patients who receive routine clinical care at study institutions, and also consented to participation within the institutional biorepository. Each eMERGE center selected a primary phenotype for investigation; T2D was led by Northwestern University (NU). The algorithm described below was used to identify all possible T2D case and control individuals at NU, and supplemented with all possible African ancestry cases and controls at VU. These samples were then supplemented with T2D cases and controls identified using the same algorithm from other eMERGE sites where individuals were selected for genotyping using independently derived algorithms for phenotypes of interest at that particular institution (eg, QRS duration at VU, cataracts at Marshfield, vascular disease at Mayo Clinic and dementia at Group Health Cooperative).

Table 1.

Overview of participating institutions' EMR and biorepositories and recruitment models

Institution Biorepository overview Recruitment model Repository size EMR summary
Marshfield Clinic Research Foundation (Marshfield, Wisconsin, USA) Personalized medicine research project: Geographically defined cohort within an integrated regional healthcare system Population based 20 000 98% Caucasian Comprehensive internally developed EMR since 1985 75% participants have 20+ years medical history
Northwestern University (Chicago, Illinois, USA) Nugene project: Northwestern affiliated hospitals and outpatient clinics Population based 10 000 12% AA 8% Hispanic Comprehensive vendor-based inpatient (Cerner, Kansas City, Missouri, USA) and outpatient (Epic Systems, Verona, Wisconsin, USA) EMR since 2000 20+ years ICD-9 data
Vanderbilt University (Nashville, Tennessee, USA) BioVU: Vanderbilt Clinic, diverse outpatient clinic Population based 100 000 11% AA 35+ Years medical history data Comprehensive internally developed EMR since 2000
Group Health Cooperative (Seattle, Washington, USA) GHC Biobank Alzheimer's disease patient registry and adult changes in thought study Disease-specific cohort 4000 (>96% Caucasian) Comprehensive vendor-based (Epic Systems) EMR since 2004 20+ years pharmacy data 15+ years ICD-9 data
Mayo Clinic (Rochester, Minnesota, USA) Vascular diseases biorepository Disease-specific cohort 3500 (>96% Caucasian) Comprehensive internally developed EMR since 1995 40-year history of data extraction

AA, African ancestry; EMR, electronic medical record; GHC, Group Health Cooperative; ICD-9, International Classification of Diseases, version 9.

Algorithm development

We used the existing clinical diagnostic criteria developed by the American Diabetes Association to develop an approach to identify T2D using commonly captured EMR data, including diagnostic codes, medications, and laboratory test results. T2D is typically diagnosed clinically by documenting hyperglycemia, a fasting glucose greater than or equal to 126 mg/dl or random glucose greater than or equal to 200 mg/dl. Our study focused on a non-pregnant adult population and did not utilize the results of oral glucose tolerance tests, which are most commonly used to screen for gestational diabetes. Each site used as many years of EMR data as were available for their study population.

To ensure comparability across sites, we identified the appropriate national standards to define diagnoses, medications, and laboratory tests (see supplementary appendices 1–3, available online only). All sites utilized International Classification of Diseases, 9th revision, clinical modification (ICD-9-CM) diagnostic codes. We defined medications using unique RxNorm codes at an ingredient level and defined laboratory tests using the logical observations identifiers names and codes (LOINC) standard.22 We identified the ‘best fit’ LOINC codes by units of measurement and overall frequency of clinical use. We included all patients with ICD-9-CM codes of 250.x0 or 250.x2, except for codes 250.10 and 250.12 (indicative of T2D with ketoacidosis, a condition also closely associated with T1D), patients on T2D medications and/or insulin at any time, and all patients with abnormal glucose (>200 mg/dl) or hemaglobin A1c (HbA1c; ≥6.5%) laboratory test results.

The algorithm was originally developed at an institution with primarily structured data housed within a clinical data warehouse, easily accessed through Structured Query Language (SQL) queries. Each site adapted the algorithm to suit the clinical data stored within their institutional EMR, and to take advantage of local data extraction and analytical tools. All sites utilized a data warehouse, separate from their transactional EMR system to execute the algorithm and avoid impacting EMR performance. Subsequent sites required varying degrees of natural language processing to extract structured data from otherwise unstructured free text clinical notes. One site utilized all diabetes codes due to a noted consistent billing pattern of using ICD-9-CM code 250.00 for both T1D and T2D patients.

In order to increase algorithm specificity and reduce the risk that an outlying observation might inadvertently capture a miscoded diagnosis, we required some redundancy in diagnostic criteria. We required cases with a T2D diagnosis code to have either an abnormal laboratory test or a prescription for a T2D medication. We required cases without a T2D diagnosis code in the EMR to have documentation of both a prescription for a T2D medication and an abnormal glucose (random glucose >200 mg/dl, fasting glucose >125 mg/dl) or a HbA1c laboratory test result of 6.5% or greater (figure 1).

Figure 1.

Figure 1

Algorithm for the identification of subjects with type 2 diabetes. *Random glucose >200 mg/dl, fasting glucose >125 mg/dl, HbA1c ≥6.5%. HbA1c, hemoglobin A1c; ICD-9, International Classification of Diseases, version 9; T1DM, type 1 diabetes mellitus; T2DM, type 2 diabetes mellitus.

Patients with a diabetes diagnosis (ICD-9-CM 250.xx) and only documented as on insulin predictably proved the most difficult to categorize as T1D or T2D patients. To differentiate between T1D and T2D in subjects who were treated with insulin, we required that patients treated with insulin alone either have a past prescription for a T2D medication or meet the following criteria: no T1D diagnoses and greater than or equal to two T2D diagnoses entered by a clinician (ie, not billing coders) on different dates.

We similarly developed an algorithm to identify control subjects without diabetes (figure 2). We excluded patients with any diabetes diagnosis (ICD-9-CM codes 250.xx) or any of the diagnoses listed in supplementary table 1 (available online only), patients on insulin or any of the medications listed in supplementary table 2 (available online only), patients who used any diabetic supplies (ie, insulin syringes, glucose monitors), and patients with any abnormal glucose (≥110 mg/dl) or HbA1c (≥6.0%) laboratory values. We also excluded patients with a family history of diabetes, either in the EMR or in questionnaire data if available. We also required controls to have had at least one normal glucose measurement and at least two in-person clinician encounters to ensure that patients had sufficient data in their EMR to determine confidently that they did not have diabetes. Figure 2 depicts the final algorithm for choosing control subjects.

Figure 2.

Figure 2

Algorithm for the identification of type 2 diabetes controls. *Glucose ≥110 mg/dl, HbA1c ≥6.0%. HbA1c, hemoglobin A1c; ICD-9, International Classification of Diseases, version 9; T1DM, type 1 diabetes mellitus; T2DM, type 2 diabetes mellitus.

Algorithm validation

As an interim step to validate the results of our algorithms before genotyping, three sites conducted a blinded chart review of at least 100 total cases and controls, with one site conducting an additional 50 chart reviews for a related study. Two sites (NU, VU) utilized clinician reviewers, and one site (MCRF) used trained chart reviewers. We reviewed charts of patients identified as either cases or controls in order to assess completely the positive predictive value (PPV) of our algorithm to identify cases and controls accurately for subsequent GWAS. Using manual chart review as a comparison standard, we generated a PPV for both cases and controls, for both iterations of the automated algorithms. Statistical analyses were performed using R,23 specifically the epiR24 package.

Genetic analysis

Genotyping was performed at the Broad Institute and Center for Inherited Disease Research on the Illumina 660W and 1M Bead Chips (Illumina Inc, San Diego, CA, USA). Genotype cleaning and quality control was performed collaboratively by all five sites using a previously described approach.25 Genotype data for rs7903146 in TCF7L2 on chromosome 10 from individuals passing quality control, and identified as a T2D case or control was used for this analysis to test the validity of the algorithm. Demographic differences between cohorts was assessed by analysis of variance in R.23

Associations between genotype and T2D case–control status were assessed through linear regression assuming an additive model, and adjusting for site, age, sex, body nass index (BMI), and ancestry using PLINK.26 We used age and median BMI at the time of earliest diabetes diagnosis or diabetes medication prescription for cases, and age and median BMI at the time of enrollment in the biobank (age when biospecimen was collected) for controls. We excluded BMI measures collected during pregnancy for both cases and controls. The OR, SE of the OR, and p value from each cohort from each of the five European ancestry cohorts were combined in a meta-analysis weighting each strata by the number of samples using default settings in PLINK. This was repeated for the two African ancestry cohorts, and all seven of the cohorts in total. We report p values and OR for a fixed-effect model.

Results

In combination, the five sites identified a total of 3353 cases and 3352 controls, of which 3266 cases and 3286 controls passed genomic quality control testing and were included in the genetic analysis. Table 2 lists the demographics from each site for samples that passed genomic quality control testing. Each cohort was significantly different (p<0.001) from each other with respect to age, sex, and BMI, although we adjusted for these differences in our subsequent genetic analysis. Table 3 summarizes validation results at three participating sites that ranged from 98.2% to 100% PPV for case identification, and were 98–100% PPV for the identification of controls. The association results are presented in table 4. Three of the five sites (Group Health, Marshfield, and NU) produced moderate to strong associations between rs7903146 and T2D in European ancestry cohorts (p value range 0.002 to 9.27×10−5). At two sites, Mayo Clinic and VU, associations trended in the same direction and approached nominal significance (p=0.1177 and 0.0601, respectively). Allele frequencies for the VU African-American (AA) cohort were similar to the remaining European-American (EA) cohorts, but due to a much smaller sample size failed to reach significance. Among the two AA cohorts, NU and VU yielded similar case and control frequencies of the associated (T) allele at rs7903146, but the small NU AA cohort (N=294) yielded a p value of 0.0867, while the larger VU AA cohort achieved high significance (p=2.25×10−6), with the difference attributable to sample size. Our cross-site cohort meta-analyses produced very similar results with even smaller p values that are all highly significant: p=2.98×10−10 for the EA, p=5.30×10−7 for AA, and p=2.05×10−15 for all subjects across all sites. The cross cohort OR for s7903146 was 1.46, which is similar to what has been found previously in other populations.27

Table 2.

Demographics of genotyped cases and controls

Cohort #Samples Sex %Male Age* Mean SE BMI Mean SE
Cases
 GHC EA 441 50.34 74.75 0.36 29.71 0.72
 Marshfield EA 562 52.49 63.75 0.46 33.25 0.31
 Mayo EA 518 70.46 64.03 0.40 32.28 0.41
 NU EA 561 55.26 57.36 0.49 33.44 0.34
 VU EA 331 53.78 57.77 0.70 33.29 0.47
 Cross cohort EA 2413 56.78 63.51 0.25 32.40 0.21
 NU AA 184 35.33 53.30 0.85 36.22 0.75
 VU AA 626 38.02 53.70 0.57 34.52 0.37
 Cross cohort AA 810 37.41 53.61 0.48 34.97 0.34
Controls
 GHC EA 379 41.69 74.64 0.32 25.02 0.22
 Marshfield EA 432 31.94 57.94 0.45 26.91 0.20
 Mayo EA 687 57.50 60.80 0.32 27.43 0.31
 NU EA 670 44.33 50.05 0.50 26.85 0.20
 VU EA 224 49.55 58.10 0.91 28.04 0.48
 Cross cohort EA 2392 45.94 59.21 0.27 26.80 0.13
 NU AA 112 20.54 40.28 1.06 30.46 0.76
 VU AA 761 34.82 45.55 0.59 29.85 0.31
 Cross cohort AA 873 32.99 44.87 0.54 29.96 0.29
*

Age, age of onset for cases, and age of enrollment for controls.

BMI, body mass index, median BMI at age of onset for cases, and median BMI at last age collected for controls.

Analysis of variance significant at p<0.001.

AA, African ancestry; EA, European ancestry; GHC, Group Health Cooperative; NU, Northwestern University; VU, Vanderbilt University.

Table 3.

Summary of chart review results at three participating sites

Manual chart review
Northwestern University* Vanderbilt University Marshfield Clinic
Case Control Total Case Control Total Case Control Total
EMR prediction
 Case 56 1 57 50 0 50 99 1 100
 Control 0 43 43 0 50 50 1 49 50
 Total 56 44 100 50 50 100 100 50 150
*

Clinician reviewers, ANK, WLL.

Clinician reviewers not authors on this study.

Trained chart reviewers.

EMR, electronic medical record.

Table 4.

Association results for rs7903146 in TCF7L2 with type 2 diabetes

N Allele frequencies (T) OR L95 U95 p Value
Cases Controls
GHC EA 813 0.327 0.248 1.59 1.25 2.02 0.0002
Marshfield EA 930 0.323 0.243 1.49 1.15 1.93 0.0025
Mayo EA 1159 0.303 0.281 1.17 0.96 1.43 0.1177
NU EA 1229 0.353 0.274 1.51 1.23 1.86 9.27×10−5
VU EA 396 0.334 0.263 1.42 0.99 2.04 0.0601
Cross cohort EA (Meta) 0.328 0.265 1.41 2.98×10−10
NU AA 294 0.353 0.241 0.94 0.24 2.39 0.08672
VU AA 1021 0.353 0.260 1.35 0.11 2.07 2.25×10−6
Cross cohort AA (Meta) 0.353 0.258 1.64 5.30×10−7
Cross-cohort all (Meta) 0.334 0.263 1.46 2.05×10−15

AA, African ancestry; EA, European ancestry; GHC, Group Health Cooperative; L95, 95% CI lower bound; NU, Northwestern University; U95, 95% CI upper bound; VU, Vanderbilt University.

Discussion

In this study, we developed and validated an algorithm to identify cases with T2D and controls using standardized data elements captured through routine clinical care across five different EMR systems. Despite variations in data capture and completeness across the different systems, by applying stringent minimum criteria, and clear definition of data elements through an iterative process, we developed a final algorithm with a 98% PPV for cases and a 100% PPV for controls. We subsequently used identified samples pooled across sites to perform a GWAS.

The association tests between rs7903146 and T2D in the five EMR-derived cohorts yielded similar results to those from purposefully collected T2D case and control cohorts. In a recent meta-analysis of 29 195 T2D control subjects and 17 202 T2D case subjects from 27 populations spanning the globe Cauchi et al27 found the OR for developing T2D was 1.46 per copy of the rs7903146 T allele. We generated the exact same OR point estimate (1.46) for our EMR-derived samples using meta-analytical techniques in the pooled samples. Perhaps more importantly, our work demonstrates the power that can be achieved by combining samples across sites, evidenced by the highly significant p values from the cross-cohort analyses.

Our work expands on earlier studies to identify patients with diabetes from EMR. Previously, Wilke et al17 at Marshfield Clinic developed an effective algorithm to identify diabetes mellitus patients, but did not specifically differentiate between T1D and T2D. Other studies have utilized laboratory values, or diagnoses, laboratory tests and natural language processing to achieve high specificity for the identification of T1D and T2D.16 28 A related study used diagnoses and medications to identify patients with conditions that are risk factors for T2D, which were in turn used to identify patients with undiagnosed diabetes.29

We identified and addressed a number of specific challenges when developing the algorithms. We created specific definitions for cases and controls to avoid confounding by the inclusion of cases with T1D and, as much as possible, controls at risk of T2D, which has not, as yet, manifested itself. In EMR, fasting status at the time of blood draw for a patient was frequently not available. We therefore assumed that all glucose laboratory test results were not taken during the fasting state, so we used a lower glucose cut-off for controls, which resulted in lower sensitivity but higher specificity. In developing the final algorithm, a potential source of bias was recognized in that initially T2D subjects who were treated with insulin alone were excluded, although subjects with diabetes on insulin together with one of the diabetes medications listed above were eligible for inclusion. This approach would select against T2D subjects with significant pancreatic β-cell failure. Another problem presented by patients on insulin alone and an ICD-9-CM code for T2D is that some of these patients could represent patients with T1D, which was misclassified as T2D because of the age of onset or other issues. To address this, we identified patients on insulin alone as cases if they had been on a T2D medication in the past, or if they had at least two visits (on different dates) with a clinician who entered T2D diagnoses (ie, in the problem list or the encounter diagnosis).

Identifying controls presented a challenge to ensure that the control group was not ‘contaminated’ with cases, which would negatively impact power in genetic studies. We operated on the principle that absence of a diagnosis, prescribed medications, laboratory results, or other data in the EMR did not necessarily correlate with true patient status, but may reflect the selective capture of data within the EMR. Particularly at tertiary care centers, some patients receive only a portion of their care at the center. To address this challenge, we required that controls have a minimal amount of data represented in the EMR. In particular, we required controls to have had glucose testing with normal results at least once and to have at least two in-person clinician encounters. Moreover, to eliminate younger patients at increased risk of T2D but in whom the disease was not manifest, potential controls with a family history of diabetes were excluded. Another potential confounder was patients with diet controlled diabetes, although our algorithm was developed with the assumption that these patients would either have an ICD-9-CM code for T2D or an abnormal laboratory test result, which would exclude them from the control group.

Lack of standardization across EMR posed a challenge for the cross-site implementation and even within a given site where different EMR were in use. As a consortium, we identified the consolidated health informatics standards as the common lingua franca to achieve comparability of data across sites. For medications, we mapped medications to RxNORM codes at the generic name level as the common link between sites.30 For purposes of easier cross-institution sharing, we identified ingredient level RxNORM codes (included in the supplementary appendix, available online only) to reduce the total number of codes. We used LOINC codes specifically to define tests for glucose and HbA1C levels and ICD-9-CM codes for diagnoses. Despite these efforts the portability of algorithms across diverse sites poses a significant challenge and our future work is focused on developing methods to scale phenotyping more broadly. For example, we noted significant differences across sites in algorithm computing time, ranging from less than 10 s at a site using an optimized commercial data warehouse to 40 h at a site sequentially extracting categories of data using statistical software on their data warehouse. To this end, we include a link to our data dictionary, sample SQL code, and a data workflow built on an open source data mining tool for other investigators to explore: https://www.mc.vanderbilt.edu/victr/dcc/projects/acc/index.php/Library_of_Phenotype_Algorithms#Type_II_Diabetes.

Our study had a number of limitations. Study sites represent institutions with a significant research focus, and this may affect how data are routinely captured within the EMR. Study sites varied in the number of years of data available in the EMR and the degree of care fragmentation. Preliminary evidence suggests that the absence of longitudinal data and fragmentation of care across sites may decrease the specificity of our algorithm. Additional studies are under way to quantify these effects in greater detail. Rates of T2D varied across sites from 1.0% of the total available biorepository to 14.8% at Mayo, compared with an approximate rate of 8% for diabetes (all types) for the general population.31 Rate differences are likely to be due to bias in sample selection for genotyping, for which only NU selected all possible T2D cases and controls for genotyping. Other sites performed the T2D case and control algorithms on their already genotyped cohorts, which were selected for genotyping based on their suitability for other phenotypes (eg, QRS duration at VU, cataracts at Marshfield, vascular disease at Mayo Clinic and dementia at Group Health Cooperative/UW). Other sources of bias include variation in biorepository recruitment (eg, Mayo Clinic's biorepository focused on patients with vascular disease, strongly associated with T2D) and variation in local coding practices.32

While the Mayo EA, VU EA, and NU AA results do not reach nominal significance they do approach significance (p=0.11, 0.06, and 0.08, respectively) and all trend in the same direction as the remaining subcohorts. The most likely explanation for the VU EA and NU AA lack of significance is reduced power from relatively small sample size for a GWAS. The Mayo EA lack of significance may be due to the selection bias, as these samples were not selected for genotyping based on the T2D case and control algorithm, but rather an algorithm designed to identify cardiovascular disease phenotypes. As noted, 14.8% of this biased Mayo cohort were identified as a T2D case, significantly higher than the national population prevalence of this disease. We suspect increased co-occurrence of cardiovascular and metabolic diseases may contribute to the reduction in significance through an increased prevalence of undiagnosed T2D among the controls. Importantly, despite the failure to achieve significance for replication of TCF7L2 at individual sites, pooling samples across sites achieved highly significant results, supporting our collective approach.

In conclusion, we describe a practical approach to the identification of T2D cases and controls for GWAS using data captured in routine clinical care across five distinct EMR. To achieve the high specificity required for GWAS, we refined an algorithm over multiple iterations, and applied stringent criteria and nationally recognized coding standards to facilitate portability across different EMR. Although the overall number of cases and controls decreased with the increased specificity needed for GWAS, by generalizing the algorithm across diverse EMR we identified the large number of cases and controls needed for a well-powered GWAS, and generated the exact OR point estimate we expected from the literature. Applying this approach across a large number of institutions provides an alternative approach for generating a large cohort of T2D cases and controls to understand better the associations between genetics and expressions of disease.

Supplementary Material

Supplementary Data
supp_19_2_212__index.html (20.4KB, html)

Footnotes

Funding: The eMERGE Network was initiated and funded by NHGRI, with additional funding from NIGMS through the following grants: U01-HG-004610 (Group Health Cooperative); U01-HG-004608 (Marshfield Clinic); U01-HG-04599 (Mayo Clinic); U01HG004609 (Northwestern University); U01-HG-04603 (Vanderbilt University, also serving as the Coordinating Center), and the State of Washington Life Sciences Discovery Fund award to the Northwest Institute of Medical Genetics. The Northwestern University Enterprise Data Warehouse was funded in part by a grant from the National Center for Research Resources, UL1RR025741. The genetic data are deposited in dbGaP (accession numbers phs000170, phs000188, phs000203, phs000234, phs000237).

Competing interests: None.

Patient consent: Obtained.

Ethics approval: Ethics approval was provided by institutional review boards from all participating sites (Northwestern University, Vanderbilt University, Group Health Cooperative, Marshfield Clinic, Mayo Clinic).

Contributors: ANK developed the algorithm, collected and analyzed data, performed chart reviews, and wrote the manuscript. MGH conducted genetic analyses, and wrote the manuscript. LRT conducted genetic analyses, and reviewed and edited the manuscript. JAP collected and analyzed data, and wrote the manuscript. WKT created the standardized data and workflow within KNIME. LLA conducted genetic analyses. JCD collected data, performed chart reviews and reviewed and edited the manuscript. PLP collected and analyzed data and reviewed and edited the manuscript. AWM collected and analyzed data. WQW collected and analyzed data and reviewed and edited the manuscript. SJB collected and analyzed data. CGC reviewed and edited the manuscript. CLL collected and analyzed data. GPJ collected and analyzed data. DRC performed genetic analyses. CSC collected and analyzed data. KMN reviewed and edited the manuscript. WAW reviewed and edited the manuscript, RLC reviewed and edited the manuscript. WLL developed the algorithm and wrote the manuscript.

Provenance and peer review: Not commissioned; externally peer reviewed.

References

  • 1.Fox CS, Pencina MJ, Meigs JB, et al. Trends in the incidence of type 2 diabetes mellitus from the 1970s to the 1990s: the Framingham Heart Study. Circulation 2006;113:2914–18 [DOI] [PubMed] [Google Scholar]
  • 2.Sloan FA, Bethel MA, Ruiz D, Jr, et al. The growing burden of diabetes mellitus in the US elderly population. Arch Intern Med 2008;168:192–9; discussion 199. [DOI] [PubMed] [Google Scholar]
  • 3.Scott LJ, Mohlke KL, Bonnycastle LL, et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 2007;316:1341–5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Sladek R, Rocheleau G, Rung J, et al. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 2007;445:881–5 [DOI] [PubMed] [Google Scholar]
  • 5.Steinthorsdottir V, Thorleifsson G, Reynisdottir I, et al. A variant in CDKAL1 influences insulin response and risk of type 2 diabetes. Nat Genet 2007;39:770–5 [DOI] [PubMed] [Google Scholar]
  • 6.Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007;447:661–78 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Groves CJ, Zeggini E, Minton J, et al. Association analysis of 6,736 U.K. subjects provides replication and confirms TCF7L2 as a type 2 diabetes susceptibility gene with a substantial effect on individual risk. Diabetes 2006;55:2640–4 [DOI] [PubMed] [Google Scholar]
  • 8.McCarty CA, Chisholm RL, Chute CG, et al. ; eMERGE Team The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics 2011;4:13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Kullo IJ, Fan J, Pathak J, et al. Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease. J Am Med Inform Assoc 2010;17:568–74 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Denny JC, Ritchie MD, Crawford DC, et al. Identification of genomic predictors of atrioventricular conduction: using electronic medical records as a tool for genome science. Circulation 2010;122:2016–21 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ritchie MD, Denny JC, Crawford DC, et al. Robust replication of genotype–phenotype associations across multiple diseases in an electronic medical record. Am J Hum Genet 2010;86:560–72 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Blumenthal D. Stimulating the adoption of health information technology. N Engl J Med 2009;360:1477–9 [DOI] [PubMed] [Google Scholar]
  • 13.Congress US American Recovery and Reinvestment Act. Washington DC, USA: United States Congress, 2009 [Google Scholar]
  • 14.Saxena R, Voight BF, Lyssenko V, et al. ; Diabetes Genetics Initiative of Broad Institute of Harvard and MIT, Lund University, and Novartis Institutes of BioMedical Research Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 2007;316:1331–6 [DOI] [PubMed] [Google Scholar]
  • 15.Zeggini E, Weedon MN, Lindgren CM, et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science 2007;316:1336–41 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Turchin A, Kohane IS, Pendergrass ML. Identification of patients with diabetes from the text of physician notes in the electronic medical record. Diabetes Care 2005;28:1794–5 [DOI] [PubMed] [Google Scholar]
  • 17.Wilke RA, Berg RL, Peissig P, et al. Use of an electronic medical record for the identification of research subjects with diabetes mellitus. Clin Med Res 2007;5:1–7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.McCarthy MI. Genomics, type 2 diabetes, and obesity. N Engl J Med 2010;363:2339–50 [DOI] [PubMed] [Google Scholar]
  • 19.Pociot F, Akolkar B, Concannon P, et al. Genetics of type 1 diabetes: what's next? Diabetes 2010;59:1561–71 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kho AN, Pacheco JA, Peissig PL, et al. Electronic medical records for genetic research: results of the eMERGE consortium. Sci Transl Med 2011;3:79re1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Pulley J, Clayton E, Bernard GR, et al. Principles of human subjects protections applied in an opt-out, de-identified biobank. Clin Transl Sci 2010;3:42–8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Regenstrief Institute I Logical Observations Identifers Names and Codes (LOINC). 2010. http://loinc.org/ (accessed 26 May 2011). [Google Scholar]
  • 23.R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing, 2008 [Google Scholar]
  • 24.Stevenson M, Nunes T, Sanchez J. epiR: Functions for analysing epidemiological data. R package version 0.9–11 ed2008. http://epicentre.massey.ac.nz/ [Google Scholar]
  • 25.Turner S, Armstrong LL, Bradford Y, et al. Quality control procedures for genome-wide association studies. Curr Protoc Hum Genet 2011;68:1–19.1.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Purcell S, Neale B, Todd-Brown K, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007;81:559–75 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Cauchi S, El Achhab Y, Choquet H, et al. TCF7L2 is reproducibly associated with type 2 diabetes in various ethnic groups: a global meta-analysis. J Mol Med (Berl) 2007;85:777–82 [DOI] [PubMed] [Google Scholar]
  • 28.Manley SE, Sikaris KA, Lu ZX, et al. Validation of an algorithm combining haemoglobin A1c and fasting plasma glucose for diagnosis of diabetes mellitus in UK and Australian populations. Diabet Med 2009;26:115–21 [DOI] [PubMed] [Google Scholar]
  • 29.Klein Woolthuis EP, de Grauw WJ, van Gerwen WH, et al. Identifying people at risk for undiagnosed type 2 diabetes using the GP's electronic medical record. Fam Pract 2007;24:230–6 [DOI] [PubMed] [Google Scholar]
  • 30.Medicine NLo RxNorm. 2010. http://www.nlm.nih.gov/research/umls/rxnorm/ (accessed 26 May 2011). [Google Scholar]
  • 31.CDC National Diabetes Fact Sheet: National Estimates and General Information on Diabetes and Prediabetes in the United States, 2011. Atlanta, GA: US Department of Health and Human Services, Centers for Disease Control and Prevention, 2011 [Google Scholar]
  • 32.Schildcrout JS, Basford MA, Pulley JM, et al. An analytical approach to characterize morbidity profile dissimilarity between distinct cohorts using electronic medical records. J Biomed Inform 2010;43:914–23 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data
supp_19_2_212__index.html (20.4KB, html)

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES