Genetically-guided algorithm development and sample size optimization for age-related macular degeneration cases and controls in electronic health records from the VA Million Veteran Program

Christopher W Halladay; Tamer Hadi; Matthew D Anger; Paul B Greenberg; Jack M Sullivan; P Eric Konicki; Neal S Peachey; Robert P Igo, Jr; Sudha K Iyengar; Wen-Chih Wu; Dana C Crawford; for the VA Million Veteran Program

. 2019 May 6;2019:153–162.

Genetically-guided algorithm development and sample size optimization for age-related macular degeneration cases and controls in electronic health records from the VA Million Veteran Program

Christopher W Halladay ¹, Tamer Hadi ², Matthew D Anger ^3,⁴, Paul B Greenberg ^5,⁶, Jack M Sullivan ^3,⁴, P Eric Konicki ^7,⁸, Neal S Peachey ^7,^9,¹⁰, Robert P Igo Jr ^7,¹¹, Sudha K Iyengar ^2,^7,^11,¹², Wen-Chih Wu ^13,¹⁴, Dana C Crawford ^7,^11,^12,^15,¹⁶; for the VA Million Veteran Program

PMCID: PMC6568141 PMID: 31258967

Abstract

Electronic health records (EHRs) linked to extensive biorepositories and supplemented with lifestyle, behavioral, and environmental exposure data, have enormous potential to contribute to genomic discovery, a necessary step in the pathway towards translational or precision medicine. A major bottleneck in incorporating EHRs into genomic studies is the extraction of research-grade variables for analysis, particularly when gold-standard measurements are not available or accessible. Here we develop algorithms for age-related macular degeneration (AMD), a common cause of blindness among the elderly, and controls free of AMD. These computable phenotypes were developed using billing codes (ICD-9-CM and ICD-10-CM) and Current Procedural Terminology (CPT) codes and evaluated in two study sites of the Veterans Affairs Million Veteran Program: Louis Stokes Cleveland VA Medical Center and the Providence VA Medical Center. After establishing a high overall positive and negative predictive values (93% and 95%, respectively) through manual chart review, the candidate algorithm was deployed in the full VA MVP dataset of >500,000 participants. The algorithm was then optimized in a data cube using a variety of approaches including adjusting inclusion age thresholds by examining previously-reported genetic associations for CFH (rs10801555, a proxy for rs1061170) and ARMS2 (rs10490924). The algorithm with the smallest p-values for the known genetic associations was selected for downstream and on-going AMD genomic discovery efforts. This two-phase approach to developing research-grade case/control variables for AMD genomic studies capitalizes on established genetic associations resulting in high precision and optimized sample sizes, an approach that can be applied to other large-scale biobanks linked to EHRs for precision medicine research.

Keywords: age-related macular degeneration, electronic health records, Million Veteran Program, genetic association study

Introduction

The Department of Veterans Affairs Veterans Health Care Administration (VHA) established the Million Veteran Program (MVP) to facilitate large-scale analysis of combined genetics and electronic health records (EHRs)¹. MVP has enrolled >500,000 Veterans from over 50 sites in the continental United States and Puerto Rico. MVP currently provides a text version of the VA EHR, genome-wide data from a custom Affymetrix Axiom Biobank Array, and two survey instruments. Future genomic data will include imputed genome-wide data as well as whole-exome and whole-genome sequencing data. MVP thus provides a rich dataset for analysis of genomic contributions to disease, response to treatment, and related questions.

Age-related macular degeneration (AMD) is a leading cause of irreversible blindness in the developed world². AMD has a substantial genetic component, with 52 independently-associated variants across 34 genomic or locus regions identified in the largest genome-wide association study (GWAS) to date for AMD³. These variants collectively explain 27.2% of AMD disease risk and more than half of the risk attributed to genetics³. While much of the genetic architecture of AMD has been revealed since its first GWAS in 2005⁴, gaps remain including the identification of additional variants (including rare variants) and the inclusion of diverse populations to identify population-specific associations.

With its size and diversity, MVP offers an important opportunity to expand our understanding of AMD genetics within and between populations. A major challenge to this goal is the accurate identification of cases (patients with AMD) and controls (patients without AMD) using a data repository of EHRs without access to the gold standard clinical images used for AMD diagnosis (fundus photography, optical coherence tomography imaging). With access to extensive structured clinical data in the form of International Classification of Diseases (ICD) codes and Current Procedural Terminology (CPT) codes, we examine here the extent to which billing codes and other criteria relevant to AMD provide an accurate identification of AMD cases and controls within the VA Computerized Patient Record System (CPRS)⁵ for downstream genomic discovery studies.

Methods

Study population

The MVP is a national research program launched in 2011 by the Department of Veterans Affairs Office of Research & Development. The rationale, study design, and data collection for the MVP have been previously described¹. In brief, upon informed consent, veterans provide biospecimens, access to their EHR, and contribute lifestyle, behavioral, and other health-related data via a baseline survey and an optional lifestyle survey (https://www.research.va.gov/MVP/). The present study was approved by Institutional Review Boards of the Louis Stokes Cleveland VA Medical Center (LSCVAMC) and Providence VA Medical Center (PVAMC).

MVP intends to recruit at least one million participants. As of 2016, 504,027 MVP participants were available for study with accessible EHRs and of these, 352,953 had genome-wide genotype data available. Genome-wide data were generated using the Affymetrix Axiom Biobank Array (~723,000 markers) by two contracted vendors¹. Basic quality control was performed by the two vendors followed by additional quality control as part of the MVP Genomic Working Group. Standard quality control metrics^6,7 included testing for batch effects; calculating missingness by batch, sample call rates, duplicate sample concordance, minor allele frequencies; and performing sex checks and sample contamination checks. Imputed data were not available at the time of algorithm development but are now available for genetic associations within MVP⁸.

Statistical analyses

Among non-Hispanic European Americans in the MVP, we performed tests of association between AMD case status and two common variants consistently and strongly associated with AMD in European-descent populations: CFH rs10801555 (a proxy for rs1061170⁹ with r²=1 in the 1000 Genomes Project Phase 3 European subset¹⁰) and ARMS2 rs10490924. Both variants have been associated with AMD with large genetic effect sizes (odds ratios) ranging from 1.5-3.0 in most populations of European-descent¹¹. Tests of association were performed in PLINK v1.90b4.4^12,13 using logistic regression assuming an additive genetic model and adjusting for sex and 10 principal components to account for population structure^14,15. Prior to performing the tests of association, we characterized genetic ancestry from global admixture proportions determined by ADMIXTURE¹⁶; only samples with < 10% non-European ancestry were included in these analyses.

Results

To classify participants as definite/probable cases, controls, or unknown for AMD, we developed algorithms utilizing structured data [International Classification of Diseases, 9th and 10th Revisions, Clinical Modification (ICD-9-CM and ICD-10-CM) and CPT codes] in the EHRs. Imaging data are not yet available in MVP; consequently, these algorithms do not include fundus photography, optical coherence tomography, or other gold-standard tools that define AMD.

Phase 1: Local algorithm development

We developed an initial algorithm and test sets of cases and controls at the LSCVAMC via detailed chart and imaging reviews by retinal specialists. Based on the chart reviews from identified cases, the initial algorithm was refined and subsequent versions were developed and tested at both VA medical centers (LSCVAMC and PVAMC).

For the initial algorithm, we identified cases and controls (≥65 years of age) among those with comprehensive eye exams (CPT codes 92004 or 92014) within the last two years. Age was defined by date of birth and calculated in years at the time the data were accessed. Cases were identified based on the presence of one mention of ICD-9-CM 362.51 (nonexudative senile macular degeneration) or 362.52 (exudative senile macular degeneration) or ICD-10-CM H35.31 (nonexudative AMD) or H35.32 (exudative AMD) and the absence of ICD-9-CM 362.55 (toxic maculopathy) or ICD-10-CM H35.389 (toxic maculopathy, unspecified eye). Controls were defined as absence of AMD-related ICD-9-CM codes (362.51 and 362.52) and ICD-10-CM codes (H35.31 and H35.32). The initial algorithm was implemented at LSCVAMC, and a chart review of 50 identified cases revealed nine false positives, resulting in a low preliminary positive predictive value (PPV; 82%). Likewise, a review of 50 identified controls reviewed seven false negatives (one wet AMD and six dry AMD cases), resulting in a low preliminary negative predictive value (NPV; 86%).

In the first revised algorithm (Algorithm 1, Table 2), we required two mentions of AMD-related ICD-9-CM or ICD-10-CM codes. We then implemented both the case and control revised algorithms at the two VAMCs (Cleveland and Providence) and reviewed a fraction of identified cases and controls from each site (Table 3). The PPV ranged from 92% to 94%, with an overall PPV of 93% (standard error = 0.017). The NVP ranged from 89% to 99%, with an overall NPV of 95% (standard error = 0.014. Total sensitivity and specificity were both high: 95% and 93%, respectively.

Table 2.

Revised algorithm (Algorithm 1) to identify AMD cases and controls in electronic health records.

AMD case definition	AMD control definition
≥65 years of age	≥65 years of age
AND	AND
At least one mention within the last two years of • CPT code 92004 or • CPT codes 92014	At least one mention within the last two years of • CPT 92004 or • CPT 92014
AND	AND
At least two mentions (on separate clinic visits) or only at the most recent visit to the Eye Clinic of • ICD-9-CM codes 362.51 or 362.52 or • ICD-10-CM codes H35.31 or H35.32	Absence of • ICD-9-CM codes 362.51 and 362.52 or • ICD-10-CM codes H35.31 and H35.32
AND
Absence of • ICD-9-CM code 362.55 or • ICD-10-CM code H35.389

Open in a new tab

Table 3.

Algorithm 1 performance, by VHA study site. Abbreviations: Louis Stokes Cleveland VA Medical Center (LSCVAMC), negative predictive value (NPV), positive predictive value (PPV), Providence VA Medical Center (PVAMC).

	Number of AMD cases	Number of controls	Number of false positives	Number of false negatives	PPV	NPV
LSCVAMC	138	126	11	1	92	99
PVAMC	100	100	6	11	94	89
Total	238	226	17	12	93	95

Open in a new tab

To capture additional cases of AMD and controls free of AMD, we varied 1) the billing codes for exclusion (Algorithm 2), 2) the billing codes for inclusion (Algorithm 3), and/or 3) the requirement of a recent eye exam (for control definition only; Algorithm 4). For Algorithm 2, we expanded the list of case exclusion codes representing other eye diseases (primary open angle glaucoma and diabetic retinopathy; Table 4) and applied these exclusions to Algorithm 1. For Algorithm 3, we added ICD-9-CM codes 362.50, 362.51, and 362.52 and ICD-10 codes H35.30, H35.31%, and H35.32% to the case inclusion list. We also expanded the CPT codes list representing ophthalmological services received (92002, 92012, 92004, and 92014). Finally, Algorithm 4 used the case definition from Algorithm 3 and relaxed the control definition to include patients without evidence of an eye exam.

Table 4.

ICD-9-CM and ICD-10-CM codes added as exclusions for AMD case status. The list of billing code exclusions for primary open-angle glaucoma and diabetic retinopathy was applied to Algorithm 1 to create Algorithm 2. The list of exclusions was developed to reduce the number of false positive cases, particularly in African American patients where both primary open-angle glaucoma and diabetic retinopathy are more prevalent compared with other racial/ethnic groups^{17, 18}.

ICD-9-CM

ICD-10-CM

250.5*
Diabetes with ophthalmic manifestations
361
Retinal detachments and defects
361.0*
Retinal detachment with retinal defect
361.1*
Retinoschisis and retinal cysts
361.2
Serous retinal detachment
361.3*
Retinal defects without detachment
361.8*
Other forms of retinal detachment
361.9
Unspecified retinal detachment
362
Other retinal disorders
362.0
Diabetic retinopathy
362.1*
Other background retinopathy and retinal vascular changes
362.2
Other proliferative retinopathy
362.20*
Retinopathy of prematurity, unspecified
362.3*
Retinal vascular occlusion
362.7 *
Hereditary retinal dystrophies
362.8*
Other retinal disorders
362.9
Unspecified retinal disorder convert
362.9 to ICD-10-CM

H33
Retinal detachments and breaks
H33.0
Retinal detachment with retinal break
H33.001, .002, .003, .009
Unspecified retinal detachment with retinal break
H33.01, .011, .012, .013, .019
Retinal detachment with single break
H33.02, .021, .022, .023, .029
Retinal detachment with multiple breaks
H33.03, .031, .032, .033, .039
Retinal detachment with giant retinal tear
H33.04, .041, .042, .043, .049
Retinal detachment with retinal dialysis
H33.05, .051, .052, .053, .059
Total retinal detachment
H33.1
Retinoschisis and retinal cysts
H33.10, .101, .102, .103, .109
Unspecified retinoschisis
H33.11, .111, .112, .113, .119
Cyst of ora serrata
H33.12, .121, .122, .123, .129
Parasitic cyst of retina
H33.19, .191, .192, .193, .199
Other retinoschisis and retinal cysts
H33.2, .20, .21, .22, .23
Serous retinal detachment
H33.3
Retinal breaks without detachment
H33.30, .301, .302, .303, .309
Unspecified retinal break
H33.31, .311, .312, .313, .319
Horseshoe tear of retina without detachment
H33.32, .321, .322, .323, .329
Round hole of retina without detachment
H33.33, .331, .332, .333, .339
Multiple defects of retina without detachment
H33.4, .40, .41, .42, .43
Traction detachment of retina
H33.8
Other retinal detachments
E10.3/E11.3
Type 1/ Type 2 diabetes mellitus with ophthalmic complications
E10.31/E11.31, .311, .319
Type 1/ Type2 diabetes mellitus with unspecified diabetic retinopathy
E10.32/E11.32
Type 1/ Type 2 diabetes mellitus with mild nonproliferative diabetic retinopathy
E10.321/E11.321, .3211, .3212, .3213, .3219
Type 1/ Type 2 diabetes mellitus with mild nonproliferative diabetic retinopathy with macular edema
E10.329/E11.329, .3291, .3292, .3293, .3299
Type 1/ Type 2 diabetes mellitus with mild nonproliferative diabetic retinopathy without macular edema
E10.33/E11.33
Type 1/ Type 2 diabetes mellitus with moderate nonproliferative diabetic retinopathy
E10.331/E11.331, .3311, .3312, .3313, .3319
Type 1 / Type 2 diabetes mellitus with moderate nonproliferative diabetic retinopathy with macular edema
E10.339/E11.339, .3391, .3392, .3393, .3399
Type 1 / Type 2 diabetes mellitus with moderate nonproliferative diabetic retinopathy without macular edema
E10.34/E11.34
Type 1/ Type 22 diabetes mellitus with severe nonproliferative diabetic retinopathy
E10.341/E11.341, .3411, .3412, .3413, .3419
Type 1/ Type 2 diabetes mellitus with severe nonproliferative diabetic retinopathy with macular edema
E10.349/E11.349, .3491, .3491, .3493, .3499
Type 1 / Type 2 diabetes mellitus with severe nonproliferative diabetic retinopathy without macular edema
E10.35/E11.35
Type 1/ Type 2 diabetes mellitus with proliferative diabetic retinopathy
E10.351/E11.351, .3511, .3512, .3513, .3519
Type 1/ Type 2 diabetes mellitus with proliferative diabetic retinopathy with macular edema
E10.352/E11.352, .3521, .3522, .3523, .3529
Type 1 / Type 2 diabetes mellitus with proliferative diabetic retinopathy with traction retinal detachment involving the macula
E10.353/E11.353, .3531, .3532, .3533, .3539
Type 1/ Type 2 diabetes mellitus with proliferative diabetic retinopathy with traction retinal detachment not involving the macula
E10.354/E11.354, .3541, .3542, .3543, .3549
Type 1 / Type 2 diabetes mellitus with proliferative diabetic retinopathy with combined traction retinal detachment and rhegmatogenous retinal detachment
E10.355/E11.355, .3551, .3552, .3553, .3559
Type 1 / Type 2 diabetes mellitus with stable proliferative diabetic retinopathy
E10.359/E11.359, .3591, .3592, .3593, 3599
Type 1 / Type 2 diabetes mellitus with proliferative diabetic retinopathy without macular edema
E10.36/ E11.36
Type 1/ Type 2 diabetes mellitus with diabetic cataract
E10.37/E11.37, .37X1, .37X2, .37X3, .37X9
Type 1 / Type 2 diabetes mellitus with diabetic macular edema, resolved following treatment
E10.39/E11.39
Type 1/ Type 2 diabetes mellitus with other diabetic ophthalmic complication
H40.11
Primary open-angle glaucoma
H40.1110-.1114
Primary open-angle glaucoma, right eye
H40.1120-.1124
Primary open-angle glaucoma, left eye
H40.1130-.1134
Primary open-angle glaucoma, bilateral
H40.1190-.1194
Primary open-angle glaucoma, unspecified eye
H34
Retinal vascular occlusions
H34.00-.03
Transient retinal artery occlusion
H34.10-.13
Central retinal artery occlusion
H34.2
Other retinal artery occlusions
H34.211, .212, .213, .219
Partial retinal artery occlusion
H34.231, .232, .233, .239
Retinal artery branch occlusion
H34.8
Other retinal vascular occlusions
H34.81
Central retinal vein occlusion
H34.8110-.8112
Central retinal vein occlusion, right eye
H34.8120-.8122
Central retinal vein occlusion, left eye
H34.8130-.8132
Central retinal vein occlusion, bilateral
H34.8190-.8192
Central retinal vein occlusion, unspecified eye
H34.821, .822, .823, .829
Venous engorgement
H34.83
Tributary (branch) retinal vein occlusion
H34.8310-.8312
Tributary (branch) retinal vein occlusion, right eye
H34.8320-.8322
Tributary (branch) retinal vein occlusion, left eye
H34.8330-.8332
Tributary (branch) retinal vein occlusion, bilateral
H34.8390-.8392
Tributary (branch) retinal vein occlusion, unspecified eye
H34.9
Unspecified retinal vascular occlusion

Open in a new tab

Phase II: Sample size optimization for genetic association studies

We further evaluated algorithms using various case and control age thresholds and well-established genetic associations to maximize sample size and statistical power. To do this, we created a data cube of cases and controls defined by a total of eight algorithms based on the Algorithms 1-4 described above and two age thresholds (Table 5). These algorithms were then applied to the MVP EHRs hosted by Veterans Informatics and Computing Infrastructure (VINCI), a partner of the VHA Corporate Data Warehouse⁵.

Table 5.

Algorithms included in the data cube for AMD case and control sample size optimization. The data cube contains Algorithms 1-4, with each represented by two different age thresholds as specified. Potential cases only have one mention of an AMD-qualifying ICD-9-CM or ICD-10-CM code compared with the required two mentions of AMD-qualifying codes.

Algorithm	Case age threshold	Control age threshold	Case count (Potential cases)	Control count	Excluded
1	65	65	14,453 (6,717)	131,075	351,782
1	50	60	19,351 (9,503)	237,546	237,627
2	65	65	14,375 (6,674)	131,075	351,903
2	50	60	19,216 (9,429)	237,546	237,836
3	50	65	28,609 (15,143)	173,500	286,775
3	50	60	28,609 (15,143)	225,172	235,103
4	50	65	28,609 (15,143)	267,581	192,694
4	50	60	28,609 (15,143)	322,539	137,736

Open in a new tab

We then used the known genetic associations between AMD and CFH (rs10801555, a proxy for rs1061170)⁹ and ARMS2 (rs10490924) to examine the impact of age threshold, case/control-defining code lists, and number of code mentions have on sample size and resulting genetic effect size and statistical significance. To do this, we accessed the Axiom genotype data for all MVP samples through The Genomic Information System for Integrative Science (GenlSIS) and performed standard tests of association assuming an additive genetic model.

In general, the requirement of two case-defining codes yielded stronger genetic effect sizes for both loci tested (Table 6a) despite the smaller sample sizes compared with relaxing this requirement to only one mention (Table 6b). The stricter case definition also resulted in smaller p-values compared with the more permissive case definition. Across the four algorithms, Algorithm 4 consistently yielded genetic associations with the smallest p-values regardless of case/control age threshold (Table 6). Within Algorithm 4, the lower case (50 years) and higher control (65 years) age thresholds yielded the smallest p-values, and thus was the optimal algorithm for downstream MVP AMD genetic association studies.

Table 6.

Data cube algorithm assessment using well-established AMD genetic associations. We tested for an association between AMD case status and rs10801555 and rs10490924 using logistic regression adjusted for sex and 10 principal components and assuming an additive genetic model. For each test performed, odds ratios and p-values are shown by algorithm when two mentions of case-defining billing codes are required (a) and when only one mention of case-defining billing codes is required (b). P-values for rs10801555 were estimated using the approximation for the extreme tail of the normal distribution from Karagiannidis and Lioumpas¹⁹. The final algorithm and case/control age-thresholds selected for MVP AMD genetic association studies going forward is bolded.

a) Requiring two mentions of case-defining codes
Algorithm	Case/control age threshold	CFH rs10801555 odds ratio	CHF rs10801555 p-value	ARMS2 rs10490924 odds ratio	ARMS2 rs10490924 p-value
1	65/65	1.775	1.81x10^-290	1.680	5.57x10^-201
2	65/65	1.775	6.90x10^-289	1.681	8.91x10^-201
1	50/60	1.746	5.85x10^-374	1.659	5.47x10^-259
2	50/60	1.747	2.42×10^-372	1.663	1.49x10^-259
3	50/65	1.689	2.73×10^-428	1.613	5.91x10^-293
3	50/60	1.667	6.40×10^-423	1.607	2.59x10^-298
4	50/65	1.661	2.25×10^-431	1.600	2.26x10^-303
4	50/60	1.647	4.56×10^-424	1.588	2.36x10^-299

b) Requiring only one mention of case-defining codes
Algorithm	Case/control age threshold	CFH rs10801555 odds ratio	CHF rs10801555 p-value	ARMS2 rs10490924 odds ratio	ARMS2 rs10490924 p-value
1	65/65	1.629	5.37×10^-287	1.548	1.92x10^-188
2	65/65	1.629	4.71×10^-286	1.548	1.18x10^-187
1	50/60	1.593	7.54×10^-365	1.534	1.09x10^-249
2	50/60	1.595	7.54×10^-365	1.536	1.84x10^-249
3	50/65	1.524	1.82×10^-378	1.472	1.08x10^-250
3	50/60	1.505	3.06×10^-373	1.465	1.59x10^-256
4	50/65	1.501	2.83×10^-384	1.459	1.30x10^-262
4	50/60	1.488	7.70×10^-377	1.448	1.47x10^-258

Open in a new tab

Discussion

We accessed the MVP structured data available in the EHRs to extract cases and controls for downstream genetic studies of AMD. Here we demonstrate that cases and controls for this complex ocular disease can be extracted with high PPV and NPV despite the lack of availability of imaging data considered gold-standard for AMD phenotyping. We further demonstrate that the algorithms initially developed can be optimized using known, strong established genetic associations.

Few EHR-based AMD algorithms are available in the literature or in public repositories. The Marshfield Clinic’s Personalized Medicine Research Project (PMRP) as part of the electronic Medical Records and Genomics network (eMERGE) developed a basic AMD algorithm, which is available in the phenotype knowledgebase (phekb.org) repository²⁰. The algorithm uses ICD-9-CM codes only and has been minimally validated within a single patient population that was uniformly European-descent^21–23. Another eMERGE study site at Northwestern University developed an ICD-9-CM-based algorithm to extract AMD cases and controls free of AMD with high PPV and NVP²⁴. Similar to the present study, imaging data were not included in the respective algorithms from the individual eMERGE study sites. Other ocular disease algorithms have been developed for primary open-angle glaucoma²⁵ and diabetic retinopathy²⁶. Like the eMERGE study sites’ AMD algorithms^22,24, these are limited to ICD-9-CM codes and lack the gold-standard imaging data.

The present study has both weaknesses and strengths. In addition to the lack of imaging data for the MVP cohort, we did not have access to the clinical notes or communications between specialists (e.g., ²⁵) that could be used to confirm potential cases or to rule out patients as controls. Indeed, phenotype misclassification potentially accounts for the smaller-than-expected odds ratios observed for the known AMD loci (1.6 versus ~ 2.5 to 3.0)^4,11,27–29. A previous study of cataracts in the eMERGE network suggests natural language processing of clinical notes^30,31 has the potential to capture additional cases or controls as well as offer useful phenotypic granularity (e.g., severity) for the identified cases. Another limitation is that the known genetic associations for AMD used for sample size optimization are limited primarily to European-descent populations^32,33.

Despite these limitations, a major strength of the present study is the VA EHR, representing the largest integrated healthcare system in the United States with relatively uniform coding practices. The large patient population ascertained through various VA Medical Centers provides opportunities to validate algorithms across different Centers as well as offers the sample size to optimize algorithm performance across the entire MVP EHR. The availability of genome-wide data, a resource that continues to grow with continuing ascertainment and investments om generating genomic data, is another major asset to MVP algorithm development for phenotypes with established genetic associations.

In summary, we have developed algorithms and strategies to extract AMD cases and controls free of AMD accessing only structured data contained in EHRs from the MVP. Evaluation between two independent VA Medical Centers coupled with genetically-guided optimization resulted in a large case-control dataset with high PPV and NPV suitable for downstream genomic downstream analyses such as the on-going genome-wide association studies in MVP for AMD. The strategies outlined here are relevant to other national efforts such as All of Us accessing only structured data within the EHR for computable or electronic phenotyping³⁴.

Table 1.

Million Veteran Program study population. Samples sizes available for the current study, presented as total (and genotyped) participants by race/ethnicity.

	Not Hispanic (with genotypes)	Hispanic (with genotypes)	Unknown (with genotypes)
African American	94,891 (65,983)	1,279 (861)	695 (457)
Native American/Alaska Native	4,562 (3,101)	962 (647)	29 (18)
Asian	4,884 (3,132)	251 (167)	38 (21)
Pacific Islander	1,778 (1,126)	474 (307)	54 (27)
White	350,142 (247,301)	23,227 (16,171)	1,814 (1,171)
Other	2,951 (2,170)	3,966 (2,909)	21 (16)
Unknown	4,462 (2,936)	2,775 (1,706)	4,772 (2,726)
TOTAL	463,670 (325,749)	32,934 (21,949)	7,423 (4,436)

Open in a new tab

Acknowledgements

This research is based on data from the Million Veteran Program, Office of Research and Development, Veterans Health Administration, and was supported by award I01 BX003364 and by Research to Prevent Blindness This publication does not reflect the views of the Department of Veterans Affairs or the United States Government. We are grateful to the VINCI and GENISIS support teams, and to the MVP Core Statistical Analysis team.

References

1.Gaziano JM, Concato J, Brophy M. Million Veteran Program: A mega-biobank to study genetic influences on health and disease. J Clin Epidemiol. 2016;70:214–223. doi: 10.1016/j.jclinepi.2015.09.016. [DOI] [PubMed] [Google Scholar]
2.Wong WL, Su X, Li X. Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: a systematic review and meta-analysis. Lancet Glob Heal. 2014;2:e106–16. doi: 10.1016/S2214-109X(13)70145-1. [DOI] [PubMed] [Google Scholar]
3.Fritsche LG, Igl W, Cooke Bailey JN. A large genome-wide association study of age-related macular degeneration highlights contributions of rare and common variants. Nat Genet. 2016;48:134–143. doi: 10.1038/ng.3448. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Klein RJ, Zeiss C, Chew EY. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308:385–389. doi: 10.1126/science.1109557. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Fihn SD, Francis J, Clancy C. Insights from advanced analytics at the Veterans Health Administration. Health Aff. 2014;33:1203–1211. doi: 10.1377/hlthaff.2014.0054. [DOI] [PubMed] [Google Scholar]
6.Turner S, Armstrong LL, Bradford Y. Quality control procedures for genome-wide association studies. Curr Protoc Hum Genet. 2011;68:1–19. doi: 10.1002/0471142905.hg0119s68. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Kvale MN, Hesselson S, Hoffmann TJ. Genotyping informatics and quality control for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort. Genetics. 2015;200:1051–1060. doi: 10.1534/genetics.115.178905. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Klarin D, Damrauer SM, Cho K. Genetics of blood lipids among ~300,000 multi-ethnic participants of the Million Veteran Program. Nat Genet. 2018;50:1514–1523. doi: 10.1038/s41588-018-0222-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Cipriani V, Leung H-T, Plagnol V. Genome-wide association study of age-related macular degeneration identifies associated variants in the TNXB-FKBPL-NOTCH4 region of chromosome 6p21.3. Hum Mol Genet. 2012;21:4138–4150. doi: 10.1093/hmg/dds225. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.1000 Genomes Project Consortium, Auton A, Brooks LD. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Restrepo NA, Spencer KL, Goodloe R. Genetic determinants of age-related macular degeneration in diverse populations from the PAGE Study. Invest Ophthalmol Vis Sci. 2014;55:6839–6850. doi: 10.1167/iovs.14-14246. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Purcell S, Neale B, Todd-Brown K. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Chang CC, Chow CC, Tellier LCAM. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:1–16. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Price AL, Patterson NJ, Plenge RM. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
15.Hellwege JN, Keaton JM, Giri A. Population stratification in genetic association studies. Curr Protoc Hum Genet. 2017;95:1.22.1–1.22.23. doi: 10.1002/cphg.48. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655–1664. doi: 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Zhang X, Saaddine JB, Chou CF. Prevalence of diabetic retinopathy in the United States, 2005-2008. JAMA. 2010;304:649–656. doi: 10.1001/jama.2010.1111. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Friedman DS, Wolfs RC, O’Colmain BJ. Prevalence of open-angle glaucoma among adults in the United States. Arch Ophthalmol. 2004;122:532–538. doi: 10.1001/archopht.122.4.532. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Karagiannidis GK, Lioumpas AS. An improved approximation for the Gaussian Q-Function. IEEE Communication Letters. 2007;11:644–646. [Google Scholar]
20.Kirby JC, Speltz P, Rasmussen LV. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inform Assoc. 2016;23:1046–1052. doi: 10.1093/jamia/ocv202. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.McCarty C, Chisholm R, Chute C. The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics. 2011;4:13. doi: 10.1186/1755-8794-4-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Brilliant MH, Vaziri K, Connor TB. Mining retrospective data for virtual prospective drug repurposing: L-DOPA and age-related macular degeneration. Am J Med. 2016;129:292–298. doi: 10.1016/j.amjmed.2015.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Ritchie MD, Verma SS, Hall MA. Electronic medical records and genomics (eMERGE) network exploration in cataract: several new potential susceptbility loci. Mol Vis. 2014;20:1281–1295. [PMC free article] [PubMed] [Google Scholar]
24.Simonett JM, Sohrab MA, Pacheco J. A validated phenotyping algorithm for genetic association studies in age-related macular degeneration. Sci Rep. 2015;5:12875. doi: 10.1038/srep12875. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Restrepo NA, Farber-Eger E, Goodloe R. Extracting primary open-angle glaucoma from electronic medical records for genetic association studies. PLoS ONE. 2015;10:e0127817. doi: 10.1371/journal.pone.0127817. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Restrepo NA, Farber-Eger E, Crawford DC. Searching in the dark: phenotyping diabetic retinopathy in a de-identified electronic medical record sample of African Americans. AMIA Jt Summits Transl Sci Proc. 2016;2016:221–230. [PMC free article] [PubMed] [Google Scholar]
27.Haines JL, Hauser MA, Schmidt S. Complement factor H variant increases the risk of age-related macular degeneration. Science. 2005;308:419–421. doi: 10.1126/science.1110359. [DOI] [PubMed] [Google Scholar]
28.Despriet DD, Klaver CC, Witteman JC. Complement factor H polymorphism, complement activators, and risk of age-related macular degeneration. JAMA. 2006;296:301–309. doi: 10.1001/jama.296.3.301. [DOI] [PubMed] [Google Scholar]
29.Yu Y, Bhangale TR, Fagerness J. Common variants near FRK/COL10A1 and VEGFA are associated with advanced age-related macular degeneration. Hum Mol Genet. 2011;20:3699–3709. doi: 10.1093/hmg/ddr270. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Peissig PL, Rasmussen LV, Berg RL. Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J Am Med Inform Assoc. 2012;19:225–234. doi: 10.1136/amiajnl-2011-000456. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Rasmussen LV, Peissig PL, McCarty CA. Development of an optical character recognition pipeline for handwritten form fields from an electronic health record. J Am Med Inform Assoc. 2012;19:e90–e95. doi: 10.1136/amiajnl-2011-000182. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Bustamante CD, De La Vega FM, Burchard EG. Genomics for the world. Nature. 2011;475:163–165. doi: 10.1038/475163a. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016;538:161–164. doi: 10.1038/538161a. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Collins FS, Varmus H. A new initiative on precision medicine. NEJM. 2015;372:793–795. doi: 10.1056/NEJMp1500523. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r1-3055722] 1.Gaziano JM, Concato J, Brophy M. Million Veteran Program: A mega-biobank to study genetic influences on health and disease. J Clin Epidemiol. 2016;70:214–223. doi: 10.1016/j.jclinepi.2015.09.016. [DOI] [PubMed] [Google Scholar]

[r2-3055722] 2.Wong WL, Su X, Li X. Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: a systematic review and meta-analysis. Lancet Glob Heal. 2014;2:e106–16. doi: 10.1016/S2214-109X(13)70145-1. [DOI] [PubMed] [Google Scholar]

[r3-3055722] 3.Fritsche LG, Igl W, Cooke Bailey JN. A large genome-wide association study of age-related macular degeneration highlights contributions of rare and common variants. Nat Genet. 2016;48:134–143. doi: 10.1038/ng.3448. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4-3055722] 4.Klein RJ, Zeiss C, Chew EY. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308:385–389. doi: 10.1126/science.1109557. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5-3055722] 5.Fihn SD, Francis J, Clancy C. Insights from advanced analytics at the Veterans Health Administration. Health Aff. 2014;33:1203–1211. doi: 10.1377/hlthaff.2014.0054. [DOI] [PubMed] [Google Scholar]

[r6-3055722] 6.Turner S, Armstrong LL, Bradford Y. Quality control procedures for genome-wide association studies. Curr Protoc Hum Genet. 2011;68:1–19. doi: 10.1002/0471142905.hg0119s68. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7-3055722] 7.Kvale MN, Hesselson S, Hoffmann TJ. Genotyping informatics and quality control for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort. Genetics. 2015;200:1051–1060. doi: 10.1534/genetics.115.178905. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r8-3055722] 8.Klarin D, Damrauer SM, Cho K. Genetics of blood lipids among ~300,000 multi-ethnic participants of the Million Veteran Program. Nat Genet. 2018;50:1514–1523. doi: 10.1038/s41588-018-0222-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9-3055722] 9.Cipriani V, Leung H-T, Plagnol V. Genome-wide association study of age-related macular degeneration identifies associated variants in the TNXB-FKBPL-NOTCH4 region of chromosome 6p21.3. Hum Mol Genet. 2012;21:4138–4150. doi: 10.1093/hmg/dds225. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r10-3055722] 10.1000 Genomes Project Consortium, Auton A, Brooks LD. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11-3055722] 11.Restrepo NA, Spencer KL, Goodloe R. Genetic determinants of age-related macular degeneration in diverse populations from the PAGE Study. Invest Ophthalmol Vis Sci. 2014;55:6839–6850. doi: 10.1167/iovs.14-14246. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12-3055722] 12.Purcell S, Neale B, Todd-Brown K. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13-3055722] 13.Chang CC, Chow CC, Tellier LCAM. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:1–16. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r14-3055722] 14.Price AL, Patterson NJ, Plenge RM. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]

[r15-3055722] 15.Hellwege JN, Keaton JM, Giri A. Population stratification in genetic association studies. Curr Protoc Hum Genet. 2017;95:1.22.1–1.22.23. doi: 10.1002/cphg.48. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r16-3055722] 16.Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655–1664. doi: 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r17-3055722] 17.Zhang X, Saaddine JB, Chou CF. Prevalence of diabetic retinopathy in the United States, 2005-2008. JAMA. 2010;304:649–656. doi: 10.1001/jama.2010.1111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r18-3055722] 18.Friedman DS, Wolfs RC, O’Colmain BJ. Prevalence of open-angle glaucoma among adults in the United States. Arch Ophthalmol. 2004;122:532–538. doi: 10.1001/archopht.122.4.532. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r19-3055722] 19.Karagiannidis GK, Lioumpas AS. An improved approximation for the Gaussian Q-Function. IEEE Communication Letters. 2007;11:644–646. [Google Scholar]

[r20-3055722] 20.Kirby JC, Speltz P, Rasmussen LV. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inform Assoc. 2016;23:1046–1052. doi: 10.1093/jamia/ocv202. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r21-3055722] 21.McCarty C, Chisholm R, Chute C. The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics. 2011;4:13. doi: 10.1186/1755-8794-4-13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r22-3055722] 22.Brilliant MH, Vaziri K, Connor TB. Mining retrospective data for virtual prospective drug repurposing: L-DOPA and age-related macular degeneration. Am J Med. 2016;129:292–298. doi: 10.1016/j.amjmed.2015.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r23-3055722] 23.Ritchie MD, Verma SS, Hall MA. Electronic medical records and genomics (eMERGE) network exploration in cataract: several new potential susceptbility loci. Mol Vis. 2014;20:1281–1295. [PMC free article] [PubMed] [Google Scholar]

[r24-3055722] 24.Simonett JM, Sohrab MA, Pacheco J. A validated phenotyping algorithm for genetic association studies in age-related macular degeneration. Sci Rep. 2015;5:12875. doi: 10.1038/srep12875. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r25-3055722] 25.Restrepo NA, Farber-Eger E, Goodloe R. Extracting primary open-angle glaucoma from electronic medical records for genetic association studies. PLoS ONE. 2015;10:e0127817. doi: 10.1371/journal.pone.0127817. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r26-3055722] 26.Restrepo NA, Farber-Eger E, Crawford DC. Searching in the dark: phenotyping diabetic retinopathy in a de-identified electronic medical record sample of African Americans. AMIA Jt Summits Transl Sci Proc. 2016;2016:221–230. [PMC free article] [PubMed] [Google Scholar]

[r27-3055722] 27.Haines JL, Hauser MA, Schmidt S. Complement factor H variant increases the risk of age-related macular degeneration. Science. 2005;308:419–421. doi: 10.1126/science.1110359. [DOI] [PubMed] [Google Scholar]

[r28-3055722] 28.Despriet DD, Klaver CC, Witteman JC. Complement factor H polymorphism, complement activators, and risk of age-related macular degeneration. JAMA. 2006;296:301–309. doi: 10.1001/jama.296.3.301. [DOI] [PubMed] [Google Scholar]

[r29-3055722] 29.Yu Y, Bhangale TR, Fagerness J. Common variants near FRK/COL10A1 and VEGFA are associated with advanced age-related macular degeneration. Hum Mol Genet. 2011;20:3699–3709. doi: 10.1093/hmg/ddr270. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r30-3055722] 30.Peissig PL, Rasmussen LV, Berg RL. Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J Am Med Inform Assoc. 2012;19:225–234. doi: 10.1136/amiajnl-2011-000456. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r31-3055722] 31.Rasmussen LV, Peissig PL, McCarty CA. Development of an optical character recognition pipeline for handwritten form fields from an electronic health record. J Am Med Inform Assoc. 2012;19:e90–e95. doi: 10.1136/amiajnl-2011-000182. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r32-3055722] 32.Bustamante CD, De La Vega FM, Burchard EG. Genomics for the world. Nature. 2011;475:163–165. doi: 10.1038/475163a. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r33-3055722] 33.Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016;538:161–164. doi: 10.1038/538161a. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r34-3055722] 34.Collins FS, Varmus H. A new initiative on precision medicine. NEJM. 2015;372:793–795. doi: 10.1056/NEJMp1500523. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Genetically-guided algorithm development and sample size optimization for age-related macular degeneration cases and controls in electronic health records from the VA Million Veteran Program

Christopher W Halladay

Tamer Hadi

Matthew D Anger

Paul B Greenberg

Jack M Sullivan

P Eric Konicki

Neal S Peachey

Robert P Igo Jr

Sudha K Iyengar

Wen-Chih Wu

Dana C Crawford

Abstract

Introduction

Methods

Study population

Statistical analyses

Results

Phase 1: Local algorithm development

Table 2.

Table 3.

Table 4.

Phase II: Sample size optimization for genetic association studies

Table 5.

Table 6.

Discussion

Table 1.

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Genetically-guided algorithm development and sample size optimization for age-related macular degeneration cases and controls in electronic health records from the VA Million Veteran Program

Christopher W Halladay

Tamer Hadi

Matthew D Anger

Paul B Greenberg

Jack M Sullivan

P Eric Konicki

Neal S Peachey

Robert P Igo Jr

Sudha K Iyengar

Wen-Chih Wu

Dana C Crawford

Abstract

Introduction

Methods

Study population

Statistical analyses

Results

Phase 1: Local algorithm development

Table 2.

Table 3.

Table 4.

Phase II: Sample size optimization for genetic association studies

Table 5.

Table 6.

Discussion

Table 1.

Acknowledgements

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases