Abstract
Electronic health records (EHRs) linked to extensive biorepositories and supplemented with lifestyle, behavioral, and environmental exposure data, have enormous potential to contribute to genomic discovery, a necessary step in the pathway towards translational or precision medicine. A major bottleneck in incorporating EHRs into genomic studies is the extraction of research-grade variables for analysis, particularly when gold-standard measurements are not available or accessible. Here we develop algorithms for age-related macular degeneration (AMD), a common cause of blindness among the elderly, and controls free of AMD. These computable phenotypes were developed using billing codes (ICD-9-CM and ICD-10-CM) and Current Procedural Terminology (CPT) codes and evaluated in two study sites of the Veterans Affairs Million Veteran Program: Louis Stokes Cleveland VA Medical Center and the Providence VA Medical Center. After establishing a high overall positive and negative predictive values (93% and 95%, respectively) through manual chart review, the candidate algorithm was deployed in the full VA MVP dataset of >500,000 participants. The algorithm was then optimized in a data cube using a variety of approaches including adjusting inclusion age thresholds by examining previously-reported genetic associations for CFH (rs10801555, a proxy for rs1061170) and ARMS2 (rs10490924). The algorithm with the smallest p-values for the known genetic associations was selected for downstream and on-going AMD genomic discovery efforts. This two-phase approach to developing research-grade case/control variables for AMD genomic studies capitalizes on established genetic associations resulting in high precision and optimized sample sizes, an approach that can be applied to other large-scale biobanks linked to EHRs for precision medicine research.
Keywords: age-related macular degeneration, electronic health records, Million Veteran Program, genetic association study
Introduction
The Department of Veterans Affairs Veterans Health Care Administration (VHA) established the Million Veteran Program (MVP) to facilitate large-scale analysis of combined genetics and electronic health records (EHRs)1. MVP has enrolled >500,000 Veterans from over 50 sites in the continental United States and Puerto Rico. MVP currently provides a text version of the VA EHR, genome-wide data from a custom Affymetrix Axiom Biobank Array, and two survey instruments. Future genomic data will include imputed genome-wide data as well as whole-exome and whole-genome sequencing data. MVP thus provides a rich dataset for analysis of genomic contributions to disease, response to treatment, and related questions.
Age-related macular degeneration (AMD) is a leading cause of irreversible blindness in the developed world2. AMD has a substantial genetic component, with 52 independently-associated variants across 34 genomic or locus regions identified in the largest genome-wide association study (GWAS) to date for AMD3. These variants collectively explain 27.2% of AMD disease risk and more than half of the risk attributed to genetics3. While much of the genetic architecture of AMD has been revealed since its first GWAS in 20054, gaps remain including the identification of additional variants (including rare variants) and the inclusion of diverse populations to identify population-specific associations.
With its size and diversity, MVP offers an important opportunity to expand our understanding of AMD genetics within and between populations. A major challenge to this goal is the accurate identification of cases (patients with AMD) and controls (patients without AMD) using a data repository of EHRs without access to the gold standard clinical images used for AMD diagnosis (fundus photography, optical coherence tomography imaging). With access to extensive structured clinical data in the form of International Classification of Diseases (ICD) codes and Current Procedural Terminology (CPT) codes, we examine here the extent to which billing codes and other criteria relevant to AMD provide an accurate identification of AMD cases and controls within the VA Computerized Patient Record System (CPRS)5 for downstream genomic discovery studies.
Methods
Study population
The MVP is a national research program launched in 2011 by the Department of Veterans Affairs Office of Research & Development. The rationale, study design, and data collection for the MVP have been previously described1. In brief, upon informed consent, veterans provide biospecimens, access to their EHR, and contribute lifestyle, behavioral, and other health-related data via a baseline survey and an optional lifestyle survey (https://www.research.va.gov/MVP/). The present study was approved by Institutional Review Boards of the Louis Stokes Cleveland VA Medical Center (LSCVAMC) and Providence VA Medical Center (PVAMC).
MVP intends to recruit at least one million participants. As of 2016, 504,027 MVP participants were available for study with accessible EHRs and of these, 352,953 had genome-wide genotype data available. Genome-wide data were generated using the Affymetrix Axiom Biobank Array (~723,000 markers) by two contracted vendors1. Basic quality control was performed by the two vendors followed by additional quality control as part of the MVP Genomic Working Group. Standard quality control metrics6,7 included testing for batch effects; calculating missingness by batch, sample call rates, duplicate sample concordance, minor allele frequencies; and performing sex checks and sample contamination checks. Imputed data were not available at the time of algorithm development but are now available for genetic associations within MVP8.
Statistical analyses
Among non-Hispanic European Americans in the MVP, we performed tests of association between AMD case status and two common variants consistently and strongly associated with AMD in European-descent populations: CFH rs10801555 (a proxy for rs10611709 with r2=1 in the 1000 Genomes Project Phase 3 European subset10) and ARMS2 rs10490924. Both variants have been associated with AMD with large genetic effect sizes (odds ratios) ranging from 1.5-3.0 in most populations of European-descent11. Tests of association were performed in PLINK v1.90b4.412,13 using logistic regression assuming an additive genetic model and adjusting for sex and 10 principal components to account for population structure14,15. Prior to performing the tests of association, we characterized genetic ancestry from global admixture proportions determined by ADMIXTURE16; only samples with < 10% non-European ancestry were included in these analyses.
Results
To classify participants as definite/probable cases, controls, or unknown for AMD, we developed algorithms utilizing structured data [International Classification of Diseases, 9th and 10th Revisions, Clinical Modification (ICD-9-CM and ICD-10-CM) and CPT codes] in the EHRs. Imaging data are not yet available in MVP; consequently, these algorithms do not include fundus photography, optical coherence tomography, or other gold-standard tools that define AMD.
Phase 1: Local algorithm development
We developed an initial algorithm and test sets of cases and controls at the LSCVAMC via detailed chart and imaging reviews by retinal specialists. Based on the chart reviews from identified cases, the initial algorithm was refined and subsequent versions were developed and tested at both VA medical centers (LSCVAMC and PVAMC).
For the initial algorithm, we identified cases and controls (≥65 years of age) among those with comprehensive eye exams (CPT codes 92004 or 92014) within the last two years. Age was defined by date of birth and calculated in years at the time the data were accessed. Cases were identified based on the presence of one mention of ICD-9-CM 362.51 (nonexudative senile macular degeneration) or 362.52 (exudative senile macular degeneration) or ICD-10-CM H35.31 (nonexudative AMD) or H35.32 (exudative AMD) and the absence of ICD-9-CM 362.55 (toxic maculopathy) or ICD-10-CM H35.389 (toxic maculopathy, unspecified eye). Controls were defined as absence of AMD-related ICD-9-CM codes (362.51 and 362.52) and ICD-10-CM codes (H35.31 and H35.32). The initial algorithm was implemented at LSCVAMC, and a chart review of 50 identified cases revealed nine false positives, resulting in a low preliminary positive predictive value (PPV; 82%). Likewise, a review of 50 identified controls reviewed seven false negatives (one wet AMD and six dry AMD cases), resulting in a low preliminary negative predictive value (NPV; 86%).
In the first revised algorithm (Algorithm 1, Table 2), we required two mentions of AMD-related ICD-9-CM or ICD-10-CM codes. We then implemented both the case and control revised algorithms at the two VAMCs (Cleveland and Providence) and reviewed a fraction of identified cases and controls from each site (Table 3). The PPV ranged from 92% to 94%, with an overall PPV of 93% (standard error = 0.017). The NVP ranged from 89% to 99%, with an overall NPV of 95% (standard error = 0.014. Total sensitivity and specificity were both high: 95% and 93%, respectively.
Table 2.
Revised algorithm (Algorithm 1) to identify AMD cases and controls in electronic health records.
| AMD case definition | AMD control definition |
|---|---|
| ≥65 years of age | ≥65 years of age |
| AND | AND |
| At least one mention within the last two years of • CPT code 92004 or • CPT codes 92014 |
At least one mention within the last two years of • CPT 92004 or • CPT 92014 |
| AND | AND |
| At least two mentions (on separate clinic visits) or only at the most recent visit to the Eye Clinic of • ICD-9-CM codes 362.51 or 362.52 or • ICD-10-CM codes H35.31 or H35.32 |
Absence of • ICD-9-CM codes 362.51 and 362.52 or • ICD-10-CM codes H35.31 and H35.32 |
| AND | |
| Absence of • ICD-9-CM code 362.55 or • ICD-10-CM code H35.389 |
Table 3.
Algorithm 1 performance, by VHA study site. Abbreviations: Louis Stokes Cleveland VA Medical Center (LSCVAMC), negative predictive value (NPV), positive predictive value (PPV), Providence VA Medical Center (PVAMC).
| Number of AMD cases | Number of controls | Number of false positives | Number of false negatives | PPV | NPV | |
|---|---|---|---|---|---|---|
| LSCVAMC | 138 | 126 | 11 | 1 | 92 | 99 |
| PVAMC | 100 | 100 | 6 | 11 | 94 | 89 |
| Total | 238 | 226 | 17 | 12 | 93 | 95 |
To capture additional cases of AMD and controls free of AMD, we varied 1) the billing codes for exclusion (Algorithm 2), 2) the billing codes for inclusion (Algorithm 3), and/or 3) the requirement of a recent eye exam (for control definition only; Algorithm 4). For Algorithm 2, we expanded the list of case exclusion codes representing other eye diseases (primary open angle glaucoma and diabetic retinopathy; Table 4) and applied these exclusions to Algorithm 1. For Algorithm 3, we added ICD-9-CM codes 362.50, 362.51, and 362.52 and ICD-10 codes H35.30, H35.31%, and H35.32% to the case inclusion list. We also expanded the CPT codes list representing ophthalmological services received (92002, 92012, 92004, and 92014). Finally, Algorithm 4 used the case definition from Algorithm 3 and relaxed the control definition to include patients without evidence of an eye exam.
Table 4.
ICD-9-CM and ICD-10-CM codes added as exclusions for AMD case status. The list of billing code exclusions for primary open-angle glaucoma and diabetic retinopathy was applied to Algorithm 1 to create Algorithm 2. The list of exclusions was developed to reduce the number of false positive cases, particularly in African American patients where both primary open-angle glaucoma and diabetic retinopathy are more prevalent compared with other racial/ethnic groups17, 18.
| ICD-9-CM | ICD-10-CM |
|---|---|
| 250.5* Diabetes with ophthalmic manifestations 361 Retinal detachments and defects 361.0* Retinal detachment with retinal defect 361.1* Retinoschisis and retinal cysts 361.2 Serous retinal detachment 361.3* Retinal defects without detachment 361.8* Other forms of retinal detachment 361.9 Unspecified retinal detachment 362 Other retinal disorders 362.0 Diabetic retinopathy 362.1* Other background retinopathy and retinal vascular changes 362.2 Other proliferative retinopathy 362.20* Retinopathy of prematurity, unspecified 362.3* Retinal vascular occlusion 362.7 * Hereditary retinal dystrophies 362.8* Other retinal disorders 362.9 Unspecified retinal disorder convert 362.9 to ICD-10-CM |
H33 Retinal detachments and breaks H33.0 Retinal detachment with retinal break H33.001, .002, .003, .009 Unspecified retinal detachment with retinal break H33.01, .011, .012, .013, .019 Retinal detachment with single break H33.02, .021, .022, .023, .029 Retinal detachment with multiple breaks H33.03, .031, .032, .033, .039 Retinal detachment with giant retinal tear H33.04, .041, .042, .043, .049 Retinal detachment with retinal dialysis H33.05, .051, .052, .053, .059 Total retinal detachment H33.1 Retinoschisis and retinal cysts H33.10, .101, .102, .103, .109 Unspecified retinoschisis H33.11, .111, .112, .113, .119 Cyst of ora serrata H33.12, .121, .122, .123, .129 Parasitic cyst of retina H33.19, .191, .192, .193, .199 Other retinoschisis and retinal cysts H33.2, .20, .21, .22, .23 Serous retinal detachment H33.3 Retinal breaks without detachment H33.30, .301, .302, .303, .309 Unspecified retinal break H33.31, .311, .312, .313, .319 Horseshoe tear of retina without detachment H33.32, .321, .322, .323, .329 Round hole of retina without detachment H33.33, .331, .332, .333, .339 Multiple defects of retina without detachment H33.4, .40, .41, .42, .43 Traction detachment of retina H33.8 Other retinal detachments E10.3/E11.3 Type 1/ Type 2 diabetes mellitus with ophthalmic complications E10.31/E11.31, .311, .319 Type 1/ Type2 diabetes mellitus with unspecified diabetic retinopathy E10.32/E11.32 Type 1/ Type 2 diabetes mellitus with mild nonproliferative diabetic retinopathy E10.321/E11.321, .3211, .3212, .3213, .3219 Type 1/ Type 2 diabetes mellitus with mild nonproliferative diabetic retinopathy with macular edema E10.329/E11.329, .3291, .3292, .3293, .3299 Type 1/ Type 2 diabetes mellitus with mild nonproliferative diabetic retinopathy without macular edema E10.33/E11.33 Type 1/ Type 2 diabetes mellitus with moderate nonproliferative diabetic retinopathy E10.331/E11.331, .3311, .3312, .3313, .3319 Type 1 / Type 2 diabetes mellitus with moderate nonproliferative diabetic retinopathy with macular edema E10.339/E11.339, .3391, .3392, .3393, .3399 Type 1 / Type 2 diabetes mellitus with moderate nonproliferative diabetic retinopathy without macular edema E10.34/E11.34 Type 1/ Type 22 diabetes mellitus with severe nonproliferative diabetic retinopathy E10.341/E11.341, .3411, .3412, .3413, .3419 Type 1/ Type 2 diabetes mellitus with severe nonproliferative diabetic retinopathy with macular edema E10.349/E11.349, .3491, .3491, .3493, .3499 Type 1 / Type 2 diabetes mellitus with severe nonproliferative diabetic retinopathy without macular edema E10.35/E11.35 Type 1/ Type 2 diabetes mellitus with proliferative diabetic retinopathy E10.351/E11.351, .3511, .3512, .3513, .3519 Type 1/ Type 2 diabetes mellitus with proliferative diabetic retinopathy with macular edema E10.352/E11.352, .3521, .3522, .3523, .3529 Type 1 / Type 2 diabetes mellitus with proliferative diabetic retinopathy with traction retinal detachment involving the macula E10.353/E11.353, .3531, .3532, .3533, .3539 Type 1/ Type 2 diabetes mellitus with proliferative diabetic retinopathy with traction retinal detachment not involving the macula E10.354/E11.354, .3541, .3542, .3543, .3549 Type 1 / Type 2 diabetes mellitus with proliferative diabetic retinopathy with combined traction retinal detachment and rhegmatogenous retinal detachment E10.355/E11.355, .3551, .3552, .3553, .3559 Type 1 / Type 2 diabetes mellitus with stable proliferative diabetic retinopathy E10.359/E11.359, .3591, .3592, .3593, 3599 Type 1 / Type 2 diabetes mellitus with proliferative diabetic retinopathy without macular edema E10.36/ E11.36 Type 1/ Type 2 diabetes mellitus with diabetic cataract E10.37/E11.37, .37X1, .37X2, .37X3, .37X9 Type 1 / Type 2 diabetes mellitus with diabetic macular edema, resolved following treatment E10.39/E11.39 Type 1/ Type 2 diabetes mellitus with other diabetic ophthalmic complication H40.11 Primary open-angle glaucoma H40.1110-.1114 Primary open-angle glaucoma, right eye H40.1120-.1124 Primary open-angle glaucoma, left eye H40.1130-.1134 Primary open-angle glaucoma, bilateral H40.1190-.1194 Primary open-angle glaucoma, unspecified eye H34 Retinal vascular occlusions H34.00-.03 Transient retinal artery occlusion H34.10-.13 Central retinal artery occlusion H34.2 Other retinal artery occlusions H34.211, .212, .213, .219 Partial retinal artery occlusion H34.231, .232, .233, .239 Retinal artery branch occlusion H34.8 Other retinal vascular occlusions H34.81 Central retinal vein occlusion H34.8110-.8112 Central retinal vein occlusion, right eye H34.8120-.8122 Central retinal vein occlusion, left eye H34.8130-.8132 Central retinal vein occlusion, bilateral H34.8190-.8192 Central retinal vein occlusion, unspecified eye H34.821, .822, .823, .829 Venous engorgement H34.83 Tributary (branch) retinal vein occlusion H34.8310-.8312 Tributary (branch) retinal vein occlusion, right eye H34.8320-.8322 Tributary (branch) retinal vein occlusion, left eye H34.8330-.8332 Tributary (branch) retinal vein occlusion, bilateral H34.8390-.8392 Tributary (branch) retinal vein occlusion, unspecified eye H34.9 Unspecified retinal vascular occlusion |
Phase II: Sample size optimization for genetic association studies
We further evaluated algorithms using various case and control age thresholds and well-established genetic associations to maximize sample size and statistical power. To do this, we created a data cube of cases and controls defined by a total of eight algorithms based on the Algorithms 1-4 described above and two age thresholds (Table 5). These algorithms were then applied to the MVP EHRs hosted by Veterans Informatics and Computing Infrastructure (VINCI), a partner of the VHA Corporate Data Warehouse5.
Table 5.
Algorithms included in the data cube for AMD case and control sample size optimization. The data cube contains Algorithms 1-4, with each represented by two different age thresholds as specified. Potential cases only have one mention of an AMD-qualifying ICD-9-CM or ICD-10-CM code compared with the required two mentions of AMD-qualifying codes.
| Algorithm | Case age threshold | Control age threshold | Case count (Potential cases) | Control count | Excluded |
|---|---|---|---|---|---|
| 1 | 65 | 65 | 14,453 (6,717) | 131,075 | 351,782 |
| 1 | 50 | 60 | 19,351 (9,503) | 237,546 | 237,627 |
| 2 | 65 | 65 | 14,375 (6,674) | 131,075 | 351,903 |
| 2 | 50 | 60 | 19,216 (9,429) | 237,546 | 237,836 |
| 3 | 50 | 65 | 28,609 (15,143) | 173,500 | 286,775 |
| 3 | 50 | 60 | 28,609 (15,143) | 225,172 | 235,103 |
| 4 | 50 | 65 | 28,609 (15,143) | 267,581 | 192,694 |
| 4 | 50 | 60 | 28,609 (15,143) | 322,539 | 137,736 |
We then used the known genetic associations between AMD and CFH (rs10801555, a proxy for rs1061170)9 and ARMS2 (rs10490924) to examine the impact of age threshold, case/control-defining code lists, and number of code mentions have on sample size and resulting genetic effect size and statistical significance. To do this, we accessed the Axiom genotype data for all MVP samples through The Genomic Information System for Integrative Science (GenlSIS) and performed standard tests of association assuming an additive genetic model.
In general, the requirement of two case-defining codes yielded stronger genetic effect sizes for both loci tested (Table 6a) despite the smaller sample sizes compared with relaxing this requirement to only one mention (Table 6b). The stricter case definition also resulted in smaller p-values compared with the more permissive case definition. Across the four algorithms, Algorithm 4 consistently yielded genetic associations with the smallest p-values regardless of case/control age threshold (Table 6). Within Algorithm 4, the lower case (50 years) and higher control (65 years) age thresholds yielded the smallest p-values, and thus was the optimal algorithm for downstream MVP AMD genetic association studies.
Table 6.
Data cube algorithm assessment using well-established AMD genetic associations. We tested for an association between AMD case status and rs10801555 and rs10490924 using logistic regression adjusted for sex and 10 principal components and assuming an additive genetic model. For each test performed, odds ratios and p-values are shown by algorithm when two mentions of case-defining billing codes are required (a) and when only one mention of case-defining billing codes is required (b). P-values for rs10801555 were estimated using the approximation for the extreme tail of the normal distribution from Karagiannidis and Lioumpas19. The final algorithm and case/control age-thresholds selected for MVP AMD genetic association studies going forward is bolded.
| a) Requiring two mentions of case-defining codes | |||||
|---|---|---|---|---|---|
| Algorithm | Case/control age threshold | CFH rs10801555 odds ratio | CHF rs10801555 p-value | ARMS2 rs10490924 odds ratio | ARMS2 rs10490924 p-value |
| 1 | 65/65 | 1.775 | 1.81x10-290 | 1.680 | 5.57x10-201 |
| 2 | 65/65 | 1.775 | 6.90x10-289 | 1.681 | 8.91x10-201 |
| 1 | 50/60 | 1.746 | 5.85x10-374 | 1.659 | 5.47x10-259 |
| 2 | 50/60 | 1.747 | 2.42×10-372 | 1.663 | 1.49x10-259 |
| 3 | 50/65 | 1.689 | 2.73×10-428 | 1.613 | 5.91x10-293 |
| 3 | 50/60 | 1.667 | 6.40×10-423 | 1.607 | 2.59x10-298 |
| 4 | 50/65 | 1.661 | 2.25×10-431 | 1.600 | 2.26x10-303 |
| 4 | 50/60 | 1.647 | 4.56×10-424 | 1.588 | 2.36x10-299 |
| b) Requiring only one mention of case-defining codes | |||||
|---|---|---|---|---|---|
| Algorithm | Case/control age threshold | CFH rs10801555 odds ratio | CHF rs10801555 p-value | ARMS2 rs10490924 odds ratio | ARMS2 rs10490924 p-value |
| 1 | 65/65 | 1.629 | 5.37×10-287 | 1.548 | 1.92x10-188 |
| 2 | 65/65 | 1.629 | 4.71×10-286 | 1.548 | 1.18x10-187 |
| 1 | 50/60 | 1.593 | 7.54×10-365 | 1.534 | 1.09x10-249 |
| 2 | 50/60 | 1.595 | 7.54×10-365 | 1.536 | 1.84x10-249 |
| 3 | 50/65 | 1.524 | 1.82×10-378 | 1.472 | 1.08x10-250 |
| 3 | 50/60 | 1.505 | 3.06×10-373 | 1.465 | 1.59x10-256 |
| 4 | 50/65 | 1.501 | 2.83×10-384 | 1.459 | 1.30x10-262 |
| 4 | 50/60 | 1.488 | 7.70×10-377 | 1.448 | 1.47x10-258 |
Discussion
We accessed the MVP structured data available in the EHRs to extract cases and controls for downstream genetic studies of AMD. Here we demonstrate that cases and controls for this complex ocular disease can be extracted with high PPV and NPV despite the lack of availability of imaging data considered gold-standard for AMD phenotyping. We further demonstrate that the algorithms initially developed can be optimized using known, strong established genetic associations.
Few EHR-based AMD algorithms are available in the literature or in public repositories. The Marshfield Clinic’s Personalized Medicine Research Project (PMRP) as part of the electronic Medical Records and Genomics network (eMERGE) developed a basic AMD algorithm, which is available in the phenotype knowledgebase (phekb.org) repository20. The algorithm uses ICD-9-CM codes only and has been minimally validated within a single patient population that was uniformly European-descent21–23. Another eMERGE study site at Northwestern University developed an ICD-9-CM-based algorithm to extract AMD cases and controls free of AMD with high PPV and NVP24. Similar to the present study, imaging data were not included in the respective algorithms from the individual eMERGE study sites. Other ocular disease algorithms have been developed for primary open-angle glaucoma25 and diabetic retinopathy26. Like the eMERGE study sites’ AMD algorithms22,24, these are limited to ICD-9-CM codes and lack the gold-standard imaging data.
The present study has both weaknesses and strengths. In addition to the lack of imaging data for the MVP cohort, we did not have access to the clinical notes or communications between specialists (e.g., 25) that could be used to confirm potential cases or to rule out patients as controls. Indeed, phenotype misclassification potentially accounts for the smaller-than-expected odds ratios observed for the known AMD loci (1.6 versus ~ 2.5 to 3.0)4,11,27–29. A previous study of cataracts in the eMERGE network suggests natural language processing of clinical notes30,31 has the potential to capture additional cases or controls as well as offer useful phenotypic granularity (e.g., severity) for the identified cases. Another limitation is that the known genetic associations for AMD used for sample size optimization are limited primarily to European-descent populations32,33.
Despite these limitations, a major strength of the present study is the VA EHR, representing the largest integrated healthcare system in the United States with relatively uniform coding practices. The large patient population ascertained through various VA Medical Centers provides opportunities to validate algorithms across different Centers as well as offers the sample size to optimize algorithm performance across the entire MVP EHR. The availability of genome-wide data, a resource that continues to grow with continuing ascertainment and investments om generating genomic data, is another major asset to MVP algorithm development for phenotypes with established genetic associations.
In summary, we have developed algorithms and strategies to extract AMD cases and controls free of AMD accessing only structured data contained in EHRs from the MVP. Evaluation between two independent VA Medical Centers coupled with genetically-guided optimization resulted in a large case-control dataset with high PPV and NPV suitable for downstream genomic downstream analyses such as the on-going genome-wide association studies in MVP for AMD. The strategies outlined here are relevant to other national efforts such as All of Us accessing only structured data within the EHR for computable or electronic phenotyping34.
Table 1.
Million Veteran Program study population. Samples sizes available for the current study, presented as total (and genotyped) participants by race/ethnicity.
| Not Hispanic (with genotypes) | Hispanic (with genotypes) | Unknown (with genotypes) | |
|---|---|---|---|
| African American | 94,891 (65,983) | 1,279 (861) | 695 (457) |
| Native American/Alaska Native | 4,562 (3,101) | 962 (647) | 29 (18) |
| Asian | 4,884 (3,132) | 251 (167) | 38 (21) |
| Pacific Islander | 1,778 (1,126) | 474 (307) | 54 (27) |
| White | 350,142 (247,301) | 23,227 (16,171) | 1,814 (1,171) |
| Other | 2,951 (2,170) | 3,966 (2,909) | 21 (16) |
| Unknown | 4,462 (2,936) | 2,775 (1,706) | 4,772 (2,726) |
| TOTAL | 463,670 (325,749) | 32,934 (21,949) | 7,423 (4,436) |
Acknowledgements
This research is based on data from the Million Veteran Program, Office of Research and Development, Veterans Health Administration, and was supported by award I01 BX003364 and by Research to Prevent Blindness This publication does not reflect the views of the Department of Veterans Affairs or the United States Government. We are grateful to the VINCI and GENISIS support teams, and to the MVP Core Statistical Analysis team.
References
- 1.Gaziano JM, Concato J, Brophy M. Million Veteran Program: A mega-biobank to study genetic influences on health and disease. J Clin Epidemiol. 2016;70:214–223. doi: 10.1016/j.jclinepi.2015.09.016. [DOI] [PubMed] [Google Scholar]
- 2.Wong WL, Su X, Li X. Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: a systematic review and meta-analysis. Lancet Glob Heal. 2014;2:e106–16. doi: 10.1016/S2214-109X(13)70145-1. [DOI] [PubMed] [Google Scholar]
- 3.Fritsche LG, Igl W, Cooke Bailey JN. A large genome-wide association study of age-related macular degeneration highlights contributions of rare and common variants. Nat Genet. 2016;48:134–143. doi: 10.1038/ng.3448. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Klein RJ, Zeiss C, Chew EY. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308:385–389. doi: 10.1126/science.1109557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Fihn SD, Francis J, Clancy C. Insights from advanced analytics at the Veterans Health Administration. Health Aff. 2014;33:1203–1211. doi: 10.1377/hlthaff.2014.0054. [DOI] [PubMed] [Google Scholar]
- 6.Turner S, Armstrong LL, Bradford Y. Quality control procedures for genome-wide association studies. Curr Protoc Hum Genet. 2011;68:1–19. doi: 10.1002/0471142905.hg0119s68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Kvale MN, Hesselson S, Hoffmann TJ. Genotyping informatics and quality control for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort. Genetics. 2015;200:1051–1060. doi: 10.1534/genetics.115.178905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Klarin D, Damrauer SM, Cho K. Genetics of blood lipids among ~300,000 multi-ethnic participants of the Million Veteran Program. Nat Genet. 2018;50:1514–1523. doi: 10.1038/s41588-018-0222-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Cipriani V, Leung H-T, Plagnol V. Genome-wide association study of age-related macular degeneration identifies associated variants in the TNXB-FKBPL-NOTCH4 region of chromosome 6p21.3. Hum Mol Genet. 2012;21:4138–4150. doi: 10.1093/hmg/dds225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.1000 Genomes Project Consortium, Auton A, Brooks LD. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Restrepo NA, Spencer KL, Goodloe R. Genetic determinants of age-related macular degeneration in diverse populations from the PAGE Study. Invest Ophthalmol Vis Sci. 2014;55:6839–6850. doi: 10.1167/iovs.14-14246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Purcell S, Neale B, Todd-Brown K. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Chang CC, Chow CC, Tellier LCAM. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:1–16. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Price AL, Patterson NJ, Plenge RM. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 15.Hellwege JN, Keaton JM, Giri A. Population stratification in genetic association studies. Curr Protoc Hum Genet. 2017;95:1.22.1–1.22.23. doi: 10.1002/cphg.48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655–1664. doi: 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Zhang X, Saaddine JB, Chou CF. Prevalence of diabetic retinopathy in the United States, 2005-2008. JAMA. 2010;304:649–656. doi: 10.1001/jama.2010.1111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Friedman DS, Wolfs RC, O’Colmain BJ. Prevalence of open-angle glaucoma among adults in the United States. Arch Ophthalmol. 2004;122:532–538. doi: 10.1001/archopht.122.4.532. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Karagiannidis GK, Lioumpas AS. An improved approximation for the Gaussian Q-Function. IEEE Communication Letters. 2007;11:644–646. [Google Scholar]
- 20.Kirby JC, Speltz P, Rasmussen LV. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inform Assoc. 2016;23:1046–1052. doi: 10.1093/jamia/ocv202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.McCarty C, Chisholm R, Chute C. The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics. 2011;4:13. doi: 10.1186/1755-8794-4-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Brilliant MH, Vaziri K, Connor TB. Mining retrospective data for virtual prospective drug repurposing: L-DOPA and age-related macular degeneration. Am J Med. 2016;129:292–298. doi: 10.1016/j.amjmed.2015.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ritchie MD, Verma SS, Hall MA. Electronic medical records and genomics (eMERGE) network exploration in cataract: several new potential susceptbility loci. Mol Vis. 2014;20:1281–1295. [PMC free article] [PubMed] [Google Scholar]
- 24.Simonett JM, Sohrab MA, Pacheco J. A validated phenotyping algorithm for genetic association studies in age-related macular degeneration. Sci Rep. 2015;5:12875. doi: 10.1038/srep12875. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Restrepo NA, Farber-Eger E, Goodloe R. Extracting primary open-angle glaucoma from electronic medical records for genetic association studies. PLoS ONE. 2015;10:e0127817. doi: 10.1371/journal.pone.0127817. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Restrepo NA, Farber-Eger E, Crawford DC. Searching in the dark: phenotyping diabetic retinopathy in a de-identified electronic medical record sample of African Americans. AMIA Jt Summits Transl Sci Proc. 2016;2016:221–230. [PMC free article] [PubMed] [Google Scholar]
- 27.Haines JL, Hauser MA, Schmidt S. Complement factor H variant increases the risk of age-related macular degeneration. Science. 2005;308:419–421. doi: 10.1126/science.1110359. [DOI] [PubMed] [Google Scholar]
- 28.Despriet DD, Klaver CC, Witteman JC. Complement factor H polymorphism, complement activators, and risk of age-related macular degeneration. JAMA. 2006;296:301–309. doi: 10.1001/jama.296.3.301. [DOI] [PubMed] [Google Scholar]
- 29.Yu Y, Bhangale TR, Fagerness J. Common variants near FRK/COL10A1 and VEGFA are associated with advanced age-related macular degeneration. Hum Mol Genet. 2011;20:3699–3709. doi: 10.1093/hmg/ddr270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Peissig PL, Rasmussen LV, Berg RL. Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J Am Med Inform Assoc. 2012;19:225–234. doi: 10.1136/amiajnl-2011-000456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Rasmussen LV, Peissig PL, McCarty CA. Development of an optical character recognition pipeline for handwritten form fields from an electronic health record. J Am Med Inform Assoc. 2012;19:e90–e95. doi: 10.1136/amiajnl-2011-000182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Bustamante CD, De La Vega FM, Burchard EG. Genomics for the world. Nature. 2011;475:163–165. doi: 10.1038/475163a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016;538:161–164. doi: 10.1038/538161a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Collins FS, Varmus H. A new initiative on precision medicine. NEJM. 2015;372:793–795. doi: 10.1056/NEJMp1500523. [DOI] [PMC free article] [PubMed] [Google Scholar]
