To the Editor A recent study1 highlighted how electronic health record (EHR)–linked biobanks can be used to investigate penetrance in a diverse population and concluded that most pathogenic variants have low penetrance, using data from 2 large biobanks—BioMe and UK Biobank.
Despite our enthusiasm for the innovative approach of this study,1 we recommend interpreting its results with caution. The high-throughput methods used to ascertain cases from EHRs may have significantly underestimated penetrance, which would provide false reassurance if applied clinically. Because genetic diagnoses are frequently under-documented in EHRs, a substantial fraction of affected individuals lack the appropriate International Classification of Diseases (ICD) codes used in this study to measure penetrance. A study2 using UK Biobank data found evidence of significant underdiagnosis of α1-antitrypsin deficiency (AATD). Only 9 of 140 individuals with a pathogenic genotype associated with AATD had documentation of a diagnosis in the EHR. Many of these individuals had phenotypic manifestations of AATD, indicating that they were affected by the disease.2
Even when diagnosed as having a genetic disease, a significant fraction of patients do not receive the relevant ICD code due to use of nonspecific codes or lack of an appropriate code. A 2019 study3 found that relevant ICD codes were missing in up to 25% of cases in which a clinical diagnosis was recorded elsewhere in the record.
A second concern is that EHRs associated with biobanks in the US provide limited ability to assess lifetime risks because the clinical information is often fragmented across institutions and follow-up may be abbreviated. Although the authors acknowledged the potential for misclassification in their study, the true magnitude of this problem is unknown and, based on previous research, may be a leading cause for their primary finding of low penetrance for pathogenic variants. Although some of these risks are unavoidable when using EHR-based high-throughput methods, there is substantial room for improvement by working toward higher-fidelity phenotyping that is sensitive to both the disease label and phenotypic manifestations of the disease, as well as quantifying the degree of ascertainment bias. With these improvements, penetrance estimates will better reflect the biological significance of pathogenic variants and prevent underestimation due to missing or incomplete documentation in EHRs.
Footnotes
Conflict of Interest Disclosures: Dr Bastarache reported receiving personal fees from Galatea Bio Inc and having a patent for Nashville Biosciences with royalties paid. No other disclosures were reported.
References
- 1.Forrest IS, Chaudhary K, Vy HMT, et al. Population-based penetrance of deleterious clinical variants. JAMA. 2022;327(4):350–359. doi: 10.1001/jama.2021.23686 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Nakanishi T, Forgetta V, Handa T, et al. The undiagnosed disease burden associated with alpha-1 antitrypsin deficiency genotypes. Eur Respir J. 2020;56 (6):2001441. doi: 10.1183/13993003.01441-2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Bastarache L, Hughey JJ, Goldstein JA, et al. Improving the phenotype risk score as a scalable approach to identifying patients with mendelian disease. J Am Med Inform Assoc. 2019;26(12):1437–1447. doi: 10.1093/jamia/ocz179 [DOI] [PMC free article] [PubMed] [Google Scholar]
