Replication of COPD susceptibility loci in a large Chinese elderly population using a validated, multi-setting EHR phenotype

Haonan Pan; Peng Wu; Kunyan Sun; Zhixin Xie; Ziyu Qiu; Qiaoshi Zhang; Zijian Tian; Xiangqing Hou; Shiteng Gao; Ying Chen; Xiaozhou Zhou; Yao Cheng; Jian Shao; Benrui Wu; Qian Li; Wanqing Dong; Anjie Peng; Yuxuan Du; Ying Pan; Kaixin Zhou; Tian Xie

doi:10.1186/s12890-025-04098-7

. 2026 Jan 5;26:47. doi: 10.1186/s12890-025-04098-7

Replication of COPD susceptibility loci in a large Chinese elderly population using a validated, multi-setting EHR phenotype

Haonan Pan ^1,^2,^#, Peng Wu ^3,^#, Kunyan Sun ⁴, Zhixin Xie ^3,⁵, Ziyu Qiu ⁶, Qiaoshi Zhang ², Zijian Tian ^2,^7,⁸, Xiangqing Hou ², Shiteng Gao ^1,², Ying Chen ^2,⁹, Xiaozhou Zhou ², Yao Cheng ², Jian Shao ², Benrui Wu ^2,^7,⁸, Qian Li ⁸, Wanqing Dong ^10,^11,^12,¹³, Anjie Peng ^1,², Yuxuan Du ², Ying Pan ^14,¹⁵, Kaixin Zhou ^1,^2,^✉,^#, Tian Xie ^2,^✉,^#

PMCID: PMC12870439 PMID: 41491456

Abstract

Background

Validated frameworks for defining chronic obstructive pulmonary disease (COPD) using real-world electronic health records (EHR) data in non-European populations and across multiple healthcare settings and remain limited. This study aimed to develop and validate an EHR-based phenotype for COPD in a large Chinese elderly cohort and assess its utility by replicating known COPD loci.

Methods

We analyzed EHR and genetic data from over 130,000 adults enrolled in the Kunshan Aging Research with E-health (KARE) cohort. An iterative, validation-driven process was used to develop the algorithm for identifying COPD case using EHR data from primary care, hospital outpatient and inpatient, and disease registries. Positive predictive value (PPV) was assessed via independent chart review by two respiratory physicians. We then tested associations between COPD and 5 established COPD susceptibility loci using logistic regression adjusted for age, sex, ever-smoking history, smoking duration, and genetic principal components.

Results

The final algorithm achieved a PPV of 88.3%. We identified 4,944 COPD cases using this algorithm and selected 86,561 controls. Four COPD loci were successfully replicated: FAM13A (OR = 1.143, P = 6.53*10^− 7), AGER (OR = 1.130, P = 1.24*10^− 4), DSP (OR = 1.066, P = 0.014), TGFB2 (OR = 1.052, P = 0.046).

Conclusion

We present a validated, multi-setting EHR-based COPD definition that replicates known genetic associations in a large Chinese elderly population. These results support the integration of biorepositories and real-world clinical data as a scalable strategy to enhance population diversity of COPD genetic research.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12890-025-04098-7.

Keywords: Chronic obstructive pulmonary disease, EHR, Genetic association, Chinese population, Phenotype

Introduction

Chronic obstructive pulmonary disease (COPD) is a leading cause of morbidity and mortality worldwide, posing a substantial and growing burden on healthcare systems, particularly among aging populations [1]. Despite its clinical importance, the complex etiology of COPD—driven by environmental exposures, genetic susceptibility, and their interactions—remains incompletely understood, especially in populations of non-European ancestry [2].

In recent years, routinely collected data from electronic health records (EHRs) offers opportunities to unravel COPD’s complex etiology, particularly through linking EHR data with genomic information [3]. However, because EHR systems are primarily designed for routine clinical care rather than research purposes, COPD definitions based solely on diagnosis code may be inaccurate [4]. For example, one validation study found that only 63.5% of COPD diagnoses identified using ICD-9-CM codes were confirmed by physician review [5].

To address these challenges, several studies have developed EHR-based algorithms to accurately identify COPD cases. Algorithms that combine diagnosis codes with additional clinical data—such as medication prescriptions and spirometry results—have shown improved accuracy [6, 7]. In addition, two prominent phenotyping platforms—the UK CALIBER [8] and the U.S. PheKB [9]—have established rigorous frameworks for developing, validating, and sharing reproducible EHR phenotype algorithms. A key methodological insight from these efforts is the implementation of an iterative phenotyping framework that combines algorithm development with physician validation in a block-randomized manner—an approach that balances computational efficiency with clinical accuracy [10].

However, the majority of existing COPD EHR algorithms have been developed in European populations, where clinical workflows, coding practices, and risk factor profiles differ substantially from those in China. For example, primary care spirometry is less frequently performed in China, and COPD risk is also strongly influenced by ambient air pollution and household biomass exposure in addition to heavy tobacco use [11, 12]. Such differences may attenuate the validity of established Western COPD algorithms when directly applied to the Chinese EHR environments. Moreover, although several COPD definitions have been developed in Chinese cohorts [5, 13, 14], they have typically relied on hospital-based data alone. For instance, the China Kadoorie Biobank (CKB) validated COPD using inpatient ICD-10 records only [13], thereby lack of outpatient and primary care diagnoses, which are essential for capturing the full clinical spectrum of COPD.

China bears a rapidly increasing COPD burden [15, 16] yet remains underrepresented in large-scale genomic studies [17]. Together, these gaps highlight an urgent need for robust, validated, multi-setting EHR phenotypes to support genome-informed studies of COPD in Chinese populations.

Therefore, in this study, we utilized EHR data from multiple healthcare settings within a large elderly Chinese cohort, applying an iterative phenotyping framework to (1) develop and validate an EHR-based COPD definition tailored to the Chinese healthcare context, and (2) evaluate the feasibility of this definition for genetic studies by assessing its ability to replicate associations with established COPD susceptibility loci.

Methods

Study design and data source

This study was a case-control study conducted within the Kunshan Aging Research with E-health (KARE) cohort, using phenotyping framework adapted from The UK CALIBER [8] and the U.S. PheKB [9] (Fig. 1). KARE is a population-based, longitudinal study that follows over 130,000 elderly individuals who reside in Kunshan City, China [18], and integrates extensive EHR and genomic data.

Fig. 1 — Study design. COPD: Chronic obstructive pulmonary disease; PPV: positive predictive value. COPD case definition algorithm development and refinement included: (1) initial sampling and manual chart review of individuals with ICD-10 J41-J44 diagnoses to estimate the preliminary PPV; (2) using physician-reviewed cases as a training set to identify the most informative EHR features when PPV fell below the predefined threshold (≥ 85%); (3) iterative refinement of a feature-based algorithm, guided by statistical feature-importance analyses and clinical expertise, until PPV ≥ 85% was achieved in the training set; (4) validation of the refined algorithm using a second, independent sample to estimate the true PPV. If the PPV in the validation sample did not reach the predefined threshold, the validation sample was used as a new training set to further refine the algorithm, followed by another independent validation. This iterative training-validation cycle continued until PPV ≥ 85% was achieved in both the training and validation samples, at which point the algorithm was adopted as the final definition and used for downstream genetic analyses

The EHR data comprise longitudinal medical records for all KARE participants, derived from four healthcare settings within Kunshan City between 2014 January and 2024 August: (1) primary care: where community health centers serve as primary care hubs in China; (2) hospital outpatient service; (3) hospital inpatient service; (4) a city-wide disease registry. These records contain both structured data such as ICD-10 diagnostic codes, laboratory test results, spirometry, and prescription information, as well as unstructured data including narrative chest Computed Tomography (CT) reports and clinical notes documenting medical history and symptoms.

As of June 2025, the KARE cohort has enrolled 135,787 participants, all of whom were included in the present analyses. The cohort continues to expand in both size and content, offering a valuable platform for genome-informed research in COPD and other chronic diseases.

Development of EHR-based COPD definition

The algorithm for identifying COPD case from EHR was developed and refined through an iterative, validation-driven process.

Initial case pool and chart review

We first identified an initial candidate pool of 32,047 individuals who had at least one ICD-10 code J41-J44 (J41: simple and mucopurulent chronic bronchitis; J42: unspecified chronic bronchitis; J43: emphysema; J44: other chronic obstructive pulmonary disease) recorded in any of the four healthcare settings. This ICD-based definition is widely used in prior EHR studies but is relatively non-specific [19].

From this pool, 400 individuals were randomly sampled and independently reviewed by two respiratory physicians, who were blinded to the algorithm design. Inter-rater agreement was assessed, and cases with discordant or uncertain classifications were further evaluated by a third senior respiratory physician. Physicians determined whether each individual represented a true COPD diagnosis based on comprehensive review of respiratory-related records across all care settings, including spirometry (when available), radiological examinations, medical history (risk factors and respiratory symptoms), and prescription records. These physician-confirmed diagnoses served as the gold standard (“verified cases”).

Feature selection and algorithm refinement

The initial algorithm defined COPD cases solely by the presence of ICD-10 codes J41–J44. The positive predictive value (PPV) of the algorithm was calculated as the proportion of reviewed cases confirmed as true COPD by physician review. If the PPV did not meet the predefined threshold (≥ 85%), the 400 reviewed cases from that iteration were treated as a training set for algorithm refinement. We constructed multivariable classification models such as logistic regression and decision tree models using a comprehensive set of candidate EHR predictors. These predictors included, but were not limited to: respiratory diagnoses (J40-J44, J45-J46) and their frequency and care setting, COPD-related medication prescriptions (e.g., bronchodilators), documented medical history of chronic bronchitis, emphysema, or COPD, chest CT findings suggestive of emphysema or chronic airway disease, symptom histories (cough, sputum production, dyspnea), spirometry records when available, and demographic variables such as sex. Smoking status was intentionally excluded from algorithm development and reserved for adjustment in downstream association analyses.

To identify EHR features most informative for COPD classification, we evaluated predictor importance using multiple complementary approaches, including LASSO regression, random forest variable importance, and Johnson’s Relative Weights analysis. Because variable importance differed across methods, we therefore combined these quantitative results with clinical judgment from additional respiratory experts (not involved in chart review) and evidence from prior studies to select features that were both statistically informative and clinically meaningful.

Algorithm refinement consisted of modifying rule-based criteria, such as: incorporating bronchodilator prescriptions as supporting evidence, combining symptom histories (cough, sputum, dyspnea), requiring multiple diagnoses recorded in outpatient and primary-care settings. This trial-and-error process was repeated until the refined algorithm achieved the predefined PPV threshold (≥ 85%) in the training set.

NLP-based feature extraction

To identify EHR features from unstructured data, natural language processing (NLP) techniques were applied. Specifically, we used rule-based text mining to extract mentions of chronic cough and dyspnea symptoms, as well as histories of COPD-related diseases (chronic bronchitis, emphysema, and COPD) from the descriptive clinical notes. Similarly, narrative chest CT reports were processed using NLP to extract imaging descriptions suggestive of emphysema, chronic bronchitis, or COPD. The accuracy of NLP-derived features was assessed through iterative manual validation of randomly sampled raw text entries, with rule refinement performed until no further extraction errors were observed. NLP-derived variables were used as supporting evidence within a multi-component phenotyping algorithm rather than as standalone determinants of COPD status. All NLP tasks were performed in Python 3.9.7 and more details can be found in supplementary methods.

Algorithm application and validation

The refined algorithms were then applied to all KARE participants. To estimate the true PPV of the refined algorithm, a non-overlapping second random sample of 400 algorithm-identified cases was independently reviewed by two respiratory physicians. If the PPV in this independent validation sample failed to reach the predefined threshold (≥ 85%), the algorithm entered a new refinement cycle. In this case, the validation sample was treated as a new training set to further optimize the algorithm, followed by another non-overlapping validation sample drawn from algorithm-identified cases. This iterative training–validation process was repeated until the algorithm achieved PPV ≥ 85% in both the training and validation samples, at which point the algorithm was finalized.

This validation framework constitutes internal validation, as both algorithm development and validation were conducted within the same EHR dataset. Because chart review was restricted to algorithm-identified cases, validation focused on estimating PPV rather than sensitivity, consistent with many other EHR studies [5, 13, 14].

Selection of controls

To replicate known COPD susceptibility loci, we applied a strict rule-based algorithm to select controls. First, individuals with any prior diagnoses of chronic bronchitis, emphysema, COPD, asthma, or bronchiectasis (ICD-10 codes: J41–J47) were excluded. To further ensure the absence of subclinical airway disease, individuals were excluded if they met any of the following criteria:

Clinical notes or chest CT reports suggesting chronic bronchitis, emphysema, or COPD;
Any spirometry record showing forced expiratory volume in 1 second (FEV1)/forced vital capacity (FVC) < 0.7;
Prescription records for bronchodilator medications, including short-acting β2-agonists (SABA), long-acting β2-agonists (LABA), short-acting muscarinic antagonists (SAMA), long-acting muscarinic antagonists (LAMA), or theophylline.

These stringent exclusion criteria ensure that our control group comprises individuals free from both diagnosed and undiagnosed airway disease, thereby minimizing misclassification and strengthening the reliability of genetic association analyses.

Selection of known COPD susceptibility loci and genotyping

We selected five well-established COPD susceptibility genes—HHIP, FAM13A, DSP, AGER, and TGFB2—based on the framework as summarized by Werder et al. [20]. In their review, these genes were identified as biologically credible COPD genes according to three criteria: (1) genome-wide significance in both lung function and COPD across multiple GWAS; (2) strong functional evidence, including gene expression in lung tissue; and (3) relevance to developmental biology, consistent with the developmental origins of COPD risk. From these five loci, Werder et al. compiled 34 single-nucleotide polymorphisms (SNPs) that had been previously associated with COPD and/or quantitative lung function traits. When multiple SNPs were reported within a single locus, we prioritized independent SNPs (linkage disequilibrium (LD) r² < 0.1 in the 1000 Genomes East Asian population [21]) that showed the strongest association signals in Biobank Japan (BBJ) [22]. When selecting among candidate variants, we considered both the association strengths observed in BBJ [22] and the broader body of literature supporting locus-level COPD associations. For example, although the DSP variant reported in BBJ did not reach genome-wide significance and showed a modest effect size, DSP remains a well-established COPD locus with strong prior association and functional evidence. We therefore retained the DSP SNP to enable locus-level replication in our population. Ultimately, six independent candidate SNPs were selected for replication: rs13141641 (HHIP), rs2609260 (FAM13A), rs2076295 (DSP), rs2070600 (AGER), and rs796395 and rs993925 (TGFB2). Detailed information about these SNPs is provided in Table 1.

Table 1.

Details and power estimates for six SNPs at five known COPD loci

Gene	SNP	CHR	POS (build 37)	EA	EAF	Beta in BBJ	P value in BBJ	Needed cases
HHIP	rs13141641	4	145,506,456	T	0.674	0.059	0.015	6002
FAM13A	rs2609260	4	89,836,819	T	0.608	0.129	1.13E-07	1041
DSP	rs2076295	6	7,563,232	T	0.529	0.028	0.233	38,653
AGER	rs2070600	6	32,151,443	C	0.781	0.146	1.09E-05	1128
TGFB2	rs796395	1	218,681,971	A	0.221	0.097	0.002	2655
TGFB2	rs993925	1	218,860,068	C	0.536	0.051	0.029	7275

Open in a new tab

BBJ Biobank Japan, CHR Chromosome, EA Effect allele, EAF Effect allele frequency, obtained from 1000G East-Asian population, POS Position. For TGFB2, LD r² between rs796395 and rs993925 is 0.001 in 1000G East-Asian population

These 6 SNPs were then extracted from KARE genetic data using PLINK 1.9 for subsequent genetic association analyses with the validated COPD phenotype. Genotyping in the KARE cohort was performed using CAS array, a custom-designed Axiom genotyping array that is optimized for biobanking in the Chinese population [23]. The genotype data underwent rigorous quality control (QC), including the removal of duplicated or monomorphic variants, non-SNP markers, and SNPs with low call rates (< 98%) or deviation from Hardy-Weinberg equilibrium (p < 1 × 10^{− 6}). Samples were excluded for low call rates (< 90%), evidence of contamination or duplication (via heterozygosity and identity-by-descent metrics), sex mismatches, or population outliers identified through principal component analysis. The samples were then imputed using a Chinese population reference panel NyuWa Genome [24]. Imputed SNPs were filtered by imputation information score > 0.3 and minor allele frequency > 0.01, resulting in 100,028 participants and approximately 6.6 million high-quality SNPs.

Statistical analysis

Descriptive statistics were used to summarize participant characteristics. For genetic association analyses, logistic regression analyses were conducted to evaluate the associations between the validated EHR-based COPD phenotype and six selected SNPs under an additive genetic model. Analyses were adjusted for age, sex, whether participants had ever smoked, years since smoking initiation, and the first five genetic principal components to control for population stratification. BMI was not included as a covariate because it may have bidirectional association with COPD [25, 26], and conditioning on BMI could introduce collider bias [27]; nevertheless, a sensitivity analysis additionally adjusting for BMI was performed to assess result robustness. Smoking-related variables with missing data (< 2%) were imputed using multivariate imputation based on age and sex. All analyses were performed in R (version 4.1.0). Because this study aimed to replicate a small, biologically informed set of variants located within five loci, we used a nominal p-value threshold of p < 0.05 for replication, consistent with previous replication studies in EHR-linked biobanks [28]. To enhance statistical transparency, we additionally report Bonferroni-corrected significance for six independent tests (p < 0.0083). To evaluate the potential advantages of the newly defined phenotype, we compared it with a conventional, code-only COPD definition based on ICD-10 codes J41-J44, which is widely used but relatively non-specific.

We performed power calculations [29] to assess our ability to replicate each association at 80% power. Based on previously reported effect sizes for these six SNPs in the BBJ COPD GWAS (β = 0.028 to 0.146, odds ratios (ORs) = 1.03 to 1.16), we estimated the required case numbers using an additive logistic regression framework (α = 0.05, power = 0.80), allele frequencies in East-Asian populations, a COPD prevalence of 13.6% in China [16], a PPV of 85%, and 60,000 controls. The necessary case counts ranged from approximately 1,041 to 38,653 depending on SNP allele frequency and effect size (Table 1). More details on power calculation are provided in the Supplementary Methods.

Result

Final algorithm for identifying COPD cases

After iterative refinement and validation, the finalized algorithm (illustrated in Fig. 2) achieved targeted PPV by defining individuals as COPD cases a two-tiered approach:

(1) Confirmed by spirometry: individuals with recorded pre- and post-bronchodilator FEV1/FVC ratios < 0.7;

Otherwise, individuals without post-bronchodilator spirometry were classified as COPD cases if they either: had an ICD-10 code J44 recorded in inpatient or disease-registry records; or met at least three of the following five risk indicators: (a) Pre- bronchodilator FEV1/FVC < 0.7; (b) Chronic cough or sputum ≥ 6 months, or documented dyspnea; (c) Chest CT evidence of emphysema; (d) Use of COPD-specific bronchodilators, or bronchodilators prescribed with asthma excluded; (e) ICD-10 J44 recorded ≥ 2 times in outpatient or primary care settings.

This multimodal algorithm integrates data across all healthcare settings and demonstrated robust performance, achieving an PPV of 88.3% in the validation sample based on manual chart review. The inter-rater agreement between two respiratory physicians was over 85% in each manual chart review.

Using this algorithm, 4,944 COPD cases were identified from all participants in the KARE cohort. Among them, only 394 individuals (8.0%) were classified directly based on spirometry criteria, 2,221 (45.0%) were identified through inpatient or disease registry records, and 2,329 (47.0%) were identified using COPD risk indicators.

Among these cases, 2,822 (57.1%) had COPD-related diagnoses (ICD-10 codes J41-J44) recorded in inpatient or disease registry data, 1,944 (39.3%) had relevant diagnoses only from hospital outpatient records or primary care records, and 178 (3.6%) had no recorded COPD relative diagnoses, suggesting they may have been underdiagnosed in routine clinical care.

In contrast, the initial broad definition—classifying all individuals with ICD-10 codes J41-J44 as COPD—identified 32,047 individuals but yielded a much lower PPV of only 32.3%.

Characteristics of identified COPD cases and selected controls

The basic characteristics of the overall study population, defined COPD cases, and selected controls are shown in Table 2. Among the 135,787 participants, 4,944 individuals were identified as COPD cases, and 86,561 were classified as controls after applying strict criteria. A further 44,282 participant—diagnosed with lower respiratory tract diseases (ICD-10 codes J41–J47) or exhibiting COPD-related features, but did not meet the COPD case definition—were excluded from both the COPD and control groups.

Table 2.

Characteristics of KARE participants, identified COPD cases, and selected controls

Characteristics	KARE participants	Identified COPD cases	Selected controls
N	135,787	4944	86,561
Age (years)	67.49 ± 9.60	75.67 ± 7.84	65.89 ± 9.56
COPD-related diagnoses (J41 ~ J44)	32,047 (23.6%)	4,766 (96.4%)	-
Disease registry	3061 (2.3%)	2211 (44.7%)	-
Hospital inpatient (excluding registry cases)	1082 (0.8%)	611 (12.4%)	-
Hospital outpatient (excluding the above settings)	6246 (4.6%)	1316 (26.6%)	-
Primary care only (no other healthcare settings involved)	21,658 (15.9%)	628 (12.7%)	-
No COPD relative diagnose	103,740 (76.4%)	178 (3.6%)	86,561 (100%)
Sex
Male	59,329 (43.7%)	3393 (68.6%)	38,368 (40.8%)
Female	76,456 (56.3%)	1551 (31.4%)	55,767 (59.2%)
Smoking status
Never smoked	105,249 (79.0%)	3398 (68.9%)	68,321 (78.9%)
Current smoker	24,399 (18.3%)	1187 (24.1%)	14,238 (16.4%)
Former smoker	3524 (2.6%)	344 (7.0%)	2124 (2.5%)
Unknown	2615 (1.9%)	15 (0.3%)	1878 (2.2%)
Smoking duration (years)	12.16 ± 19.96	26.54 ± 25.56	9.98 ± 18.11
Chronic cough or sputum ≥ 6 months	17,256 (12.7%)	3782 (76.5%)	1448 (1.7%)
Dyspnea	3329 (2.5%)	648 (13.1%)	864 (1.0%)
Prescription record of bronchodilators	19,402 (14.3%)	4451 (90.0%)	-
COPD-specific bronchodilators	2088 (1.5%)	1349 (27.3%)	-
General bronchodilators	18,953 (14.0%)	4302 (87.0%)	-
Chest CT evidence of emphysema	11,703 (8.6%)	3328 (67.3%)	-
Have Per-BD FEV1/FVC	5257 (3.9%)	1302 (26.3%)	1190 (1.9%)
Pre-BD FEV1/FVC < 0.7	3103 (2.3%)	1302 (26.3%)	-
Have Post-BD FEV1/FVC	849 (0.6%)	394 (7.9%)	53 (0.1%)
Post-BD FEV1/FVC < 0.7	401 (0.2%)	394 (7.9%)	-

Open in a new tab

Continuous variables are presented as Mean ± SD, and categorical variables as n (%). In the selected control group, individuals with COPD-specific bronchodilator usage, CT evidence of emphysema, or spirometry showing FEV1/FVC < 0.7 were excluded, resulting in a 0% prevalence of these features. COPD-specific bronchodilators refer to long-acting muscarinic antagonists (LAMA) or combination therapies containing LAMA. BD represent bronchodilator

Compared to controls, the COPD group was older (mean age 75.67 vs. 65.89 years) and comprised a higher proportion of males (68.6% vs. 40.8%) as well as current and former smokers. Because COPD-related characteristics were excluded from the control group, the prevalence of some features—including CT-detected emphysema, use of bronchiectasis medications, and an FEV1/FVC ratio less than 0.7—was observed to be 0% in controls. In contrast, features like chronic cough or sputum and dyspnea were not part of the exclusion rules, yet remained rare in controls (chronic cough: 1.7%; dyspnea: 1.0%), suggesting their potential value as indicators of the disease.

Replication of COPD susceptibility loci

Among the identified COPD cases and selected controls with genotype data, 3,965 cases and 59,284 controls were available for analysis. Four SNPs across four COPD susceptibility loci (FAM13A, AGER, DSP, and TGFB2) showed significant associations with COPD at the nominal threshold (p < 0.05). After Bonferroni correction for six independent tests (p < 0.0083), FAM13A and AGER remained statistically significant, while DSP and TGFB2 showed supportive but sub-threshold associations (Fig. 3). The HHIP locus demonstrated a borderline significant association (p = 0.056, Fig. 3). In addition, the effect sizes we observed were generally consistent with those reported in BBJ (Table 1). Given that only 3,965 cases were analyzed—fewer than the 6,002 cases indicated by our power calculations for HHIP (Table 1)—our study may have been underpowered to detect associations with smaller effect sizes. Allele frequencies and unadjusted results are provided in Supplementary Table 1. Additional adjustment for BMI in sensitivity analyses did not materially change the effect estimates (Supplementary Table 2), indicating that our main results are robust to BMI adjustment.

Fig. 3 — Association of COPD with known genetic loci using the final multimodal algorithm (N = 3965 cases and 59284 controls), and using the initial definition (ICD-10 codes J41–J44, N = 24914 cases and 59284 controls). EA Effect allele, OR Odds ratio

Importantly, when compared to the initial broad definition (ICD-10 codes J41-J44), the refined algorithm not only replicated a greater number of loci but also yielded stronger effect sizes and smaller P values (Fig. 3, Supplementary Table 1). These findings underscore that improving the accuracy of COPD phenotype definition enhances the power and precision of genetic association studies.

Discussion

In this large Chinese elderly cohort, we developed and validated a multimodal EHR-based definition of COPD with high accuracy (PPV = 88.3%), integrating ICD-10 codes, spirometry, clinical symptoms, medications, and chest CT data from multiple healthcare settings. Using this definition and strictly selected controls, we replicated four well-established COPD susceptibility loci, including FAM13A, AGER, DSP, and TGFB2. These findings demonstrated the utility of validated EHR-based COPD phenotypes in genetic epidemiology and support their application in precision medicine in COPD.

Our study demonstrates a clear blueprint for developing an accurate EHR-based COPD phenotype. Through an iterative process including algorithm development, independent physician chart review, and iterative refinement, we achieved a COPD algorithm with a PPV of 88.3%, which is higher than the typical range of 63.5% to 85% reported in most previous studies [5, 13, 14, 30–32]. We further systematically evaluated performance of previously published COPD algorithms [5–7, 13, 14, 30, 31, 33–44] in the KARE cohort. Most existing COPD algorithms show low PPV (32.25%-72.41%) in the KARE cohort. Although a few algorithms reach relatively high PPV (80%-85%) through machine-learning approach or very restrictive criteria, they identified substantially fewer COPD cases (< 2000) in the KARE cohort (Supplementary Table 3). Consequently, no existing COPD algorithm simultaneously achieved high accuracy and adequate sample size, underscoring the necessity of an iterative, multi-setting approach tailored to the Chinese healthcare system. Moreover, this methodology not only advances EHR-based identification of COPD but also offers a transferable framework for refining phenotypes for other complex diseases.

One key insight from our validation process was the variability in accuracy of ICD-10 diagnostic codes across healthcare settings. We observed that disease registry and hospital inpatient records yielded the highest coding accuracy (with PPV > 85%), followed by hospital outpatient and primary care (Supplementary Table 4). This hierarchy reflects real-world clinical practice in China, where primary care centers often provide basic consultations and lack spirometry equipment, limiting their ability to make accurate COPD diagnoses. Consequently, reliance only on ICD-10 code from outpatient or primary care may misclassify true COPD cases. To address this challenge, our algorithm combines multimodal EHR data such as ICD-10 codes, spirometry, medication use, symptoms, and chest CT findings to define COPD cases in hospital outpatient or primary care setting, yielding considerable improvements in case identification accuracy compared to code-only methods (Supplementary Table 4). Importantly, stratified validation demonstrated that the algorithm maintained high predictive accuracy even in the absence of post-bronchodilator spirometry (PPV 87.5%), while achieving the highest accuracy among spirometry-confirmed cases (PPV 95.0%) (supplementary Table 4). These findings indicate that our multi-component EHR algorithm can maintain a high PPV despite the lack of spirometry. This strategy also aligns with recent studies demonstrating that combining imaging and symptom data can identify previously unrecognized COPD patients who are at elevated risk of adverse respiratory outcomes [45].

Beyond improving clinical case identification, this phenotyping strategy has important implications for downstream genetic analyses. Using our multi-setting, validated COPD definition, we successfully replicated genetic associations at known COPD susceptibility loci, including FAM13A, AGER, DSP, and TGFB2. These findings are consistent with Ritchie et al.’s demonstration that validated EHR-based phenotypes can robustly replicate known genotype-phenotype associations across multiple diseases [28]. Compared to the initial algorithm based solely on ICD‑10 codes (J41-J44), our refined multimodal approach not only replicated more loci but also produced stronger effect sizes and more significant p-values. The ICD-only definition represents a high-sensitivity but low-specificity approach, capturing a broad pool of individuals with chronic respiratory diagnoses but introducing substantial outcome misclassification. In contrast, our multimodal algorithm prioritized high PPV (88.3%) through iterative clinician validation and integration of complementary EHR features, thereby enriching for individuals more likely to have established COPD. This improvement in phenotype accuracy reduced dilution of genetic effect estimates and enhanced statistical power, consistent with prior work in genetic epidemiology [46]. Accordingly, the resulting case definition is not intended to estimate COPD prevalence, but to support analyses that require high-specificity case identification, such as genetic association studies. Although this stringency likely reduces sensitivity, such a trade-off is appropriate for genetic association studies, where minimizing false-positive case classification is critical.

It is worth noting that genetic replication alone is not a rigorous or sufficient validation strategy. Our primary validation relied on clinician-confirmed chart review, which directly assessed case accuracy and yielded an estimate of PPV. The genetic findings therefore complement this clinical validation and provide orthogonal evidence demonstrating that this definition can be feasibly applied in genetic studies within a Chinese population. A stronger test of generalizability would require external validation in an independent EHR-linked biobank. Such resources are currently scarce in China, particularly those with both multi-setting EHR data and linked genetic information. As national EHR-integrated biobanks evolve, validation in independent Chinese or East Asian populations will be an essential next step.

An important strength of this study is that it evaluates EHR-based COPD phenotyping and genetic replication in a large Chinese elderly population, addressing a major gap in the current literature dominated by European-ancestry studies [47]. Notably, the Chinese population in our study differs from European ancestry ones in two aspects. First, environmental risk profiles differ substantially. In our cohort, only 31.1% of individuals had a history of cigarette smoking, which is lower than that reported for European-ancestry cohorts [48]. By contrast, environmental exposures such as ambient air pollution and household biomass or solid-fuel use are more prevalent COPD risk factors in China [11, 12], which is likely to generate different patterns of gene-environment interaction. Second, East Asian populations exhibit distinct allele frequencies and linkage disequilibrium structures at known COPD loci, which may modify association strength compared with European-based studies. Despite these differences in environment risk profiles and genetic architecture, we successfully replicated multiple established COPD loci, supporting the robustness of this phenotype in a non-European ancestry setting and demonstrating the feasibility of extending EHR-based genetic studies to underrepresented populations.

Another strength of our study is its incorporation of multimodal EHR data across multiple healthcare settings. Our algorithm was designed to piece together health-related information from the diversity of data sources, including primary care, hospital outpatient and inpatient, and disease registry systems. This comprehensive, multi-setting algorithm is able to capture more complete spectrum of COPD patients, addressing known gaps where health events documented in one setting may not appear in others [49]. Notably, approximately 40% of COPD cases in our study were identified solely from primary care or outpatient data, highlighting the value of multi-setting data integration.

The study also has limitations. First, some key clinical data, such as spirometry and imaging, were not available for all participants, potentially leading to under-detection of COPD cases, especially milder cases. To mitigate this, we incorporated multiple complementary EHR data sources, including medications, clinical notes, and CT imaging reports parsed using NLP, to capture a broader spectrum of diagnostic signals and reduce reliance on any single modality. Second, the genetic analysis was limited to previously reported loci, without exploring novel genetic variants. This was a deliberate choice in order to use established loci as a benchmark for validating the accuracy of our phenotype algorithm. The validated algorithm, however, can be readily used in future studies to discover novel COPD-associated variants. Finally, our validation reflects internal rather than external validation, because both algorithm development and validation were conducted within the same EHR dataset. Nonetheless, the core data types leveraged by our algorithm are widely available and structured similarly across major Chinese hospital information systems. Future studies will focus on external validation in other populations and healthcare environments to further evaluate and optimize its generalizability.

Conclusion

In conclusion, we present a validated EHR-based definition of COPD that enables the replication of known COPD susceptibility loci in a large Chinese elderly population. Our study supports the broader utility of real-world data for genetic studies and highlights the promise of EHR-integrated biobank research to advance precision medicine in chronic respiratory disease.

Supplementary Information

Supplementary Material 1.^{(190.8KB, docx)}

Acknowledgements

We gratefully acknowledge the support of all participants and staff who have contributed to the study.

Abbreviations

COPD: Chronic Obstructive Pulmonary Disease
EHR: Electronic Health Records
KARE: Kunshan Aging Research with E-health
PPV: Positive Predictive value
CKB: China Kadoorie Biobank
NLP: Natural Language Processing
CT: Computed Tomography
SABA: Short-acting β2-agonists
LABA: Long-acting β2-agonists
SAMA: Short-acting Muscarinic Antagonists
LAMA: Long-acting Muscarinic Antagonists
SNP: Single-nucleotide Polymorphism
BBJ: Biobank Japan
LD: Linkage Disequilibrium
QC: Quality Control
FEV1: Forced Expiratory Volume in 1 second
FVC: Forced Vital Capacity
BD: Bronchodilator
SD: Standard Deviation
EA: Effect Allele
CI: Confidence Interval
OR: Odds Ratio

Authors’ contributions

T.X. and K.Z. led the study design and provided overall scientific guidance. H.P. and P.W. developed the analytical framework, performed data analysis. Z.T. contributed to genetic data processing. K.S., Z.X., Z.Q., and Q.Z. conducted manual chart review of EHR. X.H., S.G., Y.C., X.Z., Y.C., J.S., B.W., Q.L., W.D., A.P., Y.D., and Y.P. were responsible for data collection and cleaning. H.P., P.W., T.X., and K.Z. drafted the manuscript. All authors reviewed and contributed feedback, and approved the final version.

Funding

This study is supported by the National Natural Science Foundation of China (Grant No. 32500516), the Guangdong Provincial High-level Talent Program (Grant No. 2024QN11Y205), and the Young Scientists Program of Guangzhou Laboratory (Grant No. QNPG24-16).

Data availability

The datasets used and analysed during the current study are available on reasonable request to the corresponding author, subject to relevant ethical and data protection regulations.

Declarations

Ethics approval and consent to participate

The KARE study protocol was approved by the Independent Ethics Committee at the First People’s Hospital of Kunshan (IEC-C007-A07-V3.0). Written informed consent was obtained from all participants prior to their inclusion in the study. The study was conducted in accordance with the principles of the Declaration of Helsinki.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Haonan Pan and Peng Wu contributed equally to this work.

Kaixin Zhou and Tian Xie jointly supervised this work.

Contributor Information

Kaixin Zhou, Email: zhou_kaixin@gzlab.ac.cn.

Tian Xie, Email: xie_tian01@gzlab.ac.cn.

References

1.de Oca MM, Perez-Padilla R, Celli B, Aaron SD, Wehrmeister FC, Amaral AFS, et al. The global burden of COPD: epidemiology and effect of prevention strategies. Lancet Respiratory Med. 2025. Cited 2025 Jul 24. 10.1016/S2213-2600(24)00339-4. [DOI] [PubMed] [Google Scholar]
2.Cho MH, Hobbs BD, Silverman EK. Genetics of chronic obstructive pulmonary disease: Understanding the pathobiology and heterogeneity of a complex disorder. Lancet Respiratory Med. 2022;10:485–96. 10.1016/S2213-2600(21)00510-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Wei W-Q, Denny JC. Extracting research-quality phenotypes from electronic health records to support precision medicine. Genome Med. 2015;7:41. 10.1186/s13073-015-0166-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Whittaker H, Quint JK. Using routine health data for research: the devil is in the detail. Thorax. 2020;75:714–5. 10.1136/thoraxjnl-2020-214821. [DOI] [PubMed] [Google Scholar]
5.Ho T-W, Ruan S-Y, Huang C-T, Tsai Y-J, Lai F, Yu C-J. Validity of ICD9-CM codes to diagnose chronic obstructive pulmonary disease from National health insurance claim data in Taiwan. Int J Chron Obstruct Pulmon Dis. 2018;13:3055–63. 10.2147/COPD.S174265. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Lee TM, Tu K, Wing LL, Gershon AS. Identifying individuals with physician-diagnosed chronic obstructive pulmonary disease in primary care electronic medical records: a retrospective chart abstraction study. NPJ Prim Care Respir Med. 2017;27:34. 10.1038/s41533-017-0035-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Chu SH, Wan ES, Cho MH, Goryachev S, Gainer V, Linneman J, et al. An independently validated, portable algorithm for the rapid identification of COPD patients using electronic health records. Sci Rep Nat Publishing Group. 2021;11:19959. 10.1038/s41598-021-98719-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Denaxas S, Gonzalez-Izquierdo A, Direk K, Fitzpatrick NK, Fatemifar G, Banerjee A, et al. UK phenomics platform for developing and validating electronic health record phenotypes: CALIBER. J Am Med Inform Assoc. 2019;26:1545–59. 10.1093/jamia/ocz105. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Kirby JC, Speltz P, Rasmussen LV, Basford M, Gottesman O, Peissig PL, et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inf Assoc. 2016;23:1046–52. 10.1093/jamia/ocv202. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Newton KM, Peissig PL, Kho AN, Bielinski SJ, Berg RL, Choudhary V, et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J Am Med Inf Assoc. 2013;20:e147–54. 10.1136/amiajnl-2012-000896. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Wang C, Xu J, Yang L, Xu Y, Zhang X, Bai C, et al. Prevalence and risk factors of chronic obstructive pulmonary disease in China (the China pulmonary health [CPH] study): a National cross-sectional study. Lancet. 2018;391:1706–17. 10.1016/S0140-6736(18)30841-9. [DOI] [PubMed] [Google Scholar]
12.Li J, Qin C, Lv J, Guo Y, Bian Z, Zhou W, et al. Solid fuel use and incident COPD in Chinese adults: findings from the China kadoorie biobank. Environ Health Perspect. 2019;127:57008. 10.1289/EHP2856. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Kurmi OP, Vaucher J, Xiao D, Holmes MV, Guo Y, Davis KJ, et al. Validity of COPD diagnoses reported through nationwide health insurance systems in the people’s Republic of China. Int J Chron Obstruct Pulmon Dis. 2016;11:419–30. 10.2147/COPD.S100736. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Kwok W, Tam TC, Sing C, Chan EW, Cheung C. Validation of diagnostic coding for chronic obstructive pulmonary disease in an electronic health record system in Hong Kong. Hong Kong Med J. 2024. Cited 2024 Nov 28.10.12809/hkmj2210657. [DOI] [PubMed] [Google Scholar]
15.Zhong N, Wang C, Yao W, Chen P, Kang J, Huang S, et al. Prevalence of chronic obstructive pulmonary disease in china: a large, population-based survey. Am J Respir Crit Care Med. 2007;176:753–60. 10.1164/rccm.200612-1749OC. [DOI] [PubMed] [Google Scholar]
16.Fang L, Gao P, Bao H, Tang X, Wang B, Feng Y, et al. Chronic obstructive pulmonary disease in china: a nationwide prevalence study. Lancet Respiratory Med Elsevier. 2018;6:421–30. 10.1016/S2213-2600(18)30103-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Tobin MD, Izquierdo AG. Improving ethnic diversity in respiratory genomics research. Eur Respiratory J Eur Respiratory Soc. 2021;58. Cited 2025 Jun 10. 10.1183/13993003.01615-2021. [DOI] [PubMed]
18.Xie T, Pan Y, Lu K, Wei Y, Chen F, Tian Z, et al. Cohort profile: Kunshan aging research with E-health (KARE). Int J Epidemiol Engl. 2025;54:dyaf041. 10.1093/ije/dyaf041. [DOI] [PubMed] [Google Scholar]
19.Sivakumaran S, Alsallakh MA, Lyons RA, Quint JK, Davies GA. Identifying COPD in routinely collected electronic health records: a systematic scoping review. ERJ Open Res. 2021;7:00167–2021. 10.1183/23120541.00167-2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Werder RB, Zhou X, Cho MH, Wilson AA. Breathing new life into the study of COPD with genes identified from genome-wide association studies. Eur Respir Rev. 2024;33:240019. 10.1183/16000617.0019-2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.A global reference for. Human genetic variation. Nature. 2015;526:68–74. 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Sakaue S, Kanai M, Tanigawa Y, Karjalainen J, Kurki M, Koshiba S, et al. A cross-population atlas of genetic associations for 220 human phenotypes. Nat Genet Nat Publishing Group. 2021;53:1415–24. 10.1038/s41588-021-00931-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Tian Z, Chen F, Wang J, Wu B, Shao J, Liu Z, et al. CAS array: design and assessment of a genotyping array for Chinese biobanking. Precis Clin Med. 2023;6:pbad002. 10.1093/pcmedi/pbad002. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zhang P, Luo H, Li Y, Wang Y, Wang J, Zheng Y, et al. NyuWa genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep. 2021;37:110017. 10.1016/j.celrep.2021.110017. [DOI] [PubMed] [Google Scholar]
25.Grigsby MR, Siddharthan T, Pollard SL, Chowdhury M, Rubinstein A, Miranda JJ, et al. Low body mass index is associated with higher odds of COPD and lower lung function in low- and middle-income countries. COPD. 2019;16:58–65. 10.1080/15412555.2019.1589443. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Prescott E, Almdal T, Mikkelsen KL, Tofteng CL, Vestbo J, Lange P. Prognostic value of weight change in chronic obstructive pulmonary disease: results from the Copenhagen City heart Study. European respiratory journal. Eur Respiratory Soc. 2002;20:539–44. 10.1183/09031936.02.00532002. [DOI] [PubMed] [Google Scholar]
27.Munafò MR, Tilling K, Taylor AE, Evans DM, Davey Smith G. Collider scope: when selection bias can substantially influence observed associations. Int J Epidemiol. 2018;47:226–35. 10.1093/ije/dyx206. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Ritchie MD, Denny JC, Crawford DC, Ramirez AH, Weiner JB, Pulley JM, et al. Robust replication of Genotype-Phenotype associations across multiple diseases in an electronic medical record. Am J Hum Genet. 2010;86:560–72. 10.1016/j.ajhg.2010.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Dupont WD, Plummer WD. Power and sample size calculations. A review and computer program. Control Clin Trials. 1990;11:116–28. 10.1016/0197-2456(90)90005-m. [DOI] [PubMed] [Google Scholar]
30.Cooke CR, Joo MJ, Anderson SM, Lee TA, Udris EM, Johnson E, et al. The validity of using ICD-9 codes and pharmacy records to identify patients with chronic obstructive pulmonary disease. BMC Health Serv Res. 2011;11:37. 10.1186/1472-6963-11-37. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Quint JK, Müllerova H, DiSantostefano RL, Forbes H, Eaton S, Hurst JR, et al. Validation of chronic obstructive pulmonary disease recording in the clinical practice research datalink (CPRD-GOLD). BMJ Open. 2014;4:e005540. 10.1136/bmjopen-2014-005540. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Kadhim-Saleh A, Green M, Williamson T, Hunter D, Birtwhistle R. Validation of the diagnostic algorithms for 5 chronic conditions in the Canadian primary care Sentinel surveillance network (CPCSSN): A Kingston Practice-based research network (PBRN) report. J Am Board Family Med. 2013;26:159–67. 10.3122/jabfm.2013.02.120183. [DOI] [PubMed] [Google Scholar]
33.Mapel DW, Frost FJ, Hurley JS, Petersen H, Roberts M, Marton JP, et al. An algorithm for the identification of undiagnosed COPD cases using administrative claims data. J Manag Care Pharm. 2006;12:457–65. [PubMed] [Google Scholar]
34.Macaulay D, Sun SX, Sorg RA, Yan SY, De G, Wu EQ, et al. Development and validation of a claims-based prediction model for COPD severity. Respir Med. 2013;107:1568–77. 10.1016/j.rmed.2013.05.012. [DOI] [PubMed] [Google Scholar]
35.Gini R, Francesconi P, Mazzaglia G, Cricelli I, Pasqua A, Gallina P, et al. Chronic disease prevalence from Italian administrative databases in the VALORE project: a validation through comparison of population estimates with general practice databases and National survey. BMC Public Health. 2013;13:15. 10.1186/1471-2458-13-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Gershon AS, Wang C, Guan J, Vasilevska-Ristovska J, Cicutto L, To T. Identifying individuals with physcian diagnosed COPD in health administrative databases. COPD. 2009;6:388–94. 10.1080/15412550903140865. [DOI] [PubMed] [Google Scholar]
37.Erdem E. Prevalence of chronic conditions among medicare part A beneficiaries in 2008 and 2010: are medicare beneficiaries getting sicker? Prev Chronic Dis. 2014;11:130118. 10.5888/pcd11.130118. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Dalal AA, Shah M, D’Souza AO, Crater GD. Rehospitalization risks and outcomes in COPD patients receiving maintenance pharmacotherapy. Respir Med. 2012;106:829–37. 10.1016/j.rmed.2011.11.012. [DOI] [PubMed] [Google Scholar]
39.Turner RM, DePietro M, Ding B. Overlap of asthma and chronic obstructive pulmonary disease in patients in the united states: analysis of Prevalence, Features, and subtypes. JMIR Public Health Surveill. 2018;4:e60. 10.2196/publichealth.9930. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Raymakers AJN, Sadatsafavi M, Sin DD, De Vera MA, Lynd LD. The impact of Statin drug use on All-Cause mortality in patients with COPD: A Population-Based cohort study. Chest. 2017;152:486–93. 10.1016/j.chest.2017.02.002. [DOI] [PubMed] [Google Scholar]
41.Mcguire K, Aviña-Zubieta JA, Esdaile JM, Sadatsafavi M, Sayre EC, Abrahamowicz M, et al. Risk of incident chronic obstructive pulmonary disease in rheumatoid arthritis: A Population-Based cohort study. Arthritis Care Res (Hoboken). 2019;71:602–10. 10.1002/acr.23410. [DOI] [PubMed] [Google Scholar]
42.Lacasse Y, Montori VM, Lanthier C, Maltis F. The validity of diagnosing chronic obstructive pulmonary disease from a large administrative database. Can Respir J. 2005;12:251–6. 10.1155/2005/567975. [DOI] [PubMed] [Google Scholar]
43.Wilchesky M, Tamblyn RM, Huang A. Validation of diagnostic codes within medical services claims. J Clin Epidemiol. 2004;57:131–41. 10.1016/S0895-4356(03)00246-4. [DOI] [PubMed] [Google Scholar]
44.Hansell A, Hollowell J, McNiece R, Nichols T, Strachan D. Validity and interpretation of mortality, health service and survey data on COPD and asthma in England. Eur Respir J. 2003;21:279–86. 10.1183/09031936.03.00006102. [DOI] [PubMed] [Google Scholar]
45.COPDGene 2025 Diagnosis Working Group and CanCOLD, Investigators, Bhatt SP, Abadi E, Anzueto A, Bodduluri S, Casaburi R et al. A multidimensional diagnostic approach for chronic obstructive pulmonary disease. JAMA. 2025. Cited 2025 Jun 23. 10.1001/jama.2025.7358. [DOI] [PMC free article] [PubMed]
46.Dahl A, Thompson M, An U, Krebs M, Appadurai V, Border R, et al. Phenotype integration improves power and preserves specificity in biobank-based genetic studies of major depressive disorder. Nat Genet. 2023;55:2082–93. 10.1038/s41588-023-01559-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Lee H, Kim W, Kwon N, Kim C, Kim S, An J-Y. Lessons from National biobank projects utilizing whole-genome sequencing for population-scale genomics. Genom Inf. 2025;23:8. 10.1186/s44342-025-00040-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Maselli DJ, Bhatt SP, Anzueto A, Bowler RP, DeMeo DL, Diaz AA, et al. Clinical epidemiology of COPD. Chest. 2019;156:228–38. 10.1016/j.chest.2019.04.135. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Herrett E, Shah AD, Boggon R, Denaxas S, Smeeth L, van Staa T, et al. Completeness and diagnostic validity of recording acute myocardial infarction events in primary care, hospital care, disease registry, and National mortality records: cohort study. BMJ. 2013;346:f2350. 10.1136/bmj.f2350. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1.^{(190.8KB, docx)}

Data Availability Statement

The datasets used and analysed during the current study are available on reasonable request to the corresponding author, subject to relevant ethical and data protection regulations.

[CR1] 1.de Oca MM, Perez-Padilla R, Celli B, Aaron SD, Wehrmeister FC, Amaral AFS, et al. The global burden of COPD: epidemiology and effect of prevention strategies. Lancet Respiratory Med. 2025. Cited 2025 Jul 24. 10.1016/S2213-2600(24)00339-4. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Cho MH, Hobbs BD, Silverman EK. Genetics of chronic obstructive pulmonary disease: Understanding the pathobiology and heterogeneity of a complex disorder. Lancet Respiratory Med. 2022;10:485–96. 10.1016/S2213-2600(21)00510-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Wei W-Q, Denny JC. Extracting research-quality phenotypes from electronic health records to support precision medicine. Genome Med. 2015;7:41. 10.1186/s13073-015-0166-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Whittaker H, Quint JK. Using routine health data for research: the devil is in the detail. Thorax. 2020;75:714–5. 10.1136/thoraxjnl-2020-214821. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Ho T-W, Ruan S-Y, Huang C-T, Tsai Y-J, Lai F, Yu C-J. Validity of ICD9-CM codes to diagnose chronic obstructive pulmonary disease from National health insurance claim data in Taiwan. Int J Chron Obstruct Pulmon Dis. 2018;13:3055–63. 10.2147/COPD.S174265. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Lee TM, Tu K, Wing LL, Gershon AS. Identifying individuals with physician-diagnosed chronic obstructive pulmonary disease in primary care electronic medical records: a retrospective chart abstraction study. NPJ Prim Care Respir Med. 2017;27:34. 10.1038/s41533-017-0035-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Chu SH, Wan ES, Cho MH, Goryachev S, Gainer V, Linneman J, et al. An independently validated, portable algorithm for the rapid identification of COPD patients using electronic health records. Sci Rep Nat Publishing Group. 2021;11:19959. 10.1038/s41598-021-98719-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Denaxas S, Gonzalez-Izquierdo A, Direk K, Fitzpatrick NK, Fatemifar G, Banerjee A, et al. UK phenomics platform for developing and validating electronic health record phenotypes: CALIBER. J Am Med Inform Assoc. 2019;26:1545–59. 10.1093/jamia/ocz105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Kirby JC, Speltz P, Rasmussen LV, Basford M, Gottesman O, Peissig PL, et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inf Assoc. 2016;23:1046–52. 10.1093/jamia/ocv202. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Newton KM, Peissig PL, Kho AN, Bielinski SJ, Berg RL, Choudhary V, et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J Am Med Inf Assoc. 2013;20:e147–54. 10.1136/amiajnl-2012-000896. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Wang C, Xu J, Yang L, Xu Y, Zhang X, Bai C, et al. Prevalence and risk factors of chronic obstructive pulmonary disease in China (the China pulmonary health [CPH] study): a National cross-sectional study. Lancet. 2018;391:1706–17. 10.1016/S0140-6736(18)30841-9. [DOI] [PubMed] [Google Scholar]

[CR12] 12.Li J, Qin C, Lv J, Guo Y, Bian Z, Zhou W, et al. Solid fuel use and incident COPD in Chinese adults: findings from the China kadoorie biobank. Environ Health Perspect. 2019;127:57008. 10.1289/EHP2856. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Kurmi OP, Vaucher J, Xiao D, Holmes MV, Guo Y, Davis KJ, et al. Validity of COPD diagnoses reported through nationwide health insurance systems in the people’s Republic of China. Int J Chron Obstruct Pulmon Dis. 2016;11:419–30. 10.2147/COPD.S100736. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Kwok W, Tam TC, Sing C, Chan EW, Cheung C. Validation of diagnostic coding for chronic obstructive pulmonary disease in an electronic health record system in Hong Kong. Hong Kong Med J. 2024. Cited 2024 Nov 28.10.12809/hkmj2210657. [DOI] [PubMed] [Google Scholar]

[CR15] 15.Zhong N, Wang C, Yao W, Chen P, Kang J, Huang S, et al. Prevalence of chronic obstructive pulmonary disease in china: a large, population-based survey. Am J Respir Crit Care Med. 2007;176:753–60. 10.1164/rccm.200612-1749OC. [DOI] [PubMed] [Google Scholar]

[CR16] 16.Fang L, Gao P, Bao H, Tang X, Wang B, Feng Y, et al. Chronic obstructive pulmonary disease in china: a nationwide prevalence study. Lancet Respiratory Med Elsevier. 2018;6:421–30. 10.1016/S2213-2600(18)30103-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Tobin MD, Izquierdo AG. Improving ethnic diversity in respiratory genomics research. Eur Respiratory J Eur Respiratory Soc. 2021;58. Cited 2025 Jun 10. 10.1183/13993003.01615-2021. [DOI] [PubMed]

[CR18] 18.Xie T, Pan Y, Lu K, Wei Y, Chen F, Tian Z, et al. Cohort profile: Kunshan aging research with E-health (KARE). Int J Epidemiol Engl. 2025;54:dyaf041. 10.1093/ije/dyaf041. [DOI] [PubMed] [Google Scholar]

[CR19] 19.Sivakumaran S, Alsallakh MA, Lyons RA, Quint JK, Davies GA. Identifying COPD in routinely collected electronic health records: a systematic scoping review. ERJ Open Res. 2021;7:00167–2021. 10.1183/23120541.00167-2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Werder RB, Zhou X, Cho MH, Wilson AA. Breathing new life into the study of COPD with genes identified from genome-wide association studies. Eur Respir Rev. 2024;33:240019. 10.1183/16000617.0019-2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.A global reference for. Human genetic variation. Nature. 2015;526:68–74. 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Sakaue S, Kanai M, Tanigawa Y, Karjalainen J, Kurki M, Koshiba S, et al. A cross-population atlas of genetic associations for 220 human phenotypes. Nat Genet Nat Publishing Group. 2021;53:1415–24. 10.1038/s41588-021-00931-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Tian Z, Chen F, Wang J, Wu B, Shao J, Liu Z, et al. CAS array: design and assessment of a genotyping array for Chinese biobanking. Precis Clin Med. 2023;6:pbad002. 10.1093/pcmedi/pbad002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Zhang P, Luo H, Li Y, Wang Y, Wang J, Zheng Y, et al. NyuWa genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep. 2021;37:110017. 10.1016/j.celrep.2021.110017. [DOI] [PubMed] [Google Scholar]

[CR25] 25.Grigsby MR, Siddharthan T, Pollard SL, Chowdhury M, Rubinstein A, Miranda JJ, et al. Low body mass index is associated with higher odds of COPD and lower lung function in low- and middle-income countries. COPD. 2019;16:58–65. 10.1080/15412555.2019.1589443. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Prescott E, Almdal T, Mikkelsen KL, Tofteng CL, Vestbo J, Lange P. Prognostic value of weight change in chronic obstructive pulmonary disease: results from the Copenhagen City heart Study. European respiratory journal. Eur Respiratory Soc. 2002;20:539–44. 10.1183/09031936.02.00532002. [DOI] [PubMed] [Google Scholar]

[CR27] 27.Munafò MR, Tilling K, Taylor AE, Evans DM, Davey Smith G. Collider scope: when selection bias can substantially influence observed associations. Int J Epidemiol. 2018;47:226–35. 10.1093/ije/dyx206. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Ritchie MD, Denny JC, Crawford DC, Ramirez AH, Weiner JB, Pulley JM, et al. Robust replication of Genotype-Phenotype associations across multiple diseases in an electronic medical record. Am J Hum Genet. 2010;86:560–72. 10.1016/j.ajhg.2010.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Dupont WD, Plummer WD. Power and sample size calculations. A review and computer program. Control Clin Trials. 1990;11:116–28. 10.1016/0197-2456(90)90005-m. [DOI] [PubMed] [Google Scholar]

[CR30] 30.Cooke CR, Joo MJ, Anderson SM, Lee TA, Udris EM, Johnson E, et al. The validity of using ICD-9 codes and pharmacy records to identify patients with chronic obstructive pulmonary disease. BMC Health Serv Res. 2011;11:37. 10.1186/1472-6963-11-37. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Quint JK, Müllerova H, DiSantostefano RL, Forbes H, Eaton S, Hurst JR, et al. Validation of chronic obstructive pulmonary disease recording in the clinical practice research datalink (CPRD-GOLD). BMJ Open. 2014;4:e005540. 10.1136/bmjopen-2014-005540. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Kadhim-Saleh A, Green M, Williamson T, Hunter D, Birtwhistle R. Validation of the diagnostic algorithms for 5 chronic conditions in the Canadian primary care Sentinel surveillance network (CPCSSN): A Kingston Practice-based research network (PBRN) report. J Am Board Family Med. 2013;26:159–67. 10.3122/jabfm.2013.02.120183. [DOI] [PubMed] [Google Scholar]

[CR33] 33.Mapel DW, Frost FJ, Hurley JS, Petersen H, Roberts M, Marton JP, et al. An algorithm for the identification of undiagnosed COPD cases using administrative claims data. J Manag Care Pharm. 2006;12:457–65. [PubMed] [Google Scholar]

[CR34] 34.Macaulay D, Sun SX, Sorg RA, Yan SY, De G, Wu EQ, et al. Development and validation of a claims-based prediction model for COPD severity. Respir Med. 2013;107:1568–77. 10.1016/j.rmed.2013.05.012. [DOI] [PubMed] [Google Scholar]

[CR35] 35.Gini R, Francesconi P, Mazzaglia G, Cricelli I, Pasqua A, Gallina P, et al. Chronic disease prevalence from Italian administrative databases in the VALORE project: a validation through comparison of population estimates with general practice databases and National survey. BMC Public Health. 2013;13:15. 10.1186/1471-2458-13-15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Gershon AS, Wang C, Guan J, Vasilevska-Ristovska J, Cicutto L, To T. Identifying individuals with physcian diagnosed COPD in health administrative databases. COPD. 2009;6:388–94. 10.1080/15412550903140865. [DOI] [PubMed] [Google Scholar]

[CR37] 37.Erdem E. Prevalence of chronic conditions among medicare part A beneficiaries in 2008 and 2010: are medicare beneficiaries getting sicker? Prev Chronic Dis. 2014;11:130118. 10.5888/pcd11.130118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Dalal AA, Shah M, D’Souza AO, Crater GD. Rehospitalization risks and outcomes in COPD patients receiving maintenance pharmacotherapy. Respir Med. 2012;106:829–37. 10.1016/j.rmed.2011.11.012. [DOI] [PubMed] [Google Scholar]

[CR39] 39.Turner RM, DePietro M, Ding B. Overlap of asthma and chronic obstructive pulmonary disease in patients in the united states: analysis of Prevalence, Features, and subtypes. JMIR Public Health Surveill. 2018;4:e60. 10.2196/publichealth.9930. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Raymakers AJN, Sadatsafavi M, Sin DD, De Vera MA, Lynd LD. The impact of Statin drug use on All-Cause mortality in patients with COPD: A Population-Based cohort study. Chest. 2017;152:486–93. 10.1016/j.chest.2017.02.002. [DOI] [PubMed] [Google Scholar]

[CR41] 41.Mcguire K, Aviña-Zubieta JA, Esdaile JM, Sadatsafavi M, Sayre EC, Abrahamowicz M, et al. Risk of incident chronic obstructive pulmonary disease in rheumatoid arthritis: A Population-Based cohort study. Arthritis Care Res (Hoboken). 2019;71:602–10. 10.1002/acr.23410. [DOI] [PubMed] [Google Scholar]

[CR42] 42.Lacasse Y, Montori VM, Lanthier C, Maltis F. The validity of diagnosing chronic obstructive pulmonary disease from a large administrative database. Can Respir J. 2005;12:251–6. 10.1155/2005/567975. [DOI] [PubMed] [Google Scholar]

[CR43] 43.Wilchesky M, Tamblyn RM, Huang A. Validation of diagnostic codes within medical services claims. J Clin Epidemiol. 2004;57:131–41. 10.1016/S0895-4356(03)00246-4. [DOI] [PubMed] [Google Scholar]

[CR44] 44.Hansell A, Hollowell J, McNiece R, Nichols T, Strachan D. Validity and interpretation of mortality, health service and survey data on COPD and asthma in England. Eur Respir J. 2003;21:279–86. 10.1183/09031936.03.00006102. [DOI] [PubMed] [Google Scholar]

[CR45] 45.COPDGene 2025 Diagnosis Working Group and CanCOLD, Investigators, Bhatt SP, Abadi E, Anzueto A, Bodduluri S, Casaburi R et al. A multidimensional diagnostic approach for chronic obstructive pulmonary disease. JAMA. 2025. Cited 2025 Jun 23. 10.1001/jama.2025.7358. [DOI] [PMC free article] [PubMed]

[CR46] 46.Dahl A, Thompson M, An U, Krebs M, Appadurai V, Border R, et al. Phenotype integration improves power and preserves specificity in biobank-based genetic studies of major depressive disorder. Nat Genet. 2023;55:2082–93. 10.1038/s41588-023-01559-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Lee H, Kim W, Kwon N, Kim C, Kim S, An J-Y. Lessons from National biobank projects utilizing whole-genome sequencing for population-scale genomics. Genom Inf. 2025;23:8. 10.1186/s44342-025-00040-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] 48.Maselli DJ, Bhatt SP, Anzueto A, Bowler RP, DeMeo DL, Diaz AA, et al. Clinical epidemiology of COPD. Chest. 2019;156:228–38. 10.1016/j.chest.2019.04.135. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] 49.Herrett E, Shah AD, Boggon R, Denaxas S, Smeeth L, van Staa T, et al. Completeness and diagnostic validity of recording acute myocardial infarction events in primary care, hospital care, disease registry, and National mortality records: cohort study. BMJ. 2013;346:f2350. 10.1136/bmj.f2350. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Replication of COPD susceptibility loci in a large Chinese elderly population using a validated, multi-setting EHR phenotype

Haonan Pan

Peng Wu

Kunyan Sun

Zhixin Xie

Ziyu Qiu

Qiaoshi Zhang

Zijian Tian

Xiangqing Hou

Shiteng Gao

Ying Chen

Xiaozhou Zhou

Yao Cheng

Jian Shao

Benrui Wu

Qian Li

Wanqing Dong

Anjie Peng

Yuxuan Du

Ying Pan

Kaixin Zhou

Tian Xie

Abstract

Background

Methods

Results

Conclusion

Supplementary Information

Introduction

Methods

Study design and data source

Fig. 1.

Development of EHR-based COPD definition

Initial case pool and chart review

Feature selection and algorithm refinement

NLP-based feature extraction

Algorithm application and validation

Selection of controls

Selection of known COPD susceptibility loci and genotyping

Table 1.

Statistical analysis

Result

Final algorithm for identifying COPD cases

Fig. 2.

Characteristics of identified COPD cases and selected controls

Table 2.

Replication of COPD susceptibility loci

Fig. 3.

Discussion

Conclusion

Supplementary Information

Acknowledgements

Abbreviations

Authors’ contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases