Skip to main content
JMIR Medical Informatics logoLink to JMIR Medical Informatics
. 2021 Jul 6;9(7):e24796. doi: 10.2196/24796

Machine Learning Methods for the Diagnosis of Chronic Obstructive Pulmonary Disease in Healthy Subjects: Retrospective Observational Cohort Study

Shigeo Muro 1, Masato Ishida 2, Yoshiharu Horie 3,, Wataru Takeuchi 4, Shunki Nakagawa 4, Hideyuki Ban 4, Tohru Nakagawa 5, Tetsuhisa Kitamura 6
Editor: Gunther Eysenbach
Reviewed by: Chintan Gandhi
PMCID: PMC8293159  PMID: 34255684

Abstract

Background

Airflow limitation is a critical physiological feature in chronic obstructive pulmonary disease (COPD), for which long-term exposure to noxious substances, including tobacco smoke, is an established risk. However, not all long-term smokers develop COPD, meaning that other risk factors exist.

Objective

This study aimed to predict the risk factors for COPD diagnosis using machine learning in an annual medical check-up database.

Methods

In this retrospective observational cohort study (ARTDECO [Analysis of Risk Factors to Detect COPD]), annual medical check-up records for all Hitachi Ltd employees in Japan collected from April 1998 to March 2019 were analyzed. Employees who provided informed consent via an opt-out model were screened and those aged 30 to 75 years without a prior diagnosis of COPD/asthma or a history of cancer were included. The database included clinical measurements (eg, pulmonary function tests) and questionnaire responses. To predict the risk factors for COPD diagnosis within a 3-year period, the Gradient Boosting Decision Tree machine learning (XGBoost) method was applied as a primary approach, with logistic regression as a secondary method. A diagnosis of COPD was made when the ratio of the prebronchodilator forced expiratory volume in 1 second (FEV1) to prebronchodilator forced vital capacity (FVC) was <0.7 during two consecutive examinations.

Results

Of the 26,101 individuals screened, 1213 met the exclusion criteria, and thus, 24,815 individuals were included in the analysis. The top 10 predictors for COPD diagnosis were FEV1/FVC, smoking status, allergic symptoms, cough, pack years, hemoglobin A1c, serum albumin, mean corpuscular volume, percent predicted vital capacity, and percent predicted value of FEV1. The areas under the receiver operating characteristic curves of the XGBoost model and the logistic regression model were 0.956 and 0.943, respectively.

Conclusions

Using a machine learning model in this longitudinal database, we identified a number of parameters as risk factors other than smoking exposure or lung function to support general practitioners and occupational health physicians to predict the development of COPD. Further research to confirm our results is warranted, as our analysis involved a database used only in Japan.

Keywords: chronic obstructive pulmonary disease, airflow limitation, medical check-up, Gradient Boosting Decision Tree, logistic regression

Introduction

Chronic obstructive pulmonary disease (COPD) is characterized by airflow limitation associated with persistent respiratory symptoms. Most patients with COPD experience exacerbation of symptoms and are at high risk of developing comorbidities such as cardiovascular disease [1].

Long-term exposure to tobacco smoke, vapor, gas, dust, and fumes is an established major risk factor for COPD [2]. However, only a small percentage of smokers develop airflow limitation, while nonsmokers can develop COPD [3]. These inconsistencies indicate that risk factors other than long-term smoking are associated with COPD [4].

The prevalence of COPD has been reported to be 12% to 13% among smokers [5]. However, only 9.4% of patients with airflow limitation have a previous diagnosis of COPD, and European data indicate that up to 80% of COPD cases are undiagnosed [6], suggesting delays in the diagnosis of COPD. The ARCTIC observational cohort study showed that late COPD diagnosis was associated with a higher exacerbation rate and increased comorbidities and costs compared with early diagnosis [7].

To address the issue of undiagnosed COPD, significant risk factors for airflow limitation other than smoking should be identified and evaluated in routine clinical practice. In a cohort study of 9040 individuals from the Japanese general population, concomitant Chlamydia pneumoniae and Mycoplasma pneumoniae seropositivity was found to be an independent risk factor for airflow limitation [8]. Additionally, Sato et al employed an annual health examination with pulmonary function tests measuring airflow limitation to identify undiagnosed patients with COPD among the Japanese population and found that iron deficiency might be associated with COPD development [9]. However, the follow-up duration of these cohorts was short (<3 years), limiting their ability to identify risk factors for COPD in the general population.

A large questionnaire-based surveillance demonstrated some improvement in diagnostic rates for COPD; however, approximately 60% of eligible participants failed to respond to the questionnaire [10]. While these results suggest that identifying robust and relevant risk factors is likely to improve early diagnosis, the slow progression and heterogeneity of the disease have hindered the identification of such risk factors for COPD development.

The recently reported “Subtype and Stage Inference” machine learning computational model identified subtypes of patients with COPD [11]. Compared with traditional approaches, the advantages of machine learning include the ability to process complex nonlinear relationships between predictors and to provide novel outputs. Therefore, the aim of this study was to apply machine learning methods to predict possible risk factors for the development of airflow limitation, an essential feature of COPD diagnosis, using a Japanese medical check-up database comprising data from a number of healthy subjects to support the early diagnosis of COPD by general practitioners and occupational health physicians.

Methods

Study Design and Population

This was a retrospective observational cohort study to predict the risk factors for COPD diagnosis in healthy individuals. The analysis data set comprised individuals aged ≥30 years who had undertaken more than two medical check-ups, had no history of lung cancer or asthma at the first medical check-up, and could be classified as either having a diagnosis of COPD or as not having COPD. This study was designed according to the Transparent Reporting of a Multivariate Prediction Model for Individual Prognosis or Diagnosis guidelines for prognostic studies [12].

The study protocol was reviewed and approved by the ethics committee of MINS (a nonprofit organization in Tokyo, Japan) and the Research & Development Group and Corporate Hospital Group of Hitachi, Ltd (Tokyo, Japan) prior to the start of data analysis. Individual informed consent was obtained using an opt-out model in agreement with the Institutional Review Board at Hitachi, Ltd. This study was conducted in accordance with the ethical principles of the Declaration of Helsinki.

Data Source

The data source was annual medical check-up data for all Hitachi employees from April 1998 to March 2019. Data were archived in a high-security server that was managed with limited access rights by Hitachi. The annual medical check-up includes clinical measurements and questionnaires to examine the health of employees (Multimedia Appendix 1). Such questionnaires are utilized by Japanese organizations to evaluate their employees’ health and give advice about health promotion, such as giving up smoking and exercising regularly based on the second term of the National Health Promotion Movement in the 21st century (Health Japan 21) issued by the Ministry of Health, Labour, and Welfare in Japan [13].

Definition of COPD

COPD was considered according to the lung function status at two consecutive measurements during an annual lung function test when the prebronchodilator (pre-BD) forced expiratory volume in 1 second/forced vital capacity (FEV1/FVC) was <0.7, as previously employed in a large population-based cohort study [14]. Individuals having a pre-BD FEV1/FVC ≥0.7 in at least three consecutive annual lung function test measurements were classified as non-COPD. For individuals with more than three records in the non-COPD group, the most recent three records were analyzed. Individuals having less than two lung function tests were excluded from all analyses. Spirometry was calibrated and performed by trained paramedical personnel according to the American Thoracic Society/European Respiratory Society guidelines [15,16].

Statistical Analysis

Age at COPD Diagnosis

The age distribution for disease diagnosis was evaluated and stratified by smoking status (current smoker, exsmoker, or nonsmoker). The age at COPD diagnosis was defined as the age at the first of two consecutive measurements in which the pre-BD FEV1/FVC was <0.7.

Risk Factor Prediction Using Machine Learning

Two types of models were constructed for predicting the risk factors for COPD diagnosis within 3 years as follows: a machine learning method (Gradient Boosting Decision Tree machine learning [XGBoost] [17]) and an established statistical method (logistic regression [18]). Individuals who did not meet the study inclusion criteria and/or had lung cancer/asthma were excluded from the analyses. Any individuals with missing data during the 3 years prior to the diagnosis year in the COPD group or during the most recent 3 years in the non-COPD group were excluded from the analyses. Propensity scores were calculated based on age, sex, smoking status, BMI, eosinophil count (EOS), and FEV1.

Data were randomly divided into a training data set and a test data set at a ratio of 7:3, with the same ratio of COPD to non-COPD individuals. Propensity scoring was used to balance the characteristics of COPD and non-COPD individuals (caliper: 0.2) in the training and test data sets. Next, the training data set was randomly divided 8:2 for model construction (XGBoost and logistic) and evaluation of model performance, respectively. The data split, model construction, and evaluation processes were repeated five times for cross-validation (5-CV approach) [19]. Model parameters, including the depth of the tree and regularization factor, were refined during performance evaluation by the 5-CV approach. Finally, the most optimal model was generated by applying the best parameters confirmed by the 5-CV approach. To evaluate model performance in the unlearned data, the most optimized model was used to evaluate the test data set.

Model construction by logistic regression was performed in a similar way to the XGBoost method. Models were constructed in the training data set (randomly sampled data from the entire data set) and subsequently validated in the test data set after model evaluation.

Following model construction, the area under the receiver operating characteristic curve (AUC), positive predictive value, sensitivity, specificity, and F1-measure were calculated for each model to evaluate the performance under both 5-CV and test conditions [20]. The feature importance of the machine learning model was calculated to examine the contribution of each predictor to the model constructed using the Gini impurity method [19]. The feature weight of the logistic regression model was also calculated. All analyses were performed using Python 3.6 software (Python Software Foundation).

Results

Individuals

Data from 26,101 individuals (employees and their families) aged 30 to 75 years, who underwent annual check-ups between April 1998 and March 2019 were included in our analysis. The total number of medical check-up records was 318,568. All 26,101 individuals had lung function test measurements for 3 consecutive years. The medical records for 73 individuals aged <30 years at the first medical check-up, 67 individuals with a history of cancer, and 727 individuals with a history of asthma were excluded, as were data from 419 individuals who had already been diagnosed with COPD (subjects for whom all data points of pre-BD FEV1/FVC were <0.7 during the observational period) or had not been classified as either COPD or non-COPD (subjects with pre-BD FEV1/FVC <0.7 without two consecutive measurements). Accordingly, data for 24,815 individuals (corresponding to 67,438 records) were included in the analyses (Figure 1).

Figure 1.

Figure 1

Flow diagram of the study. COPD: chronic obstructive pulmonary disease.

Baseline Characteristics

Table 1 shows the baseline characteristics of the COPD and non-COPD groups. Overall, 1489 individuals were considered as having COPD (pre-BD FEV1/FVC <0.7 at two consecutive measurements during annual lung function tests). In comparison with the non-COPD group, the COPD group had a lower BMI, worse lung function (pre-BD FEV1, pre-BD percent predicted value of FEV1 [%FEV1], and pre-BD FEV1/FVC), and greater emphysematous change and chronic inflammation as determined by computed tomography. Furthermore, comorbidities, such as arrythmia, duodenal ulcer, colorectal polyp, angina, stomach ulcer, and kidney disease, were more prevalent in the COPD group. Statistically significant differences in hematological parameters (mean corpuscular volume [MCV], mean corpuscular hemoglobin concentration [MCHC], mean corpuscular hemoglobin [MCH], hemoglobin [Hb], and hematocrit [HT] [15]) between the COPD and non-COPD groups were also observed. Inflammatory markers, particularly white blood cell (WBC) count and EOS, were also significantly higher in the COPD group.

Table 1.

Subject characteristics stratified by chronic obstructive pulmonary disease status.

Characteristic Non-COPDa (n=23,326) COPD (n=1489) P value
Age (years), mean (SD) 42 (9.1) 48 (9.3) <.001
Female, n (%) 3841 (16.5%) 58 (3.9%) <.001
Smoking status, n (%)

<.001

Current smoker 10,632 (45.6%) 1,021 (68.6%)

Exsmoker 3534 (15.2%) 202 (13.6%)

Nonsmoker 9153 (39.3%) 266 (17.9%)

Unknown/missing 7 (0.0%) 0 (0.0%)
BMI (kg/m2), mean (SD) 23 (3.2) 22 (2.7) <.001
Lung function test, mean (SD)



Prebronchodilator FEV1b 3.4 (0.7) 3.1 (0.6) <.001

Prebronchodilator FVCc 4.1 (0.8) 4.2 (0.8) <.001

Prebronchodilator FEV1/FVC 83.7 (5.4) 74.9 (5.1) <.001
Comorbidity, n (%)



Arrythmia 107 (0.5%) 16 (1.1%) .003

Duodenal ulcer 158 (0.7%) 19 (1.3%) .02

Colorectal polyp 43 (0.2%) 13 (0.9%) <.001

Angina 56 (0.2%) 10 (0.7%) .006

Stomach ulcer 180 (0.8%) 29 (1.9%) <.001

Kidney disease 77 (0.3%) 12 (0.8%) .01
Computed tomography finding, n (%)



Bulla, bleb 108 (0.5%) 31 (2.1%) <.001

Moderate emphysema 18 (0.1%) 13 (0.9%) <.001

Mild emphysema 96 (0.4%) 27 (1.8%) <.001

Calcification of left anterior descending coronary artery 128 (0.5%) 16 (1.1%) .02

Chronic inflammation 342 (1.5%) 43 (2.9%) <.001
Laboratory parameters, mean (SD)



Albumin (U/L) 4.4 (0.2) 4.3 (0.2) <.001

Alanine aminotransferase (U/L) 209.5 (53.7) 215.2 (54.6) <.001

Aspartate aminotransferase (U/L) 26.4 (14.8) 24.3 (12.8) <.001

Blood urea nitrogen (mg/dL) 14.1 (3.2) 14.7 (3.3) <.001

Cholinesterase (U/L) 320.8 (60.1) 307.8 (58.8) <.001

Estimated glomerular filtration rate (mL/min/1.73 m2) 83.6 (14.6) 80.2 (14.1) <.001

Eosinophil count (cells/mm3) 183.2 (124.5) 195.4 (125.9) <.001

Gamma-glutamyl transferase (U/L) 42.7 (34.4) 45.8 (34.5) <.001

Hemoglobin (g/dL) 14.7 (1.4) 14.9 (1.1) <.001

Hemoglobin A1c (%) 5.3 (0.7) 5.4 (0.7) <.001

Hematocrit (%) 44.0 (3.6) 44.6 (3.1) <.001

MCHd (pg) 30.5 (1.8) 31.1 (1.7) <.001

MCHCe (g/L) 33.4 (1.0) 33.3 (0.8) <.001

MCVf (fL) 91.3 (4.6) 93.4 (4.4) <.001

WBCg count (×102 cells/µL) 58.8 (15.0) 62.8 (15.5) <.001

aCOPD: chronic obstructive pulmonary disease.

bFEV1: forced expiratory volume in 1 second.

cFVC: forced vital capacity.

dMCH: mean corpuscular hemoglobin.

eMCHC: mean corpuscular hemoglobin concentration.

fMCV: mean corpuscular volume.

gWBC: white blood cell.

Percentage of Individuals With COPD

The overall percentage of individuals with COPD was 6.0% (1489/24,815). According to smoking status, the percentage of individuals with COPD was 8.8% (1021/11,653) among current smokers and 5.4% (202/3736) among exsmokers. Notably, 2.8% (266/9419) of nonsmokers had developed COPD. The peak age at diagnosis of COPD among current smokers and exsmokers was 55 years and 65 years, respectively (Figure 2).

Figure 2.

Figure 2

Diagnostic age for chronic obstructive pulmonary disease (COPD) according to smoking status.

Risk Factors for COPD Diagnosis

Overall, 20,265 individuals (COPD: n=954; non-COPD: n=19,311) with 51,432 records (COPD: n=2435; non-COPD: n=48,997) out of 24,815 individuals who met the criteria (Multimedia Appendix 2) were included in the machine learning analysis. Table 2 shows the model performance of the XGBoost and logistic regression models. For both models, the AUC, accuracy, sensitivity, specificity, and F-measure were generally similar between the training and test data sets. The XGBoost model had a higher positive predictive value (0.505) than the logistic regression model (0.441). The AUC was high in the training and test sets for both models (range: 0.892-0.956). Additionally, the accuracy and specificity exceeded 0.883 and 0.879, respectively, for both models.

Table 2.

Comparison of performance of the Gradient Boosting Decision Tree machine learning (XGBoost) and logistic regression models.

Variable XGBoosta model Logistic regression model
Training, mean (SE) Test, mean Training, mean (SE) Test, mean
Positive predictive value 0.505 (0.099) 0.362 0.441 (0.110) 0.285
AUCb 0.956 (0.015) 0.898 0.943 (0.022) 0.892
Accuracy 0.917 (0.032) 0.918 0.884 (0.049) 0.883
Sensitivity 0.845 (0.021) 0.877 0.874 (0.039) 0.901
Specificity 0.960 (0.016) 0.919 0.946 (0.025) 0.882
F-measure 0.370 (0.107) 0.513 0.306 (0.110) 0.434

aXGBoost: Gradient Boosting Decision Tree machine learning.

bAUC: area under the receiver operating characteristic curve.

The most important predictive factors for COPD diagnosis were lung function tests (ie, FEV1/FVC, percent vital capacity [%VC], and %FEV1) and smoking status, followed by cough, hematological indices (ie, MCV, MCHC, MCH, Hb, and HT), treatment with antidiabetic drugs, hemoglobin A1c, serum albumin, total protein, and BMI. Other predictive risk factors were EOS, serum alanine aminotransferase, WBC count, and urinary WBC count (Table 3). Logistic regression analysis showed that low FEV1/FVC and %FEV1; high %VC; high MCV, MCHC, and Hb; and low HT and MCH were related factors, and that individuals treated with antidiabetic drugs had a higher number of associated risk factors for COPD. Low serum albumin, low total protein, and low BMI were also confirmed as risk factors (Multimedia Appendix 3).

Table 3.

Importance of each predictor in the XGBoost model.

Variable Importance value
Forced expiratory volume in 1 second/forced vital capacity 0.2824
Smoking status 0.0329
Allergic symptoms (yes/no) 0.0303
Symptom-cough (yes/no) 0.0294
Smoking-pack year 0.0222
Hemoglobin A1c 0.0197
Albumin 0.0195
Mean corpuscular volume 0.0177
%Vital capacity 0.0165
%Forced expiratory volume in 1 second 0.0164
Treatment with an antidiabetic drug (yes/no) 0.0162
Allergic disease (yes/no) 0.0146
Hematocrit 0.0144
Urinary red blood cells 0.0143
Hemoglobin 0.0138
Age 0.0128
Smoking duration 0.0127
High density lipoprotein cholesterol 0.0123
Mean corpuscular hemoglobin concentration 0.0122
Total protein 0.0118
BMI 0.0118
Number of eosinophils 0.0115
Mean corpuscular hemoglobin 0.0114
Serum white blood cells 0.0111
Fasting blood sugar 0.0110
Serum alanine aminotransferase 0.0108
Pulse rate 0.0108
Forced expiratory volume in 1 second 0.0107
Urinary white blood cells 0.0104
Diastolic blood pressure 0.0103

For future utilization of risk factors for disease assessment in daily clinical practice, the machine learning process was validated using a questionnaire to predict risk factors for COPD development (Multimedia Appendix 1). Of 30 variables, 25 were clinical parameters that overlapped between the two methods. The top 30 risk factors also included the following five questions: “I am regularly doing exercise,” “I have chest compression and pain,” “Average sleeping time in the past 1 month,” “I have breakfast every day,” and “Body fat ratio” (Multimedia Appendix 4). Among these, logistic regression analysis showed that insufficient sleeping time and not having breakfast every day were risk factors for COPD (Multimedia Appendix 3).

Discussion

This study applied a machine learning method, a powerful tool to analyze large quantities of complex data, to predict risk factors for COPD. This is the first study to investigate more than 300,000 records from working-age adults in Japan utilizing an annual medical check-up database. This system allows healthy employees to track their health conditions over time by clinical measurements and questionnaires. We found that the most significant predictor of COPD diagnosis was the absolute value of FEV1/FVC, indicating that low FEV1 in early adulthood is an important factor in the development of COPD. Childhood asthma is associated with impaired lung function, lower lung function in adulthood, and higher risk of COPD even for nonsmoker participants, as previously reported by Martinez et al [21]. In our speculation, some part of the nonsmoker COPD population might have had a history of childhood asthma, increasing susceptibility to passive smoke exposure or airway pollution and resulting in the early diagnosis of COPD in nonsmokers compared with exsmokers in the study. Smoking status had the second highest impact on disease diagnosis. Among individuals with a smoking history, the peak age of COPD diagnosis was older in exsmokers than in current smokers. This finding suggests that smoking cessation delays the diagnosis of COPD, consistent with a previous study in which smoking cessation was reported to affect the natural history of COPD [22].

Erythrocyte indices (MCV and MCHC) might also be available as potential predictors of COPD diagnosis in addition to lung function measurements. These data are supported by a previous report in which continuous smoking had a significant effect on hematological parameters compared with nonsmoking, and it may be associated with an increased risk of COPD [23]. The increased levels of MCV and MCHC in individuals with COPD support a previous finding that impaired lung function has a strong association with ischemic heart disease [24]. Conversely, the presence of an allergic disease appeared to have a preventive effect on airflow limitation, which is in contrast with observations from the Tasmanian Longitudinal Health Study in which the presence of allergic diseases was an early predictor of lung trajectories toward COPD [25]. However, the Hokkaido cohort study showed that subjects with multiple asthma-like features had slower lung function decline [26]. From the findings of our observational study in Japan, we can speculate that early diagnosis and intervention for allergic diseases may have less impact on lung function and that regular and frequent medical intervention could lead to an overall increase in life expectancy among patients who can readily access appropriate treatment by respiratory specialists.

Furthermore, individuals with decreased levels of serum albumin and total protein, as well as lower hemoglobin A1c and BMI may be at risk of developing cachexia, a common condition among patients with COPD [27]. With respect to other identified risk factors, a retrospective cross-sectional study showed an association between EOS and airflow limitation in patients with COPD [28]. Given that increased alanine aminotransferase levels have been observed in patients with obstructive sleep apnea [29], individuals at risk of developing COPD might be exposed to intermittent hypoxia, indicating that a reduced sleeping time, as determined in the study questionnaire, might also represent a risk factor for COPD. Even minor changes in hematological parameters might be attributable to hypoxic conditions, leading to sleep disruption. Additionally, frequently missing breakfast might accelerate malnutrition in the COPD group. Furthermore, significantly higher prevalence rates of chronic neck and lower back pain in patients with COPD compared with healthy individuals were observed in a population-based study, although the findings were not confirmed by logistic regression analysis [30], and the link between COPD and back pain remains unknown. The observation of increased WBC counts in patients with COPD compared with healthy controls [31] suggests that systemic inflammation may be involved in the pathogenesis of COPD [32].

Our results also indicate that smoking cessation should be prioritized for the prevention of COPD and that smokers with sleep disturbances, back pain, and/or low BMI and malnutrition may be at increased risk of developing COPD and should be considered as candidates for lifestyle intervention therapy. Furthermore, the five key questions included in our questionnaire should be validated in future investigations and potentially implemented in daily practice as part of an annual medical check-up to prevent COPD.

The positive predictive value of the XGBoost model was comparable to that of a self-scored persistent airflow obstruction screening questionnaire in the Japanese population previously reported by Samukawa et al [33]. However, our models showed more accuracy because the sensitivity and specificity of our models achieved higher figures, and the AUC reached over 0.9 compared with that of the questionnaire, which ranged from 0.595 to 0.612. The AUCs of the XGBoost and logistic regression models were similar, while the most important factor related to COPD diagnosis was FEV1 in both models. However, some variables differed in importance in each model. Kuhn et al reported that machine learning approaches can incorporate high-order nonlinear interactions among predictors that cannot be addressed by traditional modeling approaches (eg, logistic regression models) [34]. However, machine learning methods cannot elucidate whether a causal relationship exists between the identified variable and the disease. Thus, the association between risk factors detected using a machine learning model and COPD requires validation in future prospective studies.

A strength of this study was the use of longitudinal lung function test data from healthy individuals from April 1998 to March 2019. In general, medical checkup data are not linked to medical records, meaning that profiles of lung function tests over time could not be investigated. However, it was possible to evaluate longitudinal lung function tests because the database included data from individuals from a point in time when they were healthy until they had developed COPD. Additionally, data from healthy individuals were included, allowing lung function test results from when they were diagnosed with COPD to be investigated. Finally, both clinical measurements and questionnaire variables were included in the database, thereby increasing the potential to identify several different risk factors for COPD.

The limitations of this study include the definition of COPD diagnosis by airflow limitation with pulmonary function tests. Instead of post-BD spirometry data as suggested by the ATS/European respiratory guidelines, we employed pre-BD spirometry data for the diagnosis of COPD since no post-BD spirometry was performed in the annual medical check-up. The precise diagnosis of COPD cannot always be demonstrated by airflow limitation alone; however, we believe that the diagnostic approach was reasonable from a clinical perspective as airflow limitation has been reported to be a poor prognostic factor in the general population [35]. Low lung function values (FEV1/FVC <0.7) might be observed at a single time point in some individuals for no discernable reason. Therefore, we considered COPD as lung function of FEV1/FVC <0.7 on two consecutive occasions. In terms of differentiation between asthma and COPD, we cannot exclude the possibility of misclassification of asthma as COPD in some patients since reversibility tests were not performed in the annual medical check-up, but participants with a medical history of asthma were excluded. Additionally, a database from a single organization was analyzed in this study; thus, the results might include bias based on the type of industry or the organizational structure of the company, limiting the generalizability of the findings. To obtain more generalizable findings, studies using other databases are necessary. Finally, some unknown confounders may have remained; therefore, we plan to perform model validation by analyzing other databases. Well-controlled prospective studies should be conducted to confirm the predictive factors for COPD diagnosis.

In conclusion, our machine learning method applied to longitudinal medical check-up data, including general questionnaires and laboratory parameters, identified hematological, nutritional, and inflammatory parameters as potential risk factors for COPD. These parameters, along with lung function and smoking status, may be useful in identifying at-risk individuals and may lead to an earlier diagnosis.

Acknowledgments

The authors thank ND Smith, Y Baba, and K Dohi of EMC Japan (Osaka, Japan) for critical reading and native checking of the manuscript. We also thank Tricia Newell and Clare Cox of Edanz Evidence Generation for providing editing support. This study was funded by AstraZeneca KK.

Abbreviations

5-CV

five times for cross-validation

AUC

area under the receiver operating characteristic curve

BD

bronchodilator

COPD

chronic obstructive pulmonary disease

EOS

eosinophil count

FEV1

forced expiratory volume in 1 second

FVC

forced vital capacity

Hb

hemoglobin

HT

hematocrit

MCHC

mean corpuscular hemoglobin concentration

MCV

mean corpuscular volume

WBC

white blood cell

XGBoost

Gradient Boosting Decision Tree machine learning method

Appendix

Multimedia Appendix 1

Clinical assessments and questions used in the analysis, and list of variables.

Multimedia Appendix 2

Numbers of records and individuals included in the machine learning model.

Multimedia Appendix 3

Association between chronic obstructive pulmonary disease and the top 30 variables of importance based on logistic regression.

Multimedia Appendix 4

Importance of each predictor in the XGBoost model (including questionnaire items).

Footnotes

Authors' Contributions: SM and TK contributed to data interpretation and reviewed the manuscript. YH planned the analyses, contributed to data interpretation, and drafted the manuscript. MI designed the study and drafted the manuscript. SN, WT, and HB conducted the analyses. TN corrected and provided the data, and reviewed the analysis including data cleansing and preprocessing. All authors take full responsibility for the content and editorial decisions, and approved the final version.

Conflicts of Interest: SM has received honoraria from AstraZeneca KK, Boehringer Ingelheim Japan, GlaxoSmithKline KK, Novartis Pharma KK, Meiji Seika Pharma Co, Ltd, Kyorin Pharmaceutical Co, Ltd, Otsuka Pharmaceutical Co, Ltd, Teijin Pharma Ltd, CHEST MI, Inc, Daiichi Sankyo Co, Ltd, Chugai Pharmaceutical Co, Ltd, Sanofi KK, Actelion Pharmaceuticals Japan Ltd, and Olympus Corporation. MI and YH are employees of AstraZeneca KK. WT, SN, HB, and TN are employees of Hitachi, Ltd.

References

  • 1.Vogelmeier CF, Criner GJ, Martinez FJ, Anzueto A, Barnes PJ, Bourbeau J, Celli BR, Chen R, Decramer M, Fabbri LM, Frith P, Halpin DMG, López Varela MV, Nishimura M, Roche N, Rodriguez-Roisin R, Sin DD, Singh D, Stockley R, Vestbo J, Wedzicha JA, Agustí A. Global Strategy for the Diagnosis, Management, and Prevention of Chronic Obstructive Lung Disease 2017 Report. GOLD Executive Summary. Am J Respir Crit Care Med. 2017 Mar 01;195(5):557–582. doi: 10.1164/rccm.201701-0218PP. [DOI] [PubMed] [Google Scholar]
  • 2.de Marco R, Accordini S, Marcon A, Cerveri I, Antó JM, Gislason T, Heinrich J, Janson C, Jarvis D, Kuenzli N, Leynaert B, Sunyer J, Svanes C, Wjst M, Burney P, European Community Respiratory Health Survey (ECRHS) Risk factors for chronic obstructive pulmonary disease in a European cohort of young adults. Am J Respir Crit Care Med. 2011 Apr 01;183(7):891–7. doi: 10.1164/rccm.201007-1125OC. [DOI] [PubMed] [Google Scholar]
  • 3.Fletcher C, Peto R. The natural history of chronic airflow obstruction. Br Med J. 1977 Jun 25;1(6077):1645–8. doi: 10.1136/bmj.1.6077.1645. http://europepmc.org/abstract/MED/871704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Stang P, Lydick E, Silberman C, Kempel A, Keating ET. The prevalence of COPD: using smoking rates to estimate disease frequency in the general population. Chest. 2000 May;117(5 Suppl 2):354S–9S. doi: 10.1378/chest.117.5_suppl_2.354s. [DOI] [PubMed] [Google Scholar]
  • 5.Fukuchi Y, Nishimura M, Ichinose M, Adachi M, Nagai A, Kuriyama T, Takahashi K, Nishimura K, Ishioka S, Aizawa H, Zaher C. COPD in Japan: the Nippon COPD Epidemiology study. Respirology. 2004 Nov;9(4):458–65. doi: 10.1111/j.1440-1843.2004.00637.x. [DOI] [PubMed] [Google Scholar]
  • 6.Soriano JB, Zielinski J, Price D. Screening for and early detection of chronic obstructive pulmonary disease. The Lancet. 2009 Aug 29;374(9691):721–732. doi: 10.1016/S0140-6736(09)61290-3. [DOI] [PubMed] [Google Scholar]
  • 7.Larsson K, Janson C, Ställberg B, Lisspers K, Olsson P, Kostikas K, Gruenberger J, Gutzwiller FS, Uhde M, Jorgensen L, Johansson G. Impact of COPD diagnosis timing on clinical and economic outcomes: the ARCTIC observational cohort study. Int J Chron Obstruct Pulmon Dis. 2019;14:995–1008. doi: 10.2147/COPD.S195382. doi: 10.2147/COPD.S195382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Muro S, Tabara Y, Matsumoto H, Setoh K, Kawaguchi T, Takahashi M, Ito I, Ito Y, Murase K, Terao C, Kosugi S, Yamada R, Sekine A, Nakayama T, Chin K, Mishima M, Matsuda F, Nagahama Study Group Relationship Among Chlamydia and Mycoplasma Pneumoniae Seropositivity, IKZF1 Genotype and Chronic Obstructive Pulmonary Disease in A General Japanese Population: The Nagahama Study. Medicine (Baltimore) 2016 Apr;95(15):e3371. doi: 10.1097/MD.0000000000003371. doi: 10.1097/MD.0000000000003371. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Sato K, Shibata Y, Inoue S, Igarashi A, Tokairin Y, Yamauchi K, Kimura T, Nemoto T, Sato M, Nakano H, Machida H, Nishiwaki M, Kobayashi M, Yang S, Minegishi Y, Furuyama K, Yamamoto T, Watanabe T, Konta T, Ueno Y, Kato T, Kayama T, Kubota I. Impact of cigarette smoking on decline in forced expiratory volume in 1s relative to severity of airflow obstruction in a Japanese general population: The Yamagata-Takahata study. Respir Investig. 2018 Mar;56(2):120–127. doi: 10.1016/j.resinv.2017.11.011. [DOI] [PubMed] [Google Scholar]
  • 10.Jordan RE, Adab P, Sitch A, Enocson A, Blissett D, Jowett S, Marsh J, Riley RD, Miller MR, Cooper BG, Turner AM, Jolly K, Ayres JG, Haroon S, Stockley R, Greenfield S, Siebert S, Daley AJ, Cheng KK, Fitzmaurice D. Targeted case finding for chronic obstructive pulmonary disease versus routine practice in primary care (TargetCOPD): a cluster-randomised controlled trial. The Lancet Respiratory Medicine. 2016 Sep;4(9):720–730. doi: 10.1016/S2213-2600(16)30149-7. [DOI] [PubMed] [Google Scholar]
  • 11.Young AL, Bragman FJS, Rangelov B, Han MK, Galbán CJ, Lynch DA, Hawkes DJ, Alexander DC, Hurst JR, COPDGene Investigators Disease Progression Modeling in Chronic Obstructive Pulmonary Disease. Am J Respir Crit Care Med. 2020 Feb 01;201(3):294–302. doi: 10.1164/rccm.201908-1600OC. http://europepmc.org/abstract/MED/31657634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med. 2015 Jan 06;162(1):55–63. doi: 10.7326/M14-0697. [DOI] [PubMed] [Google Scholar]
  • 13.Sugiyama K, Tomata Y, Takemi Y, Tsushita K, Nakamura M, Hashimoto S, Miyachi M, Yamagata Z, Yokoyama T, Tsuji I. Awareness and health consciousness regarding the national health plan "Health Japan 21" (2nd edition) among the Japanese population in 2013 and 2014. Nihon Koshu Eisei Zasshi. 2016;63(8):424–31. doi: 10.11236/jph.63.8_424. doi: 10.11236/jph.63.8_424. [DOI] [PubMed] [Google Scholar]
  • 14.Terzikhan N, Verhamme KMC, Hofman A, Stricker BH, Brusselle GG, Lahousse L. Prevalence and incidence of COPD in smokers and non-smokers: the Rotterdam Study. Eur J Epidemiol. 2016 Aug;31(8):785–92. doi: 10.1007/s10654-016-0132-z. http://europepmc.org/abstract/MED/26946425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Miller MR, Hankinson J, Brusasco V, Burgos F, Casaburi R, Coates A, Crapo R, Enright P, van der Grinten CPM, Gustafsson P, Jensen R, Johnson DC, MacIntyre N, McKay R, Navajas D, Pedersen OF, Pellegrino R, Viegi G, Wanger J, ATS/ERS Task Force Standardisation of spirometry. Eur Respir J. 2005 Aug;26(2):319–38. doi: 10.1183/09031936.05.00034805. http://erj.ersjournals.com/cgi/pmidlookup?view=long&pmid=16055882. [DOI] [PubMed] [Google Scholar]
  • 16.Celli BR, MacNee W, ATS/ERS Task Force Standards for the diagnosis and treatment of patients with COPD: a summary of the ATS/ERS position paper. Eur Respir J. 2004 Jun;23(6):932–46. doi: 10.1183/09031936.04.00014304. http://erj.ersjournals.com/cgi/pmidlookup?view=long&pmid=15219010. [DOI] [PubMed] [Google Scholar]
  • 17.Chen T, Guestrin C. XGBoost : A Scalable Tree Boosting System. KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 13-17, 2016; San Francisco, CA. 2016. pp. 785–794. [DOI] [Google Scholar]
  • 18.Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, Tibshirani RJ. Strong rules for discarding predictors in lasso-type problems. J R Stat Soc Series B Stat Methodol. 2012 Mar;74(2):245–266. doi: 10.1111/j.1467-9868.2011.01004.x. http://europepmc.org/abstract/MED/25506256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning. Springer Texts in Statistics, vol 103. New York, USA: Springer; 2013. Introduction. [Google Scholar]
  • 20.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988 Sep;44(3):837–45. [PubMed] [Google Scholar]
  • 21.Martinez FJ, Han MK, Allinson JP, Barr RG, Boucher RC, Calverley PMA, Celli BR, Christenson SA, Crystal RG, Fagerås M, Freeman CM, Groenke L, Hoffman EA, Kesimer M, Kostikas K, Paine R, Rafii S, Rennard SI, Segal LN, Shaykhiev R, Stevenson C, Tal-Singer R, Vestbo J, Woodruff PG, Curtis JL, Wedzicha JA. At the Root: Defining and Halting Progression of Early Chronic Obstructive Pulmonary Disease. Am J Respir Crit Care Med. 2018 Jun 15;197(12):1540–1551. doi: 10.1164/rccm.201710-2028PP. http://europepmc.org/abstract/MED/29406779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Bai J, Chen X, Liu S, Yu L, Xu J. Smoking cessation affects the natural history of COPD. Int J Chron Obstruct Pulmon Dis. 2017;12:3323–3328. doi: 10.2147/COPD.S150243. doi: 10.2147/COPD.S150243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Malenica M, Prnjavorac B, Bego T, Dujic T, Semiz S, Skrbo S, Gusic A, Hadzic A, Causevic A. Effect of Cigarette Smoking on Haematological Parameters in Healthy Population. Med Arch. 2017 Apr;71(2):132–136. doi: 10.5455/medarh.2017.71.132-136. http://europepmc.org/abstract/MED/28790546. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Eriksson B, Lindberg A, Müllerova H, Rönmark E, Lundbäck B. Association of heart diseases with COPD and restrictive lung function--results from a population survey. Respir Med. 2013 Jan;107(1):98–106. doi: 10.1016/j.rmed.2012.09.011. https://linkinghub.elsevier.com/retrieve/pii/S0954-6111(12)00352-6. [DOI] [PubMed] [Google Scholar]
  • 25.Bui DS, Lodge CJ, Burgess JA, Lowe AJ, Perret J, Bui MQ, Bowatte G, Gurrin L, Johns DP, Thompson BR, Hamilton GS, Frith PA, James AL, Thomas PS, Jarvis D, Svanes C, Russell M, Morrison SC, Feather I, Allen KJ, Wood-Baker R, Hopper J, Giles GG, Abramson MJ, Walters EH, Matheson MC, Dharmage SC. Childhood predictors of lung function trajectories and future COPD risk: a prospective cohort study from the first to the sixth decade of life. Lancet Respir Med. 2018 Jul;6(7):535–544. doi: 10.1016/S2213-2600(18)30100-0. [DOI] [PubMed] [Google Scholar]
  • 26.Suzuki M, Makita H, Konno S, Shimizu K, Kimura H, Kimura H, Nishimura M, Hokkaido COPD Cohort Study Investigators Asthma-like Features and Clinical Course of Chronic Obstructive Pulmonary Disease. An Analysis from the Hokkaido COPD Cohort Study. Am J Respir Crit Care Med. 2016 Dec 01;194(11):1358–1365. doi: 10.1164/rccm.201602-0353OC. [DOI] [PubMed] [Google Scholar]
  • 27.von Haehling S, Anker MS, Anker SD. Prevalence and clinical impact of cachexia in chronic illness in Europe, USA, and Japan: facts and numbers update 2016. J Cachexia Sarcopenia Muscle. 2016 Dec;7(5):507–509. doi: 10.1002/jcsm.12167. doi: 10.1002/jcsm.12167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Huang W, Huang C, Wu P, Chen C, Cheng Y, Chen H, Lee C, Wu M, Hsu J. The association between airflow limitation and blood eosinophil levels with treatment outcomes in patients with chronic obstructive pulmonary disease and prolonged mechanical ventilation. Sci Rep. 2019 Sep 17;9(1):13420. doi: 10.1038/s41598-019-49918-z. doi: 10.1038/s41598-019-49918-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Chin K, Nakamura T, Takahashi K, Sumi K, Ogawa Y, Masuzaki H, Muro S, Hattori N, Matsumoto H, Niimi A, Chiba T, Nakao K, Mishima M, Ohi M, Nakamura T. Effects of obstructive sleep apnea syndrome on serum aminotransferase levels in obese patients. Am J Med. 2003 Apr 01;114(5):370–6. doi: 10.1016/s0002-9343(02)01570-x. [DOI] [PubMed] [Google Scholar]
  • 30.de Miguel-Díez J, López-de-Andrés A, Hernandez-Barrera V, Jimenez-Trujillo I, Del Barrio JL, Puente-Maestu L, Martinez-Huedo MA, Jimenez-García R. Prevalence of Pain in COPD Patients and Associated Factors: Report From a Population-based Study. Clin J Pain. 2018 Sep;34(9):787–794. doi: 10.1097/AJP.0000000000000598. [DOI] [PubMed] [Google Scholar]
  • 31.Biljak VR, Pancirov D, Cepelak I, Popović-Grle S, Stjepanović G, Grubišić TŽ. Platelet count, mean platelet volume and smoking status in stable chronic obstructive pulmonary disease. Platelets. 2011;22(6):466–70. doi: 10.3109/09537104.2011.573887. [DOI] [PubMed] [Google Scholar]
  • 32.Agustí A, Edwards LD, Rennard SI, MacNee W, Tal-Singer R, Miller BE, Vestbo J, Lomas DA, Calverley PMA, Wouters E, Crim C, Yates JC, Silverman EK, Coxson HO, Bakke P, Mayer RJ, Celli B, Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE) Investigators Persistent systemic inflammation is associated with poor clinical outcomes in COPD: a novel phenotype. PLoS One. 2012;7(5):e37483. doi: 10.1371/journal.pone.0037483. https://dx.plos.org/10.1371/journal.pone.0037483. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Samukawa T, Matsumoto K, Tsukuya G, Koriyama C, Fukuyama S, Uchida A, Mizuno K, Miyahara H, Kiyohara Y, Ninomiya T, Inoue H. Development of a self-scored persistent airflow obstruction screening questionnaire in a general Japanese population: the Hisayama study. Int J Chron Obstruct Pulmon Dis. 2017;12:1469–1481. doi: 10.2147/COPD.S130453. doi: 10.2147/COPD.S130453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kuhn S, Egert B, Neumann S, Steinbeck C. Building blocks for automated elucidation of metabolites: machine learning methods for NMR prediction. BMC Bioinformatics. 2008 Sep 25;9:400. doi: 10.1186/1471-2105-9-400. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-400. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Akkermans RP, Biermans M, Robberts B, ter Riet G, Jacobs A, van Weel C, Wensing M, Schermer T. COPD prognosis in relation to diagnostic criteria for airflow obstruction in smokers. Eur Respir J. 2014 Jan;43(1):54–63. doi: 10.1183/09031936.00158212. http://erj.ersjournals.com/cgi/pmidlookup?view=long&pmid=23563262. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia Appendix 1

Clinical assessments and questions used in the analysis, and list of variables.

Multimedia Appendix 2

Numbers of records and individuals included in the machine learning model.

Multimedia Appendix 3

Association between chronic obstructive pulmonary disease and the top 30 variables of importance based on logistic regression.

Multimedia Appendix 4

Importance of each predictor in the XGBoost model (including questionnaire items).


Articles from JMIR Medical Informatics are provided here courtesy of JMIR Publications Inc.

RESOURCES