Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Nov 1.
Published in final edited form as: Thorax. 2017 Jun 21;72(11):998–1006. doi: 10.1136/thoraxjnl-2016-209846

DO “COPD SUBTYPES” REALLY EXIST?

Assessment of COPD Heterogeneity and Clustering Reproducibility in 17,154 Individuals Across Ten Independent Cohorts

Peter J Castaldi 1,2,*, Marta Benet 3,4,5,*, Hans Petersen 6,*, Nicholas Rafaels 7, James Finigan 8, Matteo Paoletti 9, H Marike Boezen 10, Judith M Vonk 10, Russell Bowler 8, Massimo Pistolesi 9, Milo A Puhan 11, Josep Anto 3,4,5,12, Els Wauters 13,14,15, Diether Lambrechts 13,14, Wim Janssens 15, Francesca Bigazzi 9, Gianna Camiciottoli 9, Michael H Cho 1,16, Craig P Hersh 1,16, Kathleen Barnes 7, Stephen Rennard 17,18, Meher Preethi Boorgula 7, Jennifer Dy 19, Nadia H Hansel 20,21, James D Crapo 8, Yohannes Tesfaigzi 6, Alvar Agusti 22, Edwin K Silverman 1,17, Judith Garcia-Aymerich 3,4,5
PMCID: PMC6013053  NIHMSID: NIHMS975271  PMID: 28637835

Abstract

Background

COPD is a heterogeneous disease, but there is little consensus on specific definitions for COPD subtypes. Unsupervised clustering offers the promise of “unbiased” data-driven assessment of COPD heterogeneity. Multiple groups have identified COPD subtypes using cluster analysis, but there has been no systematic assessment of the reproducibility of these subtypes.

Objective

We performed clustering analyses across ten cohorts in North America and Europe in order to assess the reproducibility of 1) correlation patterns of key COPD-related clinical characteristics and 2) clustering results.

Methods

We studied 17,146 individuals with COPD using identical methods and common COPD-related characteristics across cohorts (FEV1, FEV1/FVC, FVC, BMI, MMRC score, asthma, and cardiovascular comorbid disease). Correlation patterns between these clinical characteristics were assessed by principal components analysis (PCA). Cluster analysis was performed using k-medoids and hierarchical clustering, and concordance of clustering solutions was quantified with normalized mutual information (NMI), a metric that ranges from 0 to 1 with higher values indicating greater concordance.

Results

The reproducibility of COPD clustering subtypes across studies was modest (median NMI range 0.17 – 0.43). For methods that excluded individuals that did not clearly belong to any cluster, agreement was better but still suboptimal (median NMI range 0.32 – 0.60). Continuous representations of COPD clinical characteristics derived from PCA were much more consistent across studies.

Conclusions

Identical clustering analyses across multiple COPD cohorts showed modest reproducibility. COPD heterogeneity is better characterized by continuous disease traits coexisting in varying degrees within the same individual, rather than by mutually exclusive COPD subtypes.

Keywords: COPD epidemiology

INTRODUCTION

Chronic obstructive pulmonary disease (COPD) is characterized by significant disease heterogeneity [1,2], but there is little consensus regarding specific definitions for distinct COPD subtypes or phenotypes, terms which have been used interchangeably in the literature. Unsupervised clustering is intuitively appealing because it offers a data-driven, objective assessment of COPD heterogeneity, and several groups have used cluster analysis to identify COPD subtypes [39]. However, a recent systematic review showed substantial differences in clustering results across studies [10], calling the reproducibility of these subtypes into question. Since clinical translation of COPD subtypes depends on reproducibility, this is a critical question for the clinical application of clustering-defined subtypes.

On the other hand, the conclusions that may be drawn from the previously mentioned systematic review are limited, since the wide variety of methods used in the different studies precluded quantitative meta-analysis and subject-level assessment of cluster reproducibility. By comparing average COPD-related characteristics across clusters, the authors identified two COPD subtypes that seemed to be reasonably replicable across studies. These subtypes were characterized by 1) severe airflow limitation, low BMI and poor health status and 2) moderate airflow limitation, high BMI and cardiovascular co-morbidities.

To directly assess the reproducibility of COPD clustering subtypes, we performed uniform clustering analyses in 10 independent large cohorts of COPD patients to which authors had access to individual patient data. These analysis results were shared across cohorts in order to 1) assess the similarity of correlation patterns between selected COPD clinical characteristics and 2) determine the reproducibility of unsupervised clustering across cohorts. These experiments demonstrate that for many important COPD-related clinical characteristics such as FEV1, emphysema, and health-related quality of life, subjects with COPD are distributed along a continuous spectrum rather than being clustered into clearly distinct subgroups. As a result, clustering results are only modestly reproducible across independent studies, and continuous representations of COPD clinical variability are more consistent.

METHODS

Subjects

The participating study populations were CLIPCOPD[7], COPDGene[11], ECLIPSE[12], ICE COLD ERIC[13], LifeLines[14], Lovelace[15], Leuven[16], Lung Health Study[17], the National Jewish Health cohort, and PAC-COPD[18]. Subjects included in this analysis were self-described Caucasian subjects meeting spirometric criteria for COPD (defined as post-bronchodilator forced expiratory volume in the first second (FEV1) and forced vital capacity (FVC) ratio < 0.7 with the exception of one cohort[14] using pre-bronchodilator values). Institutional Review Board approval was obtained from the relevant participating academic centers for all study populations. Further details are provided in the online data supplement.

Clustering Features

Features used as inputs for the clustering analysis were selected based on availability within all ten studies, excluding age and pack-years, which may be drivers of disease itself rather than manifestations. Accordingly, the clustering features finally selected were: FEV1 percent of predicted, FVC percent of predicted, FEV1/FVC ratio, body mass index (BMI), modified Medical Research Council (MMRC) dyspnea score (zero to four), and self-reported asthma and cardiovascular disease diagnosis. Additional details on clustering features are included in the online data supplement.

Statistical and Clustering Analyses

All analyses were performed in R (v3.1.0). To assess the similarity of the correlation patterns between variables, we first performed Principal Component Analysis (PCA) in each cohort, and then we compared the feature loadings for each principal component (PC) across datasets.

To determine reproducibility of clustering solutions, we identified clusters in each cohort using hierarchical and k-medoids clustering according to the methods outlined by Shi and Horvath[19] using a pre-determined range of parameter settings, then we transferred these clustering solutions across cohorts by using supervised random forests predictive models (Figure 1). The predictive accuracy of these models was quantified by out-of-bag (OOB) cross-validation.

Figure 1.

Figure 1

Overview of Cluster Generation, Transfer, and Concordance Assessment. For each cohort, 23 “source” clustering solutions (S1 to S23) are generated (total of 230 solutions across the 10 cohorts). Each solution is transferred to the other cohorts via a predictive model (T1 to T23). Each solution is also labeled according to its parent cohort, thus source solution 1 from cohort 1 = S1C1. Each cohort ultimately produces 230 cluster solutions (23 source solutions, and 207 transferred solutions which are “predicted into” each cohort). The green, red, and dark blue colors correspond to cluster results generated by a specific cluster method and set of parameters (for example, “k-medoids with k=2”).

We generated 23 clustering solutions per cohort in order to explore a wide range of possible solutions for the methods under study, for a total of 230 solutions. A distinct feature of the hierarchical clustering algorithm is that it identifies “poorly clustered” subjects that are not sufficiently similar to other members in their assigned cluster[20]. In subsequent analyses, the hierarchical clustering results were analysed with and without these “poorly clustered” individuals.

We quantified the extent to which each “source” clustering solution matched the clusters generated in the other cohorts using normalized mutual information (NMI), a measure of subject-level agreement[21]. For each cohort, the best NMI solutions were considered the most reproducible cluster solutions, and the COPD-related characteristics of these clusters were described by means of descriptive statistics. We determined, based on the average characteristics of each cluster solution, whether any of the clusters resembled the previously mentioned frequently reported COPD subtypes (i.e., the “severe airflow limitation, low BMI and poor health status”, and the “moderate airflow limitation, high BMI and cardiovascular co-morbidities”).

A more comprehensive set of features was explored in two study cohorts, COPDGene and ECLIPSE (COPDGene-ECLIPSE substudy). These features included all of the features in the main study, as well as airway wall thickness (Pi10), quantitative emphysema (LAA950), number of self-reported respiratory exacerbations over the previous 12 months, chronic bronchitis symptoms, and the Saint George’s Respiratory Questionnaire (SGRQ) total score. Additional details are included in the online supplement.

RESULTS

Clinical Characteristics of the Study Samples

The clinical characteristics of the analyzed subjects from all ten cohorts are shown in Table 1.

Table 1.

Description of sociodemographic and clinical characteristics of 17,154 subjects with COPD by cohort

CLIPCOPD COPDGene ECLIPSE ICECOLDERIC LEUVEN LifeLines Lovelace LHS NJH PAC-COPD
Italy, N=367 USA, N=4471 Europe and USA, N=2094 Switzerland and The Netherlands, N=403 Belgium, N=548 The Netherlands, N=5198 Southwestern USA, N=539 USA, N=3132 Colorado USA, N=60 Spain, n=342
Age (years) 68.3 (8.9) 57.1 (8.6) 63.4 (7.1) 67.3 (9.9) 67.7 (8.6) 53.2 (9.1) 60.4 (8.8) 49.3 (6.6) 72.5 (10.0) 67.9 (8.6)
Sex: male, % 80 56 66 57 76 48 34 64 53 93
Smoking: current, % 35 43 36 38 43 28 56 100 7 35
FEV1 (% predicted) 63.9 (24.0) 57.4 (22.8) 43.9 (15.0) 55.4 (16.7) 49.8 (18.7) 90.8 (14.8) 72.8 (18.9) 76.8 (9.0) 37.7 (15.1) 52.3 (16.2)
FEV1/FVC (%) 53.5 (11.4) 52.2 (13.4) 44.6 (11.5) 51.8 (11.8) 45.2 (12.0) 64.7 (5.6) 59.8 (9.3) 63.1 (5.3) 54.1 (10.0) 53.4 (12.0)
FVC (%predicted) 93.8 (24.7) 81.9 (20.4) 79.6 (19.9) 87.3 (19.6) 45.2 (12.0) 115.7 (15.8) 92.7 (17.4) 95.7 (10.4) 65.2 (19.3) 72.6 (16.4)
BMI (kg/m2) 26.3 (4.7) 27.9 (6.1) 26.5 (5.6) 26.1 (5.2) 24.9 (5.2) 25.9 (3.7) 26.8 (5.9) 25.5 (3.8) 27.4 (8.3) 28.2 (4.7)
MMRC (0–4) 2.1 (1.0) 1.9 (1.5) 1.7 (1.1) 1.9 (1.5) 1.9 (1.1) 0.3 (0.7) 1.3 (1.2) 0.5 (0.7) 2.9 (0.9) 1.7 (1.2)
Asthma, % 1 23 22 4 0 13 25 8 7 67
CVD, % 45 20 22 20 38 5 28 1 23 25

Values are mean (SD) unless otherwise noted.

The number of subjects in each cohort ranged from 60 to 5,198. Some studies included COPD patients with a wide range of airflow limitation, whereas others had a predominance of severely affected or less severely affected subjects. Studies drew from populations in the United States and Northern and Southern Europe.

Correlation Patterns and Clustering Importance of COPD Clinical Features

PCA demonstrated that the correlation pattern between variables was extremely similar across cohorts (Figure 2), despite the fact that the distribution of variables differed across them (Table 1). The majority of the variance was captured by the first 3 PCs in all participating cohorts (Figure E1). In addition, when the data were visualized with multi-dimensional scaling, it resembled a continuous surface that tracked closely with spirometric disease severity in all study populations (Figure E2). Thus, the correlation pattern and general structure of the data was highly consistent across cohorts, but the data were not clustered in distinct groups.

Figure 2.

Figure 2

Loadings of input features (cluster variables) for the first four principal components in all cohorts.

As explained in Methods, prior to clustering, features were automatically weighted by the clustering procedure. The importance of each feature for determining cluster membership was very similar between datasets (Figure 3). FEV1 percent of predicted contributed most to the clustering solutions across all participating study populations, followed by FEV1/FVC and FVC. MMRC and BMI contributed to cluster solutions in some study populations but not others, and self-reported asthma and cardiovascular comorbidity did not contribute meaningfully to any clustering solutions.

Figure 3.

Figure 3

Heatmap of Relative Feature Importance for Clustering by Cohort. Colors represent importance values generated by unsupervised random forests clustering. Higher values indicate that a given feature had a larger impact on the clustering results than other features in that dataset. Results for primary analysis in all ten cohorts are shown in Panel A. Results for the COPDGene and ECLIPSE substudy with more clustering features are shown in Panel B.

Reproducibility of Clustering Results Across Cohorts

Figure 4 shows that, within each of the cohorts, the reproducibility of the k-medoid and hierarchical clustering results was modest (range of median NMI across ten cohorts is 0.17 – 0.43 and max NMI is 0.29 – 0.72). However, when poorly classifiable subjects (identified by the hierarchical clustering method) were excluded, agreement across cohorts was higher (range of median NMI 0.32 – 0.60 and max NMI 0.61 – 1.0). The most highly reproducible cluster solutions varied greatly in terms of the number of identified clusters and cluster characteristics between cohorts. The clinical characteristics of these clusters are shown in Tables E2–E11. The median accuracy of the supervised prediction models used to transfer cluster solutions between cohorts was 90.3% (IQR 82.3–96.3%).

Figure 4.

Figure 4

Reproducibility of Different Clustering Methods Across Ten Cohorts. Distribution of normalized mutual information (NMI*) is shown for clustering with partitioning around medoids (PAM, in blue), hierarchical clustering including unclassified subjects (HC + U, in green), and hierarchical clustering excluding unclassified subjects (HC, in red).

* NMI ranges from 0 (poor reproducibility) to 1 (excellent reproducibility).

We also examined whether these “best NMI” solutions resembled the two clusters identified in the review by Pinto et al. Due to small cluster size, the NJH cohort solutions were not considered. Six of the nine best NMI solutions identified a cluster with severe airflow limitation and moderate MMRC dyspnea scores (Table 2), and three study populations identified a cluster characterized by increased BMI and cardiovascular co-morbidities with mild-moderate airflow limitation (Table 3). While these clusters appeared similar in their average characteristics, the average concordance of subject assignment to these clusters across different cohorts ranged from 50%–86%.

Table 2.

Clinical characteristics of patients included in the clusters resembling the “severe airflow limitation, low BMI and poor health status” subtype

Cohort CLIPCOPD COPDGene ECLIPSE ICECOLDERIC LEUVEN PAC-COPD
N (% of the cohort) 144 (39%) 880 (20%) 250 (12%) 51 (13%) 95 (17%) 58 (17%)
FEV1 (% pred) 41.8 (11.6) 26.8 (7.4) 24.8 (4.7) 27.9 (6.7) 29.6 (6.4) 32 (8.0)
FEV1/FVC (%) 47.7 (11.2) 34.9 (7.6) 30.3 (4.2) 38.2 (10.3) 37.5 (7.4) 41.1 (8.0)
FVC (%pred) 71.4 (15.8) 58.8 (12.3) 63.7 (11.9) 63.0 (14.9) 63.2 (9.9) 57.9 (9.3)
BMI (kg/m2) 25.4 (4.3) 26.2 (5.7) 23.9 (4.1) 23.5 (3.2) 23.7 (5.3) 24.9 (3.5)
MMRC (0–4) 2.3 (1.0) 3.2 (0.7) 2.5 (0.8) 2.5 (1.3) 2.4 (1.1) 1.9 (1.3)
Asthma, % 1 27 26 2 0 79
CVD, % 36 23 20 14 37 7

Values are mean (SD) unless otherwise noted. Complete description of best NMI cluster solutions for each cohort are available in Supplemental Tables 2–11.

Table 3.

Clinical characteristics of patients included in the clusters resembling the “moderate airflow limitation, high BMI and cardiovascular co-morbidities” subtype

Cohort ICECOLDERIC LEUVEN PAC-COPD
N (% of the cohort) 90 (22.3%) 60 (10.9%) 45 (13.2%)
FEV1 (% pred) 71.2 (6.3) 50.6 (7.2) 63.8 (4.1)
FEV1/FVC (%) 63.3 (3.6) 58.2 (6.7) 65.8 (4.3)
FVC (%pred) 91.9 (10.6) 68.7 (7.8) 71.7 (4.7)
BMI (kg/m2) 29.1 (6.0) 30.1 (5.6) 31.5 (3.5)
MMRC (0–4) 1.3 (1.3) 2.1 (0.9) 1.4 (0.8)
Asthma, % 1 0 64
CVD, % 21 45 31

Values are mean (SD) unless otherwise noted. Complete description of best NMI cluster solutions for each cohort are available in Supplemental tables 2–11.

COPDGene-ECLIPSE Substudy with a More Extensive Set of COPD-Related Features

We considered the possibility that the modest reproducibility may be due to the limited set of variables common to all ten cohorts. To observe the reproducibility of clustering on a more comprehensive set of variables, we applied the same clustering methods to a larger set of COPD-related clinical measures in subjects in spirometric GOLD Stages 2–4 in the COPDGene and ECLIPSE studies. In addition to the seven features used in the main study, this analysis included measures of airway wall thickness (Pi10), quantitative emphysema from chest CT (LAA950), prior 12-month exacerbation history, chronic bronchitis, and SGRQ score. The variable importance measures demonstrate that spirometric measures contribute the most to these cluster solutions, with the next most important measures being LAA950, MMRC and SGRQ score (Figure 3). These analyses confirmed the findings from the main study, demonstrating modest reproducibility for the clusters that included all subjects and higher reproducibility for clustering approaches that allowed a proportion of subjects to be unclassified. PCA plots of these data also confirm that these data are distributed along a continuum rather than in discrete clusters (Figure 5).

Figure 5.

Figure 5

PCA Plot of Clustering Variables Used in COPDGene k-means Clustering. Visualization of data by the first three principal components in the COPDGene clustering analysis with spirometric, chest CT imaging, and clinical data.

We also considered the possibility that our observed modest cluster reproducibility may be due to differences in the underlying data distributions between cohorts. To address this question, we performed a clustering analysis in the COPDGene-ECLIPSE substudy limited to subjects in GOLD spirometric stage 2 only. The reproducibility of these clustering solutions is comparable to our other experiments (Figure E3).

Because some of the solutions allowing for unclassified subjects did demonstrate high reproducibility, we examined the characteristics of these clusters in both COPDGene and ECLIPSE. The COPDGene analysis identified three clusters that corresponded to a healthier group (higher FEV1 % of predicted, less emphysema, and less airway wall thickening), an emphysema-predominant group, and an airway predominant group (Table E12). However, the proportion of unclustered subjects was extremely high (86% of all subjects). The most reproducible clustering solution in ECLIPSE identified six clusters, and also demonstrated a high rate of unclassified subjects (52%).

DISCUSSION

This study is the first investigation of the reproducibility of COPD clustering results across multiple independent cohorts, and it demonstrates that 1) COPD subtypes identified through clustering show only modest reproducibility and 2) the variable manifestations of COPD are best represented by continuous traits, such as airflow limitation or quantitative emphysema, that can coexist to varying degrees within the same individual, rather than categorizations of patients in mutually exclusive COPD subtypes/phenotypes. These findings have a number of implications for the future study of COPD subtypes. First, the concept of continuous representations of COPD, similar to the concept of “treatable traits”[22], is a useful alternative to clusters that highlights distinct aspects of COPD, while allowing for the fact that these treatable traits may be present to varying degrees in different subjects. Second, for some sets of variables, standard data-driven clustering methods may not demonstrate levels of reproducibility appropriate for clinical use.

Interpretation of results

The clustering data used in this study capture many important aspects of COPD pathology and have been used in previous attempts to classify COPD[3,6,7,22,23]. The modest reproducibility of clustering solutions can be explained by the fact that these data do not have strong clustering structure and are better characterized by a continuum of disease severity. However, this observation applies only to the limited set of COPD clinical characteristics used in this study. It is possible that other COPD-related characteristics may lead to more reproducible clusters.

Despite modest clustering reproducibility, certain clusters tend to recur across multiple studies. Clustering often identifies a “severe COPD” cluster with low FEV1, low BMI, and dyspnea. The COPDGene-ECLIPSE substudy confirms that this cluster also has extensive CT emphysema. The other commonly occurring cluster is an “airway-predominant cluster” characterized by moderately impaired FEV1 and elevated BMI. In the COPDGene-ECLIPSE substudy, this group also had thickened airway walls and relatively little CT emphysema. These two clusters resemble the clusters identified by Pinto et al, providing additional support to the concept of “emphysema-predominant” and “airway-predominant” COPD.

While our results demonstrate limitations of clustering, they do not indicate that phenotypic differences between subjects with COPD are small or neglible. On the contrary, our data confirm that COPD encompasses a wide range of clinical presentations, because the average characteristics of clusters were quite different. It is also important to note that 1) reproducibility can vary by subtype and 2) many subtype definitions are reproducible in the sense that predictive models can be used to identify groups of subjects in other datasets with similar characteristics. Thus, our findings demonstrate that clustering, as a means to define subtypes in an unbiased manner, is only modestly reproducible for a set of variables that includes many of the most commonly used phenotypic measures of COPD.

Implications of findings

This study has a number of important implications for the future study of COPD subtypes. First, it demonstrates that reproducibility of clustering results cannot be assumed across independent cohorts. Second, it demonstrates that continuous representations of COPD clinical variability are an alternative approach to characterizing COPD heterogeneity that are better suited to the continuous nature of many key COPD-related phenotypic measures. These continuous representations are similar to the concept of “treatable traits” that has been previously proposed as a strategy to improve the management and prognosis of patient with COPD[22]. Unlike clusters, treatable traits are not mutually exclusive since any given patient can manifest more than one “phenotypic” trait. For instance, for two patients with the same amount of airflow limitation and emphysema, one may have bronchiectasis and the other may not, and both of them may or may not have pulmonary hypertension. Third, it may be useful to use differences in clinically relevant outcomes such as risk of exacerbation, mortality, or FEV1 decline to define group boundaries and COPD subtypes. This entails a shift in the general conception of COPD subtypes, because it implies that there may be multiple distinct sets of subtypes that depend on the specific clinical outcome of interest. However, the concept of treatment-specific or outcome-specific subtypes is already well-established in clinical practice (i.e., roflumilast for subjects with COPD and chronic bronchitis to reduce exacerbations). Fourth, the definition of COPD subtypes may benefit from the identification of novel features, including genomic or proteomic features, that more effectively identify distinct COPD subtypes. Fifth, clustering methods that identify a “core” of clustered individuals are more reproducible than methods that assume that all subjects can be classified. Finally, clustering can be useful for data exploration, as long as its potential limitations regarding reproducibility are recognized.

Strengths and limitations

This study has a number of strengths. As noted by Pinto et al, previous efforts to address cluster reproducibility in COPD have been limited by extensive heterogeneity in methods between studies[10]. Our collaborative effort addressed this issue by performing identical clustering analyses across multiple cohorts, resulting in insights that would have been difficult to obtain from studying these cohorts individually. We used multiple clustering methods and explored a wide range of clustering parameters. To our knowledge, this is the largest and most comprehensive replication effort for cluster-based complex disease subtype identification.

This study also has important limitations. Because the variables used in the primary analysis were limited to those available in all participating study populations, this set of features does not fully capture the phenotypic spectrum of COPD. However, the clustering data used in this study capture many important aspects of COPD pathology and have been used in previous attempts to classify COPD[3,6,7,23,24]. In addition, when a more comprehensive set of variables was assessed in the COPDGene-ECLIPSE substudy, the level of reproducibility was still modest. Second, while all studies included subjects with FEV1/FVC < 0.7, there were still differences in the distribution of variables, enrollment criteria, and subject selection between studies. This variability may have limited the concordance of clustering solutions across studies. However, to address this concern, we performed clustering for an even more well-defined group of only GOLD 2 subjects in COPDGene and ECLIPSE, and the results of this analysis were consistent with the overall study results, suggesting that incomplete sampling was not likely to be a major driver of these results. Third, certain variables related to medical history, such as asthma or cardiovascular disease, are ascertained primarily by self-report and may not be uniform across studies. This would limit the ability to identify potential clusters related specifically to those variables. Fourth, our analysis of clustering methods was not exhaustive. It was outside the scope of this effort to exhaustively survey the performance of all available clustering methods. Fifth, for those methods that allowed for “unclustered” subjects, the unclustered rate was quite high for the best NMI solutions in some cohorts. This likely reflects the poor separability of the underlying data rather than a shortcoming of the specific clustering method, since this method has been applied successfully in other scenarios[20]. Finally, non-smoking COPD subjects are under-represented in these cohorts, and characterization of heterogeneity in non-smoking COPD requires further study[25].

Conclusions

This study of the replicability of clustering-defined COPD subtypes across multiple international cohorts found that COPD heterogeneity is best represented by continuous traits (such as airflow limitation or quantitative emphysema) coexisting in varying degrees within the same individual, rather than by mutually exclusive COPD subtypes/phenotypes. This is an important perspective to inform future efforts to characterize COPD heterogeneity.

Supplementary Material

Supplement

What is the key question?

Are COPD subtypes identified through clustering algorithms reproducible in independent patient populations?

What is the bottom line?

COPD subtypes identified through clustering algorithms have modest reproducibility in the contexts studied, but continuous representations of COPD clinical characteristics are more reproducible.

Why read on?

This is the largest, multi-cohort study explicitly designed to assess the reproducibility of COPD subtypes, and it provides novel insights about the nature of clinical variability in COPD.

Acknowledgments

Funding:

CLIP-COPD: CLIP - COPD was funded by the Ministry of the University and the Ministry of Health of Italy.

COPDGene: The COPDGene Study (NCT00608764) was supported by Award Number R01HL089897 (JDC), R01HL089856 (EKS) and R01 HL075478 (EKS) from the National Heart, Lung, and Blood Institute. The COPDGene project is also supported by the COPD Foundation through contributions made to an Industry Advisory Board comprised of AstraZeneca, Boehringer Ingelheim, Novartis, Pfizer, Siemens and Sunovion. This work was supported by U.S. National Institutes of Health (NIH) grants R01 HL124233 and R01 HL126596 (PJC), R01 HL113264, and the Alpha-1 Foundation (MHC). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung, And Blood Institute or the National Institutes of Health.

ECLIPSE: The ECLIPSE Study was funded by GSK (NCT00292552).

ICE COLD ERIC: This study was supported by the Swiss National Science Foundation [Grant 3233B0/115216/1], Dutch Asthma Foundation [Grant 3.4.07.045], and Zurich Lung League (unrestricted grant).

LifeLines: LifeLines has been funded by a number of public sources, notably the Dutch Government, The Netherlands Organization of Scientific Research NWO, the Northern Netherlands Collaboration of Provinces (SNN), the European fund for regional development, Dutch Ministry of Economic Affairs, Pieken in de Delta, Provinces of Groningen and Drenthe, the Target project, BBMRI-NL, the University of Groningen, and the University Medical Center Groningen, The Netherlands.

The Lovelace Smokers Cohort: The Lovelace Smokers Cohort was funded by the State of New Mexico (appropriation from the Tobacco Settlement Fund) and by Institutional funds.

Lung Health Study: This research was supported by GENEVA (U01HG004738 ). The Lung Health Study I was supported by contract NIH/N01-HR-46002.

NJH Cohort: The NJH cohort was supported by National Jewish Health internal funds.

PAC-COPD: The PAC-COPD Study was supported by grants from the Fondo de Investigación Sanitaria (grants PI020541, PI052486, PI052302, and PI060684), Ministry of Health, Madrid, Spain; the Agència d’Avaluació de Tecnologia i Recerca Mèdiques (grant 035/20/02), Catalonia Government, Barcelona, Spain; the Spanish Society of Pneumology and Thoracic Surgery (grant 2002/137); the Catalan Foundation of Pneumology (grant 2003 Beca Marià Ravà); the Red Respira (grant C03/11); the Red de Centros de Investigación Cooperativa en Epidemiología y Salud Pública (grant C03/09); the Fundació La Marató de TV3 ( grant 041110); and Novartis Farmacèutica, Barcelona, Spain. The CIBERESP is funded by the Instituto de Salud Carlos III, Ministry of Health, Madrid, Spain.

Footnotes

Author Contributions: Conception and design: PJC, JGA; Acquisition, Analysis and/or interpretation: PJC, MB, HP, JF, MP, HMB, JMV, MAP, EW, DL, WJ, MHC, KB, SR, MPB, JDC, YT, EKS; Drafting the manuscript for important intellectual content: all authors.

References

  • 1.Vestbo J, Sin DD, Hurd SS, et al. Global strategy for the diagnosis, management, and prevention of chronic obstructive pulmonary disease: GOLD executive summary. American Journal of Respiratory and Critical Care Medicine. 2013;187:347–65. doi: 10.1164/rccm.201204-0596PP. [DOI] [PubMed] [Google Scholar]
  • 2.Rennard SI, Vestbo J. The Many “Small COPDs. Chest. 2008;134:623. doi: 10.1378/chest.07-3059. [DOI] [PubMed] [Google Scholar]
  • 3.Cho M, Washko GR, Hoffmann TJ, et al. Cluster analysis in severe emphysema subjects using phenotype and genotype data: an exploratory investigation. Respiratory Research. 2010;11:30. doi: 10.1186/1465-9921-11-30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Burgel P-R, Paillasseur J-L, Caillaud D, et al. Clinical COPD phenotypes: a novel approach using principal component and cluster analyses. The European respiratory journal : official journal of the European Society for Clinical Respiratory Physiology. 2010;36:531–9. doi: 10.1183/09031936.00175109. [DOI] [PubMed] [Google Scholar]
  • 5.Burgel P-R, Paillasseur J-L, Roche N. Identification of clinical phenotypes using cluster analyses in COPD patients with multiple comorbidities. Biomed Res Int. 2014;2014:420134–9. doi: 10.1155/2014/420134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Garcia-Aymerich J, Gómez FP, Benet M, et al. Identification and prospective validation of clinically relevant chronic obstructive pulmonary disease (COPD) subtypes. Thorax. 2011;66:430–7. doi: 10.1136/thx.2010.154484. [DOI] [PubMed] [Google Scholar]
  • 7.Pistolesi M, Camiciottoli G, Paoletti M, et al. Identification of a predominant COPD phenotype in clinical practice. Respiratory Medicine. 2008;102:367–76. doi: 10.1016/j.rmed.2007.10.019. [DOI] [PubMed] [Google Scholar]
  • 8.Spinaci S, Bugiani M, Arossa W, et al. A multivariate analysis of the risk in chronic obstructive lung disease (COLD) J Chronic Dis. 1985;38:449–53. doi: 10.1016/0021-9681(85)90141-9. [DOI] [PubMed] [Google Scholar]
  • 9.Vanfleteren L, Spruit M, Groenen M, et al. Clusters of comorbidities based on validated objective measurements and systemic inflammation in patients with chronic obstructive pulmonary disease. American Journal of Respiratory and Critical Care Medicine. 2013;187:728–35. doi: 10.1164/rccm.201209-1665OC. [DOI] [PubMed] [Google Scholar]
  • 10.Pinto LM, Alghamdi M, Benedetti A, et al. Derivation and validation of clinical phenotypes for COPD: a systematic review. Respiratory Research. 2015;16:50. doi: 10.1186/s12931-015-0208-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Regan EA, Silverman E, Hokanson JE, et al. Genetic epidemiology of COPD (COPDGene) study design. COPD. 2010;7:32–43. doi: 10.3109/15412550903499522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Vestbo J, Anderson W, Coxson HO, et al. Evaluation of COPD Longitudinally to Identify Predictive Surrogate End-points (ECLIPSE) European Respiratory Journal. 2008;31:869–73. doi: 10.1183/09031936.00111707. [DOI] [PubMed] [Google Scholar]
  • 13.Siebeling L, Riet ter G, van der Wal WM, et al. ICE COLD ERIC--International collaborative effort on chronic obstructive lung disease: exacerbation risk index cohorts--study protocol for an international COPD cohort study. BMC Pulm Med. 2009;9:15. doi: 10.1186/1471-2466-9-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Scholtens S, Smidt N, Swertz MA, et al. Cohort Profile: LifeLines, a three-generation cohort study and biobank. International Journal of Epidemiology. 2015;44:1172–80. doi: 10.1093/ije/dyu229. [DOI] [PubMed] [Google Scholar]
  • 15.Hunninghake GM, Cho M, Tesfaigzi Y, et al. MMP12, lung function, and COPD in high-risk populations. N Engl J Med. 2009;361:2599–608. doi: 10.1056/NEJMoa0904006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Wauters E, Smeets D, Coolen J, et al. The TERT-CLPTM1L locus for lung cancer predisposes to bronchial obstruction and emphysema. European Respiratory Journal. 2011;38:924–31. doi: 10.1183/09031936.00187110. [DOI] [PubMed] [Google Scholar]
  • 17.Buist AS, Connett JE, Miller RD, et al. Chronic Obstructive Pulmonary Disease Early Intervention Trial (Lung Health Study) Baseline characteristics of randomized participants. Chest. 1993;103:1863–72. doi: 10.1378/chest.103.6.1863. [DOI] [PubMed] [Google Scholar]
  • 18.Balcells E, Antó JM, Gea J, et al. Characteristics of patients admitted for the first time for COPD exacerbation. Respiratory Medicine. 2009;103:1293–302. doi: 10.1016/j.rmed.2009.04.001. [DOI] [PubMed] [Google Scholar]
  • 19.Horvath S. Unsupervised learning with random forest predictors. Journal of Computational and Graphical Statistics. 2012;15:118–38. doi: 10.1198/106186006X94072. [DOI] [Google Scholar]
  • 20.Langfelder P, Zhang B, Horvath S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics. 2008;24:719–20. doi: 10.1093/bioinformatics/btm563. [DOI] [PubMed] [Google Scholar]
  • 21.Strehl A, Ghosh J. Cluster ensembles --- a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research. 2003;3:583–617. doi: 10.1162/153244303321897735. [DOI] [Google Scholar]
  • 22.Agustí AGN, Bel E, Thomas M, et al. Treatable traits: toward precision medicine of chronic airway diseases. European Respiratory Journal. 2016;47:410–9. doi: 10.1183/13993003.01359-2015. [DOI] [PubMed] [Google Scholar]
  • 23.Paoletti M, Camiciottoli G, Meoni E, et al. Explorative data analysis techniques and unsupervised clustering methods to support clinical assessment of Chronic Obstructive Pulmonary Disease (COPD) phenotypes. Journal of Biomedical Informatics. 2009;42:1013–21. doi: 10.1016/j.jbi.2009.05.008. [DOI] [PubMed] [Google Scholar]
  • 24.Castaldi PJ, Dy JG, Ross J, et al. Cluster analysis in the COPDGene study identifies subtypes of smokers with distinct patterns of airway disease and emphysema. Thorax. 2014;69:415–22. doi: 10.1136/thoraxjnl-2013-203601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Thomsen M, Nordestgaard BG, Vestbo J, et al. Characteristics and outcomes of chronic obstructive pulmonary disease in never smokers in Denmark: a prospective population study. Lancet Respir Med. 2013;1:543–50. doi: 10.1016/S2213-2600(13)70137-1. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES