Abstract
Rationale
Chronic obstructive pulmonary disease (COPD) exhibits considerable progression heterogeneity. We hypothesized that elastic principal graph analysis (EPGA) would identify distinct clinical phenotypes and their longitudinal relationships.
Objectives
Our primary objective was to create a map of COPD phenotypes and their connectivity using EPGA. Secondarily, we used longitudinal and external data sets to test the validity and reproducibility of this map.
Methods
Cross-sectional data from 8,972 tobacco-exposed COPDGene participants, with and without COPD, were used to train a model with EPGA, using thirty clinical, physiologic and CT features. 4,585 participants from COPDGene Phase 2 were used to test longitudinal trajectories. 2,652 participants from SPIROMICS tested external reproducibility.
Measurements and Main Results
Our analysis used cross-sectional data to create an elastic principal tree, where time is associated with distance on the tree. Six clinically distinct tree segments were identified that differed by lung function, symptoms, and CT features: Subclinical (SC); Parenchymal Abnormality (PA); Chronic Bronchitis (CB); Emphysema Male (EM); Emphysema Female (EF); and Severe Airways (SA) disease. 5-year data from COPDGene mapped longitudinal changes onto the tree, and longitudinal trajectories demonstrated a net flow of patients from SC towards EM and EF, including trajectories through airway disease predominant phenotypes, CB and SA. Cross-sectional SPIROMICS data projected onto the tree showed clinically similar patient groupings.
Conclusions
This novel analytic methodology provides an approach to defining longitudinal phenotypic trajectories using cross sectional data. These insights are clinically relevant and could facilitate precision therapy and future trials to modify disease progression.
Clinical trial registered with www.clinicaltrials.gov (NCT00608764 and NCT01969344).
Keywords: COPD, phenotypes, emphysema, airway disease, disease progression
At a Glance Commentary
Scientific Knowledge on the Subject
Chronic obstructive pulmonary disease is a heterogeneous disease characterized by multiple phenotypes and variable progression. However, how individual patients progress through the disease, from one phenotype to another, has been unknown.
What This Study Adds to the Field
This is the first study, to our knowledge, to apply the elastic principal graph method to a large chronic obstructive pulmonary disease dataset, providing an approach to defining longitudinal phenotypic trajectories using cross-sectional data. These insights are clinically relevant and could facilitate precision therapy and future trials to modify disease progression.
Chronic obstructive pulmonary disease (COPD) is characterized by significant heterogeneity in disease presentation and progression (1). It is evident that obstructive physiology does not completely define the spectrum of smoking-induced lung disease. Computed tomographic (CT) imaging has provided key insights into disease pathogenesis, demonstrating that small airway abnormality precedes the development of emphysema (2). Although accelerated lung function decline is a known feature of COPD (3, 4), our understanding of how individual patients progress from one clinical presentation to another remains unclear. Reducing large longitudinal datasets to clinically interpretable patterns has proved challenging, and traditional clustering approaches using cross-sectional data have difficulty separating disease subtype from disease severity. In addition, cluster algorithms struggle to account for time. Hence, identifying clinically meaningful and recognizable phenotypic groupings and trajectories in COPD has proved challenging.
In this study, we investigated elastic principal graph analysis (EPGA), a sophisticated machine learning method (5–7), which has been used in other contexts for single-cell and clinical trajectory analysis (8, 9), and shows potential for relational subtyping in COPD (10). This methodology combines elements of principal component analysis and graph theory to determine and visualize underlying structure in complex datasets. We hypothesized that we could use this approach to determine a map of COPD subtypes and trajectories from a large cross-sectional dataset. This map takes the form of a bifurcating tree, where each patient is mapped to a segment of the tree, and intersections may represent decisive points for disease progression. Under the assumption that the distance between any two points on the tree, via its branches, reflects the degree of disease progression, we derived “pseudo-time,” a measurement of trajectory progression for each participant. We hypothesized that EPGA could identify the relationship between specific phenotypes over time, overcoming a major problem with traditional clustering approaches where disease severity tends to dominate group assignments.
To do this, we applied EPGA to cross-sectional, baseline COPDGene (Genetic Epidemiology of COPD) cohort data to build a bifurcating tree that identifies phenotypes and infers their longitudinal relationships (11). We evaluated model reproducibility using baseline SPIROMICS (Subpopulations and Intermediate Outcome Measures in COPD Study) cohort data. We then overlaid true 5-year longitudinal COPDGene data onto the tree to explore the degree to which the model characterized real phenotypic changes over time. Preliminary results of this study were reported in abstracts first presented at the European Respiratory Society meeting in 2021 (12) and later, including longitudinal analysis, at the American Thoracic Society meeting in 2023 (13).
Methods
Study Population and Assessments
Subjects participating in the NIH-funded COPDGene study (11), a large multicenter longitudinal observational cohort, had either a current or former tobacco smoke exposure history of ≥10 pack-years and were aged 40–80 years at enrollment. SPIROMICS is an NIH-funded COPD cohort with current or former smoking history of ≥20 pack-years (14). In both cohorts, those with and without airflow obstruction were enrolled, although those with restrictive physiology were excluded from the SPIROMICS protocol. Both studies conducted extensive baseline characterization, including symptomatic assessment, spirometry, and quantitative imaging, as previously described (11, 14). COPD was defined by post-bronchodilator FEV1/FVC <0.70 per the Global Initiative for Chronic Obstructive Lung Disease (GOLD) (3).
We used data from the baseline COPDGene visit and the second visit that occurred roughly 5 years after enrollment. At both visits, paired inspiratory and expiratory CT scans were obtained at TLC and FRC. Expiratory CT scans in SPIROMICS were acquired at residual volume. Parametric response mapping (PRM) analysis was performed on paired registered inspiratory and expiratory images using Lung Density Analysis software (Imbio LLC) (15). PRM distinguishes functional small airway disease (PRMfSAD) from emphysema (PRMEmph), as well as identifying a normally functioning lung (PRMNorm) (15). For this analysis, we included parenchymal disease (PRMPD) (16) as well as “rapidly emptying emphysema” (PRMEE) (17); see Figure E1 in the data supplement for a complete PRM subcategory description. COPDGene used Thirona software for quantitative CT analysis of the airways, whereas SPIROMICS used VIDA (18).
Statistical Analyses
Data preparation
Thirty clinical, physiologic, and radiologic variables were selected from the COPDGene dataset (Table 1) for training a model. Variables were selected on the basis of consultation with clinician scientists on what is important in the clinical diagnosis and management of COPD; PRM was included to use expertise of the leading research group. A total of 8,972 participants from the baseline visit (P1) of the COPDGene study were included in this analysis, based on having a complete PRM profile. A total of 4,585 individuals selected had phase II (P2) data available for analysis. A total of 2,652 participants from the SPIROMICS study were analyzed. These were selected on the basis of tobacco exposure and variable availability to maximize concordance with the COPDGene dataset. The same 30 variables in SPIROMICS were selected or computed to be comparable with COPDGene (Table 1). Using a previously described method (19), predicted values of emphysema were obtained from all available subjects with required data, from P2 of the COPDGene Study; the specific model used included clinical, transcriptomic, and proteomic predictors.
Table 1.
Dynamic Phenotyping Segment Features from COPDGene Phase I
| Subclinical (n = 3,198) | Parenchymal Abnormality (n = 1,855) | Bridge 1 (n = 999) | Chronic Bronchitis (n = 738) | Bridge 2 (n = 777) | Severe Airway Disease (n = 343) | Emphysema Male (n = 562) | Emphysema Female (n = 482) | |
|---|---|---|---|---|---|---|---|---|
| GOLD spirometric grade (PRISm/GOLD 0/1/2/3/4) | 165/2,260/475/292/5/1 | 529/1,099/71/128/26/2 | 236/444/85/226/8/0 | 131/42/19/354/177/15 | 24/26/55/542/128/2 | 4/1/1/106/189/42 | 0/0/0/24/264/274 | 0/0/0/55/236/191 |
| Age, yr | 57.2 (8.0) | 53.1 (6.2) | 66.3 (7.4) | 60.1 (8.2) | 65.2 (8.0) | 63.9 (8.9) | 65.6 (7.6) | 65.6 (7.5) |
| Sex, n (% female) | 707 (22) | 1373 (74) | 786 (79) | 444 (60) | 265 (36) | 126 (37) | 27 (5) | 438 (91) |
| Race, n (% Black) | 925 (29) | 1318 (71) | 53 (5) | 234 (32) | 101 (14) | 53 (15) | 78 (14) | 117 (24) |
| BMI, kg/m2 | 27.2 (4.6) | 31.5 (7.1) | 31.0 (6.2) | 33.5 (6.8) | 27.0 (5.0) | 27.5 (5.4) | 25.3 (4.8) | 25.3 (5.2) |
| Currently smoking, n (%) | 1909 (60) | 1587 (86) | 137 (14) | 412 (56) | 284 (38) | 161 (47) | 152 (27) | 83 (17) |
| Smoking pack-years | 39.1 (20.6) | 36.5 (20.0) | 42.2 (22.1) | 54.1 (28.0) | 55.1 (28.2) | 58.9 (28.7) | 64.0 (31.1) | 47.3 (23.3) |
| Age when started smoking, yr | 16.9 (4.4) | 17.0 (5.3) | 17.8 (4.1) | 15.9 (4.3) | 16.7 (4.4) | 16.1 (4.8) | 16.1 (3.8) | 17.7 (4.7) |
| SGRQ | 13.8 (15.5) | 27.9 (21.3) | 15.2 (13.8) | 52.5 (18.6) | 30.6 (17.9) | 47.6 (18.9) | 52.3 (17.2) | 49.3 (17.7) |
| mMRC score | 0.5 (1.0) | 1.5 (1.4) | 0.8 (1.1) | 2.8 (1.1) | 1.5 (1.3) | 2.4 (1.3) | 2.8 (1.0) | 2.9 (1.0) |
| Chronic bronchitis, n (%) | 430 (13) | 317 (17) | 45 (5) | 329 (45) | 165 (22) | 138 (40) | 193 (34) | 93 (19) |
| Exacerbations in the prior year | 0.1 (0.4) | 0.2 (0.6) | 0.2 (0.6) | 1.1 (1.4) | 0.5 (0.9) | 0.9 (1.2) | 1.0 (1.4) | 1.2 (1.5) |
| 6-minute-walk distance, ft | 1,593.6 (329.2) | 1,289.3 (350.3) | 1,424.6 (289.9) | 1,047.5 (351.1) | 1,355.0 (308.1) | 1,177.3 (349.4) | 1021.5 (362.5) | 946.9 (355.9) |
| FEV1, L | 3.1 (0.6) | 2.2 (0.5) | 2.0 (0.4) | 1.6 (0.5) | 1.8 (0.5) | 1.3 (0.5) | 1.0 (0.4) | 0.8 (0.3) |
| FEV1, % predicted | 94.9 (14.3) | 84.8 (16.4) | 81.5 (13.3) | 59.3 (14.5) | 62.8 (13.4) | 44.9 (13.3) | 30.8 (10.6) | 34.2 (12.2) |
| FVC, L | 4.2 (0.8) | 2.8 (0.6) | 2.8 (0.5) | 2.6 (0.7) | 3.4 (0.8) | 3.0 (0.8) | 2.8 (0.8) | 2.0 (0.5) |
| FVC, % predicted | 98.8 (13.2) | 86.1 (15.8) | 85.1 (12.4) | 74.1 (14.5) | 87.2 (15.0) | 78.9 (17.6) | 66.1 (16.7) | 66.6 (16.1) |
| FEV1/FVC | 0.74 (0.08) | 0.78 (0.07) | 0.73 (0.08) | 0.62 (0.10) | 0.55 (0.10) | 0.43 (0.11) | 0.35 (0.09) | 0.39 (0.09) |
| Pre-/post-bronchodilator FEV1 % change (%) | 3.8 (6.4) | 2.6 (7.8) | 5.1 (6.9) | 9.1 (11.3) | 7.8 (9.3) | 30.2 (17.3) | 6.4 (10.8) | 6.3 (9.6) |
| Pre-/post-bronchodilator FVC % change (%) | 1.6 (7.3) | 0.2 (9.3) | 2.6 (8.2) | 7.5 (11.8) | 6.0 (9.1) | 34.2 (33.2) | 5.3 (12.1) | 4.7 (10.9) |
| TLC, L | 6.2 (1.1) | 4.1 (0.7) | 4.8 (0.7) | 4.9 (0.9) | 6.4 (1.2) | 6.4 (1.3) | 7.6 (1.1) | 5.4 (0.7) |
| FRC, L | 3.2 (0.7) | 2.3 (0.5) | 2.6 (0.5) | 3.1 (0.6) | 4.0 (0.8) | 4.6 (1.0) | 5.8 (0.9) | 3.9 (0.6) |
| Mean HU inspiration | −837.1 (23.2) | −784.3 (36.2) | −825.9 (20.6) | −812.2 (23.3) | −855.9 (17.2) | −855.2 (23.1) | −877.0 (19.1) | −870.5 (19.4) |
| Mean HU expiration | −692.6 (48.1) | −633.6 (54.9) | −690.1 (44.4) | −714.7 (42.4) | −773.8 (28.2) | −801.1 (31.2) | −839.0 (27.7) | −824.8 (26.9) |
| Segmental wall area, % | 47.1 (7.4) | 52.0 (8.4) | 48.4 (6.0) | 59.7 (7.2) | 52.6 (7.2) | 60.1 (7.1) | 56.0 (7.3) | 52.3 (6.8) |
| Pi10 | 2.0 (0.5) | 2.4 (0.6) | 2.1 (0.4) | 3.0 (0.6) | 2.5 (0.5) | 3.1 (0.5) | 2.9 (0.5) | 2.7 (0.4) |
| PRMNorm (%) | 63.4 (10.8) | 52.9 (15.9) | 61.1 (9.0) | 50.5 (10.9) | 42.3 (10.9) | 34.6 (11.2) | 21.0 (8.1) | 24.7 (7.8) |
| PRMfSAD (%) | 12.5 (9.5) | 5.8 (5.8) | 12.3 (7.8) | 16.9 (8.5) | 29.2 (10.2) | 34.7 (9.5) | 36.8 (8.9) | 34.9 (8.6) |
| PRMEmph (%) | 1.5 (2.5) | 0.3 (0.8) | 1.2 (1.7) | 2.1 (3.5) | 8.6 (7.2) | 11.5 (9.9) | 26.1 (12.7) | 22.7 (11.7) |
| PRMPD (%) | 20.7 (6.6) | 40.6 (16.6) | 24.0 (6.6) | 29.4 (8.9) | 17.3 (3.8) | 17.8 (5.4) | 14.6 (3.4) | 16.0 (3.9) |
| PRMEE (%) | 1.8 (2.0) | 0.5 (0.6) | 1.5 (1.4) | 1.2 (1.3) | 2.6 (2.0) | 1.5 (1.3) | 1.4 (1.1) | 1.7 (1.4) |
Definition of abbreviations: BMI = body mass index; COPDGene = Genetic Epidemiology of chronic obstructive pulmonary disease; GOLD = Global Initiative for Chronic Obstructive Lung Disease; HU = Hounsfield units; mMRC = modified Medical Research Council; Pi10 = square root of the wall area of a theoretical airway of 10-mm luminal perimeter; PRM = parametric response mapping; SGRQ = St. George’s Respiratory Questionnaire.
All values expressed as mean (SD), except categorical variables expressed as count (percent).
Due to missing data, age, BMI, smoking start age, sex, smoking status, exacerbation history, race and chronic bronchitis, n = 8,971; SGRQ, n = 8,970; pack-years, n = 8,969; FEV1, FEV1 percent predicted, FVC, FVC percent predicted, FEV1/FVC, n = 8,968; mMRC, n = 8,960; mean inspiratory HU and Pi10, n = 8,959; wall area percentage, n = 8,958; FEV1 and FVC bronchodilator reversibility, n = 8,860; FRC and mean expiratory HU, n = 8,042.
Clinical trajectory analysis
The 8,972 × 30 data matrix for the COPDGene P1 cohort was processed using Jupyter notebooks, based on the previously described methodology (8), developed from templates hosted on GitHub (https://github.com/auranic/ClinTrajan), and using consensus tree modeling (5), to determine an elastic principal tree (EPT) summarizing the data. Characteristic study of the EPT identified one branch containing relatively healthy (subclinical) subjects. The terminal node of this segment was designated as the root node for studying cross-sectional trajectory profiles; pseudo-time values were calculated as the geodesic distance from this root node. Further details, including methods for reproducibility work using the SPIROMICS cohort, longitudinal (real-time) analysis using COPDGene P2 data, and Kaplan-Meier survival analysis, are provided in the data supplement. Statistical tests were performed with MATLAB (R2023a; MathWorks) at the 5% significance level.
Results
Baseline data (n = 8,972), highlighting the 30 variables used to create the EPT, are illustrated in Table E1, consisting of at-risk individuals with normal spirometry (n = 3,872), preserved ratio impaired spirometry (PRISm; n = 1,089), GOLD 1–2 (n = 2,438), and GOLD 3–4 (n = 1,573). The 30 variables included 1) demographics such as age, sex, body mass index (BMI), smoking history, and respiratory symptoms; 2) lung function; and 3) quantitative CT imaging data including the square root of the wall area of a theoretical airway of 10 mm luminal perimeter (Pi10), wall thickness, and PRM data.
The EPT is displayed in Figure 1. This model defines not only unique subgroups but also intermediary states that may act as bridges connecting them. The method uses patient data from all disease stages to create a model of patient states or phenotypes across pseudo-time (distance along tree segments). In other words, the tree suggests the relationship between various phenotypes over time as the disease progresses. Heatmaps for all variables used to train the model are included in Figure E2.
Figure 1.
Graphic representation of the consensus elastic principal tree model with baseline COPDGene (Genetic Epidemiology of chronic obstructive pulmonary disease) data. This simple graph was generated on the basis of a Kamada-Kawai force-directed algorithm. Points projected onto a node are plotted in a random direction from that node; all other points are randomly directed to one side of the segment they were projected onto. Distance from the graph (faded line) is directly proportional to the projection distance in the principal space. Point colors represent Global Initiative for Chronic Obstructive Lung Disease (GOLD) grade (green, GOLD 0; purple, preserved ratio impaired spirometry; blue, GOLD 1; yellow, GOLD 2; orange, GOLD 3; and red, GOLD 4). Segments are annotated with phenotypic labels based on the descriptive statistics of projected participants.
We identified six distinct terminal segments (phenotypes) and two bridging segments (B1 and B2). The clinical characteristics for the segments and bridges are provided in Table 1. Segments were named on the basis of assessment of the clinical characteristics of individuals in each segment: subclinical (SC); parenchymal abnormality (PA); chronic bronchitis (CB); severe airway disease (SA); and two severe emphysema groups, a predominantly male (EM) and a predominantly female (EF) segment, that split primarily on the basis of sex (95% male/5% female vs. 9% male/91% female). Representative PRM images for each segment are presented in Figure E3. It is important to note, however, that there is still heterogeneity even within a single segment, as can be seen in Figure 1.
When considering each segment in aggregate, the PA group was the youngest, followed by the SC group. The PA group was notable for having the most PRISm subjects (n = 529; 28.5%), the highest FEV1/FVC ratio of any group (0.78 ± 0.07), and the highest proportion of current smokers (n = 1587; 86%). The PA group also had the highest mean Hounsfield units (HUs) at TLC of any group (−784.3 ± 36.2) and the highest percentage of PRM parenchymal abnormality (40.6 ± 16.6%). The SC group had the highest proportion of GOLD 0 individuals (n = 2,260; 70.7%), was also relatively young with a mean age of 57.2 ± 8 years, had the least amount of airway thickening as measured by Pi10, and had the greatest amount of PRMNorm (63.4 ± 10.8%) of any group.
The CB group was characterized by the highest St. George’s Respiratory Questionnaire total score relative to any other group (mean, 52.5 ± 18.6). The BMI for this group was also the highest of any segment (33.5 ± 6.8 kg/m2). The mean exacerbations in the prior year was also high at 1.1 ± 1.4, second only to the EF group, suggesting that this is a fairly symptomatic group, despite modest lung function impairment with mean FEV1 of 59.3 ± 14.5% predicted. This group was characterized by large airway disease, with 45% meeting criteria for chronic bronchitis, the largest for any group. This group had the second greatest CT measures of large airway disease, with a segmental wall area percentage of 59.7 ± 7.2% and Pi10 of 3.0 ± 0.6. Overall, characteristics of segment members appear to be similar to those traditionally thought of as “blue bloaters” (20).
The SA group had evidence of both large and small airway disease, with numerically the highest segmental wall area percentage (60.1 ± 7.1%) and Pi10 of 3.0 ± 0.6 of any group. The SA group also had levels of PRMfSAD numerically similar to the other two severe groups (%PRMfSAD 34.7 ± 9.5 vs. 36.8 ± 8.9 in EM and 34.9 ± 8.6 in EF) but less emphysema (%PRMEmph 11.5 ± 9.9 vs. 26.1 ± 12.7 in EM [P < 0.001] and 22.7 ± 11.0 in EF [P < 0.001]).
Of the two severe emphysema groups, the male predominant group had more % PRMEmph (26.1 ± 12.7 vs. 22.7 ± 11.7; P < 0.001), more chronic bronchitis (34% vs. 19% [P < 0.001]), thicker airways (Pi10, 2.9 ± 0.5 vs. 2.7 ± 0.4 [P < 0.001]; segmental wall area, 56.0 ± 7.3% vs. 52.3 ± 6.8% [P < 0.001]) as well as greater smoking history (64 vs. 47.3 pack-years) as compared with the female predominant group. However, both emphysema groups are characterized by the lowest BMI values, 25.3 ± 5.2 and 25.3 ± 4.8 kg/m2 for EF and EM, respectively, and may represent a “pink puffer” phenotype (20). Predicted emphysema was assessed in 2,846 patients (see Figure E4). EM and EF had the highest predicted emphysema as expected, followed by SA and B2. Participants in the PA group had noticeably less predicted emphysema than all other groups.
Model reproducibility was analyzed using cross-sectional SPIROMICS data. Table E2 provides the whole-segment features for each of the phenotypes. We qualitatively compared principal component analysis output (in the principal plane) and projection onto the same EPT structure between the two cohorts (Figure E5), observing very similar distributions of points with respect to GOLD classification, despite notable limitations, including both the lack of PRISm cases in SPIROMICS and different methods of airway wall analysis (Thirona vs. VIDA). We quantitively assessed similarity between the two cohorts across model segments in Figure E6, which showed similarity in FEV1 percent predicted, FEV1 (L), and FEV1/FVC for COPDGene and SPIROMICS cases projected to the same segment, and tested statistical equivalence between cohorts by segment in FEV1 for two possible minimal clinically significant difference thresholds of 100 ml and 140 ml, based on the literature (21).
To study real-time changes in relation to our model, we examined longitudinal data from COPDGene. In Figure 2A, we present a Sankey plot based on 1,322 participants (29.0% of participants with longitudinal data) who changed segments between P1 and P2. In this figure, we present proportional net changes in segment membership, showing the dominant directions of change from P1 to P2, to understand bias in longitudinal progression. Several predominant patterns emerged (Figure 2B). For simplicity, EM and EF were merged in this analysis. There was a strong bias in patients moving from SC directly to B2 (42%) with a comparably smaller flow to B1 (5%). The dominant flow of patients leaving B2 transitioned to emphysema (63%). There was also a notable flow of SC patients who transitioned to CB (42%). Patients in CB appear to have transitioned predominantly in one of two directions, to SA (47%) or directly to emphysema (53%). Not surprisingly, for those who transitioned from SA, the net flow is dominated by ending in emphysema only. Proportionately, significantly less flow bias occurred in the PA group. Of those who did shift (n = 242), we saw flow dominance in the SC (44%) and CB (56%) directions.
Figure 2.
Analysis of the directional bias in longitudinal segment changes for COPDGene (Genetic Epidemiology of chronic obstructive pulmonary disease) participants that changed segment between phase I and phase II of the COPDGene study (1,322/4,585 = 29% of participants with phase II data). (A) A Sankey diagram expressing proportional net change in segment membership between phase I (left) and phase II (right) in the COPDGene study. Boxes correspond to tree segments (see Figure 1) and are annotated externally with their segment label. Only positive net changes (i.e., the dominant directions of change from phase I to phase II [left to right] between pairs of segments) are shown. Intermediate bar widths are calculated as the net change divided by the total number of participants shifting in either direction for every pair of segments (i.e., showing the net change proportional to the total traffic for each pair of segments). Boxes are labeled with the percentage of proportional net change leaving phase I (left) or arriving at phase II (right) for each segment. Intermediate bars are labeled with the percentage of proportional net change leaving or arriving at a specific segment. (B) The elastic principal tree (Figure 1, grayscale) overlaid with annotations expressing the dominant direction and magnitude of proportional net changes in segment membership. The widths of the arrows are directly proportional to the proportional net change for the indicated pair of segments. Pairs with a net change of less than 10 participants were excluded. B1 = bridge 1; B2 = bridge 2; CB = chronic bronchitis; EMPH = emphysema (pooled male and female); PA = parenchymal abnormality; SA = severe airway disease; SC = subclinical.
Membership counts for segments at P1 and P2 are presented in Figure E7. A Sankey plot showing segment membership change for all participants who changed segment is included as Figure E8. A Sankey plot showing all transitions can be seen in Figure E9. An analysis of displacement in participants who did not change segment is provided in Figure E10; in brief, average pseudo-time changes were small but consistently directed away from the end of the subclinical segment, with the exception of the SA segment, and in accelerated decline cases (FEV1 drop of at least 60 ml per year), the displacements were notably larger and in the same direction (toward disease terminals), except in the SA segment, where magnitude of displacement decreased.
Patient survival was assessed via the Kaplan-Meier method (Figure 3) using longitudinal COPDGene data (median follow-up, 7.8 yr), which demonstrated notable differences in mortality for each phenotype, assessed using a log-rank test (P < 0.05). This analysis was performed on cases on the distal half of each terminal segment to study survival in the most phenotypically differentiated and progressed subjects, as well as subsets of the bridging segments to study mortality change across them. EM and EF segments had the lowest survival estimate, followed by SA, CB, PA, and then SC. The survival rate of men (EM) with emphysema appeared to be notably worse than that among women (EF), with 44.5% and 54.8% surviving, respectively. Analysis of the bridging segments showed declining survival rates with increasing distance from SC.
Figure 3.
Kaplan-Meier survival analysis for terminal and bridging segments. This analysis was conducted from a set of 7,975 participants from phase I of the COPDGene (Genetic Epidemiology of chronic obstructive pulmonary disease) study who had survival data available and were followed for an average of 2,716 (±709) days. Pseudo-time for this analysis was measured from the node at the intersection of SC, PA, and B1. (A) Survival analysis of the terminal segments. Analysis was performed per segment on participants with above average pseudo-time values, focusing on the distal terminal regions as indicated. (B) Survival analysis of bridging segments. Segments are split into Prox. (proximal) and Dist. (distal) parts for participants with below or above average pseudo-time values, respectively. Plots are annotated with the total mortality rate (%) and count (N) of each subset. This analysis was performed in MATLAB via the edcf (empirical cumulative distribution function) package. B1 = bridge 1; B2 = bridge 2; CB = chronic bronchitis; EF = emphysema female; EM = emphysema male; PA = parenchymal abnormality; SA = severe airway disease; SC = subclinical.
Discussion
In this analysis, we demonstrated that EPGA allows us to take cross-sectional clinical, physiologic, and CT data to identify distinct COPD phenotypes and provide a model of their longitudinal relationships. We examined reproducibility of the model using the external SPIROMICS dataset, demonstrating that the principal component projections of the two datasets were similar, as were the characteristics of patients for the modeled subtypes. We also mapped COPDGene 5-year longitudinal data onto the tree, allowing us to determine how patients behaved longitudinally relative to the model. Our results not only provide a new nosological framework for clinically recognizable COPD subtypes but also help to define trajectories as their longitudinal relationships.
This analysis supports us in understanding the ways in which patients with predisease and mild disease might progress. To provide further context to our findings, others have described and modeled various lung function trajectories in COPD (22–24). Although it is difficult to translate data from these analyses to clinically recognizable phenotypes or to understand how phenotypes relate to one another, there are similar themes comparing this analysis with a prior analysis of the COPDGene and ECLIPSE cohorts performed by Young and colleagues in which the authors used a machine learning tool called “Subtype and Stage Inference” (SusStaIn) to analyze longitudinal imaging data (25). They identified two primary progression pathways. The predominant one (70% of subjects) was characterized by development of small airway disease and emphysema first as measured by PRM, which was used in this analysis, followed by changes in the larger airways. The second pathway (30% of subjects) was characterized by changes first in the large airways followed by small airway disease and emphysema. Young’s primary pathway could be like the one we noted where individuals progressed from SC through the bridging segments to emphysema. The secondary pathway noted by Young could relate to participants progressing through the CB and SA segments.
Several other findings from this analysis are worthy of further discussion. Although we reported fewer net transitions from the PA group relative to the SC group, there does appear to be a transition pathway between the PA and CB segments. The PA segment also contains the largest percentage of individuals in the PRISm spirometric category and the greatest amount of parenchymal abnormality defined by PRMPD that identifies lung tissue that is of greater density at TLC (16), which could represent parenchymal inflammation. This group is also notable for the highest percentage of Black participants in both COPDGene (71%) and SPIROMICS (43%), which could suggest social disadvantage risk factors for COPD in this segment. A question for future investigation is whether prevention and treatment strategies for individuals who are in the PA segment or originate from the PA segment should be approached differently from a risk factor modification or treatment standpoint, particularly if, for instance, early-life factors play a significant role in the development of this phenotype.
Differences in sex distribution were also seen throughout the map. In milder disease, the SC and SA groups leaned more heavily male, whereas the PA and CB groups leaned more heavily female. The model split the severe emphysema groups almost entirely into female and male groups, notable for significantly lower pack-years of tobacco exposure among the female group. Sex differences in COPD have been described previously (26). It has been reported that women may have more disease for the same level of smoking history (27). This was clear when comparing the EM and EF groups, in which a roughly 20–pack-year smoking difference distinguished the two groups in which the amount of emphysema was very similar.
Limitations
We acknowledge several limitations to this analysis. We acknowledge that, although a convenient assumption, pseudo-time connectivity implied by our graph unlikely corresponds to all real-time pathways of disease; patient-specific temporal dynamics are likely more complex than what we captured with a simple tree. However, we present our work as a method to capture a structure within which patients move in time, and we have provided compelling evidence to support its potential to describe general and clinically recognizable trends in disease subtypes. It is important to note that there is still heterogeneity among individuals along any individual segment with more extreme phenotypes seen at the terminal ends. Data required to map individuals in this analysis may not be available in clinical practice, and the imputation of missing data may be insufficient for accurate assessment. However, we are encouraged by the fact that the projection of participants from the SPIROMICS cohort to the tree appeared similar to that of the COPDGene cohort and that the patient characteristics for the phenotypes appeared similar between the two cohorts, despite differences in study protocol. Furthermore, the phenotypes identified are also clinically recognizable, adding to the strength of the results, although we acknowledge additional longitudinal analysis is needed to provide further validity, particularly to confirm how pathways link over longer periods of time. Finally, although we demonstrate differences in mortality between the various segments, it is evident that each segment still represents a spectrum of disease. Further refinements of the segments may be important.
Our work raises two important clinical questions: 1) To what extent does an individual’s starting point influence subsequent disease trajectory and phenotype? and 2) Are two individuals within the same segment truly similar, regardless of their journey to arrive there? In other words, does the disease progression pathway define the phenotype itself? This would be akin to the concept of the “rapid progressor” phenotype, but the phenotyping approach outlined here allows this to be understood in a much more granular way. Given that our data focus on older individuals, we also do not know where the entry point is for all patients. Although we have made a conceptual assumption that the SC segment is the “root,” potentially SC, PA, or B1 could all be entry points for individuals onto the tree in young adulthood. Ultimately, being able to perform this deeper phenotyping approach to more patients over longer periods of time, as well as studying specific interventions where responses can be monitored, will help us answer these questions.
Conclusions
Using two large COPD cohorts, we have demonstrated that cross-sectional data can be used not only to identify disease phenotypes but also to infer their longitudinal relationships. We believe this provides a new potential framework for understanding the relationship between various predisease states and subsequent disease phenotypes, representing a new way of conceptualizing how patients progress through this complex disease beyond simply declines in FEV1. The ultimate future utility of this approach would be to phenotype specific patients, determine if we can predict longitudinal progression for them, and finally understand how phenotype predicts disease response to targeted interventions (28).
Supplemental Materials
Footnotes
Supported by NHLBI grants R01 HL089897, R01 HL089856, R01 HL150023, R01 HL167072, K24 HL138188, K01 HL166705; the NIH/NHLBI (HHSN268200900013C, HHSN268200900014C, HHSN268200900015C, HHSN268200900016C, HHSN268200900017C, HHSN268200900018C, HHSN268200900019C, HHSN268200900020C); grants from the NIH/NHLBI (U01 HL137880 and U24 HL141762); and supplemented by contributions made through the Foundation for the National Institutes of Health and the COPD Foundation from AstraZeneca/MedImmune; Bayer; Bellerophon Therapeutics; Boehringer Ingelheim Pharmaceuticals, Inc.; Chiesi Farmaceutici S.p.A.; Forest Research Institute, Inc.; GlaxoSmithKline; Grifols Therapeutics, Inc.; Ikaria, Inc.; Novartis Pharmaceuticals Corporation; Nycomed GmbH; ProterixBio; Regeneron Pharmaceuticals, Inc.; Sanofi; Sunovion; Takeda Pharmaceutical Company; Theravance Biopharma; and Mylan. Additional support was provided by the UK Research and Innovation and the Engineering and Physical Sciences Research Council Turing AI Fellowship ARaISE EP/V025295 (A.N.G.), and by an Alpha-1 Foundation Research Grant (A.B.).
Author Contributions: A.J.B., M.K.H., and C.J.G. conceived the study and hypotheses. A.J.B. managed data preparation, statistical analysis, software implementation, experimentation, and presentation of results. M.K.H. and A.J.B. wrote the initial draft of the manuscript. M.K.H. and C.J.G. supervised the project. M.K.H. and W.W.L. provided domain expert advice and evaluation of the project. S.M. provided consultancy on the statistical analyses. C.R.H. processed COPDGene computed tomography data for parametric response mapping. A.Z. supported algorithmic development and implementation of elastic principal graphs for this project. A.Z., E.M.M., and A.N.G. provided expert support in the theory and application of elastic principal graphs and related data analysis. V.I., P.M., and E.K. supported phenotyping and longitudinal analysis in COPDGene with elastic principal graphs. E.M., R.S., A.B., and P.J.C. analyzed COPDGene clinical and omics data for emphysema prediction. All authors provided substantive critical reviews and approved of the submitted manuscript.
A data supplement for this article is available via the Supplements tab at the top of the online article.
Originally Published in Press as DOI: 10.1164/rccm.202401-0127OC on September 13, 2024
Author disclosures are available with the text of this article at www.atsjournals.org.
References
- 1. Han MK, Agusti A, Calverley PM, Celli BR, Criner G, Curtis JL, et al. Chronic obstructive pulmonary disease phenotypes: the future of COPD. Am J Respir Crit Care Med . 2010;182:598–604. doi: 10.1164/rccm.200912-1843CC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Labaki WW, Gu T, Murray S, Hatt CR, Galban CJ, Ross BD, et al. Voxel-wise longitudinal parametric response mapping analysis of chest computed tomography in smokers. Acad Radiol . 2019;26:217–223. doi: 10.1016/j.acra.2018.05.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Agusti A, Celli BR, Criner GJ, Halpin D, Anzueto A, Barnes P, et al. Global Initiative for Chronic Obstructive Lung Disease 2023 report: GOLD executive summary. Am J Respir Crit Care Med . 2023;207:819–837. doi: 10.1164/rccm.202301-0106PP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Backman H, Blomberg A, Lundquist A, Strandkvist V, Sawalha S, Nilsson U, et al. Lung function trajectories and associated mortality among adults with and without airway obstruction. Am J Respir Crit Care Med . 2023;208:1063–1074. doi: 10.1164/rccm.202211-2166OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Albergante L, Mirkes E, Bac J, Chen H, Martin A, Faure L, et al. Robust and scalable learning of complex intrinsic dataset geometry via ElPiGraph. Entropy (Basel) . 2020;22:296. doi: 10.3390/e22030296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Gorban AN, Sumner NR, Zinovyev AY. Topological grammars for data approximation. Appl Math Lett . 2007;20:382–386. [Google Scholar]
- 7. Zinovyev A, Mirkes E. Data complexity measured by principal graphs. Comput Math Appl . 2013;65:1471–1482. [Google Scholar]
- 8. Golovenkin SE, Bac J, Chervov A, Mirkes EM, Orlova YV, Barillot E, et al. Trajectories, bifurcations, and pseudo-time in large clinical datasets: applications to myocardial infarction and diabetes data. Gigascience . 2020;9:giaa128. doi: 10.1093/gigascience/giaa128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Chen H, Albergante L, Hsu JY, Lareau CA, Lo Bosco G, Guan J, et al. Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nat Commun . 2019;10:1903. doi: 10.1038/s41467-019-09670-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Maiorino E, De Marzio M, Xu Z, Yun JH, Chase RP, Hersh CP, et al. Joint clinical and molecular subtyping of COPD with variational autoencoders [preprint] 2024. https://pubmed.ncbi.nlm.nih.gov/38260473/
- 11. Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, et al. Genetic Epidemiology of COPD (COPDGene) study design. COPD . 2010;7:32–43. doi: 10.3109/15412550903499522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Bell A, Ram S, Labaki W, Murray S, Kazerooni E, Galbán S, et al. Elastic principal graphs for clinical trajectory analysis in COPD: a COPDGene study [abstract] Eur Respir J . 2021;58(Suppl 65):OA1284. [Google Scholar]
- 13. Bell AJ, Ram S, Labaki WW, Murray S, Kazerooni E, Galban S, et al. Clinical trajectory analysis with longitudinal validation in COPD: a COPDGene study [abstract] Am J Respir Crit Care Med . 2023;207:A6589. [Google Scholar]
- 14. Couper D, LaVange LM, Han M, Barr RG, Bleecker E, Hoffman EA, et al. SPIROMICS Research Group Design of the Subpopulations and Intermediate Outcomes in COPD Study (SPIROMICS) Thorax . 2014;69:491–494. doi: 10.1136/thoraxjnl-2013-203897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Galban CJ, Han MK, Boes JL, Chughtai KA, Meyer CR, Johnson TD, et al. Computed tomography-based biomarker provides unique signature for diagnosis of COPD phenotypes and disease progression. Nat Med . 2012;18:1711–1715. doi: 10.1038/nm.2971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Galbán CJ, Boes JL, Bule M, Kitko CL, Couriel DR, Johnson TD, et al. Parametric response mapping as an indicator of bronchiolitis obliterans syndrome after hematopoietic stem cell transplantation. Biol Blood Marrow Transplant . 2014;20:1592–1598. doi: 10.1016/j.bbmt.2014.06.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Bhatt SP. Imaging small airway disease: probabilities and possibilities. Ann Am Thorac Soc . 2019;16:975–977. doi: 10.1513/AnnalsATS.201903-231ED. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Smith BM, Hoffman EA, Rabinowitz D, Bleecker E, Christenson S, Couper D, et al. Comparison of spatially matched airways reveals thinner airway walls in COPD. The Multi-Ethnic Study of Atherosclerosis (MESA) COPD Study and the Subpopulations and Intermediate Outcomes in COPD Study (SPIROMICS) Thorax . 2014;69:987–996. doi: 10.1136/thoraxjnl-2014-205160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Suryadevara R, Gregory A, Lu R, Xu Z, Masoomi A, Lutz SM, et al. COPDGene investigators Blood-based transcriptomic and proteomic biomarkers of emphysema. Am J Respir Crit Care Med . 2024;209:273–287. doi: 10.1164/rccm.202301-0067OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Scadding JG. Meaning of diagnostic terms in broncho-pulmonary disease. Br Med J . 1963;2:1425–1430. doi: 10.1136/bmj.2.5370.1425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Jones PW, Beeh KM, Chapman KR, Decramer M, Mahler DA, Wedzicha JA. Minimal clinically important differences in pharmacological trials. Am J Respir Crit Care Med . 2014;189:250–255. doi: 10.1164/rccm.201310-1863PP. [DOI] [PubMed] [Google Scholar]
- 22. Ross JC, Castaldi PJ, Cho MH, Hersh CP, Rahaghi FN, Sanchez-Ferrero GV, et al. Longitudinal modeling of lung function trajectories in smokers with and without chronic obstructive pulmonary disease. Am J Respir Crit Care Med . 2018;198:1033–1042. doi: 10.1164/rccm.201707-1405OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Bui DS, Lodge CJ, Burgess JA, Lowe AJ, Perret J, Bui MQ, et al. Childhood predictors of lung function trajectories and future COPD risk: a prospective cohort study from the first to the sixth decade of life. Lancet Respir Med . 2018;6:535–544. doi: 10.1016/S2213-2600(18)30100-0. [DOI] [PubMed] [Google Scholar]
- 24. Lange P, Celli B, Agusti A, Boje Jensen G, Divo M, Faner R, et al. Lung-function trajectories leading to chronic obstructive pulmonary disease. N Engl J Med . 2015;373:111–122. doi: 10.1056/NEJMoa1411532. [DOI] [PubMed] [Google Scholar]
- 25. Young AL, Bragman FJS, Rangelov B, Han MK, Galban CJ, Lynch DA, et al. COPDGene Investigators Disease progression modeling in chronic obstructive pulmonary disease. Am J Respir Crit Care Med . 2020;201:294–302. doi: 10.1164/rccm.201908-1600OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Perez TA, Castillo EG, Ancochea J, Pastor Sanz MT, Almagro P, Martínez-Camblor P, et al. Sex differences between women and men with COPD: a new analysis of the 3CIA study. Respir Med . 2020;171:106105. doi: 10.1016/j.rmed.2020.106105. [DOI] [PubMed] [Google Scholar]
- 27. Amaral AFS, Strachan DP, Burney PGJ, Jarvis DL. Female smokers are at greater risk of airflow obstruction than male smokers. UK Biobank. Am J Respir Crit Care Med . 2017;195:1226–1235. doi: 10.1164/rccm.201608-1545OC. [DOI] [PubMed] [Google Scholar]
- 28. Cazzola M, Rogliani P, Barnes PJ, Blasi F, Celli B, Hanania NA, et al. An update on outcomes for COPD pharmacological trials: a COPD Investigators report—reassessment of the 2008 American Thoracic Society/European Respiratory Society statement on outcomes for COPD pharmacological trials. Am J Respir Crit Care Med . 2023;208:374–394. doi: 10.1164/rccm.202303-0400SO. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



