Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Oct 11.
Published in final edited form as: Proc IEEE Int Symp Biomed Imaging. 2021 May 25;2021:1568–1572. doi: 10.1109/isbi48211.2021.9434172

UNSUPERVISED CLUSTERING OF AIRWAY TREE STRUCTURES ON HIGH-RESOLUTION CT: THE MESA LUNG STUDY

Artur Wysoczanski 1, Elsa D Angelini 1,2, Benjamin M Smith 3,4, Eric A Hoffman 5,6, Grant T Hiura 3, Yifei Sun 7, R Graham Barr 3,8, Andrew F Laine 1
PMCID: PMC11467910  NIHMSID: NIHMS2027029  PMID: 39399779

Abstract

The morphology of the proximal human airway tree is highly variable in the general population, and known variants in airway branching patterns are associated with increased risk of COPD and with polymorphisms in growth factors involved in pulmonary development. Variation in the geometry and topology of the airway tree remains incompletely characterized, and their clinical implications are not yet understood. In this work, we present an approach to unsupervised clustering of airway tree structures in Billera-Holmes-Vogtmann tree-space. We validate our pipeline on synthetic airway tree data, and apply our algorithm to identify reproducible and morphologically distinct airway tree subtypes in the MESA Lung CT cohort.

Keywords: Airway Morphology, Computational Anatomy, Lung CT, Community Detection

1. INTRODUCTION

Within the conserved structures of the proximal human airway tree, there exists remarkable diversity in the geometry of airway branches, in their spatial distribution within the lung lobes, and in the order in which they arise from the lobar bronchi [1]. Furthermore, at the level of the segmental airways, branch variants are observed in over 25% of the general population [2]: “accessory” airway branches are often seen that are not present in the standard anatomy, while some standard segmental branches are frequently absent. Accessory sub-superior and absent right medial-basal airway variants are associated with significantly increased risk of developing chronic obstructive pulmonary disease (COPD) [2], and the latter is also strongly associated with genetic variants in FGF10, a regulator of airway budding during pulmonary development [3].

The high variability of airway geometry and topology in anatomic case series strongly suggests that (1) multiple morphologic subtypes of the proximal airway tree exist in the general population, and (2) this variation is developmentally regulated and may be a biomarker of altered pulmonary function and disease risk. However, both the subject-level patterns of such variation and their clinical implications remain poorly understood. Aggregate statistics such as count and caliber of CT-resolved airway branches are strongly associated with COPD [4, 5], but by necessity discard higher-order features of tree structure readily observed in other vertebrates [6]. Procrustes analysis, meanwhile, encodes high-level features of tree geometry and has previously been used to analyze airway morphology [7], but requires anatomic labeling of all landmark points and does not explicitly address tree topology.

Tree-structure analysis in Billera-Holmes-Vogtmann (BHV) tree space [8] addresses each of these shortcomings directly. Given any set of trees with identifiable root nodes and a shared set of unique labels for each terminal branch, the distance between trees in BHV tree-space is well-defined, and the existence and uniqueness of a mean tree-structure is guaranteed, with no additional constraints on tree geometry or topology. Prior applications have demonstrated the utility of BHV geodesic distance for hypothesis testing and classification on airway trees of COPD subjects and healthy controls [9], as well as visualization of inter-tree anatomic variation [10] and unsupervised clustering of phylogenetic trees for outlier gene detection [11, 12]. In this work, we introduce a novel unsupervised-learning pipeline for airway tree structures based on BHV geodesic distance, and apply it to the segmented airway trees of Multi-Ethnic Study of Atherosclerosis (MESA) Exam 5 participants [13] to identify airway structural subtypes.

2. MATERIALS AND METHODS

2.1. Dataset and airway segmentation

MESA is a population-based sample of 6,814 participants (53% female, 39% white) recruited in 2000-2002 at six US field centers, aged 45-84 and free of clinical cardiovascular disease at baseline. As part of the MESA Lung study, full-lung high-resolution computed-tomography (HRCT) scans were acquired at full inspiration for 3,195 MESA Exam 5 participants from 2010-2012, with slice thickness 0.625 or 0.75mm, and isotropic in-plane resolution in the range [0.4668, 0.9180] mm.

Airway segmentation and anatomical labeling were provided for all subjects using the VIDA Diagnostics, Inc. image analysis service and the Apollo 2.0 software suite. The airway subtrees from the carina through the segmental airways were extracted for further analysis. The trachea was excluded from the study due to variability in the superior cutoff of the CT scans. Segmental airways are labeled following standard anatomic nomenclature, encoded in this work as being in the left (L) or right (R) lung and then indexed (1 to 10) indicative of their corresponding pulmonary segment.

2.2. Airway tree pre-processing

From this cohort we found that 2,781 trees contained a single connected component, possessed labels for all conserved segmental airways (R1-R10 and L1-L10), excluding the medial basal airways bilaterally due to their frequent absence, and had no segmental airway descended from another. Each airway tree is translated to center the carina at the origin of a fixed Cartesian coordinate frame, and the geometry of each airway segment is encoded as a vector in R15, consisting of the coordinates of five equally-spaced landmark points interpolated along the airway centerline (Fig. 1). Airway branches are classified topologically based on the segmental airway branches they subtend, and when comparing multiple subjects, branches are considered equivalent or “matched” if and only if they subtend the same set of segmental airways. Structural variability due to subject orientation and morphometry was canceled out by registration of all airway trees to the Frechet mean, computed as described below.

Fig. 1.

Fig. 1.

BHV encoding of a representative airway tree. The tree is truncated at the level of the segmental airways (L1-L10, R1-R10), with landmarks sampled at five equidistant points along the medial axis of each airway branch.

Full tree-structures are described by vectors in (R15)s, where s is equal to the number of possible subsets of segmental airway branches. BHV tree-space is then the subset of this Euclidean space containing all acyclic tree structures, where no two branches of the same tree may subtend the same segmental airway unless one of the branches is descended from the other. The geodesic distance d(x,y) between any two trees x and y in BHV tree-space is the length of the shortest path between x and y which is contained entirely in tree-space. Where x and y are topologically equivalent, BHV geodesic distance reduces to the standard Euclidean distance metric. The Frechet mean of a set of N airway trees in the tree-space T is defined as:

t¯=argminxTi=1Nd2(x,xi) (1)

and was computed following Sturm’s algorithm on the BHV distance metric, as implemented in the TreeStats software package [9].

Registration was optimized using gradient descent under 3D rotation and anisotropic scaling, minimizing the squared Euclidean distance between anatomically-matched landmark points of each tree and the Fréchet mean.

2.3. Airway tree clustering

Pairwise geodesic distance was computed on the full cohort using the GTP software package [14]. A subset of airway trees demonstrated high dissimilarity to all other members of the cohort, which would be expected in the case of mislabeled tree anatomy or connectivity. The top 2% of trees ranked by nearest-neighbor BHV distance (n = 57 of 2,781) were therefore excluded from further study. Undirected graphs Gk(E,V) are then constructed from the pairwise distance matrix of registered trees, with the vertices V consisting of airway trees and edge weights E defined by the relation:

E(i,j)=(maxi,j(d(i,j)kNN(i,j))d(i,j))kNN(i,j) (2)

where kNN(i,j)=1 if tree i is one of the k nearest neighbors of tree j in the dataset, and zero otherwise. The number k of included nearest-neighbor edges, which regulates graph density, is treated as a tuned parameter. Clustering on the graphs Gk is then performed with Infomap community detection [15], which identifies the partition that optimally compresses the Huffman code of random walks on the graph. Tree centroids are defined for each cluster using the Fréchet mean.

2.4. Cluster validation and evaluation

Airway tree registration was evaluated on two synthetic datasets generated from MESA Exam 5. For all airway trees both before and after registration, the anatomic labels of the right upper lobe segmental airways (R1, R2, R3) were randomly permuted, generating six subsets of trees with distinct branch arrangements. The distribution of tree subsets in BHV tree-space was examined qualitatively by t-SNE for both datasets, and quantitatively by nearest-neighbor and Infomap-based label prediction.

Reproducibility of learned airway subtypes was assessed by constructing four subsets of MESA Exam 5 airway trees, each containing 75% of all subjects, and repeating the clustering procedure for each subset independently. Cluster indices for airway trees in each subset were then permuted using the Hungarian algorithm to optimally match clusters learned on the full training set. Clustering reproducibility is defined quantitatively as the average inter-rater agreement of the clusters learned on the training data subsets with those learned on the full dataset:

R=14n=14j=1Nclusters{cn=j}{cfull=j}j=1Nclusters{cfull=j} (3)

where cfull denotes the label assigned to a given tree after learning on the full dataset, and cn denotes the label assigned to a given tree after learning on the nth subset of the training data. Inter-cluster variability in tree structure is described by the morphology of cluster centroids, as well as the relative frequency of distinct tree topologies within each cluster.

3. RESULTS

3.1. Synthetic Data

On the synthetic airway tree datasets, nearest-neighbor classification predicted the label pennutation with a leave-one-out accuracy of 91.2% and 89.0% using registered and unregistered airway trees, respectively. This finding, coupled with t-SNE visualization of the synthetic datasets (Fig. 2), demonstrates that the structural dissimilarity induced by permutation of anatomic labels is preserved under airway tree registration. On the registered synthetic airway trees, Infomap graph partitioning detected 8 clusters and achieved classification accuracy of 97.4% after graph density tuning (k=55) and cluster mapping using the Hungarian algorithm.

Fig. 2.

Fig. 2.

t-SNE projections of the synthetic airway tree dataset A) before, B) after registration to the Frechet mean under 3D rotation and anisotropic scaling. Colors encode the 6 permutations of right upper lobe branch labels.

3.2. Airway Tree Clustering

Pairwise BHV geodesic distance between registered airway trees falls within the range [5.72, 162.45], with IQR [53.55, 67.85]. Ordering the pairwise distance matrix by average linkage (Fig. 3A) reveals that airway trees in the dataset partition into visually-appreciable subgroups, with distinct patterns in their relative distances to other members.

Fig. 3.

Fig. 3.

A) Pairwise BHV distances between airway trees in the full MESA Exam 5 dataset (n=2,781). Subjects are ordered by average-linkage distance. B) Number of tree clusters detected by Infomap graph partitioning as a function of neighborhood parameter k. The dashed red line denotes the beginning of the “plateau” region for cluster number with respect to graph density, and the solid red line indicates the optimal k applied in subsequent analysis.

The number of clusters identified by Infomap graph partitioning decreases as the neighborhood size k (and therefore graph density) increases (Fig. 3B). Cluster number drops sharply for low values of k, plateaus for intermediate values (25<k55), then again drops sharply before ending with a single cluster for k68. We therefore tune kopt individually for all training experiments to the maximum value of k attained on the plateau, which maximizes graph density while returning a non-trivial number of airway subtypes.

Clustering of registered airway trees with tuned kopt yields 10 distinct airway tree subtypes, whose centroids (Fig. 4A) vary geometrically in both length and 3D orientation of airway branches. Clusters are strongly associated with distinct tree topologies, described by sequences of internal airway branches (Fig. 4B).

Fig. 4.

Fig. 4.

A) BHV tree-space centroids of ten clusters learned on the registered MESA Exam 5 airway dataset (n=2,724), with their respective cluster sizes. B) Log-odds ratio of observing 20 airway branches in each cluster, ordered by decreasing total prevalence. For example, R(1,2,3) denotes the internal airway branch with descendants R1, R2 and R3.

3.3. Cluster Reproducibility

When re-learning on 4 subsets of MESA Exam 5, each comprising 75% of the available training data, Infomap partitioning detected {11,10,10,10} clusters at kopt of {42,47,43,41}. These models achieved 90.1%, 92.6%, 89.6% and 86.3% inter-observer clustering agreement (cf. Eq 3), respectively, comparing to the 10 clusters learned on the full dataset. Mean inter-observer clustering agreement reached 89.6%.

The discovered clusters were furthermore found to be highly robust to variation in graph density, controlled by the neighborhood parameter k. For all subsets of MESA Exam 5 subjects, as well as the full training set, Infomap cluster assignments at intermediate graph densities were in high agreement with cluster assignments at k=kopt (Fig. 5). When learning on the full training set, in particular, agreement with the learned subtypes exceeds 80% for all neighborhood parameter values k in the range [19, kopt]. The stability of the clustering output over a wide range of graph densities indicates that the discovered airway tree subtypes represent distinct subpopulations within MESA Exam 5 participants, rather than an artifact of graph construction.

Fig. 5.

Fig. 5.

Inter-observer clustering agreement, as a function of neighborhood parameter k, between Infomap partitions of graphs Gk and Gk,opt for the MESA Exam 5 training set (n=2,724) and four subsets.

4. DISCUSSION

In this work, we have introduced and validated a novel pipeline for clustering proximal airway tree structures extracted from HRCT scans, based on BHV geodesic distance. Our findings on sets of synthetic airway trees demonstrate that our pre-processing pipeline and the BHV distance metric preserve salient airway structural features, and that graph-based clustering identifies subpopulations of structurally-dissimilar airway trees with high reproducibility. In the MESA Exam 5 cohort, our work suggests the existence of 10 reproducible airway tree subtypes, which exhibit unique geometric and topological properties, and are robust to user-defined parameters of the clustering pipeline.

Although BHV tree-space places almost no constraints on the geometry or topology of admissible airway trees, it does require that the “leaf set”, or set of segmental airway branches, be conserved across all airway trees in the dataset, and each leaf must have a unique anatomic label. In this study, we satisfied this requirement by excluding frequently - absent segmental branches (e.g. subsuperior, R7 and L7 airways), as well as trees where segmental airways of interest were either absent or unlabeled. Methods have been proposed in the literature to address this concern [16], which will be a focus of future extensions of our pipeline.

Future work will center on structural and clinical characterization of the discovered subtypes. Subtree analysis, quantifying dissimilarity across subjects in anatomic subunits of the lungs, will serve to localize characteristic structural features of airway subtypes. We aim to combine the topological features discussed in this work to describe subject-level branching patterns. We further plan to extend our analysis to independent CT cohorts, such as the SubPopulations and Intermediate Outcomes in COPD (SPIROMICS) study [17], to replicate the airway tree subtypes discovered in our current work, and to examine the association between airway tree subtypes, pulmonary function and clinical outcomes in the pooled sample.

5. COMPLIANCE WITH ETHICAL STANDARDS

Institutional review board approval was obtained at each of the MESA study sites (mesa-nhlbi.org). Written informed consent was obtained from all participants.

ACKNOWLEDGMENTS

Dr. Hoffman is a shareholder in VIDA Diagnostics, Inc. This research was supported by NIH R01-HL130506, NIH 2 R01 HL121270-05 and by contracts 75N92020D00001, HHSN268201500003I, N01-HC-95159, 75N92020D00005, N01-HC-95160, 75N92020D00002, N01-HC-95161, 75N92020D00003, N01-HC-95162, 75N92020D00006, N01-HC-95163, 75N92020D00004, N01-HC-95164, 75N92020D00007, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168 and N01-HC-95169 from the National Heart, Lung, and Blood Institute, and by grants UL1-TR-000040, UL1-TR-001079, and UL1-TR-001420 from the National Center for Advancing Translational Sciences (NCATS). The authors thank the other investigators, the staff, and the participants of the MESA study for their valuable contributions. A full list of participating MESA investigators and institutions can be found at http://www.mesa-nhlbi.org.

7. REFERENCES

  • [1].Yamashita H, Roentgenologic Anatomy of the Lung. 1978. [Google Scholar]
  • [2].Smith BM et al. , "Human airway branch variation and chronic obstructive pulmonary disease," Proc Natl Acad Sci USA, vol. 115, no. 5, pp. E974–E981, Jan 30 2018, doi: 10.1073/pnas.1715564115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Bellusci S, Grindley J, Emoto H, Itoh N, and Hogan BL, "Fibroblast growth factor 10 (FGF10) and branching morphogenesis in the embryonic mouse lung," Development, vol. 124, no. 23, pp. 4867–78, Dec 1997. [Online], Available: https://www.ncbi.nlm.nih.gov/pubmed/9428423. [DOI] [PubMed] [Google Scholar]
  • [4].Kirby M et al. , "Total Airway Count on Computed Tomography and the Risk of Chronic Obstructive Pulmonary Disease Progression. Findings from a Population-based Study," Am J Respir Crit Care Med, vol. 197, no. 1, pp. 56–65, Jan 1 2018, doi: 10.1164/rccm.201704-0692OC. [DOI] [PubMed] [Google Scholar]
  • [5].Smith BM et al. , "Association of Dysanapsis With Chronic Obstructive Pulmonary Disease Among Older Adults," JAMA, vol. 323, no. 22, pp. 2268–2280, Jun 9 2020, doi: 10.1001/jama.2020.6918. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Metzger RJ, Klein OD, Martin GR, and Krasnow MA, "The branching programme of mouse lung development," Nature, vol. 453, no. 7196, pp. 745–50, Jun 5 2008, doi: 10.1038/nature07005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Humphries SM, Hunter KS, Shandas R, Deterding RR, and DeBoer EM, "Analysis of pediatric airway morphology using statistical shape modeling," Medical & Biological Engineering & Computing, vol. 54, no. 6, pp. 899–911, 2016/June/01 2016, doi: 10.1007/s11517-015-1445-x. [DOI] [PubMed] [Google Scholar]
  • [8].Billera LJ, Holmes SP, and Vogtmann K, "Geometry of the Space of Phylogenetic Trees," Advances in Applied Mathematics, vol. 27, no. 4, pp. 733–767, 2001, doi: 10.1006/aama.2001.0759. [DOI] [Google Scholar]
  • [9].Feragen A et al. , "Tree-Space Statistics and Approximations for Large-Scale Analysis of Anatomical Trees," presented at the Information Processing in Medical Imaging, 2013. [DOI] [PubMed] [Google Scholar]
  • [10].Amenta N et al. , "Quantification and Visualization of Variation in Anatomical Trees," Cham, 2015: Springer International Publishing, in Research in Shape Modeling, pp. 57–79. [Google Scholar]
  • [11].Zairis S, Khiabanian H, Blumberg AJ, and Rabadan R, "Genomic data analysis in tree spaces," arXiv e-prints, p. arXiv: 1607.07503. [Online], Available: https://ui.adsabs.harvard.edu/abs/2016arXiv160707503Z [Google Scholar]
  • [12].Gori K, Suchan T, Alvarez N, Goldman N, and Dessimoz C, "Clustering Genes of Common Evolutionary History," Mol Biol Evol, vol. 33, no. 6, pp. 1590–605, Jun 2016, doi: 10.1093/molbev/msw038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Bild DE et al. , "Multi-Ethnic Study of Atherosclerosis: objectives and design," Am J Epidemiol, vol. 156, no. 9, pp. 871–81,Nov 1 2002, doi: 10.1093/aje/kwf113. [DOI] [PubMed] [Google Scholar]
  • [14].Owen M and.Provan JS, "A fast algorithm for computing geodesic distances in tree space," IEEE/ACM Trans Comput Biol Bioinform, vol. 8, no. 1, pp. 2–13, Jan-Mar 2011, doi: 10.1109/TCBB.2010.3. [DOI] [PubMed] [Google Scholar]
  • [15].Rosvall M, Axelsson D, and Bergstrom CT, "The map equation," The European Physical Journal Special Topics, vol. 178, no. 1, pp. 13–23, 2010, doi: 10.1140/epjst/e2010-01179-1. [DOI] [Google Scholar]
  • [16].Grindstaff G and Owen M, "Geometric comparison of phylogenetic trees with different leaf sets," arXiv e-prints, p. arXiv: 1807.04235. [Online], Available: https://ui.adsabs.harvard.edu/abs/2018arXiv180704235G [Google Scholar]
  • [17].Couper D et al. , "Design of the Subpopulations and Intermediate Outcomes in COPD Study (SPIROMICS)," Thorax, vol. 69, no. 5, pp. 491–4, May 2014, doi: 10.1136/thoraxjnl-2013-203897. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES