Machine learning (ML) is a branch of artificial intelligence in which a model or set of rules is derived on the basis of an initial training set. This model is then used to evaluate a new dataset (1). ML has made important contributions to the analysis of high-resolution computed tomography scans in predicting progression of idiopathic pulmonary fibrosis (IPF) (2, 3). The work by Huang and colleagues (pp. 444–454) reported in this issue of the Journal has now extended the role of ML to proteomic analysis in pulmonary fibrosis, to enhance diagnostic ability and increase our understanding of the disease process of interstitial lung disease (ILD) (4). The objective of the present study was to identify proteins that separate and classify patients with connective tissue disease (CTD)–associated ILD from those with IPF.
The study cohort was drawn from four registries—the Pulmonary Fibrosis Foundation Patient Registry at the University of Virginia, at the University of Chicago, and at the University of California, Davis (5), and the United Kingdom RECITAL (Rituximab versus Cyclophosphamide in Connective Tissue Disease–ILD) clinical trial (6)—providing both patients with IPF (n = 1,247) and those with CTD–ILD (n = 352), with matched proteomic and clinical data. Olink (Proteomics), an unsupervised proteomics platform, had an output 2,912 proteins between CTD–ILD and IPF. The model was derived from the Pulmonary Fibrosis Foundation as the training cohort. For the appropriate downstream analysis, patient numbers and gender for included diseases such as IPF, rheumatoid arthritis (RA)–associated ILD, and scleroderma-associated ILD had to be balanced against each other. This was done using a technique called random subsampling, in which subjects from the cohorts are randomly allocated but balanced for diagnosis and gender.
Once balancing was complete, the authors developed their classifier, their model of the proteomic signature that differentiates CTD–ILD from IPF. They used recursive feature elimination (RFE). In this selection method, the weakest features are progressively removed in an iterative process until a specific number of strong(er) features is reached (7). In addition, RFE removes multicollinearity when two presumed independent variables (proteins) correlate with each other. RFE ranked 37 proteins as a single classifier between CTD–ILD and IPF.
The 37-protein classifier was then subjected to several ML techniques, including support vector machine, which helps solves binary problems, in this case the proteomic separation of CTD–ILD versus IPF; the least absolute shrinkage selection operator, which is a regression analysis technique that was focused on specific variables (proteins) to establish the diagnosis; and random forest (RF), in which predictions from different decision trees are merged. The “imbalanced” RF is used when the cohorts have different numbers. These ML modalities in the University of Virginia and Chicago classification cohorts and the RECITAL/University of California, Davis, prediction cohorts had accuracy values of 83% and 77% with the support vector machine, 80% and 83% with the least absolute shrinkage selection operator, 83.1% and 82.5% with RF, and 77.8% and 80.7% with imbalance RF, respectively, in separating CTD–ILD from IPF (Figure 1). The authors concluded that “multiple machine learning models trained with large cohort proteomic datasets consistently distinguished CTD-ILD from IPF.”
Figure 1.
(A) The schematic of the study. The PFF cohort constituted the training set, which underwent random subsampling and recursive feature elimination to obtain a 37-protein signature. These 37 proteins were subjected to ML analysis from the UVA/Chicago classification cohorts and the RECITAL (Rituximab versus Cyclophosphamide in Connective Tissue Disease–ILD)/UC-Davis prediction cohorts. (B) Proportion of patients with different composite diagnosis scores (CDSs). Patients with CDSs of 0 or 1 were classified as CTD–ILD, those with CDSs of 2 were unclassified, and those with CDSs of 3 or 4 were classified as IPF. CTD = connective tissue disease; ILD = interstitial lung disease; IPF = idiopathic pulmonary fibrosis; LASSO = least absolute shrinkage selection operator; PFF = Pulmonary Fibrosis Foundation; RF = random forest; UC-Davis = University of California, Davis; UVA = University of Virginia.
The diagnosis of CTD is based on guidelines from the American College of Rheumatology for different autoimmune conditions on the basis of clinical features, organ involvement, and serology (8). The pooled prevalence of ILD was 11% for RA, 47% for systemic sclerosis, 41% and 56% for mixed CTD (39–72%), and 6% for systemic lupus erythematosus (3–10%). Usual interstitial pneumonia was the most prevalent ILD pattern for RA (pooled prevalence of 46%), while nonspecific interstitial pneumonia was the most common ILD pattern for all other CTD subtypes (pooled prevalence range, 27–76%) (9).
In contrast, the diagnosis of IPF is based on American Thoracic Society/European Respiratory Society/Japanese Respiratory Society guidelines, with subpleural, bibasal honeycombing comprising the radiological (computed tomography) pulmonary fibrosis pattern of usual interstitial pneumonia (10). The significant difference in the radiology and histology between CTD–ILD and IPF is reflected in the proteomics of respective blood samples.
Although the authors did not specifically aim to find a diagnostic proteomic test to classify CTD–ILD and IPF, the clinical need for a diagnostic proteomic test is limited. Indeed, the lack of a confident diagnosis in 30% of ILD cases and the reclassification in 10% of ILD cases in multidisciplinary discussion is unlikely to pertain to CTD–ILD versus IPF, as the clinical, serological, and radiological features are different between the two conditions. However, there is a fibrosing phenotype within CTD–ILD that may have clinical and radiological overlap with IPF. The prevalence of the fibrosing phenotype of ILD in RA lung disease is 40%, in systemic sclerosis 32%, and in mixed CTD–ILD 24% (11–13). The authors do not describe whether the proteomics study was performed in a fibrosing CTD–ILD cohort. This would be informative, as the proteomic analysis would have greater relevance for separating fibrosing CTD–ILD from IPF. This is helpful for diagnosis when fibrosing CTD–ILD occurs before the onset of autoantibody positivity and so may be (mis)classified as IPF. This scenario can lead the multidisciplinary discussion participants to reclassify IPF as fibrosing CTD–ILD when the full clinical and serology picture unveils itself. Furthermore, proteomic comparisons between fibrosing CTD–ILD and IPF might also delineate important differences in pathogenic pathways in the respective conditions.
The proteomic patterns contrasting CTD–ILD and IPF demonstrate a predominantly inflammatory profile in CTD–ILD, in contrast to proteins associated with epithelial injury, fibrosis, and extracellular matrix turnover in IPF. This result is in line with the pathogenesis of both CTD–ILD and IPF, respectively. There are also “counterintuitive” proteins and pathways that have not been previously associated with these conditions (14, 15). Such an example in this paper is the pathway associated with response to silica dioxide. Further work is required to determine if these proteins have been “falsely” ranked by the proteomic, statistical, or ML methods or whether these common novel pathways worth pursuing. Conundrums such as this will continue to pose a challenge when using ML and bioinformatics, so increasingly sophisticated approaches will be required to gain further granularity about protein ranking.
A further consideration for this study is the acceptable threshold for proteomic separation between CTD–ILD and IPF. The authors propose “PC37,” the number of proteins used in their training set to separate CTD–ILD from IPF. How do we address the separation if <37 proteins are detected in a particular patient? Would we then reject a diagnosis of CTD–ILD versus IPF? Perhaps this could be addressed in practice by analyzing proteomic outputs in the setting of clinical features to inform decisions.
Given that omics platforms will be used increasingly with ML, a major hurdle will be the cost-effectiveness of using these expensive assays and sophisticated ML to make a diagnosis. Hopefully, this will not turn out to be a major concern as these technologies become more mainstream and cheaper. Proteomics and ML might find new relationships, pathogenic/diagnostic pathways, and clusters that will change the diagnosis and management of patients in the future. This brave new world should be embraced.
Footnotes
Supported by the Centre for Research Excellence in Pulmonary Fibrosis, Medical Research Future Fund grant 2031474, and National Health and Medical Research Council grant APP 1147776.
Originally Published in Press as DOI: 10.1164/rccm.202403-0603ED on April 9, 2024
Author disclosures are available with the text of this article at www.atsjournals.org.
References
- 1. Rashidi HH, Tran NK, Betts EV, Howell LP, Green R. Artificial intelligence and machine learning in pathology: the present landscape of supervised methods. Acad Pathol . 2019;6:2374289519873088. doi: 10.1177/2374289519873088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Pan J, Hofmanninger J, Nenning KH, Prayer F, Röhrich S, Sverzellati N, et al. Unsupervised machine learning identifies predictive progression markers of IPF. Eur Radiol . 2023;33:925–935. doi: 10.1007/s00330-022-09101-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Walsh SLF, Mackintosh JA, Calandriello L, Silva M, Sverzellati N, Larici AR, et al. Deep learning-based outcome prediction in progressive fibrotic lung disease using high-resolution computed tomography. Am J Respir Crit Care Med . 2022;206:883–891. doi: 10.1164/rccm.202112-2684OC. [DOI] [PubMed] [Google Scholar]
- 4. Huang Y, Ma SF, Oldham JM, Adegunsoye A, Zhu D, Murray S, et al. Machine learning of plasma proteomics classifies diagnosis of interstitial lung disease. Am J Respir Crit Care Med . 2024;210 doi: 10.1164/rccm.202309-1692OC. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Wang BR, Edwards R, Freiheit EA, Ma Y, Burg C, de Andrade J, et al. The Pulmonary Fibrosis Foundation Patient Registry: rationale, design, and methods. Ann Am Thorac Soc . 2020;17:1620–1628. doi: 10.1513/AnnalsATS.202001-035SD. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Saunders P, Tsipouri V, Keir GJ, Ashby D, Flather MD, Parfrey H, et al. Rituximab versus cyclophosphamide for the treatment of connective tissue disease-associated interstitial lung disease (RECITAL): study protocol for a randomised controlled trial. Trials . 2017;18:275. doi: 10.1186/s13063-017-2016-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Li L, Ching WK, Liu ZP. Robust biomarker screening from gene expression data by stable machine learning-recursive feature elimination methods. Comput Biol Chem . 2022;100:107747. doi: 10.1016/j.compbiolchem.2022.107747. [DOI] [PubMed] [Google Scholar]
- 8. Fischer A, Strek ME, Cottin V, Dellaripa PF, Bernstein EJ, Brown KK, et al. Proceedings of the American College of Rheumatology/Association of Physicians of Great Britain and Ireland connective tissue disease-associated interstitial lung disease summit: a multidisciplinary approach to address challenges and opportunities. Arthritis Rheumatol . 2019;71:182–195. doi: 10.1002/art.40769. [DOI] [PubMed] [Google Scholar]
- 9. Joy GM, Arbiv OA, Wong CK, Lok SD, Adderley NA, Dobosz KM, et al. Prevalence, imaging patterns and risk factors of interstitial lung disease in connective tissue disease: a systematic review and meta-analysis. Eur Respir Rev . 2023;32:220210. doi: 10.1183/16000617.0210-2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Raghu G, Remy-Jardin M, Myers JL, Richeldi L, Ryerson CJ, Lederer DJ, et al. American Thoracic Society, European Respiratory Society, Japanese Respiratory Society, and Latin American Thoracic Society Diagnosis of idiopathic pulmonary fibrosis: an official ATS/ERS/JRS/ALAT clinical practice guideline. Am J Respir Crit Care Med . 2018;198:e44–e68. doi: 10.1164/rccm.201807-1255ST. [DOI] [PubMed] [Google Scholar]
- 11. Olson A, Hartmann N, Patnaik P, Wallace L, Schlenker-Herceg R, Nasser M, et al. Estimation of the prevalence of progressive fibrosing interstitial lung diseases: systematic literature review and data from a physician survey. Adv Ther . 2021;38:854–867. doi: 10.1007/s12325-020-01578-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Wijsenbeek M, Kreuter M, Olson A, Fischer A, Bendstrup E, Wells CD, et al. Progressive fibrosing interstitial lung diseases: current practice in diagnosis and management. Curr Med Res Opin . 2019;35:2015–2024. doi: 10.1080/03007995.2019.1647040. [DOI] [PubMed] [Google Scholar]
- 13. Zamora-Legoff JA, Krause ML, Crowson CS, Ryu JH, Matteson EL. Progressive decline of lung function in rheumatoid arthritis-associated interstitial lung disease. Arthritis Rheumatol . 2017;69:542–549. doi: 10.1002/art.39971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Moodley YP, Corte TJ, Oliver BG, Glaspole IN, Livk A, Ito J, et al. Analysis by proteomics reveals unique circulatory proteins in idiopathic pulmonary fibrosis. Respirology . 2019;24:1111–1114. doi: 10.1111/resp.13668. [DOI] [PubMed] [Google Scholar]
- 15. Clynick B, Corte TJ, Jo HE, Stewart I, Glaspole IN, Grainge C, et al. Biomarker signatures for progressive idiopathic pulmonary fibrosis. Eur Respir J . 2022;59:2101181. doi: 10.1183/13993003.01181-2021. [DOI] [PubMed] [Google Scholar]