Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Apr 10.
Published in final edited form as: Proc SPIE Int Soc Opt Eng. 2025 Apr 10;13409:134090J. doi: 10.1117/12.3047189

Automated multi-lesion annotation in chest X-rays: annotating over 450,000 images from public datasets using the AI-based Smart Imagery Framing and Truthing (SIFT) system

Lin Guo a, Fleming YM Lure b, Teresa Wu c, Fulin Cai c, Stefan Jaeger d, Bin Zheng b, Jordan Fuhrman e, Hui Li e, Maryellen L Giger e, Andrei Gabrielian f, Alex Rosenthal f, Darrell E Hurt f, Ziv Yaniv f, Li Xia a,#, Weijun Fang g,#, Jingzhe Liu h,#
PMCID: PMC12034099  NIHMSID: NIHMS2076087  PMID: 40297478

Abstract

This work utilized an artificial intelligence (AI)-based image annotation tool, Smart Imagery Framing and Truthing (SIFT), to annotate pulmonary lesions and abnormalities and their corresponding boundaries on 452,602 chest X-ray (CXR) images (22 different types of desired lesions) from four publicly available datasets (CheXpert Dataset, ChestX-ray14 Dataset, MIDRC Dataset, and NIAID TB Portals Dataset). SIFT is based on Multi-task, Optimal-recommendation, and Max-predictive Classification and Segmentation (MOM ClaSeg) technologies to identify and delineate 65 different abnormal regions of interest (ROI) on CXR images, provide a confidence score for each labeled ROI, and various recommendations of abnormalities for each ROI, if the confidence score is not high enough. The MOM ClaSeg System integrating Mask R-CNN and Decision Fusion Network is developed on a training dataset of over 300,000 CXRs, containing over 240,000 confirmed abnormal CXRs with over 300,000 confirmed ROIs corresponding to 65 different abnormalities and over 67,000 normal (i.e., “no finding”) CXRs. After quality control, the CXRs are entered into the SIFT system to automatically predict the abnormality type (“Predicted Abnormality”) and corresponding boundary locations for the ROIs displayed on each original image. The results indicated that the SIFT system can determine the abnormality types of labeled ROIs and their boundary coordinates with high efficiency (improved 7.92 times) when radiologists used SIFT as an aide compared to radiologists using a traditional semi-automatic method. The SIFT system achieves an average sensitivity of 89.38%±11.46% across four datasets. This can significantly improve the quality and quantity of training and testing sets to develop AI technologies.

Keywords: Chest radiograph, artificial intelligence, lesion annotation, multiple pulmonary lesion

1. INTRODUCTION

Large-scale, high-quality, accurately labeled imaging datasets are essential for the development of artificial intelligence (AI) in the medical field. There is an increasing number of publicly available medical imaging datasets (e.g., chest X-rays, CT scans, etc.), which only include image-level labels extracted from radiology reports using natural language processing (NLP) techniques. While such image-level labeling is adequate for classification tasks, diagnosis and intervention, the management of diseases requires the knowledge of the location and specific area of lesions recognized by AI. Hence, training data requires locations, detailed delineation of boundaries of desired lesions, and accompanied comorbidities, if identifiable. Moreover, with the rapid development of AI, there is a growing demand to encompass medical imaging datasets with diverse populations, disease progression stages, imaging modalities and conditions, in order to ensure models trained on these datasets can generalize effectively across different clinical settings. However, the lack of lesion-based annotations significantly limits the usage and intention of public datasets. The process of manually marking lesion areas by radiologists is both time-consuming and costly.

To address the above issues, some semi-automatic approaches have been developed utilizing machine learning (ML) / deep learning (DL) techniques. Wu et al.[1] proposed an automatic approach for labeling 13,911 images in 6 standardized lung regions (bounding boxes) from ChestX-ray14 dataset, and each bounding box was labeled as either positive (with opacity) or negative (no opacity). Recently, a labeling algorithm for converting German thoracic radiology reports into CheXpert labels was developed. It involved two collected datasets and successfully generated class labels of images [2]. However, these semi-automatic approaches cannot process large volumes of images quickly, accurately, and consistently, nor have they performed lesion boundary delineation (>10 types). Therefore, developing annotation methods that are widely acceptable and accurately lesion-based remains a key research focus. Progress in this area could greatly enhance the value of existing public datasets and promote ML/DL algorithms.

This paper demonstrates an AI-based annotation tool called SIFT, designed to simultaneously label and delineate 65 different abnormalities/diseases on diverse and publicly available datasets of chest X-ray (CXR) images such that annotating radiologists only have to review the SIFT-annotated results. The SIFT system, with its graphic user interface, allows radiologists to review pre-annotated results and supports the rapid, accurate generation of lesion labels and boundaries across multiple abnormality types. This tool is valuable in advancing ML/DL medical applications by enabling efficient annotation of large CXR image volumes. The segmentations generated by the SIFT system can serve as pseudo-labels[3] for the various datasets. Such pseudo-labels are useful surrogates for manual segmentation, enabling the development and evaluation of new segmentation networks and algorithms [4].

2. METHODOLOGY

2.1. Model development

The SIFT system was initially developed based on Multi-task, Optimal-recommendation, and Max-predictive Classification and Segmentation (MOM ClaSeg) technologies to label and delineate 65 abnormal regions of interest (ROI) on CXR images. It provides a confidence score for labeled ROIs and various recommendations of abnormalities for each ROI. The MOM ClaSeg involves a substantial dataset of 310,333 confirmed adult CXR images collected from multiple hospitals, containing 243,262 abnormal images representing 65 different abnormalities (307,415 ROIs) and 67,071 normal images[5]. SIFT is later augmented using self-rating curriculum learning[6] to train an additional 2,829 confirmed COVID-19 CXR with 5,267 ROIs from publicly available datasets, including Brixia COVID-19 dataset (https://paperswithcode.com/dataset/brixia) and Cancer Imaging Archive (TCIA) dataset [7].

2.2. Annotated publicly available datasets

Figure 1 illustrates the compilation of CXR images from four publicly available datasets used to independently test the generalization of our SIFT system: (a) CheXpert Dataset: 224,316 frontal and lateral CXR images of 65,240 patients from Stanford Health Care with image-level labels for the presence of “No Finding” and other 13 chest radiographic observations as positive, negative, or uncertain [8]; (b) National Institutes of Health (NIH) ChestX-ray14 Dataset: an extension of the original ChestX-ray8 dataset containing 112,120 CXR frontal-view images of 30,805 patients with “No Finding” label and 14 common abnormality/disease labels [9,10], including 880 radiographs with bounding-box annotations of lesions; (3) Medical Imaging & Data Resource Center (MIDRC) Dataset: 164,590 frontal and lateral CXR images from 67,371 COVD-19 positive or negative patients from MIDRC Data Commons (https://data.midrc.org/); (4) National Institute of Allergy and Infectious Diseases (NIAID) TB Portals Dataset: 10,795 CXR images from 8,868 tuberculosis patients from NIAID TB Portals Program (https://data.tbportals.niaid.nih.gov/).

Figure 1.

Figure 1.

A compilation of CXR images was used from four publicly available datasets. (a) CheXpert Dataset with 13 abnormal label categories; (b) ChestX-ray14 Dataset with 13 abnormal label categories; (c) MIDRC Dataset from COVID-19 patients; (d) NIAID TB Portals Dataset from tuberculosis patients.

2.3. Quality control process

The exclusion criteria were as follows: (1) lateral-view CXR images; (2) duplicate study identifiers; (3) images labeled as “No Finding”; (4) images with label categories not included in the SIFT system (i.e., infiltration); (5) images with missing study identifiers; and (6) images lacking test results. This quality control process ensures a high-quality dataset for supporting lesion delineation and subsequent research efforts.

2.4. Workflow and output features

After the quality control process, involved DICOM CXR image is entered into the SIFT system to predict the abnormality type and corresponding boundary locations for the ROI and display on the original image. All the CXR images were finally categorized according to the predicted ROI results. The usage of the SIFT system includes (1) SIFT Automatically Predicted Outputs: several ROIs each of which has predicted abnormality and corresponding boundary of ROI; (2) SIFT Automatically Predicted Other Recommended Abnormalities for Orange Labeled ROI: The SIFT system innovatively provides recommendation of other possible abnormalities for the predicted abnormality that has a lower possibility to be a specific abnormality, or sometimes radiologists may have different opinions (note that two radiologists do not have 100% agreement on every diagnosis). The annotator can choose either the original predicted abnormality or one of the additional recommended abnormalities; (3) Output in Excel File: The file includes predicted abnormality for each ROI, predicted score for each ROI, location of boundary for each ROI, and additional recommended abnormalities for each ROI.

2.5. Ground truth (GT) establishment and evaluation metrics

In order to evaluate the abnormality/disease classification performance, we use sensitivity (also known as recall) as the evaluation metric. This involves comparing results generated by the SIFT method with the GT for abnormality/disease categories across four datasets. To assess the ROI delineation performance, we utilize the Dice coefficient. Since calculating Dice scores requires the original lesion location information, and only the ChestX-ray14 dataset provides this information, our study temporarily focuses on evaluating localization performance using this dataset. For the ChestX-ray14 dataset, we compared the SIFT-generated results with the original set of 880 images containing 984 hand-labeled bounding boxes.

3. RESULTS

In accordance with image quality control standards, a total of 452,602 CXR images were analyzed and sorted into different categories by using the SIFT system. The images are entered into the SIFT system to predict the abnormality type and corresponding boundary locations for the ROIs and display them on the original images (Figure 2). In this example, two ROIs named secondary pulmonary tuberculosis are identified with different colors. The red-labeled ROI corresponds to a predicted abnormality with a very high score, and no additional recommended abnormalities are provided by the system. The orange-labeled ROI represents a predicted abnormality with a high score, and the system offers three additional recommendations for possible abnormalities in a decreasing score sequence. Due to the overlapping imaging features of different diseases or varying manifestations of the same disease in CXR interpretation, even experienced radiologists may often find it challenging to differentiate between them [9]. This recommendation functionality of the SIFT system enables enhanced interpretation and provides radiologists with alternative insights for diagnosis. The radiologist, a well-trained annotator, can accept, delete, or edit the originally predicted abnormality or one of the additional recommended abnormalities or add new abnormalities. The final output is stored in Excel, JSON, and DICOM files.

Figure 2.

Figure 2.

Example of SIFT graphic user interface: Two classes of abnormality at different locations and their boundary locations are predicted. The red label indicates the score for the predicted abnormality is very high, and SIFT does not provide additional recommended abnormalities for the first “Predicted Abnormality”. The orange color indicates that the score for the predicted abnormality is high, and it is probably the predicted abnormality. Three additional recommended abnormalities have been provided for the second “Predicted Abnormality”.

For manual annotation, a well-trained radiologist typically needs an average of 63.90 seconds (ranging from 56.10 to 72.53 seconds) per CXR image [12]. A previous study has reported that when SIFT is used, a radiologist spends 7% of the total time verifying the SIFT results [12]. Therefore, in the traditional image annotation method, considering the 30% intra-observer variation of the first annotating radiologist and the 7% of the total time spent by the second verifying radiologist, the total time required for the manual process is 87.54 seconds (87.54=63.90*(7% + 30%)+63.90). For the use of SIFT, the automatic annotation takes 0.2 seconds per CXR image with an average sensitivity of 90.0%, and combined with the time during which the radiologist is asked to re-annotate all ROIs if he/she does not agree with SIFT, brings the total time to 11.06 seconds (11.06=0.2+63.9*(7% + 10%)). The improvement of a radiologist’s efficiency hence increased by 7.92 times (7.92=87.54/11.06).

SIFT has various sensitivity and radiologist’s verification rate changes for different lesions. Figure 3 illustrates the details of how annotation efficiency varies as the SIFT system’s sensitivity and the radiologist’s approved verification rate changes. Efficiency is defined as the ratio of time spent by radiologists using the SIFT system compared to the time spent using traditional annotation methods. Approved verification rate refers to the percentage of time radiologists spend verifying the results generated by the SIFT system. Results show that annotation efficiency improves significantly across all verification rates as SIFT sensitivity increases. This suggests that higher sensitivity levels allow radiologists to rely more on SIFT, thus reducing the time needed for manual corrections or verifications. The efficiency increases more noticeably when verification rates are lower (e.g., 10% and 20%), as radiologists spend less time verifying SIFT results. At higher verification rates (e.g., 30%, 40%, and 50%), the efficiency increases more gradually, due to the added time in reviewing and confirming the results of SIFT. The efficiency remains close to 1 at low sensitivity levels (e.g., below 40%) but grows rapidly at high sensitivity levels (e.g., above 70%), especially at lower verification rates. It is worth noting that the highest efficiency nearing 10 times faster than traditional methods can be achieved when SIFT sensitivity is close to 100% and with a minimum verification rate of 10%.

Figure 3.

Figure 3.

Relationship between annotation efficiency improvement and SIFT sensitivity at various approved verification rates. Efficiency is defined as the ratio of time spent by radiologists using SIFT compared to the traditional annotation method. The x-axis represents the sensitivity of SIFT, while the y-axis represents the efficiency improvement. Different colored lines correspond to different approved verification rates (10%, 20%, 30%, 40%, and 50%).

CheXpert Dataset

All 13 types of chest radiographic observations in the dataset overlapped with the abnormalities/diseases categories identified by SIFT (Figure 4a). The SIFT system demonstrated an average sensitivity of 86.73% with a standard deviation of 8.32%, and sensitivity ranges from 70.36% to 96.37%. High sensitivity (above 90%) was observed for 7 out of 13 conditions, including Edema (96.37%), Atelectasis (94.84%), Lung Lesion (92.08%), Pneumonia (91.11%), Enlarged Cardiomediastinum (91.04%), Pleural Other (90.82%), and Lung Opacity (90.70%), indicating the SIFT system’s robustness for detecting certain observations. It could be used to further clinical diagnosis and subsequent research efforts. Low sensitivity (below 80%) was observed for 2 out of 13 conditions involving Pleural Effusion at 70.36% and Consolidation at 71.19%, suggesting challenges in identifying these specific abnormalities in this dataset, likely due to complex imaging patterns or dataset limitations.

Figure 4.

Figure 4.

(a-c) Sensitivity comparison across four datasets for multiple abnormalities/diseases; (d) Dice coefficient comparison on ChestX-ray14 Dataset.

ChestX-ray14 Dataset

The SIFT system achieved an average sensitivity of 92.29 % on the ChestX-ray14 dataset, with a standard deviation of 12.91% (Figure 4b). The SIFT system showed 100% sensitivity for several conditions, including Cardiomegaly (2,776 cases), Pneumonia (1,431 cases), Fibrosis (1,686 cases) and Pleural Thickening (3,385 cases), followed by 99.91% sensitivity for Pneumothorax (5,302), 99.65% sensitivity for Nodule (6,331 cases), and 99.42% for Consolidation (4,667 cases). The lowest sensitivity for Edema in the ChestX-ray14 Dataset is 57.97%, significantly lower than the CheXpert Dataset’s 96.37%. This discrepancy may be due to differences in labeling protocols. ChestX-ray14 uses NLP to extract the mentions from radiology reports, and this process relies heavily on the accuracy and completeness of the reports, which can introduce variability. CheXpert also uses NLP but incorporates different approaches to handle uncertainty by categorizing observations as positive, negative, or uncertain. The discrepancy highlights how variations in labeling protocols directly influence the performance metrics of AI models across different datasets.

MIDRC Dataset

The SIFT system showed a sensitivity of 68.86% for COVID-19 (Figure 4c). The dataset categorizes patients as either COVID-19 positive or negative based on RT-PCR or rapid antigen test results without providing individual results for each follow-up CXR examination. Our SIFT system uses all the qualified CXR images to delineate COVID-19 lesions and assesses performance using the patient-level labels provided by the dataset. This approach may introduce errors in the sensitivity measurement, as it does not account for the temporal changes in the disease state reflected in follow-up CXR examinations. This highlights the challenges in correlating static patient-level labels with dynamic imaging findings, particularly for those diseases (like COVID-19) in which the radiographic manifestations change over time.

NIAID TB Portals Dataset

In this dataset, the classification of TB patient images involves a combination of clinical, radiological, and microbiological criteria. Additionally, the CXR data come from many countries worldwide and vary in quality, and this may cause our SIFT system to miss some cases that do not present clear radiological signs alone, resulting in a sensitivity of 78.35% (Figure 4c). The use of multiple criteria in this dataset increases diagnostic complexity. As this dataset combines different standards, chest radiographic findings represent only one part of the diagnostic process. Some TB patients may meet the clinical and microbiological criteria but lack typical radiological signs of TB on CXR images, potentially reducing the AI model’s sensitivity to radiological features. While such a comprehensive standard is crucial for accurate TB diagnosis, it may hinder the model’s performance in medical imaging annotation scenarios, where AI mainly focuses on CXR analysis. For example, in several datasets where radiologists review CXR images and assess the presence of TB lesions, our AI model can achieve sensitivities ranging from 85.7% to 97.8% [13,14]. In our study, the SIFT results on TB Portals, with a sensitivity of 78%, are significant as they reflect real-world variability and highlight the differences between image-based CAD and the multimodal approach clinicians take.

Overall, the SIFT system achieves an average sensitivity of 89.38%±11.46% across four datasets and an average Dice score of 0.64±0.12 (Figure 4d). In addition to the observations within the four datasets, our SIFT system also labeled additional abnormalities/diseases, which may include comorbidities or complications of existing abnormal categories or new abnormalities not covered by the original datasets. This capability may have the potential to enhance the diversity of data available for AI development, promoting deeper exploration of varied health conditions and bringing novel insights into pulmonary imaging analysis. However, this capability may increase false positive rates, requiring a balanced approach in practice.

4. CONCLUSIONS

A novel AI-based SIFT system was developed and applied over 450,000 CXR images with very high efficiency and sensitivity to assist radiologists in determining abnormality/disease types of ROIs, boundary coordinates, confidence levels, and possible recommended types of abnormalities. SIFT is promising in assisting researchers in image labeling with multiple observations. Specifically, the high efficiency of SIFT predictions makes it attractive for tuberculosis diagnostics and monitoring. Multi-domain, patient-centric datasets like NIAID TB Portals could be used to align SIFT predictions with retrospective data on treatment history, drug sensitivity, history of relapses, influence of comorbidities and treatment outcomes. Connecting SIFT predictions with the above-mentioned clinical data would allow for sophisticated and sensitive radiology models to improve diagnostics and long-term survival for lung diseases.

ACKNOWLEDGEMENTS

This study has received funding from the Shenzhen Science and Technology Program [Grant No.: KQTD2017033110081833; JCYJ20220531093817040], the Guangzhou Science and Technology Planning Project [Grant No.: 2023A03J0536; 2024A03J0583], and the Inner Mongolia Autonomous Region Science and Technology Program Project [Grant No.: 2024SGGZ059]. This work was supported in part by the Lister Hill National Center for Biomedical Communications of the National Library of Medicine (NLM), National Institutes of Health. MLG, HL, and JF are part of MIDRC (the Medical Imaging and Data Resource Center), which is supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health under contract 75N92020D00021. This work has been funded in part with Federal funds from the National Institute of Allergy and Infectious Diseases (NIAID), National Institutes of Health, Department of Health and Human Services under BCBB Support Services Contract HHSN316201300006W/75N93022F00001 to Guidehouse, Inc.

REFERENCES

  • [1].Wu J, Gur Y, Karargyris A, et al. Automatic bounding box annotation of chest x-ray data for localization of abnormalities[C]//2020 IEEE 17th International symposium on biomedical imaging (ISBI). IEEE, 2020: 799–803. [Google Scholar]
  • [2].Wollek A, Hyska S, Sedlmeyr T, et al. German CheXpert Chest X-ray Radiology Report Labeler[C]//RöFo-Fortschritte auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren. Georg Thieme Verlag KG, 2024. [Google Scholar]
  • [3].Lee DH Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks[C]//Workshop on challenges in representation learning, ICML. 2013, 3(2): 896. [Google Scholar]
  • [4].Kantipudi K, Bui V, Yu H, et al. , Semantic Segmentation of TB in Chest X-rays: a New Dataset and Generalization Evaluation, SPIE Medical Imaging: Computer-Aided Diagnosis, 2025. [DOI] [PMC free article] [PubMed]
  • [5].Guo L, Hong K, Xiao Q, et al. Developing and assessing an AI-based multi-task prediction system to assist radiologists detecting lung diseases in reading chest x-ray images[C]//Medical Imaging 2023: Image Perception, Observer Performance, and Technology Assessment. SPIE, 2023, 12467: 73–90. [Google Scholar]
  • [6].Hong K, Guo L, Lure YF. Self-Rating Curriculum Learning for Localization and Segmentation of Tuberculosis on Chest Radiograph[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2022: 686–695. [Google Scholar]
  • [7].Desai S, Baghal A, Wongsurawat T, et al. Data from Chest Imaging with Clinical and Genomic Correlates Representing a Rural COVID-19 Positive Population [Data set]. The Cancer Imaging Archive, 2020, DOI: 10.7937/tcia.2020.py71-5978. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Irvin J, Rajpurkar P, Ko M, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison[C]//Proceedings of the AAAI conference on artificial intelligence. 2019, 33(01): 590–597. [Google Scholar]
  • [9].Wang X, Peng Y, Lu L, et al. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2097–2106. [Google Scholar]
  • [10].Summers RM. NIH Chest X-ray Dataset of 14 Common Thorax Disease Categories. https://nihcc.app.box.com/v/ChestXray-NIHCC/file/220660789610. Accessed May 2019.
  • [11].Stefanidis K, Konstantelou E, Yusuf GT, et al. Radiological, epidemiological and clinical patterns of pulmonary viral infections[J]. European Journal of Radiology, 2021, 136: 109548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Guo L, Hong K, Zhang Z, et al. Assessing an AI-based smart imagery framing and truthing (SIFT) system to assist radiologists annotating lung abnormalities on chest x-ray images for development of deep learning models[C]//Medical Imaging 2023: Computer-Aided Diagnosis. SPIE, 2023, 12465: 147–155. [Google Scholar]
  • [13].Nijiati M, Zhang Z, Abulizi A, et al. Deep learning assistance for tuberculosis diagnosis with chest radiography in low-resource settings[J]. Journal of X-ray Science and Technology, 2021, 29(5): 785–796. [DOI] [PubMed] [Google Scholar]
  • [14].Zhang X, Wang Q, Xia L, et al. Clinical evaluation of chest X-radiograph computer aided diagnostic system for pulmonary tuberculosis applied in primary hospitals[J]. Journal of Tuberculosis and Lung Disease, 2022, 3(2): 96–101. [Google Scholar]

RESOURCES