Skip to main content
The International Journal of Tuberculosis and Lung Disease logoLink to The International Journal of Tuberculosis and Lung Disease
letter
. 2023 Feb 1;27(2):157–160. doi: 10.5588/ijtld.22.0437

CAD4TB software updates: different triaging thresholds require caution by users and regulation by authorities

J Fehr 1,2,3, R Gunda 1,4,5, M J Siedner 1,6,7,8, W Hanekom 1, T Ndung’u 1,5,9,10, A Grant 1,4,11,12, C Lippert 2,3,13, E B Wong 1,14,
PMCID: PMC9904401  PMID: 36853104

Dear Editor,

The recent recommendations by the WHO for systematic screening for TB with digital chest X-ray (CXR) and automated imaging interpretation1 has led to an explosion in the use of computer-assisted diagnostic (CAD) algorithms. We previously found that the performance of CAD4TB (©Delft Imaging, Hertogenbosch, The Netherlands) is comparable to a human radiologist during community-based TB screening in rural South Africa.2 CAD4TB quantifies lung field abnormalities suggestive of active TB, assigning a score between 0 and 100. Using CAD4TB requires screening programmes to select a triaging threshold above which participants receive sputum testing. Triaging thresholds are not universal and require adjustment based on demographic characteristics, laboratory capacities, budget, healthcare settings and programmatic goals.27 CAD4TB is updated annually, and the 7th version has been released recently. Screening programmes might be eager to use new versions because studies report improved performance,8,9 but no recommendations on adopting software updates currently exist. Here, we have evaluated the triaging performance characteristics and optimal thresholds of the latest version of CAD4TB (v7) compared to the two most recent versions (v5 and v6).

During the first year of a community-based multi-morbidity study in a rural district in KwaZulu-Natal, South Africa, 9,912 local residents above 15 years of age received free TB screening at a mobile camp (as described previously).2,10 Briefly, TB screening included digital posterior-anterior CXR imaging and assessment of symptoms. Following WHO guidelines for TB prevalence surveys,11 participants were triaged for sputum collection for any TB-related symptom (fever, weight loss, cough or night sweats) or for any CXR lung field abnormality. CXRs were analysed using CAD4TB v5 and scored between 0 and 100 to indicate the likelihood of TB-related lung field abnormality. As described previously,2 those with CAD4TB v5 >25 (a triaging threshold with a sensitivity of 85% for lung field abnormality)2 were triaged for sputum examination using Xpert® MTB/RIF Ultra (Cepheid, Sunnyvale, CA, USA) and MGIT (BD, Franklin Lakes, NJ, USA) liquid culture, and defined as microbiologically confirmed TB if either test was positive. Among the 9,912 participants who underwent CXR, 5,594 (56.4%) were referred for sputum testing, 4,976 (89.0%) of whom were able to produce sputum. A total of 99 (1.0%) participants had microbiologically positive sputum. A senior radiologist (blinded to CAD4TB scores and patient information) interpreted CXRs as having normal or abnormal lung fields. CXRs were retrospectively analysed using CAD4TB v6 and v7. The distribution of CAD4TB scores (v5–v7) and percentage of participants required to test were compared among all CXRs (n = 9,912). Performance characteristics and triage threshold that most closely matched the radiologist’s performance were compared (v5–v7) among individuals with sputum test results (n = 4,976). Participants provided written informed consent to participate in the study. Ethics approval was obtained from the University of KwaZulu-Natal Biomedical Research Ethics Committee (BE560/17), KwaZulu-Natal, South Africa; the London School of Hygiene & Tropical Medicine Ethics Committee (14722), London, UK; and the Mass General Brigham Institutional Review Board, Boston, MA, USA (2018P001802).

The overall performance between CAD4TB v5, v6 and v7 (area under the curve [AUC] v5: 0.78, 95% CI 0.73–0.83; v6: 0.79, 95% CI 0.73–0.84; v7: 0.80, 95% CI 0.75–0.85; P > 0.1; Figure Panel A) was similar, but the distribution of scores across the 100-point scale varied greatly across the three versions (median scores were v5: 28, interquartile range [IQR] 22–41; v6: 35, IQR 16–46; and v7: 11, IQR 5.2–27; P < 0.001; Figure Panel B). Between the three versions, each numerical threshold had strikingly different performance. For example, triaging with a CAD4TB threshold of 40 would result in a range of screening sensitivities (v5: 79.8%, v6: 88.9%, v7: 66.7%) and specificities (v5: 57.4%, v6: 33.3%, v7: 84.6%). As no threshold from any version met the WHO target product profile of ≥90% sensitivity and ≥70% specificity,12 we identified one threshold for each CAD4TB version that most closely matched the radiologist sensitivity at 80.8% (95% CI 71.7–88.0). The matching thresholds were v5: 40 (79.8%, 95% CI 70.5–87.2); v6: 47 (82.8%, 95% CI 73.9–89.7); and v7: 20 (79.8%, 95% CI 70.5–87.2). At these thresholds, the three CAD4TB versions had lower specificity than the radiologist (radiologist: 66.9%, 95% CI 65.6–68.2; v5: 40, 57.4%, 95% CI 56.0–58.8; v6: 47, 62.6%, 95% CI 61.2–64.0; v7: 20, 56.6%, 95% CI 55.2–58.0), leading to a higher percentage of participants who would require microbiological sputum testing relative to all participants (n=9,912) (radiologist: 20.2%; v5: 40, 27.0%; v6: 47, 23.7%; v7: 20, 33.5%; Figure Panel C and Supplementary Data S1). Substantial variations were also observed in the number of cases of microbiologically positive sputum that would be ‘missed’ using potential triaging thresholds for the different CAD4TB versions (Figure Panel D). For example, triaging with CAD4TB threshold 40, would result in sputum testing for 27.0% (v5), 45.9% (v6) and 10.3% (v7) of participants. At the same threshold, the percentage of microbiologically confirmed TB cases missed would be 20.2% (v5), 11.1% (v6) and 33.3% (v7). To note, despite previous reports that showed improved performance with newer versions,8,9 in these real-world data v7 did not outperform v6, as measured by AUC and specificity matched at the radiologist sensitivity. Despite similar AUC, v7 performed at higher specificity but lower sensitivity at each triaging threshold compared to v5 and v6 (Supplementary Table S1).

Figure.

Figure

Performance of CAD4TB v5, v6 and v7 to identify microbiologically confirmed TB. TB was defined if sputum was found to be positive on either Xpert Ultra or microbiological culture. A) For individuals with sputum results (n = 4,976), performance is shown in terms of sensitivity and specificity and AUC. Annotations show thresholds that closest matched the radiologist’s sensitivity; B) distributions and most frequent CAD4TB scores of all three versions obtained for all chest X-rays (n = 9,912); C) percentage of participants triaged for sputum testing among all participants at each CAD4TB threshold (n= 9,912); D) percentage of missed positive sputum among TB-positive individuals (n = 99) at each CAD4TB threshold. The performance of the senior radiologist is marked with a cross (A) and dashed lines (C, D). CAD4TB thresholds that matched the radiologist’s performance (v5: 40, v6: 47, v7: 20) are marked with numbers (A) and grey vertical lines (C, D). AUC = area under the receiver-operating curve.

The change in scales and resulting wide variations in triaging thresholds between different CAD4TB versions poses a risk to end-users in TB screening programmes who may unintentionally introduce systematic screening errors by adopting software updates without adjusting the selected triaging thresholds. Using incorrect triaging thresholds may have severe consequences and result in missing people with TB (triage threshold inadvertently too high) or utilising microbiological testing excessively (triaging threshold inadvertently too low). To accommodate intra-version variation, screening programmes need to select new triaging thresholds for each new software update. Previous work2,13,14 and the developer15 suggest that it is necessary to conduct pilot studies to finding triage thresholds that optimally serve the goals of each screening exercise. It is now unclear whether each software update requires new piloting for re-adjustment or whether this can be achieved through retrospective analysis of the newest version’s performance against population specific CXR collections. It is unknown whether our findings of significant variation between CAD4TB versions is applicable to other image interpretation algorithms used for TB screening – this information needs to be established urgently.15 For anyone designing TB screening programmes, decisions about programmatic adjustments to new versions are especially difficult because the underlying algorithmic or data changes between software versions are not communicated by manufacturers. Changes to the underlying reference standard for algorithm training may require readjustment of triaging thresholds, whereas small changes for faster radiograph interpretation, might not. However, information about the changes between versions is not transparently shared with the community because it has been considered proprietary by developers.15

Based on the results presented here, we call for regulation to require CAD-developing companies to communicate changes between software versions and give guidance for medical or public health end-users to effectively adopt software version updates in TB screening programmes. Continued vigilance and performance auditing of successive CAD software versions should be an integral requirement for authorisation by the WHO and regulatory agencies. These findings also contribute to ongoing scientific debates on how to successfully adopt artificial intelligence-based tools for healthcare.

Funding Statement

This research was funded in part, by the Wellcome Trust, London, UK (Grant number 201433/Z/16/A).

References

  • 1.World Health Organization Geneva, Switzerland: WHO; 2021. WHO consolidated guidelines on tuberculosis. Module 2: Screening Systematic screening for tuberculosis disease. [PubMed] [Google Scholar]
  • 2.Fehr J, et al. Computer-aided interpretation of chest radiography reveals the spectrum of tuberculosis in rural South Africa. Npj Digit Med. 2021;4(1):20. doi: 10.1038/s41746-021-00471-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Zaidi SMA, et al. Evaluation of the diagnostic accuracy of Computer-Aided Detection of tuberculosis on chest radiography among private sector patients in Pakistan. Sci Rep. 2018;8(1):1–9. doi: 10.1038/s41598-018-30810-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Koesoemadinata RC, et al. Computer-assisted chest radiography reading for tuberculosis screening in people living with diabetes mellitus. Int J Tuberc Lung Dis. 2018;22(9):1088–1094. doi: 10.5588/ijtld.17.0827. [DOI] [PubMed] [Google Scholar]
  • 5.Qin ZZ, et al. Tuberculosis detection from chest x-rays for triaging in a high tuberculosis-burden setting: an evaluation of five artificial intelligence algorithms. Lancet Digit Health. 2021;3(9):e543–554. doi: 10.1016/S2589-7500(21)00116-3. [DOI] [PubMed] [Google Scholar]
  • 6.Khan FA, et al. Articles Chest x-ray analysis with deep learning-based software as a triage test for pulmonary tuberculosis : a prospective study of diagnostic accuracy for culture-confirmed disease. Lancet Digit Health. 2020;2(11):e573–581. doi: 10.1016/S2589-7500(20)30221-1. [DOI] [PubMed] [Google Scholar]
  • 7.Tavaziva G, et al. Chest X-ray analysis with deep learning-based software as a triage test for pulmonary tuberculosis: an individual patient data meta-analysis of diagnostic accuracy. Clin Infect Dis. 2022;74(8):1390–1400. doi: 10.1093/cid/ciab639. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Qin ZZ, et al. Comparing different versions of computer-aided detection products when reading chest X-rays for tuberculosis. PLoS Digit Health. 2022;1(6):e0000067. doi: 10.1371/journal.pdig.0000067. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Murphy K, et al. Computer aided detection of tuberculosis on chest radiographs: an evaluation of the CAD4TB v6 system. Sci Rep. 2020;10(1):1–11. doi: 10.1038/s41598-020-62148-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Wong EB, et al. Convergence of infectious and non-communicable disease epidemics in rural South Africa: a cross-sectional, population-based multimorbidity study. Lancet Glob Health. 2021;9(7):e967–976. doi: 10.1016/S2214-109X(21)00176-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.World Health Organization 2nd ed. Geneva, Switzerland: WHO; 2011. Tuberculosis prevalence surveys: a handbook. The Lime Book. [Google Scholar]
  • 12.World Health Organization Geneva, Switzerland: WHO; 2014. High-priority target product profiles for new tuberculosis diagnostics: report of a consensus meeting. [Google Scholar]
  • 13.Qin ZZ, et al. Using artificial intelligence to read chest radiographs for tuberculosis detection: a multi-site evaluation of the diagnostic accuracy of three deep learning systems. Sci Rep. 2019;9:15000. doi: 10.1038/s41598-019-51503-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.World Health Organization, Geneva, Switzerland: WHO; 2021. UNICEF/UNDP/World Bank/WHO Special Programme for Research and Training in Tropical Diseases. Determining the local calibration of computer-assisted detection (CAD) thresholds and other parameters: a toolkit to support the effective use of CAD for TB screening. [Google Scholar]
  • 15.Qin ZZ, et al. A new resource on artificial intelligence powered computer automated detection software products for tuberculosis programmes and implementers. Tuberculosis. 2021;127:102049. doi: 10.1016/j.tube.2020.102049. [DOI] [PubMed] [Google Scholar]

Articles from The International Journal of Tuberculosis and Lung Disease are provided here courtesy of The International Union Against Tuberculosis and Lung Disease

RESOURCES