Abstract
This study develops, validates, and deploys deep learning for automated total kidney volume (TKV) measurement (a marker of disease severity) on T2-weighted MRI studies of autosomal dominant polycystic kidney disease (ADPKD). The model was based on the U-Net architecture with an EfficientNet encoder, developed using 213 abdominal MRI studies in 129 patients with ADPKD. Patients were randomly divided into 70% training, 15% validation, and 15% test sets for model development. Model performance was assessed using Dice similarity coefficient (DSC) and Bland-Altman analysis. External validation in 20 patients from outside institutions demonstrated a DSC of 0.98 (IQR, 0.97–0.99) and a Bland-Altman difference of 2.6% (95% CI: 1.0%, 4.1%). Prospective validation in 53 patients demonstrated a DSC of 0.97 (IQR, 0.94–0.98) and a Bland-Altman difference of 3.6% (95% CI: 2.0%, 5.2%). Last, the efficiency of model-assisted annotation was evaluated on the first 50% of prospective cases (n = 28), with a 51% mean reduction in contouring time (P < .001), from 1724 seconds (95% CI: 1373, 2075) to 723 seconds (95% CI: 555, 892). In conclusion, our deployed artificial intelligence pipeline accurately performs automated segmentation for TKV estimation of polycystic kidneys and reduces expert contouring time.
Keywords: Convolutional Neural Network (CNN), Segmentation, Kidney
ClinicalTrials.gov identification no.: NCT00792155
Supplemental material is available for this article.
© RSNA, 2022
Keywords: Convolutional Neural Network (CNN), Segmentation, Kidney
Summary
A deep learning model was developed and deployed to automatically segment kidneys in autosomal dominant polycystic kidney disease MRI studies for determining total kidney volume, a metric of disease severity.
Key Points
■ Model performance on validation MRI studies (53 prospective, 20 external institution) demonstrated a Dice similarity coefficient greater than 0.97 (quartile Q1 > 0.94) and Bland-Altman mean percentage difference in total kidney volume (model vs manual reference) less than 3.6% (95% CI: 2.0%, 5.2%).
■ Model deployment into the clinical pipeline was accomplished by pushing new MRI studies to a clinical server within the picture archiving and communication system firewall that runs inference on axial T2-weighted sequences and outputs results formatted for radiologist label refinement and final reporting.
■ Prospective assessment in 53 patients showed a 51% reduction in radiologist time for model-assisted segmentation compared with segmenting without the model (P < .001).
Introduction
Autosomal dominant polycystic kidney disease (ADPKD) is the most common inherited renal disease, affecting 12.5 million people worldwide and comprising 5%–10% of end-stage renal disease (1). Total kidney volume (TKV) indexed to patient height (htTKV) (2,3) is an important imaging biomarker for assessing ADPKD severity (4,5). The Mayo classification prognosticates the time to requiring dialysis for patients with ADPKD with typical diffuse cystic kidney disease (Mayo class A) using htTKV at a single time point, adjusted for age and estimated kidney growth rate (6,7). Tolvaptan can slow the rate of progression of chronic kidney disease in patients at risk for rapid progression on the basis of age, creatinine level, and htTKV (8). The importance of htTKV for determining tolvaptan therapy eligibility and for following patients taking tolvaptan has resulted in increased demand for MRI measurements of TKV. As TKV increases only 2%–5% per year (9), a highly reproducible, operator-independent method for measuring kidney volumes is needed.
In this study, we developed, validated, and deployed a segmentation model using a U-Net architecture with an EfficientNet encoder for accurate polycystic kidney segmentation at MRI. Furthermore, we evaluated the efficiency improvement of model-assisted annotation.
Material and Methods
Patient Datasets
This Health Insurance Portability and Accountability Act–compliant study was approved by the institutional review board. Individuals enrolled in the Rogosin Institute ADPKD Repository study provided signed informed consent. Existing axial MRI studies (n = 213 in 129 patients) were utilized for model development and retrospective validation. MRI studies in another 20 patients with ADPKD, performed at other institutions, were utilized for external validation. MRI studies performed beginning June 1, 2021, (n = 53) were included in a prospective assessment of the clinical implementation. This study was registered on ClinicalTrials.gov with identification number NCT00792155.
MRI Protocol
All participants used for model training and prospective evaluation underwent a standardized imaging protocol at 1.5 T using a body phased‐array coil, including axial T2‐weighted single-shot fast spin-echo and steady-state free precession (SSFP) images (details described in Appendix E1 [supplement]). Participants used for external validation all underwent axial T2-weighted imaging with various MRI machines (Table E1 [supplement]).
Code, Model, and Data Availability
All code was developed in the Python programming language (version 3.8.5) using the PyTorch deep learning framework (version 1.7.1) (10). The training code, inference code, and model weights for this project are publicly available on our code repository (https://github.com/aksg87/adpkd-segmentation-pytorch) (11). Data availability is discussed at the end of the article.
Data Preparation
De-identified Digital Imaging and Communications in Medicine (DICOM) axial T2-weighted and SSFP MRI studies were converted into NIfTI (Neuroimaging Informatics Technology Initiative) file format using the NiBabel package in Python (https://nipy.org/nibabel/) (12). Segmentation masks were saved in NIfTI format within ITK-SNAP. Normalization was applied to each scan by linearly scaling 16-bit grayscale values to [0, 1] (single-precision floating point), on the basis of the minimum and maximum grayscale value.
TKV Reference Standard
The reference standard TKV was obtained using ITK-SNAP (13) to manually segment kidneys by a radiology fellow (A.G.) and radiology attending physician (M.R.P.) with a combined 25 years of experience imaging ADPKD. All annotations were performed without interpolation with attention to detail for accuracy at the polycystic kidney boundary. Annotation voxels were summed to derive TKV.
Data Stratification
Patients (n = 129 with 213 studies) were split on the basis of patient ID for model development into 70% training, 15% validation, and 15% holdout test sets, with a stratified random split controlling for the statistical distribution of TKV. The training, validation, and test sets contained 154 MRI studies (5570 DICOM images), 28 MRI studies (1312 DICOM images), and 31 MRI studies (1050 DICOM images), respectively. No patient was represented in more than one split. The dataset split was fixed prior to model training to prevent data leakage. Cross-validation (k-fold) was not performed because of limited computing resources and our inclusion of prospective and external validation.
Model Development, Data Augmentation, and Training Procedure
MRI studies and ADPKD kidney segmentations were used to train models for the objective of kidney segmentation (Fig 1). ADPKD kidney segmentation was formulated as a two-dimensional problem, such that each slice in an MRI study provided a single sample. To generate a three-dimensional segmentation, inference was performed on all slices in a volume and then combined. The model architecture was based on the encoder-decoder U-Net (14), with EfficientNet comprising the encoder component (15); EfficientNet has shown top performance on various image segmentation tasks (16). Further details on the model architecture are shown in Table E2 and Figure E1 (supplement).
Figure 1:
Schematic summarizes project infrastructure on deep learning server. Training is highlighted with a light pink background. Deployment and inference are highlighted with a light blue background.
Training data were augmented using affine transformations (17). Model checkpoints were saved on the basis of improving performance on the validation set that was based on the metric, predicted-TKV error. Upon the completion of training, the single best model checkpoint was evaluated against the internal holdout test dataset. Further details on model development, data augmentation, and training procedure are discussed in Appendix E1 (supplement).
Model Deployment into the Clinical Routine and Prospective Validation
The model was deployed on a clinical Linux server within the picture archiving and communication system (PACS) firewall (Fig 1). Remote connection to the server for the purposes of annotation and validation was achieved using XFCE (18), an open-source desktop environment. The single best model checkpoint was utilized after training and fixed throughout external and prospective validation. To summarize the data flow, first, new MRI studies were pushed to the Linux server by a radiologist via our clinical PACS. Then, inference was triggered upon the arrival of new axial T2-weighted images. Next, model predictions were saved as NIfTI files, fully editable (using ITK-SNAP 3.8.0 running on the Linux server) by a radiologist to correct any errors, and time savings of model-assisted annotation versus manual annotation were recorded for the first consecutive 50% of prospective cases. The possibility of bias from repeated annotation of the same scan was minimized by having a minimum 1-week interval between manual and model-assisted annotation.
Statistical Analysis
Comparison between the manual reference standard (ground truth) and the model was performed using Dice similarity coefficient (DSC) defined as
![]() |
where A and B represent the ground truth and model-derived segmentations, respectively (19). Mean absolute error and Bland-Altman analysis were used to assess agreement in TKV calculation. The significance of time savings of model-assisted annotation was assessed with the (paired) Student t test.
Results
Table 1 shows baseline characteristics of the patients with ADPKD used for model development (n = 129), outside external test cohort (n = 20), and internal prospective test cohort (n = 53).
Table 1:
Characteristics of Development and Validation Data
Figure 2 shows Bland-Altman plots for agreement in TKV calculations and DSC by TKV plots for external and prospective internal validation. Model performance was validated on external data with a median DSC of 0.98 (IQR, 0.97–0.99) and mean percentage difference in TKV of 2.6% (95% CI: 1.0%, 4.1%). Additional validation was performed on prospective internal data with a median DSC of 0.97 (IQR, 0.94–0.98) and mean percentage difference in TKV of 3.6% (95% CI: 2.0%, 5.2%) (Fig 2, Table 2). Model performance on the internal test set is shown in Figure E3 (supplement).
Figure 2:
External dataset (top) and prospective dataset (bottom) validation with Bland-Altman agreement analysis and Dice similarity coefficient by htTKV. BA = Bland-Altman, htTKV = TKV indexed to patient height, TKV = total kidney volume.
Table 2:
Internal Validation Set and Test Set Results and Dataset Characteristics
Multiple visualizations were evaluated to assess performance. Model predictions and the reference standard were visualized with three-dimensional overlapping surface renderings (Fig 3). Predictions were also visualized as two-dimensional images (Figs E2, E4 [supplement]).
Figure 3:
Prospective dataset (top row) and external dataset (bottom row) example surface renderings of ground truth reference (red) and model prediction (blue) volumes with 50% opacity. Overlapping concordant predictions are visualized in shades of purple. Yellow mannequin illustrates orientation of the surface renderings.
Time savings analysis in prospective patients revealed that manual contouring (starting with no segmentations) required a mean of 1724 seconds (95% CI: 1373, 2075), compared with 723 seconds (95% CI: 555, 892) for model-assisted contouring, a 51% reduction in time for model-assisted contouring (P < .001).
Failure analysis was conducted to evaluate the largest model errors. Examples of the most significant prospective errors and the corresponding radiologist corrections are shown in Figure 4, which include a fluid-filled stomach partially labeled as left cystic kidney, the urinary bladder labeled as cystic kidney, and mischaracterization of cysts at the border of liver and right kidney, among other examples.
Figure 4:
Examples of the most significant prospective model inference errors and the corresponding radiologist corrections. Inference label is red, radiologist additions are green, and radiologist subtractions are indicated by blue arrows. (A) Fluid-filled stomach partially labeled as left cystic kidney. (B) Urinary bladder labeled as cystic kidney. (C) Liver cyst labeled as kidney. (D) Renal cyst at liver border missed by inference. (E) Complex hemorrhagic left renal cyst incompletely labeled. (F) Collapsed descending colon labeled as left kidney. (G) Renal cyst at liver border missed by inference. (H) Left elbow medial epicondyle fat labeled as left kidney in a patient imaged with arms in the field of view.
Discussion
Imaging patients with ADPKD demands accurate, reproducible kidney volume measurements for calculating htTKV. These data from 173 patients with ADPKD (129 development, 20 external validation, and 53 acquired after deployment) demonstrate a deep learning segmentation model with exceptional multi-institutional and postdeployment prospective performance. The prospective and external validation demonstrated excellent DSC greater than 0.97 and a low Bland-Altman mean percentage difference in TKV less than 3.6%. Moreover, model-assisted segmentation required 51% less time compared with manual contouring without model assistance.
Conventionally, planimetry tracing (2) and stereologic methods (22) are used to measure TKV at MRI. These methods comprise a two-step process in which an initial organ contour is drawn, followed by tedious slice-by-slice correction of contour errors. Additionally, the manual ellipsoid method of calculating TKV has attempted to simplify kidney measurement using only three orthogonal measurements but is far less accurate and less reproducible (7,19). Operator exhaustion from the tedious contouring task may naturally lead to human error, particularly given the precision required at contour borders. Alternatively, in model-assisted annotation as deployed in this study, we modify the contouring task significantly and reduce operator fatigue. We believe this may enable radiologists to produce more accurate contours of kidneys in ADPKD.
As established in many studies, deep learning excels at computer vision tasks (20–22), but deep learning studies often lack prospective validation and often fail to impact radiologic clinical routines. In contrast, our study includes external and postdeployment prospective validation and incorporates recommendations from the Checklist for AI in Medical Imaging (or, CLAIM) guidelines (23). While our model architecture is novel, by applying an EfficientNet encoder to ADPKD kidney segmentation, our study’s primary contribution is demonstrating clinical impact from model-assisted annotation in a real-world health care setting. Prior studies, such as those of Kline et al (24) and Sharma et al (25), have published impressive results on similar tasks; however, they do not show our external multi-institute validation and time savings when deployed into the clinical routine. Furthermore, our code repository, model weights, and select experiments are available for other researchers (11).
While our model had a robust performance, our failure analysis demonstrated notable errors during select clinical scenarios (ie, fluid-filled stomach, distended urinary bladder, hemorrhagic renal cysts, and hepatic cysts at the border of liver and right kidney). Despite these edge cases, our model-assisted annotation saves substantial radiologist time, as at baseline, the radiologist performs manual contouring at our institution. Furthermore, model performance is likely to improve with retraining our model on radiologist-corrected annotations and via ensemble learning methods. Last, our study evaluates model performance at a single time point, and ultimately assessing kidney growth over time remains an important clinical end point. We plan to address this limitation with direct assessment of automated ADPKD kidney growth tracking in the near future.
Supported by the Weill Cornell Medicine Clinical & Translational Science Center and the Shaw Foundation.
Data sharing: Data generated or analyzed during the study are available from the corresponding author by request and institutional review.
Disclosures of Conflicts of Interest: A.G. No relevant relationships. G.S. Leadership roles as co-chair of Society for Imaging Informatics in Medicine machine learning committee, co-chair for Society of Abdominal Radiology AI committee, co-director of Radiological Society of North America AI certificate course; assistant editor of Radiology: Artificial Intelligence. S.R. No relevant relationships. S.J. No relevant relationships. H.D. No relevant relationships. R.H. No relevant relationships. D.R. No relevant relationships. K.T. No relevant relationships. J.D.B. Grants or contracts from Vertex Pharmaceuticals; secretary of American Journal of Hypertension. I.B. No relevant relationships. I.C. No relevant relationships. H.R. No relevant relationships. M.R.P. No relevant relationships.
Abbreviations:
- ADPKD
- autosomal dominant polycystic kidney disease
- DICOM
- Digital Imaging and Communications in Medicine
- DSC
- Dice similarity coefficient
- htTKV
- TKV indexed to patient height
- NIfTI
- Neuroimaging Informatics Technology Initiative
- PACS
- picture archiving and communication system
- SSFP
- steady-state free precession
- TKV
- total kidney volume
References
- 1. Chapman AB , Devuyst O , Eckardt KU , et al . Autosomal-dominant polycystic kidney disease (ADPKD): executive summary from a Kidney Disease: Improving Global Outcomes (KDIGO) Controversies Conference . Kidney Int 2015. ; 88 ( 1 ): 17 – 27 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Kistler AD , Poster D , Krauer F , et al . Increases in kidney volume in autosomal dominant polycystic kidney disease can be detected within 6 months . Kidney Int 2009. ; 75 ( 2 ): 235 – 241 . [DOI] [PubMed] [Google Scholar]
- 3. Chapman AB , Guay-Woodford LM , Grantham JJ , et al . Renal structure in early autosomal-dominant polycystic kidney disease (ADPKD): The Consortium for Radiologic Imaging Studies of Polycystic Kidney Disease (CRISP) cohort . Kidney Int 2003. ; 64 ( 3 ): 1035 – 1045 . [DOI] [PubMed] [Google Scholar]
- 4. Sedman A , Bell P , Manco-Johnson M , et al . Autosomal dominant polycystic kidney disease in childhood: a longitudinal study . Kidney Int 1987. ; 31 ( 4 ): 1000 – 1005 . [DOI] [PubMed] [Google Scholar]
- 5. Fick-Brosnahan GM , Belz MM , McFann KK , Johnson AM , Schrier RW . Relationship between renal volume growth and renal function in autosomal dominant polycystic kidney disease: a longitudinal study . Am J Kidney Dis 2002. ; 39 ( 6 ): 1127 – 1134 . [DOI] [PubMed] [Google Scholar]
- 6. Yu ASL , Shen C , Landsittel DP , et al . Baseline total kidney volume and the rate of kidney growth are associated with chronic kidney disease progression in Autosomal Dominant Polycystic Kidney Disease . Kidney Int 2018. ; 93 ( 3 ): 691 – 699 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Irazabal MV , Rangel LJ , Bergstralh EJ , et al . Imaging classification of autosomal dominant polycystic kidney disease: a simple model for selecting patients for clinical trials . J Am Soc Nephrol 2015. ; 26 ( 1 ): 160 – 172 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Torres VE , Chapman AB , Devuyst O , et al . Tolvaptan in patients with autosomal dominant polycystic kidney disease . N Engl J Med 2012. ; 367 ( 25 ): 2407 – 2418 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Grantham JJ , Chapman AB , Torres VE . Volume progression in autosomal dominant polycystic kidney disease: the major factor determining clinical outcomes . Clin J Am Soc Nephrol 2006. ; 1 ( 1 ): 148 – 157 . [DOI] [PubMed] [Google Scholar]
- 10. Paszke A , Gross S , Massa F , et al . PyTorch: An Imperative Style, High-Performance Deep Learning Library . arXiv:1912.01703 [preprint] http://arxiv.org/abs/1912.01703. Posted December 3, 2019. Accessed April 14,2021 . [Google Scholar]
- 11. Goel A . ADPKD Segmentation in PyTorch . https://github.com/aksg87/adpkd-segmentation-pytorch/tree/v1.0. Published 2021. Accessed May 2021 .
- 12. Brett M , Hanke M , Cipollini B , et al . nibabel: 2.1.0. Zenodo . https://zenodo.org/record/60808. Published August 24, 2016. Accessed May 2021 . [Google Scholar]
- 13. Yushkevich PA , Piven J , Hazlett HC , et al . User-guided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability . Neuroimage 2006. ; 31 ( 3 ): 1116 – 1128 . [DOI] [PubMed] [Google Scholar]
- 14. Ronneberger O , Fischer P , Brox T . U-Net: Convolutional Networks for Biomedical Image Segmentation . arXiv:1505.04597 [preprint] http://arxiv.org/abs/1505.04597. Posted May 18, 2015. Accessed April 14,2021 . [Google Scholar]
- 15. Tan M , Le QV . EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks . arXiv:1905.11946 [preprint] http://arxiv.org/abs/1905.11946. Posted September 11, 2020. Accessed November 18,2020 . [Google Scholar]
- 16.Siddique N, Paheding S, Alom MZ, Devabhaktuni VK. Recurrent residual U-Net with EfficientNet encoder for medical image segmentation. In: Alam MS, ed.Proceedings of SPIE: SPIE defense + commercial sensing 2021—pattern recognition and tracking XXXII.Vol 11735.Bellingham, Wash:International Society for Optics and Photonics,2021;117350L. [Google Scholar]
- 17. Buslaev A , Iglovikov VI , Khvedchenya E , Parinov A , Druzhinin M , Kalinin AA . Albumentations: fast and flexible image augmentations . Information (Basel) 2020. ; 11 ( 2 ): 125 . [Google Scholar]
- 18.Fourdan O. Xfce: A Lightweight Desktop Environment. In: Proceedings of the 4th Annual Linux Showcase & Conference,Atlanta, Atlanta, GA,October 10–14, 2000.Berkeley, Calif:USENIX Association,2000.https://www.usenix.org/conference/als-2000/xfce-lightweight-desktop-environment. [Google Scholar]
- 19. Higashihara E , Nutahara K , Okegawa T , et al . Kidney volume estimations with ellipsoid equations by magnetic resonance imaging in autosomal dominant polycystic kidney disease . Nephron 2015. ; 129 ( 4 ): 253 – 262 . [DOI] [PubMed] [Google Scholar]
- 20. Lin YC , Lin CH , Lu HY , et al . Deep learning for fully automated tumor segmentation and extraction of magnetic resonance radiomics features in cervical cancer . Eur Radiol 2020. ; 30 ( 3 ): 1297 – 1305 . [DOI] [PubMed] [Google Scholar]
- 21. Ge C , Gu IYH , Jakola AS , Yang J . Deep semi-supervised learning for brain tumor classification . BMC Med Imaging 2020. ; 20 ( 1 ): 87 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Conze PH , Kavur AE , Cornec-Le Gall E , et al . Abdominal multi-organ segmentation with cascaded convolutional and adversarial deep networks . Artif Intell Med 2021. ; 117 : 102109 . [DOI] [PubMed] [Google Scholar]
- 23. Mongan J , Moy L , Kahn CE Jr . Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers . Radiol Artif Intell 2020. ; 2 ( 2 ): e200029 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Kline TL , Korfiatis P , Edwards ME , et al . Performance of an Artificial Multi-observer Deep Neural Network for Fully Automated Segmentation of Polycystic Kidneys . J Digit Imaging 2017. ; 30 ( 4 ): 442 – 448 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Sharma K , Rupprecht C , Caroli A , et al . Automatic Segmentation of Kidneys using Deep Learning for Total Kidney Volume Quantification in Autosomal Dominant Polycystic Kidney Disease . Sci Rep 2017. ; 7 ( 1 ): 2049 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Yakubovskiy P . Segmentation Models Pytorch . https://github.com/qubvel/segmentation_models.pytorch. Published 2020. Accessed May 2021 .
- 27. Liu L , Jiang H , He P , et al . On the variance of the adaptive learning rate and beyond . arXiv:1908.03265 [preprint] https://arxiv.org/abs/1908.03265. Posted August 8, 2019. Accessed May 2021 . [Google Scholar]