Automated machine learning (autoML) allows clinicians and researchers to access deep learning (DL) for medical image analysis without otherwise required computational expertise and access to expensive hardware.1 Many autoML platforms have been developed in industry and academia, with variable strengths and limitations.2 Our previous work shows that autoML has promise in clinical research, but comparative studies pitting autoML platforms against one another are lacking.3 Here, we compared Amazon Rekognition – the highest performing autoML platform for image analysis according to our systematic review3 – with H2O.ai Driverless AI, an emerging platform with promising results outside medicine.
We trialled the autoML platforms in four clinical contexts, using publicly available datasets: classifying pneumonia and normal chest X-ray (CXR) images; Alzheimer's disease, mild cognitive impairment, and normal brain magnetic resonance images (MRI); glaucoma and healthy fundus photographs; and malignant and benign pigmented lesions in dermoscopic photographs. Training dataset sizes were 4,684, 5,120, 563, and 2,637 images respectively; external validation dataset sizes were 1,172, 1,280, 142, and 660 images respectively. Our hardware had no bearing on the results, as both platforms conducted analysis and model-building in the cloud, using provider computational resources. Platforms were compared in terms of their technical features (Table 1), and performance in each clinical task was gauged with F1-score, calculated as the harmonic mean of precision and recall in an external validation dataset.
Table 1.
Comparison of platforms' technical features
AutoML platform | H2O.ai Driverless AI | Amazon Rekognition | |
Cost | Unclear (free R/Python libraries) | $4/hr inference; $1/hr training (free trial) | |
Accessibility | Code requirement | None (optional) | None (optional) |
Computing locus | Local/cloud | Cloud | |
Data format | Structured/unstructured | Unstructured | |
Feature extraction/selection | Yes | Yes | |
Technical Features | Model selection/training | Yes | Yes |
Hyperparameter optimisation | Yes | Yes | |
Evaluation | Yes | Yes | |
Explainability analysis | Yes | No | |
Portability | Exportability | Yes | No |
Interpretability | Yes | No |
Although autoML has already begun to be adopted as a research tool in clinical research, validation is essential to demonstrate that approaches are accurate, reliable and fair. Most studies cited as evidence of validation fail to compare autoML platforms with alternative modalities; Rekognition and Driverless AI have not been trialled in the same clinical task.3 Here, platform performances were similar, with Driverless AI exhibiting superiority in CXR (F1H2O = 0.985, F1Rekognition = 0.976) and brain MRI (F1H2O = 0.991; F1Rekognition = 0.982); Rekognition exhibiting superiority in dermoscopic photograph (F1H2O = 0.910, F1Rekognition = 0.915); and platforms exhibiting equivalent discriminative ability in fundus photograph (F1H2O = 1, F1Rekognition = 1). Performance compares well with bespoke computational models and clinical experts.4,5 Driverless AI facilitated model export for local deployment without requiring image upload to the cloud, whereas Rekognition did not. Driverless AI also provided explainability analysis which illustrated salient features in the DL algorithms (Fig 1). While algorithms tended to focus on clinically relevant regions of images in each modality, the dermoscopic photograph classifier often made decisions based on the peri-lesion region.
Fig 1.
Driverless AI providing explainability analysis, which illustrated salient features in the DL algorithms.
The performance of diagnostic autoML models is promising, and autoML has great potential to democratise DL.6 Limitations of autoML include concerns regarding ‘black box’ algorithms, data security with cloud-based models, and lack of customisability. To be applied clinically, autoML must facilitate model export for use with sensitive data, comparison with conventional models and explainability analysis. Alternative uses of autoML include education – allowing novices to gain ‘hands-on’ experience sooner; and pilot studies – allowing clinicians to trial DL themselves before approaching computational researchers and applying for research grants with convincing preliminary results. Clinicians andand researchers should consider their requirements, capabilities, and the technical features of platforms when deciding how to apply autoML.
References
- 1.Waring J, Lindvall C, Umeton R. Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artif Intell Med 2020; 104; 101822. [DOI] [PubMed] [Google Scholar]
- 2.Korot E, et al. Code-free deep learning for multi-modality medical image classification. Nat Mach Intell 2021;3:288–98. [Google Scholar]
- 3.Thirunavukarasu AJ, et al. Clinical applications of automated machine learning: a systematic review. [unpublished].
- 4.Aggarwal R, et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. npj Digit Med 2021;4:65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Nagendran M, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ 2020;368:m689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Korot E, et al. Clinician-driven artificial intelligence in ophthalmology: resources enabling democratization. Curr Opin Ophthalmol 2021;32:445–51. [DOI] [PubMed] [Google Scholar]