See also the article by Pan et al in this issue.

Eliot L. Siegel, MD, is a radiologist and professor at the University of Maryland, adjunct professor of computer science and biomedical engineering and chief of imaging for the VA Maryland Healthcare System where he created the world’s first filmless health care enterprise. He currently serves as a senior consulting editor for this journal and has editorial responsibilities on multiple journals, and is co-chair of the annual Conference on Machine Intelligence in Medical Imaging.
Diversity: The art of thinking independently together.
– Malcolm Forbes
The positive value of a second opinion has been well documented in the radiology literature. For example, in a study that involved reinterpretation of breast US and MRI examinations, Coffey et al (1) documented “a change in interpretation in more than one-fourth of submitted studies,” which resulted in detection of additional cancers in 5% of patients and averted biopsies in 4%. Other specialties in diagnostic imaging have shown similar benefits of a second reader for CT colonography (2), neuroradiology (3), musculoskeletal imaging (4), MRI of the pelvis (5), and many others. Recently, deep learning algorithms have been proposed as a less time- and labor-intensive alternative (6).
The incremental value of more than one additional opinion was explored by Hukkinen et al (7) who determined the change in sensitivity and specificity in the detection of cancer when combinations of up to eight readers (four experienced mammographers, two general radiologists, and two residents) interpreted a set of mammograms from 200 women. There was an increase in sensitivity from 57% for the best single reader to 67% when combining the two best readers, which further increased to 75% with four readers. Interestingly, but perhaps not surprisingly, they also observed a major drop in sensitivity when the studies were read together in a consensus fashion (presumably groupthink) in comparison to having the studies read independently and subsequently combined.
In a manner analogous to human second opinions, the effectiveness and versatility of ensemble systems in machine learning have been well described (8) in the literature. Ensemble learning is a paradigm in which multiple machine learning models are combined together, either prospectively or retrospectively to address a single task.
The original research article published in Radiology: Artificial Intelligence, “Improving Automated Pediatric Bone Age Estimation Using Ensembles of Models from the 2017 RSNA Machine Learning Challenge” (9) describes substantially improved performance with the combination of multiple models working together and discusses successful strategies to optimize ensemble performance. The 48 models utilized in this study were created by multiple teams throughout the world in response to the 2017 RSNA Pediatric Bone Age Machine Learning Challenge (10). The purpose of the challenge was to demonstrate an application of machine learning in medical imaging, promote “collaboration to catalyze AI model creation,” and to “identify innovators in medical imaging” (10). The task for each machine learning model was to provide the best guess possible of the consensus of four pediatric radiologists who were using the Greulich and Pyle atlas of hand radiographs to determine the corresponding age expected for a radiograph of the hand. The potential clinical application of these models would be to assist radiologists to rapidly and accurately assess skeletal maturation in children with a wide variety of congenital and acquired developmental anomalies.
The authors utilized an experimental design that tested ensembles (combinations) ranging from one to up to 10 different models that were chosen from the total 48 submitted models. Each of these combinations was tested using 1000 simulated validation-test splits that were created to avoid the use of the same data for both model selection and evaluation. This technique of utilizing random sampling with replacement to determine a measure of accuracy such as mean absolute deviation (MAD) from the reference standard consensus of experts is referred to as bootstrapping.
The average expert radiologist in the RSNA Challenge consensus panel differed from the weighted consensus of all four radiologists (reference standard) by 5.8 months. The 48 computer models were assessed according to their MAD from the consensus standard. The top machine learning model had an MAD of 4.27 months which was substantially lower than that of the average radiologist in the panel. The authors tested the hypothesis that ensembles from two to 10 models would further improve performance beyond even the best single model. They quantified the improvement provided by these various-sized ensembles.
In the “second opinion” case using an ensemble of two models, they found an MAD of less than 4 months for the top 10 of these two-model pairings, representing a substantial jump in performance over the best model. Interestingly, the top performance was not obtained by combining the top two best models but actually the first-ranked with the 16th-ranked model, which together achieved an MAD of 3.78 months. The next four best-performing two-model ensembles all contained the fourth-ranked model, which was known to not only have a low correlation with other models but also to utilize alternative statistical and machine learning techniques unlike most of the other models in the competition, which used deep learning. Unfortunately, the group that submitted the 16th-ranked model did not specify their approach, but as was the case with the fourth-ranked model, their results had a relatively low correlation with the other models, suggesting that it is likely their algorithm development technique was different from most others as well. Continuing this trend, there was further improvement with three- and four-model ensembles. As with the Hukkinen study of human mammographers, the optimal number of best ensemble groups was found to be four with slight degradation of performance for ensembles greater than five.
As was the case for the top two-model ensemble, the top four-model ensemble did not consist simply of the top four performing models. Consistent with the observation that relatively high-performing but diverse (low correlation) models improved ensemble performance, the best four-model ensemble included model 4 and model 16 in addition to models 1 and 3.
This original research report demonstrated multiple important points and potential strategies to advance the current state of the art in artificial intelligence applications in medical imaging. As is the case in building the best-performing human teams, machine learning ensembles perform best when relatively strong performers who also have a diversity of “opinions” and approaches are selected. In their article, having models 4 and 16 on the “team” resulted in the high performance, largely because these models utilized a different approach that was not in lockstep with other team members.
The authors pointed out that contests such as the RSNA Pediatric Bone Age Machine Learning Challenge attract a wide diversity of different participants and provide an excellent source of models to combine into an ensemble because they are utilizing the same training and validation and testing datasets to achieve the same goal. In fact, the top five models were created from ensembles according to their developers. The contestants did not submit the actual models but rather the results, which limited the potential to further enhance ensemble performance by using techniques such as boosting, in which cases that were not well predicted by one or more models are given higher weight to improve ensemble performance. Future contests that require submission of the models could explore more of the full potential of more advanced ensemble approaches. Another advantage of having the models available would be the ability to test various ensembles on additional research or clinical datasets beyond the one made available for the RSNA challenge.
The strategy of combining more than one model could similarly be utilized in clinical practice. For example, there are numerous machine learning algorithms available (with a subset that are already Food and Drug Administration cleared) to detect the presence of intracranial hemorrhage. A clinical site with sufficient technical and clinical expertise with access to the intracranial hemorrhage models could derive its own ensembles utilizing its own datasets that would likely outperform any individual commercial or research algorithm’s performance. Furthermore, a commercial algorithm platform provider could also develop and offer ensembles of algorithms, as could a third-party ensemble vendor or ensemble validation group. These combinations of algorithms or models would not be constrained to the ensemble paradigm of working together on a common problem. They could provide added clinical value by providing a combination of functions such as image segmentation (eg, divide the brain into multiple regions), hemorrhage detection, quantification (assess volume or composition of hemorrhage), and/or diagnostic or etiologic (eg, aneurysm, stroke, tumor detection) functions, and/or change detection (eg, change in size of hemorrhage over time). These could work together cooperatively to provide much more sophisticated quantification, detection, diagnostic, and recommendation functionality than would be practical from a single provider.
It is becoming increasingly clear that the ultimate partnership in diversity will be between humans and machines and both will inevitably have and develop very different but complementary approaches to challenges in diagnostic imaging and health care in general. Freeing ourselves of the constraints associated with trying to develop computer algorithms that merely emulate human approaches to problem solving will accelerate the arrival and efficacy of our next generation of artificial intelligence applications.
Footnotes
Disclosures of Conflicts of Interest: E.L.S. disclosed no relevant relationships.
References
- 1.Coffey K, D’Alessio D, Keating DM, Morris EA. Second-opinion review of breast imaging at a cancer center: is it worthwhile? AJR Am J Roentgenol 2017;208(6):1386–1391. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Bodily KD, Fletcher JG, Engelby T, et al. Nonradiologists as second readers for intraluminal findings at CT colonography. Acad Radiol 2005;12(1):67–73. [DOI] [PubMed] [Google Scholar]
- 3.Zan E, Yousem DM, Carone M, Lewin JS. Second-opinion consultations in neuroradiology. Radiology 2010;255(1):135–141. [DOI] [PubMed] [Google Scholar]
- 4.Chalian M, Del Grande F, Thakkar RS, Jalali SF, Chhabra A, Carrino JA. Second-opinion subspecialty consultations in musculoskeletal radiology. AJR Am J Roentgenol 2016;206(6):1217–1221. [DOI] [PubMed] [Google Scholar]
- 5.Lakhman Y, D’Anastasi M, Miccò M, et al. Second-opinion interpretations of gynecologic oncologic MRI examinations by sub-specialized radiologists influence patient care. Eur Radiol 2016;26(7):2089–2098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Chan S, Siegel EL. Will machine learning end the viability of radiology as a thriving medical specialty? Br J Radiol 2019;92(1094):20180416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hukkinen K, Kivisaari L, Vehmas T. Impact of the number of readers on mammography interpretation. Acta Radiol 2006;47(7):655–659. [DOI] [PubMed] [Google Scholar]
- 8.Zhang C, Ma Y, eds. Ensemble machine learning: methods and applications. New York, NY: Springer Science & Business Media, 2012. [Google Scholar]
- 9.Pan I, Thodberg HH, Halabi S, Kalpathy-Cramer J, Larson D. Improving Automated Pediatric Bone Age Estimation Using Ensembles of Models from the 2017 RSNA Machine Learning Challenge. Radiol Artif Intell 2019;1(6):e190053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Halabi SS, Prevedello LM, Kalpathy-Cramer J, et al. The RSNA pediatric bone age machine learning challenge. Radiology 2019;290(2):498–503. [DOI] [PMC free article] [PubMed] [Google Scholar]
