See also the article by Rauschecker and Rudie et al in this issue.

Dr Zaharchuk is a professor of radiology at Stanford University in the division of neuroimaging. He received his MD degree from Harvard Medical School and his PhD degree from the Harvard-MIT Health Sciences and Technology program, with clinical training at the University of California, San Francisco. He directs the Center for Advanced Functional Neuroimaging at Stanford University, where his research focuses on advanced medical imaging techniques and algorithms (including AI) with the goal of alleviating the burden of neurologic disease.
In 2012, a method based on deep convolutional neural networks won the Stanford ImageNet Challenge for computer vision object identification by a wide margin, almost halving the error rate from the previous year (1). Since then, excitement about the power of deep networks has pervaded all research subjects of artificial intelligence (AI), including radiology. But relying on these methods to identify pathologic features on radiologic images has proved harder than many expected. One reason is the small size of most medical imaging data sets. Also, unlike many computer vision tasks, identifying pathologic features on medical images often requires examination of small and inconspicuous portions of the image. This is why we train radiologists through a prolonged residency and fellowship curricula with careful guidance and attention from senior physicians. In this issue of Radiology, Rauschecker et al (2) launch a combined assault on the task of automated disease classification by using the dual forces of deep learning and Bayesian inference, mimicking the way radiologists become trained. They show good performance on a small but diverse data set of 19 common and uncommon brain diseases, with 91% accuracy of finding the correct entity among the top three most probable conditions, a similar accuracy to that of academic neuroradiologists (86%).
How does automated disease classification work? The first step is to allow deep learning to do something it is pretty good at—segmenting lesions on images and classifying them into human-interpretable concepts such as location, signal intensity, or enhancement (3). Given the progress with segmentation challenges such as the Brain Tumor Segmentation, or BraTS (4,5), common architectures and tricks have been established using tomographic data sets on hundreds of cases. In this implementation, lesions were first identified if they had a high signal at T2-weighted fluid-attenuated inversion recovery (FLAIR), a sensitive MRI pulse sequence to detect pathologic features but which by itself is nonspecific. Then, the authors use a variety of non-AI tools, most involving thresholding and size criteria, to extract individual “imaging features” about these lesions. Of the 18 features they included, five were related to the signal intensity on other imaging sequences, six were related to volumetric features such as lesion size and mass effect, and seven were spatial features related to location. Most of the “states” of these features were simple, such as a signal on gradient-recalled-echo (low, high, absent) or enhancement (present or absent).
While the selection of features was somewhat arbitrary and required much trial and error, the choices made by the authors cover typical items that radiologists look for in the reading room and include in reports as an “explanation” for favoring certain pathologies over others. Besides imaging features, the authors included clinical features, including age, sex, and duration of symptoms (acute or chronic). The 20% absolute drop in AI performance when excluding these features shows their importance. This highlights that a radiologist’s job involves not only looking at images (6), but also integrating imaging and clinical features, an inconvenient truth sometimes forgotten by our computer vision colleagues.
With this list of features in hand, the authors applied the “magic” of Bayesian networks (7,8). In brief, a Bayesian network allows you to assign a probability to an event (for example, finding a certain feature, such as high signal on diffusion, given the presence of a specific disease, such as acute stroke). This table of probabilities and diseases is large (19 diseases × 18 features, each with two or three states) and requires considerable domain knowledge. These probabilities were based on literature if it existed, filled in by the best guesses of neuroradiology experts if not, and then further tweaked on the images in the training set. Applied to radiology studies with success in many settings, this approach often requires features to be hand-fed to the systems rather than arising from an algorithm (8–10). The novelty is the automatic generation of features, although which features to include must still be determined by the investigators. In theory, a full “end-to-end” deep learning system should be able to learn the relevant features as well, but for reasons mentioned earlier, the number of training cases would likely exceed any currently existing database and would be at the expense of interpretability.
Armed with this large table of probabilities and the features identified by the deep networks, the authors used an implementation known as naive Bayesian inference. This application looks at all the probabilities given the specific set of features identified in the first step, multiplies them together, and yields its final likelihood (or posterior probability) for each disease. These probabilities are then compared with each other allowing a ranking of disease likelihood, much in the way a radiologist would put together a differential diagnosis (eg, high-grade tumor most likely, tumefactive multiple sclerosis next, etc). While it is infeasible to ask radiologists to rank 19 different diseases for each case, it is trivial for the network to produce such a list.
The authors took this automated rank list and compared it against radiologists with various levels of training (thankfully, they only asked the radiologists to rank their top three diagnoses!). The AI method predicted the correct diagnosis in one of the top three ranked positions with similar accuracy to attending neuroradiologists while outperforming generalists and radiologists with less training. They also found that human readers struggled with rare entities, while the AI method had similar performance to human readers on common and rare conditions. Humans seem to get better at further narrowing a differential diagnosis into a specific single top-choice, which is arguably our main job. While not quite significant in this small cohort, attending neuroradiologists can still take solace in the fact that we outperform this AI method by about 50% when the rubber hits the road!
This study is an impressive feat—combining deep learning and other quantitative methods with human domain knowledge. But it is a point on the journey rather than the destination. The study had many limitations, not the least of which was that the lesion was visible at FLAIR imaging and only a single disease was present. The chosen conditions cover a wide range of entities under consideration with T2-bright lesions but exclude some common diseases such as intracerebral brain hemorrhage. Also, many important diseases do not manifest as parenchymal FLAIR lesions at all including extra-axial hemorrhages, aneurysms, and leptomeningeal disease. Finally, the construct was artificial, with a distribution of cases not typical of daily practice. But when viewed as a possible adjunct tool to help or jog the radiologist’s mind, especially to more rare diseases, the AI system could be very helpful, somewhat like a “fellow in a box” for the overburdened solo radiologist. If only it could also protocol the study and write the draft report!
But it is the thought process behind the AI system which I find most attractive. It mimics the way radiologists actually look at a study, starting by identifying features on the image, integrating the clinical context, and then applying their training and experience to judge and rank the likelihood of disease. Furthermore, given the fact that a Bayesian network is a series of probabilities, it should be possible to adjust the AI system for different practice settings by including prior probabilities for each disease related to the expected distribution in your local population. Finally, it gives insight into the decision process by reporting the features, thus allowing the AI model to “explain” its rankings, which can be analyzed for plausibility.
At the end of the day, I am left with two thoughts. One is the incredible amount of work that went into designing this AI system and its good performance on a narrow task, for which the authors should be congratulated. The other is how long the road ahead is before such a tool would be a must-have in the reading room, given the vast number of diseases that radiologists must recognize. It is possible that this methodological approach does not scale well with more added features and diseases. We are still in the early days of AI for providing automated image diagnosis, but they are exciting days. As Winston Churchill said: “Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.”
The work of Rauschecker et al (2) charts a plausible way forward to an automated diagnosis system that combines many AI methods guided by the domain expertise and training methods of human radiologists.
Footnotes
Disclosures of Conflicts of Interest: G.Z. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: is a board member of and holds stock/stock options in Subtle Medical; has grants/grants pending from Bayer Healthcare, GE Healthcare, and National Institutes of Health; has deep learning–related patents (planned, pending, or issued); receives royalties from Cambridge University Press. Other relationships: has received GPU donations from Nvidia.
References
- 1.Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. NIPS’12: Proceedings of the 25th International Conference on Neural Information Processing Systems, 2012; 1097–1105. [Google Scholar]
- 2.Rauschecker AM, Rudie JD, Xie L, et al. Artificial Intelligence System Approaching Neuroradiologist-level Differential Diagnosis Accuracy at Brain MRI. Radiology 2020;295:626–637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Korfiatis P, Kline TL, Erickson BJ. Automated segmentation of hyperintense regions in FLAIR MRI using deep learning. Tomography 2016;2(4):334–340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Pereira S, Pinto A, Alves V, Silva CA. Brain tumor segmentation using convolutional neural networks in MRI images. IEEE Trans Med Imaging 2016;35(5):1240–1251. [DOI] [PubMed] [Google Scholar]
- 5.Menze BH, Jakab A, Bauer S, et al. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans Med Imaging 2015;34(10):1993–2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Agten C. Real radiologist at work - time-lapse. https://www.youtube.com/watch?v=vP8gLKM6v2w. Published 2019. Accessed February 25, 2020.
- 7.Pearl J, Mackenzie D. The Book of Why. New York, NY: Basic Books, 2018. [Google Scholar]
- 8.Burnside ES. Bayesian networks: computer-assisted diagnosis support in radiology. Acad Radiol 2005;12(4):422–430. [DOI] [PubMed] [Google Scholar]
- 9.Kahn CE, Jr, Laur JJ, Carrera GF. A Bayesian network for diagnosis of primary bone tumors. J Digit Imaging 2001;14(2 Suppl 1):56–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kahn CE, Jr, Roberts LM, Shaffer KA, Haddawy P. Construction of a Bayesian network for mammographic diagnosis of breast cancer. Comput Biol Med 1997;27(1):19–29. [DOI] [PubMed] [Google Scholar]
