Abstract
Artificial intelligence (AI) systems can detect subtle features in diagnostic imaging scans that radiologists may miss, including higher-order features that lack obvious visual correlates. This may enable earlier disease detection and non-invasive lesion phenotyping, but also introduces risks due to AI’s reliance on correlations rather than causation, potential demographic and technical biases, and uninterpretable reasoning. This perspective explores how radiologists and AI learn to perceive details in medical images differently, leading to potential discrepancies in medical decision-making.
Subject terms: Object vision, Pattern vision, Computational biology and bioinformatics, Data processing, Diagnostic markers, Diagnostic markers
Introduction
Diagnostic imaging scans encode a wealth of multidimensional data reflecting physiologic and pathophysiologic processes across genetic, molecular, cellular, and macroscopic scales. Using artificial intelligence (AI) to help interpret these scans has the potential to enhance the speed and accuracy of radiological diagnoses1, but also comes with certain challenges, such as conflicts between radiologist and AI interpretations of a scan. Many common imaging modalities (e.g., CT, MRI, etc.) capture three-dimensional data that can show grossly visible pathology as well as subtle, barely detectable findings. Beyond visual analysis, meaningful patterns in imaging data can be identified by computational techniques. Radiomics, akin to genomics, proteomics, and other “-omics”, refers to the comprehensive extraction of quantitative or semi-quantitative features from medical images, treating the image as a data source to capture pertinent disease-specific or subject-specific information without the need for explicit visual analysis2. This includes applying mathematical transformations to extract higher-order statistical patterns that may not be directly visually obvious (“nonvisual”), and extracting subtle features that are technically visible but practically imperceptible (even to highly trained radiologists) due to subtle textures, small size, camouflaged colors, or spatially complex features (“subvisual”). Radiomics, which aims to extract quantitative and semi-quantitative imaging features, often using advanced image processing methods, is distinct from AI. However, AI and radiomics can be closely related in that radiomic features can serve as input data to AI (e.g., a machine learning classifier). Modern AI approaches, such as deep convolutional neural networks, no longer rely on hand-crafted features that need to be explicitly defined and designed by researchers to capture specific image characteristics. Instead, deep learning networks automatically learn hierarchical feature representations directly from training data, and optimize the combination of those features during training to maximize predictive performance.
Despite potential invisibility, algorithmically derived imaging signatures (encompassing both classical radiomics and deep learning-derived features) can be clinically useful to predict disease presence, disease progression, or treatment response3. For example, a scan deemed normal by radiologists might contain subvisual or nonvisual patterns strongly predicting occult disease4, and lesions that are visible in a scan might still harbor prognostically significant features that are invisible to human observers5. Although the internal processing of deep learning systems cannot be directly interpreted as concepts, there is growing evidence and use cases to suggest they may base decisions on features that cannot be visually verified by radiologists6–9. These systems also lack general world knowledge, the ability to reason from causal principles, and intuitive understanding of interventions and their consequences10. This lack of causal “common sense” and ability to extrapolate beyond the learned patterns leads to the risk of learning spurious “shortcuts” (e.g., scanner artifacts, demographic biases) rather than true pathophysiologic signals, limiting generalizability11. Regardless of clinical utility, it presents an ethical and medicolegal problem to make clinical decisions based on something that humans cannot independently perceive, as this would require complete trust in an AI system while physicians retain responsibility for diagnosis and management. This conundrum will become both more prevalent and pressing over time as AI-guided image analyses transition from research to practice.
In this work, we detail differences in visual perception between radiologists and AI and discuss the benefits and risks of using AI to help guide radiologic diagnoses. We primarily focus on deep learning but also consider traditional machine learning analyses of radiomic features (e.g., logistic regression, support vector machines, Fig. 1).
Fig. 1. Taxonomy of artificial intelligence methods.

Machine learning is a common application of AI, which infers patterns from data. Deep learning is a type of machine learning that utilizes various types of deep neural network architectures. Adapted from ref. 120.
Radiologists and AI are trained differently
How radiologists are trained to interpret medical imaging data
To interpret medical imaging data, radiologists undergo a lengthy medical education grounded in biological principles and clinical correlation to teach them about how diseases appear in radiological images. Initially, novices rely on detecting specific well-defined imaging features, e.g., Kerley B lines suggest pulmonary edema. Over time, repeated practice develops effective perceptual processes whereby the visual system extracts diagnostically relevant features from images, stores them in working memory, and compares them against disease patterns and imaging templates stored in long-term memory12. Attending radiologists use more direct visual search patterns and extract more complex information from imaging studies, and do so in less time, than trainees13,14. While interpreting an imaging study may use components of intuition or “gestalt perception” that are partially outside of conscious awareness15, radiologists can typically explain their reasoning behind a diagnosis, including discussing the parts of the study that appeared normal or abnormal16. Radiologists also have clinical knowledge, both generally and of the patient case in question, that allows them to discuss the clinical significance of what they see, or do not see, in a given scan. When faced with rare diseases, unusual anatomic distortions, or atypical artifacts, a radiologist can engage foundational biomedical and anatomical reasoning rather than having to rely solely on pattern recognition. This also allows radiologists to comment on variants of uncertain significance. Whereas, AI systems lacking causal reasoning and broader world knowledge struggle to generalize to features falling outside their training data distribution.
How AI is trained
In most cases, AI does not have relevant medical knowledge or the ability to reason based on causal principles, but offers the advantage that machine learning models can be trained to analyze medical imaging studies and identify very complex imaging patterns (Fig. 2). In supervised machine learning, AI systems learn by a training process in which they are fed input data and told what the correct output should be (“ground truth”)17. This type of machine learning typically occurs inside neural networks, a common AI architecture (reviewed in refs. 17,18). With access to ground truth labels, the AI system optimizes how it processes data so that it can eventually learn to consistently predict correct (ground truth) outputs for unseen input data. For instance, AI models can be trained on chest radiographs labeled with the correct radiographic diagnoses as assigned by expert radiologists19. Over time, the model learns to analyze chest radiographs in a way that aligns with the diagnosis agreed upon by the expert radiologist (or panel of them), even though different imaging features may be used to arrive at the same diagnosis. In other words, AI does not faithfully mimic radiologists’ visual search processes. However, depending on the architecture, deep learning can replicate top-down or bottom-up visual search to some extent20,21. In another example, an AI model could be trained with MRI scans of brain tumors, labeled with histopathology results about tumor gene mutations22. To the degree that the MRI scans harbor specific clues about the gene mutation status (i.e., to the degree that gene mutations influence tumor appearance on imaging), AI can learn to find those clues and predict tumor gene mutations from the MRI scan only. Once the AI model has been trained to perform image analysis, it can (ideally) be validated in real-world scenarios for regulatory approval and subsequently used in clinical practice23. In practice, validation data is often vendor-dependent, posing issues when the statistical characteristics of the use-case population differ from the training population. To mitigate these limitations, several large-scale initiatives provide de-identified medical imaging datasets with ground truth labels, such as the Medical Imaging and Data Resource Center24 (MIDRC; University of Chicago), the Emory BrEast imaging Dataset25 (EMBED; Emory University), and the Cancer Imaging Archive26 (TCIA; National Cancer Institute and University of Arkansas).
Fig. 2. Machine learning AI in diagnostic imaging.
a Overview of the training process for deep learning architectures. In supervised deep learning, neural networks are trained on input data paired with validated outputs (“ground truth”). Knowing what the correct output should be, the neural network runs a program (e.g., backpropagation with gradient descent of the loss function to update connection weights) to adjust how it processes input data, eventually learning to do so in a way that reliably leads to the correct (ground truth) outputs. Chest radiograph examples were provided by the National Institutes of Health Clinical Center and downloaded from https://nihcc.app.box.com/v/ChestXray-NIHCC19. b Deep learning neural networks are the most common machine learning architecture for image-interpreting AI. Information propagates through sequential computational units (“neurons”*). Each unit can implement a mathematical function to transform its input. Crucially, the training period adjusts the processing that occurs in the “hidden layers” of neurons. This processing may include convolutional filters, nonlinear activations, and other complex methods to map complex, multi-scale image features. *Note: “neuron” here refers to a computational processing unit rather than a biological cell. c Machine learning systems can identify diagnostically relevant imaging features that are difficult for radiologists to see because of their subtle texture, small size, camouflaged colors, spatially complex features, higher-order statistical patterns without an obvious visual correlate, or other reasons. d Comparison between medical image analysis by AI and radiologists.
Most modern AI systems are designed without any explicit instruction on how to analyze images in a pre-set way in terms of being programmed to detect certain features, such as lines, edges, or objects. Instead, they are designed to adapt themselves to training data to successfully reproduce the ground truth outputs (Fig. 2). As a result, AI has the freedom to diverge from human-recognizable perceptions, and compared to human interpretations, may rely on dissimilar features of the image7. Even if radiologists and AI agree about an imaging study’s diagnosis, they do not necessarily process the pertinent features of the image in the same way. For example, AI does not always discern “objects” like what humans see ref. 27, and when it does segregate discrete objects, they can be different than the ones humans recognize8. There is generally a risk that imaging features considered important by humans might be ignored by AI and vice versa6,7, although there are also reports of AI systems and humans each being sensitive to similar features28,29. Overall, one expects some differences between radiologists and AI-based image analysis, given the differences between human and computer vision6–9,27, and between radiologist and AI learning processes30. Potentially, such differences may overall benefit radiologic diagnoses if AI and radiologists combine complementary strengths to detect clinically important information (see Table 1 for comparison of AI and radiologist miss rates in key conditions). However, this raises the question of how and to what extent radiologists and other physicians should trust AI analyses based on subvisual or nonvisual information that is difficult to directly verify. Currently, AI may be roughly on par with radiologists for some conditions, but struggles with edge cases and distorted anatomy (see Table 1).
Table 1.
AI and radiologist miss rates in example key conditions
| Diagnosis and imaging modality | Study type and ground truth | AI miss rate | Radiologist miss rate | AI or human advantages |
|---|---|---|---|---|
| Pneumonia, chest radiograph121 | Single study, expert panel with multimodal clinical information | ~25–30% | ~25–30% | Radiologists more specific especially in complex cases; AI more false positives. |
| Breast cancer, mammography122 | Meta-analysis of 8 studies, histopathological diagnoses | 15% | 23% | Radiologists superior in cases with dense breasts or architectural distortions123,124 |
| ICH, head CT125 | Single study, subspecialty neuroradiologist | 8% | 12% | Performance likely similar |
| Hip fracture, radiograph126 | Meta-analysis of 39 studies with various ground truth methods e.g., surgical confirmation, radiologist panel consensus | 11% | ~13%a | Radiologists likely superior with distorted anatomy, overlapping structures |
Drawn from meta-analyses or single studies with an appropriate ground truth standard. Comparisons of this sort are limited in number.
AI artificial intelligence CT computed tomography, ICH intracranial hemorrhage.
aDerived from OR(95%CI) 1.27(0.76–2.08) relative to the AI miss rate.
Radiologists and AI see differently
Differences in visual learning
Human vision evolved to facilitate survival and reproductively-advantageous behaviors, biasing us to see the world in a way that promotes interacting with it31. The brain uses 20–25% of the body’s energy, and visual circuitry is among the most metabolically expensive brain tissue32. Therefore, metabolic constraints on the visual system likely supported the development of specialized neural mechanisms to detect evolutionarily significant features33, including dedicated brain regions for processing faces, body parts, and graspable tools34,35, potential threats like snakes36 and spiders37, and possibly animals overall38. In noisy visual environments, neural mechanisms automatically prioritize and attend to the features essential for the given task and suppress the perception of irrelevant complexities31,39. Goal-directed prioritization can operate across multiple levels from basic visual properties (shape, contrast, color) to higher-order categorical processing (identifying edible items, defensive structures, or potential tools)40. Similar principles apply when radiologists engage in goal-directed visual searches driven by specific clinical questions, such as searching for radiographic evidence of pneumonia or causes of wrist pain. For both bottom-up and top-down scenarios, training cultivates conscious and nonconscious pattern recognition that distinguishes expert observers from novices41. Although human vision is flexible with training, it is not infinitely so42; certain innate tendencies are retained and influence what radiologists are likely to see or not see within diagnostic images. This is illustrated by the large number of diseases whose radiographic features are likened to naturalistic items with similar shapes (Fig. 3). Many naturalistic and tool-based signs are described, such as the “starry sky liver,” “sabre sheath trachea,” and “dinner fork deformity,” and over 200 signs derived from animals, examples including “hummingbird sign”, “goose neck appearance”, “shark fin sign”, “rat tail appearance”, “lobster claw sign”, and “the face of the giant panda sign” 43,44. The prevalence of such signs suggests an innate human propensity to perceive ecologically significant patterns, and likely reflects the neuroplastic adaptation or “recycling” of the neural circuits that evolved to respond to such patterns45,46.
Fig. 3. Examples of naturalistic imaging signs.
Adapted with permission from doi:10.1002/mdc3.12734. a “Hummingbird sign” seen on an MR brain scan in progressive supranuclear palsy. The hummingbird shape is delineated by midbrain atrophy that is out of proportion to the relatively preserved bulk of the pons. b “Tigroid stripe sign” seen on an MR brain scan in metachromatic leukodystrophy. Stripes are marked by radial hyperintense lesions on sulfatide accumulation causing demyelination (hyperintense), but the perivascular myelin is relatively spared (dark stripes). c “Face of the giant panda sign” seen on an MR brain scan in Wilson’s disease. The panda face is delineated by copper accumulation causing hyperintense signal in the tegmentum bordering the physiologically hypointense red nuclei (eyes), preserved signal intensity in the substantia nigra pars reticulata forming the ears, and hypointense signal in the superior colliculus forming the chin.
Deep learning, as a prototypical modern AI approach, offers flexibility beyond human vision as it can theoretically approximate any mathematical function with arbitrary closeness (provided a large amount of labeled training data)47. This means that deep learning neural networks can hypothetically map any statistical relationship that is adequately present in the labeled data, for example, between diagnostic labels and disease-related imaging features, including higher-order patterns that might not have an obvious visual correlate (Fig. 4). However, the amount of data required for an AI system to learn a given relationship varies with task complexity, thus, accurately learning some image-disease relationships may be infeasible in a medical imaging setting. Because neural networks can adjust internal parameters based specifically on the dataset at hand, this extreme flexibility is employed in a targeted fashion for the given task. While this can lead to overfitting and a lack of generalization, especially with smaller sample sizes at single institutions. It can also lead to excellence at the task at hand when using data from multiple institutions with a large sample size of a diverse population. This does not imply that current AI outperforms radiologists in clinical settings, but does account for how deep learning can reveal things in medical imaging data that are not detectable by human vision.
Fig. 4. Overview of select semantic and radiomic features and human vision.
a First-order features describe the distribution of voxel values in a given region, without encoding spatial relationships. Such features can be modeled with histograms. Second-order features, often called “textures”, describe the statistical interrelationships between voxels in space and require more complex models. Higher-order features are obtained by imposing a filter or other transformation to extract statistical relationships which may be abstract and lack visual correlates. Note, handcrafted radiomic features are typically applied to a region of interest which can be segmented manually, or in a data-driven manner as part of the machine learning pipeline. For features at all levels, if the imaging data were obtained prospectively, radiomic analysis may include access to the raw data (pre-reconstruction), whereas retrospective datasets typically lack the raw image data and must be analyzed in the reconstructed format. *Hand-crafted features are specified in code; whereas, as part of its training process, deep-learning AI autonomously chooses which features to model. b Key concepts in human visual perception. Information derived from refs. 31–40,55–58,67,68. Brain, neuron, cone cells, and DNA illustrations courtesy of National Institutes of Health “BioArt” (https://bioart.niaid.nih.gov/), other drawings are original. Abbreviations: FD Fractal Dimension, GBOR Gabor Filters, GLCM Gray-Level Co-occurrence Matrix, GLDM Gray-Level Dependence Matrix, GLRLM Gray-Level Run Length Matrix, GLSZM Gray-Level Size Zone Matrix, HOG Histogram of Oriented Gradient, LBP Local Binary Patterns, LoG Laplacian of Gaussian, LPQ Local Phase Quantization, MF Minkowski Functional, NGTDM Neighborhood Gray-Tone Difference Matrix, WT Wavelet Transform.
Differences in signal detection
Human peak visual acuity is limited by the physical distance between retinal photoreceptor cells. To see very small features in a diagnostic image, radiologists must zoom in, temporarily sacrificing the larger-scale context of the image; and/or use window-level functions to enhance the visibility of different aspects of the image (e.g., “bone level”, “soft tissue level”, etc.) In contrast, convolutional neural networks can apply multi-scale kernels that evaluate single-voxel statistics and whole-organ geometry in parallel, integrating local detail and spatial context in a hierarchical multidimensional model. This architecture allows AI to essentially process multiple weak, spatially dispersed cues falling below human salience thresholds and not appearing as a localizable object48.
AI-based computer vision also excels at analyzing subtle textural changes in color or grayscale luminance variations. CT scanners record with enough detail to render images in thousands of shades of gray (e.g., Hounsfield units are typically defined between −1000 and +3096), and are viewed on specialized hospital computer monitors that can typically display 1000 unique shades of gray49. Human contrast sensitivity, although it may be 20-fold less sensitive than the theoretical optical limit50, is capable of discriminating several hundreds of “just noticeable differences” in grayscale luminance variations under ideal conditions and is relatively effective for detecting focal variations in grayscale luminance in medical imaging49. However, complex texture analysis is more difficult: while radiologists can qualitatively describe texture (e.g., “ground-glass,” “honeycomb,” “reticular”), AI can quantify complex relationships and variations between voxel intensities at a level of detail extending down to “subvisual” that cannot be appreciated by humans (Fig. 4). Furthermore, certain textures are high-order statistical patterns that do not have a visual correlate. Conceptually, textural variations can reveal “imaging-negative” acute ischemic stroke on head CT images51 and myocardial infarction on non-contrast cardiac CT data4, although study results have varied with respect to the experience level of the radiologists and the imaging planes used for visual analysis.
Radiologist visual processing and the value of algorithmic image analysis
In practice, radiologists often have only minutes to interpret medical images, which can make it difficult to identify all of the pertinent information that is present. Compounding this challenge are visual-cognitive biases including anchoring, framing, and satisfaction of search, as well as “inattentional blindness”—a failure to attend to something pertinent52. Research suggests that in some circumstances radiologists could frequently (up to 40% of the time) miss a lesion despite gazing directly at it53,54. Looking for one kind of lesion can also hamper detection of other types of lesions55. In reviewing CT scans for lung nodules, 66% of radiologists did not notice breast cancer, and when instructed to look for lung nodules, 30% did not detect lymphadenopathy55. In another study, a missing clavicle (removed by image manipulation) went unnoticed by roughly 60% of radiologists who reviewed chest radiographs for a routine check-up56. Perhaps most surprisingly, 83% of radiologists missed a cartoon gorilla that was inserted into a chest CT scan, despite gazing directly at the gorilla for an average of 5.8 seconds57. While this cartoon example partly reflects radiologists’ training and focus on identifying biologically plausible anomalies, it also speaks to the complexity of visual processing and that humans do not automatically and necessarily perceive “exactly what is there” in a universally objective sense58. Outside of the research laboratory and in real clinical environments, it has been estimated that breast cancer lesions in screening mammography examinations go undetected up to 30% of the time59. Similarly, MRI false negatives occur in up to 16–35% of males with biopsy-proven clinically significant prostate cancer60,61. When lesions are missed it delay potential diagnosis, work-up, and treatment, ultimately worsening patient outcomes.
Radiologists might make a mistake with any given scan, but at a deeper level may systematically struggle to detect certain kinds of imaging signals. For example, radiologists preferentially see lesions with sharp, clear borders. A 2002 study inserted simulated lung nodules of different texture/imaging characteristics into chest radiographs and found that the radiologists in this study were more likely to report lung nodules with clear borders and sharp contrast from surrounding tissue, while lesions with ill-formed borders were less likely to be reported62. Potentially, this tendency could be clinically detrimental, as clear borders relatively predict benignity for lung nodules, but ill-defined and irregular borders are more suggestive of malignancy63. For AI as well as humans, a clear-edged lesion exhibits a high signal-to-noise ratio and is more readily segregated from the background. A 2025 systematic review of 34 lung nodule studies found that AI sensitivities range from 56 to 96%, with good performance for larger solid nodules but difficulty with subtle ground glass opacities, mirroring the human edge-sharpness bias64,65. However, AI models can be explicitly trained to detect low-contrast, irregular, part-solid nodules to compensate for the innate edge-preference of both humans and neural network feature maps66. Furthermore, AI’s algorithmic feature extraction is not limited to purely visual signal-to-noise ratios. Conversely, the human bias toward favoring clear-edged lesions might be more challenging to overcome. Homo sapiens has evolutionarily thrived because of an ability to manipulate surroundings to our biological advantage, and an edge specifies the bound of a potentially manipulable object67,68.
There are numerous research examples of algorithmically derived imaging signatures that are subvisual or nonvisual to radiologists, yet are clinically relevant. Numerous studies illustrate this capability (Fig. 5):
Given retinal fundus images, a deep learning model using 22 million parameters learned to accurately predict patients’ age, sex, blood pressure, smoking status, diabetes control, and risk of adverse cardiovascular events69. The prediction of sex in particular was noted in an accompanying editorial to be “surprising and puzzling” 70, as it was unknown at that time how such a prediction could be made from retinal fundus photographs.
In a multicenter study of patients with MRI-confirmed acute ischemic stroke that was occult on initial non-contrast CT, machine learning classifiers using radiomic features achieved an AUC (area under the curve) of up to 0.846 for discriminating stroke compared to normal tissue in those CT scans51. The detection appeared to leverage textural analyses revealing local homogeneity, consistent with degradation of tissue architecture in acute brain ischemia.
In patients undergoing non-contrast cardiac CT scans, radiomic texture analysis in combination with machine learning classifiers detected myocardial infarction with up to 86.2% accuracy despite the scans appearing normal to radiologists4. The detection of myocardial infarction was based on fine-grained, small-scale changes in intensity that could not be visualized by the radiologists in the study, who had 2 and 4 years of experience in cardiac imaging, respectively, and were only provided with axial plane images for visual analysis, which could have negatively impacted the radiologists’ performance.
Using brain 18F-fluorodeoxyglucose PET scans of memory clinic patients, a deep learning model was trained to predict several years in advance who would ultimately receive a diagnosis of Alzheimer’s disease, with 100% sensitivity and 82% specificity71. The model in this study, rather than relying on a particular discrete neuroanatomical region, appeared to utilize the whole brain to inform its predictions, although there was special emphasis on anterior parietotemporal regions, which are known to be implicated in Alzheimer’s disease.
In patients with glioblastoma multiforme, deep learning was trained to classify isocitrate dehydrogenase mutation status from routine T2-weighted MRI with up to 92.8% accuracy72. Grad-CAM activation maps showed the model’s strongest attention weights along marginal infiltrative zones and micro-heterogeneous clusters, regions that might be qualitatively described as “heterogeneous enhancement” with difficulty for more granular visual stratification73.
In children with medulloblastoma (a common childhood brain tumor for which tissue sampling is difficult due to its posterior fossa location), machine learning classifiers using radiomic features extracted from MR brain scans were trained to predict the molecular genomics of the tumor, diagnosing the main four clinically relevant subgroups with up to 88–98% accuracy22. This level of accuracy is superior to what has been shown from models using traditional semantic features from MRI74.
In SARS-CoV-2 patients, a machine learning classifier using radiomic features extracted from CT chest angiograms learned to predict serum transcriptome markers of vascular inflammation by identifying radiomic markers predictive of vascular inflammation75. Multiple classifiers were tested and extreme gradient boosting was the best performing, appearing to leverage textural changes in the perivascular space.
In patients with solid tumors (predominantly lung, gynecological, breast, or head and neck), a radiomics-based machine learning classifier was trained to use CT scans to predict CD8 tumor infiltration (a marker of immune attack against the tumor). Tested in multiple validation sets, this model was able to predict prognosis and response to immunotherapy76.
In a retrospective study of patients with advanced-stage lung adenocarcinoma with or without PD-L1 expression by tumor, selected radiomic features of CT chest scans varied significantly between groups despite no significant difference observed on visual analysis by radiologists, and no association with qualitative imaging features5. The machine learning classifier in this study predicted PD-L1 status with AUC 0.550 without radiomic features, which increased to 0.646 when adding radiomic features.
Fig. 5. Examples of algorithmic image analyses.
a AI-driven texture analysis distinguishes advanced-stage lung adenocarcinomas with and without PD-L1 expression on CT, despite no significant visual difference to radiologists. Images show a PD-L1 lesion from multiple views (top) and undergoing region of interest segmentation for feature extraction. Key discriminative features (GLCM angular second momentum and various GLRLM metrics) indicated more homogeneous small-scale high-attenuation patterns in PD-L1-positive lesions, though there was no association with any qualitative imaging feature. Reproduced with permission from 10.1111/1759-7714.13352. b Deep learning AI trained on T1-weighted MRI scans achieves 95% accuracy (AUC = 0.98) in distinguishing Parkinson’s disease from healthy controls. Example of control (top) and Parkinson’s disease (bottom) images appear grossly similar. Class activation maps (right) highlight the substantia nigra pars compacta, consistent with known dopaminergic pathology. Reproduced with permission from doi:10.3390/diagnostics10060402. c AI-driven texture analysis detects myocardial infarction (MI) on noncontrast low-dose CT scans, even when no abnormality is visually apparent (top left). The contrast-enhanced scan (top middle) shows hyperdense tissue (white arrow) consistent with MI. Region of interest segmented the left ventricle (top right), and coronary angiogram indicates coronary artery disease (bottom left). Of the noncontrast scans, key radiomic features (GLCM and autoregressive coefficients) indicated fine textural changes (i.e., abundant, small-scale changes in intensity) undetectable by radiologists in the study but indicative of MI. Bottom right depicts a parametric map of a GLCM textural feature which differed significantly between control and MI. Reproduced with permission from doi:10.1097/RLI.0000000000000448. d Deep learning AI trained on UK Biobank non-dilated fundus images predicted patient sex with 87% accuracy on internal validation and 79% accuracy on external validation. Saliency maps revealed the model’s reliance on subtle, voxel-level cues not apparent to clinicians. Reproduced with permission from doi:10.1038/s41598-021-89743-x. e AI trained on MR scans (left) paired with histopathology results (right) used radiomics to classify medulloblastomas into four main molecular subgroups with 88–98% accuracy. Feature selection identified shape, first-order intensity, and texture-based descriptors (e.g., GLCM correlation, run length nonuniformity) as key predictors. This approach outperformed the accuracy of conventional MRI-based assessments reported elsewhere74. Reproduced with permission from doi:10.1148/radiol.212137. Abbreviations: GLCM Gray-Level Co-occurrence Matrix, GLRLM Gray-Level Run Length Matrix, PD-L1 programmed death-ligand 1.
These nine illustrative studies demonstrate that AI analysis of medical imaging data can uncover clinically relevant signals that can be difficult or impossible for radiologists to see. However, before broad clinical deployment, further validation of these algorithms is imperative to verify that the models perform well and as-expected when applied to real-life clinical data, especially in images that were acquired in other centers not contributing to the training sets, or in the same center but using different scanning protocols.
Benefits of incorporating data that AI can see but radiologists cannot
Historically, the clinical value of medical imaging studies has been capped by visual interpretation. Certain AI approaches can expand this frontier by detecting subtle, subvisual, or nonvisual imaging patterns with clinical relevance. This likely impacts clinical practice in at least three critical ways.
The detection of disease in normal-appearing studies. Up to 30% of breast lesions may go unreported in screening mammograms, but as many as 65% of those missed lesions are identifiable on retrospective review59. This is well in line with literature showing that 70% of missed lesions in mammograms attract gaze, indicating that errors of visual processing, but not detection, caused them to be missed77. In a multicenter German study including data from 461,818 women, AI-assisted double reading increased the number of women in whom breast cancer was detected by 17.6% without increasing false positives, by flagging subtle findings in normal-appearing studies78. Differences in breast density distributions, imaging protocols, and population demographics may all influence algorithm performance, so broader evaluation across heterogeneous populations is essential before large-scale deployment. Whether driven by algorithmically derived imaging features, enhanced contrast sensitivity, or lack of fatigue effects, AI alongside radiologists shows promise for improving diagnostic sensitivity of medical imaging.
“Virtual histopathology”. Even when lesions are visible, AI may be able to extract additional relevant information by decoding hidden subvisual or nonvisual phenotypes that may be linked to molecular or histologic characteristics. In prostate cancer, a machine learning classifier based on radiomics can discriminate Gleason grade 6 and 7 with 93% accuracy, and further discriminate two subtypes of grade 7 (4 + 3 versus 3 + 4) with 92% accuracy79. In breast cancer, specific textural features in MRI have been shown to correlate with certain transcriptional pathways80, and, in a small exploratory study, on digital breast tomosynthesis with the percentage of estrogen receptor-positive tumor cells81. In pediatric medulloblastoma, where posterior fossa biopsies carry significant risks, radiomic features analyzed by AI can diagnose the four main genetic subtypes with 88–98% accuracy22. In the future, inferences from algorithmically derived imaging signatures may offer certain insights that have historically depended on invasive biopsy. Lacking tissue sample bias, algorithmically derived imaging features could hypothetically characterize an entire tumor non-invasively, even tracking treatment responses over time through changes in the imaging phenotype82. However, to date, data supporting such claims are mostly derived from small sample sizes82. Therefore, biopsy remains the gold standard for histopathologic diagnosis, and no current AI algorithm is a reliable substitute yet.
Incidental anomalies from routine studies. This includes findings that radiologists might not screen for due to time constraints/lack of specific clinical indication, and/or due to the subvisual or nonvisual nature of the data in question. For example, based on routine chest radiographs, deep learning can quantitatively estimate numerous metrics, including coronary artery calcification83, pulmonary function84, and bone mineral density85. If sufficiently accurate, such estimates could obviate more expensive/invasive definitive tests in the future and enable opportunistic screening. However, large-scale flagging of incidental anomalies risks significant alert fatigue related to low-value findings, and over investigating incidentalomas risking procedural side effects, patient anxiety, and financial cost without clinical benefit. Algorithms themselves also carry costs related to their acquisition, integration, and quality-assurances, which must be justified against alterative uses of healthcare resources. Operationally, deploying multiple models that each screen for one abnormality is likely infeasible whereas a single system that screens for multiple abnormalities would be more practical but also more challenging to develop.
Risks of relying on AI image analysis that cannot be independently verified visually
A major risk of AI is that because it cannot reason based on causal principles it can be fooled in surprising ways86,87. Most AI models used for medical image analysis learn correlations, not causation, meaning that the characteristics of the data that a model is trained on can negatively influence the ability of that model to perform well on unseen data. For instance, it was shown that if trained on chest radiograph data in which the pneumothorax disease label was correlated with the presence of a chest drain in the image, an AI model learned that the presence of the drain could be used as a “shortcut” for accurately predicting pneumothorax88. In another example, an AI system was trained on hospital data in which patients presenting with asthma and pneumonia were often admitted to the intensive care unit, received aggressive care, and rarely died. The AI system learned that asthma itself, rather than aggressive care, was a protective factor against dying from pneumonia86. In these cases, an AI system relying on these correlations will not perform well on new data, which does not contain these “shortcuts” (i.e., pneumothorax cases without chest drains, or pneumonia with comorbid asthma not receiving aggressive care), since the system did not actually learn which features within an image are causally related to the diseases or their trajectory. This illustrates that, unlike humans, most AI systems cannot consider contextual factors or perform causal reasoning, which is often highly important when considering diagnostic/prognostic questions.
AI is also trained on data from the past and therefore cannot anticipate future changes or make logical inferences from general world knowledge. This problem applies when real-world data undergoes a drift in its statistical characteristics relative to the data that the AI was trained on. Example reasons include changing patient demographics (e.g., due to gentrification, or hospital construction), disease outbreaks changing prevalence and pre-test probabilities, soft- or hardware updates changing image acquisition parameters (e.g., time to image after contrast injection), and other factors. In the real world, AI models can fail due to unexpectedly small changes in the data’s statistical and texture characteristics89. This can cause AI performance to degrade over time90. Even within the same hospital but with different scanners or protocols, AI can perform differently91. Compared to physicians, medical AI systems currently lack common sense, real-world causal knowledge, and adaptability in the face of statistical changes in the data distribution. Therefore, to trust AI interpretation of scans, continual revalidation of that AI system should occur. As of December 2024, the FDA permits certain post-market software updates through Predetermined Change Control Plans (PCCPs), which denote pre-approved plans submitted at initial clearance that authorize specific model modifications without a new submission, provided that the updates remain within the cleared scope and adhere to defined monitoring and control requirements92.
Lastly, the fact that AI can detect algorithmically derived imaging features that are potentially invisible to the human eye may not always be beneficial. A pertinent example is the recently uncovered ability of AI models to accurately identify demographic attributes, such as self-reported race, of patients from their medical images69,93,94, which is widely considered to be impossible for human experts to perform93. The implications of this are that AI models could potentially use sensitive patient characteristics to inform decision-making, which may result in systematic errors between different demographic groups. Crucially, in the case of an AI model that was able to predict racial identity with very high accuracy, it was unclear what image features were used in this task even after a comprehensive investigation93. This brings to light an important consideration about AI image analysis: due to the black-box nature of many deep learning models, these invisible features can be used in ways that may lead to discriminatory prediction outcomes across demographic groups. Evidence of such performance disparities between subsets of medical imaging datasets, whether due to shortcuts, data distribution shifts, or reliance of learned demographic characteristics in decision-making, has been documented extensively in recent years95–98. In parallel, there has been considerable progress toward identifying when and how deep learning models use these “biases” in medical image analysis tasks (e.g., through exploration of learned features99,100 and unsupervised discovery of lower-performing subgroups101), developing methods for mitigating performance differences between subpopulations102–105, and other initiatives that aim to ensure careful development and robust validation of AI systems106,107. Nevertheless, rigorous evaluation and regulation of these systems in a rapidly evolving research and medical device landscape remains a challenge108,109.
Ongoing work is still needed to mitigate hazards associated with the black-box nature of AI systems, and to earn physician and public trust. Current explainability tools, such as saliency maps110 generated using gradient-weighted class activation maps (Grad-CAM)111 or SmoothGrad112, and other feature attribution methods113 provide partial transparency by visualizing areas of focus in deep learning models, but typically cannot account for the biological significance of subvisual or nonvisual imaging features and may not be adequate for ensuring that AI decisions on individual images are trustworthy114. Emerging frameworks, like masked causal flows115, which model causal relationships and facilitate validation of disease mechanisms through counterfactual analysis, present a promising direction in developing AI systems that are inherently interpretable and are less likely to fall victim to harmful spurious correlations. From a clinical standpoint, the overarching goal should be to harness the potential of AI to enhance radiographic diagnoses, while maintaining or improving patient safety. To this end, there is a pressing need for continued interdisciplinary collaboration among radiologists, AI developers, and biologists to forge “bio-literate” AI systems that link algorithmically derived imaging signatures to validated disease markers and mechanisms.
Future directions
When radiologists and AI do not share the same interpretation of an imaging study, the implications extend beyond diagnostic accuracy to fundamental questions about medical decision-making and patient safety90. One could, for example, imagine a confidence-based triage framework that is deliberately simple and low-burden in such cases:
If the AI’s internal confidence is >95% and it can highlight the specific image features supporting that conclusion (e.g., via saliency maps), the case is flagged for a second review by the reader.
If confidence is <70%, the AI output is not displayed.
For an intermediate confidence, the AI report is shown to the radiologist for consideration or displayed alongside the radiologist report without emphasis.
Within this context, it needs to be highlighted that confidence scores are not safety guarantees: AI systems can be confidently wrong (“confident misclassification”), and when AI is wrong about a scan, it may even cause radiologists to make more mistakes116. Thus, algorithmic confidence should never substitute for clinical judgement. Moreover, implementation into care should proceed under explicit, auditable checkpoints (e.g., SPIRIT-AI for protocol reporting117; DECIDE-AI for early clinical evaluation and human factors118), with ongoing post-market monitoring and recalibration. Even so, key governance questions remain. When, if ever, should physicians defer to AI in contradiction of their own expert interpretation and opinion? How can healthcare leverage AI-driven analyses while safeguarding against the risk of misleading use of shortcuts and harmful biases? Will AI alerts that lack the context of applied medical knowledge inevitably increase rather than decrease physician workload?
Harmonizing AI’s analytical power with physicians’ biologically grounded reasoning is an admirable goal, but when AI analyzes image properties that radiologists cannot see, its deployment may add to cognitive load by introducing a new class of algorithmic uncertainty and bias. Still, a physician’s ethical need to deliver the best care might, over time, include the use of well-validated AI systems90. Even as AI development progresses, it may not be able to foreseeably replicate many cognitive faculties that are essential for clinical medicine to the extent that humans can: contextual interpretation, causal reasoning, and the integration of imaging findings with clinical narratives, physiological concepts, and atypical presentations. To date, these human capabilities remain essential safeguards against black box algorithmic predictions. However, clinical trust in AI subvisual and nonvisual imaging findings could be earned over time, in part by shifting the evidentiary focus from validating AI against human perception to validating against biological ground truth. Hypothesis-driven research connecting algorithmically derived imaging features to histopathology, molecular assays, or other validated biomarkers will be essential for this. In turn, the practice of radiology may evolve towards high-level clinical integration of both visible and invisible imaging signatures119. Continuing to clarify the relative advantages and disadvantages of human and AI image analysis is vital for ensuring that AI complements the biologically grounded reasoning of radiologists, ultimately improving patient health outcomes.
Acknowledgements
This work was conducted without specific funding. Publication costs were supported by a grant from Alberta Innovates.
Author contributions
G.A.M. wrote the main manuscript text and prepared Figs. 2–5. E.A.M.S. wrote part of the main manuscript text, reviewed and edited the manuscript, prepared figure 1, and edited Figs. 2 and 3. T.R. reviewed and edited the manuscript and Fig. 2. N.D.F. reviewed and edited the manuscript text and figures. All authors conceptualized the project. T.R. and N.D.F. supervised the project. All authors have read and approved the manuscript.
Data availability
No datasets were generated or analysed during the current study.
Competing interests
Within the last 36 months, N.D.F. has received grants from the Canadian Institutes of Health Research, Natural Sciences and Engineering Research Council of Canada, and Canada Research Chairs Program for work unrelated to this paper. Publication costs were supported by a grant from Alberta Innovates. G.A.M., E.A.M.S., and T.R. have nothing to declare.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Liu, X. et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit. Health1, e271–e297 (2019). [DOI] [PubMed] [Google Scholar]
- 2.Scapicchio, C. et al. A deep look into radiomics. Radiol. Med.126, 1296–1311 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Gillies, R. J., Kinahan, P. E. & Hricak, H. Radiomics: images are more than pictures, they are data. Radiology278, 563–577 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Mannil, M., von Spiczak, J., Manka, R. & Alkadhi, H. Texture analysis and machine learning for detecting myocardial infarction in noncontrast low-dose computed tomography: unveiling the invisible. Investig. Radiol.53, 338 (2018). [DOI] [PubMed] [Google Scholar]
- 5.Yoon, J. et al. Utility of CT radiomics for prediction of PD-L1 expression in advanced lung adenocarcinomas. Thorac. Cancer11, 993–1004 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ullman, S., Assif, L., Fetaya, E. & Harari, D. Atoms of recognition in human and computer vision. Proc. Natl. Acad. Sci. USA113, 2744–2749 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Linsley, D., Eberhardt, S., Sharma, T., Gupta, P. & Serre, T. What are the visual features underlying human versus machine vision? in 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) 2706–2714. 10.1109/ICCVW.2017.331 (IEEE, 2017).
- 8.Funke, C. M. et al. Five points to check when comparing visual perception in humans and machines. J. Vis.21, 16 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Makino, T. et al. Differences between human and machine perception in medical diagnosis. Sci. Rep.12, 6877 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lake, B. M., Ullman, T. D., Tenenbaum, J. B. & Gershman, S. J. Building machines that learn and think like people. Behav. Brain Sci.40, e253 (2017). [DOI] [PubMed] [Google Scholar]
- 11.Souza, R. et al. Identifying biases in a multicenter MRI database for Parkinson’s disease classification: is the disease classifier a secret site classifier? IEEE J. Biomed. Health Inform.28, 2047–2054 (2024). [DOI] [PubMed] [Google Scholar]
- 12.Annis, J. & Palmeri, T. J. Modeling memory dynamics in visual expertise. J. Exp. Psychol. Learn. Mem. Cognit.45, 1599–1618 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Yoon, J.-S. et al. A think-aloud study to inform the design of radiograph interpretation practice. Adv. Health Sci. Educ.25, 877–903 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Waite, S. et al. Analysis of perceptual expertise in radiology – current knowledge and a new perspective. Front. Hum. Neurosci.13, 213 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Raat, E. M., Kyle-Davidson, C. & Evans, K. K. Using global feedback to induce learning of gist of abnormality in mammograms. Cognit. Res. Princ. Implic.8, 3 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Norman, G., Young, M. & Brooks, L. Non-analytical models of clinical reasoning: the role of experience. Med. Educ.41, 1140–1145 (2007). [DOI] [PubMed] [Google Scholar]
- 17.LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature521, 436–444 (2015). [DOI] [PubMed] [Google Scholar]
- 18.Lo Vercio, L. et al. Supervised machine learning tools: a tutorial for clinicians. J. Neural. Eng. 17, 062001 (2020). [DOI] [PubMed]
- 19.Wang, X. et al. ChestX-ray8: hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 3462–3471. 10.1109/CVPR.2017.369 (IEEE, 2017).
- 20.Kriegeskorte, N. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annu. Rev. Vis. Sci.1, 417–446 (2015). [DOI] [PubMed] [Google Scholar]
- 21.Richards, B. A. et al. A deep learning framework for neuroscience. Nat. Neurosci.22, 1761–1770 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Zhang, M. et al. MRI Radiogenomics of Pediatric Medulloblastoma: A Multicenter Study. Radiology304, 406–416 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Winder, A. J., Stanley, E. A., Fiehler, J. & Forkert, N. D. Challenges and potential of artificial intelligence in neuroradiology. Clin. Neuroradiol.34, 293–305 (2024). [DOI] [PubMed] [Google Scholar]
- 24.The Medical Imaging and Data Resource Center. MIDRC https://www.midrc.org.
- 25.Jeong, J. J. et al. The EMory BrEast imaging Dataset (EMBED): a racially diverse, Granular Dataset of 3.4 million screening and diagnostic mammographic images. Radiol. Artif. Intell.5, e220047 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Clark, K. et al. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J. Digit. Imaging26, 1045–1057 (2013). [DOI] [PMC free article] [PubMed]
- 27.Baker, N., Lu, H., Erlikhman, G. & Kellman, P. J. Deep convolutional networks do not classify based on global object shape. PLOS Comput. Biol.14, e1006613 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Dobs, K., Martinez, J., Kell, A. J. E. & Kanwisher, N. Brain-like functional specialization emerges spontaneously in deep neural networks. Sci. Adv.8, eabl8913 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Cadieu, C. F. et al. Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition. PLoS Comput. Biol.10, e1003963 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Solé, R. & Seoane, L. F. Evolution of brains and computers: the roads not taken. Entropy24, 665 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zhaoping, L. Understanding Vision: Theory, Models, and Data (Oxford Univ. Press, 2019). [DOI] [PubMed]
- 32.Wong-Riley, M. T. T. Energy metabolism of the visual system. Eye Brain2, 99–116 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Niven, J. E. & Laughlin, S. B. Energy limitation as a selective pressure on the evolution of sensory systems. J. Exp. Biol.211, 1792–1804 (2008). [DOI] [PubMed] [Google Scholar]
- 34.Downing, P. E., Chan, A. W.-Y., Peelen, M. V., Dodds, C. M. & Kanwisher, N. Domain specificity in visual cortex. Cereb. Cortex16, 1453–1461 (2006). [DOI] [PubMed] [Google Scholar]
- 35.Mruczek, R. E. B., von Loga, I. S. & Kastner, S. The representation of tool and non-tool object information in the human intraparietal sulcus. J. Neurophysiol.109, 2883–2896 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Le, Q. V. et al. Pulvinar neurons reveal neurobiological evidence of past selection for rapid detection of snakes. Proc. Natl. Acad. Sci. USA110, 19000–19005 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Rakison, D. H. & Derringer, J. Do infants possess an evolved spider-detection mechanism? Cognition107, 381–393 (2008). [DOI] [PubMed] [Google Scholar]
- 38.New, J., Cosmides, L. & Tooby, J. Category-specific attention for animals reflects ancestral priorities, not expertise. Proc. Natl. Acad. Sci.USA104, 16598–16603 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Triesch, J., Ballard, D. H., Hayhoe, M. M. & Sullivan, B. T. What you see is what you need. J. Vis.3, 9–9 (2003). [DOI] [PubMed] [Google Scholar]
- 40.Barsalou, L. W. Ad hoc categories. Mem. Cognit.11, 211–227 (1983). [DOI] [PubMed] [Google Scholar]
- 41.Drew, T., Evans, K., Võ, M. L.-H., Jacobson, F. L. & Wolfe, J. M. Informatics in radiology: what can you see in a single glance and how might this guide visual search in medical images? Radiographics33, 263–274 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Evans, K. K. et al. Does visual expertise improve visual recognition memory? Atten. Percept. Psychophys.73, 30–35 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Ridley, L. J., Xiang, H., Han, J. & Ridley, W. E. Animal signs in radiology: method of creating a compendium. J. Med. Imaging Radiat. Oncol.62, 7–11 (2018). [DOI] [PubMed] [Google Scholar]
- 44.Mulroy, E. et al. Animals in the Brain. Mov. Disord. Clin. Pract.6, 189–198 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Dehaene, S. Evolution of human cortical circuits for reading and arithmetic: the ‘neuronal recycling’ hypothesis. In From Monkey Brain to Human Brain (eds Dehaene, S., Duhamel, J.-R., Hauser, M. D. & Rizzolatti, G.) 133–158 (MIT Press, 2005).
- 46.Bilalic, M., Grottenthaler, T., Nägele, T. & Lindig, T. The faces in radiological images: fusiform face area supports radiological expertise. Cereb. Cortex26, 1004–1014 (2016). [DOI] [PubMed] [Google Scholar]
- 47.Zhou, D.-X. Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal.48, 787–794 (2020). [Google Scholar]
- 48.Skrede, O. J. et al. Deep learning for prediction of colorectal cancer outcome: a discovery and validation study. Lancet395, 350–360 (2020). [DOI] [PubMed] [Google Scholar]
- 49.Kimpe, T. & Tuytschaever, T. Increasing the number of gray shades in medical display systems—how much is enough? J. Digit. Imaging20, 422–432 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Banks, M. S., Geisler, W. S. & Bennett, P. J. The physical limits of grating visibility. Vision Res.27, 1915–1924 (1987). [DOI] [PubMed] [Google Scholar]
- 51.Sun, K. et al. Noninvasive imaging biomarker reveals invisible microscopic variation in acute ischaemic stroke (≤ 24 h): a multicentre retrospective study. Sci. Rep.15, 3743 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Brady, A. P. Error and discrepancy in radiology: inevitable or avoidable? Insights Imaging8, 171–182 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Samuel, S., Kundel, H. L., Nodine, C. F. & Toto, L. C. Mechanism of satisfaction of search: eye position recordings in the reading of chest radiographs. Radiology194, 895–902 (1995). [DOI] [PubMed] [Google Scholar]
- 54.Berbaum, K. S. et al. Cause of satisfaction of search effects in contrast studies of the abdomen. Acad. Radiol.3, 815–826 (1996). [DOI] [PubMed] [Google Scholar]
- 55.Williams, L. H. et al. The invisible breast cancer: Experience does not protect against inattentional blindness to clinically-relevant findings in radiology. Psychon. Bull. Rev.28, 503–511 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Potchen, E. J. Measuring observer performance in chest radiology: some experiences. J. Am. Coll. Radiol.3, 423–432 (2006). [DOI] [PubMed] [Google Scholar]
- 57.Drew, T., Võ, M. L. H. & Wolfe, J. M. The invisible gorilla strikes again: sustained inattentional blindness in expert observers. Psychol. Sci.24, 1848–1853 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Hoffman, D. D. The Case Against Reality: Why Evolution Hid the Truth from Our Eyes (W. W. Norton, 2019).
- 59.Tourassi, G., Voisin, S., Paquit, V. & Krupinski, E. Investigating the link between radiologists’ gaze, diagnostic decision, and image content. J. Am. Med. Inform. Assoc.20, 1067–1075 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Filson, C. P. et al. Prostate cancer detection with magnetic resonance-ultrasound fusion biopsy: the role of systematic and targeted biopsies. Cancer122, 884–892 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Artiles Medina, A. et al. Identifying risk factors for MRI-invisible prostate cancer in patients undergoing transperineal saturation biopsy. Res. Rep. Urol.13, 723–731 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Samei, E., Flynn, M. J., Peterson, E. & Eyler, W. R. Subtle lung nodules: influence of local anatomic variations on detection. Radiology228, 76–84 (2003). [DOI] [PubMed] [Google Scholar]
- 63.Choromańska, A. & Macura, K. J. Evaluation of solitary pulmonary nodule detected during computed tomography examination. Polish J. Radiol.77, 22–34 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Megat Ramli, P. N., Aizuddin, A. N., Ahmad, N., Abdul Hamid, Z. & Ismail, K. I. A Systematic review: the role of artificial intelligence in lung cancer screening in detecting lung nodules on chest X-Rays. Diagnostics15, 246 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Han, C. et al. Synthesizing diverse lung nodules wherever massively: 3D multi-conditional GAN-based CT image augmentation for object detection. In 2019 International Conference on 3D Vision (3DV) 729–737. 10.1109/3DV.2019.00085 (IEEE, 2019).
- 66.Geirhos, R. et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations (2019).
- 67.Bruner, E., Battaglia-Mayer, A. & Caminiti, R. The parietal lobe evolution and the emergence of material culture in the human genus. Brain Struct. Funct.228, 145–167 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Tettamanti, M., Conca, F., Falini, A. & Perani, D. Unaware processing of tools in the neural system for object-directed action representation. J. Neurosci.37, 10712–10724 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng.2, 158–164 (2018). [DOI] [PubMed] [Google Scholar]
- 70.Ting, D. S. W. & Wong, T. Y. Eyeing cardiovascular risk factors. Nat. Biomed. Eng.2, 140–141 (2018). [DOI] [PubMed] [Google Scholar]
- 71.Ding, Y. et al. A deep learning model to predict a diagnosis of Alzheimer Disease by Using 18 F-FDG PET of the Brain. Radiology290, 456–464 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Bangalore Yogananda, C. G. et al. MRI-based deep learning method for classification of IDH mutation status. Bioengineering10, 1045 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Kerkhof, M. et al. Interobserver variability in the radiological assessment of magnetic resonance imaging (MRI) including perfusion MRI in glioblastoma multiforme. Eur. J. Neurol.23, 1528–1533 (2016). [DOI] [PubMed] [Google Scholar]
- 74.Perreault, S. et al. MRI surrogates for molecular subgroups of medulloblastoma. AJNR Am. J. Neuroradiol.35, 1263–1269 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Kotanidis, C. P. et al. Constructing custom-made radiotranscriptomic signatures of vascular inflammation from routine CT angiograms: a prospective outcomes validation study in COVID-19. Lancet Digit. Health4, e705–e716 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Sun, R. et al. A radiomics approach to assess tumour-infiltrating CD8 cells and response to anti-PD-1 or anti-PD-L1 immunotherapy: an imaging biomarker, retrospective multicohort study. Lancet Oncol19, 1180–1191 (2018). [DOI] [PubMed] [Google Scholar]
- 77.Gandomkar, Z. & Mello-Thoms, C. Visual search in breast imaging. Br. J. Radiol.92, 20190057 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Eisemann, N. et al. Nationwide real-world implementation of AI for cancer detection in population-based mammography screening. Nat. Med.31, 917–924 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Fehr, D. et al. Automatic classification of prostate cancer Gleason scores from multiparametric magnetic resonance images. Proc. Natl. Acad. Sci. USA112, E6265–E6273 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Zhu, Y. et al. Deciphering genomic underpinnings of quantitative mri-based radiomic phenotypes of invasive breast carcinoma. Sci. Rep.5, 17787 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Tagliafico, A. S. et al. An exploratory radiomics analysis on digital breast tomosynthesis in women with mammographically negative dense breasts. Breast Edinb. Scotl.40, 92–96 (2018). [DOI] [PubMed] [Google Scholar]
- 82.Zhou, M. et al. Radiologically defined ecological dynamics and clinical outcomes in glioblastoma multiforme: preliminary results. Transl. Oncol.7, 5–13 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Kamel, P. I., Yi, P. H., Sair, H. I. & Lin, C. T. Prediction of coronary artery calcium and cardiovascular risk on chest radiographs using deep learning. Radiol. Cardiothorac. Imaging3, e200486 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Ueda, D. et al. A deep learning-based model to estimate pulmonary function from chest x-rays: multi-institutional model development and validation study in Japan. Lancet Digit. Health6, e580–e588 (2024). [DOI] [PubMed] [Google Scholar]
- 85.Sato, Y. et al. Deep learning for bone mineral density and t-score prediction from chest X-rays: a multicenter study. Biomedicines10, 2323 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Caruana, R. et al. Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In Proc. 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’15 1721–1730. 10.1145/2783258.2788613 (ACM Press, 2015).
- 87.Su, J., Vargas, D. V. & Kouichi, S. One pixel attack for fooling deep neural networks. IEEE Trans. Evol. Comput.23, 828–841 (2019). [Google Scholar]
- 88.Jiménez-Sánchez, A., Juodelyte, D., Chamberlain, B. & Cheplygina, V. Detecting shortcuts in medical images -- a case study in chest X-rays. Preprint at http://arxiv.org/abs/2211.04279 (2022).
- 89.Recht, B., Roelofs, R., Schmidt, L. & Shankar, V. Do ImageNet classifiers generalize to ImageNet? In Proceedings of the 36th International Conference on Machine Learning. Vol. 97, 5389–5400 (PMLR, 2019).
- 90.Froomkin, A. M., Kerr, I. & Pineau, J. When AIs outperform doctors: confronting the challenges of a tort-induced over-reliance on machine learning. Ariz. LAW Rev. 61, 33 (2019).
- 91.Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med.15, e1002683 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Health, C. for D. and R. Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence-Enabled Device Software Functions. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/marketing-submission-recommendations-predetermined-change-control-plan-artificial-intelligence (2024).
- 93.Gichoya, J. W. et al. AI recognition of patient race in medical imaging: a modelling study. Lancet Digit. Health4, e406–e414 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Yi, P. H. et al. Radiology “forensics”: determination of age and sex from chest radiographs using deep learning. Emerg. Radiol.28, 949–954 (2021). [DOI] [PubMed] [Google Scholar]
- 95.Stanley, E. A. M., Wilms, M. & Forkert, N. D. Disproportionate Subgroup Impacts and Other Challenges of Fairness in Artificial Intelligence for Medical Image Analysis. In Ethical and Philosophical Issues in Medical Imaging, Multimodal Learning and Fusion Across Scales for Clinical Decision Support, and Topological Data Analysis for Biomedical Imaging (eds Baxter, J. S. H. et al.) 14–25. (Springer Nature Switzerland, 2022).
- 96.Seyyed-Kalantari, L., Zhang, H., McDermott, M. B. A., Chen, I. Y. & Ghassemi, M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat. Med. 27, 2176–2182 (2021). [DOI] [PMC free article] [PubMed]
- 97.Lee, T., Puyol-Antón, E., Ruijsink, B., Shi, M. & King, A. P. A systematic study of race and sex bias in CNN-based cardiac MR segmentation. In Statistical Atlases and Computational Models of the Heart. Regular and CMRxMotion Challenge Papers (eds Camara, O. et al.) 233–244 (Springer, 2023).
- 98.Alloula, A., Mustafa, R., McGowan, D. R. & Papież, B. W. On Biases in a UK biobank-based retinal image classification model. In Ethics and Fairnessin Medical Imaging (eds Puyol-Antón, E. et al.) 140–150. 10.1007/978-3-031-72787-0_14 (Springer Nature Switzerland, 2025).
- 99.Glocker, B., Jones, C., Bernhardt, M. & Winzeck, S. Algorithmic encoding of protected characteristics in chest X-ray disease detection models. EBioMedicine89, 104467 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Stanley, E. A. M., Souza, R., Wilms, M. & Forkert, N. D. Where, why, and how is bias learned in medical image analysis models? A study of bias encoding within convolutional networks using synthetic data. eBioMedicine111, 105501 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Olesen, V., Weng, N., Feragen, A. & Petersen, E. Slicing through bias: explaining performance gaps in medical image analysis using slice discovery methods. In Ethics and Fairness in Medical Imaging (eds Puyol-Antón, E. et al.) 3–13 (Springer, 2025).
- 102.Boland, C. et al. All you need is a guiding hand: mitigating shortcut bias in deep learning models for medical imaging. In Ethics and Fairness in Medical Imaging (eds Puyol-Antón, E. et al.) 67–77 (Springer, 2025).
- 103.Yao, S. et al. Enhancing the fairness of AI prediction models by Quasi-Pareto improvement among heterogeneous thyroid nodule population. Nat. Commun.15, 1958 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Wang, R. et al. Drop the shortcuts: image augmentation improves fairness and decreases AI detection of race and other demographics from medical images. eBioMedicine102, 105047 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Lin, M. et al. Improving fairness of automated chest radiograph diagnosis by contrastive learning. Radiol. Artif. Intell.6, e230342 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Alderman, J. E. et al. Tackling algorithmic bias and promoting transparency in health datasets: the STANDING Together consensus recommendations. Lancet Digit. Health7, e64–e88 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107.Lekadir, K. et al. FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare. BMJ388, e081554 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108.Petrick, N. et al. Regulatory considerations for medical imaging AI/ML devices in the United States: concepts and challenges. J. Med. Imaging10, 051804 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Schmidt, J. et al. Mapping the regulatory landscape for artificial intelligence in health within the European Union. Npj Digit. Med.7, 1–9 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110.Wollek, A. et al. Attention-based Saliency Maps Improve Interpretability of Pneumothorax Classification. Radiol. Artif. Intell.5, e220187 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111.Mohamed, M. M., Mahesh, T. R., Vinoth, K. & Guluwadi, S. Enhancing brain tumor detection in MRI images through explainable AI using Grad-CAM with Resnet 50. BMC Med. Imaging24, 107 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112.Smilkov, D., Thorat, N., Kim, B., Viégas, F. & Wattenberg, M. SmoothGrad: removing noise by adding noise. Preprint at https://arxiv.org/abs/1706.03825v1 (2017).
- 113.Hassan, S. U., Abdulkadir, S. J., Zahid, M. S. M. & Al-Selwi, S. M. Local interpretable model-agnostic explanation approach for medical imaging analysis: a systematic literature review. Comput. Biol. Med.185, 109569 (2025). [DOI] [PubMed] [Google Scholar]
- 114.Ghassemi, M., Oakden-Rayner, L. & Beam, A. L. The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit. Health3, e745–e750 (2021). [DOI] [PubMed] [Google Scholar]
- 115.Vigneshwaran, V., Ohara, E., Wilms, M. & Forkert, N. MACAW: a causal generative model for medical imaging. Preprint at 10.48550/arXiv.2412.02900 (2024).
- 116.Bernstein, M. H. et al. Can incorrect artificial intelligence (AI) results impact radiologists, and if so, what can we do about it? A multi-reader pilot study of lung cancer detection with chest radiography. Eur. Radiol.33, 8263–8269 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117.Rivera, S. C. et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Lancet Digit. Health2, e549–e560 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118.Vasey, B., Novak, A., Ather, S., Ibrahim, M. & McCulloch, P. DECIDE-AI: a new reporting guideline and its relevance to artificial intelligence studies in radiology. Clin. Radiol.78, 130–136 (2023). [DOI] [PubMed] [Google Scholar]
- 119.Jha, S. & Topol, E. J. Adapting to Artificial Intelligence: radiologists and pathologists as information specialists. JAMA316, 2353–2354 (2016). [DOI] [PubMed] [Google Scholar]
- 120.Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning (MIT Press, 2016).
- 121.Hofmeister, J. et al. Validating the accuracy of deep learning for the diagnosis of pneumonia on chest x-ray against a robust multimodal reference diagnosis: a post hoc analysis of two prospective studies. Eur. Radiol. Exp.8, 20 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 122.Hashim, H. T. et al. Artificial intelligence versus radiologists in detecting early-stage breast cancer from mammograms: a meta-analysis of paradigm shifts. Pol. J. Radiol.90, e1–e8 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 123.Caldas, F. A. A. et al. Evaluating the performance of artificial intelligence and radiologists accuracy in breast cancer detection in screening mammography across breast densities. Eur. J. Radiol. Artif. Intell.2, 100013 (2025). [Google Scholar]
- 124.Woo, O. H. et al. Invasive breast cancers missed by AI screening of mammograms. Radiology315, e242408 (2025). [DOI] [PubMed] [Google Scholar]
- 125.Wang, D. et al. Real world validation of an AI-based CT hemorrhage detection tool. Front. Neurol. 14, 1177723 (2023). [DOI] [PMC free article] [PubMed]
- 126.Lex, J. R. et al. Artificial intelligence for hip fracture detection and outcome prediction. JAMA Netw. Open6, e233391 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
No datasets were generated or analysed during the current study.




