Abstract.
Purpose
Expert radiologists can detect the “gist of abnormal” in bilateral mammograms even three years prior to onset of cancer. However, their performance decreases if both breasts are not from the same woman, suggesting the ability to detect the abnormality is partly dependent on a global signal present across the two breasts. We aim to detect this implicitly perceived “symmetry” signal by examining its effect on a pre-trained mammography model.
Approach
A deep neural network (DNN) with four mammogram view inputs was developed to predict whether the mammograms come from one woman, or two different women as the first step in investigating the symmetry signal. Mammograms were balanced by size, age, density, and machine type. We then evaluated a cancer detection DNN’s performance on mammograms from the same and different women. Finally, we used textural analysis methods to further explain the symmetry signal.
Results
The developed DNN can detect whether a set of mammograms come from the same or different woman with a base accuracy of 61%. Indeed, a DNN shown mammograms swapped either contralateral or abnormal with a normal mammogram from another woman, resulted in performance decreases. Findings indicate that abnormalities induce a disruption in global mammogram structure resulting in the break in the critical symmetry signal.
Conclusion
The global symmetry signal is a textural signal embedded in the parenchyma of bilateral mammograms, which can be extracted. The presence of abnormalities alters textural similarities between the left and right breasts and contributes to the “medical gist signal.”
Keywords: mammography, gist, global symmetry, deep neural networks, textural analysis, artificial intelligence
1. Introduction
Breast cancer is both the most commonly diagnosed cancer worldwide, and the most common cause of cancer-related mortality in women.1 In the developed world, breast cancer mortality rates have fallen, in significant part thanks to early screening and detection2–5 by expert radiologists. However, mammographic screening is not perfectly accurate.6–8 Most damaging are false negatives, whereon a cancer is determined to have been present during screening yet remained undetected. False negative rates have been found to be as high as 35%.9,10 Considering the potentially life-changing impact of a missed diagnosis, it is natural to wish to maximize the amount of clinically relevant information that can be extracted from a mammogram, and as such reduce screening error rates.
One such potential source of information is the “gist of abnormality.” Based in perception literature, “gist” refers to information on the global structure and statistical regularities present in the image. This information can be extracted extremely quickly from images (around 500 ms or less),11,12 before any localization or deployment of attention takes place. For scene images, this ability allows rapid categorization of scenes. For example, knowing that the image you saw, even briefly, was a beach despite not having time to fixate on diagnostic scene elements (such as a deck chair).13 Hence, the “gist of abnormality” refers to information on abnormality that can be rapidly extracted. Indeed, radiologists have been shown to be capable of detecting signals of abnormalities at above-chance levels in medical images, in sub-second time frames. This ability is not domain-specific, and has been observed in chest x-rays,11,14 cervical micrographs,12 prostate multiparametric magnetic resonance imaging,15 skin pathology,16 and relevant to this work, mammograms.12,17 Gist extraction also appears fairly robust, remaining unaffected by the difficulty in detecting the cancer using more classical screening methodologies.18
This ability does not seem to require localization of a lesion within the mammogram. While presenting medical images for only very short times naturally precludes localization, radiologists have also demonstrated the ability to extract signals of abnormality from images with no visibly actionable lesion present. This includes textural patches drawn from mammograms with abnormality present, as well as the contralateral mammogram.17 There is even evidence that this ability extends to prior images—mammograms taken before localizable lesions develop.19,20 The gist of abnormality does not appear to be related to standard mammographic descriptors such as density and shape asymmetry. Increases in density cause performance degradation,20 which would be unexpected if rapid abnormality detection were to be based in greater perceived densities. In standard screening practice, it is common for radiologists to first view bilateral images, where the left and right mammograms are presented simultaneously. This allows for the comparison and contrasting of potential differences between the left and right breast. In this case, it is natural to assume that the gist signal could arise from symmetry differences between each breast. However, radiologists are still able to extract the signal from single breast images, and in cases where the symmetry between breasts is disrupted.17 In the latter case, however, gist-based abnormality detection performance is significantly degraded. Here, if bilateral mammograms are presented, but are composed of images from different women (for example, replacing the contralateral mammogram with a normal mammogram from a different woman), there is a corresponding decrease in performance from radiologists when detecting the abnormal gist signal. This decrease does not lead to floor-level performance; the global symmetry signal is only a single, but apparently important, component of the “gist of abnormality.”
While symmetry is not required for the gist signal, damaging the relationship between the two mammograms of a bilateral does lead to worse performance. There appears to be a helpful signal shared between left and right mammograms of the same woman, that in part, contributes to the gist signal present in bilateral mammograms. To summarize, the gist of abnormality appears to be a global signal embedded throughout the mammogram, based within the textural variation of the parenchyma. In bilaterals, this gist signal is partially supported by an unknown “global symmetry” component that extends across the left and right mammogram. While asymmetry (in volume and computed texture metrics) has been shown to relate to the risk of developing cancer,21,22 its impact on rapid abnormality detection is still not well understood. We hypothesize that the global symmetry signal captures similarities in the global structure and textural elements of the parenchyma across bilateral mammograms. Under this hypothesis, disruptions or modifications to this signal at a subtle structural or textural level (for example, those caused by breast cancer) improve rapid detection of abnormality, whereas extreme disruptions to the signal (such as replacing a mammogram with that of another woman’s) lead to decreased performance.
In this work, we employ deep learning to investigate this symmetry signal, aiming first to verify its presence computationally, before analyzing its effect on abnormality detection. We do this by first designing and training a deep neural network (DNN) architecture capable of determining whether a set of mammograms come from the same woman, while accounting for obvious symmetry-breaking differences, such as age, breast size, imaging machine, and density. We then evaluate the effect of symmetry differences on the performance of a state-of-the-art cancer detection neural network. Our data indicate that this symmetry signal is detectable and that DNN cancer detection performance is degraded when this signal is broken or not present. Given that the DNNs are likely learning to extract information from the parenchymal texture, we employ textural patch analysis to further explore the symmetry signal and determine whether a significant difference in textural patterns arises between normal and abnormal textures when controlling for lesion presence.
2. Experiment 1 - Detecting the Symmetry Signal
We first set out to verify the existence of the symmetry signal computationally. While there is evidence for the effect of symmetry in human radiologists, there is not yet a computational method that provides evidence for this signal independent of human observations. Under the hypothesis that the “symmetry signal” is in fact textural similarities and differences between the parenchymas of mammograms drawn from the same woman, detecting these variations is not necessarily intractable. However, it is a highly difficult task; the symmetry signal is not thought to be particularly strong (being a subcomponent of the gist of abnormality, already a difficult-to-detect signal), and additionally, because radiologists do not know which indicative textural differences they are extracting. Hand-engineering a feature extractor for the symmetry signal is therefore unlikely to be reasonably tractable.
To resolve these issues, we instead turn to feature extractors learned by a DNN via stochastic gradient descent. By providing a DNN with an auxiliary task, which involves having to learn to extract the symmetry signal, we can hunt for its existence without having to hand-design a method to capture it.
2.1. Methods
2.1.1. Dataset
For all our experiments we use a subset of OPTIMAM, a breast imaging dataset gathered from multiple screening sites throughout the United Kingdom. This subset we use consists of 8428 individual “for presentation” mammograms, gathered from 1199 individual patients. The dataset was delivered as DICOMs. We first preprocess the dataset, excluding all images that do not show a good, unobstructed view of the parenchyma. Images used for biopsy purposes, or that were spot-compressed, or magnified were removed from the dataset. Text DICOM annotations were removed from the images. The DICOMs were converted into 16-bit PNG files.
In total, this resulted in 5000 usable PNG images. Of these images, 3628 were captured by a Lorad Selenia, 16 by a MammoDiagnost DR, 852 by a Selenia Dimensions, 4 by a Senographe DS, and 500 by a Senographe Essential. These images make up 1250 exams, of which 480 are considered normal, and 770 considered abnormal.
2.1.2. Training data
The training data for our architectures consists of four mammographic images, one per laterality per view (an “exam”). The non-swapped cases consist of the original data from our preprocessed OPTIMAM subset, decribed above. In the basic swapped case (before considering other variables, such as age or size), either the left or right mammographic images are swapped with those of another woman. Lateralities are preserved: left images are swapped with left images, and right images with right images. Type of case is also considered. If the case is considered normal (i.e., no abnormal mammograms within the four images that make it up), the swap is completed with another normal case. The laterality of the mammogram swapped in this case does not matter.
For the abnormal case, abnormality is preserved. For example, if the pre-swap case consists of an abnormal left, and a contralateral right, the swap is conducted to result in a post-swap case consisting of an abnormal left, and a contralateral right, yet with one side taken from different women. We do not perform swaps that cause a case type that did not exist for that patient in the first place. For example, swapping a contralateral right for an abnormal right, resulting in both lateralities in a case to be abnormal, when the original only presented with abnormality on one side. Equally, a case containing an abnormal mammogram is not swapped in such a manner that creates a case which consists of no abnormals, and only contralateral images. Finally, we also exclude any cases that include abnormalities in both breasts (14 cases). We train the neural network with equal amounts of swapped and unswapped cases, which avoids an intrinsic bias in the data to one class.
We control for four variables: age, size, density, and machine type. A score (or cost) is computed for each possible pairing of mammogram based on these variables (for example, two mammograms with highly different sizes would have a high “pairing cost”), and pairings are then conducted to minimize the overall cost of making that swap. This matches mammograms together that have similar values for each of the variables. Patient age comes with the dataset and is already numerical. The score to be minimized for each pairing is simply the difference in age. Size is computed by counting the number of non-black pixels present in the mammogram normalized by the size of the image (generally, smaller breasts occupy less of the photographic field). This approach may not function correctly for magnified images. Conveniently, machine type for each image is provided with the dataset and can be used by dividing the dataset into subsets of images (with each subset consisting of images from one machine type) and swapping performed as described in the basic case above.
Density is more difficult. The OPTIMAM subset we have access to does not come equipped with density ratings. Traditionally, this would be evaluated by trained radiologists, though for our 5000 images this is impractical. Instead, we turn to a neural network for density estimation.23 Using this network, we compute density scores between 1 and 4 for each patient in our dataset. These scores correspond to the radiologist density ratings: (1) almost entirely fatty, (2) scattered areas of fibroglandular density, (3) heterogeneously dense, and (4) extremely dense. To balance using density, we can then minimize the difference in density score for each swap. However, in practice, the density estimator only predicts scores 2 and 3 over our entire image set, never estimating an image to be in the extremes of density ratings. This is unlikely given the scope and size of our dataset, and likely points to the density estimator not being sufficiently capable for this data. However, given that the lack of openly available computational breast density estimators, for our purposes these ratings will suffice; and is better than excluding density entirely.
Balancing by all of these factors together is slightly more complex. First, the dataset is divided by machine, to ensure all pairings only consist of mammograms captured by the same imaging machine. Second, age, size, and density are normalized into the same scale and combined into a composite score for each mammogram for each laterality (e.g., left CC and right CC scores are combined into a single score for that laterality—a “set of mammograms” for that side). This allows some flexibility in the matching; sufficient similarities in size and age can outweigh a difference in density, and vice-versa. This may be necessary in the cases that a perfect pairing is not possible. Each set of mammograms now has a score accounting for all factors, and the goal is to pair said set with another set with (ideally) an identical score. However, this presents a problem: by pairing mammogram sets perfectly, eventually there may be “leftovers” that have highly different scores, yet there are no remaining alternatives for them to be paired with. It would be better to take slightly worse pairings earlier on, to make the later pairings better. We resolve this via the use of the Hungarian algorithm,24 which finds optimal pairings for every set of mammograms in the dataset.
The data is divided into train (70% of data), validation (10%), and test (20%) sets. The neural network is never trained on the test or validation sets. The validation allows for hyperparameter tuning without leaking information from the test set to the network.
2.1.3. Training and implementation
We train the architecture shown in Fig. 1. This architecture is superficially similar to breast cancer classification networks,25,26 but contains a subnetwork trained to contrast learned features for same/different classification. The network takes four images as input, corresponding to the two standard mammographic projects: craniocaudal and mediolateral oblique for both the left and right breast. The output of the neural network classifies the four images as coming from one woman, or two different women. The network first learns features relevant to each view independently. These features are then combined into features that describe either the left or right side of the body, before passing into a final set of layers that consider these features and generate the classification. The feature extractors for the view consists of four convolutional layers with (4, 8, 16, and 16) channels, respectively. The laterality feature extractors consist of three layers with 32 channels, and the decision network consists of 3 convolutional layers with 64 features each, followed by global average pooling27 and a sigmoid non-linearity. Batch normalization28 and ReLU activations are used throughout the network.
Fig. 1.
Neural network architecture for detecting mammograms from the same/different women: a proxy for the symmetry signal.
The network is trained using the binary cross-entropy loss function, the RMSProp optimizer, and with a learning rate of 0.0001. However, we find the network fairly robust to learning rate except in the most difficult case, when considering all balancing factors. The network was trained for 100 epochs with a batch-size of 16, where a single datapoint represents one exam, consisting of four mammogram images. The network takes ~27 hrs to train on four NVidia V100 GPUs.
2.2. Results
Here, we design a neural network for the purpose of determining whether a set of four mammograms are composed of images entirely from the same woman, or are composed of images of two different women, combined across lateralities (e.g., left images from one woman and right images from a different woman). With this problem, we task the neural network to learn to extract features that are indicative of similarities or differences between the two sets of mammograms. If the symmetry signal is present, this should play a part in the features the neural network learns; mammograms from the same woman should present a strong “symmetric” signal, whereas mammograms from two different women would lack this signal. Given that there are other factors that the neural network could learn to utilize for this purpose, we control for the most obvious: size (while small size differences are normal, large size differences are indicative the mammograms come from different women), age (which can cause parenchymal differences), density (given its significance as a cancer risk factor29,30), and critically, imaging machine brand (which causes contrast/luminance differences in the image content).
Results for this test set in each balancing case are shown in Table 1, indicating to which accuracy the neural network can determine whether the images come from the same, or different women. Given the identical amount of data in the same/different categories, if there is no detectable difference between mammograms from the same woman compared to two different women, expected classification accuracy should lie very close to 50%. However, in all cases we find an above-chance classification accuracy. Even in the most difficult case, where the swaps are balanced by all four factors simultaneously, we still find an above-chance accuracy on the test-set. Given that this is a prediction task, we can consider prediction accuracy on our identically balanced test set to be evidence that the classifier has learned to extract a signal that enables this classification to be made. The likelihood of there being no signal the network can leverage, but the weights in each test case aligning to give accuracy is extremely close to zero.
Table 1.
Detecting whether a set of mammograms come from the same or two different women. Controlling for size, age, density, and machine type.
| Balancing performed | Test accuracy (%) |
|---|---|
| None | 61 |
| Breast size | 61 |
| Age | 68 |
| Age + breast size | 63 |
| Density | 62 |
| Machine type | 56 |
| All factors | 56 |
3. Experiment 2 - Symmetry and Abnormality
Given the existence of the global symmetry signal, we expect the presence or absence of this signal to have some verifiable impact on cancer detection. In prior work, this effect has been shown in human radiologists. We now aim to verify the impact of the global symmetry signal over a larger dataset of mammographic images, which would be impractical to investigate with human observers. Instead, in this experiment we examine the impact of disrupting bilateral symmetry on the detection abilities of a DNN model trained for breast cancer detection.
3.1. Methods
3.1.1. Data
Starting with our preprocessed OPTIMAM subset, as described in experiment 1, we define four different categories. We do not perform the same balancing as with our symmetry-detecting neural network, as here we aim to disrupt the signal. We define four different “swap categories,” which are based on the original conditions which were shown to human radiologists.17 These swap categories are shown below. Given variation in the dataset, it is not always possible to swap all the data without repeats except in the within-case category, due to the difference in normal and abnormal case count. Repeats are not used to make up the difference.
No-swaps: standard and unaltered OPTIMAM data. This represents the base case. 1250 samples (cases/exams).
Within-category swaps: abnormal mammograms swapped with abnormals from different women, and normal mammograms with normals from different women (Consisting of 1250 total samples).
Contra-to-normal: swapping the contralateral of an abnormal exam with a normal image from a different woman’s normal exam, which preserves the breast images containing the abnormality (806 samples).
Abnormal-to-normal: swapping the abnormal images in the exam with normal images from a different woman’s normal case (preserving the original contralateral signal; 804 samples).
3.1.2. Procedure
For this experiment, we investigate the effect of symmetry breaking on cancer detection performance of a state-of-the-art DNN, that of Wu et al.:26 the breast cancer classifier (BCC). This network achieved high performance for determining the following cases from breast mammograms: no abnormality, benign abnormality, and malignant abnormality. We obtained a freely available pretrained version of the network, which means we do not have to train the network over our own dataset, and can hence treat our data as “evaluation” data. The data the BCC network were originally trained on consists of mammographic images captured by the same type of machines as in our dataset; this avoids potential differences caused by evaluating on data different to the type of data with which the network was pretrained. The input to the network consists of four mammograms, two per laterality, and one per screening projection. The output of the network is the likelihood that mammogram contains cancer. Predictions range from 0 (no likelihood of cancer) to 1.0 (definitely contains cancer). Generating predictions for our dataset takes about 6 hours on a single NVidia V100 GPU.
We use the network to generate predictions for each of our “swap categories.” Accuracy is defined as the inverse of the error between the ground-truth of the data and the likelihood predictions of the neural network for that exam. The distance is computed via mean-squared error. We calculate significance values for the accuracy of each swap category by comparing the squared errors of each category with a two-sample t-test, as suggested by Modern Mathematical Statistics with Applications, pages 490 to 491.31
3.2. Results
Prior work17 finds evidence that lack of textural symmetry degrades radiologists performance; but the requirement for human observers limits the dataset to 120 mammograms. With a DNN predictor, we can side-step this requirement, with the caveat that the accuracy of the predictions will not be equivalent to those made by expert radiologists. This allows us to test the thousands of images, and hundreds of cases, that make up our dataset, for evidence of symmetry-breaking performance degradation (see Fig. 2).
Fig. 2.
Investigating symmetry-breaking and abnormality detection performance in DNNs. The DNN takes four mammographic images, two for each standard mammographic view. These are: the left and right medio-lateral oblique view and the left and right cranio-caudal view.
Results for each type of swap category are shown in Fig. 3. The y-axis is not percentage correct, but instead represents the difference between the predicted likelihood of cancer being present in the images shown, and the ground truth. This “accuracy” value is computed by taking 1 error, where error is the mean squared error (MSE) between the ground-truth and the DNN predictions. This allows for a graph in which higher values indicate better performance, rather than lower values indicating better performance. Random chance classifier performance is denoted by the red line, computed by calculating the accuracy between random network outputs to the ground truth. All differences between categories are significant, as is the difference between swap categories and the base case of no swaps ([NoSwaps vs. within-category ], [NoSwaps vs. contra-to-normal ], [NoSwaps vs. abnormal-to-normal ]). We also report the standard error numerically, as the large sample size and low variance renders it impractical to display on the graph: [NoSwaps: , within-category: , contra-to-normal: , abnormal-to-normal: ]. In Fig. 4 we show this effect in more detail for a selected subset of normal/abnormal mammograms that are included in all swap categories. Here we show the deviation in neural network responses as a percentage of the deviation expected should the neural network output predictions at chance level. The results are similar to those in Fig. 3, except for abnormal/normal pairs we do not see a significant difference in deviation as we do not include the normal mammograms in this analysis. This suggests that swapping abnormal with abnormal, or contralateral with contralateral, preserves some element of the symmetry signal. Error bars not shown as we are measuring percentage deviation of prediction error in a trained predictor (there will be no variability for this specific predictor).
Fig. 3.
DNN results for predictive performance under differing “swap categories.” Higher is better.
Fig. 4.
Difference in neural network performance for a set of abnormal/normal mammogram pairs common to all categories. In this graph, we show the percentage deviation (as proportion of deviation caused by random predictions) from the “no-swap” baseline. Asterisks indicate significance.
With no swaps, the network performs well at cancer prediction. Within category swaps slightly degrades performance. However, replacing the contralateral mammogram with a normal mammogram (despite leaving the abnormal mammogram, with lesion present) results in a decrease in cancer detection ability (Within-Category vs. contra-to-normal ). Interestingly, preserving the contralateral but replacing the abnormal image with a normal image does not result in predictive performance at chance levels (chance vs. abnormal-to-normal ). Instead, the network can still, to some degree, detect cancer from the contralateral alone.
4. Experiment 3 - Textural Analysis
So far, DNN-based analysis has provided evidence for both verifying the existence of the global symmetry signal, as well as its effect on abnormality detection. However, the features these models are learning to pick up on remain unknown. This is not necessarily an unsolvable problem; certain types of neural networks can be analyzed, revealing the image features relevant to that neural network’s final output. For example, if the neural network classifies an image as containing a bird, it might be that the network has learned to activate strongly on areas of the image that contain wings and a beak. These methods either depend upon preexisting libraries of diagnostic parts (e.g., collections of anatomy parts, textures, patterns), or on human review of the “activation maps,” to verify that the highlighted area is, for example, a wing. There is no existing library of textural components of mammograms, nor is it practical for a human to classify individual radiological textural features; especially when such features may be a composite of many different textural elements.
In lieu of the above approaches, and given that it is highly likely that the neural network has learned to operate on the textural features of the mammograms, we can instead directly investigate the mammogram texture for evidence of global symmetry signal. While this approach will be naturally coarse-grained, detecting evidence of the signal directly in the texture presents additional evidence for the role symmetry plays in rapid abnormality detection, while being independent of the previous DNN-based methods.
4.1. Methods
4.1.1. Patch extraction and filtering
Each patch is automatically extracted from the center of each mammogram of the curated OPTIMAM dataset described in experiment 1. The center is defined as the median point of the non-zero pixels of the mammogram. The patch itself is , placed such that the computed center of the mammogram is also at the center of the patch. Given our textural patches are automatically extracted (due to dataset size) we cannot rule out that one patch may contain lesions, while the other patch does not. This would cause an obvious textural difference when comparing said patches. To ameliorate this issue, we Fourier filter the textural patches (see Fig. 5), considering both high-pass and low-pass filtering. Low-pass filtering excludes all frequencies present in the image above a certain threshold, while allowing frequencies below that threshold to remain present in the image. In contrast, high-pass filtering excludes all frequencies below a certain cutoff. In this case, low-pass filtering reduces the impact of lesions, while preserving global structure. High-pass preserves high-frequency changes, such as calcifications and obvious lesions. This also allows us to identify which frequencies are most relevant for textural differences. For both filters, the cutoff frequency is set empirically to 20.48 cycles per image, which preserves enough information in the low-pass to avoid heavily blurring the texture, while ensuring discontinuities, such as lesions are smoothed out, while preserving architectural asymmetries. Likewise, the chosen value is sufficient in the high-pass filter to avoid including background details, while still capturing well-defined masses, fibroadenomas, and calcifications.
Fig. 5.
Low pass Fourier filtering (left) versus high-pass (right).
4.1.2. Textural distance metrics and procedure
We choose two commonly used metrics for evaluating the similarity of two textures. The first metric is MSE, which provides a per-pixel distance between two textures
| (1) |
where represents one texture patch, the other, and enumerates each pixel in the texture. The second measure we use is the structural similarity index (SSIM), an alternative that considers how perceptually different two images are is the SSIM,32 which accounts for structural information by considering inter-dependencies between spatially close pixels. SSIM considers differences between the luminance, contrast, and structure, which evaluates the covariance between the textures.
These distance metrics are then used to compute the textural similarity between one patch and all other patches, for every patch in the dataset (Fig. 6), as well as the distance between patches that come from the same woman. Before computing the distance, one of the textural patches is mirrored, which ensures the correct parts of the texture are being compared. These distances are used to determine if there is a difference between textural content from one woman, compared to two different women. To determine if case type has an effect on textural similarity, we compute the average textural distance between the abnormal and contralateral mammograms of all abnormal cases, and compare it to the average textural distance between the normal mammograms of all normal cases.
Fig. 6.
Computing textural distances between one patch and all other patches.
4.2. Results
We first verify whether square textural patches extracted from the mammogram parenchyma are significantly different between the “same woman” case and the “different woman” case, comparing across women. We find that there is indeed a significant difference between these two cases [one-way analysis of variance (ANOVA) , MSE]. Parenchymal textures are more similar between bilateral images from the same woman, compared to images from two different women; likely the difference that the DNN model we train in experiment 1 is detecting.
We then compare the differences in texture between normal and abnormal cases. We first compute the textural distance within two groups: those that have abnormality present in one mammogram (of either side), and those that have no abnormality. That is, for each case, in each group we examine the textural similarity between the left and right mammogram, resulting in a single “similarity score” for that case. We then compare these textural similarity scores between the two groups via one-way ANOVA for scores that are computed using the raw image, the low-pass filtered images, and the high-pass filtered images. We show our results in Fig. 7 for all cases. Higher values indicate greater similarity. Where the difference is significant () we denote this with an asterisk (*). When using MSE as our texture-difference metric, we find a significant difference in all cases; suggesting a pixelwise difference in textures between normal, and abnormal mammograms from the same woman. However, this metric is not necessarily close to human perception. If we make use of SSIM, a more perceptually-based algorithm, we instead find that the only significant difference between abnormal and normal mammograms occurs in the low-pass filtered images; which preserve the global structure of the texture but lose the high-frequency details.
Fig. 7.
Similarity in textures between abnormal and normal cases. Results for both MSE and SSIM are given. Asterisk indicates significant difference in cases. Greater values indicate greater textural similarity.
5. Discussion
It has been shown that radiologists are capable of extracting the “gist of abnormality,” a signal that enables a radiologists to determine the presence of abnormality in mammograms in sub-second time frames.12,20 However, it appears that across bilateral images, when the “symmetry” between said bilaterals is broken by replacing one mammogram with another woman’s, the efficacy of this gist signal decreases.17 This occurs even if the actual abnormal mammogram is left in place. This effect would be consistent with the existence of a perceptual signal that occurs across the bilateral mammograms19 and which contributes to the gist of abnormality in cases where images from both left and right lateralities (for the same screening projection) are viewed side-by-side. Given that this signal is extractable at gist-processing speeds, the signal is highly likely to be a global one,33 and hence arises from similarities and differences (i.e., symmetry) in the textures of both mammograms of the bilateral image. In this work our goal is to explore this “global symmetry signal,” and as a result, start to investigate the gist signal of mammograms. We first computationally verified the symmetry signal’s existence via DNNs, before analysing its effect on abnormality detection, and then starting to investigate its textural characteristics.
Presently, the gist of abnormality is not employed clinically, despite the potential for rapid and more importantly very early, detection of cancer. One reason for this may be that gist for medical images is both a relatively weak signal and not yet well understood. As such, clinicians may be hesitant to implement gist-based screening steps, given the high impact of a false positive. It would be valuable to “dissect” the gist signal and come to an understanding of its components. Being able to describe the main elements that compose the gist signal may go a significant way towards reassuring radiologists that gist is a suitable addition to their early detection clinical toolboxes. An additional benefit of better understanding medical gist is the potential ability to boost the gist signal present in the mammograms, leading to better overall detection. With an understanding of the individual components that together make up the gist of abnormality, training radiologists to better detect this signal becomes possible, as does the ability to build computational tools that either detect, or increase, the available gist signal. This is of interest considering gist offers the ability to obtain a first-step indication of abnormality at extremely fast rates compared to standard screening practice, as well as being detectable years before the onset of any visibly actionable lesions.
Our results both reveal new information about the symmetry signal, as well as computationally replicating prior results from Evans et al.17 In experiment 1, we find evidence that the symmetry signal can be detected through computational means: specifically, a DNN. The rationale for this experiment is straightforward: if this signal exists, it should be detectable. Given gist performance falls when a radiologists is shown mammograms from two different women,17 there must be some method of determining that the mammograms seen do not “pair up” in some way. The parenchyma is the most informationally dense region of the mammogram, so it makes sense to assume that the symmetry signal, like its parent gist signal, arises from statistical regularities present in the parenchymic texture. Most likely the signal captures structural and textural similarities between the mammograms presented. If so, these same regularities should be able to be captured by a suitable feature extractor, and could hence be used to identify whether a set of mammograms come from the same woman, due to the presence of the symmetry signal over the mammograms. We use a DNN classifier model to learn this feature extractor, and find that we can determine whether a set of mammograms come from the same, or two different women at above-chance rates. We are able to do this even when the mammograms are paired by size, age, density, and machine-type, either separately, or taking account of all factors together. This is suggestive that what we are capturing really is shared textural statistics, and not, for example, size differences being indicative that the mammograms are from two different women. Given the elusiveness and overall difficulty in detecting this signal (See Fig. 8 for an example), it is not surprising that current classifier performance over all factors does not exceed 60%. However, it is sufficient to show that some signal is present.
Fig. 8.
Mammograms from different women, but balanced by size, age, density, and machine type, can easily appear superficially similar. These images were captured by the same imaging machine. Balancing by machine type allows for balancing for system properties such as luminance as a by-product. Fully normalizing the images may destroy information about the parenchyma (for example, variation in tissue structure).
In experiment 2, we find evidence that symmetry breaking leads to significant decreases in the cancer detection performance of a state-of-the-art DNN, replicating results from Evans et al.,17 including those that find cancer detection is possible even in contralateral images. However, whereas prior studies were limited to just over 100 images, here we can explore hundreds of cases and thousands of images. Notably, results from the DNN mirror those from human radiologists. This has the interesting implication that the DNN, tasked with abnormality detection, intrinsically learns to draw on the “symmetry signal,” because doing so leads to improved performance. If this were not the case, we would expect identical performance between same-woman cases, and different-woman cases where the abnormal mammogram was left in place. What we find is that performing these swaps does indeed cause quantifiable changes in the network’s predictions. Here, we see evidence that the DNN has learned implicitly to take advantage of information that spreads across multiple mammographic views, and that the swaps do cause changes in the DNN output. It appears that being able to consider multiple mammographic views from the same woman adds additional information that is useful for cancer detection—even in AI models. Interestingly, we also see from Fig. 4 that when considering only abnormal-normal pairings, that within-category swaps do not cause a significant deviation in predictions compared to the no-swap baseline. This may suggest that enough textural similarities remain between an abnormal mammogram, and another woman’s contralateral (from an abnormal case) for the symmetry signal to remain present, albeit weakened. Generally, as DNNs do not learn features that lead to no performance improvement (at least, on the training set), we can conclude that global symmetry is important enough for abnormality detection that it is leveraged by both human perception and machine vision.
Finally, in experiment 3, we directly analyze the texture of the parenchyma for evidence of both the symmetry signal, and its interaction with abnormality. This serves to independently verify the DNN results from experiments 1 and 2 and narrow down which textural features the DNN may be learning. In experiments 1 and 2, the lowest levels of performance disruption occur when the mammogram swaps made are “within category,” when abnormal mammograms are swapped with abnormals from a different woman. In this scenario, the overall signal of abnormality should remain similar after the swap. This may suggest that the symmetry signal contributes to the gist of abnormality by capturing differences between normal, and abnormal mammograms from the same woman. Hence, we evaluated whether abnormal cases have textures that are less similar than cases without any abnormality in either mammogram. Comparing normal to abnormal texture patches from the same woman shows that these patches are significantly less similar than comparing two normal textures, also from the same woman. The presence of abnormality disrupts the global structure (as captured by low-pass Fourier filtering, which excludes high-frequency information, such as fine detail) of the mammogram and results in perceptible differences between the normal and abnormal mammograms. This textural difference in the structure of the mammogram may correspond to the symmetry signal, which radiologists are detecting when extracting the gist of bilaterally presented mammograms.
This study has several limitations. Our neural network for detecting the presence of the symmetry signal is the first step in this direction, and future improvements to the architecture and training methods are likely to lead to an improvement in accuracy. We also, at this stage, have not ruled out all possible confounding variables. While we have attempted to account for those most likely to affect the results (age, size, density, and machine), there may be other factors worth considering: for example, cancer type. So far, we have examined this effect over a small subset of mammography data. Given datasets of hundreds of thousands of mammographic images exist, it would be worth verifying the existence of the symmetry signal on these much larger datasets. Additionally, the textural analysis only evaluates patches extracted from the center of the mammograms, rather than the entire texture of the parenchyma. This could be improved by exploring more advanced methods of obtaining texture statistics, as could developing mammographic-specific measures of texture difference. Finally, this work only attempts to verify the existence of the symmetry as a computationally extractable signal, and does not attempt to quantify the potential strength of this signal. Determining the strength of this signal across a wide variety of mammographic cases and its variation under swaps would be a valuable future course of work.
However, quantifying the effect of symmetry (for example, numerically) is not straightforward. It is much easier to train a DNN to make use of textural symmetry features for the purposes of classification (same/different—as we have done in this work) than it is to train it to return the strength of the signal present: this is because there is no sufficient training data which contains ground-truth symmetry signal strength for bilateral mammogram data (already quantified, that is)—which means the DNN cannot be directly trained to extract it. There are methods which may be able to overcome this—including evaluating some distance measure on the neural network features directly, or making use of the direct difference in parenchymal textures.
Secondly, the symmetry signal would need to be separated from that of the overall gist signal. This would require baseline gist signal data (from human observers) as well as symmetry signal data over the same set of images and then require developing a method of disentangling the two, which will involve determining how the gist and symmetry signals are entangled in the first place. Recent work on rapid abnormality perception on single breast images suggests the use of “non-symmetry” textural features. However, beyond recent investigations into the frequency bands that these features lie in Ref. 34, these textural features remain unknown—and as such we cannot “subtract” these easily from the symmetry signal. Here, computational techniques offer the promise to clarify this, for example, by radiomic signatures, or texture quantification methods.
6. Conclusion
Together, our data suggest that the global symmetry signal is a textural signal that can be detected and analyzed via the appropriate methods. This signal contributes to the gist of abnormality by informing the observer of statistical and structural similarities between two mammograms from the same woman. The symmetry signal appears be useful for rapid abnormality perception, because the presence of said abnormality alters the textural similarities between the breasts. Subtle differences seem to imply abnormality, but drastic differences, where there are minimal shared statistical regularities between the mammograms (such as when mammograms are presented from different women), lead to conflicting information and hence the observed performance decrease. It should be noted that this signal is not required for gist perception in medical images; radiologists are still able to detect abnormality above chance even when the symmetry signal is disrupted. It is also not required when viewing single-breast images. However, when viewing bilaterals, the global symmetry signal appears contribute to abnormality detection. Our textural analysis findings suggest that this signal may afford recognition of subtle textural disruptions caused by abnormality in one breast, that do not occur in the contralateral image. Global symmetry appears to contribute significantly to the ability of radiologists to detect abnormality “in the blink of an eye,” at least in the case of bilateral presentation.
Acknowledgements
This project is supported by Cancer Research UK & EPSRC grant EDDCPJTÄ100027, awarded to KKE.
Biographies
Cameron Kyle-Davidson is a research associate at the University of York, and primarily develops and applies machine learning techniques to explore psychological phenomena. His current focus is the medical imaging field, aiming to understand the perceptual and cognitive mechanisms by which radiologists detect abnormalities in mammograms.
Lyndon L. Rakusen is a research technician in the Complex Cognitive Perception Lab at the University of York. He is currently investigating abnormality detection in medical imaging, gist perception, and visual and verbal memory. He received his MSc in Cognitive Neuroscience at the University of York, UK, and BSc in Psychology with Neuropsychology at Bangor University, UK.
Emma Raat has a background in biology and neuroscience. She was as a PhD student at the University of York, where she investigated the perception of the gist of medical abnormality by medical experts and novices, with a specific focus on spatial frequency components and learning processes.
Roisin Bradley is a consultant radiologist with special interest in breast imaging at York and Scarborough NHS Trust, and Director of Breast Screening for North Yorkshire. She qualified from Glasgow University in 1995 MBCHB, with further postgraduate qualifications MRCP Glasgow 1999 and FRCR 2009 Edinburgh. Her interests include reducing over treatment of good prognosis breast cancers, AI in breast imaging, and medical imaging perception.
Karla K. Evans is an associate professor and head of the Complex Cognitive Processing Lab at the Psychology Department, University of York, UK. She received her PhD from Princeton University and performed postdoctoral research at MIT and Harvard Medical School BWH in the United States. Using a variety of techniques, her work focuses on understanding how perceptual information within and across senses and memory gets integrated into a unified complex percept of the world and is applied to real world tasks.
Disclosures
The authors have no relevant financial interests in the manuscript and no other potential conflicts of interest to disclose.
Contributor Information
Cameron Kyle-Davidson, Email: ckd505@york.ac.uk.
Lyndon L. Rakusen, Email: lyndon.rakusen@york.ac.uk.
Emma Raat, Email: emma.raat@york.ac.uk.
Roisin Bradley, Email: roisin.bradley@york.nhs.uk.
Karla K. Evans, Email: karla.evans@york.ac.uk.
References
- 1.Wilkinson L., Gathani T., “Understanding breast cancer as a global health concern,” Br. J. Radiol. 95(1130), 20211033 (2022). 10.1259/bjr.20211033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wojtyla C., et al. , “European trends in breast cancer mortality, 1980–2017 and predictions to 2025,” Eur. J. Cancer 152, 4–17 (2021). 10.1016/j.ejca.2021.04.026 [DOI] [PubMed] [Google Scholar]
- 3.Morrell S., et al. , “Mammography screening and breast cancer mortality in Australia: an aggregate cohort study,” J. Med. Screen. 19(1), 26–34 (2012). 10.1258/jms.2012.011127 [DOI] [PubMed] [Google Scholar]
- 4.Schünemann H. J., et al. , “Breast cancer screening and diagnosis: a synopsis of the European breast guidelines,” Ann. Internal Med. 172(1), 46–56 (2020). 10.7326/M19-2125 [DOI] [PubMed] [Google Scholar]
- 5.Njor S., et al. , “Breast cancer mortality in mammographic screening in Europe: a review of incidence-based mortality studies,” J. Med. Screen. 19(1_suppl), 33–41 (2012). 10.1258/jms.2012.012080 [DOI] [PubMed] [Google Scholar]
- 6.Théberge I., et al. , “Radiologist interpretive volume and breast cancer screening accuracy in a Canadian organized screening program,” J. Natl. Cancer Inst. 106(3), djt461 (2014). 10.1093/jnci/djt461 [DOI] [PubMed] [Google Scholar]
- 7.Pisano E. D., et al. , “Diagnostic performance of digital versus film mammography for breast-cancer screening,” N. Engl. J. Med. 353(17), 1773–1783 (2005). 10.1056/NEJMoa052911 [DOI] [PubMed] [Google Scholar]
- 8.Trister A. D., Buist D. S., Lee C. I., “Will machine learning tip the balance in breast cancer screening?” JAMA Oncol. 3(11), 1463–1464 (2017). 10.1001/jamaoncol.2017.0473 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ekpo E. U., Alakhras M., Brennan P., “Errors in mammography cannot be solved through technology alone,” Asian Pac. J. Cancer Prevent. 19(2), 291 (2018). 10.22034/APJCP.2018.19.2.291 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lamb L. R., et al. , “Missed breast cancer: effects of subconscious bias and lesion characteristics,” Radiographics 40(4), 941–960 (2020). 10.1148/rg.2020190090 [DOI] [PubMed] [Google Scholar]
- 11.Kundel H. L., Nodine C. F., “Interpreting chest radiographs without visual search,” Radiology 116(3), 527–532 (1975). 10.1148/116.3.527 [DOI] [PubMed] [Google Scholar]
- 12.Evans K. K., et al. , “The gist of the abnormal: above-chance medical decision making in the blink of an eye,” Psychonom. Bull. Rev. 20(6), 1170–1175 (2013). 10.3758/s13423-013-0459-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wolfe J. M., et al. , “Visual search in scenes involves selective and nonselective pathways,” Trends Cognit. Sci. 15(2), 77–84 (2011). 10.1016/j.tics.2010.12.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Carmody D. P., Nodine C. F., Kundel H. L., “Finding lung nodules with and without comparative visual scanning,” Percept. Psychophys. 29(6), 594–598 (1981). 10.3758/BF03207377 [DOI] [PubMed] [Google Scholar]
- 15.Treviño M., et al. , “Rapid perceptual processing in two-and three-dimensional prostate images,” J. Med. Imaging 7(2), 022406 (2020). 10.1117/1.JMI.7.2.022406 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Brunyé T. T., et al. , “Melanoma in the blink of an eye: pathologists’ rapid detection, classification, and localization of skin abnormalities,” Vis. Cognit. 29(6), 386–400 (2021). 10.1080/13506285.2021.1943093 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Evans K. K., et al. , “A half-second glimpse often lets radiologists identify breast cancer cases even when viewing the mammogram of the opposite breast,” Proc. Natl. Acad. Sci. U. S. A. 113(37), 10292–10297 (2016). 10.1073/pnas.1606187113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Gandomkar Z., et al. , “Does the strength of the gist signal predict the difficulty of breast cancer detection in usual presentation and reporting mechanisms?” Proc. SPIE 10952, 1095203 (2019). 10.1117/12.2513151 [DOI] [Google Scholar]
- 19.Brennan P. C., et al. , “Radiologists can detect the ‘gist’of breast cancer before any overt signs of cancer appear,” Sci. Rep. 8, 8717 (2018). 10.1038/s41598-018-26100-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Evans K. K., Culpan A.-M., Wolfe J. M., “Detecting the “gist” of breast cancer in mammograms three years before localized signs of cancer are visible,” Br. J. Radiol. 92(1099), 20190136 (2019). 10.1259/bjr.20190136 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sun W., et al. , “Prediction of near-term risk of developing breast cancer using computerized features from bilateral mammograms,” Comput. Med. Imaging Graphics 38(5), 348–357 (2014). 10.1016/j.compmedimag.2014.03.001 [DOI] [PubMed] [Google Scholar]
- 22.Scutt D., Lancaster G. A., Manning J. T., “Breast asymmetry and predisposition to breast cancer,” Breast Cancer Res. 8(2), 1–7 (2006). 10.1186/bcr1388 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Wu N., et al. , “Breast density classification with deep convolutional neural networks,” in IEEE Int. Conf. Acoust. Speech and Signal Process., IEEE, pp. 6682–6686 (2018). 10.1109/ICASSP.2018.8462671 [DOI] [Google Scholar]
- 24.Kuhn H. W., “The Hungarian method for the assignment problem,” Nav. Res. Logist. Q. 2(1-2), 83–97 (1955). 10.1002/nav.3800020109 [DOI] [Google Scholar]
- 25.Yala A., et al. , “Toward robust mammography-based models for breast cancer risk,” Sci. Transl. Med. 13(578), eaba4373 (2021). 10.1126/scitranslmed.aba4373 [DOI] [PubMed] [Google Scholar]
- 26.Wu N., et al. , “Deep neural networks improve radiologists’ performance in breast cancer screening,” IEEE Trans. Med. Imaging 39(4), 1184–1194 (2019). 10.1109/TMI.2019.2945514 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Lin M., Chen Q., Yan S., “Network in network,” arXiv:1312.4400 (2013).
- 28.Ioffe S., Szegedy C., “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in Int. Conf. Mach. Learn., PMLR, pp. 448–456 (2015). [Google Scholar]
- 29.Boyd N. F., et al. , “Mammographic density: a heritable risk factor for breast cancer,” Cancer Epidemiol. 472, 343–360 (2009). 10.1007/978-1-60327-492-0_15 [DOI] [PubMed] [Google Scholar]
- 30.Vachon C. M., et al. , “Mammographic density, breast cancer risk and risk prediction,” Breast Cancer Res. 9(6), 1–9 (2007). 10.1186/bcr1829 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Devore J. L., Berk K. N., Carlton M. A., Modern Mathematical Statistics with Applications, Vol. 285, Springer; (2012). [Google Scholar]
- 32.Wang Z., et al. , “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process. 13(4), 600–612 (2004). 10.1109/TIP.2003.819861 [DOI] [PubMed] [Google Scholar]
- 33.Raat E., Kyle-Davidson C., Evans K., “Using global feedback to induce learning of gist of abnormality in mammograms,” Cognit. Res. Princ. Implic. 8(1), 1–22 (2023). 10.1186/s41235-022-00457-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Raat E. M., et al. , “The effect of spatial frequency on perceiving the gist of abnormality in mammograms,” J. Vis. 21(9), 2294–2294 (2021). 10.1167/jov.21.9.2294 [DOI] [Google Scholar]








