AI-based image quality assessment of positioning in mammography: considerations and challenges

Tina Santner; Mickael Tardy; Johanne-Gro Stalheim; Stephanie Frei; Wolfram Santner; Stefano Gianolini; Malik Galijasevic; Marthe Larsen; Jonas Gjesvik; Solveig Hofvind; Gerlig Widmann

doi:10.1186/s13244-025-02191-3

. 2026 Feb 16;17:47. doi: 10.1186/s13244-025-02191-3

AI-based image quality assessment of positioning in mammography: considerations and challenges

Tina Santner ¹, Mickael Tardy ², Johanne-Gro Stalheim ³, Stephanie Frei ⁴, Wolfram Santner ⁵, Stefano Gianolini ⁶, Malik Galijasevic ⁷, Marthe Larsen ⁸, Jonas Gjesvik ⁸, Solveig Hofvind ⁸, Gerlig Widmann ^7,^✉

PMCID: PMC12909628 PMID: 41697490

Abstract

Objectives

Artificial intelligence (AI) could facilitate and objectify quality assessment in the daily routine. The purpose was to explore the extent to which an AI prototype algorithm is able to replicate the perfect-good-moderate-inadequate (PGMI) system (perfect, good, moderate, inadequate).

Materials and methods

From a multicentre case collection, 200 standard mammograms (800 images) were selected. A deep learning-based prototype software was used to rate the images in analogy to the PGMI system. The AI results were compared with a reference standard obtained through consensus reading by three expert radiographers and one expert radiologist, using quadratically weighted Cohen’s kappa with confidence intervals (CI) and context-based interpretation. Frequency and reasons for disagreement were evaluated for challenging cases with a discrepancy of two or more grades and a discrepancy in assigning an inadequate.

Results

For overall PGMI per image, slight agreement between human consensus and AI was observed for CC views (κ = 0.14) and fair agreement for MLO views (κ = 0.25). The highest agreement was observed for the CC category “M. Pectoralis visibility” (substantial, κ = 0.75). Best category in MLO was “Pectoralis angle” (moderate, κ = 0.49). For other categories, fair, slight or poor agreement was observed. The work-up of disagreement gave insight into misinterpretations of anatomical landmarks and causality issues in the categorization.

Conclusion

Transforming the PGMI system into a fully automated AI algorithm is challenging and may differ substantially between subcategories. Further research in computer science and quality assessment methodology is needed to pave the way for AI-based objective quality management in mammography.

Critical relevance statement

Profound evaluation of AI algorithms and their ability to replicate human interpretation, scoring, and classification are the basis and scientific framework toward AI-based objective quality management in mammography.

Key Points

AI has huge potential for automated assessment of diagnostic image quality.
Compared with human reading agreement, substantial disagreement may also be found.
Direct transformation of perfect-good-moderate-inadequate scoring into an AI algorithm is challenging.

Graphical Abstract

Keywords: Mammography, Quality, PGMI, Software, AI

Introduction

Breast cancer is the most commonly diagnosed cancer type and cause of cancer deaths among women worldwide [1]. Mammography is considered the best screening tool for the disease, and about 40 million examinations are performed each year in the United States alone [2]. High-quality images are a prerequisite for high sensitivity and specificity of mammographic screening [3]. Correct positioning of the breast is essential to fully visualize and correctly display all relevant tissue to avoid misinterpretations of features or missing tumors [4, 5]. As radiographers make autonomous decisions throughout the process and in their interaction with women, awareness of their responsibility and reflection on their own performance should be encouraged [6]. Various classification and grading systems have been developed to determine the image quality of a mammogram [7]. The Perfect-Good-Moderate-Inadequate (PGMI) system, originally conceived for mammography screening by the National Health Service in the United Kingdom (NHSBSP), has been used and adapted internationally for decades [3, 7–10]. It corresponds to a catalog of criteria, which enables a systematic analysis of each view, craniocaudal (CC) and mediolateral oblique (MLO), resulting in an overall image quality score. Available guidelines from different countries state that ≥ 75% of the screening images should reach a P or G, ≤ 22% M, and ≤ 3% I [3, 11]. However, despite the given rules and recommendations, interpretation remains subjective and is prone to inter- and intrareader variability [7, 12]. Hofvind et al [3] observed that the distributions of given PGMI grades differed significantly between local readers and a superior expert. Hill/Robinson [6] analyzed the vague wording and misleading definitions of guidelines and concluded the tool to be neither reliable nor valid. Boyce et al [9] also addressed these ambiguities and found poor agreement between results evaluated by assessors from different countries. Alukic et al [13] measured mainly poor agreement between evaluations of five radiographers, with subjectivity being a major influence despite clearly communicated rules. Furthermore, an organized evaluation of image quality, and dissemination of results and improvement actions requires resources and is time-consuming [7, 14]. Consequently, randomly selected images are usually assessed for each radiographer; they may not be fully representative and may not capture all complex cases. Efforts should thus be made to find a practical method with high efficiency that gives comprehensive support to the radiographers and fits into their daily routine [6].

Use of artificial intelligence (AI) generally opens opportunities in facilitating and accelerating working processes and might also help to simplify and automate aspects of image quality assurance [2, 12, 13]. First prototypes of software solutions for quality evaluation of mammograms have been introduced, which may provide live feedback during the examination or report quality assessment and monitoring data retrospectively [12, 14]. These systems might be able to replace subjective human measurements, perform faster, be more comprehensive, and identify training needs distinctively to enable interventions [12]. However, there is limited knowledge about the performance of these systems and how they can replicate human interpretations, scorings and classifications. The aim of the study was to evaluate whether an AI algorithm matches human reference PGMI scores from clinical routine data. Further, we wanted to understand and describe the reasons for the major discrepancy.

Materials and methods

The study was approved by the ethical committee (Medical University of Innsbruck, Austria, reference number 1321/2021), which waived the requirement for informed patient consent.

Study population

We received image data from 200 anonymized standard digital mammography examinations from 13 sites in university hospitals and private clinics, 100 from women residing in Switzerland, and 100 from women residing in Austria. The mammograms were performed in the period from June to August 2021 with mammography machines from three vendors (GE n = 26; Hologic n = 52 and Siemens n = 122). All women had four images, two CC and two MLO, resulting in 800 images in total.

Human PGMI reference and AI-based evaluation of image quality

The Austrian version of the PGMI system [11] was used and slightly refined to avoid presumably vague definitions ([6, 9], Appendix 1). The analysis included 16 criteria for CC and 15 criteria for MLO (Appendix 1). In addition, a summary PGMI score was provided for each image. All 800 images were independently classified by three PGMI expert radiographers with more than 10 years of experience using PGMI in Austria (T.S.), Norway (J.S., T.S.), or Switzerland (S.F., T.S.). In any reading process, all image parameters were hidden. To establish a human reference standard, the three independent assessments were reviewed and consolidated by a radiographer (T.S.) and a breast radiologist (W.S.), both with over 10 years of experience in breast imaging from Austria, Norway, and Switzerland. For each of the 800 images, T.S. and W.S. manually examined the ratings across all individual positioning criteria for both CC and MLO views. A consensus decision was reached for each criterion, and these were then used to derive a final PGMI score per image. In cases of disagreement or ambiguity among the three original ratings, T.S. and W.S. jointly reviewed the image and resolved differences through consensus-based adjudication. T.S. and W.S.’s intervention was not to judge or even overrule the decision of the three readers, but solely to resolve cases that could not produce a consensus result. They were blinded to the AI outputs when performing the adjudication. Such a double-review approach ensured consistent scoring and provided an arbitration mechanism when majority voting alone was insufficient.

The 800 images were classified as either challenging or non-challenging based on predefined criteria for challenging cases. An image was considered challenging if any of the following conditions applied: (1) accurate measurement of the pectoralis nipple line (PNL) and its deviation was not possible due to a short pectoralis muscle on the MLO view combined with its absence on the CC view; (2) the pectoralis muscle and/or nipple appeared frayed or blurred, hindering precise PNL measurement; (3) the nipple was not clearly visible due to anatomical, positioning, or technical factors (e.g., inadequate compression, suboptimal exposure, image noise, or poor contrast); or (4) differentiation between skinfolds and scars was difficult.

The research prototype software (Hera-MI Mammography Technical Evaluation) was based on an existing CE-certified product (Breast SlimView), including developments for quality assessment (Mammography Technical Evaluation). The specific algorithm relied on a deep neural network which was trained in a supervised approach on a large set of multivendor data independent from the present study (seven institutions in Europe; training 12,000 images, validation 2400, testing 1200; Fujifilm (26%), Hologic (22%), GE (17%), IMS Giotto (10%), Siemens (10%), Planmed (7%)) in a multi-task manner, based on annotations generated by three professionals with 1–5 years of experience in mammography who were not the readers of this study and unknown to this group, similarly described in Tardy et al [15]. The following classification tasks were performed by the software: nipple correctness, IMF correctness, pectoral muscle presence. The following objects were segmented: pectoral muscle, nipple, IMF and glandular tissue. The assessment of the correctness of nipple and IMF was used from the network output, same is for glandular tissue. Other outputs, the angle and length measurements, such as PNL, pectoral muscle angle, were calculated from the segmentation results and were not obtained directly by training on PGMI-labeled datasets. This limits the fully AI mode of operations and contributes to the explainability of the algorithm.

Preparation and validation of the algorithm were conducted by the AI provider; no additional training was performed during the study. The evaluation of the algorithm relied on the ground truth generated by skilled professionals as part of the research and development process of the manufacturer. Each model component was validated independently; overall PGMI-mapping performance was evaluated using a separate PGMI-labeled dataset. To verify that angle and length outputs met the accuracy requirements for clinical quality control, mean absolute errors (MAE) were computed. MAE for pectoral angle was below 2 degrees, while MAE of PNL was below 2 mm.

A separate, deterministic rule layer assigns the information from all measurements to a scale that reflects the level graduated from the ideal image. In the software, this was translated into grades on a scale between 0 and 10. The conversion into PGMI categories required for the study was defined as follows: I = [0, 2], M = [3, 5], G = [6, 9], P = 10. Thus, the AI performs detection/measurement; the logical layer implements transparent, rule‑based stratification.

The visible output of the software would include a report per study with a global grade and a list of deficiencies per image (Appendix 2).

For the study, PGMI criteria and rules were matched as close as possible to the scheme used by the human readers.

The software version used in this study focused on positioning criteria. Skinfold and blur/motion detection were not implemented, as these features were outside the scope of the current release and scheduled for future iterations. We discuss implications for benchmarking against expert readers in the “Limitations” section. Appendix 1 provides an overview of which criteria were assessed by the human readers and also processed by the software, as well as which of the scored criteria were included in the statistical analysis of the study.

The software results for the 800 images were entered into the predesigned CRF for comparison with the human consensus reading.

Statistical analysis and review

We descriptively presented the distribution of PGMI scores by frequencies and percentages, for the human consensus and for the AI system. The results were stratified by CC and MLO images. The results were further stratified by non-challenging and challenging cases.

The agreement between the human consensus and the AI system for each PGMI category was summarized using quadratically weighted Cohen’s kappa [16], including the corresponding confidence intervals (CIs). Agreement was also presented for the overall PGMI value in CC and MLO, supplemented by confusion matrices. The strength of agreement was interpreted in a context-dependent manner, as the Landis and Koch thresholds (< 0 poor agreement, 0–0.20 slight agreement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 substantial agreement, > 0.81 almost perfect agreement) [17] strictly apply only to unweighted Cohen’s kappa.

Statistics were calculated in Stata (StataCorp. 2023. Stata Statistical Software: Release 19. College Station, TX: StataCorp LLC) using the kappaetc package (Kolenikov [18]), R Statistical Software (R Core Team v. 3.3.0), and Microsoft Excel Version 2409.

All images in the challenging group, in which the overall PGMI between the human consensus and the AI differed by two or three levels, or in which there was disagreement about inadequacy, were reviewed by an expert radiographer (T.S.). To identify the reason for deviation, the content of the CRFs was manually compared value by value and linked to the respective image. Findings for discrepancy between the human readers and the AI system were described qualitatively (expert radiographer review).

Results

Consensus intervention

The three readers agreed in 156 (30%) of the non-challenging cases and 90 (32.14%) of the challenging cases. A decision was produced by majority voting in 347 (66.73%) and 168 (60%) cases, respectively. In 13 cases (2.5%), respectively 17 cases (6.07%), an adjudication by T.S./W.S. was necessary. In 4 (0.76%) and 5 (1.79%) of the cases, respectively, no reliable interpretation was possible due to unclear landmarks in the images.

Descriptives

Frequencies and percentages of perfect, good, moderate, and inadequate scores for each PGMI category for the 400 CC images and 400 MLO images, stratified by human reference and artificial intelligence, are depicted in Table 1. For CC images, no difference in classification between the human reference and AI was observed in 88% (351/400) for the category “M. Pectoralis visibility,” 57% (226/400) for “Nipple orientation,” and 46% (182/400) for “Nipple in profile” (Fig. 1, Table 2). For “Lateral gland depiction,” 11% (43/400) had a difference of 3 levels between human reference and AI. In MLO, no difference in the PGMI classification between human reference and AI was observed in 70% (278/400) for the category “Pectoralis angle,” followed by 55% (221/400) for “PNL comparison” and 48% (191/400) for “IMF visibility” (Fig. 2, Table 2). A difference of 2 and 3 levels was observed in 18% (71/400) and 10% (39/400) of the category “M. Pectoralis relaxation and length,” respectively, and a difference of 2 levels was observed in 17% (69/400) of the category “Nipple in profile.” For the overall PGMI, a difference of 2 or more levels was observed in 5% (15/400) for CC and 6% (21/400) for MLO, respectively.

Table 1.

Frequencies and percentages of P, G, M, and I (perfect, good, moderate, inadequate) scores for each PGMI category for 400 CC images and 400 MLO images

	Human readers										Artificial Intelligence
	Perfect		Good		Moderate		Inadequate		Error/NA		Perfect		Good		Moderate		Inadequate
	n	%	n	%	n	%	n	%	n	%	n	%	n	%	n	%	n	%
Craniocaudal images (n = 400)
M. Pectoralis visibility	138	35%	256	64%	0	0%	0	0%	6	2%	119	30%	281	70%	0	0%	0	0%
Pectoralis-nipple-line comparison	124	31%	110	28%	117	29%	25	6%	24	6%	205	51%	147	37%	48	12%	0	0%
Nipple in profile	291	73%	60	15%	24	6%	17	4%	8	2%	159	40%	164	41%	77	19%	0	0%
Nipple orientation	191	48%	114	29%	70	18%	12	3%	13	3%	299	75%	83	21%	18	5%	0	0%
Medial gland depiction	149	37%	156	39%	56	14%	16	4%	23	6%	400	100%	0	0%	0	0%	0	0%
Lateral gland depiction	91	23%	114	29%	107	27%	62	16%	26	7%	365	91%	0	0%	35	9%	0	0%
Skinfolds	165	41%	183	46%	43	11%	2	1%	7	2%	n.a.	n.a.	n.a.	n.a.	n.a.	n.a.	n.a.	n.a.
Overall PGMI	20	5%	166	42%	159	40%	47	12%	8	2%	38	10%	228	57%	134	34%	0	0%
Mediolateral oblique images (n = 400)
M. Pectoralis shape and length	111	28%	151	38%	78	20%	50	13%	10	3%	345	86%	27	7%	28	7%	0	0%
M. Pectoralis angle	147	37%	198	50%	38	10%	12	3%	5	1%	102	26%	249	62%	23	6%	26	7%
Pectoralis-nipple-line comparison	292	73%	66	17%	18	5%	8	2%	16	4%	235	59%	139	35%	26	7%	0	0%
Nipple in profile	230	58%	86	22%	42	11%	27	7%	15	4%	132	33%	106	27%	162	41%	0	0%
Inframammary fold visibility	159	40%	118	30%	66	17%	49	12%	8	2%	211	53%	138	35%	51	13%	0	0%
Skinfolds	102	26%	209	52%	78	20%	11	3%	0	0%	n.a.	n.a.	n.a.	n.a.	n.a.	n.a.	n.a.	n.a.
Overall PGMI	16	4%	120	30%	203	51%	60	15%	1	0%	13	3%	150	38%	211	53%	26	7%

Open in a new tab

Results are stratified by human consensus and artificial intelligence (AI). Error/NA referred to no consensus possible or not sufficient information on the image to give a fair grade

Fig. 1 — Differences in the PGMI (Perfect, Good, Moderate, Inadequate) scoring between the human consensus and the AI in CC images. Error/NA referred to no consensus possible, or too little information on the image to give a fair grade

Table 2.

Differences in the PGMI (Perfect, Good, Moderate, Inadequate) scoring between the human consensus and the AI

	No difference		Difference = 1		Difference = 2		Difference = 3		Error/NA
Craniocaudal images (n = 400)
M. Pectoralis visibility	351	88%	43	11%	0	0%	0	0%	6	2%
PNL comparison	165	41%	149	37%	59	15%	3	1%	24	6%
Nipple in profile	182	46%	165	41%	45	11%	0	0%	8	2%
Nipple orientation	226	57%	132	33%	29	7%	0	0%	13	3%
Medial gland depiction	149	37%	156	39%	56	14%	16	4%	23	6%
Lateral gland depiction	85	21%	133	33%	113	28%	43	11%	26	7%
Overall PGMI	185	46%	192	48%	15	4%	0	0%	8	2%
Mediolateral oblique images (n = 400)
M. Pectoralis relaxation and length	126	32%	154	39%	71	18%	39	10%	10	3%
Pectoralis angle	278	70%	102	26%	9	2%	6	2%	5	1%
PNL comparison	221	55%	140	35%	21	5%	2	1%	16	4%
Nipple in profile	172	43%	144	36%	69	17%	0	0%	15	4%
IMF visibility	191	48%	160	40%	39	10%	2	1%	8	2%
Overall PGMI	218	55%	160	40%	20	5%	1	0%	1	0%

Open in a new tab

A difference of 1 could, for example, mean that AI scored an image as P and human consensus as G. If the difference was 3, AI had scored P and human consensus I, or the other way around. Error/NA referred to no consensus possible or too little information on the image to give a fair grade. Results stratified by non-challenging and challenging cases are given in Appendix 3

Fig. 2 — Differences in the PGMI (Perfect, Good, Moderate, Inadequate) scoring between the human consensus and the AI in MLO images. Error/NA referred to no consensus possible, or too little information on the image to give a fair grade

For CC images, the highest agreement between the human reference and the AI system was observed for “M. pectoralis visibility” (κ = 0.75) (Table 3). Medium agreement was observed for overall PGMI for CC images (κ = 0.41). Considering the MLO images, the strongest agreement (κ = 0.57) between human consensus and AI was observed for “Pectoralis angle.” Agreement for the overall PGMI rating was slightly lower in MLO compared with CC (κ = 0.38). Across the remaining categories in both CC and MLO views, agreement tended to be lower.

Table 3.

Quadratically weighted Cohen’s kappa and 95% confidence intervals (CI) between human consensus and the AI for craniocaudal (CC) and mediolateral oblique (MLO) images for each PGMI category (Perfect, Good, Moderate, Inadequate), and for all, non-challenging and challenging cases

	All cases, n = 400		Non-challenging, n = 260		Challenging, n = 140
Craniocaudal images (CC)	n	Kappa (95% CI)	n	Kappa (95% CI)	n	Kappa (95% CI)
M. Pectoralis visibility	394	0.75 (0.68–0.82)	255	0.74 (0.65–0.83)	139	0.77 (0.66–0.89)
PNL comparison	376	0.33 (0.25–0.41)	245	0.47 (0.38–0.56)	131	0.11 (−0.02 to 0.23)
Nipple in profile	392	0.33 (0.25–0.41)	257	0.48 (0.37–0.60)	135	0.08 (−0.01 to 0.17)
Nipple orientation	387	0.48 (0.40–0.56)	252	0.46 (0.36–0.56)	135	0.47 (0.35–0.59)
Medial gland depiction	377	0.00 (0.00–0.00)	247	0.00 (−0.00 to 0.00)	130	−0.00 (−0.00 to −0.00)
Lateral gland depiction	374	0.08 (0.02–0.13)	243	0.00 (−0.06 to 0.07)	131	0.13 (0.05–0.20)
Overall PGMI	392	0.41 (0.34–0.47)	256	0.37 (0.34–0.50)	136	0.30 (0.18–0.42)
Mediolateral oblique images (MLO)
M. Pectoralis relaxation and length	390	0.08 (0.03–0.14)	253	0.09 (0.01–0.17)	137	0.08 (0.02–0.15)
Pectoralis angle	395	0.57 (0.47–0.67)	259	0.59 (0.47–0.71)	136	0.53 (0.35–0.71)
PNL comparison	384	0.26 (0.15–0.37)	252	0.39 (0.25–0.53)	132	0.06 (−0.12 to 0.23)
Nipple in profile	385	0.39 (0.31–0.46)	250	0.63 (0.55–0.71)	135	−0.00 (−0.06 to 0.05)
IMF visibility	392	0.49 (0.45–0.57)	260	0.52 (0.43–0.62)	132	0.46 (0.36–0.55)
Overall PGMI	399	0.38 (0.29–0.47)	260	0.36 (0.25–0.48)	139	0.12 (−0.01 to 0.26)

Open in a new tab

The number of images included in the PGMI analysis for each category is also provided

Confusion matrices with the number of images classified as P, G, M and I by AI and human consensus in the overall PGMI can be seen in Table 4.

Table 4.

Confusion matrices showing frequencies classified as P, G, M, and I (Perfect, Good, Moderate, Inadequate) by artificial intelligence (AI) and human consensus for the overall PGMI of craniocaudal (CC) and mediolateral oblique (MLO) images, for all cases, non-challenging cases, and challenging cases

		CC images Human consensus				MLO images Human consensus
	All cases n = 392 for CC; n = 399 for MLO	P	G	M	I	P	G	M	I
AI	P	8	28	2	0	3	5	4	1
	G	11	110	90	12	7	74	64	4
	M	1	28	67	35	6	35	128	42
	I	0	0	0	0	0	6	7	13
	Non-challenging cases n = 256 for CC; n = 260 for MLO	P	G	M	I	P	G	M	I
AI	P	8	28	2	0	3	5	4	1
	G	7	77	65	7	7	72	62	3
	M	0	11	38	13	2	14	6	13
	I	0	0	0	0	0	4	4	6
	Challenging cases n = 136 for CC; n = 139 for MLO	P	G	M	I	P	G	M	I
AI	P	0	0	0	0	0	0	0	0
	G	4	33	25	5	0	2	2	1
	M	1	17	29	22	4	21	68	29
	I	0	0	0	0	0	2	3	7

Open in a new tab

For non-challenging images, fair agreement was observed in the overall PGMI in both CC and MLO (κ = 0.37 and κ = 0.36). For challenging images, agreement was reduced to varying degrees, being moderately lower in CC (κ = 0.30) and markedly lower in MLO (κ = 0.12).

The challenging images had minor agreement in the categories “PNL comparison” and “Nipple in profile” in both CC and MLO views. Details can be seen in Table 3.

We found a difference of 3 levels for “Medial gland depiction” (5%, 12/260) and “Lateral gland depiction” (7%, 17/260) for non-challenging CC cases (Appendix 3). It was 3% (4/140) for “Medial gland depiction” and 19% (26/140) for “Lateral gland depiction” for the challenging cases. A difference of 3 levels was found in 6% (15/260) of the “M. Pectoralis relaxation and length” for the non-challenging cases and in 17% (24/140) of the challenging cases.

Reasons for the discrepancy between human reference and AI reading

A total of 75 images (33 CC and 42 MLO) from the challenging group met the conditions for further evaluation to identify reasons for discrepancy. The reasons and frequency of discrepancy between human reference and AI reading are given in Table 5. In 17 (23%) images, M. Pectoralis was incorrectly identified in MLO (Fig. 3). Such cases were identified throughout, independent of the image contrast and postprocessing of different vendors. In 19 (25%) of the images, the PNL measurement was not readily comprehensible by the AI, and the landmarks upon which the measurement was based remained unclear (Fig. 4). Uncertainty as to whether the nipple was correctly registered and classified by the AI was a reason for discrepancy between human consensus and AI in the same number of cases. In 12 (16%) CC images, the orientation of the nipple was deviated while the lateral glandular tissue was simultaneously cut (Fig. 5). Further representative images for each of the reasons for discrepancy listed in Table 5 are given in Appendix 4.

Table 5.

Reasons and frequency of discrepancy between human and AI reading with difference of 2 or more grades or disagreement about inadequateness in 75 images

	Reasons for discrepancy in human versus AI classification	Frequency (n)	Percentage (%)
1	Error in identification of the M. Pectoralis (MLO)	17	23%
2	Error in PNL measurement (CC and/or MLO)	19	25%
3	Uncertainty about correct identification of the nipples (CC and/or MLO)	19	25%
4	Error in recognizing out-of-profile nipple (CC and/or MLO)	3	4%
5	Error in categorizing nipple orientation when breast is rotated (CC)	12	16%
6	Error in capturing insufficient breast tissue (MLO)	6	8%
7	Unable to recognize skinfolds (CC and/or MLO)	9	12%
8	Dominance of a single category on overall score	4	5%
9	Unclear rationale of software scoring	8	11%

Open in a new tab

Fig. 3 — Example for discrepancy in the identification of the M. pectoralis in MLO view

Fig. 4 — Example for discrepancy in PNL measurements

Fig. 5 — Example for discrepancy in categorizing the nipple orientation when the breast is rotated in CC view

Discussion

Currently, only a few providers have AI software solutions for dedicated quality assessment of positioning in mammography available [12]. Scientific evaluation, especially about the use in demands of clinical routine, is limited and based on rather small case collections [7], single reader comparisons [2], and tends to include only a fraction of all quality aspects. A study on the improvement of image quality and reduction of technical recalls after implementation of a direct feedback software shows promising results [14], but lacks an evaluation of the software’s performance on quality assessment criteria. Similar studies in a comprehensive setting and those investigating software validation appear to be under development.

The analysis included a fully comprehensive PGMI evaluation with all standard criteria applied and was based on multicentre real-world data in combination with reference from multireader expertise.

When comparing the human reference with the AI algorithm, the highest agreement was found for “M. pectoralis visibility” in CC and for “Pectoralis angle” in MLO. All other PGMI categories demonstrated lower levels of agreement, with some categories showing a pronounced decrease. For the overall PGMI, agreement remained consistently within the moderate spectrum.

There was low agreement on assigning an image as “inadequate.” Most of these cases had a difference of only one PGMI level. Clear delineation of inadequate images is important because of its impact on the routine screening workflow. A rating of “inadequate” may result in the exam being repeated or, if the woman has already left, in her having to return for it. It also puts a strain on the required average performance level of a team member (inadequates must be less than 3%). Software should therefore aim to reliably identify inadequate images.

For some categories, the results in the non-challenging group showed better agreement than in the challenging group of cases. In complex situations, humans may tend to interpret findings instinctively, connecting all images of a case, which can increase subjectivity. The AI, on the other hand, strictly applies to the entered rules and always takes one image at a time without establishing any relationships. Complex cases seem advanced to analyze for AI too, because either many difficulties and deficiencies occur simultaneously or reliable landmarks disappear (for example, miss of orientation when nipples are not in profile and at the same time the pectoralis is much too short in MLO) or cannot be recognized. For instance, identification of the nipple may be affected by its inconspicuous anatomical dimension, obscuration by dense glandular tissue, or preset image processing that favors ideal representation of other densities. Although the software used fully DICOM images, seemingly arbitrary values could arise if the nipple deviated significantly from the expected constellation (dimension, visibility, in profile). Also, human readers had to window extensively to recognize subtle nipples associated with high breast density. As a solution, availability to postprocessing in the graphical output of the software may avoid loss of information in the displayed image and to use the full dynamic range of the raw or DICOM images [13]. To the best of our knowledge, none of the vendors is currently able to display such extensive data within the user interface. Enabled windowing would facilitate the visualization of multilayered structures and lay the groundwork for correcting misidentified landmarks in the next step.

In PGMI, lack of the nipple in profile and the nipple within the tissue is to be classified as “inadequate,” but a nipple visualized at the skin boundary may be classified as “moderate.” This distinction worked very well for humans, but was not available for the AI system. In CC, when the nipple deviated laterally at the same time as the lateral gland body was cut, the breast might not have been sufficiently mobilized forward [3]. If the PGMI is replicated with the AI system, such a combination should be rated as “inadequate.” An explanation for the weak agreement for this measurement could be the moderate agreement for “Nipple orientation” and poor agreement for “Lateral gland.” The inappropriate result was therefore composed of two different aspects.

In a worse case, the AI system may not only process the two categories “Nipple in profile” and “Nipple orientation” (in CC) incorrectly, but also refer to an incorrect landmark for calculating the PNL. When comparing the PNL between CC and MLO, incorrect values would occur, and the initial misjudgement would affect the estimation of the other projection. In contrast, human reading may be able to instinctively establish connections and gain an understanding of the anatomical situation by considering all four projections of one woman together. We assume that AI systems would benefit from recognition of structures over different projections. In the software studied, the ipsilateral views were co-processed, allowing for evaluation of criteria such as PNL difference. Incidentally, no provider is currently able to include previous exams or other patient-related information. This would be a game changer, especially for lesion detection tools or prognostic models, and is the subject of further research [19].

In CC, the presence of M. Pectoralis was well recognized by the AI system. However, in MLO, the AI system struggled to define the outline in length and shape. Causes may include differences in anatomy, differences in positioning and differences in the contrast behavior between vendors. Furthermore, this can result in an incorrect basis for the PNL measurement. The expert radiographers review identified cases where the PNL was inconclusive for comparison, and it was not possible to ascertain which landmarks the AI utilized.

The MLO view exhibited a problematic combination that may not be adequately resolved through enhanced detection and rule adjustment. AI failure appeared to occur when part of the glandular tissue was not imaged; at the same time, the M. Pectoralis was shorter than required, and the IMF was not visualized. This situation might limit the number of landmarks to process the image correctly for the AI system, and measuring the PNL will be troublesome. One solution could be a new criterion, “Parenchyma depiction in MLO,” similar to the criteria in CC [12]. However, these criteria had poor agreement between the human reference and AI in our study.

“IMF visibility” had medium agreement with the human reference. However, it is questionable how the IMF classification can work completely without the wrinkle detection feature. This is the region where the most common unintended folds and tissue overlaps occur, due to gravity, abdominal fat, a broad-based IMF, or improper positioning along the detector edge. Since skinfolds can significantly affect image quality, it seems essential to include this criterion in the software [3].

In some cases, agreement between AI and the human reference was given in the overall PGMI score, but the individual values of the categories differed completely from each other. When the AI system produces questionable results, it would be beneficial to visually inspect the software’s pattern recognition and underlying calculations to gain a deeper understanding of its functionality and the rationale behind its outputs. A graphical representation of the registered structures and lines within the interface, and human adaptation options seem inevitable [2, 12]. A future goal may be to design a practical platform that enables adjustment and communication between the user and the software, as is already possible with one vendor [20].

The variability between the model output and a manual PGMI assessment can generally also be attributed to differences in the evaluation methodology. While human reading remains prone to subjective tendencies, the evaluation by the algorithm consists of explicitly applying sets of rules relying on explicit measurements.

In fact, replication of traditional PGMI by an AI algorithm is difficult and not yet sufficient for a direct transfer into routine. Nevertheless, we have been able to obtain information about which aspects of the current PGMI are easier to transfer to an artificial model and which are more troublesome. This raises the question of whether a one-to-one transfer of the PGMI, with its human interpretations at the limits of objectivity, to a technically structured AI can work despite deep learning and similar approaches, or whether our criteria and methods should first be re-evaluated and discussed in the light of new possibilities. In any case, transparent feedback to programmers and designers of software can be a relevant contribution to finding an appropriate quality assessment for the future.

Limitations

Although the study sample included 800 images from 13 different sites, a larger number of cases would have given more power to the study. When selecting the participating institutions, great importance was attached to a range in terms of unit size, throughput, team constellation, and clientele to allow a heterogeneity of the data that corresponds to reality. We wanted to ensure that the concept would work equally well in different environments and not just in strictly organized screening scenarios with highly experienced core teams and fast workflows. Precisely because of this, the distribution of P, G, M, and I, as well as the distribution of the individual flaws, was not homogeneous in our approach. Further investigation would benefit from a pre-selected and balanced set of data to lay the groundwork for validation at a later stage.

The software recorded fewer PGMI criteria than the human readers in their familiar system (Appendix 1). It did not detect skinfolds or blur/motion artifacts, which may have affected the agreement between the overall PGMI score. However, when assessing the IMF, the missing skinfold category may have had less influence on the evaluation, since the presence of overlap effects in this region could already be detected by the software upon specific request to the provider. For a 1:1 comparison, the humans would have to re-evaluate using a new table or wait for further expansion of the missing categories.

All readers strictly recorded every single criterion of each image. The overall PGMI score they determined for each image was moreover based on their own subjective assessment. This means that weighting comes into play in the overall PGMI value. The participants were familiar with this procedure from their respective systems. Since we wanted results that would also be produced in routine, we left this approach as it was. This subjective impact poses a significant challenge for computer software, which is obviously better suited to replicating rules and standards. The AI followed the concept of the lowest individual value as a grade for the overall PGMI per image. Other more balanced approaches, adding more weight to specific PGMI categories, would have been more conclusive to compare with the human reading classifications, but this was not available in the prototype of the AI system. Recognizing this discrepancy, our study compared not only the overall PGMI but also the values of each individual criterion.

The PGMI assessment by the software was primarily based on structural recognition and rule-based categorization, rather than advanced learning from a broad base of human evaluations. The inclusion of additional data and training on a larger scale is considered essential in order to advance the software and ultimately achieve reliable validation and convincing establishment on the market.

The study did not focus on interreader agreement, as data have already been published [3, 6, 9, 13, 21]. We were not able to conduct a large multinational, multireader study; however, the human expert consensus reading in our study may provide a PGMI reference standard.

To avoid any bias in a consensus voting or re-reading process, we recommend using specially designed software with complete blinding capabilities whenever possible.

A larger consensus on requirements across screening programs and organizations may be fruitful to harmonize quality assessment in mammography and ultimately define the framework for AI.

Conclusion

The results from this study showed that the transformation of mammographic image quality measurements using PGMI into a fully automated AI system is challenging. Although moderate agreement was reached for overall PGMI, performance across specific criteria was inconsistent and decreased substantially in challenging images. Elaboration of the decision-making process and criteria for human assessment of mammographic image quality is essential in further work with automated solutions aimed at replacing and supporting radiographers as an objective, time- and cost-effective tool.

Supplementary information

Supplementary Material^{(816.4KB, pdf)}

Abbreviations

AI: Artificial intelligence
CC: Craniocaudal
CI: Confidence interval
IMF: Inframammary fold
MAE: Mean absolute errors
MLO: Mediolateral oblique
PGMI: Perfect-good-moderate-inadequate
PNL: Pectoralis nipple line

Author contributions

T.S. is the main author of the study, collected the image data, acted as human reader, did the work-up and the major writing. M.T. provided the software, prepared it and delivered the results of the AI. J.S., S.F. and W.S. acted as human reader. S.G. was a major contributor to the image data collection and processing. M.G. contributed to the statistical analysis. M.L. was a contributor and advisor in writing methods and results. J.G. performed all statistical analyses. S.H. was a major advisor and contributor in writing the manuscript. G.W. is the supervisor of T.S., corresponding author and was a major advisor in designing and performing the study and manuscript. All authors read and approved the final manuscript.

Funding

The authors state that this work has not received any funding.

Data availability

All mammography image data were collected in the period from June to August 2021 and are courtesy of the University Hospital of Innsbruck, Austria, or the Hirslanden Group, Switzerland. The CRFs with the evaluated PGMI data are archived by T.S.

Declarations

Ethics approval and consent to participate

The study was approved by the ethical committee (Medical University of Innsbruck, Austria, reference number 1321/2021).

Consent for publication

Waived by IRB.

Competing interests

M.T. is an employee of the vendor. He took charge of the compilation of the results generated by the algorithm and provided the data related to the algorithm design, training and evaluation. All other authors are neither employees nor consultants and have not received any funding from the provider or another party, as can be seen in the declarations already submitted in the first round. All communication regarding the software took place exclusively between T.S. (as the organizer of the study) and M.T. (in the role of the programmer of the algorithm). J.G.S. is affiliated with Evidia, S.F. is affiliated with Way to Women Sàrl, W.S. is affiliated with Team Radiologie Plus, and S.G. is affiliated with MSS Medical Software Solutions GmbH.

Footnotes

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1186/s13244-025-02191-3.

References

1.IARC (2023) IARC biennial report 2022–2023. Available via https://publications.iarc.who.int/633. Accessed 12 Dec 2024
2.Brahim M, Westerkamp K, Hempel L, Lehmann R, Hempel D, Philipp P (2022) Automated assessment of breast positioning quality in screening mammography. Cancers (Basel) 10.3390/cancers14194704 [DOI] [PMC free article] [PubMed]
3.Hofvind S, Vee B, Sørum R, Hauge M, Ertzaas A (2009) Quality assurance of mammograms in the Norwegian Breast Cancer Screening Program. Eur J Radiography. 10.1016/j.ejradi.2008.11.002
4.Taplin SH, Rutter CM, Finder C, Mandelson MT, Houn F, White E (2002) Screening mammography: clinical image quality and the risk of interval breast cancer. AJR Am J Roentgenol. 10.2214/ajr.178.4.1780797 [DOI] [PubMed]
5.U.S. Food & Drug Administration (2016) Poor positioning responsible for most clinical image deficiencies, failures. Available via https://www.fda.gov/radiation-emitting-products/mqsa-insights/mqsa-insights-articles. Accessed 12 Dec 2024
6.Hill C, Robinson L (2015) Mammography image assessment: validity and reliability of current scheme. Radiography. 10.1016/j.radi.2015.07.005
7.Waade GG, Danielsen AS, Holen ÅS et al (2021) Assessment of breast positioning criteria in mammographic screening: agreement between artificial intelligence software and radiographers. J Med Screen. 10.1177/0969141321998718 [DOI] [PubMed]
8.Taylor K, Parashar D, Bouverat G et al (2017) Mammographic image quality in relation to positioning of the breast: a multicentre international evaluation of the assessment systems currently used, to provide an evidence base for establishing a standardised method of assessment. Radiography. 10.1016/j.radi.2017.03.004 [DOI] [PubMed]
9.Boyce M, Gullien R, Parashar D, Taylor K (2015) Comparing the use and interpretation of PGMI scoring to assess the technical quality of screening mammograms in the UK and Norway. Radiography. 10.1016/j.radi.2015.05.006
10.Moreira C, Svoboda K, Poulos A, Taylor R, Page A, Rickard M (2005) Comparison of the validity and reliability of two image classification systems for the assessment of mammogram quality. J Med Screen. 10.1258/0969141053279149 [DOI] [PubMed]
11.Hondl M (2014) Bildanalyse der mammographie nach den PGMI-bildkriterien. In: Hondl M, Weissensteiner S, Gaisbichler S, Rosenblattl M (eds) Die richtige einstellung zur mammographie. Berufsverband der RadiologietechnologInnen Österreich
12.Hejduk P, Sexauer R, Ruppert C, Borkowski K, Unkelbach J, Schmidt N (2023) Automatic and standardized quality assurance of digital mammography and tomosynthesis with deep convolutional neural networks. Insights Imaging. 10.1186/s13244-023-01396-8 [DOI] [PMC free article] [PubMed]
13.Alukic E, Homar K, Pavic M, Zibert J, Mekis N (2022) The impact of subjective image quality evaluation in mammography. Radiography. 10.1016/j.radi.2023.02.025 [DOI] [PubMed]
14.Eby PR, Martis LM, Paluch JT, Pak JJ, Chan AHL (2023) Impact of artificial intelligence-driven quality improvement software on mammography technical repeat and recall rates. Radiol Artif Intell. 10.1148/ryai.230038 [DOI] [PMC free article] [PubMed]
15.Tardy M, Mateus D (2022) Leveraging multi-task learning to cope with poor and missing labels of mammograms. Front Radiol. 10.3389/fradi.2021.796078 [DOI] [PMC free article] [PubMed]
16.Cohen J (1968) Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull. 10.1037/h0026256 [DOI] [PubMed]
17.Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics. 10.2307/2529310 [PubMed]
18.Kolenikov S (2021) kappaetc: tables of interrater agreement. Stata J 21:1–15
19.Taylor CR, Monga N, Johnson C, Hawley JR, Patel M (2023) Artificial intelligence applications in breast imaging: current status and future directions. Diagnostics (Basel) 10.3390/diagnostics13122041 [DOI] [PMC free article] [PubMed]
20.b-rayz AG (2025) Available via https://b-rayz.spce.com/s/general-showroom-b-rayz-ag. Accessed April 2025
21.Santner T, Ruppert C, Gianolini S et al (2025) PGMI assessment in mammography: AI software versus human readers. Radiography. 10.1016/j.radi.2025.103017 [DOI] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material^{(816.4KB, pdf)}

Data Availability Statement

[CR1] 1.IARC (2023) IARC biennial report 2022–2023. Available via https://publications.iarc.who.int/633. Accessed 12 Dec 2024

[CR2] 2.Brahim M, Westerkamp K, Hempel L, Lehmann R, Hempel D, Philipp P (2022) Automated assessment of breast positioning quality in screening mammography. Cancers (Basel) 10.3390/cancers14194704 [DOI] [PMC free article] [PubMed]

[CR3] 3.Hofvind S, Vee B, Sørum R, Hauge M, Ertzaas A (2009) Quality assurance of mammograms in the Norwegian Breast Cancer Screening Program. Eur J Radiography. 10.1016/j.ejradi.2008.11.002

[CR4] 4.Taplin SH, Rutter CM, Finder C, Mandelson MT, Houn F, White E (2002) Screening mammography: clinical image quality and the risk of interval breast cancer. AJR Am J Roentgenol. 10.2214/ajr.178.4.1780797 [DOI] [PubMed]

[CR5] 5.U.S. Food & Drug Administration (2016) Poor positioning responsible for most clinical image deficiencies, failures. Available via https://www.fda.gov/radiation-emitting-products/mqsa-insights/mqsa-insights-articles. Accessed 12 Dec 2024

[CR6] 6.Hill C, Robinson L (2015) Mammography image assessment: validity and reliability of current scheme. Radiography. 10.1016/j.radi.2015.07.005

[CR7] 7.Waade GG, Danielsen AS, Holen ÅS et al (2021) Assessment of breast positioning criteria in mammographic screening: agreement between artificial intelligence software and radiographers. J Med Screen. 10.1177/0969141321998718 [DOI] [PubMed]

[CR8] 8.Taylor K, Parashar D, Bouverat G et al (2017) Mammographic image quality in relation to positioning of the breast: a multicentre international evaluation of the assessment systems currently used, to provide an evidence base for establishing a standardised method of assessment. Radiography. 10.1016/j.radi.2017.03.004 [DOI] [PubMed]

[CR9] 9.Boyce M, Gullien R, Parashar D, Taylor K (2015) Comparing the use and interpretation of PGMI scoring to assess the technical quality of screening mammograms in the UK and Norway. Radiography. 10.1016/j.radi.2015.05.006

[CR10] 10.Moreira C, Svoboda K, Poulos A, Taylor R, Page A, Rickard M (2005) Comparison of the validity and reliability of two image classification systems for the assessment of mammogram quality. J Med Screen. 10.1258/0969141053279149 [DOI] [PubMed]

[CR11] 11.Hondl M (2014) Bildanalyse der mammographie nach den PGMI-bildkriterien. In: Hondl M, Weissensteiner S, Gaisbichler S, Rosenblattl M (eds) Die richtige einstellung zur mammographie. Berufsverband der RadiologietechnologInnen Österreich

[CR12] 12.Hejduk P, Sexauer R, Ruppert C, Borkowski K, Unkelbach J, Schmidt N (2023) Automatic and standardized quality assurance of digital mammography and tomosynthesis with deep convolutional neural networks. Insights Imaging. 10.1186/s13244-023-01396-8 [DOI] [PMC free article] [PubMed]

[CR13] 13.Alukic E, Homar K, Pavic M, Zibert J, Mekis N (2022) The impact of subjective image quality evaluation in mammography. Radiography. 10.1016/j.radi.2023.02.025 [DOI] [PubMed]

[CR14] 14.Eby PR, Martis LM, Paluch JT, Pak JJ, Chan AHL (2023) Impact of artificial intelligence-driven quality improvement software on mammography technical repeat and recall rates. Radiol Artif Intell. 10.1148/ryai.230038 [DOI] [PMC free article] [PubMed]

[CR15] 15.Tardy M, Mateus D (2022) Leveraging multi-task learning to cope with poor and missing labels of mammograms. Front Radiol. 10.3389/fradi.2021.796078 [DOI] [PMC free article] [PubMed]

[CR16] 16.Cohen J (1968) Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull. 10.1037/h0026256 [DOI] [PubMed]

[CR17] 17.Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics. 10.2307/2529310 [PubMed]

[CR18] 18.Kolenikov S (2021) kappaetc: tables of interrater agreement. Stata J 21:1–15

[CR19] 19.Taylor CR, Monga N, Johnson C, Hawley JR, Patel M (2023) Artificial intelligence applications in breast imaging: current status and future directions. Diagnostics (Basel) 10.3390/diagnostics13122041 [DOI] [PMC free article] [PubMed]

[CR20] 20.b-rayz AG (2025) Available via https://b-rayz.spce.com/s/general-showroom-b-rayz-ag. Accessed April 2025

[CR21] 21.Santner T, Ruppert C, Gianolini S et al (2025) PGMI assessment in mammography: AI software versus human readers. Radiography. 10.1016/j.radi.2025.103017 [DOI] [PubMed]

PERMALINK

AI-based image quality assessment of positioning in mammography: considerations and challenges

Tina Santner

Mickael Tardy

Johanne-Gro Stalheim

Stephanie Frei

Wolfram Santner

Stefano Gianolini

Malik Galijasevic

Marthe Larsen

Jonas Gjesvik

Solveig Hofvind

Gerlig Widmann

Abstract

Objectives

Materials and methods

Results

Conclusion

Critical relevance statement

Key Points

Graphical Abstract

Introduction

Materials and methods

Study population

Human PGMI reference and AI-based evaluation of image quality

Statistical analysis and review

Results

Consensus intervention

Descriptives

Table 1.

Fig. 1.

Table 2.

Fig. 2.

Table 3.

Table 4.

Reasons for the discrepancy between human reference and AI reading

Table 5.

Fig. 3.

Fig. 4.

Fig. 5.

Discussion

Limitations

Conclusion

Supplementary information

Abbreviations

Author contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases