Abstract
Background
Emergency/trauma radiology artificial intelligence (AI) is maturing along all stages of technology readiness, with research and development (R&D) ranging from data curation and algorithm development to post-market monitoring and retraining.
Purpose
To develop an expert consensus document on best research practices and methodological priorities for emergency/trauma radiology AI.
Methods
A Delphi consensus exercise was conducted by the ASER AI/ML expert panel between 2022–2024. In phase 1, a steering committee (7 panelists) established key themes- curation; validity; human factors; workflow; barriers; future avenues; and ethics- and generated an edited, collated long-list of statements. In phase 2, two Delphi rounds using anonymous RAND/UCLA Likert grading were conducted with web-based data capture (round 1) and a bespoke excel document with literature hyperlinks (round 2). Between rounds, editing and knowledge synthesis helped maximize consensus. Statements reaching ≥80% agreement were included in the final document.
Results
Delphi rounds 1 and 2 consisted of 81 and 78 items, respectively.18/21 expert panelists (86%) responded to round 1, and 15 to round 2 (17% drop-out). Consensus was reached for 65 statements. Observations were summarized and contextualized. Statements with unanimous consensus centered around transparent methodologic reporting; testing for generalizability and robustness with external data; and benchmarking performance with appropriate metrics and baselines. A manuscript draft was circulated to panelists for editing and final approval.
Conclusions
The document is meant as a framework to foster best-practices and further discussion among researchers working on various aspects of emergency and trauma radiology AI.
Keywords: Radiology, Imaging, Emergency, Trauma, Computer aided detection, Artificial intelligence, Machine learning, Research priorities, Emergency Radiology, Trauma radiology, ASER, Consensus statement, Position paper, Delphi study
Graphical Abstract

Introduction
Artificial intelligence and machine learning (AI/ML)-based automated computer-aided detection and diagnosis (CAD) tools that detect, classify, grade, quantify, risk stratify, or prognosticate emergent traumatic and non-traumatic conditions on imaging exams stand to improve patient outcomes and reduce costs in the high-volume, safety critical emergency setting by improving turnaround times, diagnostic accuracy, and personalized decision support [1–5]. AI-based image augmentation, reconstruction, and enhancement techniques may also be beneficial, and AI-assisted annotation could accelerate data curation and labeling [6–9]. AI tools may relieve some of the burden placed on round-the-clock staff and have a disproportionate effect in locations with fewer radiology resources and expertise while reducing errors and improving agreement between observers [10]. As advances in GPU hardware accelerate, allowing rapid parallel computing, AI algorithms continue to improve in performance and robustness [11]. The introduction of natural image-pretrained deep learning models was an initial watershed moment culminating in the proliferation of radiology CAD tools [12, 13]. More recently, foundational vision transformers trained on massive datasets have been modified and optimized for a variety of medical imaging tasks [14–18]. AI-based commercial technologies are rapidly coming online, with hundreds of radiology-based tools gaining regulatory approval to date [4, 19].
The American Society of Emergency Radiology (ASER) AI/ML Expert panel was conceived in 2020 and kicked-off with a virtual meeting in November of 2021. Among the goals of the panel were a) developing a better understanding of AI R&D in the emergency and trauma radiology domain and b) aligning future clinical and research priorities with the needs of the emergency radiology community [3–5]. Priority-setting work products have so far included a scoping review on the current state of AI CAD trauma tools [4] and an ASER member survey to query current trends in AI utilization, perceived unmet needs, and expectations [5]. Through initial meetings, the panel also identified a need for a position statement on research priorities, recommendations, and guidelines spanning the research pipeline from dataset curation to early foundational and translational proof-of-concept work; productization; validation; regulatory clearance; post-market surveillance; and outcomes research.
While Delphi methods have been used to move toward consensus for prioritization of key research issues and guidelines in a variety of fields, including AI for screening colonoscopy [20], to our knowledge, there has been no such formal systematic process for the emergency/trauma radiology domain. Our aim was therefore to create a Delphi-driven consensus document to establish R&D guidelines, methodological priorities, and challenges that would be useful to emergency/trauma imaging AI researchers [21, 22].
Methods
Justification for use of a Delphi technique
Priority-setting exercises often employ the Delphi method as a qualitative yet systematic group facilitation approach that involves collating recommendations elicited through an iterative adaptive process of cognitive reflection with the opportunity to revise opinions, considering new or previously overlooked information (i.e., consensus building). This can be achieved through group discussion or by drawing from external sources of knowledge [23, 24].
Overview of the Delphi process
Per Conduct and Reporting of Delphi Studies (CREDES) criteria, The process requires methodologic transparency with a summary flow chart (see Fig 1), and separate descriptions of preparatory phases, anonymous iterative rounds of inquiry (“Delphi rounds”), interim activities to deal with non-consensus, and concluding steps. A priori cut-offs, average group responses, and changes between rounds, review of any new materials or informational input introduced during survey rounds, and pre-set thresholds for terminating the Delphi process are described. Two or three rounds are often used for pragmatic reasons, and consensus is generally not expected for all items [23, 25]. Endorsement by relevant stakeholders within professional bodies is advised as part of the dissemination plan [24].
Fig. 1.

Delphi study flow-chart. Flow-chart of ASER AI/ML expert panel Delphi consensus approach
Phase 1: establishing key themes and generating a long-list of consensus statements
A flow-chart for this Delphi study is provided in Fig. 1. Following an Institutional review board (IRB) not human subjects research (NHSR) determination at University of Maryland, Baltimore, participation in a steering committee tasked with the conception, format, and execution of this Delphi consensus study was solicited among all ASER AI/ML Expert Panel members by e-mail survey. An electronic (eDelphi) format was used to facilitate communication with members at geographically separate US and international institutions. The steering committee, comprised of 7 volunteer members, was tasked with determining the primary purpose, scope, and methods adherent to Delphi technique [24], and subsequently refining, modifying, or removing statements based on feedback when available. The steering committee was comprised of 7 volunteer members, including six emergency/trauma radiologists and one computer science PhD. Members had expertise and publication records in various aspects of emergency/trauma AI, including algorithm development, collaborative translation, deployment, validation, post-market evaluation, outcomes research, and governance. The steering committee convened via virtual meeting on January 7, 2022, and determined the methodologic approach, with the overall goal of creating an expert-driven list of consensus statements promoting high-quality, ethical research throughout the technology readiness pipeline.
An 80% consensus threshold (Likert scores of 7–9, “agree”), and two-round stopping point if consensus was reached on most items and was determined a priori. The steering committee aimed to remain agnostic to the relative importance of CAD tool R&D for specific acute illnesses or injuries, as insights into these issues were to be gleaned through the subsequently published scoping review and ASER member survey [4, 5].
Seven key themes were established during the brain-storming session. These included: 1. Dataset curation, 2. Validity, 3. Human factors, 4. Workflow, 5. Barriers to research and implementation, 6. Future avenues and 7. Ethics. Following the meeting, each steering committee member drafted 3–4 statements they felt were most relevant to each theme with instructions to formulate each statement in a way that could be evaluated by the AI Expert Panel members-at-large using the 9-point RAND/UCLA Likert method. Likert grading was used to maximize discrimination within three broad categories (1–3: disagree; 4–6: uncertain; 7–9- agree). Prospective statements were solicited via electronic survey (SurveyMonkey Inc., San Mateo, California, US) between February 3 and March 20th, 2022. An advisory team of four steering committee members (DD, PS, GK, and KB) was formed to collectively evaluate, collate, and edit statements before and between eDelphi rounds with the aim of maximizing consensus, circulating drafts, and arriving at a final comprehensive consensus document. The first round was disseminated on June 4th, 2022.
Phase 2: survey administration
Round 1. first consensus-building round, between-round editing, and literature search
Surveys were disseminated via e-mail. The first was administered using an online form (JotForm, San Francisco, California, US) link on August 8, 2022, with responses closed on October 3, 2022 (see Appendix 1). Respondents performed Likert grading on each of the above themes and had the opportunity to enter free-form comments for all statements. Although statements were organized thematically, no effort was made to prioritize statements or themes over one another by order of importance. All data was extracted from the eDelphi surveys and imported into a spreadsheet. Between-survey steering committee activities were completed by advisory group members from November 2022 to June 2023 and included discussion, editing, and literature search.
To maximize the likelihood of building further consensus, the initial level of consensus and anonymous comments from round 1 were considered. A focused literature search was conducted for knowledge synthesis by DD, with collated statements as search criteria using PubMed and Google Scholar. DD thematically and topically analyzed the literature to identify and highlight relevant passages. In total, three members of the advisory team available for this exercise (DD, GK, and PS) approved dissemination of 52 published papers as embedded hyperlinks within the second eDelphi round survey. These included peer reviewed society position statements, systematic reviews, clinical and technical scientific papers, narrative reviews, and expert commentaries in clinical and technical journals. Provision of embedded links to literature full texts was intended as a more objective “wisdom of crowds” approach to shore up consensus, in lieu of group discussion, to maximize anonymity and avoid bandwagon effects [26].
Round 2. Second and final consensus-building round.
The second survey with pdf hyperlinks was administered in spreadsheet format (see Appendix 2a- statements; and Appendix 2b- citations for linked articles). Respondents were instructed to complete Round 2 only if they had completed Round 1. To maintain anonymity while preserving a list of participants for final review and approval of the consensus document, the steering committee survey administrator sequestered and preserved identifying information before tabulating results. The second eDelphi round was administered between June 15, 2023, and Dec 12, 2024. At the conclusion of round 2, summary results were tabulated, and mean, median, and mode were calculated. Statements with less than 80% consensus were excluded, while those with ≥80% agreement in the final round were included in the final consensus document. Statements with a given theme were organized post-hoc in descending order of agreement, and color-coded to indicate unanimous (100%), strong (≥90% to <100%), and moderate (≥80% to < 90%) consensus.
Results
Delphi study expert participants
The Delphi study expert participants who completed both rounds of the exercise included an international body from 13 institutions and 3 countries within North America, the Asia-Pacific region, and the Middle East. The group included 10 men and 5 women, with mean age of 44 (range 35–64); Highest degrees included MD or equivalent (n = 10), MD/MBA (n = 2), PhD (n = 1), MD/PhD (n = 1), and MD/MS (n = 1). Participants had a median of 6 years of experience with AI tools and research (range: 4–10). Four were currently principal investigators on funded studies, with federal funding reported by 3 (two of whom also reported industry and private society/foundation funding), and one reporting departmental funding. Participants have been engaged in various aspects of governance or other leadership/oversight activities, including departmental AI clinical implementation committees, the RSNA informatics committee, and associate or assistant editorship of RSNA scientific journals (n = 5).
Data processing
In phase 1, steering committee expert topic generation resulted in a list of 124 items from raw verbatim responses. After removing duplicates and collating items that addressed similar or overlapping subject matter, 81 statements were developed. In the first eDelphi round, these were distributed to 21 panel members, with 18 panel members returning responses (86% initial response rate). Phase 2, round 2, consisted of 78 statements with 18 items for dataset curation, 17 for validity, 11 for human factors, 7 for workflow, 7 for barriers to research and implementation, 7 for future avenues, and 11 for ethics. Out of 18 ASER AI expert panel respondents, 15 responded to the second eDelphi round (17% drop-out). In the first round, mean scores ranged from 3.1 to 8.5 with round 1 agreement scores (≥7) reached for 52 of 81 statements (64%). In the second round, following between-round editing and knowledge synthesis, mean scores ranged from 5.8 to 8.1, with round 2 agreement reached for 65 of 78 statements (83%). The 65 statements with consensus represent the expert panel’s final work product of research recommendations and guidance, presented in Table 1, and organized from highest to lowest agreement. Unanimous (100%) consensus was established for 8 statements/recommendations, high level consensus (≥90 to <100%) for an additional 16, and moderate consensus (≥80 to <90%) for the remaining 41. A manuscript was drafted and distributed first to the steering committee, and then among panelists participating in both rounds for final feedback and editing.
Table 1.
Final consensus document
| ASER AI/ML panel consensus research recommendations and guidelines for trauma/emergency radiology | ||||||
|---|---|---|---|---|---|---|
| Theme: Dataset Curation. Key Issues: Dataset size (training/validation), Selection criteria, Annotation/label quality, Heterogenous/diverse/representative data and Clinical data | ||||||
| Number | Statement | total 7,8 and 9* | % consensus | Median | Mean | Mode |
| 1 | The source of data, acquisition parameters, vendors makes/models, how patients were selected, how many patients and how many images were included, how data was prepared, how data was anonymized, and demographics (at minimum the age and sex distribution of subjects) should be clearly stated in manuscripts. | 15 | 100 | 9 | 8.8 | 9 |
| 2 | Emergency radiology algorithms should be tested to confirm generalizability (in terms of diagnostic accuracy, calibration/model fit, and bias/hidden stratification) using heterogeneous out-of-sample datasets prior to clinical deployment (e.g., data from multiple outside institutions or public repositories, containing variable demographic features, vendors, and protocols). | 15 | 100 | 9 | 8.5 | 9 |
| 3 | To ensure reproducibility and transparency, vendors/researchers must clearly state how data were annotated, by whom, the experience of the labeling/annotation team, and any potential bias they may introduce. | 14 | 93.3 | 9 | 8.4 | 9 |
| 4 | The ASER AI expert panel should actively encourage the curation and dissemination of public datasets that accelerate the R&D of CAD tools and consider sponsoring AI challenges in collaboration with societies such as RSNA, ACR, or MICCAI. | 14 | 93.3 | 9 | 8.2 | 9 |
| 5 | Scarcity of high-quality labeled data is a major obstacle in emergency radiology CAD tool R&D, particularly for cross-sectional imaging tasks. | 14 | 93.3 | 9 | 7.9 | 9 |
| 6 | Applying the same expectations of dataset heterogeneity, size, and multi-reader design to first-of-their-kind emergency radiology CAD tools for novel tasks could stifle early R&D | 14 | 93.3 | 8 | 7.8 | 9 |
| 7 | To help ensure reproducibility of AI algorithm performance when scaling to larger and higher quality datasets, data curation methods should be transparent. Checklists such as CLAIM should be strongly encouraged to this end. | 13 | 86.7 | 8.5 | 8.3 | 9 |
| 8 | Methods that accelerate high-quality human-in-the-loop data-labeling while also minimizing the effects of automation bias on human oracles (i.e., annotators) should be a major research priority for ER/Trauma radiology | 13 | 86.7 | 8 | 8.1 | 9 |
| 9 | Post-proof-of-concept and preliminary translational work, researchers should strive, when possible, for reproducible and uniform annotation with appropriate quality control measures such as supervision and editing by a group of experts with a method to achieve consensus. | 13 | 86.7 | 8 | 7.9 | 9 |
| 10 | Buy-in for multicenter projects should include the opportunity to participate as co-author on abstracts, conference papers, and manuscripts provided that specific expectations and ICMJE criteria are met (1. Substantial contributions to conception, design, or acquisition of data, 2. Drafting or revising critically for important intellectual content, and 3. Final approval of the version to be published). | 13 | 86.7 | 8 | 7.8 | 9 |
| 11 | ASER Al/ML expert panel experts should be instrumental in setting priorities for dataset curation by identifying applications expected to have a major impact on the clinical practice of emergency and trauma radiology. | 13 | 86.7 | 8 | 7.7 | 9 |
| 12 | Data-sharing rules vary by institution and may require considerable leg work. Along this line, the resource-intensive process of applying for a multicenter Institutional Review Board (IRB) approval and two-way Data User Agreement (DUA) should be prioritized as early as possible for multi-center data curation efforts. | 12 | 80 | 9 | 8.1 | 9 |
| 13 | Due to the very high variability of appearances of pathology in emergency radiology, achieving the goal of acceptable on-task performance for CAD tools will require training and validation using large heterogeneous multicenter datasets. | 12 | 80 | 8 | 7.7 | 9 |
| 14 | Dataset size, heterogeneity, and rigor of image labeling for emergency radiology-related algorithms should be commensurate with the current state of research for a given pathology and task and updated as technology readiness advances. | 12 | 80 | 8 | 7.7 | 8 |
| 15 | ASERAI/ML Panel experts should collaborate on high-quality data annotation and strive to complete IRBs/DUAs to facilitate deidentified data exchange, where feasible. | 12 | 80 | 8 | 7.5 | 9 |
| 16 | Public challenges and challenge data are important to develop clinically useful models, but challenge data is lacking in many domains of emergency radiology. | 12 | 80 | 8 | 7.5 | 8 |
| Theme: Validity. Key issues: Selection of clinically salient/actionable algorithm outcomes. Performance assessment and Generalizability, scalability, and robustness post-approval surveillance/back-end validation | ||||||
| 1 | Following proof-of-concept and preliminary translational work, emergency radiology CAD tool performance should be appropriately benchmarked depending on intended use, e.g., triage/early notification (CADt) should be compared to turn around time without CAD; for concurrent or second reader tools (CADe), reader performance should be assessed with and without AI assistance. | 15 | 100 | 9 | 8.4 | 9 |
| 2 | Non-adversarial robustness of AI models (e.g., robustness under different image conditions/scanners/patient demographics) is also a large concern. | 15 | 100 | 8 | 8.3 | 8 |
| 3 | The reference standard depends on the task at hand. Objective reference standards can be extracted from clinical outcome data, electronic phenotyping (e.g., biopsy or laboratory results), independent reads with arbitration, or segmentation with oversight/quality control. | 15 | 100 | 8 | 8.1 | 9 |
| 4 | For tools far along the R&D TRL pipeline, scalability of algorithms across diverse populations and healthcare settings is an essential element of robustness for pre-market approval and widespread use. | 14 | 93.3 | 9 | 8.5 | 9 |
| 5 | Emergency radiology CAD tools must be reevaluated locally prior to implementation. This may include accuracy metrics (e.g., for detection/classification), or model fit (if output includes probability of disease) | 14 | 93.3 | 9 | 8.3 | 9 |
| 6 | Agreement between experts (e.g., kappa, DSC, ICC, Bland-Altman) in at least a subset of independently labeled patients should ideally be reported for CAD software in the literature | 14 | 93.3 | 7 | 6.6 | 7 |
| 7 | The ASER AI/ML Expert Panel should engage in the exploration of clinical needs assessment to determine areas of high priority for CAD R&D | 13 | 86.7 | 9 | 8.3 | 9 |
| 8 | AI deep learning algorithms may underperform in newly deployed environments. Post-FDA approval surveillance can address drops in performance through retraining, when possible. | 13 | 86.7 | 9 | 8 | 9 |
| 9 | Raw data (results flagged as positive or negative) should be available to end-users upon request to the vendor to assess local algorithm performance; Investigators may also need to determine ground truth on a subset of patients to determine TP, TN, FP, FN. | 13 | 86.7 | 9 | 7.9 | 9 |
| 10 | Standardized guidelines such as those introduced in the Radiology: Artificial Intelligence CLAIM checklist (modified from STARD and STROBE) should be encouraged to standardize performance assessment and help ensure methods are transparent and reproducible. Intended use should be clearly defined. Failure analyses should be reported. | 13 | 86.7 | 8 | 8.1 | 9 |
| 11 | Research on system benevolence and user satisfaction (e.g., usability, efficiency, integration, adaptability, trust) are also not required by FDA, but user acceptance research should be a major research priority early in the R&D pipeline, in part so that governing structures can make informed decisions that maximize user needs and preferences while minimizing added workload and automation/complacency bias. | 12 | 80 | 8 | 7.7 | 9 |
| 12 | Developers/vendors should provide strong scientific justification for any cut-off values used. | 12 | 80 | 8 | 7.7 | 9 |
| 13 | The FDA requires proof of safety and effectiveness, but does not require outcomes research for 510k, PMA, or de novo approval. However, reimbursement by CMS, such as through the New Technology Pathway, does require evidence of improved outcomes. Outcomes research for FDA- approved tools should be a major priority for ER/trauma CAD tools. | 12 | 80 | 8 | 7.5 | 9 |
| 14 | Adversarial robustness of AI models in clinical settings is a large concern. | 12 | 80 | 8 | 7.1 | 8 |
| Theme: Human Factors. Key issues: Transparency, Interoperability, Human-centered design, selection of target users. | ||||||
| 1 | There is a need for formalized approaches to achieving high-trust, high-transparency human factors engineering goals for Emergency AI algorithm and software development. | 14 | 93.3 | 8 | 8.1 | 9 |
| 2 | Principles of human-centered design should be employed to determine the level of interoperability/human-in-the-loop functionality desired by appropriate end-users, considering dimensions such as degree of mental support, workload, frustration, trust, and the likelihood of future use. | 14 | 93.3 | 8 | 8.1 | 9 |
| 3 | If end-users include both clinical teams and radiologists, then both should ideally be included in formative user research and user-centered design. | 14 | 93.3 | 8 | 8 | 9 |
| 4 | Following formative user research using human-centered design principles, the same dimensions of user acceptance should be assessed in a simulated deployment environment. | 14 | 93.3 | 8 | 7.9 | 8 |
| 5 | AI deep learning algorithms may underperform in newly deployed environments. Post-FDA approval surveillance can address drops in performance through retraining, when possible. | 13 | 86.7 | 8 | 8.1 | 9 |
| 6 | Leveraging expertise in this domain by computer science faculty, the ASER AI/ML Expert Panel should engage in empirical simulation studies to develop human-centered AI systems for a high-value task determined by group consensus. | 13 | 86.7 | 8 | 7.7 | 8 |
| 7 | Along FAIR principles (findable, accessible, interoperable, reproducible), investigators should be encouraged to make non-proprietary source code and datasets available through GitHub or another repository. | 13 | 86.7 | 7 | 7.6 | 7 |
| 8 | Regulatory agencies such as the FDA should encourage direct input from expert radiologists and other relevant end-users- not only as annotators or readers in performance studies, but throughout the R&D process- to ensure that the software is designed to meet user needs and performs appropriately in clinical settings. | 12 | 80 | 9 | 8.2 | 9 |
| 9 | Transparency of machine learning models is still required even if algorithm performance is properly validated (e.g., through an RCT). | 12 | 80 | 8 | 7.8 | 9 |
| 10 | In general, Emergency radiology CAD tools call for maximum interpretability given the high stakes, fast-paced, and safety-critical nature of clinical decision-making. | 12 | 80 | 8 | 7.6 | 9 |
| Theme: Workflow. Key issues: User acceptance: logical/applicable/optimally usable workflow, Cost-benefit of added trust vs added workload, Integrated with PACS vs separate client - multiplatform/vendor integration, Types of tools - CADe/CADt/CADx | ||||||
| 1 | More research is needed to determine if AI CAD triage/early notification tools decrease report turn-around times and make the radiology workflow faster. | 15 | 100 | 8 | 7.9 | 7 |
| 2 | Once operational, cost-benefit analyses are vital to understanding whether clinically used systems can improve outcomes that may also be translated into financial savings. | 13 | 86.7 | 9 | 8.3 | 9 |
| 3 | Guidelines for institutional governance structures, societies, and regulatory agencies are needed to standardize the process of post-market surveillance of radiology AI CAD tools. | 13 | 86.7 | 8 | 8.1 | 9 |
| 4 | Operationalizing locally developed AI CAD in the clinical workflow for “shadow evaluation ” and prospective clinical studies and integrating with PACS can be challenging. Bespoke institutional pipelines require considerable resources, and collaboration between radiologists, computer scientists, and software developers. To avoid “reinventing the wheel”, research emphasis on open-source interoperable vendor agnostic software containerized within easily deployable virtual environments is expected to benefit clinical-translational research in this domain. | 13 | 86.7 | 8 | 8 | 9 |
| 5 | The perceived value of AI CAD should be studied from different angles, such as the perspective of governance structures and healthcare administrators, end-users, and patients. | 13 | 86.7 | 8 | 7.9 | 9 |
| 6 | Once a commercial tool is introduced, it should interact with third-party software platforms and integrate with RIS/PACS systems to varying degrees that meet the needs of end-users. This functionality should be carefully researched by all clinical and technical staff with buy-in—particularly those required for continued algorithm support, with oversight by the institution’s governance structure. | 13 | 86.7 | 8 | 7.8 | 9 |
| 7 | AI orchestrators are expected to prioritize interface unification between multiple tools from multiple vendors in a seamless way for the interpreting radiologist. Surveys describing, among other things, progress of orchestrator platform R&D, mergers, and clinical adoption will help governance structures anticipate and consider acquisition of these platforms in the Emergency setting. | 12 | 80 | 8 | 7.4 | 9 |
| Theme: Barriers to research and implementation. Key issues: Scanner/vendor cross-compatibility, Limited resources for software development (GUIs/APIs), Perceived value/importance from an organizational standpoint (e.g., buy-in from IT, operating and maintenance costs, FDA clearance, Funding sources/mentorship | ||||||
| 1 | Successful AI R&D or implementation depends on the involvement of a large multi-disciplinary team of experts that may include IT, radiologists, hospital administrators, and developer/vendor technical support staff. | 14 | 93.3 | 9 | 8.4 | 9 |
| 2 | Annotation of large datasets is among the most challenging hurdles of data curation. | 14 | 93.3 | 9 | 8.3 | 9 |
| 3 | Resistance from radiologists to AI CAD tools in the daily workflow is based on perceived on negative effects on efficiency, over- or under-diagnosis, negative impacts on training, lack of published data to support use, financial concerns, ethical concerns, and AI CAD transparency. | 14 | 93.3 | 8 | 8.1 | 9 |
| 4 | Foundational and translational AI research is resource and time-intensive and often requires grant support from funding agencies. Guidance documents and ASER annual meeting workshops could be helpful to clarify potential funding mechanisms for AI in Emergency radiology and improve research engagement. | 13 | 86.7 | 8 | 7.7 | 9 |
| 5 | Dedicated informatics fellowships, radiology informatics tracks, and AI expert panel meeting workshops are potentially beneficial avenues for accelerating AI research in ER/Trauma radiology. | 12 | 80 | 9 | 7.7 | 9 |
| 6 | Providing ER/trauma radiologists with education on entrepreneurship and relevant funding opportunities can accelerate AI growth in the subspeciality. | 12 | 80 | 8 | 7.7 | 9 |
| 7 | ASER AI expert panel white papers, position statements, and meeting workshops are needed to increase clarity regarding best practices to maximize chances of regulatory approval for new tools and obtaining reimbursement codes, such as through CMS. | 12 | 80 | 8 | 7.6 | 8 |
| Theme: Future avenues. Key issues: Needs-based assessment/potential impact on patient care, Governance for research resource prioritization | ||||||
| 1 | Future research should take a multifaceted approach, examining not only effects on the speed of diagnosis/turn-around time and reducing missed injury/pathology, but also direct effects on patient care (e.g., earlier time to treatment). | 15 | 100 | 9 | 8.5 | 9 |
| 2 | Leveraging expertise in this domain by computer science faculty, the ASER AI/ML Expert Panel should engage in empirical simulation studies to develop human-centered AI systems for a high-value task determined by group consensus. | 13 | 86.7 | 8 | 7.7 | 9 |
| 3 | Underserved populations are often over-represented in the emergency room. Research should include the assessment of penetration of emergency radiology CAD tools into underserved population centers. | 13 | 86.7 | 8 | 7.6 | 9 |
| 4 | Overall, AI CAD tools in emergency radiology could increase access to care for underserved populations, for example, by providing diagnostic assistance in under-resourced settings where emergency/trauma expertise is lacking. | 12 | 80 | 8 | 7.7 | 9 |
| Theme: Ethics. Key issues: What research areas could potentially cause harm to subjects, Image-driven versus patient-centric, Bias | ||||||
| 1 | There are potential unforeseen risks associated with black box algorithms or data-heavy “radiomics” algorithms falsely labeling patients as having or not having disease. | 15 | 100 | 9 | 8.4 | 9 |
| 2 | Mechanisms must be established to monitor the performance of AI solutions over time and report any deviations. | 14 | 93.3 | 9 | 8.3 | 9 |
| 3 | As AI algorithms proliferate, there is increased potential for the incorporation of staff lacking radiology expertise in the decision pathway, which may be detrimental. For example, less qualified staff will become more dependent on a potentially flawed ML image interpretation. | 14 | 93.3 | 9 | 8.1 | 9 |
| 4 | The collection of demographic and clinical data elements is important for demonstrating robustness and to account for bias in tools in advanced stages of the technology readiness pipeline, or for the pilot stages of CADx tool development where prognosis may depend on these factors. | 13 | 86.7 | 9 | 8.3 | 9 |
| 5 | Deployment and use of clinical algorithms prior to regulatory approval could pose harm to patients and end-user investigators involved in prospective “shadow mode” evaluation in the clinical environment should in most circumstances be included on an IRB protocol following discussion with the Human Research Protection Office as needed. | 13 | 86.7 | 8 | 7.8 | 9 |
| 6 | Radiologist end-users should be considered research subjects if their performance may be audited using commercial AI algorithms, and auditing is not explicitly approved as an intended use of the software. Otherwise, end-users may be incentivized to agree with algorithm results to avoid professional harm leading to over-reliance on a potentially imperfect reference standard. | 13 | 86.7 | 8 | 7.6 | 9 |
| 7 | Generalization of single institutional data to larger human populations poses an ethical risk. | 13 | 86.7 | 8 | 7.6 | 9 |
Green shading with progressive lightness indicates unanimous, strong, and moderate consensus. Column 2 (“total 7, 8, and 9)” indicates the number of respondents with Likert scores in the “agree” range. Statements are organized in descending order by percent consensus, median, mean, and mode.
Discussion
The ultimate goal of AI/ML R&D in emergency and trauma radiology is to develop and validate tools that are safe, effective, and improve patient health. Food and Drug Administration (FDA) Center for Devices and Radiological Health serves as the regulatory body for clearance through the 510k or de novo software as medical device (SaMD) pathway, ensuring that devices work as intended without major risks [27, 28]. Reimbursement through the Centers for Medicare and Medicaid Services (CMS) New Technology Add-on Payment (NTAP) involves cost-sharing with hospital systems based on national cost reporting for a given tool, provided that the cost is in excess of the bundled Disease Related Group (DRG) payment. For a tool to receive an NTAP designation, there must be evidence of newness and substantial improvement in clinical outcomes over existing technologies [27, 28]. In this way, CMS intends to incentivize improvement of patient care through innovation. Guidelines are also evolving within these agencies to promote software with high user acceptance, establish mechanisms for post-market monitoring and retraining of AI tools, and develop novel validation criteria for generalist AI tools.
Practitioners of diagnostic emergency and trauma radiology are tasked with diagnosing and characterizing the long-tail of diverse and sometimes uncommon or rare pathologies and injuries under time constraints related to the high volume and high acuity in this setting. AI R&D in this domain is both complex and challenging and although a) commercial tools are now used to detect and, in some cases, characterize the most common pathologies, such as ischemic stroke, intracranial hemorrhage, fractures, pneumonia, and pulmonary embolism; b) publications describing prototype tools has increased steeply since 2016, and c) the majority of radiologists report using at least one AI CAD tool in clinical practice [5], the field remains in a very early stage of maturity.
In our previous survey of ASER members and scoping review of trauma AI CAD tools, the ASER AI/ML panel identified scarcity of public data for many tasks, methodological concerns with ground truth labeling, and a lack of papers exploring elements of user acceptance and user interaction as major limitations. Our survey respondents overwhelmingly expressed the need for explainable and verifiable tools and the need for transparency in the R&D process. Negative perceptions of AI centered around automation bias, over-diagnosis, poor-generalizability, and impediments to workflow. In addition to triage and detection tools, respondents reported disease severity or injury grading, quantitative visualization, and autopopulation of reports as high-value tasks [5]. Cross-over multi-reader, multi-case (MRMC) methodology is infrequent in validation papers; There remain few papers exploring the generalizability of clinically approved tools to local data using high-quality reference standards; and very few papers have explored effects on patient outcomes or performed cost-benefit analyses [4].
In this work, we do not set out to determine unmet needs for specific non-traumatic pathologies or traumatic injuries, as gaps in R&D and perceived needs were explored in our previous work products. Rather, the Expert Panel employed the Delphi approach as a structured, systematic, collaborative, and comprehensive means of reaching consensus on research guidance and methodological recommendations for ER/trauma radiology AI tools along various stages of the FDA Center for Devices and Radiological Health (CDRH) Total Product Life Cycle (TPLC)— from ideation to post-market surveillance. More specifically, this consensus document (Table 1) is intended to help navigate the challenging process of developing, refining, monitoring, and governing the use of transparent, robust, and ethical AI CAD tools that have high acceptance, improve workflow, are well-validated, improve outcomes and decrease costs.
A structured summary of our guidance and recommendations is provided below, organized along the seven key themes (1. Dataset curation, 2. Validity, 3. Human factors, 4. Workflow, 5. Barriers to research and implementation, 6. Future avenues and 7. Ethics) and within each theme, by the level of consensus (unanimous- 100%; strong- ≥90 to <100%; moderate- ≥80 to <90%). Consensus statements are contextualized using the existing literature and illustrated with examples and hypothetical scenarios where these may be helpful to convey practical implications.
Dataset curation
Unanimous recommendations
There was unanimous agreement on statements that pertained to transparency of data curation methods and the need for high-quality curation and annotation of external validation datasets so that robustness and generalizability can be adequately examined (statements 1 and 2). Regarding transparency, key elements of reporting should include how the studies were retrieved and anonymized, patient selection criteria, baseline patient characteristics, imaging hardware makes and models, and the number of patients, studies, and images included in the final dataset. External datasets should be curated using heterogeneous out-of-sample data from multiple institutions and/or public repositories, and should reflect a spectrum of demographic features, vendors, and imaging parameters or protocols to facilitate testing of mature tools for robustness.
The number and size of publicly available datasets with heterogeneous multi-institutional data has been increasing over time [29–33]. Since medical imaging datasets are much smaller than those that brought breakthroughs in natural image processing, and vary in appearances, testing for generalizability will require IRB-approved curation of datasets that are not only sizeable and heterogeneous, but also conform to high methodological standards [34]. Comparatively small dataset sizes in medical imaging require every effort to develop temporal and external validation datasets free of spectrum bias (e.g., by using consecutive selection rather than convenience sampling), so that results from validation datasets reflect, as closely as possible, performance in the clinical setting [35, 36]. Additionally, details on patient characteristics allow investigators to adjust disease prevalence and assess performance on samples representative of various practice settings (e.g., urban or rural, and tertiary academic versus community or teleradiology practice) [36].
Strong and moderate recommendations
Consensus was similarly reached on the need for transparency in reporting of annotation methods with sufficient details on the annotation procedure so that methodologic quality can be ascertained, and labeling can be reproduced (statements 3 and 7) [37]. Use of the updated CLAIM checklist is encouraged to ensure methodologic rigor and transparency (statement 7) [38]. Quality control measures are strongly encouraged. This may include supervision of labeling and label editing by senior personnel (for slice-level, pixelwise, or voxelwise editing), and a valid means of reaching label consensus (e.g., arbitration by a senior expert for patient level ground truthing) (statement 9) [37]. As the time effort and personnel requirements for ensuring quality control for the many classes of pathology encountered in the ER/trauma setting is expected to be increasingly prohibitive when scaling up to large data samples, and given that scarcity of labeled data remains a major obstacle particularly for cross-sectional imaging tasks (statement 5), human-in-the-loop methods such as active learning that accelerate labeling should be adopted when possible, with procedures in place to minimize automation bias (statement 8) [7, 8]. The many classes of non-traumatic and traumatic pathology that may be encountered and the very high variability in appearances requires training and on-task validation of mature tools using large heterogeneous datasets (statement 13) [7, 11]. However, we also recognize that early proof-of-concept studies using novel methods require costly time and resource intensive iterative experimentation with various approaches and modifications. Use of convenience samples and balanced datasets at this stage may be acceptable if there is a premium on algorithm novelty or novel use cases for which data is scarce. We therefore acknowledge that applying the same expectations of data size, heterogeneity, labeling quality, and multi-reader design, to first of their kind ER/trauma CAD tools could stifle early R&D (statement 6), and that methodological versus technical rigor needs to be commensurate with the level of technology readiness (statement 14) [21].
The ASER AI/ML panel could be instrumental in future priority-setting, leveraging collective knowledge of emerging technologies and emergency/trauma tasks identified as top priorities by end-users to actively encourage the curation and dissemination of public datasets that maximize clinical value. This would best be realized through collaboration with other clinical and technical societies in conceptualization, planning, and execution, ranging from acquiring regulatory and legal approval to accelerate multicenter dataset curation; input into optimal curation and labeling strategies; and hands-on engagement in crowd-sourced labeling efforts (statements 4, 11, and 15) [39].
Data sharing rules involve multi-center IRB approval and data use agreements take considerable time to execute, and in our experience, can take longer in the ER/trauma subspeciality due to the limited bandwidth of investigators with challenging 24/7 schedules, a smaller pool of clinical practitioners and trainees, and less grant-protected time in this relatively understudied and under-resourced field. Nevertheless, input from the Panel could facilitate early starts on projects and ensure that data curation and labeling schemes harmonize with a given task, minimizing the risk of misspent effort. To ensure adequate incentives and buy-in for these multicenter projects among frequently overcommitted practitioners, there should be liberal opportunities for co-authorship or contributorship, following standard ICMJE guidelines (statement 10) [40]. Several members of the panel recently served as authors or contributors on the RSNA RATIC abdominal trauma dataset [41], and discussions on further annotation refinement of this dataset with the relevant RSNA organizing committee are ongoing. While there has been notable progress in public dataset curation by the RSNA and other groups for a range of important pathologies and injuries, challenge data is still lacking in many domains in this broad field (statement 16) [32]. Growing inter-institutional consortia and advancements in federated learning are expected to accelerate ER/trauma tool R&D from bench to bedside, reducing obstacles to pre- and post-market privacy-preserving external validation [7, 42].
Validity
Unanimous recommendations
Unanimous consensus was formed around the importance of benchmarking emergency/trauma AI CAD tools based on their intended use following preliminary proof-of-concept and translational work; for example, triage/early notification (CADt) and detection (CADe) tools should be benchmarked by comparing turnaround times or reader performance with and without CAD respectively (statement 1) [36]. Non-adversarial robustness- the resilience of a model in the face of shifting distributions or noisy data- remains a large concern (statement 2). Validation requires objective and trustworthy reference standards (with arbitrated reads, segmentations with appropriate quality control, or well-established “electronic phenotypes” such as histopathology data) that fit the task at hand (statement 3) [34]. In our scoping review, we observed that clinical-translational work describing detection algorithms best suited as second reader tools, rarely compared human performance with and without the use of AI prototype tools [4]. This design, when used in small pilot studies, can determine effect sizes for powering future labor-intensive cross-over multi-reader multi-case (MRMC) clinical trials [43]. The use of the same commercial tools in both the ground truthing and evaluation process is a frequent flaw in study design [43]. For example, a frequent shortcut involves attempting to ground truth a validation sample by scrutinizing only those studies flagged as positive or discrepant by an algorithm. Such an approach fails to ground truth true negatives, is subject to automation bias and dataleak, and is based on pre-conclusions regarding algorithm performance.
Researchers should routinely provide the intended use for a given approach along with information on accuracy, and compute times at the study level, since time is critical in the ER/trauma setting, and results may be desirable while the patient is still on the CT table. A CT-based segmentation method using 3D fully convolutional neural networks, vision transformers, or other computationally heavy approaches, may be found to have excellent diagnostic accuracy metrics similar to or improved over an existing lightweight weakly supervised 2D classification approach. But if the output is orders of magnitude slower, it may have little utility as a CADt tool with current hardware. Conversely, a lightweight approach that provides saliency maps may be adequate for CADt purposes but inadequate as a CADe second reader for a complex CT-based fracture grading system [4]. By unambiguously describing intended use and reporting appropriately benchmarked metrics, authors of initial proof-of-concept papers establish a clear justification for subsequent labor-intensive processes of curating and annotating large heterogeneous multicenter datasets and validating performance and robustness at scale.
Strong and moderate recommendations
For tools far along the technology readiness level (TRL) pipeline, demonstration of scalability and robustness across diverse populations and healthcare settings is an essential perquisite of premarket approval and widespread use (statement 4). However, following regulatory approval, the Total Product Life Cycle of ER/trauma CAD tools should include not only local re-evaluation for accuracy and/or model fit prior to implementation, but also monitoring for drops in performance, and potential periodic or continuous retraining (statement 5, statement 8) [44–46]. The FDA has already approved continuous learning paradigms for several commercial radiology AI tools [46, 47].
For validation studies to be considered trustworthy by end-users, agreement between expert study participants (with tests such as Cohen’s Kappa, Dice similarity, and intraclass correlation coefficient), and raw performance data should be available in the published literature, or upon request to a given vendor. Developers and vendors should also provide a strong justification for any cut-offs used (statement 6, statement 9, statement 10, statement 12) [5, 36–38, 48, 49]. To illustrate with a few examples, agreement is highly variable for classification systems such as the AAST organ injury scales or pelvic fracture grading systems, and accuracy for some challenging positive or negative diagnoses such as penetrating diaphragmatic or bowel injury may also vary widely among experts [50–53]. Relevant stakeholders such as AI governance committees should therefore be able to compare local agreement or performance of readers and algorithms with the published literature to determine whether a tool performs within the standard of care and improves diagnostic accuracy as expected [45]. Validation of tools should also extend beyond performance to include assessment of system benevolence and user satisfaction (e.g., through user interface and user experience studies) (statement 11) [45, 54–56]. Since evidence of improved outcomes is a criterion for CMS reimbursement and the goal of AI tools is improving patient health, outcomes research should be a major ongoing priority for FDA-approved tools (statement 13) [27, 28, 57].
In addition to the emphasis placed on non-adversarial robustness, deliberate manipulations of input data, known as adversarial attacks, remain a major IT security priority due to the potential risk of patient harm through misdiagnosis, incorrect treatment, and flawed research (statement 14) [58, 59]. It should be noted that, to date, we are not aware of any publicized real-world incidents of such attacks. Proposed motivations for such attacks range from intentional patient harm to false claims for reimbursement (i.e., insurance fraud) [60]. The potential for high-risk malicious adversarial perturbations of medical imaging AI algorithms has been demonstrated in a growing body of literature, and these algorithms are more vulnerable to perturbations than algorithms trained on natural images due to complex textures, over-parameterization, and in particular, “black box” design where algorithm reasoning cannot be interrogated [60]. Patient harm from adversarial attacks can only be prevented with protective measures, among them, perturbation detectors.
The panel also felt that it should take a role in further exploration of clinical needs to determine high priority areas for AI CAD R&D (statement 7). This could include work such as follow-up ASER member surveys, additional scoping reviews to chart progress in the ER/trauma AI domain, end-user and patient focus groups and interviews, pilot studies of AI tools, and collaborative research with other societies [4, 5, 39].
Human factors
Strong and moderate recommendations
The panel identified a need for formalized approaches to achieve high-trust, high-transparency human factors engineering goals for emergency/trauma AI algorithm and software development (statement 1) [54, 55], and agreed that the panel should leverage computer scientist expertise to develop open-source tools for high-value ER/trauma tasks (statements 6 and 7) [61–63]. We also stress human-centered design principles to determine the level of interoperability and human-in-the-loop functionality desired by end users, considering dimensions such as the degree of mental support, workload and frustration, trust, and likelihood of future use in formative user research and simulated deployment studies (statements 2 and 3) [54, 64–66]. To this end, members of the panel collaborated on a formative user research study with a prototype spleen AAST grading tool that included iterative user feedback and Likert grading along these and other dimensions [56]. The study yielded generalizable insights regarding PACS integration, pop-up and instant messenger alerts, and report autopopulation, and members of the panel subsequently developed and shadow tested software for PACS-integrated DICOM-compatible results viewing for end-to-end combined detection and quantitative visualization with these features, following the feedback [67]. Other members of the panel have independently developed user-friendly orchestrators for AI algorithm deployment [62]. Ideally future studies should typically include both clinical teams and radiologists (statement 3) [44]. To date, the FDA has introduced a set of practice expectations (known as GMLP, or “good machine learning practice”, including cursory guidance pertaining to interpretability [47]. Given the that errors in emergency and trauma imaging can be life-threatening and errors based on spurious algorithm assumptions could have major adverse consequences, transparency is required and maximum interpretability is needed even when algorithm performance is properly validated (statements 9 and 10) [68–71]. The panel recommends that regulatory agencies engage directly with radiologists and clinical end-users to help ensure that at all stages in the R&D pipeline, AI algorithms and software are designed and developed to meet end-user needs (statement 8) [61].
Workflow
Unanimous recommendations
The panel agreed unanimously that more research is needed to determine if CADt tools decrease report turnaround time and make the radiology workflow faster (statement 1) [4, 72]. Currently, research papers evaluating CADt tools in emergency and trauma radiology are few. Most papers showing a potential benefit focus on turnaround times before and after implementation. Software has been developed for randomized controlled trials (RCTs) that randomly assigns cases to an experimental group with CADt processing and a control group without, and we are aware of one prospective RCT that has used random notification drop in its study design [73].
Strong and moderate recommendations
Cost-benefit analyses will also be needed to determine if clinical AI tools can improve outcomes that translate into cost savings (statement 2). The FDA has established the National Evaluation System for Healthcare Technologies (NEST) program to work with teams on real world evidence (RWE) and value analysis initiatives [65], The perceived value of CAD AI tools should be studied from different angles including through the perspective of governance structures, hospital administrators, and patients (statement 5) [27, 45, 57]. Guidelines for institutional governance structures and societies, are necessary to standardize the process of post-market surveillance of AI CAD tools (statement 3) [45]. Operationalizing locally developed algorithms can be challenging since bespoke deployment pipelines require collaboration between radiologists, computer scientists, and software developers with information technology expertise. To avoid “reinventing the wheel”, R&D efforts should be directed toward open-source interoperable software (statement 4) [62, 63, 65, 74], along FAIR principles (“findable, accessible, interoperable, and reusable”). Orchestrators are ultimately expected to integrate enterprise RIS/PACS platforms with multiple tools from multiple vendors for a seamless radiology workflow. Surveys that describe progress in orchestrator R&D, mergers, and clinical adoption will help governance structures with acquisition decisions with respect to these orchestrator platforms in the emergency setting (statement 7) [75]. Scoping reviews to chart the current state of orchestrator R&D could also be beneficial. For “home grown” or non-commercial AI algorithm prototypes, there should be research emphasis on open-source pipelines that lower barriers toward “shadow evaluation” in the clinical workflow and prospective pre-clinical studies (statement 5) [63, 76]. Commercial tools should interact to varying degrees with RIS/PACS systems to meet end-user needs. RIS/PACS integration should be carefully researched, and this should include staff that would be involved in continued technical support (statement 6) [45].
Barriers to research and implementation
Strong and moderate recommendations
Barriers occur at all stages of the technology readiness pipeline, from dataset curation to post-market surveillance. For mature tools, successful AI R&D and implementation depends on involvement of a large multidisciplinary team that may include IT, radiologists, hospital administrators, developers/vendors and technical support staff (statements 1 and 2) [45, 66]. During the R&D process, researchers should anticipate and attempt to minimize resistance from emergency radiologists, which our ASER member survey previously revealed can be related to perceived negative effects on efficiency, over- and under-diagnosis, negative impact on training, lack of published data to support use, financial concerns, ethical concerns, and AI CAD tool transparency (statement 3) [5].
Foundational and translational AI research is resource and time intensive and often requires grant support from funding agencies. Future ASER AI/ML Expert Panel guidance documents and workshops could be helpful to clarify funding mechanisms, promote entrepreneurship, and provide education on best practices to maximize likelihood of obtaining regulatory approval and reimbursement codes. Additionally, informatics fellowships, radiology informatics tracks, and ASER AI/ML Expert Panel-led workshops could help promote engagement and accelerate AI research in ER/trauma (statements 4, 5, 6, and 7) [4, 77–80].
Future avenues
Unanimous recommendations
Future avenues of research require a multifaceted approach that examines not only effects on diagnostic speed and turnaround and reducing missed pathology or injury, but also direct effects on patient care (statement 1) [81]. A relatively small number of tools have received CMS NTAP recognition. As of this writing, all fall nominally within the umbrella of emergency radiology. These include detection tools for CT intracranial hemorrhage, vertebral fracture, and large vessel occlusion [27]. The evidence used to receive NTAP designation is unfortunately not always available, and while some evidence may be published, unpublished proprietary data may also be submitted. Therefore, it is unclear what thresholds must be met to receive NTAP status, whether these have become more stringent over time, and to what extent CMS judgements of “returns on investment” with respect to population health and cost may vary on a case-by-case basis [82]. Based on the collective knowledge and experience of AI/ML Expert Panel members, it appears that intermediate outcomes or surrogate endpoints are typically sufficient, provided that there is an existing robust body of evidence linking these with hard endpoints such as mortality, quality-adjusted life-years (QALYs), and net monetary benefit. Aside from the need for robust evidence linking time savings or diagnostic accuracy with clinical outcomes in acute illness or injury, generally speaking the more published single or multicenter evidence supporting improvement in surrogate endpoints, the better. While not an explicit panel recommendation, more literature is needed to shed light on the basis for determining initial NTAP approval, length of approval, and future adjustments [27].
Strong and moderate recommendations
Leveraging computer science faculty expertise, the ASER AI/ML Expert Panel should engage in empirical simulation studies to develop human-centered AI systems for high value tasks determined by group consensus (statement 2) [54]. Overall, emergency radiology AI CAD tools could increase access to care for underserved populations by providing assistance in under-resourced settings where emergency/trauma expertise is lacking. Since underserved patients are overrepresented in the emergency room, research on penetration of AI tools into underserved population centers is a high priority (statements 3 and 4) [83].
Ethics
Unanimous recommendations
There are potential unforeseen risks associated with black box algorithms or data-heavy “radiomics” algorithms falsely labeling patients as having or not having a disease (statement 1). This further highlights the need for high-trust, responsible, and transparent tools [70, 71].
Strong and moderate recommendations
Mechanisms must be established to monitor the performance of AI solutions over time and report deviations (statement 2) [45]. The ethical risks of bias, such as from failing to test tools in relevant clinical and demographic substrata or generalizing single institution data to large human populations are emphasized (statements 4 and 7) [36, 37, 84]. Prior to clinical approval, end users involved in investigation of tools deployed for pre-clinical testing in “shadow mode” should in most circumstances be included on IRB protocols [45]. A low threshold for consultations with Human Research Protections staff is advised. For commercial tools, radiologists should be considered research subjects if auditing of radiologists is performed beyond the approved intended use of the software; end-users may otherwise be incentivized to agree with algorithm results and rely too strongly on imperfect reference standards to avoid potential professional harm (statements 5 and 6). While there is little literature on radiologists as potential subjects in greater than minimal risk research, the concept resonated with the panel and the issue should be explored further. Finally, as AI algorithms proliferate, there is a moral hazard of increased dependence of less qualified staff lacking emergency radiology expertise on potentially flawed AI systems (statement 3) [85, 86].
Conclusions
The aim of this comprehensive ASER AI/ML Expert Panel-driven Delphi study was to produce, to our knowledge, the first consensus guidance document on AI R&D in the emergency and trauma radiology subspeciality. The major conclusions of this work product are as follows: Aligning with regulatory pathways set forth by FDA and CMS, the overall objective of AI/ML R&D in emergency and trauma radiology is to create scalable tools that are safe, rapid, accurate, generalizable, and that improve patient outcomes. A premium is also placed on algorithm transparency and explainability, and human factors design that reduces burden in the emergency setting.
The consensus statements also emphasize collaboration between multidisciplinary teams throughout the Total Product Life Cycle to ensure ethical, high-impact R&D aligned with user expectations and patient outcomes; rigor and transparency in data curation and labeling; large-scale, diverse, external datasets including patient characteristics for bias analyses; clear research design and reporting of results consistent with intended use; An emphasis on patient outcomes and cost-benefit research to justify acquisition and reimbursement; and promotion of entrepreneurship or research funding to accelerate R&D.
Limitations and future directions
Some caution is needed as the Delphi study was limited to two rounds, and had a relatively small participant sample size (15 experts following drop-out), although within the suggested range of 10–18 participants to achieve reliable consensus with manageable logistics [87]. We opted for a literature search for knowledge synthesis in place of group discussion to avoid band-wagoning. Our literature search was guided by the expert knowledge network of the steering committee, but was not systematic; however, information was synthesized from numerous recent systematic reviews, position statements, and topical editorials and commentaries in leading medical image processing and radiology journals. We also acknowledge major developments in language, image-based, and combined multi-task vision-language generalist foundation models since the start of this study that can be fine-tuned for emergency and trauma radiology-related AI tasks [88]. As AI is rapidly advancing technology, the statements could be re-evaluated with more basic and translational scientists in future work and employing additional rounds for further refinement. Nevertheless, the consensus document is expected to be a useful framework for AI researchers in the subspeciality, and, more broadly, could stimulate discussions and collaborative research in other fields.
Supplementary Material
Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/s10140-024-02306-1.
Acknowledgements
David Dreizin funding source: NIH R01 GM148987-01 (PI: David Dreizin, MD)
Footnotes
The authors have no conflicts of interest to declare that are relevant to the content of this article.
This original work has not been published or submitted elsewhere for review.
Data availability
All data supporting the findings of this study are available within the paper and its Supplementary material (Appendices 1, 2a, and 2b).
References
- 1.Cellina M, Cè M, Irmici G, Ascenti V, Caloro E, Bianchi L, Pellegrino G, D’Amico N, Papa S, Carrafiello G (2022) Artificial intelligence in emergency radiology: where are we going? Diagnostics 12(12):3223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Liu J, Varghese B, Taravat F, Eibschutz LS, Gholamrezanezhad A (2022) An extra set of intelligent eyes: application of artificial intelligence in imaging of abdominopelvic pathologies in emergency radiology. Diagnostics 12(6):1351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Dreizin D (2023) The American Society of Emergency Radiology (ASER) AI/ML expert panel: inception, mandate, work products, and goals. Emerg Radiol 30(3):279–283 [DOI] [PubMed] [Google Scholar]
- 4.Dreizin D, Staziaki PV, Khatri GD, Beckmann NM, Feng Z, Liang Y, Delproposto ZS, Klug M, Spann JS, Sarkar N (2023) Artificial intelligence CAD tools in trauma imaging: a scoping review from the American Society of Emergency Radiology (ASER) AI/ML Expert Panel. Emerg Radiol 30(3):251–265 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Agrawal A, Khatri GD, Khurana B, Sodickson AD, Liang Y, Dreizin D (2023) A survey of ASER members on artificial intelligence in emergency radiology: trends, perceptions, and expectations. Emerg Radiol 30(3):267–277 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Cheng C-T, Ooyang C-H, Kang S-C, Liao C-H (2024) Applications of Deep Learning in Trauma Radiology: A Narrative Review. Biom J:100743. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Langlotz CP, Allen B, Erickson BJ, Kalpathy-Cramer J, Bigelow K, Cook TS, Flanders AE, Lungren MP, Mendelson DS, Rudie JD (2019) A roadmap for foundational research on artificial intelligence in medical imaging: from the 2018 NIH/RSNA/ACR/The Academy Workshop. Radiology 291(3):781–791 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Dreizin D, Zhang L, Sarkar N, Bodanapally UK, Li G, Hu J, Chen H, Khedr M, Khetan U, Campbell P (2023) Accelerating voxelwise annotation of cross-sectional imaging through AI collaborative labeling with quality assurance and bias mitigation. Front Radiol 3:1202412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Diaz-Pinto A, Alle S, Nath V, Tang Y, Ihsani A, Asad M, Pérez-García F, Mehta P, Li W, Flores M (2024) Monai label: A framework for ai-assisted interactive labeling of 3d medical images. Med Image Anal 95:103207. [DOI] [PubMed] [Google Scholar]
- 10.Agrawal A (2022) Emergency teleradiology-past, present, and, is there a future. Front Radiol 2:866643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zhou SK, Greenspan H, Davatzikos C, Duncan JS, Van Ginneken B, Madabhushi A, Prince JL, Rueckert D, Summers RM (2021) A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. Proc IEEE 109(5):820–838 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition 2016; p. 770–778. [Google Scholar]
- 13.Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, 2015, proceedings, part III 18: Springer, 2015; p. 234–241. [Google Scholar]
- 14.Torres-Lopez VM, Rovenolt GE, Olcese AJ, Garcia GE, Chacko SM, Robinson A, Gaiser E, Acosta J, Herman AL, Kuohn LR (2022) Development and validation of a model to identify critical brain injuries using natural language processing of text computed tomography reports. JAMA Netw Open 5(8):e2227109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Le Guellec B, Lefèvre A, Geay C, Shorten L, Bruge C, Hacein-Bey L, Amouyel P, Pruvo J-P, Kuchcinski G, Hamroun A (2024) Performance of an open-source large language model in extracting information from free-text radiology reports. Radiol Artif Intell:e230364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Huang Y, Yang X, Liu L, Zhou H, Chang A, Zhou X, Chen R, Yu J, Chen J, Chen C (2024) Segment anything model for medical images? Med Image Anal 92:103061. [DOI] [PubMed] [Google Scholar]
- 17.Mohsan MM, Akram MU, Rasool G, Alghamdi NS, Baqai MAA, Abbas M (2022) Vision transformer and language model based radiology report generation. IEEE Access 11:1814–1824 [Google Scholar]
- 18.Shen Y, Li J, Shao X, Romillo BI, Jindal A, Dreizin D, Unberath M. FastSAM3D: An Efficient Segment Anything Model for 3D Volumetric Medical Images. arXiv preprint arXiv:240309827 2024. [Google Scholar]
- 19.Hudnal C (2023) ACR eBulletin”Choosing AI”. American College of Radiology Press [Google Scholar]
- 20.Ahmad OF, Mori Y, Misawa M, Kudo S-e, Anderson JT, Bernal J, Berzin TM, Bisschops R, Byrne MF, Chen P-J (2021) Establishing key research questions for the implementation of artificial intelligence in colonoscopy: a modified Delphi method. Endoscopy 53(09):893–901 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Lavin A, Gilligan-Lee CM, Visnjic A, Ganju S, Newman D, Ganguly S, Lange D, Baydin AG, Sharma A, Gibson A (2022) Technology readiness levels for machine learning systems. Nat Commun 13(1):6039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Weikert T, Cyriac J, Yang S, Nesic I, Parmar V, Stieltjes B (2020) A practical guide to artificial intelligence–based image analysis in radiology. Investig Radiol 55(1):1–7 [DOI] [PubMed] [Google Scholar]
- 23.Niederberger M, Köberich S, Network D (2021) Coming to consensus: the Delphi technique. Oxford University Press; [DOI] [PubMed] [Google Scholar]
- 24.Jünger S, Payne SA, Brine J, Radbruch L, Brearley SG (2017) Guidance on Conducting and REporting DElphi Studies (CREDES) in palliative care: Recommendations based on a methodological systematic review. Palliat Med 31(8):684–706 [DOI] [PubMed] [Google Scholar]
- 25.Diamond IR, Grant RC, Feldman BM, Pencharz PB, Ling SC, Moore AM, Wales PW (2014) Defining consensus: a systematic review recommends methodologic criteria for reporting of Delphi studies. J Clin Epidemiol 67(4):401–409 [DOI] [PubMed] [Google Scholar]
- 26.Nowack M, Endrikat J, Guenther E (2011) Review of Delphibased scenario studies: Quality and design considerations. Technol Forecast Soc Chang 78(9):1603–1615 [Google Scholar]
- 27.Chen MM, Golding LP, Nicola GN (2021) Who will pay for AI? Radiology. Artif Intell 3(3):e210030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Benjamens S, Dhunnoo P, Meskó B (2020) The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database. npj Digital Med 3(1):118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Mei X, Liu Z, Robson PM, Marinelli B, Huang M, Doshi A, Jacobi A, Cao C, Link KE, Yang T (2022) RadImageNet: an open radiologic deep learning research dataset for effective transfer learning. Radiology. Artif Intell 4(5):e210315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Liu P, Han H, Du Y, Zhu H, Li Y, Gu F, Xiao H, Li J, Zhao C, Xiao L (2021) Deep learning to segment pelvic bones: large-scale CT datasets and baseline models. Int J Comput Assist Radiol Surg 16:749–756 [DOI] [PubMed] [Google Scholar]
- 31.Jin L, Yang J, Kuang K, Ni B, Gao Y, Sun Y, Gao P, Ma W, Tan M, Kang H (2020) Deep-learning-assisted detection and segmentation of rib fractures from CT scans: Development and validation of FracNet. EBioMedicine 62 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Kitamura FC, Prevedello LM, Colak E, Halabi SS, Lungren MP, Ball RL, Kalpathy-Cramer J, Kahn CE Jr, Richards T, Talbott JF (2024) Lessons Learned in Building Expertly Annotated Multi-Institution Datasets and Hosting the RSNA AI Challenges. Radiology. Artif Intell 6(3):e230227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Rudie JD, Lin H-M, Ball RL, Jalal S, Prevedello LM, Nicolaou S, Marinelli BS, Flanders AE, Magudia K, Shih G (2024) The RSNA Abdominal Traumatic Injury CT (RATIC) Dataset. Radiol Artif Intell 6(6):e240101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Willemink MJ, Koszek WA, Hardell C, Wu J, Fleischmann D, Harvey H, Folio LR, Summers RM, Rubin DL, Lungren MP (2020) Preparing medical imaging data for machine learning. Radiology 295(1):4–15 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Varoquaux G, Cheplygina V (2022) Machine learning for medical imaging: methodological failures and recommendations for the future. npj Digital Med 5(1):48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Park SH, Han K (2018) Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology 286(3):800–809 [DOI] [PubMed] [Google Scholar]
- 37.de Hond AA, Leeuwenberg AM, Hooft L, Kant IM, Nijman SW, van Os HJ, Aardoom JJ, Debray TP, Schuit E, van Smeden M (2022) Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. npj Digital Med 5(1):2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Tejani AS, Klontzas ME, Gatti AA, Mongan JT, Moy L, Park SH, Kahn CE Jr, Panel CU Checklist for Artificial Intelligence in Medical Imaging (CLAIM): 2024 Update. Radiol Artif Intell 2024:e240300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Allen B, Schmidt K, Brink L, Pisano E, Coombs L, Apgar C, Dreyer K, Wald C (2023) Specialty society support for multicenter research in artificial intelligence. Acad Radiol 30(4):640–643 [DOI] [PubMed] [Google Scholar]
- 40.Hwang SS, Song HH, Baik JH, Jung SL, Park SH, Choi KH, Park YH (2003) Researcher contributions and fulfillment of ICMJE authorship criteria: analysis of author contribution lists in research articles with multiple authors published in Radiology. Radiology 226(1):16–23 [DOI] [PubMed] [Google Scholar]
- 41.Rudie JD, Lin HM, Ball RL, Jalal S, Prevedello LM, Nicolaou S, Marinelli BS, Flanders AE, Magudia K, Shih G, Davis MA (2024) The RSNA Abdominal Traumatic Injury CT (RATIC) Dataset. Radiol Artif Intell 6(6):e240101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Linguraru MG, Bakas S, Aboian M, Chang PD, Flanders AE, Kalpathy-Cramer J, Kitamura FC, Lungren MP, Mongan J, Prevedello LM (2024) Clinical, Cultural, Computational, and Regulatory Considerations to Deploy AI in Radiology: Perspectives of RSNA and MICCAI Experts. Radiol Artif Intell:e240225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Obuchowski NA, Bullen J (2022) Multireader diagnostic accuracy imaging studies: fundamentals of design and analysis. Radiology 303(1):26–34 [DOI] [PubMed] [Google Scholar]
- 44.He J, Baxter SL, Xu J, Xu J, Zhou X, Zhang K (2019) The practical implementation of artificial intelligence technologies in medicine. Nat Med 25(1):30–36 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Daye D, Wiggins WF, Lungren MP, Alkasab T, Kottler N, Allen B, Roth CJ, Bizzo BC, Durniak K, Brink JA (2022) Implementation of clinical artificial intelligence in radiology: who decides and how? Radiology 305(3):555–563 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Pianykh OS, Langs G, Dewey M, Enzmann DR, Herold CJ, Schoenberg SO, Brink JA (2020) Continuous learning AI in radiology: implementation principles and early applications. Radiology 297(1):6–14 [DOI] [PubMed] [Google Scholar]
- 47.Zhang K, Khosravi B, Vahdati S, Erickson BJ (2024) FDA review of radiologic AI algorithms: process and challenges. Radiology 310(1):e230242. [DOI] [PubMed] [Google Scholar]
- 48.Bankier AA, Levine D, Halpern EF, Kressel HY (2010) Consensus interpretation in imaging research: is there a better way? Radiological Society of North America, Inc., pp 14–17 [DOI] [PubMed] [Google Scholar]
- 49.Benchoufi M, Matzner-Lober E, Molinari N, Jannot A-S, Soyer P (2020) Interobserver agreement issues in radiology. Diagn Interv Imaging 101(10):639–641 [DOI] [PubMed] [Google Scholar]
- 50.Adams-McGavin RC, Tafur M, Vlachou PA, Wu M, Brassil M, Crivellaro P, Lin H-M, Gomez D, Colak E (2024) Interrater agreement of CT grading of blunt splenic injuries: does the AAST grading need to be reimagined? Can Assoc Radiol J 75(1):171–177 [DOI] [PubMed] [Google Scholar]
- 51.Dreizin D, Borja MJ, Danton GH, Kadakia K, Caban K, Rivas LA, Munera F (2013) Penetrating diaphragmatic injury: accuracy of 64-section multidetector CT with trajectography. Radiology 268(3):729–737 [DOI] [PubMed] [Google Scholar]
- 52.Dreizin D, Boscak AR, Anstadt MJ, Tirada N, Chiu WC, Munera F, Bodanapally UK, Hornick M, Stein DM (2016) Penetrating colorectal injuries: diagnostic performance of multidetector CT with trajectography. Radiology 281(3):749–762 [DOI] [PubMed] [Google Scholar]
- 53.Berger-Groch J, Thiesen DM, Grossterlinden LG, Schaewel J, Fensky F, Hartel MJ (2019) The intra- and interobserver reliability of the Tile AO, the Young and Burgess, and FFP classifications in pelvic trauma. Arch Orthop Trauma Surg 139(5):645–650. 10.1007/s0040201903123-9 [DOI] [PubMed] [Google Scholar]
- 54.Chen H, Gomez C, Huang C-M, Unberath M (2022) Explainable medical imaging AI needs human-centered design: guidelines and evidence from a systematic review. npj Digital Med 5(1):156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Cai CJ, Reif E, Hegde N, Hipp J, Kim B, Smilkov D, Wattenberg M, Viegas F, Corrado GS, Stumpe MC. Human-centered tools for coping with imperfect algorithms during medical decision-making. Proceedings of the 2019 chi conference on human factors in computing systems 2019; p. 1–14. [Google Scholar]
- 56.Sarkar N, Kumagai M, Meyr S, Pothapragada S, Unberath M, Li G, Ahmed SR, Smith EB, Davis MA, Khatri GD (2024) An ASER AI/ML expert panel formative user research study for an interpretable interactive splenic AAST grading graphical user interface prototype. Emerg Radiol 31(2):167–178 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Van Leeuwen KG, de Rooij M, Schalekamp S, van Ginneken B, Rutten MJ (2022) How does artificial intelligence in radiology improve efficiency and health outcomes? Pediatr Radiol:1–7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Chu LC, Anandkumar A, Shin HC, Fishman EK (2020) The potential dangers of artificial intelligence for radiology and radiologists. J Am Coll Radiol 17(10):1309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Bortsova G, González-Gonzalo C, Wetstein SC, Dubost F, Katramados I, Hogeweg L, Liefers B, van Ginneken B, Pluim JP, Veta M (2021) Adversarial attack vulnerability of medical image analysis systems: Unexplored factors. Med Image Anal 73:102141. [DOI] [PubMed] [Google Scholar]
- 60.Ma X, Niu Y, Gu L, Wang Y, Zhao Y, Bailey J, Lu F (2021) Understanding adversarial attacks on deep learning based medical image analysis systems. Pattern Recogn 110:107332 [Google Scholar]
- 61.Scheek D, Rezazade Mehrizi MH, Ranschaert E (2021) Radiologists in the loop: the roles of radiologists in the development of AI applications. Eur Radiol 31:7960–7968 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Cohen RY, Sodickson AD (2023) An orchestration platform that puts radiologists in the driver’s seat of AI innovation: a methodological approach. J Digit Imaging 36(2):700–714 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Zhang L, LaBelle W, Unberath M, Chen H, Hu J, Li G, Dreizin D (2023) A vendor-agnostic, PACS integrated, and DICOM-compatible software-server pipeline for testing segmentation algorithms within the clinical radiology workflow. Front Med 10:1241570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Nazar M, Alam MM, Yafi E, Su’ud MM (2021) A systematic review of human–computer interaction and explainable artificial intelligence in healthcare with artificial intelligence techniques. IEEE Access 9:153316–153348 [Google Scholar]
- 65.Allen B Jr, Seltzer SE, Langlotz CP, Dreyer KP, Summers RM, Petrick N, Marinac-Dabic D, Cruz M, Alkasab TK, Hanisch RJ (2019) A road map for translational research on artificial intelligence in medical imaging: from the 2018 National Institutes of Health/RSNA/ACR/The Academy Workshop. J Am Coll Radiol 16(9):1179–1189 [DOI] [PubMed] [Google Scholar]
- 66.Martín-Noguerol T, Paulano-Godino F, López-Ortega R, Górriz J, Riascos R, Luna A (2021) Artificial intelligence in radiology: relevance of collaborative work between radiologists and engineers for building a multidisciplinary team. Clin Radiol 76(5):317–324 [DOI] [PubMed] [Google Scholar]
- 67.Dreizin D, LaBelle W, Unberath M, L Z. A PACS-Integrated Platform for Automated Combined Early Notification and Quantitative Visualization Tools with Report Auto-Population. Society for Imaging Informatics in Medicine (SIIM). National Harbor, Maryland: 2024. [Google Scholar]
- 68.Ahn D, Almaatouq A, Gulabani M, Hosanagar K. Will we trust what we don’t understand? Impact of model interpretability and outcome feedback on trust in AI. arXiv preprint arXiv:211108222 2021. [Google Scholar]
- 69.Reyes M, Meier R, Pereira S, Silva CA, Dahlweid F-M, Tengg-Kobligk H, Summers RM, Wiest R (2020) On the interpretability of artificial intelligence in radiology: challenges and opportunities. Radiol Artif Intell 2(3):e190043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Arrieta AB, Díaz-Rodríguez N, Del Ser J, Bennetot A, Tabik S, Barbado A, García S, Gil-López S, Molina D, Benjamins R (2020) Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf Fusion 58:82–115 [Google Scholar]
- 71.DeGrave AJ, Janizek JD, Lee S-I (2021) AI for radiographic COVID-19 detection selects shortcuts over signal. Nat Mach Intell 3(7):610–619 [Google Scholar]
- 72.Davis MA, Rao B, Cedeno PA, Saha A, Zohrabian VM (2022) Machine learning and improved quality metrics in acute intracranial hemorrhage by noncontrast computed tomography. Curr Probl Diagn Radiol 51(4):556–561 [DOI] [PubMed] [Google Scholar]
- 73.Wismüller A, Stockmaster L. A prospective randomized clinical trial for measuring radiology study reporting time on Artificial Intelligence-based detection of intracranial hemorrhage in emergent care head CT. Medical Imaging 2020: Biomedical Applications in Molecular, Structural, and Functional Imaging: SPIE, 2020; p. 144–150. [Google Scholar]
- 74.Venkatesh K, Santomartino SM, Sulam J, Yi PH (2022) Code and data sharing practices in the radiology artificial intelligence literature: a meta-research study. Radiol Artif Intell 4(5):e220081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Wiggins WF, Magudia K, Schmidt TMS, O’Connor SD, Carr CD, Kohli MD, Andriole KP (2021) Imaging AI in practice: a demonstration of future workflow using integration standards. Radiol Artif Intell 3(6):e210152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Shen Y, Shao X, Romillo BI, Dreizin D, Unberath M. FastSAM-3DSlicer: A 3D-Slicer Extension for 3D Volumetric Segment Anything Model with Uncertainty Quantification. arXiv preprint arXiv:240712658 2024. [Google Scholar]
- 77.West E, Mutasa S, Zhu Z, Ha R (2019) Global trend in artificial intelligence–based publications in radiology from 2000 to 2018. Am J Roentgenol 213(6):1204–1206 [DOI] [PubMed] [Google Scholar]
- 78.Balthazar P, Harri P, Prater A, Heilbrun ME, Mullins ME, Safdar N (2022) Development and implementation of an Integrated Imaging Informatics Track for radiology residents: our 3-year experience. Acad Radiol 29:S58–S64 [DOI] [PubMed] [Google Scholar]
- 79.Yu J, Kansagra AP, Thaker A, Colucci A, Sherry SJ, Subramaniam RM (2014) Building for tomorrow today: opportunities and directions in radiology resident research. Acad Radiol 22(1):50–57 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Hu R, Rizwan A, Hu Z, Li T, Chung AD, Kwan BY (2023) An artificial intelligence training workshop for diagnostic radiology residents. Radiol Artif Intell 5(2):e220170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Lobig F, Subramanian D, Blankenburg M, Sharma A, Variyar A, Butler O (2023) To pay or not to pay for artificial intelligence applications in radiology. npj Digital Med 6(1):117. 10.1038/s4174602300861-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Hassan AE (2021) New Technology Add-On Payment (NTAP) for Viz LVO: a win for stroke care. J NeuroIntervent Surg 13(5):406–408 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Mollura DJ, Culp MP, Pollack E, Battino G, Scheel JR, Mango VL, Elahi A, Schweitzer A, Dako F (2020) Artificial intelligence in low-and middle-income countries: innovating global health radiology. Radiology 297(3):513–520 [DOI] [PubMed] [Google Scholar]
- 84.Shad R, Cunningham JP, Ashley EA, Langlotz CP, Hiesinger W (2021) Designing clinically translatable artificial intelligence systems for high-dimensional medical imaging. Nat Mach Intell 3(11):929–935 [Google Scholar]
- 85.Kolossváry M, Raghu VK, Nagurney JT, Hoffmann U, Lu MT (2023) Deep learning analysis of chest radiographs to triage patients with acute chest pain syndrome. Radiology 306(2):e221926. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Mazurowski MA (2019) Artificial intelligence may cause a significant disruption to the radiology workforce. J Am Coll Radiol 16(8):1077–1082 [DOI] [PubMed] [Google Scholar]
- 87.Oxley E, Nash HM, Weighall AR (2024) Consensus building using the Delphi method in educational research: a case study with educational professionals. Int J Res Method Educ:1–15 [Google Scholar]
- 88.Moor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, Topol EJ, Rajpurkar P (2023) Foundation models for generalist medical artificial intelligence. Nature 616(7956):259–265 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data supporting the findings of this study are available within the paper and its Supplementary material (Appendices 1, 2a, and 2b).
