Abstract
Background
The integration of artificial intelligence (AI) into clinical practice is reshaping the competency requirements for medical trainees. Yet, validated evaluation instruments aligned with outcome-based education (OBE) frameworks remain scarce.
Methods
We conducted a sequential mixed methods study to develop and preliminarily evaluate an OBE-based competency assessment matrix for clinical medical trainees in China. The framework was derived from national and international competency standards and refined through a three-round Delphi process with 16 medical education experts. Empirical evaluation involved 276 respondents including residents, postgraduate students, and clinical educators who completed the finalized 72-item instrument via a digital assessment platform. Reliability and exploratory structural characteristics were examined using Cronbach’s α, exploratory factor analysis (EFA), and inter-item correlation matrices. Subgroup differences were examined descriptively and visualized with radar plots.
Results
The Delphi panel reached consensus on 72 items across three domains—Importance, Feasibility, and Clarity—with progressive convergence (Kendall’s W ranging from 0.65 in Round 1 to 0.74 in Round 3). The resulting scale showed excellent internal consistency (Cronbach’s α = 0.928) and strong sampling adequacy (KMO = 0.884). Bartlett’s test of sphericity was highly significant (χ2 = 421.35, df = 28, p < 0.001), confirming the suitability of the data for structural exploration. EFA of aggregated domain scores yielded a three-component pattern that cumulatively explained 74.5% of the variance. The resulting loading profile suggested meaningful contributions of Importance, Feasibility, and Clarity, offering exploratory support for the proposed domain-level structure. Radar plots revealed systematic but role-dependent differences: faculty emphasized Importance, residents prioritized Feasibility, and postgraduates rated Clarity slightly higher.
Conclusion
This study provides a context-sensitive evaluation matrix with encouraging initial psychometric evidence, tailored to the evolving demands of AI-informed clinical education. The framework offers a promising platform for competency assessment and curriculum development in Chinese teaching hospitals and may serve as a reference model for other AI-integrating medical education systems, while highlighting the need for confirmatory factor analysis in independent samples to more definitively establish its dimensional structure.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12909-026-08779-7.
Keywords: Competency-based medical education, Delphi method, Psychometric validation, Artificial intelligence
Introduction
The accelerating complexity of healthcare systems, compounded by rapid advances in digital technologies, has fundamentally reshaped the competency expectations for clinical medical trainees [1]. In particular, the integration of artificial intelligence (AI) into diagnostic reasoning, decision support, and health systems operations has redefined the contours of medical preparedness [2]. Traditional pedagogical models—often centered on factual recall and sequential skill acquisition—are increasingly insufficient in equipping future physicians for environments that demand adaptability, digital literacy, and systems-oriented thinking. In this evolving landscape, competency evaluation must extend beyond conventional clinical proficiencies to encompass emerging skills required for AI-informed care delivery [3].
In China, the nationwide implementation of standardized residency training and expansion of postgraduate medical education have heightened the need for structured and evidence-based evaluation tools [4]. Although competency-based medical education (CBME) has been promoted in policy and institutional reform, assessment practices remain heterogeneous, frequently anchored in subjective judgment, and insufficiently aligned with measurable learning outcomes [5, 6]. National frameworks, including the Ministry of Education’s “Five-in-One” competency system, provide overarching guidance but often lack operational clarity, leading to inconsistent adoption across institutions. Moreover, existing tools seldom capture competencies relevant to AI-era clinical practice—such as AI-assisted diagnostics, human–machine collaboration, or data-driven patient engagement—leaving a critical gap between educational aspirations and assessment practice [7].
Outcome-Based Education (OBE) has emerged as a promising paradigm to address these challenges. By prioritizing demonstrable learning outcomes over instructional processes, OBE facilitates backward curriculum design, transparent evaluation standards, and learner-centered progression based on competence attainment [8]. These features are particularly pertinent to postgraduate clinical training, where assessments must reliably capture complex behaviors in dynamic care environments [9]. Yet despite its theoretical appeal, the practical deployment of OBE-aligned assessment instruments within Chinese medical institutions remains underdeveloped, particularly in relation to the competencies necessitated by AI integration.
Although international frameworks such as ACGME and CanMEDS have advanced the global discourse on medical competencies, they were primarily developed in Western healthcare contexts and do not fully reflect the sociotechnical and cultural realities of Chinese medical education [10]. Furthermore, few existing studies have combined theory-driven model construction with rigorous psychometric validation, resulting in tools that are either conceptually sound but contextually misaligned, or empirically tested but lacking theoretical coherence. This underscores the urgent need for a validated, context-sensitive framework that integrates OBE principles, aligns with AI-driven clinical practice, and addresses the specific educational needs of China’s rapidly evolving training landscape.
Against this backdrop, the present study sought to develop and validate a competency-based evaluation framework tailored to clinical medical trainees in China. The framework was constructed through a sequential mixed-methods design: initial model development based on OBE principles and curricular standards, expert consensus refinement via multi-round Delphi consultation, and empirical validation involving residents, postgraduate students, and clinical educators [11]. By uniting expert consensus with psychometric rigor, this study aims to deliver an adaptable, reliable, and contextually grounded tool for competency assessment in AI-integrated clinical education [12]. The findings may inform curriculum developers, institutional leaders, and policymakers seeking to modernize assessment strategies and foster the development of AI-literate, patient-centered physicians [13].
Methods
Methodological framework and study design
This study adopted a rigorously structured sequential mixed-methods design to construct and validate a competency-based evaluation framework for clinical medical trainees within the context of artificial intelligence (AI)-integrated healthcare. The research was conducted as a single-center investigation at Nanfang Hospital, Southern Medical University (China) between July 2024 and June 2025.
The study proceeded in three distinct but interrelated phases. The first phase involved the formulation of a theoretical framework grounded in outcome-based education (OBE) principles, drawing on existing curricular standards and competency models. The second phase employed a Delphi consensus process, in which domain experts iteratively reviewed and refined the framework across multiple rounds to enhance content validity and contextual appropriateness. The third phase comprised a cross-sectional validation study, in which empirical data were collected from medical educators, residency training supervisors, postgraduate instructors, and frontline trainees to assess the psychometric properties of the finalized instrument.
The rationale for employing this sequential mixed-methods design was twofold. The Delphi procedure provided inductive refinement of the framework through expert judgment, ensuring that latent pedagogical constructs were adequately captured. Subsequently, the quantitative validation phase offered statistical substantiation of the framework’s structural validity and reliability, ensuring its applicability in educational practice.
Methodological triangulation was maintained throughout the study to enhance construct fidelity and mitigate potential bias. By integrating expert-derived consensus with empirical testing among key educational stakeholders, the resulting framework was designed to be both theoretically coherent and practically deployable within competency-based medical education (CBME) systems in China.
Phase I – Initial development of the OBE-based competency matrix
The first phase of this study focused on the conceptual development of an evaluation matrix rooted in the principles of outcome-based education (OBE). OBE emphasizes measurable learning outcomes as the foundation of instructional design, an approach particularly suited to delineating the competencies required of clinical medical trainees in the era of digital and AI-enhanced healthcare.
To establish a robust content base, the research team systematically reviewed authoritative sources including national residency training standards, international competency models such as ACGME and CanMEDS, institutional curricula, and peer-reviewed literature on AI readiness in medical education [14]. Through a structured process of thematic abstraction, recurrent professional capabilities were distilled and adapted to the realities of Chinese clinical training. The initial framework development was led by the Director of the Teaching Development Center of Southern Medical University and undertaken by a multidisciplinary team comprising medical education specialists, experienced clinician-educators from Nanfang Hospital, academic affairs officers, and members of an institutional AI/digital medicine innovation group. Team members collectively covered major clinical disciplines (including internal medicine and surgery) and possessed substantial experience in residency training, postgraduate supervision, and curriculum design, providing a broad-based perspective on competency requirements.
This review informed the creation of a three-tier indicator system, balancing comprehensiveness with operational precision. At the highest level, eight core competency domains were identified to represent essential dimensions of clinical and professional development. These were subdivided into twenty-four second-level categories, which in turn yielded seventy-two third-level indicators. Each indicator was articulated as a behaviorally anchored item, designed to capture observable attributes such as clinical reasoning, ethical conduct, digital adaptability, and teamwork [15].
Drafting of the items emphasized semantic clarity, cultural appropriateness, and evaluative neutrality. Internal reviews by the study team, all of whom had prior expertise in medical education assessment, ensured consistency of terminology and domain fidelity. To mitigate potential biases arising from the composition of the core development team, the indicator pool was grounded in national training standards, international competency frameworks, and a systematic review of relevant literature, rather than relying solely on local opinions. Moreover, the draft matrix was subsequently submitted to an independent Delphi expert panel—whose members did not participate in the initial framework construction—for external review, refinement, and consensus building, thereby providing an additional layer of methodological separation between initial development and formal validation. To further enhance content and face validity, the draft framework was circulated among a small panel of senior clinician-educators, who provided feedback on relevance, clarity, and contextual alignment with frontline clinical practice [16].
In particular, for the digital and AI-related competency domain, item development prioritized operationalization rather than abstract familiarity with technology. The indicator pool was organized around three interrelated subdomains: (1) interaction with AI-enabled clinical decision-support tools, (2) critical appraisal of AI-generated outputs, and (3) risk awareness and governance. Interaction-focused descriptors required trainees to demonstrate the ability to formulate clinically meaningful prompts or queries, specify relevant patient parameters, and interpret AI outputs within real-world clinical workflows. Appraisal-related descriptors emphasized the capacity to cross-check AI recommendations against clinical guidelines, laboratory or imaging results, and patient trajectories, as well as to identify inconsistencies or potentially hallucinated content. Risk-governance descriptors addressed awareness of algorithmic bias, limitations of training datasets, over-reliance on automated suggestions, privacy considerations, and the need to maintain human accountability for all diagnostic and therapeutic decisions. All AI-related items were drafted in a tool-agnostic manner to ensure applicability across different AI platforms, and while no model-specific technical validation was conducted as part of this educational study, item content was aligned with contemporary guidance on safe AI use in healthcare [2, 17, 18].
By the conclusion of this phase, the research team had produced a fully structured draft matrix comprising 8 domains, 24 subdimensions, and 72 indicators. This draft, summarized hierarchically in Table 1 and methodologically contextualized in Fig. 1, served as the conceptual foundation for the Delphi consensus process in Phase II.
Table 1.
Competency-based evaluation matrix for clinical medical trainees in the era of artificial intelligence
| Primary Dimension | Secondary Competency Item | Representative Behavior | Suggested Assessment Method | Contextual Note |
|---|---|---|---|---|
| History Taking and Documentation | General History Taking | Collect general patient history accurately | Case-based discussion; AI-assisted history module | AI enhances history-taking efficiency and completeness [19] |
| History Taking and Documentation | System-Specific Inquiry | Tailor history collection to various organ systems | Case writing and AI analysis of key terms | AI supports system-targeted data capture |
| History Taking and Documentation | Communication and Interview Skills | Use appropriate inquiry and empathy techniques | Simulated patient interviews; AI communication role-play | AI-based role-play tools help improve humanistic care [20] |
| History Taking and Documentation | Medical Ethics in Interviewing | Maintain professionalism and ethical conduct | Case scenario; AI-supported ethics appraisal | AI can support ethics training and detection [21] |
| History Taking and Documentation | Medical Documentation | Complete and standardized medical records | AI-based format checker; teacher feedback | AI assists in format verification and missing content alerts [17] |
| Clinical Examination and Procedural Skills | Standardized Physical Exams | Conduct accurate and complete physical exams | OSCE; AI virtual anatomy feedback | AI-assisted simulation improves motor-skill feedback [22] |
| Clinical Examination and Procedural Skills | Special Exam Techniques | Perform special techniques in system-specific cases | Instructor OSCE; AI scenario augmentation | AI guides learners through less-practiced skills |
| Clinical Examination and Procedural Skills | Sterility and Safety Awareness | Follow infection control and patient safety principles | Practical exam; AI-based procedural checklists | AI offers reminders for sterile technique compliance [23] |
| Clinical Reasoning and Decision Making | Differential Diagnosis Skills | Generate structured differential diagnoses | Case review with AI-based mind mapping | AI enhances pattern recognition and reasoning |
| Clinical Reasoning and Decision Making | Diagnostic Thinking | Link symptoms, signs, and lab results to decision-making | Problem-based learning + AI-supported analysis | AI supports diagnostic correlation and decision trees [24] |
| Clinical Reasoning and Decision Making | Emergency and Priority Judgment | Identify clinical emergencies and prioritize action | Simulated emergencies; AI alerts | AI offers triage insight in simulated environments [25] |
| Humanistic Care and Communication | Empathy and Patient Understanding | Demonstrate empathy and understanding | Reflective writing; peer review | AI reflection analysis identifies empathy markers [26] |
| Humanistic Care and Communication | Team Communication | Collaborate effectively in medical teams | Peer evaluations; AI communication simulation | AI tools assist in evaluating interprofessional skills |
| AI Literacy and Application | AI Tool Proficiency | Understand and use AI tools in clinical tasks | Interactive modules; tool-based testing | Courses provide direct training in clinical AI tools [27] |
| AI Literacy and Application | AI Critical Thinking | Evaluate AI results and avoid over-reliance | Scenario judgment tasks with error traps | AI designed biases prompt critical judgment [28] |
| AI Literacy and Application | Data Privacy and Ethics | Maintain confidentiality in AI-assisted systems | Case discussion + knowledge quiz | AI case simulations address privacy violations [29] |
| Professionalism and Lifelong Learning | Self-reflection and Learning | Engage in reflective learning and personal growth | Portfolio; AI pattern feedback | AI detects learning gaps from reflection logs [30] |
| Professionalism and Lifelong Learning | Time Management and Diligence | Manage responsibilities and patient care tasks efficiently | Supervisor ratings + AI task logs | AI can track workflow and productivity markers [18] |
| Professionalism and Lifelong Learning | Integrity and Accountability | Demonstrate integrity and accept responsibility | Multisource feedback; AI ethics flags | AI assists in identifying unprofessional behavior patterns |
Fig. 1.
Study flow diagram. A total of 21 experts were initially invited, of whom 16 consented and completed both the first and second rounds of Delphi consultation. In the subsequent empirical validation phase, 298 questionnaires were distributed to residents, postgraduate students, and clinical educators. Of these, 288 were returned, and 12 were excluded due to incomplete or invalid responses. The final analysis included 276 valid questionnaires, which were combined with the Delphi data (n = 16) for psychometric testing of the competency-based evaluation framework
Expert panel and delphi consensus process
To refine and validate the preliminary competency matrix, a Delphi consensus methodology was employed, allowing structured, iterative engagement of domain experts to evaluate the relevance, feasibility, and clarity of each item. This approach was selected for its rigor in educational instrument development and its ability to balance disciplinary expertise with practical applicability in clinical training settings.
A purposive sampling strategy was used to recruit sixteen experts from Nanfang Hospital, Southern Medical University, China, comprising senior clinical educators, postgraduate instructors, and residency training supervisors. Eligibility required a minimum of five years of experience in clinical education and demonstrated familiarity with competency-based instructional models. Written informed consent was obtained, and all responses were anonymized across rounds to mitigate conformity bias.
All experts were affiliated with Nanfang Hospital, a leading tertiary teaching hospital in South China with recognized influence in both clinical medicine and medical education. According to the WHO discipline classification, their specialties covered general surgery, oncology, hepatobiliary surgery, and medical imaging–artificial intelligence (AI). Notably, six experts possessed dual expertise in clinical medicine and AI-related research. The professional seniority of the panel was substantial: 10 were full professors or chief physicians, 4 were associate professors/associate chief physicians, and 2 were attending physicians, thereby meeting the hierarchical requirements for Delphi panel qualification. Beyond their institutional roles, all participants held leadership positions in provincial or higher-level academic societies, and 33.3% served on the editorial boards of core journals. On average, panelists reported 14.6 years of clinical practice experience (range: 8–25 years) and, for interdisciplinary researchers, 5.2 years of cross-disciplinary collaboration experience. Collectively, the panel provided a reasonably diverse perspective within the context of a tertiary surgical teaching hospital, encompassing four surgical subspecialties (general, oncology, hepatobiliary, and colorectal), and two technological domains (AI-driven imaging diagnostics and integrative medicine), and three functional dimensions (clinical care, medical education, and technological innovation). However, several major disciplines—such as internal medicine subspecialties, pediatrics, psychiatry, and primary care—were not represented, so the resulting content validity primarily reflects views from surgery- and imaging-oriented specialties. The framework should therefore be regarded as an initial, discipline-informed consensus that will require further testing and refinement in other clinical domains and at different stages of training.
Three rounds of Delphi consultation were conducted between August and December 2024. In each round, participants independently rated the 72-item matrix across the three evaluative dimensions—Importance, Feasibility, and Clarity—using a 5-point Likert scale. For each round, mean scores, standard deviations (SD), coefficients of variation (CV), and Kendall’s W were calculated to quantify central tendency and inter-rater agreement. Feedback, including aggregated statistics and anonymized expert comments, was provided between rounds to facilitate reflective re-evaluation.
Consensus thresholds were predefined: items with mean ratings ≥ 4.0 and CV < 0.25 were retained, while those with persistently low scores or high dispersion were revised or removed. As reported in Table 2, Round 1 demonstrated relatively high ratings (mean 4.42, SD 0.66, CV 0.15) but limited concordance (Kendall’s W = 0.65). By Round 2, expert agreement had strengthened considerably, with higher ratings (mean 4.85, SD 0.38), reduced variability (CV 0.08), and improved concordance (Kendall’s W = 0.68). The third round expanded to a broader validation survey, with 298 questionnaires distributed and 276 valid responses analyzed, thereby confirming stability and reproducibility of the expert-derived framework.
Table 2.
Summary statistics of expert ratings across Delphi rounds
| Evaluation Item | Round | Mean | SD | CV | Kendall’s W |
|---|---|---|---|---|---|
| Importance of AI integration in clinical training | R1 | 4.42 | 0.66 | 0.15 | 0.65 |
| R2 | 4.85 | 0.38 | 0.08 | 0.68 | |
| R3 | 4.92 | 0.29 | 0.06 | 0.74 | |
| Feasibility of behavioral evaluation | R1 | 3.94 | 0.80 | 0.20 | 0.66 |
| R2 | 4.54 | 0.52 | 0.11 | 0.69 | |
| R3 | 4.71 | 0.44 | 0.09 | 0.73 | |
| Clarity of competency indicators | R1 | 3.88 | 0.81 | 0.21 | 0.67 |
| R2 | 4.47 | 0.49 | 0.11 | 0.70 | |
| R3 | 4.65 | 0.42 | 0.09 | 0.72 | |
| Relevance to clinical tasks | R1 | 4.20 | 0.74 | 0.18 | 0.66 |
| R2 | 4.68 | 0.41 | 0.09 | 0.71 | |
| R3 | 4.83 | 0.35 | 0.07 | 0.75 |
Mean = arithmetic average of ratings (5-point Likert scale)
SD = standard deviation
CV = coefficient of variation (SD/Mean); smaller values indicate higher consensus
Kendall’s W = coefficient of concordance among experts/participants
R1 = Round 1 (n = 16 experts); R2 = Round 2 (n = 16 experts); R3 = Round 3 (n = 276 valid questionnaires)
The convergence of expert judgments across rounds is illustrated in Fig. 2, which depicts progressive improvements in mean ratings, narrowing variability, and enhanced concordance. By the end of this phase, a consensus-based and linguistically refined matrix had been finalized, establishing a solid foundation for subsequent psychometric validation in authentic clinical education contexts.
Fig. 2.
Evolution of expert ratings across Delphi rounds. A Mean ratings of the three evaluation dimensions (Importance, Feasibility, Clarity) across Rounds 1–3. B Standard deviations reflect reduced variability in expert ratings across rounds. C Coefficients of variation (CV) show increasing consensus. D Kendall’s W increases across rounds, indicating improved inter-rater agreement
Empirical validation via field testing
Following the consensus-based refinement of the competency matrix, empirical validation was conducted to evaluate its psychometric performance in an authentic clinical education setting. This phase aimed to assess the framework’s reliability, exploratory structural characteristics, and inter-group comparability among different categories of medical education stakeholders.
Participants were recruited from Nanfang Hospital, Southern Medical University, China, and stratified into three cohorts: clinical residents in standardized residency training programs, postgraduate medical students enrolled in master’s or doctoral tracks, and faculty physicians actively engaged in bedside teaching and formative assessment. Eligibility required direct involvement in clinical training activities within the preceding 12 months. Recruitment was coordinated through institutional announcements, and written informed consent was obtained from all participants.
The 72-item competency evaluation questionnaire used in this study was developed specifically for this research, based on outcome-based education (OBE) principles and refined through a three-round Delphi expert consensus process. It has not been previously published elsewhere. An English-language version of the finalized questionnaire has been provided as Supplementary File 1 to facilitate transparency and replication. To assist international readers, Supplementary File 1 is organized into two sections: Section A contains the original Chinese item set, and Section B provides a full English translation. All AI- and data-related competency descriptors are explicitly labeled in Section B to ensure clarity and accessibility for non-Chinese-speaking audiences.
The finalized 72-item matrix, comprising the three evaluative dimensions—Importance, Feasibility, and Clarity—was administered through the institution’s digital testing platform. Respondents rated each item on a 5-point Likert scale, consistent with the Delphi process. Quality control measures were embedded to ensure completeness and prevent duplicate submissions. A total of 298 questionnaires were distributed, with 276 valid responses retained after screening for missing or inconsistent entries.
Responses were subjected to a comprehensive psychometric evaluation. Descriptive statistics were calculated to summarize central tendencies and variability across items. Internal consistency reliability was examined using Cronbach’s α. Sampling adequacy and factorability were assessed using the Kaiser–Meyer–Olkin (KMO) measure and Bartlett’s test of sphericity. To explore the latent relationship among the three evaluative domains rather than to conduct item-level factor modeling, we first computed respondent-level aggregated scores for Importance, Feasibility, and Clarity by averaging their respective item ratings. Exploratory factor analysis (EFA) with principal component extraction was then performed on these three aggregated domain scores to provide preliminary insight into their underlying structural relationships. The factor solution was interpreted as an exploratory assessment of domain-level structure and evaluated using scree plots and cumulative explained variance. Given the multidimensional nature of the Clarity construct, factor loadings were examined to identify differential contributions of linguistic versus operational aspects embodied within this domain.
Inter-item correlations were analyzed to assess relational coherence across the three dimensions, and results were visualized using heatmaps. In addition, group-based comparisons among residents, postgraduates, and faculty were conducted descriptively to explore potential role-related differences. Comparative visualizations, including radar plots, bar plots, and boxplots, were generated to illustrate scoring trends across groups.
Collectively, this field-testing phase provided a rigorous empirical basis for subsequent evaluation of the framework’s reliability, preliminary structural patterns, and applicability within competency-based medical education.
Statistical analyses
All statistical analyses were conducted to evaluate the psychometric robustness of the competency evaluation framework, with specific focus on internal consistency, exploratory domain-level structural patterns, and intergroup variability. Prior to analysis, datasets were screened for completeness, normality, and outlier influence, and only valid responses (n = 276) were retained for inclusion.
For descriptive evaluation, means, standard deviations (SD), coefficients of variation (CV), and Kendall’s W were calculated across the three Delphi rounds to quantify central tendency, dispersion, and inter-rater agreement (Table 2, Fig. 2).
Internal consistency reliability was assessed using Cronbach’s α, computed for the overall scale and for each of the three dimensions (Importance, Feasibility, and Clarity). Values greater than 0.80 were interpreted as indicative of strong internal coherence, consistent with established psychometric standards.
To explore structural relationships among the evaluative dimensions rather than to perform item-level factor modeling, construct validity was examined using exploratory factor analysis (EFA) applied to the aggregated domain scores. Specifically, for each respondent, dimension-level mean scores were computed for Importance, Feasibility, and Clarity, and these three aggregated variables were entered into the EFA with principal component extraction. Sampling adequacy was evaluated using the Kaiser–Meyer–Olkin (KMO) index, and factorability was further assessed using Bartlett’s test of sphericity. Factors with eigenvalues greater than 1.0 were retained and interpreted as reflecting latent structural tendencies among the three evaluative domains. Scree plots were generated to visualize component patterns, and cumulative explained variance was used to assess the overall coherence of the domain-level structure. Domain-level factor loadings, rather than item-level loadings, were examined and are reported in Table 4.
Table 4.
Factor loadings of competency evaluation items based on exploratory factor analysis
| Evaluation Item | Factor 1 | Factor 2 | Factor 3 |
|---|---|---|---|
| Importance | 0.732 | 0.218 | 0.161 |
| Feasibility | 0.681 | 0.183 | 0.093 |
| Clarity | 0.295 | 0.758 | 0.689 |
Factor loadings were derived from exploratory factor analysis using maximum likelihood extraction with varimax rotation, based on expert-rated responses in the third-round questionnaire (n = 276). Loadings greater than |0.3| were considered meaningful. The results indicate that “Importance” and “Feasibility” cluster primarily within the first component, suggesting a shared latent construct, whereas “Clarity” loads distinctly onto Factors 2 and 3, supporting a multidimensional framework of competence evaluation
To assess relational coherence, inter-domain correlations were analyzed using Pearson’s r. Results were visualized as correlation heatmaps (Fig. 4) to illustrate both convergence among domains and residual independence across dimensions.
Fig. 4.
Correlation matrices of evaluation dimensions across Delphi rounds. A Correlation heatmap from the second Delphi round (n = 16 experts), depicting associations among the three evaluation dimensions: Importance, Feasibility, and Clarity. B Correlation heatmap from the third-round validation survey (n = 276 respondents), confirming the stability of these relationships in a large-scale empirical sample. C Difference matrix (Round 3 − Round 2), demonstrating that all pairwise correlations strengthened across rounds, with the most pronounced increase observed for the Importance–Clarity association. This global strengthening of relationships among the three dimensions provides clear evidence of consensus convergence over time and illustrates the growing internal coherence of the evaluation framework, without implying definitive confirmation of its latent structure
Finally, group-level comparisons were conducted descriptively across residents, postgraduates, and faculty. Comparative visualizations included radar plots, bar plots, and boxplots (Fig. 5) to highlight systematic differences in domain-level ratings.
Fig. 5.
Comparative performance across evaluation dimensions. A Radar chart depicting mean scores of residents, faculty, and postgraduates across three core evaluation dimensions: Importance, Feasibility, and Clarity. Distinct line styles and markers highlight group-level differences. B Bar plots displaying mean ± standard deviation for each group across the three dimensions, illustrating comparative central tendencies and variability. C Boxplots demonstrating score distributions across groups, underscoring inter-group variability and within-group consistency. Collectively, these panels confirm convergent yet distinct perspectives across respondent groups, reinforcing the robustness of the competency-based evaluation framework
All analyses were performed using IBM SPSS Statistics (Version 26.0; IBM Corp., Armonk, NY, USA) for descriptive and psychometric computations. Figures were generated using Python (Matplotlib and Seaborn libraries) with customized formatting to align with the visual presentation standards of high-impact peer-reviewed medical journals.
Ethical considerations
This study focused on educational evaluation and psychometric testing of competency indicators and did not involve any patients, biological specimens, or clinical interventions. Because the research used only anonymous questionnaires and expert consultations, it posed no physical, psychological, or social risks to participants.
The study protocol was reviewed by the Ethics Committee of Nanfang Hospital, Southern Medical University (Guangzhou, China), which determined that the requirement for formal ethical approval could be waived, as the research involved only anonymous educational data and posed no potential risk to participants (approval waived; no reference number applicable). All procedures involving human participants were conducted in accordance with the ethical standards of the institutional research committee and the 1964 Declaration of Helsinki and its later amendments.
All expert panelists and survey respondents received written or electronic information sheets describing the study objectives, the voluntary nature of participation, and the confidentiality of their responses. Written or electronic informed consent was obtained from all participants prior to data collection. Anonymity was maintained throughout all Delphi rounds, and feedback was presented only in aggregated and de-identified form. No personally identifiable information was collected, and all data were stored securely on password-protected servers accessible only to the research team.
Data availability
The datasets generated and analysed during the current study are not publicly available because the underlying trainee and faculty evaluation records are classified as confidential institutional educational data. Under the policies of Nanfang Hospital and the approving ethics committee, raw item-level questionnaires from internal educational assessments may not be deposited in open repositories or released outside the institution, as they could potentially be linked back to small subgroups of trainees or educators and thereby affect staff appraisal. Anonymized, aggregated data (e.g., domain-level scores and summary tables) are available from the corresponding author on reasonable request and subject to institutional and ethics approval.
Results
Framework construction and conceptualization of competency dimensions among clinical medical trainees
A competency framework tailored to the realities of contemporary medical training was established through a sequential mixed-method design that integrated outcome-based education (OBE) principles with iterative expert consensus. The conceptualization phase drew upon national guidelines for medical education, established curricular standards, and prior scholarly work on clinical competency development. This synthesis yielded an initial item pool organized into eight first-level domains, twenty-four second-level subdomains, and seventy-two third-level indicators, capturing a comprehensive spectrum of professional abilities.
Three overarching evaluative dimensions—Importance, Feasibility, and Clarity—were applied to ensure that each indicator could be systematically appraised in terms of relevance to clinical training, potential for practical implementation, and linguistic precision. Items were operationalized to reflect concrete, observable behaviors, such as diagnostic reasoning, interdisciplinary collaboration, and responsible application of artificial intelligence in clinical practice.
The refinement process relied on structured Delphi consultations with sixteen senior experts representing residency program directors, postgraduate mentors, and frontline clinical educators from Nanfang Hospital, Southern Medical University, China. Experts independently rated each indicator, and their aggregated judgments informed subsequent revisions. This iterative validation ensured that both the structure and wording of the framework achieved semantic clarity, contextual appropriateness, and alignment with the rapidly evolving demands of AI-augmented healthcare.
The overall sequence of framework development—encompassing item generation, expert evaluation, and data-driven reduction—is depicted in Fig. 1. A hierarchical summary of the finalized evaluation framework is provided in Table 1, which demonstrates the integration of clinical reasoning, communication, ethical awareness, and digital literacy into a coherent and culturally relevant competency model. Together, these elements provide a robust conceptual foundation for the subsequent rounds of expert validation and large-scale empirical testing.
Delphi based expert evaluation and consensus refinement
To enhance the content validity and internal coherence of the proposed competency framework, a structured Delphi process was conducted with sixteen senior experts, including residency program directors, postgraduate supervisors, and experienced clinical educators from Nanfang Hospital, Southern Medical University. Panelists independently rated each item across the three predefined dimensions—Importance, Feasibility, and Clarity—using a five point Likert scale.
In Round 1, aggregated ratings demonstrated high perceived relevance but considerable variability. The overall mean score across items was 4.42 (SD 0.66), with a coefficient of variation (CV) of 0.15, indicating moderate dispersion in expert judgments. Concordance, as measured by Kendall’s W, was 0.65, reflecting modest alignment among experts at this initial stage.
After revisions informed by qualitative feedback, particularly addressing redundant phrasing, insufficient specificity, and contextual misalignment, the framework was resubmitted for Round 2 evaluation. Consensus improved appreciably: the mean score increased to 4.85 (SD 0.38), the CV declined to 0.08, and Kendall’s W rose to 0.68, suggesting moderate to strong agreement across panelists. These findings confirmed that iterative refinement enhanced semantic precision, operational clarity, and educational feasibility of the matrix.
A third round of validation was subsequently conducted through large scale questionnaire administration, yielding 276 valid responses from residents, postgraduate students, and clinical educators. The results demonstrated further convergence, with an overall mean score of 4.92 (SD 0.29), a CV of 0.06, and Kendall’s W of 0.74, underscoring strengthened consensus in a broader population.
The progressive harmonization of expert and trainee perspectives is illustrated in Fig. 2, which shows rising mean scores and narrowing variability across rounds. Detailed statistics for each dimension and round are provided in Table 2, confirming the trend toward higher consensus and supporting the adequacy of the item set for subsequent psychometric validation.
Internal structure validity and underlying dimensionality
Upon completion of the Delphi process, the finalized evaluation matrix underwent psychometric testing to examine its internal structural characteristics and latent dimensionality. Reliability analysis demonstrated excellent internal consistency, with a Cronbach’s α of 0.928, exceeding the threshold widely accepted as evidence of strong reliability in educational measurement. Sampling adequacy was supported by a Kaiser–Meyer–Olkin (KMO) value of 0.884, indicating meritorious shared variance across responses. In addition, Bartlett’s test of sphericity (χ2 = 421.35, df = 28, p < 0.001) confirmed that the correlation matrix was appropriate for further structural exploration (Table 3).
Table 3.
Psychometric properties of the competency-based evaluation scale
| Metric | Value | Interpretation |
|---|---|---|
| Cronbach’s α (overall scale) | 0.928 | Excellent internal consistency (α > 0.9) |
| Kaiser–Meyer–Olkin (KMO) | 0.884 | Meritorious sampling adequacy; suitable for factor analysis |
| Bartlett’s test of sphericity | χ2 = 421.35, df = 28, p < 0.001 | Significant correlation matrix; data suitable for factor analysis |
| Explained variance (1st component) | 43.2% | Indicates a dominant latent construct |
| Cumulative variance (first 3 components) | 74.5% | Supports a multidimensional yet coherent evaluation framework |
Cronbach’s α measures internal consistency across the full scale
KMO (Kaiser–Meyer–Olkin) tests sampling adequacy for factor analysis, with values above 0.8 considered meritorious
Bartlett’s test of sphericity assesses whether variables are sufficiently correlated to justify factor analysis
Explained variance is derived from principal component analysis (PCA); a higher percentage in the first component indicates a strong underlying construct
Cumulative variance above 70% demonstrates the scale’s robustness and structural clarity
Exploratory factor analysis (EFA) was conducted on the aggregated domain-level mean scores for Importance, Feasibility, and Clarity, rather than on the 72 item-level ratings, in order to explore the latent relationships among the three evaluative dimensions. Principal component extraction revealed a consistent inflection point at the third component in both the second Delphi round (n = 16 experts) and the field validation cohort (n = 276 respondents) (Fig. 3). This three-component pattern was interpreted as reflecting one dominant factor capturing overall evaluative salience (Factor 1), accompanied by two secondary components (Factors 2 and 3) that together explained complementary aspects of the Clarity dimension. In line with the loading structure, Factor 2 was tentatively interpreted as reflecting predominantly linguistic clarity (e.g., interpretability and semantic transparency), whereas Factor 3 was interpreted as reflecting predominantly operational clarity (e.g., actionability and procedural specificity). The cumulative explained variance (74.5%) therefore provides exploratory insight into the domain-level structural tendencies underlying the matrix.
Fig. 3.
Principal component analysis of the competency-based evaluation framework. A Scree plot of explained variance from the second Delphi round (n = 16 experts), with eigenvalues plotted against the ordered principal components on the X-axis (labelled “Principal Component”), providing an initial view of the emergent dimensional structure based on expert ratings. B Scree plot of explained variance from the third-round validation survey (n = 276 respondents), again plotting eigenvalues as a function of the principal components on the X-axis (labelled “Principal Component”), illustrating the consistency of the three-component pattern in a larger empirical sample. C Comparative factor loadings for the three evaluation dimensions—Importance, Feasibility, and Clarity—between Round 2 and Round 3. The patterns suggest convergence across rounds, with Importance and Feasibility primarily clustering on the first component, while Clarity shows distinct contributions to subsequent components, providing exploratory, domain-level insight into a multidimensional yet coherent evaluation structure
The domain-level loading pattern (Table 4) further illustrated this architecture. Importance showed a strong loading on Factor 1 (0.732), and Feasibility also loaded primarily on this component (0.681), supporting their shared contribution to overarching evaluative salience. In contrast, Clarity exhibited substantial loadings on two distinct components (0.758 and 0.689), a distribution consistent with its theoretically multidimensional nature. Clarity encompasses both linguistic clarity (e.g., interpretability of wording) and operational clarity (e.g., precision and actionability of expectations), which may account for its cross-loading across Factors 2 and 3. This interpretation aligns with the tentative labeling of Factor 2 as a linguistic-clarity component and Factor 3 as an operational-clarity component, further illustrating the internal complexity of the Clarity domain relative to Importance and Feasibility.
Although confirmatory factor analysis (CFA) was not conducted in this initial developmental study, the convergence of theoretical expectations, scree plot patterns, and domain-level loading structures provides preliminary and exploratory support for the coherence of the three-domain framework. These findings should, however, be interpreted as indicative rather than definitive, pending future item-level validation in independent samples.
Taken together, these analyses suggest that the evaluation matrix exhibits an internally coherent and theoretically meaningful domain-level structure, offering a promising basis for competency-based assessment in contemporary clinical training while underscoring the need for subsequent confirmatory work to refine and verify its dimensional architecture.
Scale reliability and inter-item correlation patterns
Further examination of the framework’s psychometric properties focused on internal reliability and the structural coherence of associations among the three evaluation dimensions. The finalized scale demonstrated excellent reliability, with a Cronbach’s α of 0.928 (Table 3), confirming that items were sufficiently homogeneous to represent a unified construct while avoiding redundancy.
Correlation analyses provided additional insights into construct coherence. The heatmaps of inter-item correlations (Fig. 4) displayed a systematic and interpretable pattern across rounds. In the second Delphi round, associations between domains were moderate, reflecting the early stage of consensus formation. By the third-round validation survey (n = 276), all three pairwise correlations among Importance, Feasibility, and Clarity had strengthened, with the largest relative increase observed for the Importance–Clarity association. Specifically, correlations in Round 3 were highest between Importance and Feasibility (r = 0.56) and between Importance and Clarity (r = 0.43), suggesting that competencies considered critical were increasingly viewed as both achievable and linguistically clear. Although the Feasibility–Clarity correlation (r = 0.39) remained comparatively lower than the other two pairs, it also increased relative to Round 2, indicating that perceptions of operational practicality and semantic precision became more aligned over the iterative refinement process.
The difference heatmap (Round 3 – Round 2) confirmed this trajectory, highlighting a global strengthening of relationships among the three evaluative dimensions and providing additional evidence of consensus convergence over time. Collectively, these results affirm that while the three dimensions are conceptually interconnected, they retain sufficient distinctiveness to warrant delineation as discrete but synergistic components of competency assessment.
Divergent perceptions of competency dimensions across educational roles
To examine potential variability in how competencies are valued across distinct educational roles, comparative analyses were conducted among residents, faculty members, and postgraduate trainees. Each group independently rated the finalized framework along the three evaluative dimensions—Importance, Feasibility, and Clarity—allowing for profiling of their relative emphases.
As depicted in Fig. 5, clear but nuanced differences emerged across the groups. Faculty members consistently provided the highest ratings for Importance, reflecting their prioritization of conceptual comprehensiveness and the centrality of professional values in training. Residents, by contrast, demonstrated greater emphasis on Feasibility, likely shaped by their direct engagement with the practical constraints of high-intensity clinical environments. Postgraduates tended to provide more balanced ratings across all domains, though with a modest elevation in Clarity, suggesting an inclination toward interpretability and transparency as they transition from theoretical coursework to supervised clinical application.
While mean scores across groups remained broadly aligned, the radar plot revealed divergence in the contours of scoring distributions. Bar plots of mean ± SD highlighted consistent group-level distinctions, and boxplots confirmed that variability within groups did not obscure these systematic tendencies. Collectively, these analyses suggest that competency valuations are not monolithic but rather contingent on the assessor’s role, stage of training, and exposure to clinical or pedagogical constraints.
This heterogeneity reinforces the interpretive value of the framework: it not only serves as a summative assessment tool but also functions as a reflective mechanism to reconcile institutional pedagogical priorities with the lived experiences of trainees across different strata of the medical education continuum.
Discussion
This study developed and preliminarily evaluated a competency-based evaluation framework for clinical medical trainees in China, explicitly grounded in outcome-based education (OBE) principles and responsive to the evolving demands of AI-informed healthcare. Using a three-round Delphi process and empirical validation among residents, postgraduate students, and clinical educators, the framework addresses a critical gap in China’s medical education infrastructure by providing a context-sensitive and evidence-informed instrument. The progressive convergence across Delphi rounds, together with encouraging psychometric findings, underscores both the conceptual clarity and the practical feasibility of the proposed tool.
The instrument showed excellent internal consistency (Cronbach’s α = 0.928) and good sampling adequacy (KMO = 0.884; Bartlett’s test significant), supporting further structural exploration. Exploratory factor analysis of aggregated domain-level mean scores was therefore used to describe broad relationships among Importance, Feasibility, and Clarity rather than to estimate a full latent construct model. The resulting three-component solution—with one dominant component capturing overall evaluative salience and two secondary components reflecting different facets of Clarity—offered preliminary domain-level structural support broadly consistent with the theoretical framework [31]. These findings are necessarily exploratory, and item-level confirmatory factor analysis (CFA) in independent samples will be needed to establish the final dimensional structure of the instrument [32].
When compared to international models such as the ACGME core competencies and CanMEDS, the present framework distinguishes itself by embedding competencies directly relevant to national reform priorities and technological integration. Unlike prior efforts that primarily translated or adapted foreign frameworks, this instrument was built from first principles through expert consensus and empirical testing [33]. This methodological approach enhances contextual legitimacy, ensuring the tool’s practical utility for guiding curriculum reform and competency assessment in China.
Beyond aligning with high-level policy calls for AI readiness, the present framework uniquely operationalizes AI-related competencies in concrete, practice-facing terms. Rather than describing AI literacy in abstract or aspirational language, the indicators explicitly require trainees to (1) formulate clinically meaningful prompts and structured queries when interacting with AI-enabled decision-support systems, (2) critically appraise AI-generated outputs by triangulating them with patient-specific information, laboratory or imaging findings, and current clinical guidelines, and (3) recognize and mitigate risks related to algorithmic bias, hallucinated content, privacy concerns, and over-reliance on automated recommendations. These descriptors were intentionally drafted in a tool-agnostic manner to ensure applicability across diverse AI platforms and evolving technological landscapes. By articulating these operational expectations, the framework moves beyond conceptual digital competency models and provides actionable guidance for cultivating safe, accountable, and contextually appropriate AI-assisted clinical reasoning among trainees in China.
Subgroup comparisons further revealed systematic differences in emphasis across respondents. Faculty prioritized Importance, reflecting their pedagogical focus; residents rated Feasibility higher, likely due to their sensitivity to clinical implementation challenges; and postgraduate students emphasized Clarity, underscoring their need for transparent and interpretable expectations during the transition from theoretical to practical training. Such divergence highlights the value of the framework not only as an evaluative tool but also as a diagnostic mechanism to align the perspectives of diverse educational stakeholders [34].
Factor analysis indicated that the Clarity domain loaded on two components, suggesting that it encompasses both linguistic interpretability and practical actionability. Rather than undermining the framework, this pattern highlights Clarity as a bridging construct between conceptual understanding and implementation, with the two secondary factors tentatively reflecting linguistic and operational clarity, respectively. Although confirmatory factor analysis was not conducted in this initial study, the convergence of theoretical expectations, scree plot inspection, and domain-level loading patterns provides preliminary support for the three-domain structure, which should nevertheless be regarded as provisional pending item-level validation in independent samples.
Taken together, these analyses suggest that the evaluation matrix exhibits an internally coherent and theoretically meaningful domain-level structure, offering a promising basis for competency-based assessment in contemporary clinical training while underscoring the need for subsequent confirmatory work to refine and verify its dimensional architecture.
Strengths and limitations
This study has several notable strengths. First, it is among the few investigations to systematically develop and preliminarily evaluate a competency evaluation framework for clinical medical trainees in China that explicitly integrates outcome-based education principles within the context of AI-enhanced healthcare. The sequential mixed-methods design, combining expert consensus through Delphi consultation with empirical psychometric assessment, provided both theoretical grounding and initial measurement rigor. Second, the framework demonstrated strong internal consistency and coherent domain-level structural patterns, offering encouraging initial support for its use in competency-based medical education. Because exploratory factor analysis was conducted on aggregated domain scores rather than item-level data, these structural findings should be regarded as preliminary and hypothesis-generating. Third, the inclusion of residents, postgraduate students, and clinical educators ensured input from multiple stakeholder groups, enhancing the tool’s practical relevance and sensitivity to role-specific expectations. Fourth, the AI-related competency domain was articulated in concrete behavioral terms—covering interaction with AI-enabled tools, critical appraisal of AI outputs, and explicit attention to risk, governance, and professional responsibility—thereby emphasizing that AI should augment rather than replace human clinical judgment and helping to mitigate the risk that AI-assisted workflows simply amplify existing human errors.
Nonetheless, several limitations should be acknowledged. The study was conducted at a single teaching hospital, which may constrain the generalizability of findings to other regions or institutional contexts. The reliance on self-reported questionnaire ratings introduces the possibility of response bias, despite measures to preserve anonymity and reduce conformity effects. Furthermore, because the structural analysis operated at the domain level, item-level relationships and the full latent structure of the instrument were not examined in this developmental stage. Future studies incorporating confirmatory factor analysis (CFA) in independent samples will be needed to more definitively establish the dimensional architecture. In particular, the observed cross-loadings of the Clarity domain are consistent with its theoretically multidimensional nature—spanning both linguistic interpretability and operational specificity—but also indicate the need for continued refinement of item wording and dimensional allocation. In addition, although the AI-related competencies were formulated in a tool-agnostic manner to ensure broad applicability, the framework does not evaluate performance with specific AI platforms or model versions, and its content will require periodic updating as clinical AI technologies, regulatory guidance, and local implementation practices evolve. The present study also did not directly observe how trainees or educators apply AI tools in real clinical settings, leaving implementation fidelity and residual risks of uncritical AI reliance to be examined in future work. Finally, the current instrument evaluates perceived competencies but does not directly link scores to objective measures of clinical performance, which remains a critical direction for subsequent research.
Taken together, these strengths and limitations indicate that while the framework offers a reliable and context-sensitive approach to competency assessment in the AI era, its structural findings should be viewed as exploratory, and further multi-center validation together with performance-based outcome linkage will be required to maximize its educational utility and scalability.
Conclusion
This study developed and preliminarily evaluated a competency-based framework for assessing Chinese clinical trainees’ readiness for AI-informed practice, grounded in outcome-based education and local training needs. The instrument showed promising reliability and coherent domain-level structure, providing a context-sensitive and practically oriented tool for competency assessment. Further multi-center, item-level validation and linkage to objective educational and clinical outcomes are needed to confirm and refine the framework. Its use may support curriculum refinement, faculty development, and the broader modernization of medical education in China, while also offering a potential reference for other health systems undergoing rapid digital transformation.
Supplementary Information
Acknowledgements
Not applicable.
Authors’ contributions
FL analyzed and interpreted the data and drafted the manuscript. HZ contributed to manuscript proofreading. QH collected and curated the data. GC and CZ jointly supervised all aspects of the study and served as corresponding authors. All authors read and approved the final manuscript.
Funding
This work was supported by:
The 2024 National and Provincial Undergraduate Innovation and Training Program of Southern Medical University (Grant No. S202412121199: Development of a Competency Framework for Clinical Medical Trainees in the AI Era Based on Outcome-Based Education);
The 2024 Provincial Education and Teaching Reform Project for High-Level Universities (Grant No. G624NF0320: Exploring the Construction of a High-Precision Competency-Based Evaluation System in Medicine Under the OBE Framework); and
The 2025 Higher Education Teaching Reform Project (Grant No. G625NF0365: Construction and Application of a High-Precision Competency-Based Digital Evaluation System in Medical Education Under the OBE Framework).
The funding bodies had no role in the study design, data collection, analysis, interpretation, or manuscript preparation.
Data availability
The datasets generated and analysed during the current study are not publicly available because the underlying trainee and faculty evaluation records are classified as confidential institutional educational data. Under the policies of Nanfang Hospital and the approving ethics committee, raw item-level questionnaires from internal educational assessments may not be deposited in open repositories or released outside the institution, as they could potentially be linked back to small subgroups of trainees or educators and thereby affect staff appraisal. Anonymized, aggregated data (e.g., domain-level scores and summary tables) are available from the corresponding author on reasonable request and subject to institutional and ethics approval.
Declarations
Ethics approval and consent to participate
This study involved only educational evaluation and psychometric testing of competency indicators and did not include any patients, biological specimens, or clinical interventions. The study protocol was reviewed by the Ethics Committee of Nanfang Hospital, Southern Medical University (Guangzhou, China), which determined that the requirement for formal ethical approval could be waived because the research involved only anonymous educational data and posed no potential risk to participants (approval waived; no reference number applicable).
All Delphi expert panelists and survey respondents were fully informed about the study objectives, the voluntary nature of participation, and the confidentiality of their responses. Written or electronic informed consent was obtained from all participants prior to data collection.
Consent for publication
Not applicable. This study does not include any individual person’s data in any form (including individual details, images, or videos).
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Chunhui Zhang, Email: zch@smu.edu.cn.
Geyu Chen, Email: my_2025search@163.com.
References
- 1.Miguez-Pinto JP, et al. The medical student of the future: redefining competencies in a transformative era. Front Med (Lausanne). 2025;12:1593685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Alowais SA, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ. 2023;23(1):689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Williamson SM, Prybutok V. Balancing privacy and progress: a review of privacy challenges, systemic oversight, and patient perceptions in AI-driven healthcare. Appl Sci. 2024;14(2):675. [Google Scholar]
- 4.Wang C, et al. The impact of evidence-based medicine curricula on information literacy among clinical medical undergraduates and postgraduates in China. BMC Med Educ. 2025;25(1):520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hawkins RE, et al. Implementation of competency-based medical education: are we addressing the concerns and challenges? Med Educ. 2015;49(11):1086–102. [DOI] [PubMed] [Google Scholar]
- 6.Gupta SK, Srivastava T. Assessment in undergraduate competency-based medical education: a systematic review. Cureus. 2024;16(4):e58073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Pantanowitz L, et al. Nongenerative artificial intelligence in medicine: advancements and applications in supervised and unsupervised machine learning. Mod Pathol. 2025;38(3):100680. [DOI] [PubMed] [Google Scholar]
- 8.Okano H, et al. Outcomes of simulation-based education for vascular access: a systematic review and meta-analysis. Cureus. 2021;13(8):e17188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Øvrebø LJ, Dyrstad DN, Hansen BS. Assessment methods and tools to evaluate postgraduate critical care nursing students’ competence in clinical placement. An integrative review. Nurse Educ Pract. 2022;58:103258. [DOI] [PubMed] [Google Scholar]
- 10.Lam TP, Wan XH, Ip MS. Current perspectives on medical education in China. Med Educ. 2006;40(10):940–9. [DOI] [PubMed] [Google Scholar]
- 11.Nasa P, et al. Defining and subphenotyping ARDS: insights from an international Delphi expert panel. Lancet Respir Med. 2025;13(7):638–50. [DOI] [PubMed] [Google Scholar]
- 12.Peng J, Li Y. Frontiers of artificial intelligence for personalized learning in higher education: a systematic review of leading articles. Appl Sci. 2025;15(18):10096. [Google Scholar]
- 13.Ng FYC, et al. Artificial intelligence education: an evidence-based medicine approach for consumers, translators, and developers. Cell Rep Med. 2023;4(10):101230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Raskob K, et al. Twelve tips to harness the power of AI for curriculum mapping. Medical teacher. 2025;1–10. 10.1080/0142159X.2025.2513427. [DOI] [PubMed]
- 15.Rencic J, et al. Clinical reasoning performance assessment: using situated cognition theory as a conceptual framework. Diagnosis (Berl). 2020;7(3):241–9. [DOI] [PubMed] [Google Scholar]
- 16.Pangaro L, ten Cate O. Frameworks for learner assessment in medicine: AMEE Guide No. 78. Med Teach. 2013;35(6):e1197–210. [DOI] [PubMed] [Google Scholar]
- 17.Bates DW, et al. The potential of artificial intelligence to improve patient safety: a scoping review. NPJ Digit Med. 2021;4(1):54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56. [DOI] [PubMed] [Google Scholar]
- 19.Siira E, Johansson H, Nygren J. Mapping and summarizing the research on AI systems for automating medical history taking and triage: scoping review. J Med Internet Res. 2025;27:e53741. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Rädel-Ablass K, et al. Teaching opportunities for anamnesis interviews through AI based teaching role plays: a survey with online learning students from health study programs. BMC Med Educ. 2025;25(1):259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.McLennan S, et al. Embedded ethics: a proposal for integrating ethics into the development of medical AI. BMC Med Ethics. 2022;23(1):6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Mishra GV, Luharia AA, Naqvi W, Sood A. Artificial Intelligence in OSCE: Innovations and Implications for Medical Education Assessment – A Systematic Review. In 2024 2nd DMIHER International Conference on Artificial Intelligence in Healthcare, Education and Industry (IDICAIEI). IEEE. 2024:1–6. 10.1109/IDICAIEI61867.2024.10842789.
- 23.Bhattacharjee S, Bhattacharya S. Leveraging AI-driven nudge theory to enhance hand hygiene compliance: paving the path for future infection control. Front Public Health. 2024;12:1522045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Chang C-L, Chen C-H. Applying decision tree and neural network to increase quality of dermatologic diagnosis. Expert Syst Appl. 2009;36(2, Part 2):4035–41. [Google Scholar]
- 25.Da’Costa A, et al. AI-driven triage in emergency departments: A review of benefits, challenges, and future directions. Int J Med Inform. 2025;197:105838. [DOI] [PubMed] [Google Scholar]
- 26.Morrow E, et al. Artificial intelligence technologies and compassion in healthcare: a systematic scoping review. Front Psychol. 2022;13:971044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Paranjape K, et al. Introducing artificial intelligence training in medical education. JMIR Med Educ. 2019;5(2):e16048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Cross JL, Choma MA, Onofrey JA. Bias in medical AI: implications for clinical decision-making. PLoS Digit Health. 2024;3(11):e0000651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Majeed A, Hwang SO. When AI meets information privacy: the adversarial role of AI in data sharing scenario. IEEE Access. 2023;11:76177–95. [Google Scholar]
- 30.Naseer F, Khawaja S. Mitigating conceptual learning gaps in mixed-ability classrooms: a learning analytics-based evaluation of AI-driven adaptive feedback for struggling learners. Appl Sci. 2025;15(8):4473. [Google Scholar]
- 31.Knop M, et al. Human factors and technological characteristics influencing the interaction of medical professionals with artificial intelligence-enabled clinical decision support systems: literature review. JMIR Hum Factors. 2022;9(1):e28639. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Jacobs SM, et al. Reimagining core Entrustable Professional Activities for undergraduate medical education in the era of artificial intelligence. JMIR Med Educ. 2023;9:e50903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Lee GB, Chiu AM. Assessment and feedback methods in competency-based medical education. Ann Allergy Asthma Immunol. 2022;128(3):256–62. [DOI] [PubMed] [Google Scholar]
- 34.Turchi T, et al. Pathways to democratized healthcare: envisioning human-centered AI-as-a-service for customized diagnosis and rehabilitation. Artif Intell Med. 2024;151:102850. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets generated and analysed during the current study are not publicly available because the underlying trainee and faculty evaluation records are classified as confidential institutional educational data. Under the policies of Nanfang Hospital and the approving ethics committee, raw item-level questionnaires from internal educational assessments may not be deposited in open repositories or released outside the institution, as they could potentially be linked back to small subgroups of trainees or educators and thereby affect staff appraisal. Anonymized, aggregated data (e.g., domain-level scores and summary tables) are available from the corresponding author on reasonable request and subject to institutional and ethics approval.
The datasets generated and analysed during the current study are not publicly available because the underlying trainee and faculty evaluation records are classified as confidential institutional educational data. Under the policies of Nanfang Hospital and the approving ethics committee, raw item-level questionnaires from internal educational assessments may not be deposited in open repositories or released outside the institution, as they could potentially be linked back to small subgroups of trainees or educators and thereby affect staff appraisal. Anonymized, aggregated data (e.g., domain-level scores and summary tables) are available from the corresponding author on reasonable request and subject to institutional and ethics approval.





