Abstract
Abstract
Objective
Communication skills assessment (CSA) is essential for ensuring competency, guiding educational practices and safeguarding regulatory compliance in health professions education (HPE). However, there appears to be heterogeneity in the reporting of validity evidence from CSA methods across the health profession that complicates our interpretation of the quality of assessment methods. Our objective was to map reliability and validity evidence from scores of CSA methods that have been reported in HPE.
Design
Scoping review.
Data sources
MEDLINE, Embase, PsycINFO, CINAHL, ERIC, CAB Abstracts and Scopus databases were searched up to March 2024.
Eligibility criteria
We included studies, available in English, that reported validity evidence (content-related, internal structure, relationship with other variables, response processes and consequences) for CSA methods in HPE. There were no restrictions related to date of publication.
Data extraction and synthesis
Two independent reviewers completed data extraction and assessed study quality using the Medical Education Research Study Quality Instrument. Data were reported using descriptive analysis (mean, median, range).
Results
A total of 146 eligible studies were identified, including 98 394 participants. Most studies were conducted in human medicine (124 studies) and participants were mostly undergraduate students (85 studies). Performance-based, simulated, inperson CSA was most prevalent, comprising 115 studies, of which 68 studies were objective structured clinical examination-based. Other types of methods that were reported were workplace-based assessment; asynchronous, video-based assessment; knowledge-based assessment and performance-based, simulated, virtual assessment. Included studies used a diverse range of communications skills frameworks, rating scales and raters. Internal structure was the most reported source of validity evidence (130 studies (90%), followed by content-related (108 studies (74%), relationships with other variables (86 studies (59%), response processes (15 studies (10%) and consequences (16 studies (11%).
Conclusions
This scoping review identified gaps in the sources of validity evidence related to assessment method that have been used to support the use of CSA methods. These gaps could be addressed by studies explicitly defining the communication skill construct(s) assessed, clarifying the validity source(s) reported and defining the intended purpose and use of the scores (ie, for learning and feedback, for decision making purposes). Our review provides a map where targeted CSA development and support are needed. Limitations of the evidence come from score interpretation being constrained by the heterogeneity of the definition of communication skills across the health professions and the reporting quality of the studies.
Keywords: Education, Medical; EDUCATION & TRAINING (see Medical Education & Training); Psychometrics; Review
STRENGTHS AND LIMITATIONS OF THIS STUDY.
This scoping review maps the reliability and validity evidence from scores of communication skills assessment methods across the health professions.
Score interpretation is constrained by the heterogeneity of the definition of communication skills across the health professions and the reporting quality of the studies.
The use of artificial intelligence (AI) in communication skills training and assessment has increased, but this is not yet reflected in our data, as current studies mainly look at the possibilities for using AI in communication teaching interventions and not yet at the psychometrics of AI-supported assessments.
Introduction
Communication skills (CSs) of health professionals affect clinical outcomes including patient safety, satisfaction and client adherence to health provider recommendations.1,4 Therefore, CS training and, by extension, assessment are cornerstones for many health professions spanning the continuum from undergraduate through continuing professional education. Scores from CS assessment methods in health professions education (HPE) are widely used to provide feedback on learning and/or make decisions regarding student advancement and licensure. Many studies have been carried out investigating validity and reliability evidence of CS assessments (CSAs) to support their use in both formative (lower stakes, typically for learning purposes) and summative (higher stakes, for decision making purposes) settings in HPE. However, there appears to be heterogeneity in the reporting of reliability and validity evidence from CSA methods across the health professions that complicates our interpretation of the quality of assessment methods.5,8
Reporting the reliability and validity of assessment scores—specifically for CSs—should be grounded in an evidence-centred design9 and performance assessment framework10 approach. The four-step approach to evidence-centred design includes:
A clear statement of the concept and associated constructs (in this case CSs).
A list of the type of validity evidence being sought and the purpose it serves.
Specification of the assessment methods used to generate the evidence.
Interpretation of the resulting data.
As Braun et al11 state, ‘we must take validity as the touchstone—assessment, design, development and deployment must all be linked to the operational definition of the … construct.’ (p.6, ‘…’ was our inclusion). Without a definition of the construct—here, CSs—it becomes difficult to interpret results within studies or combine coefficients across studies, both of which are necessary to strengthen the evidence base for feedback and decision-making purposes.
Given that the validity of scores is reliant on a definition of CSs score, one aspect of this review focuses on whether a definition was provided. The intent of this review was not to develop a consensus statement on the definition of CSs. Rather, it was to gather if ‘operational definitions’ were reported, providing context for the presented reliability and validity evidence and the use of the CSA methods. As an example, communication frameworks like the Calgary-Cambridge communication guide (CCG)6 and the SEGUE (Set the stage, Elicit information, Give information, Understand the patient's perspective, and End the encounter) framework7 are used to define CSs, build CS Assessment scales and items and are modified to accommodate various healthcare settings and cultural differences.812,15
Methods used to assess CS range from written assessments16 to multiple performance-based assessment methods including simulation and clinical encounters8 17 with evaluators ranging from patients to faculty.18,20 Systematic and scoping reviews reporting reliability and validity evidence from CSA performed to date typically report on one specific method, such as objective structured clinical examinations (OSCEs),17 21 22 workplace based assessment8 and written assessment16 23 or are limited to one profession, such as medicine,24,26 dentistry27 and nursing.28
Validity and reliability evidence from OSCEs was reviewed by Setyonugroho et al17 who reported 47% (16/34 studies) of included studies reported evidence of both reliability and validity in which 18% reported only validity evidence and 35% only reliability evidence. Comert et al21 reported a systematic review focused on rating scales for CSs. Eight scales were identified, which had various degrees of reliability and validity evidence. Most noteworthy was that three of the scales reported were not accompanied by an explicit definition of CSs. In a systematic review of the reliability of OSCE scores, Brannick et al22 reported a mean alpha coefficient for communication scales as 0.55 (for clinical skills mean alpha was 0.69), suggesting that while an OSCE format provides a standardised format to assess these skills, the reliability of the scores across OSCE stations was <0.60. This hampers the use of these scores for high-stakes decision making.
Workplace-based assessments, that included the assessment of CSs, were reviewed by McGill et al.8 This study concluded that the reliability from workplace-based assessment scores alone did not provide sufficient evidence to support high stakes decision making. These scores should be considered part of an assessment programme rather than relying on the CSA by one supervisor alone.
Knowledge-based, written assessment of health profession CSs was reviewed by Perron et al.23 They concluded that reporting of the development and psychometric properties of these instruments is often incomplete. As a follow-up to this review, Kiessling et al16 looked at the correlation between written assessments of CS and performance-based assessment and reported low-medium correlations between written and performance-based assessment outcomes. They concluded that reporting psychometric properties of assessment strategies is essential to improve interpretation and may have better predictive validity for assessment performance.
As indicated above, reviews in medicine,24,26 dentistry27 and nursing28 report a wide variety of CSA methods and associated definitions of CSs. These studies consistently state that psychometric evidence is often heterogeneous,17 29 which makes comparison of examinee competencies within and across disciplines difficult. A synthesis of existing studies using a validity framework, such as Messick’s30 or Kane’s,31 could provide a robust foundation for the healthcare professions to design, implement and evaluate evidence-based CSA methods in HPE.
The intent of this scoping review was to collate and synthesise what methods of CSA have been used and to what extent validity evidence for those CSA methods exists. Our aim was to map the reliability and validity evidence from scores of CSA methods that have been reported in HPE.
Methods
We reviewed the literature to answer the following questions regarding CSA methods in the health professions:
What are the reported CSA methods used in HPE?
What validity and reliability evidence is reported for these assessment methods?
Study identification
The Preferred Reporting Items for Systematic Reviews and Meta Analysis was used as the framework for this review.32 We systematically searched the MEDLINE, PsycINFO, Embase, CAB Abstracts, CINAHL, ERIC and Scopus databases. The search strategy was developed in collaboration with a research librarian (first search conducted November 2021, second search conducted July 2023, last search conducted 27 March 2024). Key search concepts included the topic (communication skills assessment), the population (undergraduate, resident, professional) and the outcomes (reliability, validity, measure). Controlled vocabulary headings (eg, Medical Subject Headings (MeSH)) and keywords were searched for each of these concepts. The search strategy was developed in MEDLINE and translated to the other databases. Complete search strategies for each database are provided in online supplemental appendix A. References of previous reviews816 17 23,28 33 were hand searched to identify potentially omitted studies.
Study eligibility and inclusion
Inclusion and exclusion criteria are summarised in table 1. We made no exclusions based on date of publication.
Table 1. Inclusion and exclusion criteria.
| Inclusion criteria | Exclusion criteria |
|---|---|
| Report on CSA* | Adjacent construct used but communication skill is not specifically reported in the assessment, for example, interpersonal skills. |
| Report on reliability and/or validity evidence of the CSA | Report on teaching intervention, for example, the application of a communication framework in a curriculum without assessment scores of the learners’ communication skills. |
| Objective assessment method used | Solely self-assessment, patient assessment or assessment by peers used. |
| Report on students or professionals in the field of healthcare education | Solely reports on written communication, that is, communication through written means such as patient notes. |
| Available in English |
Communication skills assessment.
CSA, communication skills assessment.
For each manuscript, two members of the review team independently screened title and abstract for inclusion using Covidence (Covidence systematic review software, Veritas Health Innovation, Melbourne, Australia. www.covidence.org). Full texts were independently, in parallel, reviewed for inclusion again by two members of the review team. Conflicts were resolved by initial consensus between LD and KH. If LD and KH could not reach consensus, the rest of the review team was involved.
Data extraction
Data from included studies were extracted by two raters, in parallel, by two members of the review team. Conflicts were resolved by rereading the original study and in conversation between LD and KH. If LD and KH could not reach consensus on how specific data from a study should be interpreted, the rest of the review team was involved. The variables collected included study participants, assessment method, scoring format (Likert-type scales, adjectival scale, dichotomous, etc), rater type (faculty, standardised patient, peer, etc) for performance-based assessment, rater training, statement of rater bias, whether a definition of CS or a CSs framework was reported, sources of validity evidence and methodological quality. Reliability and validity evidence were coded into the five sources of validity evidence: content-related, response processes, internal structure, relationship with other variables and consequences30 34 and whether the study reported use of a validity framework such as Messick’s30 or Kane’s.31 Under content-related, we included a subcategory to capture if a previously developed CSs framework (ie, Calgary Cambridge guide, the Maastricht History-taking and Advice Scoring List (MAAS) framework or the Kalamazoo Consensus statement) was used as a proxy for the definition of CSs and a framework for CSA. Under internal structure, we categorised the classical test theory coefficients into: internal consistency—coefficients reported across different CSA items; inter-rater reliability—coefficients reported across different raters and interstation (case) reliability—coefficients across different OSCE stations or CS cases. We also categorised intrarater and station reliability coefficients when reported. Finally, we categorised those studies that used generalisability and item response theory coefficients to support the claims of reliability and validity. Under the relationship with other variables, we also categorised if those variables were associated with expertise levels, performance or any other relationship that was reported outside of performance or expertise. Additionally, reviewers were asked to assess manuscripts for content, process and perceptual CSA.35 36 The content component of CS refers to what is communicated, the specific words and sentences employed, for example, the substance of the questions asked, information given and treatments discussed. The process component pertains to how someone communicates, for example, how an examinee goes about discovering information, how they relate to patients and the use of non-verbal communication. The perceptual component refers to the thought processes that inform communication, beyond the explicit content and process of the message, for example, clinical reasoning and problem-solving skills.
The Medical Education Research Study Quality Instrument (MERSQI)32 was used to appraise the general methodological quality of the studies included. The MERSQI is scored on 10 items in the domains of study design, sampling, data type, validity, analysis and outcome. The maximum score per item is three and four items might be scored as ‘Not applicable’. The maximum obtainable score is 18, with a possible range of 5–18. The total MERSQI score was determined as the percentage of total achievable points (accounting for ‘not applicable’ items) and then adjusted to a standard denominator of 18 to allow for comparison of scores across studies.32
MERSQI assessment was performed by the authors in the same manner as the data extraction: Full texts were independently, in parallel, scored for the MERSQI by two members of the review team. Conflicts were resolved by initial consensus between LD and KH. If LD and KH could not reach consensus, the rest of the review team was involved.37
Data synthesis
Data were descriptively summarised using counts and percentages, means and medians where appropriate for the following categories:
Simulated, inperson assessment, that is, performance-based assessment using an encounter with a simulated participant (SP). In this category, a further distinction was made between OSCEs and other simulated interactions with SPs. Therefore, as outlined below, we report on further analyses of these two subcategories.
Workplace-based assessment, that is, performance-based assessment where the examinee was assessed in an in-clinic setting with real patients/clients.
Asynchronous, video-based assessment, that is, assessment of the examinee involved evaluating their response to the verbal expressions conveyed by the patient or client depicted in a video.
Knowledge-based assessment, that is, written assessment in which the examinee’s communication knowledge was evaluated through questions that were either multiple choice questions or open-ended questions. These could be focused on assessing theoretical knowledge or on skill spotting, in which the examinee would have responded to questions after having watched a video containing an interaction of healthcare provider and SP or patient.
Performance-based, simulated, virtual assessment, that is, performance-based assessment using a virtual simulated encounter with a virtual patient/client.
Studies were assigned in multiple categories if they employed multiple assessment methods.
Results
Of the 17 406 unique manuscripts initially identified, 710 full texts were evaluated for eligibility and 137 were included (figure 1). Citation searching from identified manuscripts resulted in adding nine more full-text articles, bringing the total number to 146 including 98 394 participants. The total number of participants in these studies varies greatly, from 10 to 20 participants in workplace-based assessments to thousands of examinees in licensing examinations. Four studies did not report the exact number of participants. Citations (including citations for tables24 below) are provided in online supplemental appendix B.
Figure 1. Preferred Reporting Items for Systematic Reviews and Meta Analysis flow diagram of the systematic search and selection of manuscripts.
Table 2. Descriptive statistics of the studies reported in the included manuscripts.
| No. (%) | ||
|---|---|---|
| Scoring method | ||
| No. different methods used, mean (min–max)* | 1.4 (1–4) | NA |
| Performance-based, simulated, inperson assessment | 116 (79) | |
| Performance-based, workplace-based assessment | 27 (19) | |
| Asynchronous, video-based assessment | 17 (12) | |
| Knowledge-based assessment | 5 (3) | |
| Performance-based, simulated, virtual assessment | 1 (1) | |
| Scoring format, type of scale** | ||
| Dichotomous | 35 (24) | |
| Likert (type) scale | 47 (32) | |
| Adjectival scale | 75 (52) | |
| Global scale | 8 (5) | |
| Narrative | 3 (2) | |
| Frequency count | 2 (1) | |
| Raters** | ||
| Experts | 44 (30) | |
| Faculty members | 54 (37) | |
| Investigators | 16 (11) | |
| Staff | 12 (8) | |
| Practitioners | 24 (17) | |
| Simulated client/patients | 62 (42) | |
| Real patients/clients | 7 (5) | |
| Peers | 9 (6) | |
| Self | 11 (8) | |
| Computer | 3 (2) | |
| Students | 5 (3) | |
| Lay persons | 2 (1) | |
| Not specified | 3 (2) | |
| Statement of rater bias included†† | 42 (29) | |
| Rater training reported†† | 95 (65) | |
| Component of communication measured** | ||
| Content | 138 (95) | |
| Process | 142 (97) | |
| Perceptual | 21 (14) | |
| Definition of communication skills provided | 26 (18) | |
| Reported theory or theoretical underpinnings | 54 (37) | |
| Discipline | ||
| Human medicine | 125 (86) | |
| Pharmacy | 6 (4) | |
| Nursing | 3 (2) | |
| Dentistry | 4 (3) | |
| Veterinary medicine | 2 (1) | |
| Disciplines from which one study was included: physiotherapy, nutrition, osteopathy, psychotherapy | 4 (3) | |
| Participant type | ||
| Undergraduate students | 86 (59) | |
| Postgraduate trainees | 40 (28) | |
| Professionals in practice | 21 (14) | |
| Country | ||
| USA | 64 (44) | |
| Germany | 17 (12) | |
| UK | 16 (11) | |
| The Netherlands | 13 (9) | |
| Canada | 9 (6) | |
| Brazil | 4 (3) | |
| Iran | 3 (2) | |
| Ireland | 3 (2) | |
| Countries from which two studies were included: Belgium, China, Norway, Portugal, Qatar | 10 (7) | |
| Countries from which one study was included: Australia, Denmark, France, New Zealand, Pakistan, Scotland, Spain, Switzerland, Turkey | 8 (6) | |
Table including the respective references can be found in online supplemental appendix B.
Some studies employed multiple methods, scoring formats or raters or assessed multiple components of CS, resulting in the numbers adding up to more than 100%.
‘Statement of rater bias included’ and ‘rater training reported’ were used as proxy indicators of study rigour, reflecting efforts to ensure assessment quality and reduce bias.
CS, communication skills.
Table 4. Sources of validity for the two subcategories of performance-based, simulated and inperson CSA methods.
| Objective structured clinical examination | Other performance-based, simulated, inperson assessment | |
|---|---|---|
| No. manuscripts | 68 | 48 |
| No. (%) | No. (%) | |
|
Content
Any description of steps taken to ensure that test content reflects the construct it is intended to measure |
44 (65) | 38 (79) |
| Use of previously developed framework | 20 (29) | 21 (44) |
|
Internal structure
Evaluations of relations among individual items, raters, cases within an assessment and how they relate to an overarching construct |
61 (90) | 42 (88) |
| Internal consistency – across different items; includes Cronbach’s alpha, split-half reliability, KR20, KR21 or test-retest reliability | 23 (34) | 22 (48) |
| Inter-rater reliability – across different raters; includes Cronbach’s alpha, kappa, intraclass correlation, or Pearson’s correlation | 30 (44) | 35 (73) |
| Interstation (case) reliability – across different communication OSCE stations or cases | 6 (9) | 3 (7) |
| Intrarater reliability – within the same rater | 2 (3) | 1 (2) |
| Intrastation (case) reliability | 3 (4) | 0 (0) |
| Generalisability theory | 29 (43) | 9 (20) |
| Item response theory | 8 (12) | 4 (9) |
| Other Included studies that report, Bland Altman plots, ROC |
3 (4) | 2 (4) |
|
Relations with other variables
Associations between assessment scores and another measure or feature |
47 (69) | 24 (50) |
| Any | 12 (18) | 4 (9) |
| Performance Association of assessment scores with a separate measure of performance |
36 (53) | 20 (42) |
| Expertise Association with level of expertise, such as training level (expert vs novice) or status (trained vs untrained) |
12 (18) | 4 (9) |
|
Response process
Analyses of how responses align with intended contract; includes raters’ thoughts and test security/quality control |
5 (7) | 6 (13) |
|
Consequences
Impact of the assessment itself includes actions based on assessment scores and standard setting procedures |
9 (13) | 6 (13) |
| Used validity framework | 7 (10) | 5 (10) |
Table including the respective references can be found in online supplemental appendix B.
OSCE, objective structured clinical examinations; ROC, receiver operator curve.
Study context
64 (44%) studies originated from the USA, 17 (12%) from Germany and 16 (11%) from the UK (table 2). Manuscripts identified dated back to 1991. Human medicine contributed the majority of studies (125 studies, 86%). Studies were also conducted in pharmacy, nursing, dentistry, veterinary medicine, physiotherapy, nutrition, osteopathy and psychotherapy (table 2). 86 (59%) studies involved undergraduate students, 40 (28%) were post-graduate trainees, for example, residents, and 21 (14%) engaged healthcare professionals in practice.
Features of CSA methods
Table 2 summarises the descriptive data collected from the included studies. The mean number of CSA methods to assess participants across all studies was 1.4, ranging 1–4 methods of assessment. For example, assessing the participant by a simulated patient and a faculty member or by two different assessment tools.
Performance-based, simulated, inperson CSA was the largest group, comprising 116 (79%) studies, of which 68 (47%) studies were OSCE-based. 27 (19%) studies assessed examinees using workplace-based assessment. These studies included scores generated from clinical interactions with real patients (seven studies, 5%), multisource feedback (four studies, 3%) and mini-clinical evaluation exercises (two, 1%). Video-based assessments were reported in 17 (12%) studies, including eight (6%) studies specifically indicating that they used an Objective Structured Video Examination format. Three studies (2%) reported skill spotting skills using videos and two (1%) reported the use of videos in a situational judgement test. Knowledge-based assessment was reported in five studies (3%) and one study reported using a virtual client in CSA (1%).
Nearly all studies evaluated the content component (138 studies, 95%) and process component of CS (142 studies, 97%). An overlap between the studies evaluating content and process was present, with 137 studies assessing both content and process of communication. The perceptual component of communication was assessed in 21 (14%) studies. These studies also assessed both the content and process dimension of communication. 26 (18%) of the studies provided a definition of CSs, the construct to be assessed. CSs frameworks (ie, the CCG, the SEGUE, etc) were reported in 54 (37%) of the manuscripts. The use of frameworks could be used as a proxy measure for a CS definition and provides content validity evidence (as indicated below).
Response format and scoring
Scales
Studies meeting the inclusion criteria used a diverse range of rating formats and rubrics. An adjectival scale, for example, a series of response options, ranging from ‘Poor’ to ‘Excellent’ without a neutral mid-point, was used in 75 studies (52%). 47 (32%) studies used a Likert (type) scale. Seven (5%) studies employed global scales and generally had the SP performing this assessment. Checklists were present in 34 (23%) studies.
Raters
CSA were performed by a wide variety of raters, and some studies employed multiple raters. Most prominent was the use of SPs for scoring (61 studies, 42%), faculty members (53 studies, 37%), that is, faculty at the institution where the study was conducted but not necessarily trained in CS and experts (43 studies, 30%), that is, CS specialists. Rater training was mentioned in 96 (65%) studies and most often involved familiarising the raters with the assessment method and with the scenario in which they would represent a patient/client. 41 (28%) studies included a statement on rater bias, acknowledging that it might be present in the study and/or strategies on mitigating it.
Reliability and validity evidence
Performance-based, simulated, inperson assessment
As mentioned above, given the large number of studies in the category, we provide overarching reliability and validity information in table 3 and then break down this category into studies reporting validity evidence from OSCEs versus other simulated inperson assessment (table 4). Of the 116 studies using performance-based, simulated, inperson CSA, 12 (10%) studies reported a validity framework, most commonly adopting Messick’s framework (six studies, 5%) or Kane’s framework (four studies, 3%) (table 3).
Table 3. Sources of validity.
| Total | Performance based, simulated, inperson assessment | Performance based, workplace-based assessment | Asynchronous, video-based assessment | Knowledge based assessment | Performance based, simulated, virtual assessment | |
|---|---|---|---|---|---|---|
| No. (%) | No. (%) | No. (%) | No. (%) | No. (%) | No. (%) | |
| No. of manuscripts | 146 | 116 | 27 | 17 | 5 | 1 |
|
Content
Any description of steps taken to ensure that test content reflects the construct it is intended to measure |
109 (75) | 83 (72) | 24 (89) | 12 (65) | 4 (80) | 0 (0) |
| Use of previously developed framework | 54 (37) | 41 (35) | 9 (33) | 8 (47) | 5 (100) | 1 (100) |
|
Internal structure
Evaluations of relations among individual items, raters, cases within an assessment and how they relate to an overarching construct |
131 (90) | 103 (89) | 25 (93) | 15 (88) | 5 (100) | 1 (100) |
| Internal consistency – across different items; includes Cronbach’s alpha, split-half reliability, KR20, KR21 or test-retest reliability | 50 (34) | 45 (39) | 5 (19) | 4 (24) | 3 (60) | 1 (100) |
| Inter-rater reliability – across different raters; includes Cronbach’s alpha, kappa, intraclass correlation or Pearson’s correlation | 83 (57) | 64 (55) | 15 (56) | 9 (53) | 4 (80) | 1 (100) |
| Inter station (case) reliability – across different communication OSCE stations or cases | 9 (6) | 9 (8) | 1 (4) | 0 (0) | 0 (0) | 0 (0) |
| Intra rater reliability – within the same rater | 4 (3) | 3 (3) | 1 (4) | 0 (0) | 0 (0) | 0 (0) |
| Intra station (case) reliability | 3 (2) | 3 (3) | 1 (4) | 0 (0) | 0 (0) | 0 (0) |
| Generalisability theory | 44 (30) | 38 (33) | 6 (22) | 5 (29) | 0 (0) | 0 (0) |
| Item response theory | 13 (9) | 12 (10) | 1 (4) | 1 (6) | 1 (20) | 0 (0) |
| Other Bland Altman plots, receiver operator curves |
5 (3) | 5 (4) | 0 (0) | 0 (0) | 0 (0) | 0 (0) |
|
Relations with other variables
Associations between assessment scores and another measure or feature |
87 (60) | 72 (62) | 16 (59) | 11 (65) | 4 (80) | 0 (0) |
| Any | 19 (13) | 16 (14) | 2 (7) | 2 (12) | 1 (20) | 0 (0) |
| Performance Association of assessment scores with a separate measure of performance |
69 (47) | 57 (49) | 14 (52) | 9 (53) | 4 (80) | 0 (0) |
| Expertise Association with level of expertise, such as training level (expert vs novice) or status (trained vs untrained) |
20 (14) | 16 (14) | 3 (11) | 3 (18) | 0 (0) | 0 (0) |
|
Response process
Analyses of how responses align with intended contract; includes raters’ thoughts and test security/quality control |
15 (10) | 11 (9) | 1 (4) | 3 (18) | 2 (40) | 0 (0) |
|
Consequences
Impact of the assessment itself, includes actions based on assessment scores and standard setting procedures |
16 (11) | 15 (13) | 1 (4) | 2 (12) | 0 (0) | 0 (0) |
| Used validity framework | 15 (10) | 12 (10) | 0 (0) | 4 (24) | 1 (20) | 0 (0) |
As studies can report on multiple sources of validity evidence, for example, both content and internal structure, and can report on various forms of evidence within a source of evidence, for example, a study reports on internal structure evidence derived from classical test theory and item test theory, numbers do not add up to 146. Percentages are computed in relation to the number of papers within each specific category. Table including the respective references can be found in the online supplemental appendix B.
OCSE, objective structured clinical examination.
Validity evidence related to content
Content-related validity evidence was reported in 65% (44 studies) of the OSCE based studies and 79% (38 studies) of the other simulated inperson assessments (table 4). The primary source of this validity evidence comes from reporting the use of a CSs framework to assess CSs. Where the OSCE-based studies reported using the CCG (seven studies, 10%), the MAAS framework (six studies, 9%), the Kalamazoo Consensus statement (four studies, 5%) and the RIAS, the Four Habits coding scheme, the E4 model and the Communication Assessment Tool (CAT) assessment (each one study, 2%), the other simulated inperson assessment studies reported using the CCG (seven studies, 15%), the Kalamazoo consensus statement (five studies, 10%), the MAAS framework (four studies, 8%), the SEGUE framework (three studies, 6%), the CAT (two studies, 4%) and the RIAS and patient-centred communication framework (each one study, 2%).
Validity evidence related to internal structure
The majority of studies included evidence of internal structure (90% for OSCE-based assessment and 88% in other simulated performance-based). Classical test theory reliability coefficients were more frequently reported than either generalisability theory or item response theory. Of the classical test theory coefficients, the most frequently reported were coefficients associated with inter-rater reliability followed by internal consistency for both OSCE-based and other simulated assessment studies (table 4). However, generalisability and item response theory were reported twice as often in studies employing OSCEs compared with other simulated performance-based CSA (table 4). Other examples of sources of internal structure validity evidence that were reported were Bland-Altman plots and receiver operator curves.
Validity evidence related to relationships with other variables
Validity evidence derived from the relationship with other variables was used in more than half of all simulated assessments (table 3), although more in OSCE-based assessments than in others (table 4). Most studies that include this source of validity evidence reported a relationship between CSA scores and level of performance on other assessments.
Validity evidence related to response process and consequences
Finally, validity evidence for response process and consequences was the least prevalent in simulated inperson assessments (table 3). Both in OSCE type assessments and other inperson simulated assessments, these sources of validity evidence add up to less than 10 studies (table 4).
Performance-based, workplace-based assessment
No studies in this category reported the use of a specific validity framework. However, all studies reported at least one source of validity evidence.
Validity evidence related to content
Content validity was reported by the majority of the studies in this category (table 3). Within this category, several studies drew content validity evidence from established CSs frameworks, such as the MAAS framework (three studies, 11%), the Four Habits framework, the CCG (two studies, 7%), the CAT and the E4 framework (both one study, 4%).
Validity evidence related to internal structure
Internal structure validity evidence was reported by most of the studies in this category, reporting mainly classical test theory reliability coefficients.
Validity evidence related to relationships with other variables
Relationship to other variables was reported in 16 studies (59%; table 3). Mostly, test scores were compared with other measures of cognitive performance.
Validity evidence related to response process and consequences
In studies on workplace-based assessment, validity evidence derived by response process or consequences was the least prevalent (one study, 4%), similar to the simulation-based assessment in the prior category (table 3).
3) Asynchronous, video-based assessment
In total, 17 studies (12% of all included studies) were included in the category asynchronous, video-based assessment. Four studies (22%) reported employing a validity framework, of which one used Messick’s framework.
Validity evidence related to content
The majority of studies (12 studies, 67%) reported content validity evidence. CSs frameworks that served as the basis for assessments included the Kalamazoo Consensus Statement (four studies, 22%), CCG (three studies, 17%), the MAAS framework (two studies, 11%) and the E4 model (6%).
Validity evidence related to internal structure
Internal structure validity was reported in 16 of the studies (89%), the majority of which reported inter-rater reliability coefficients (table 3).
Validity evidence related to relationships with other variables
Analogous to performance-based assessments described earlier, validity evidence derived from the relationship with other variables was reported in approximately two-thirds of the studies in this category (11 studies, 65%). Additionally, also akin to performance-based assessment previously described, most of these studies related CSA scores to other cognitive performance variables (table 3).
Validity evidence related to response process and consequences
In studies on asynchronous, video-based assessment, validity evidence derived by response process or consequences was the least prevalent (table 3). Validity evidence derived by response process was present in three studies (18%). Validity evidence derived by consequences was present in two studies (12%).
Knowledge-based assessment
In total, five studies fell into this category. One of them (20%) explicitly indicated the use of a framework for validity. The other four studies employed distinct sources of validity without integrating them into a validity framework.
Validity evidence related to content
All studies in this category presented evidence of content validity. CSs frameworks forming the foundation for assessments included the Kalamazoo Consensus Statement (four studies, 80%) and the MAAS framework, CCG and patient-centred communication framework (one study each, 20%).
Validity evidence related to internal structure
Additionally, all studies reported classical test theory reliability coefficients as a source of internal structure validity, one of them supplementing that with a reliability coefficient generated from item test theory.
Validity evidence related to relationships with other variables
Furthermore, relations with other variables were reported in most studies, relating assessment scores to cognitive performance on other variables (table 3).
Validity evidence related to response process and consequences
Within these studies, validity sources of response process and consequences were not identified.
Performance-based, simulated, virtual assessment
This category contains only one study on the use of virtual agents for CSA purposes. It derived content validity as a source of validity evidence by being based on the Kalamazoo consensus statement. It reported on the internal structure by using reliability measures from classical test theory. It did not mention validity sources of relations with other variables, response process and consequences.
Study quality
MERSQI scores ranged from 8 to 16 (of 18 possible), with a mean (SD) of 12.7 (1.4) (table 5). ‘Statement of rater bias included’ and ‘rater training reported’ (table 2) were used as proxy indicators of study rigour, reflecting efforts to ensure assessment quality and reduce bias. In 42 studies (29%) a statement on rater bias was included, 95 studies (65%) reported on the training that raters received prior to taking part in participant assessment.
Table 5. Scores for the Medical Education Research Study Quality Instrument (MERSQI)37 to appraise the general methodological quality of the included studies.
| Number of items available | MERSQI score* | N |
|---|---|---|
| 10 | 12.5, 9–14.5 (1.3) | 64 |
| 9 | 12.6, 8–15 (1.4) | 74 |
| 8 | 13.0, 11–14 (1.6) | 4 |
| 7 | 14, 14–14 (0) | 2 |
| 6 | 15.5, 15–16 (0.6) | 2 |
| All combined | 12.7, 8–16 (1.4) | 146 |
The MERSQI is scored on 10 items of which four items might be scored as ‘Not applicable’. The maximum obtainable score is 18.
Mean, range (SD).
Discussion
This scoping review’s aim was to map the reliability and validity evidence from scores of CSA methods reported in HPE. 146 studies met our inclusion criteria. Key findings include collating the wide range of CSA methods that have been used, of which OSCEs were the majority, the various sources of validity evidence that are reported and the use of multiple scoring methods used for CSs. There were gaps in the validity evidence (namely, reporting consequences and response process validity evidence) used to support the interpretation and use of CSA scores; internal structure was reported in many studies (90%), followed by content validity (74%), relationships with other variables (60%), consequences (11%) and response process validity evidence reported in 10% of the studies. Only 10% of the studies reported using a validity framework.
Integration of prior work
To our knowledge, this is the first scoping review to map existing validity evidence of various CSA methods across HPE. It complements and extends the work previously performed looking at specific methods such as OSCEs,17 21 22 workplace-based assessment8 and written assessment16 23 or to specific professions, such as medicine,24,26 dentistry27 and nursing.28
Internal structure (reliability) validity evidence was most reported (90%), with many of these studies reporting classical test theory coefficients for inter-rater agreement or internal consistency. However, as Bensing38 commented before, ‘Interassessor reliability is just one and maybe not the most important condition to develop an adequate measurement instrument’.8 38 39 Here we reported on internal structure data that characterised internal consistency coefficients for communications skills assessment items, inter-rater agreement and interstation (case) reliability to further identify reliability metrics that are typically under-reported. Our mapping process identified gaps in reporting other sources of validity evidence with content-related validity (74%) and relationships with other variables (60%) being more frequently reported, while response processes (10%) and consequences (11%) were infrequently reported. The absence of a theoretical or empirical basis for prioritising sources of validity evidence poses a significant limitation in validation research. Additionally, it is worth noting that type, rigour and quantity of evidence needed varies depending on the assessment’s purpose. Assessments with greater potential consequences, such as summative examinations or those aimed at more advanced learners, typically require more robust validity evidence. A starting point for designing and implementing these is included in the implications section of this paper.
In our review, 90% of studies using OSCEs reported at least one reliability measure. Other sources of validity were reported in 65% (content validity) and 69% (relations with other variables) of studies. By contrast, response processes and consequences were not consistently reported, 7% and 13%, respectively, across the studies reported here. Why these sources are under-reported or why authors choose not to report these sources is an area worth further investigation.
McGill et al8 concluded that the reliability from workplace-based assessment scores alone did not provide sufficient evidence to support high stakes decision making. These scores should be considered part of an assessment programme rather than relying on the CSA by one supervisor alone. Further analyses of the 27 studies that were identified reveal trends in the validity evidence to help pursue where and how best to assess CSs within the workplace environment to support learning and decision making of our trainees.
Simulated encounters as an assessment method were reported in 12–15% of the studies reported by Tan24 and 31% of the studies reported by Fischer.25 This seems low when compared with the prevalence of these assessment methods in our study (79%), which can be explained by this review studying objective CSA specifically while others also include studies on teaching methods24 or self-assessment through questionnaires.25 Like ours, each of these publications emphasised the need for and lack of validity evidence to be able to reliably measure CSA.
Finally, context and case specificity in CSA methods within HPE is pivotal for understanding their applicability and effectiveness across diverse healthcare contexts and learner populations.8 12 40 Our scoping review revealed a comprehensive range of healthcare professions and populations under study which exhibited considerable variability. Therefore, it is not surprising that a breadth of assessment methods was reported in the included studies. Previous reviews in medicine,24,26 dentistry27 and nursing28 also reported a wide variety of CSA methods and predominantly reported on the assessment of undergraduate students.17 27
Implications for future work
This review’s mapping of existing validity evidence demonstrates most evidence can be found, not unsurprisingly, within performance-based assessments, mostly OSCEs.
There are other implications. First, this review reports variations in the use of assessment theory that underpins CSA in HPE. While designing CSA methods can be challenging, we encourage educators to develop assessment methods that are contextually relevant and guided by psychometric best practices using evidence-based design principles. Guidelines on how to design, implement and evaluate objective assessments41 and performance-based simulated assessments40 have been summarised elsewhere. In short, one should consider: (1) the definition of the construct(s) to be measured, (2) the situation in which the assessment will be employed, and thus the level of validity evidence needed, (3) employment of a validation process39 and (4) continuous quality assurance of the assessment.
Second, our work highlighted the heterogeneity of reporting on rater training and the mitigation of rater bias. As the assessment of communication mostly relies on rater or peer judgement, it is susceptible to individual preferences and styles, underscoring the critical importance of mitigating bias to ensure the validity of outcomes. Rater training and measures to prevent rater bias might assist in more reliable and valid examinees’ communication scores.
Third, this review underscored the lack of clarity on the definition of CSs, with only 26 (18%) of the manuscripts mentioning a communication construct that was being measured and 54 (37%) reporting a communication framework used. A clear definition of the construct of CSs within a study would aid in understanding the assessment method(s) selected and assist with interpretation of the scores for low/high stakes situations. Across health professions, these definitions may vary, even within health professions, there may be greater emphasis on specific aspects of CSs given the practice continuum. Regardless, an explicit statement of the construct of CSs measured would further the conversation on what skills can be reliably and validly assessed, when these skills could be measured, by what assessment methods (possibly as part of a system of assessment) and whether the scores should be used for low stakes feedback or high stakes decision making.
Overall, this work has implications for assessment in HPE in general. Our work demonstrated a need to design psychometrically sound assessment strategies, and implement adequate rater training and bias mitigation strategies, both of which are based on a solid and clear understanding of the construct being assessed.
Limitations and strengths
Like all reviews, our interpretations were constrained by the methodological and reporting quality of the studies that met our inclusion criteria. Inadequate reporting hindered our assessment of some assessment features. This includes, but is not limited to, whether and how CS assessment was operationalised within a CSs framework, like the Calgary Cambridge Guide or the MAAS framework; whether CSs were defined and how that translated to the items used and the assessment performed and how validity components were reported and interpreted within the studies. Additionally, in the last few years, the use of artificial intelligence (AI) in CS training and assessment has increased. This is not yet reflected in our data, as current studies mainly look at the possibilities for using AI in teaching interventions, not yet at the psychometrics of AI-supported assessments.
In this study, we used the MERSQI37 to appraise the methodological quality of included studies, which initially we thought of as a strength for this work. The mean MERSQI score for the 146 studies included was 12.7 (SD, 1.4; range 8–16; table 5). However, this high score is more a reflection of our inclusion criteria than it is of the methodological quality of studies.42 As our review specifically included studies that objectively assess communication in examinees, all studies included scored higher in the MERSQI items that assessed ‘type of data’, ‘validity evidence for evaluation instrument scores’, ‘data analysis: sophistication’ and ‘data analysis: appropriate’.
Strengths of our scoping review include the inclusion of manuscripts from a wide spectrum of health professions, duplicate review at all stages and iterative development of the data characterisation and extraction rubric.
Conclusions
This scoping review maps the wide array of CSA methods present in HPE, with performance-based, simulated, inperson assessment being most prevalent. Studies were mostly performed in medicine and in the USA and Europe. We identified validity evidence gaps in the following categories: content validity, relationship with other variables, response processes and consequences of assessment. Little to no validity evidence was reported for response processes and consequences across all CSA categories. The findings of this scoping review point to areas for further study needed to strengthen the reliability and validity evidence supporting communication knowledge and skills assessment methods. Focused work in these gaps will ultimately support educators and licensing bodies in their efforts to ensure the production of reliable and valid CSA scores. The results of this review will provide CS experts and researchers with the knowledge of what can be used and the extent of the evidence to support decision-making based on scores from these methods. This will hopefully lead to a more evidence-centred approach for the selection of CSA methods that can be combined into longitudinal programmes of assessment within CSs curricula and/or programme-wide programmes of assessment.
Supplementary material
Footnotes
Funding: International Council for Veterinary Assessment.
Prepublication history and additional supplemental material for this paper are available online. To view these files, please visit the journal online (https://doi.org/10.1136/bmjopen-2024-096799).
Provenance and peer review: Not commissioned; externally peer reviewed.
Patient consent for publication: Not applicable.
Ethics approval: Not applicable.
Patient and public involvement: Patients and/or the public were not involved in the design, conduct, reporting or dissemination plans of this research.
Data availability statement
Data are available upon reasonable request.
References
- 1.Kanji N, Coe JB, Adams CL, et al. Effect of veterinarian-client-patient interactions on client adherence to dentistry and surgery recommendations in companion-animal practice. J Am Vet Med Assoc. 2012;240:427–36. doi: 10.2460/javma.240.4.427. [DOI] [PubMed] [Google Scholar]
- 2.DiMatteo MR, DiNicola DD. Achieving patient compliance: the psychology of the medical practitioner’s role. 1982
- 3.Stewart MA. Effective physician-patient communication and health outcomes: a review. CMAJ. 1995;152:1423. [PMC free article] [PubMed] [Google Scholar]
- 4.Ong LML, de Haes J, Hoos AM, et al. Doctor-patient communication: A review of the literature. Soc Sci Med. 1995;40:903–18. doi: 10.1016/0277-9536(94)00155-M. [DOI] [PubMed] [Google Scholar]
- 5.Gruppen LD, Mangrulkar RS, Kolars JC. The promise of competency-based education in the health professions for improving global health. Hum Resour Health. 2012;10:1–7. doi: 10.1186/1478-4491-10-43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kurtz SM, Silverman JD. The Calgary-Cambridge Referenced Observation Guides: an aid to defining the curriculum and organizing the teaching in communication training programmes. Med Educ. 1996;30:83–9. doi: 10.1111/j.1365-2923.1996.tb00724.x. [DOI] [PubMed] [Google Scholar]
- 7.Makoul G. The SEGUE Framework for teaching and assessing communication skills. Patient Educ Couns. 2001;45:23–34. doi: 10.1016/s0738-3991(01)00136-7. [DOI] [PubMed] [Google Scholar]
- 8.McGill DA, van der Vleuten CPM, Clarke MJ. Supervisor assessment of clinical and professional competence of medical trainees: a reliability study using workplace data and a focused analytical literature review. Adv Health Sci Educ Theory Pract. 2011;16:405–25. doi: 10.1007/s10459-011-9296-1. [DOI] [PubMed] [Google Scholar]
- 9.Mislevy RJ, Haertel GD. Implications of Evidence‐Centered Design for Educational Testing. Educational Measurement. 2006;25:6–20. doi: 10.1111/j.1745-3992.2006.00075.x. [DOI] [Google Scholar]
- 10.Shavelson RJ. On an Approach to Testing and Modeling Competence. Educ Psychol. 2013;48:73–86. doi: 10.1080/00461520.2013.779483. [DOI] [Google Scholar]
- 11.Braun HI, Shavelson RJ, Zlatkin-Troitschanskaia O, et al. Performance Assessment of Critical Thinking: Conceptualization, Design, and Implementation. Front Educ. 2020;5 doi: 10.3389/feduc.2020.00156. [DOI] [Google Scholar]
- 12.Hecker KG, Adams CL, Coe JB. Assessment of first-year veterinary students’ communication skills using an objective structured clinical examination: the importance of context. J Vet Med Educ. 2012;39:304–10. doi: 10.3138/jvme.0312.022R. [DOI] [PubMed] [Google Scholar]
- 13.Ritter C, Adams CL, Kelton DF, et al. Clinical communication patterns of veterinary practitioners during dairy herd health and production management farm visits. J Dairy Sci. 2018;101:10337–50. doi: 10.3168/jds.2018-14741. [DOI] [PubMed] [Google Scholar]
- 14.Hodges B. Medical education and the maintenance of incompetence. Med Teach. 2006;28:690–6. doi: 10.1080/01421590601102964. [DOI] [PubMed] [Google Scholar]
- 15.Whitehead CR, Kuper A, Hodges B, et al. Conceptual and practical challenges in the assessment of physician competencies. Med Teach. 2015;37:245–51. doi: 10.3109/0142159X.2014.993599. [DOI] [PubMed] [Google Scholar]
- 16.Kiessling C, Perron NJ, van Nuland M, et al. Does it make sense to use written instruments to assess communication skills? Systematic review on the concurrent and predictive value of written assessment for performance. Patient Educ Couns. 2023;108:107612. doi: 10.1016/j.pec.2022.107612. [DOI] [PubMed] [Google Scholar]
- 17.Setyonugroho W, Kennedy KM, Kropmans TJB. Reliability and validity of OSCE checklists used to assess the communication skills of undergraduate medical students: A systematic review. Patient Educ Couns. 2015;98:1482–91. doi: 10.1016/j.pec.2015.06.004. [DOI] [PubMed] [Google Scholar]
- 18.Epstein RM, Hundert EM. Defining and assessing professional competence. JAMA. 2002;287:226–35. doi: 10.1001/jama.287.2.226. [DOI] [PubMed] [Google Scholar]
- 19.Norcini JJ, McKinley DW. Assessment methods in medical education. Teaching and Teacher Education. 2007;23:239–50. doi: 10.1016/j.tate.2006.12.021. [DOI] [Google Scholar]
- 20.Whelan GP, Boulet JR, McKinley DW, et al. Scoring standardized patient examinations: lessons learned from the development and administration of the ECFMG Clinical Skills Assessment (CSA) Med Teach. 2005;27:200–6. doi: 10.1080/01421590500126296. [DOI] [PubMed] [Google Scholar]
- 21.Cömert M, Zill JM, Christalle E, et al. Assessing Communication Skills of Medical Students in Objective Structured Clinical Examinations (OSCE)--A Systematic Review of Rating Scales. PLoS ONE. 2016;11:e0152717. doi: 10.1371/journal.pone.0152717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Brannick MT, Erol-Korkmaz HT, Prewett M. A systematic review of the reliability of objective structured clinical examination scores. Med Educ. 2011;45:1181–9. doi: 10.1111/j.1365-2923.2011.04075.x. [DOI] [PubMed] [Google Scholar]
- 23.Perron NJ, Pype P, van Nuland M, et al. What do we know about written assessment of health professionals’ communication skills? A scoping review. Patient Educ Couns. 2022;105:1188–200. doi: 10.1016/j.pec.2021.09.011. [DOI] [PubMed] [Google Scholar]
- 24.Tan XH, Foo MA, Lim SLH, et al. Teaching and assessing communication skills in the postgraduate medical setting: a systematic scoping review. BMC Med Educ. 2021;21:483. doi: 10.1186/s12909-021-02892-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Fischer F, Helmer S, Rogge A, et al. Outcomes and outcome measures used in evaluation of communication training in oncology - a systematic literature review, an expert workshop, and recommendations for future research. BMC Cancer. 2019;19:808. doi: 10.1186/s12885-019-6022-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Duffy FD, Gordon GH, Whelan G, et al. Assessing competence in communication and interpersonal skills: the Kalamazoo II report. Acad Med. 2004;79:495–507. doi: 10.1097/00001888-200406000-00002. [DOI] [PubMed] [Google Scholar]
- 27.Khalifah AM, Celenza A. Teaching and Assessment of Dentist-Patient Communication Skills: A Systematic Review to Identify Best-Evidence Methods. J Dent Educ. 2019;83:16–31. doi: 10.21815/JDE.019.003. [DOI] [PubMed] [Google Scholar]
- 28.Kerr D, Ostaszkiewicz J, Dunning T, et al. The effectiveness of training interventions on nurses’ communication skills: A systematic review. Nurse Educ Today. 2020;89:104405. doi: 10.1016/j.nedt.2020.104405. [DOI] [PubMed] [Google Scholar]
- 29.Gordon M, Farnan J, Grafton-Clarke C, et al. Non-technical skills assessments in undergraduate medical education: A focused BEME systematic review: BEME Guide No. 54. Med Teach. 2019;41:732–45. doi: 10.1080/0142159X.2018.1562166. [DOI] [PubMed] [Google Scholar]
- 30.Messick S. Educational Measurement. 3rd. New York: American Council on Education/Macmillan; 1989. pp. 13–103. edn. [Google Scholar]
- 31.Kane MT. Validating the Interpretations and Uses of Test Scores. J Educational Measurement. 2013;50:1–73. doi: 10.1111/jedm.12000. [DOI] [Google Scholar]
- 32.Liberati A, Altman DG, Tetzlaff J, et al. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. Ann Intern Med. 2009;151:W65–94. doi: 10.7326/0003-4819-151-4-200908180-00136. [DOI] [PubMed] [Google Scholar]
- 33.Cusson O, Mercier J, Catelin C, et al. Evaluating communication with parents in paediatric patient encounters: a systematic review protocol. BMJ Open. 2021;11:e049461. doi: 10.1136/bmjopen-2021-049461. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.American Educational Research A. American Psychological A. National Council on Measurement in E Standards for educational and psychological testing. 2014
- 35.Adams CL, Kurtz SM, Bayly W, et al. Otmoor Publishing; 2017. Skills for communicating in veterinary medicine. [Google Scholar]
- 36.Denniston C, Molloy E, Nestel D, et al. Learning outcomes for communication skills across the health professions: a systematic literature review and qualitative synthesis. BMJ Open. 2017;7:e014570. doi: 10.1136/bmjopen-2016-014570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Reed DA, Cook DA, Beckman TJ, et al. Association between funding and quality of published medical education research. JAMA. 2007;298:1002–9. doi: 10.1001/jama.298.9.1002. [DOI] [PubMed] [Google Scholar]
- 38.Bensing J. Doctor-patient communication and the quality of care. Soc Sci Med. 1991;32:1301–10. doi: 10.1016/0277-9536(91)90047-G. [DOI] [PubMed] [Google Scholar]
- 39.Downing SM. Validity: on meaningful interpretation of assessment data. Med Educ. 2003;37:830–7. doi: 10.1046/j.1365-2923.2003.01594.x. [DOI] [PubMed] [Google Scholar]
- 40.Buléon C, Mattatia L, Minehart RD, et al. Simulation-based summative assessment in healthcare: an overview of key principles for practice. Adv Simul (Lond) 2022;7:42. doi: 10.1186/s41077-022-00238-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Tavakol M, Dennick R. Post-examination analysis of objective tests. Med Teach. 2011;33:447–58. doi: 10.3109/0142159X.2011.564682. [DOI] [PubMed] [Google Scholar]
- 42.Cook DA, Reed DA. Appraising the quality of medical education research methods: the Medical Education Research Study Quality Instrument and the Newcastle-Ottawa Scale-Education. Acad Med. 2015;90:1067–76. doi: 10.1097/ACM.0000000000000786. [DOI] [PubMed] [Google Scholar]

