Abstract
Background
The personal qualities of health workers determine the way health services are provided to clients. Some key personal qualities (also called behavioural competencies) of physicians that contribute to quality healthcare delivery include ethical responsibility, empathy, patient-centeredness, diligence, good judgment, respectful, teamwork, team leadership/conflict management, ability to take correction and tolerance. In this study, we developed and validated clinical scenarios (dilemmas) for assessing priority behavioural competencies for medical practice in Nigeria.
Methods
Drawing on prioritized competencies generated via a scoping review and nominal group technique (NGT) exercises in a previous study in the study series, Faculty members from the University of Nigeria Teaching Hospital were consulted to develop or adapt clinical scenarios that could be used to assess these competencies in a physician. The clinical scenarios and options were framed as situational judgement tests (SJTs) and these tests were administered to a random sample of 192 undergraduate and 111 postgraduate medical doctors in a tertiary hospital in Enugu State. Using Kane’s validity argument framework, we assessed scoring and generalization inferences of situational judgment tests (SJT) based on the developed scenarios.
Result
Scoring inference – difficulty and discrimination index – shows that most of the SJT items are good test items and can differentiate between high and low performers. The corrected point biserial correlations show positive correlation for most of the items. Generalization inference shows the items represent the domains of interest and are internally consistent. However, few items that show poor difficulty and discrimination index were subjected to re-evaluation and possible elimination.
Conclusion
This study has produced a set of valid clinical scenarios that can be used to evaluate specific behavioural competencies among trainee medical doctors. It demonstrates that SJTs can be used to assess behavioural competencies for medical practice. However, further research is needed to establish the applicability of SJT beyond the immediate context, such as the medical school, in which it is developed.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12909-024-06298-x.
Keywords: Validation, Medical assessment, Behavioral competenices, Medical practice
Background
Current research makes it increasingly clear that today’s clinician needs strong foundation in medical knowledge coupled with behavioural skills for effective communication, teamwork, ethical behaviour and display of the highest level of professionalism in patient management [1, 2]. These trends raise several queries concerning the bases for selecting intending physicians as it relates to behavioural competencies required for medical practice [1, 2], and these queries have prompted substantive changes in the approach to medical assessment across the globe, particularly in high-income countries. However, Nigeria and other low- and middle-income countries (LMIC) are yet to adopt a compressive method of assessment, such that capture cognitive as well as behavioural competencies.
Assessment for medical profession in Nigerian focuses mostly on the cognitive aspect, though this is valid for predicting academic performance, it has been found to be insufficient, exhibiting large adverse impact [3]. Therefore, leaving the behavioural components entirely unattended results in inadequate demonstration of preparedness for independent practice [4]. Some medical institutions have recognised the need to review the medical assessment modalities in order to capture both academic and behavioural competencies to improve graduates’ preparedness for future practice [5–8]. Based on emerging consensus on the potential role of behavioural competencies in optimal medical practice [1], it is important to transition to more comprehensive assessment methods that incorporate behavioural competencies required for high quality health service delivery.
The Situational Judgement Test (SJT) presents an innovative approach for measuring behavioural attributes that are considered essential for high-performing physicians [9]. It can complement the academic performance and can be tailored to specific aspects of behaviour and context. It presents a hypothetical work‐related scenario followed by different options from which candidates are expected to judge and appropriately select the best lines of action or rank the responses in order of appropriateness. The response modality of SJT is such that the response instructions can have either a knowledge format (“What is the best answer?”) or a behavioural tendency format (“What are you most likely to do?”). SJT is a relatively new consideration in medical assessment [10]. It measures procedural knowledge in addition to aspects of declarative knowledge and fluid abilities [11–13].
The rising cognizance of the need to assess non-academic attributes has created a promising area of research and policy [3]. Research interest has shifted towards the development and validation of SJTs for assessing behavioural domains of interest. Validity is an indicator of how well an assessment measures what it is designed to measure, it is the degree to which accumulated evidence and theory support specific interpretations of test scores entailed by proposed uses of a test [14].
The validation process is crucial and ensures that the two fundamental queries about a test tool are answered; the meaning of the test score and the justification for its use [15]. This is achieved through evidence gathering and determination of the extent of logical soundness of the tool. Literature specifies different modalities for validation of test tools [15]. Kanes’ framework, unlike the other frameworks allows determination of the types of validity that are most important and specifies the order in which validity evidence is collected. Kane’s validity argument framework involves three sequential steps: (i) determination of the use and interpretation of scores from an assessment, (ii) derivation of assumptions from the use and interpretation, much like stating hypotheses, (iii) testing the weakest assumptions of the validity argument by collecting evidence in a stepwise fashion for four areas- 1) scoring, 2) generalization, 3) extrapolation, and [4] implication inferences. Whetzel et al. 2009 highlighted two primary types of evidence related to the validity of SJT scores; evidence related to the constructs measured by SJTs and evidence related to the prediction of job performance. Both forms of validity are captured by Kanes’ validity argument framework.
Given the potential of SJT for incremental validity as demonstrated in recent studies [1, 2], it is crucial develop and validate such tools that move beyond standard tests to a more holistic assessment of candidates that captures cognitive and behavioural measures. Such a high-stake tool for the selection of medical professionals requires a thorough examination of its validity within the context of application. Literature reiterates the need for rigorous validation and testing of behavioural assessment tools for the medical profession [15–18]. However, the validity SJT tool has been assessed mostly in high income settings, and has been found useful as predictor of performance in medical practice [10, 11, 16, 18]. Yet, such tool has not been validated for within the Nigerian context.
In the first study in the series, we identified behavioural competencies considered relevant for effective clinical practice in Nigeria. We did this by triangulating a scoping review and nominal group technique (NGT) exercises. We explored the differences and similarities in the perspectives of different groups of stakeholders and compared these with findings reported in global literature to enable a more nuanced prioritization of relevant competencies. In this study, we drew on the prioritized competencies to develop and validate SJT tool to measure behavioural competencies required for competent and high-quality health service delivery in the Nigerian medical training institutions.
Methods
Study area and study population
We conducted this study at the University of Nigeria Teaching Hospital, Enugu State, Southeast Nigeria. The Teaching Hospital is a federal institution located in Enugu state where medical professionals at undergraduate and postgraduate levels are trained. The study population included undergraduate clinical level medical students, postgraduate (resident) doctors, and faculty members involved in the medical training and supervision of medical students and resident doctors in the College of Medicine, University of Nigeria.
We conducted this study in two phases. In phase one, we developed clinical scenarios via focus group discussions, and in phase two, we validated the developed clinical scenarios using Kane’s validity argument framework.
Phase I: Development and adaptation of clinical scenarios for SJTs
This phase started with an expert-panel development and adaptation of clinical scenarios based on the prioritized core contextual behavioural competencies generated from the first study in the series [19]. We developed the clinical scenarios via expert panel discussion with subject matter experts (SMEs). We defined SMEs as faculty members who train and supervise medical undergraduate and postgraduate doctors. Subject matter experts who are experienced in teaching and supervision are very useful in deliberation, development and modification of SJT scenarios [20].
Only faculty members who are consultants in their departments and who provide training and supervision in medical training and clinical practice to undergraduate and postgraduate (resident) doctors were included in the study. Faculty members who provide either medical training or clinical supervision only, were excluded from the study. The faculty members that met our inclusion criteria were invited from all the clinical departments within the college of medicine (surgery, internal medicine, obstetrics and gynaecology, paediatrics and community medicine departments). An invitation email to participate in the development and validation exercise was sent out to twenty experts (four from each department), with a brief description of the study objectives and their role as participants. The invitation was followed up with telephone calls to ascertain their availability and willingness to participate in the study.
The expert panel for the development and validation exercise was composed of ten SMEs and the session was facilitated by three renowned experts (two public health physicians and one research fellow) who are well grounded in qualitative research methods. The session began with a consolidation exercise to affirm the contextual competencies generated from the first study [19]. A PowerPoint presentation of the ten prioritized competencies was made; five competencies for doctor-client relationship (Ethical responsibility, Empathy, Patient-centeredness, Diligence, Good judgment), and five competencies for doctor-colleague relationship (Respectful, Teamwork, Team leadership/conflict management, Ability to take correction, Tolerance). Faculty members carefully examined all the competencies and corroborated the findings.
Following the consolidation exercise, the experts were presented with sample SJT tools (applied in a different setting) for the adaptation exercise. The faculty members were randomly assigned to five teams in a simple manner for adaptation and development of scenarios that capture the consolidated competencies. Each of the five teams was assigned two of the ten competencies and was tasked to develop five scenarios for each of the assigned competencies and come up with a total of ten scenarios for the two assigned competencies. Experts carefully examined the sample SJT scenarios, identified and adapted questions that captured their assigned competencies, at least five SJT scenarios for each competency. They proceeded to develop new SJT scenarios for competencies for which there is no scenario found in the given pool of sample SJT questions. At least five scenarios were adapted and developed for each of the ten prioritized competencies
Each of the five teams produced at least ten SJT scenarios, five for each of the two assigned competency. Some SJTs were adapted from the sample SJTs, and new ones were developed to reflect all the ten competencies relevant for medical practice in Nigeria [19]. Five competencies for doctor-client relationship (Ethical responsibility, Empathy, Patient-centeredness, Diligence, Good judgment), and five competencies for doctor-colleague relationship (Respectful, Teamwork, Team leadership/conflict management, Ability to take correction, Tolerance). A total of 35 dilemmas were adapted, and 15 new ones were developed based on competencies required for medical practice within the Nigerian setting.
While our study relied heavily on expert review for content validity, we did not explicitly measure the Content Validity Index (CVI). The content validity was established through the rigorous process of scenario development and adaptation by subject matter experts (SMEs). The SMEs, who were experienced faculty members from various clinical departments, reviewed and agreed upon the scenarios and ranked options before the pilot study. This consensus-based approach helped ensure that the SJT items were relevant, and representative of the behavioural competencies being assessed.
Phase II: Validation of clinical scenarios using Kane’s validity argument framework
Following the adaptation and development of the SJT tool, we gathered evidence for validation of the clinical practice scenarios (SJTs) for content and theoretical construct drawing on Kane’s validity argument framework for uses and interpretation of test scores. Kane’s framework provides a contemporary approach to validity arguments for assessments in medical education [21]. Unlike the other frameworks, Kane captures evidence related to the measured construct and evidence related to job prediction performance [22]. It also specifies the order for collecting the validity evidence. Kane’s framework consists of 3 steps with emphasis on the third step (the four key inferences in the validity argument).
The first step is to determine the use and interpretation of scores from an assessment. The second step is to derive assumptions from the use and interpretation, this is analogous to stating the hypotheses. The third step is to test the weakest assumptions of the validity argument by collecting evidence in a stepwise fashion for four areas: scoring, generalization, extrapolation, and implication inferences. Kane emphasises on these four key inferences in the validity argument- i) Scoring inference involves computing the test result and translating the observed performance into a score to determine the difficulty and discrimination index; ii) generalization inference assesses how well the sample of test items adequately represents the domains of interest being measured; iii) extrapolation inference refers to how well the assessment estimates real life performance and iv) implications inference interprets the scores to inform a decision or action.
In accordance with Kane’s validity argument framework, we established the use and interpretation of the test scores- the SJTs will be used to assess behavioural competencies for best practices in the medical profession; high scores will reflect good behavioural competence while low score will reflect poor behavioural competence. Secondly, we specified the assumptions from the use and interpretation of the test scores- we assumed that trainees who have higher SJT scores are more likely to possess behavioural competencies associated with better quality healthcare provision and patient satisfaction. Thirdly, we gathered evidence for scoring and generalization inferences.
Data collection
The SJT tool generated from the expert-panel development and adaptation of clinical scenarios was administered to the study population (undergraduate and postgraduate trainees) to gather evidence for scoring and generalization inference. The minimum sample size for this study was determined using the formula for estimating proportion. A total of 303 trainees; 192 medical students and 111 resident doctors were randomly sampled from the study population. At the undergraduate level, a comprehensive list the three clinical classes (4th, 5th and 6th year students) served as the sample frame for simple random selection of 192 students (64 from each class). At the post-graduate level, a comprehensive list of all the resident doctors in all the 5 departments (internal medicine, surgery, paediatrics, community medicine and obstetrics and gynaecology) served as the sampling frame for proportionate random sampling of 111 residents depending on the total number of residents in each department. The estimated sample size was adjusted for population less than 10,000 using the formula for population < 10,000, and ten percent of the estimated sample size was added to account for attrition.
KoBo Toolbox was used to administer the assessment to all the participants using electronic tablets. KoBo Toolbox is an open-source tool for mobile data collection. It allows data collection in the field using mobile devices such as mobile phones, tablets, or computers. There was a total of fifty SJT scenarios, five items for each of the ten prioritized contextual competencies. For each dilemma, respondents were requested to rank the five given options from most appropriate to least appropriate action given the scenario presented (with 1 being the most appropriate and 5 being the least appropriate). The evidence gathered from the pool of the SJT questions was subjected to scoring and generalization inferences.
Data analysis
Data were analyzed using STATA version 16.1. For the scoring inference, we estimated the item difficulty and discrimination. We developed a scoring rubric that awards 25 points if all options are in the correct order, 20 points if the correct order is off by one item, 15 points if the correct order is off by two items, 10 points if the correct order is off by three items, 5 points if the correct order is off by four items and zero point if all five items are off or incorrect. Based on the scores, we estimated the item difficulty and item discrimination. For item difficulty, we determined the proportion of the respondents that answered each item/question correctly. A good question stem should have a difficulty index of between 0.3–0.8 to be considered good; meaning that 30%-80% of students answered the item correctly. Values below 0.3 suggest that the item is too difficult, and values above 0.8 suggest that the item is too easy [20, 23].
For item discrimination, we determined the degree to which each item differentiates correctly among respondents’ behaviour that the question stem is designed to measure. The correlation between performance on the specific question and overall performance on the test was examined to identify items that are good at differentiating high and low abilities. We employed a robust measure, the corrected point-biserial correlation to correct this correlation. This measure calculates the relationship between the score for each item and the overall test score, after removing the item score from the total test score. A minimal-quality item might have a point-biserial of 0.10, a good item about 0.20, and strong items 0.30 or higher [24]. For generalization inference, we examined how well the assessment items represent the domains or behavioural competencies of interest. The scenarios were assessed to see if questions that are mapped to the same behavioural competency or domain are internally consistent with Cronbach’s alpha.
Results
This version of the SJT was completed by 303 respondents (medical students and medical doctors).
Summary statistics for the SJT scores are presented in Tables 1 and 2. The aggregate statistics in Table 1 show an average total score of 268 with a minimum score of 45 and a maximum score of 545 out of a total possible score of 1,250. Performance is highest in the “diligence” category, while the “good judgement” category reports the lowest performance. When we restrict the sample to only medical students, we get an average total SJT score of 246, while the average total SJT score is 307 when the sample is restricted to medical doctors.
Table 1.
Descriptive Statistics of the SJT (Aggregate sample)
Mean | Std. Dev | Min | Max | |
---|---|---|---|---|
Empathy | 27.09 | 18.59 | 0 | 90 |
Ethical responsibility | 24.15 | 16.76 | 0 | 70 |
Patient’s centeredness | 21.63 | 13.75 | 0 | 60 |
Diligence | 30.72 | 19.28 | 0 | 95 |
Good judgment | 18.54 | 14.73 | 0 | 75 |
Tolerance | 27.87 | 13.87 | 0 | 70 |
Teamwork | 43.81 | 23.91 | 0 | 100 |
Respectful | 26.76 | 15.23 | 0 | 75 |
Team leadership and conflict mgt | 22.85 | 13.25 | 0 | 75 |
Ability to take correction | 24.88 | 15.36 | 0 | 70 |
Total score | 268.34 | 88.68 | 45 | 545 |
N | 303 |
Table 2.
Descriptive Statistics of the SJT (Sub-samples)
Medical students | Mean | Std. Dev | Min | Max |
---|---|---|---|---|
Empathy | 25.70 | 17.91 | 0 | 70 |
Ethical responsibility | 20.57 | 15.00 | 0 | 65 |
Patient’s centeredness | 19.16 | 12.79 | 0 | 55 |
Diligence | 30.36 | 20.56 | 0 | 95 |
Good judgment | 15.75 | 13.01 | 0 | 50 |
Tolerance | 25.18 | 13.23 | 0 | 70 |
Teamwork | 40.02 | 24.61 | 0 | 100 |
Respectful | 24.06 | 14.61 | 0 | 75 |
Team leadership and conflict mgt | 21.11 | 12.16 | 0 | 65 |
Ability to take correction | 23.93 | 15.71 | 0 | 65 |
Total score | 245.88 | 88.06 | 45 | 485 |
Observation | 192 | |||
Resident doctors | ||||
Empathy | 25.50 | 19.56 | 0 | 90 |
Ethical responsibility | 30.36 | 17.88 | 0 | 70 |
Patient’s centeredness | 25.90 | 14.35 | 0 | 60 |
Diligence | 31.35 | 16.92 | 0 | 75 |
Good judgment | 23.37 | 16.28 | 0 | 75 |
Tolerance | 32.52 | 13.76 | 0 | 70 |
Teamwork | 50.36 | 21.19 | 0 | 95 |
Respectful | 31.44 | 15.21 | 0 | 75 |
Team leadership and conflict mgt | 25.85 | 14.52 | 0 | 75 |
Ability to take correction | 26.53 | 14.67 | 0 | 70 |
Total score | 307.20 | 75.65 | 45 | 545 |
N | 111 |
Table 3 show that the majority of the items (76%) have a difficulty index that falls within the range deemed appropriate for a good question (0.30–0.80). Few questions, 20% have a difficulty index below 0.30 (too difficult questions) and 8% have a difficulty index above 0.80 (too easy questions).
Table 3.
Item difficulty
S/n | Competency | Difficulty index | S/n | Competency | Difficulty Index |
---|---|---|---|---|---|
1 | Empathy 1 | 0.56 | 26 | Tolerance 1 | 0.81 |
2 | Empathy 2 | 0.21 | 27 | Tolerance 2 | 0.72 |
3 | Empathy 3 | 0.37 | 28 | Tolerance 3 | 0.84 |
4 | Empathy 4 | 0.53 | 29 | Tolerance 4 | 0.16 |
5 | Empathy 5 | 0.73 | 30 | Tolerance 5 | 0.53 |
6 | Eth Responsibility 1 | 0.32 | 31 | Teamwork 1 | 0.49 |
7 | Eth Responsibility 2 | 0.62 | 32 | Teamwork 2 | 0.41 |
8 | Eth Responsibility 3 | 0.45 | 33 | Teamwork 3 | 0.67 |
9 | Eth Responsibility 4 | 0.43 | 34 | Teamwork 4 | 0.72 |
10 | Eth Responsibility 5 | 0.67 | 35 | Teamwork 5 | 0.11 |
11 | Px Centeredness 1 | 0.07 | 36 | Respectful 1 | 0.85 |
12 | Px Centeredness 2 | 0.67 | 37 | Respectful 2 | 0.66 |
13 | Px Centeredness 3 | 0.25 | 38 | Respectful 3 | 0.48 |
14 | Px Centeredness 4 | 0.55 | 39 | Respectful 4 | 0.61 |
15 | Px Centeredness 5 | 0.54 | 40 | Respectful 5 | 0.27 |
16 | Diligence 1 | 0.43 | 41 | Team leadership 1 | 0.88 |
17 | Diligence 2 | 0.25 | 42 | Team leadership 2 | 0.12 |
18 | Diligence 3 | 0.71 | 43 | Team leadership 3 | 0.45 |
19 | Diligence 4 | 0.38 | 44 | Team leadership 4 | 0.79 |
20 | Diligence 5 | 0.65 | 45 | Team leadership 5 | 0.33 |
21 | Gd judgement 1 | 0.39 | 46 | Take correction 1 | 0.42 |
22 | Gd judgement 2 | 0.73 | 47 | Take correction 2 | 0.56 |
23 | Gd judgement 3 | 0.06 | 48 | Take correction 3 | 0.17 |
24 | Gd judgement 4 | 0.44 | 49 | Take correction 4 | 0.56 |
25 | Gd judgement 5 | 0.47 | 50 | Take correction 5 | 0.64 |
In Table 4, the corrected correlation between performance on each specific item and overall test performance show that most of the item (90%) have a positive correlation, while 72% of the items have a point-biserial correlation of at least 0.10.
Table 4.
Item discrimination- Corrected point-biserial correlation
Items | Point-biserial correlation | Items | Point-biserial correlation |
---|---|---|---|
Item 1 | 0.04 | Item 26 | 0.33 |
Item 2 | 0.02 | Item 27 | 0.30 |
Item 3 | 0.27 | Item 28 | 0.33 |
Item 4 | 0.23 | Item 29 | -0.02 |
Item 5 | 0.41 | Item 30 | 0.18 |
Item 6 | 0.01 | Item 31 | 0.23 |
Item 7 | 0.34 | Item 32 | 0.25 |
Item 8 | 0.24 | Item 33 | 0.36 |
Item 9 | 0.23 | Item 34 | 0.35 |
Item 10 | 0.45 | Item 35 | -0.07 |
Item 11 | -0.11 | Item 36 | 0.19 |
Item 12 | 0.30 | Item 37 | 0.18 |
Item 13 | 0.02 | Item 38 | 0.15 |
Item 14 | 0.22 | Item 39 | 0.24 |
Item 15 | 0.13 | Item 40 | -0.02 |
Item 16 | 0.21 | Item 41 | 0.04 |
Item 17 | 0.08 | Item 42 | 0.00 |
Item 18 | 0.16 | Item 43 | 0.16 |
Item 19 | 0.04 | Item 44 | 0.22 |
Item 20 | 0.25 | Item 45 | 0.24 |
Item 21 | 0.13 | Item 46 | 0.16 |
Item 22 | 0.20 | Item 47 | 0.18 |
Item 23 | -0.24 | Item 48 | 0.00 |
Item 24 | 0.27 | Item 49 | 0.20 |
Item 25 | 0.31 | Item 50 | 0.24 |
Cronbach's alpha estimation across competencies return alpha estimates that are less than 0.7, which suggests low internal consistency among test items in each category or competency. However, it is important to note that low Cronbach’s alpha values for SJTs do not necessarily indicate poor precision of measurement; rather it reflects the heterogeneity of the domain being measured [25]. In addition, the alpha values reported in Table 5 are anticipated as the literature shows typically low internal consistency indices in SJT studies [25]. We also estimate Cronbach alpha using only the 36 question that were identified to have good discrimination ability by the corrected point-biserial correlation. The results are mostly similar to the results presented in Table 5.
Table 5.
Cronbach’s alpha
Competencies | Mean | Std. Dev | Cronbach’s alpha |
---|---|---|---|
Empathy | 27.09 | 18.59 | 0.23 |
Ethical responsibility | 24.15 | 16.76 | 0.30 |
Patient’s centeredness | 21.63 | 13.75 | 0.20 |
Diligence | 30.72 | 19.28 | 0.30 |
Good judgment | 18.54 | 14.73 | 0.33 |
Tolerance | 27.87 | 13.87 | 0.23 |
Teamwork | 43.81 | 23.91 | 0.30 |
Respectful | 26.76 | 15.23 | 0.21 |
Team leadership and conflict mgt | 22.85 | 13.25 | 0.30 |
Ability to take correction | 24.88 | 15.36 | 0.20 |
N | 303 |
Discussion
Our study is the second in a series engaged in the development and validation of Situational Judgement Tests for assessing the behavioural competencies of physicians in Nigeria. In the first study, we explored contextual behavioural competencies for effective medical practice in Nigeria. In this study, we adapted and developed SJT tools that capture the competencies; and drawing on Kane’s validity argument framework for uses and interpretation of test scores, we validated the instrument. Scoring and generalization inferences drawn from test scores show that most of the test items have good difficulty and discrimination index appropriate for a meaningful assessment; and will be useful for assessing behavioural competencies required for medical practice in Nigeria. Below we confer the soundness, strengths, and limitations of the proposed use and interpretation of the SJT scores.
Drawing on the consensus-grounded development of contextual competencies in the first study [19], we began with consensual expert agreement on the scenarios that capture the prioritized behavioural competencies identified in the first study. The final pilot SJT tool captured ten behavioural domains, five with a focus on the doctor-client relationship (Ethical responsibility, Empathy, Patient-centeredness, Diligence, Good judgment); and five with a focus on the doctor-colleague relationship (Respectful, Teamwork, Team leadership/conflict management, Ability to take correction, Tolerance). The final tool had a total of fifty questions that prompt a variety of scenarios that capture the prioritized competencies relevant within the Nigerian setting. Five questions for each of the ten domains. Kane’s framework enabled us to trail the routine research practice of stating and testing a hypothesis. We stated the use and interpretation argument of the proposed test scores, gathered evidence, and evaluated the plausibility of our claims in a stepwise fashion.
The difficulty index demonstrates that a good number of the items have difficulty values that fall within the range specified as appropriate for a good question [20, 23, 26]. The few questions that were suggested as too difficult or too easy will be reviewed for possible confusing language or removed entirely in subsequent evaluations. Amongst the ten competencies assessed, ethical responsibility stood out with all its items within the range specified as good difficulty index. Respondents performed better in the items under this category (ethical responsibility), and the difficulty index shows that the items were neither too difficult nor too easy. This is not surprising as this is one competency that is formally taught as part of the medical curriculum, both at the undergraduate and postgraduate levels. Incorporating these prioritized competencies into the curriculum is likely to improve behavioural competencies in upcoming physicians [27].
Most of the test items have good discriminating power and will be able to differentiate between high and low abilities. Unlike a few validation studies that focused on measuring a single construct [28–30], ours captured ten competencies. Therefore, it is needful to pay attention and apply caution in result interpretation. It is crucial to cautiously contextualize the interpretation of the discrimination index, as values of the coefficient are more likely to be lower when measuring a wide range of content areas than for a more homogenous construct [23]. This may have negatively affected our result, considering the range of competencies we measured. Furthermore, sampling of residents and students may have also increased variability affecting the difficulty level as well as the discrimination.
Corrected point-biserial correlation provided a more robust and accurate assessment of the discriminating power of the items. This measure calculates the relationship between the score for each item and the overall test score, after removing the index item score from the total test score. Corrected point-biserial correlation shows that most of the items performed well, discriminating between candidates that possess the desired competence and those who do not. This is a more appropriate correction because embedding the index item in the total score usually results in a spuriously higher relationship [24]. More so, this correction is more ideal for such test as this pilot with a smaller number of items, where removal of one item will have a significant effect on the total test score.
Therefore, we can make more certain conclusions with corrected point-biserial correlation. According to Varma, good items typically have a correlation exceeding 0.25; values of 0.15 or higher mean that the item is performing well [31]. Items with a correlation below 0.10 will be re-examined for a possible ambiguity or incorrect key. A few of the items with negative correlation demonstrate that the high-performing respondents are getting the answer wrong, and the low to mid-performing respondents are getting it right. These items will be subject to deletion as recommended by Kaplan & Saccuzzo [32]. In the early stages of developing an instrument, several iterative processes are necessary for revision and correction based on the results of the validation process [32].
Balancing between difficulty index and corrected point discrimination index as proposed by Hopkins and Antes [20], we identified 33 items as valid. Cronbach's alpha analyses for each category of items return alpha estimates that are less than 0.7. This is anticipated as literature shows typically low internal consistency indices in SJT studies. Catano et. al. reported a mean of 0.46 in a meta-analysis, highlighting that low Cronbach values for SJTs do not necessarily indicate poor precision of measurement; rather reflects the heterogeneity of the domain being measured [25]. Situational Judgement Test builds on the behavioural consistency and psychological fidelity principles that capture a multitude of constructs [33]. Such competencies as captured in our SJT tool measure a composite of skill set that may include knowledge and skills, applied social skills, and basic personality tendencies as classified by Christian et al. [34]. Therefore, Cronbach’s alpha requires that the construct domain be homogeneous [35]. Given the heterogeneity of the construct captured in our tool, Cronbach’s alpha may not be a very appropriate measure.
Furthermore, there were only five items under each category of competency; with less than ten items on a scale, it is difficult to get good alpha value. Even though estimates for the fifty items return an alpha estimate of 7.5 indicating good internal consistency across all the test items, we do not consider the items as truly consistent based on this estimate. An important supplement to validate internal consistency is defining the level of evidence for each item based on literature, in combination with an expert agreement [36, 37]. Generation of the competencies started with a literature search; the opinion sample from all the clinical departments and all clinical levels trainees (graduate and postgraduate) within the college was represented through the process. Development and adaptation of the scenarios that capture these competencies were carried out by subject matter experts (faculty members). The scenarios were agreed upon by the faculty members who teach/supervise/observe/examine undergraduate and postgraduate medical trainees in the faculty of Medicine.
The consensual agreement on the scenarios and the order of ranked options by the team of faculty members support both scoring and generalization inferences. We assume that a sufficient and diverse sample of faculty members who are experienced in teaching and supervision for up to 5 years should be enough to establish consensus on the scenarios and correct rank of options. Schubert et. al. agree that experts are very useful in deliberation and consensual modification of ambiguous scenarios in development of test tools [38]. The faculty members agreed on the scenarios and ranked options before the pilot. However, we cannot rule out response bias because it is possible that a different set of faculty members would have developed and adapted different set of scenarios.
The assumptions underlying Cronbach’s alpha coefficient are violated due to the multidimensional nature of SJTs [39]. Several authors have proposed different approaches to examining SJT constructs [39, 40]. Kane’s framework has its limitations, but it captures evidence related to the constructs measured by SJTs and evidence related to the prediction of job performance [17]. It enabled us to begin with a clear statement of the proposed use and interpretation of the assessment scores and progress sequentially from evaluating the scoring evidence to generalization inference. According to Kane, the validation process is a cycle of continuous evidence gathering and determination of the extent of logical soundness of the tool with each phase of information gathering.
Therefore, at this point of evidence gathering, it is yet inappropriate to assume that the supporting findings justify the proposed use and interpretation of the scores. Making the final judgment depends on extrapolation and implication inferences. Extrapolating the gathered evidence to real life assessment and attempts to corroborate the impact of the assessment on meaningful outcomes are needful steps that are often neglected. Implication evidence of any sort is very rarely published [17]. The absence of extrapolation and implication evidence presents important gaps in the literature. Furthermore, our study is limited as we did not explicitly measure the Content Validity Index (CVI). While we relied on a rigorous expert review process to establish content validity, future research could benefit from a more formal quantitative assessment of content validity.
Drawing on Kane’s framework, our assumptions about the uses and interpretation of the test scores are valid for the majority of the SJT test items. The valid items can be employed to assess behavioural competencies for best practices in the medical profession. High scores will indicate better behavioural competency. Therefore, respondents with higher SJT scores will be more likely to portray behavioural competencies associated with better quality healthcare provision and patient satisfaction. The valid items will be used for the next phase of evaluation in our study series, other items will be re-evaluated to identify ambiguities for restructuring or possible deletion.
Supplementary Information
Authors’ contributions
AC and CM conceptualized the study; AC, CM, UO, and IA designed the study; CM, UO, and IA collected the data; GE analyzed the data; GE, AC, CM, UO, and IA interpreted the data, drafted the manuscript, and were involved in revising the intellectual content of the manuscript. All authors read and approved the final manuscript.
Funding
Not applicable.
Data availability
The datasets generated and analysed during this study are available in the Kobo Toolbox, https://kf.kobotoolbox.org/#/forms/awveowcLSWxurpgmtVXLEz/data/table with reasonable request for access from the corresponding author.
Declarations
Ethics approval and consent to participate
Ethical clearance was sought and obtained from the Health Research Ethics Committee of University of Nigeria Teaching Hospital (UNTH), Ituku-Ozalla. Written informed consent was obtained from all eligible participants having presented them with the purpose of research, their rights and measures to protect them and their data.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Cullen MJ, Zhang C, Marcus-Blank B, Braman JP, Tiryaki E, Konia M, Hunt MA, Lee MS, Van Heest A, Englander R, Sackett PR, Andrews JS. Improving Our Ability to Predict Resident Applicant Performance: Validity Evidence for a Situational Judgment Test. Teach Learn Med. 2020. 10.1080/10401334.2020.1760104. Epub 2020 May 19. PMID: 32427496. [DOI] [PubMed]
- 2.Petty-Saphon K, Walker KA, Patterson F, Ashworth V, Edwards H. Situational judgment tests reliably measure professional attributes important for clinical practice. Adv Med Educ Pract. 2017. http://www.ncbi.nlm.nih.gov/pubmed/28096705. [DOI] [PMC free article] [PubMed]
- 3.García E. The Need to Address Non-Cognitive Skills in the Education Policy Agenda. 2016: https://www.epi.org/publication/the-need-to-address-noncognitive-skills-in-the-education-policy-agenda/.
- 4.Koczwara A, Patterson F, Zibarras L, Kerrin M, Irish B, Wilkinson M. Evaluating cognitive ability, knowledge tests and situational judgement tests for postgraduate selection. 2012. http://www.ncbi.nlm.nih.gov/pubmed/22429176. [DOI] [PubMed]
- 5.Fischer MR, Bauer D, Mohn K, NKLM-Projektgruppe. Finally finished! National Competence Based Catalogues of Learning Objectives for Undergraduate Medical Education (NKLM) and Dental Education (NKLZ) ready for trial. GMS Z Med Ausbild [Internet]. 2015 [cited 2019 Feb 19];32(3):Doc35. Available from: http://www.ncbi.nlm.nih.gov/pubmed/26677513. [DOI] [PMC free article] [PubMed]
- 6.Harris P, Snell L, Talbot M, Harden RM. Competency-based medical education: implications for undergraduate programs. Med Teach [Internet]. 2010 Aug 27 [cited 2018 Oct 27];32(8):646–50. Available from: http://www.tandfonline.com/doi/full/10.3109/0142159X.2010.500703. [DOI] [PubMed]
- 7.Kiguli-Malwadde E, Omaswa F, Olapade-Olaopa oluwabunmi, Kiguli S, Chen C, Sewankambo N, et al. Competency-based medical education in two Sub-Saharan African medical schools. Adv Med Educ Pract. 2014 Dec 9 [cited 2019 Jul 23];5:483. Available from: http://www.dovepress.com/competency-based-medical-education-in-two-sub-saharan-african-medical--peer-reviewed-article-AMEP. [DOI] [PMC free article] [PubMed]
- 8.Olopade FE, Adaramoye OA, Raji Y, Fasola AO, Olapade-Olaopa EO. Developing a competency-based medical education curriculum for the core basic medical sciences in an African Medical School. Adv Med Educ Pract. 2016 [cited 2019 Jul 23];7:389–98. Available from: http://www.ncbi.nlm.nih.gov/pubmed/27486351. [DOI] [PMC free article] [PubMed]
- 9.Lievens F. Adjusting medical school admission: assessing interpersonal skills using situational judgement tests. 2013. http://www.ncbi.nlm.nih.gov/pubmed/23323657. [DOI] [PubMed]
- 10.Patterson F, Driver R. Situational Judgement Tests (SJTs). Selection and Recruitment in the Healthcare Professions. Cham: Springer International Publishing; 2018. http://link.springer.com/10.1007/978-3-319-94971-0_4.
- 11.Patterson F, Rowett E, Hale R, Grant M, Roberts C, Cousans F, et al. The predictive validity of a situational judgement test and multiple-mini interview for entry into postgraduate training in Australia. 2016. http://www.ncbi.nlm.nih.gov/pubmed/26957002. [DOI] [PMC free article] [PubMed]
- 12.Luschin-Ebengreuth M, Dimai HP, Ithaler D, Neges HM, Reibnegger G. Situational judgment test as an additional tool in a medical admission test: an observational investigation. 2015. http://www.ncbi.nlm.nih.gov/pubmed/25889941. [DOI] [PMC free article] [PubMed]
- 13.Cizek GJ. Validity: an integrated approach to test score meaning and use. Educ Psychol Pract. 2020;37(1):114–114. [Google Scholar]
- 14.Sireci SG. On Validity Theory and Test Validation. Educ Res 2019. http://journals.sagepub.com/doi/10.3102/0013189X07311609.
- 15.Carney PA, Palmer RT, Fuqua Miller M, Thayer EK, Estroff SE, Litzelman DK, et al. Tools to Assess Behavioral and Social Science Competencies in Medical Education: A Systematic Review. 2016. http://www.ncbi.nlm.nih.gov/pubmed/26796091. [DOI] [PMC free article] [PubMed]
- 16.Webster E.S, Paton L.W, Crampton P.E, Tiffin P. A Situational judgement test validity for selection: A systematic review and meta-analysis. https://asmepublications.onlinelibrary.wiley.com/doi/pdf/10.1111/medu.14201. [DOI] [PubMed]
- 17.Sahota G, Fisher V, Patel B, JuJ K, & Taggar J. The educational value of situational judgement tests (SJTs) when used during undergraduate medical training: A systematic review and narrative synthesis. Med Teach 2003. https://www.tandfonline.com/doi/abs/10.1080/0142159X.2023.2168183. [DOI] [PubMed]
- 18.Aylott L.M. E., Finn G. M., Tiffin P. A. Assessing professionalism in mental health clinicians: development and validation of a situational judgement test. Cambridge University Press, 2023. https://www.cambridge.org/core/journals/bjpsych-open/article/assessing-professionalism-in-mental-health-clinicians-development-and-validation-of-a-situational-judgement-test/CE475CAD66630C3697768B7D557A4A5D. [DOI] [PMC free article] [PubMed]
- 19.Chukwuma A, Obi U, Agu I, Mbachu C. Exploring Behavioral Competencies for Effective Medical Practice in Nigeria. J Med Educ Curric Dev. 2020Jan;7:238212052097823. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hopkins, Charles D. and Antes, Richard L., "Classroom Management and Evaluation", F.E. Pencock Publishing, Inc. U.S.A., 3rd Edition, c 1990.
- 21.Kane MT. Validating the Interpretations and Uses of Test Scores. Vol. 50, Journal of Educational Measurement Spring. 2013 [cited 2022 May 20]. Available from: https://sci-hub.se/10.1111/jedm.12000.
- 22.Cook DA, Brydges R, Ginsburg S, Hatala R. A contemporary approach to validity arguments: A practical guide to Kane’s framework. Med Educ. 2015 Jun [cited 2018 Dec 25];49(6):560–75. Available from: http://www.ncbi.nlm.nih.gov/pubmed/25989405. [DOI] [PubMed]
- 23.University of Washington. Understanding Item Analyses. University of Washington. 2022. https://www.washington.edu/assessment/scanning-scoring/scoring/reports/item-analysis/.
- 24.Wright BD. Point-biserial correlations and item fits. Rasch Measurement Transactions,. 1992.
- 25.Catano VM, Brochu A, Lamerson CD. Assessing the Reliability of Situational Judgment Tests Used in High-Stakes Situations. Int J Sel Assess. 2012: https://onlinelibrary.wiley.com/doi/10.1111/j.1468-2389.2012.00604.x.
- 26.Shrestha A, Bhandary S, Shrijana Shrestha. 2019. Situational judgement test: Psychometric analysis of a pilot study for selecting post graduate medical student in residency program. https://www.nepjol.info/index.php/JPAHS/article/view/27237.
- 27.Gaspar F.D.R, Abbad G,D. Development of Situational Judgment Tests in Interprofessional Health Education. Creative education, 2024. https://www.scirp.org/journal/paperinformation?paperid=130729.
- 28.Smith KJ, Flaxman C, Farland MZ, Thomas A, Buring SM, Whalen K, et al. Development and Validation of a Situational Judgement Test to Assess Professionalism. Am J Pharm Educ 2020. http://www.ncbi.nlm.nih.gov/pubmed/32773831. [DOI] [PMC free article] [PubMed]
- 29.Becker TE. Development and Validation of a Situational Judgment Test of Employee Integrity. Int J Sel Assess. 2005 doi.wiley.com/10.1111/j.1468-2389.2005.00319.x.
- 30.Mumford T V., Van Iddekinge CH, Morgeson FP, Campion MA. The Team Role Test: Development and validation of a team role knowledge situational judgment test. J Appl Psychol. 2008 http://www.ncbi.nlm.nih.gov/pubmed/18361630. [DOI] [PubMed]
- 31.Varma S. Preliminary item statistics using point-biserial correlation and p-values. 2006 Available from: https://jcesom.marshall.edu/media/24104/Item-Stats-Point-Biserial.pdf.
- 32.Robert M. Kaplan/Dennis P. Saccuzzo. Psychological Testing: Principles, Applications, and Issues. 2013.
- 33.Lievens F, Peeters H, Schollaert E. Situational judgment tests: A review of recent research Part of the Human Resources Management Commons, and the Organizational Behavior and Theory Commons. http://ink.library.smu.edu.sg/lkcsb_research. http://ink.library.smu.edu.sg/lkcsb_research/5678.
- 34.Christian MS, Edwards BD, Bradley JC. Situational judgment tests: Constructs assessed and a meta-analysis of their criterion-related validities. Pers Psychol. 2010. 10.1111/j.1744-6570.2009.01163.x. [DOI]
- 35.Schmidt FL, Hunter JE. Measurement error in psychological research: Lessons from 26 research scenarios. Psychol Methods. 1996. doi.apa.org/getdoi.cfm.
- 36.Boateng GO, Neilands TB, Frongillo EA, Melgar-Quiñonez HR, Young SL. Best Practices for Developing and Validating Scales for Health, Social, and Behavioral Research: A Primer Vol. 6, Frontiers in Public Health. Frontiers Media SA; 2018. http://www.ncbi.nlm.nih.gov/pubmed/29942800. [DOI] [PMC free article] [PubMed]
- 37.Eden J, Levit L, Berg A, Morton S. Standards for synthesising the body of evidence. In: Find what works in healthcase: standards for systematic reviews. National Academies Press (US); 2011 [cited 2022 Jun 14]. p. 155–95. https://www.ncbi.nlm.nih.gov/books/NBK209522/. [PubMed]
- 38.Schubert S, Ortwein H, Dumitsch A, Schwantes U, Wilhelm O, Kiessling C, Schubert S, Ortwein H, Dumitsch A. Ulrich Schwantes OW& CK. A situational judgement test of professional behaviour: development and validation. 2009. 10.1080/01421590801952994. [DOI] [PubMed] [Google Scholar]
- 39.Sorrel MA, Olea J, Abad FJ, de la Torre J, Aguado D, Lievens F. Validity and Reliability of Situational Judgement Test Scores: A New Approach Based on Cognitive Diagnosis Models. Organ Res Methods. 2016. doi/10.1177/1094428116630065.
- 40.Weekley JA, Hawkes B, Guenole N, Ployhart RE. Low-Fidelity Simulations. Vol. 2, Annual Review of Organizational Psychology and Organizational Behavior. 2015. 10.1146/annurev-orgpsych-032414-11130.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets generated and analysed during this study are available in the Kobo Toolbox, https://kf.kobotoolbox.org/#/forms/awveowcLSWxurpgmtVXLEz/data/table with reasonable request for access from the corresponding author.