Abstract
Background
Teamwork training has been included in several emergency medicine (EM) curricula; the aim of this study was to compare different scales’ performance in teamwork evaluation during simulation for EM residents.
Methods
In the period October 2013–June 2014, we performed bimonthly high-fidelity simulation sessions, with novice (I–III year, group 1 (G1)) and senior (IV–V year, group 2 (G2)) EM residents; scenarios were designed to simulate management of critical patients. Videos were assessed by three independent raters with the following scales: Emergency Team Dynamics (ETD), Clinical Teamwork Scale (CTS) and Team Emergency Assessment Measure (TEAM). In the period March–June, after each scenario, participants completed the CTS and ETD.
Results
The analysis based on 18 sessions showed good internal consistency and good to fair inter-rater reliability for the three scales (TEAM, CTS, ETD: Cronbach's α 0.954, 0.954, 0.921; Intraclass Correlation Coefficients (ICC), 0.921, 0.917, 0.608). Single CTS items achieved highly significant ICC results, with 12 of the total 13 comparisons achieving ICC results ≥0.70; a similar result was confirmed for 4 of the total 11 TEAM items and 1 of the 8 total ETD items. Spearman's r was 0.585 between ETD and CTS, 0.694 between ETD and TEAM, and 0.634 between TEAM and CTS (scales converted to percentages, all p<0.0001). Participants gave themselves a better evaluation compared with external raters (CTS: 101±9 vs 90±9; ETD: 25±3 vs 20±5, all p<0.0001).
Conclusions
All examined scales demonstrated good internal consistency, with a slightly better inter-rater reliability for CTS compared with the other tools.
Keywords: High-Fidelity Simulation, Teamwork performance, Self-assessment
Introduction
Emergency departments represent a high-risk clinical setting: patient management has to be made under conditions of uncertainty, incomplete information about patients’ medical history and time pressure. By now it is well known that, in this environment, errors in patient management are frequent.1 2 To improve patient safety, medical knowledge and skills are essential but not enough: up to 70% of all errors can be attributed to human factors and non-technical skills (NTS), which play a key role in preventing errors.3 High-fidelity simulation is increasingly used to allow teams to practice and improve their teamwork skills in emergency situations;4 several programmes have been developed to implement simulation in the emergency medicine (EM) residents curriculum.5 6
For a worthwhile use of simulation for improving team behaviour, an essential prerequisite is to have a valid and reliable measure to quantify performance. We identified the following five scales: Emergency Team Dynamics (ETD),7 Clinical Teamwork Scale (CTS),8 Team Emergency Assessment Measure (TEAM),9 Ottawa Crisis Resource Management Global Rating Scale (OTT)10 and Mayo High Performance Teamwork Scale (MAYO)11 (figure 1). ETD was developed in an observational study of the teamwork performance of an Emergency Response Team: the purpose of the study was to determine the relationship between leadership behaviour, team dynamics and task performance. The CTS was developed in an obstetric setting as a brief tool that could be used to objectively evaluate teamwork during short clinical team simulations and in everyday clinical care. The TEAM scale was primarily developed to evaluate teamwork in a simulated cardiac resuscitation setting.
Figure 1.
Flow-chart describing the study design. EM, emergency medicine; CTS, Clinical Teamwork Scale; ETD, Emergency Team Dynamics; MAYO, Mayo High Performance Teamwork Scale OTT, Ottawa Crisis Resource Management Global Rating; TEAM, Team Emergency Assessment Measure.
The MAYO was developed from 107 participants’ ratings of key crisis resource management (CRM) skills during CRM training in a simulation centre.11 The OTT was used to evaluate simulation-based performance of medical teams managing the critically ill. These scales apparently explore similar teamwork dimensions and would be expected to give a similar evaluation. Further studies regarding the TEAM scale specifically investigated its psychometric properties, and these studies confirmed the tool as a valid, reliable and feasible instrument for measuring the NTS of medical emergency teams:12 13 some limitations persisted, specifically the scale did not enable evaluation of single members of the team, but only gave an overall assessment.14 Moreover, further testing is required in real clinical settings. Similar re-evaluations were not conducted for the other scales.
The aims of this paper were: (1) to compare the performance of different scales in teamwork evaluation during simulation sessions for EM residents, with a special focus on their inter-rater’ reliability; (2) to compare the discriminative power of the tools, specifically the ability to discriminate between junior and senior residents in teamwork skills; (3) to compare self-assessment and evaluation by external observers.
Methods
Participants
This was conducted as a prospective study, and participants were members of the University of Florence EM residency, a 5-year training programme; all our residents began their residency briefly after obtaining their academic degree and were in their 20s. They were divided into two main subgroups: group 1 (G1), first, second and third-year EM residents and group 2 (G2), fourth and fifth-year EM residents. Each group was further divided into teams of 4–5 residents. The reason the groups were split in this manner was because fourth and fifth-year EM residents can work more independently than junior residents; however, formal teamwork training is not a routine part of the residency programme curriculum.
Settings
This project was conducted in the emergency department high-dependency unit (ED-HDU) of the University-Hospital Careggi, in Florence, Italy. In the period between October 2013 and June 2014, we performed bimonthly simulation sessions. Simulation sessions for G1 residents occurred in the Simulation centre and only involved doctors. For G2, a simulation environment was created in the ED-HDU; during their simulation, the patient mannequin was positioned on a vacant patient bed, with curtains used to shield real patients. Participants were asked to act and to use equipment and medications as they would in daily clinical practice; a nurse was always present. Scenarios were run in real time and all participant interventions had to be conducted as if in an actual clinical setting; each scenario lasted about 10 min. In G1 scenarios, participants were intended to apply Acute Life Support (ALS) algorithms to manage arrest and impending cardiac arrest patients; the G2 scenarios included arrest and impending cardiac arrest patients, and management of the critically ill, where a broad differential diagnosis was necessary for correct management (coma of unknown origin, acute dyspnoea). G1 teams took part in two simulation sessions and G2 teams engaged in four simulation sessions—each session included three scenarios—within 1 month of each other. Each session involved only one team. The project was designed to expose residents to a series of clinical cases distributed in the same order for all the teams in the two groups. Scenario progression was predefined, and no attempt was made to manipulate degree of difficulty. Before simulation sessions, there was a 2 h training session on teamwork and Crisis Resource Management (CRM) principles,15 16 which included a lecture and role play.
We used a high-fidelity Laerdal Basic patient simulator (Laerdal, Stavenger, Norway) connected to a SimPad touch screen system. The system consists of a full-body mannequin with a realistic upper airway, chest movement, variable cardiac and breath sounds, and palpable pulses. It can be mask-ventilated, intubated, cannulated, given fluids and medications, and defibrillated. A monitor displays representations of blood pressure, ECG, oxygen saturation and respiratory rate. The console operator can adjust clinical signs and monitor data as the scenario progresses. Full medical documentation, laboratory results and radiology were provided at the beginning or during the scenario, to increase fidelity.
Debriefing occurred immediately following the simulation. Two of the authors (FI and RP) served as facilitators for each simulation and debriefing. Debriefing included self-assessment and group assessment of performance, both regarding TS and NTS. Participants were asked to identify, evaluate and offer solutions to the identified challenges.
Participants gave their informed consent for video recording and analysis. According to the local Ethics Committee rules, an ethics approval was not required.
Measurements
All sessions were digitally recorded and videos assessed a minimum of 2 weeks after the simulation by three independent raters, who participated in running simulations but not debriefing. All raters were medical doctors who obtained a 1-year scholarship to implement simulation in the EM residency programme; prior to the beginning of the study, all raters were enrolled in a 4 h course to clarify the scale objectives and the team work characteristics under evaluation using three previously recorded videos. Each assessor watched every video once and applied all tools to rate each video-taped scenario in a random order. Assessors were kept blinded to each other's ratings throughout video revision and rating.
We chose already validated instruments with the following criteria: (1) scales designed or used in emergency care; (2) scales giving a broad measure of teamwork skills specific to the management of the critically ill; (3) scale not specifically designed for a given disease.
Teamwork performance during scenarios was assessed with the following scales: ETD, CTS, TEAM, OTT and MAYO, (figure 1). From February 2014, after each scenario, and prior to debriefing, participants completed the CTS and ETD scales. All tools were employed in English, as all participants had a B2-level command of English.
Data analysis
Statistical analysis included Cronbach's α coefficient for internal consistency and intraclass correlation coefficients (ICC) for inter-rater reliability, with ICC values of 0.70 or higher indicating adequate agreement in scoring.
We also carried out non-parametric, repeated-measures Friedman tests to assess whether the average scores allocated by each assessor were significantly different (median and range were reported; non-significant results would indicate the desirable consistency in the scoring between the three assessors). The same test was employed to compare self-assessment and assessment by external raters (mean±SD). To evaluate the concurrent validity of TEAM and CTS scales, we calculated the correlation between scores on global performance and ratings on the individual items by Spearman's r correlation coefficients. The level of agreement between different tools was evaluated by non-parametric Spearman's r correlation coefficients between CTS, ETD and TEAM scores, and Bland-Altman analysis. Given the differences in the structure of the three tools, we computed an average score on each tool and expressed it as a percentage score (%). For the calculation of the TEAM percentage score, we based the calculation on the first 11 questions and excluded the final question that assesses global performance; in the same analysis, for CTS, we excluded the first item, which again represents a global assessment. The Wilcoxon signed rank test was employed to compare evaluation of junior and senior residents by the external observers, in order to evaluate the discrimination ability of the different tools (mean±SD).
Data were analysed with SPSS V.21 software and GraphPad Prism V.5. The level of significance was set at p=0.05.
Results
All the 27 residents participated in the study and we performed 18 simulation sessions overall (figure 1).
Tools selection
In the period between October–February, teamwork performance during scenarios was assessed with the following scales: ETD, CTS, TEAM, OTT and MAYO. An interim analysis was performed after the first six sessions. All scales demonstrated a good internal consistency: (Cronbach's α: CTS 0.919, TEAM 0.879, ETD 0.855, OTT 0.838, MAYO 0.759) and fair to good inter-rater reliability TEAM 0.871, CTS 0.862, ETD 0.504, MAYO 0.492, OTT 0.419). In this analysis, MAYO's items 9–16 were excluded, since they were not applicable for the majority of our simulation scenarios, making this scale less suitable than others for our setting. After this analysis, we abandoned the MAYO and OTT; from February 2014, only CTS, ETD and TEAM were employed (figure 1).
Internal consistency and inter-rater reliability
Table 1 provides a direct comparison of the three tools. The final analysis, based on 18 simulation sessions, showed good internal consistency for all the three scales (Cronbach's α: TEAM 0.954, CTS 0.954, ETD 0.921). Mean total score was 68% (30/44) for TEAM, 62% (20/32) for ETD and 55% (77/140) for CTS; global rating was 60% (6/10) for TEAM and CTS. We further evaluated internal consistency for CTS and TEAM using non-parametric correlation analysis. For TEAM and CTS, we demonstrated a very high correlation between single items and the global score (Spearman's r between 0.514 and 0.819 for CTS; between 0.345 and 0.660 for TEAM) (tables 2 and 3); in both scales, communication showed the highest correlation with the overall evaluation. ETD was not suitable for this kind of evaluation because it does not include an overall score that is different from the sum of all the items.
Table 1.
Single TEAM, CTS and ETD items
TEAM | ETD | CTS |
---|---|---|
The team leader let the team know what was expected of them | The leader let the team know what was expected of them | How would you rate teamwork during this delivery/emergency? |
The team leader maintained a global perspective | The team transferred information | Overall Communication Rating |
The team communicated effectively | The team was adaptable | Orient new members (SBAR) |
The team worked together to complete tasks in a timely manner | The team was coordinated | Transparent thinking |
The team acted with composure and control | The team cooperated | Directed communication |
The team morale was positive | The team used initiative | Closed loop communication |
The team adapted to changing situations | The team put effort into its work | Overall Situational Awareness |
The team monitored and reassessed the situation | The team had a positive spirit and morale | Resource allocation |
The team anticipated potential actions | Target fixation | |
The team prioritised tasks | Overall Decision Making Rating | |
The team followed approved standards | Prioritise | |
On a scale of 1–10 give your global rating of team performance | Overall Role Responsibility (Leader/Helper) | |
Role clarity | ||
Perform as a leader/helper | ||
Patient friendly |
CTS, Clinical Teamwork Scale; ETD, Emergency Team Dynamics; SBAR, situation, background, assessment, recommendations; TEAM, Team Emergency Assessment Measure.
Table 2.
Descriptive CTS statistics (median/range) across assessors (Friedman test); ICC between raters; non-parametric correlation between single items and global score (Spearman's r); single item comparison between junior and senior resident (Wilcoxon signed rank test)
Median (range) | Comparison between junior and senior | |||||||
---|---|---|---|---|---|---|---|---|
Op 1 | Op 2 | Op 3 | p Value | ICC | Spearman's r | G1 | G2 | |
1 | 6 (3–8) | 6.5 (5–8) | 7 (3–9) | 0.001 | 0.742 | 0.819° | 7.5±1.1 | 6.8±1.3° |
2 | NA† | NA | NA | NA | NA | |||
3 | 6 (3–8) | 6.5 (4–9) | 7 (4–9) | 0.005 | 0.779 | 0.790° | 7.5±1.2 | 6.6±1.4° |
4 | 6 (3–8) | 7 (4–8) | 7 (4–9) | <0.001 | 0.734 | 0.722° | 7.5±1.2 | 6.7±1.5° |
5 | 6 (4–8) | 7 (4–9) | 6 (3–9) | 0.323 | 0.683 | 0.665° | 7.4±1.3 | 6.6±1.5° |
6 | 6 (3–8) | 6 (4–9) | 6 (3–8) | 0.281 | 0.722 | 0.680° | 7.3±1.2 | 6.5±1.5° |
7 | 6 (3–9) | 6.5 (4–8) | 7 (3–9) | 0.236 | 0.740 | 0.649° | 7.3±1.1 | 6.5±1.4° |
8 | NA‡ | |||||||
9 | 7 (4–8) | 7 (5–9) | 6.5 (1–9) | 0.423 | 0.735 | 0.773° | 7.5±1.0 | 6.8±1.4° |
10 | 7 (3–8) | 6.5 (4–9) | 6.5 (4–9) | 0.906 | 0.743 | 0.745° | 7.5±1.1 | 6.9±1.2° |
11 | 6 (3–8) | 6 (4–9) | 7 (2–9) | 0.791 | 0.847 | 0.746° | 7.5±1.2 | 6.8±1.6° |
12 | 6 (3–8) | 6 (2–9) | 7 (2–9) | 0.771 | 0.823 | 0.684° | 7.4±1.4 | 6.7±1.8° |
13 | 6.5 (3–8) | 6 (3–9) | 7 (3–9) | 0.434 | 0.816 | 0.722° | 7.5±1.2 | 6.7±1.7° |
14 | 7 (3–8) | 7 (5–8) | 7 (4–10) | 0.090 | 0.777 | 0.514° | 7.4±1.0 | 6.7±1.4° |
*p<0.05; °p<0.0001.
†There were no simulated cases with new member orientation.
‡For this item a yes/no score is expected.
CTS, Clinical Teamwork Scale; G1, group 1; G2 group 2; ICC, Intraclass Correlation Coefficients; NA, not applicable; Op, operator.
Table 3.
Descriptive TEAM statistics (median/range) across assessors (Friedman test); ICC between raters; non-parametric correlation between single items and global score (Spearman's r); single item comparison between junior and senior resident (Wilcoxon signed rank test)
Median (range) | Comparison between G1 and G2 |
|||||||
---|---|---|---|---|---|---|---|---|
Op 1 | Op 2 | Op 3 | p Value | ICC | Spearman's r | G1 | G2 | |
1 | 3 (1–4) | 2 (1–4) | 3 (1–4) | 0.041 | 0.713 | 0 0.647° | 2.7±0.8 | 2.5±0.9 |
2 | 3 (1–4) | 2 (1–4) | 3 (1–4) | 0.045 | 0.647 | 0.612° | 2.7±0.7 | 2.6±0.7 |
3 | 2.5 (0–4) | 3 (1–4) | 3 (1–4) | 0.020 | 0.488 | 0.660° | 2.8±0.7 | 2.5±0.9 |
4 | 3 (0–4) | 3 (1–4) | 3 (1–4) | 0.055 | 0.491 | 0.585° | 2.7±0.6 | 2.7±0.8 |
5 | 3 (1–4) | 3 (1–4) | 3 (1–4) | 0.420 | 0.581 | 0.345° | 2.6±0.8 | 2.7±0.7 |
6 | 3 (0–4) | 3 (1–4) | 3 (1–4) | 0.061 | 0.680 | 0.359° | 2.7±0.7 | 2.7±0.8 |
7 | 3 (1–4) | 3 (2–4) | 3 (1–4) | 0.021 | 0.536 | 0.418° | 2.8±0.7 | 2.9±0.8 |
8 | 3 (1–4) | 3 (2–4) | 3 (2–4) | 0.047 | 0.725 | 0.507° | 2.8±0.7 | 2.8±0.8 |
9 | 2 (0–4) | 3 (2–4) | 2.5 (1–4) | 0.045 | 0.519 | 0.605° | 2.5±0.6 | 2.6±0.8 |
10 | 3 (1–4) | 3 (2–4) | 3 (2–4) | 0.072 | 0.437 | 0.498° | 2.8±0.6 | 2.8±0.7 |
11 | 3 (2–4) | 3 (2–4) | 3 (2–4) | 0.365 | 0.753 | 0.518° | 2.9±0.6 | 2.9±0.6 |
*p<0.05; °p<0.0001.
ICC, Intraclass Correlation Coefficients; Op, operator; TEAM, Team Emergency Assessment Measure.
The inter-rater concordance of the scores for the TEAM and the CTS was excellent (ICC between raters: TEAM=0.921; CTS=0.917), while for the ETD, concordance was moderate (ICC 0.608).
Ratings allocated by the three assessors were calculated per item across all tools. Tables 2–4 summarise the descriptive statistics (median and range) of the single items across all assessors. Statistical comparisons (Friedman test) of these ratings revealed that a small proportion of CTS and ETD (3 for CTS and 1 for ETD) ratings achieved significance, thereby suggesting some overall disagreement between assessors; disagreement was evidenced for half of the TEAM's items. These differences were small in absolute terms (means that differed by 0.5–1 in all behaviours) and never led to a difference in the direction of the overall opinion of the assessors (ie, one rating the team as ‘poor’ and the other, as ‘good’). CTS items (table 2) achieved highly significant ICC results between raters, with 12 of the total 13 comparisons achieving ICC results ≥0.70, which indicated a good agreement; only 4 of the total 11 TEAM items (table 3) and only 1 of the ETD 8 total items (table 4) reached an ICC value ≥0.70. In all scales, the best reliability was evidenced for the items regarding leadership, especially the ability to clearly assign roles and to share with the team members what was expected from them. Taken together, these findings show a slightly better overall reliability for CTS compared with TEAM and ETD.
Table 4.
Descriptive ETD statistics (median/range) across assessors (Friedman test); ICC between raters; single item comparison between junior and senior resident (Wilcoxon signed rank test)
Median (range) | Comparison between junior and senior | ||||||
---|---|---|---|---|---|---|---|
Op 1 | Op 2 | Op 3 | p Value | ICC | G1 | G2 | |
1 | 3 (1–4) | 2 (0–4) | 2 (0–4) | <0.001 | 0.764 | 2.7±0.9 | 2.5±1.0* |
2 | 2 (1–4) | 2 (1–4) | 2 (1–4) | 0.339 | 0.663 | 2.9±0.8 | 2.5±0.8° |
3 | 2 (1–4) | 3 (1–4) | 2 (1–4) | 0.161 | 0.493 | 2.8±0.9 | 2.6±0.7 |
4 | 2.5 (1–4) | 2 (1–4) | 2 (0–4) | 0.073 | 0.637 | 2.8±0.9 | 2.5±0.8* |
5 | 3 (1–4) | 3 (2–4) | 2 (0–4) | 0.244 | 0.497 | 3.0±0.8 | 2.8±0.8 |
6 | 3 (2–4) | 3 (1–4) | 2 (1–4) | 0.128 | 0.629 | 2.8±0.9 | 2.9±0.7 |
7 | 3 (1–4) | 3 (2–4) | 3 (1–4) | 0.226 | 0.536 | 3.2±0.8 | 3.0±0.8* |
8 | 3 (1–4) | 3 (2–4) | 3 (0–4) | 0.500 | 0.289 | 3.2±0.8 | 2.9±0.8* |
*p<0.05; °p<0.0001.
ETD, Emergency Team Dynamics; ICC, Intraclass Correlation Coefficients; Op, operator.
Content comparison and correlation of scorings
Overall, correlation was moderate for all comparisons: Spearmans's r was 0.585 between ETD and CTS, 0.694 between ETD and TEAM, and 0.634 between TEAM and CTS (all p<0.0001; figure 2). We repeated this analysis for each rater and results were very similar: Spearman's r values were, respectively, 0.674, 0.811 and 0.661 for rater 1, 0.455 (p=0.001), 0.769 and 0.659 for rater 2, and 0.633, 0.659 and 0.563 for rater 3 (all other p<0.0001).
Figure 2.
Spearman's r correlation between CTS, ETD and TEAM scores, converted to percentages. TEAM, Team Emergency Assessment Measure; CTS, Clinical Teamwork Scale; ETD, Emergency Team Dynamics.
The Bland-Altman plots (figure 3) demonstrated good agreement between the different tools, as shown by the relatively small number of points falling outside the 95% limits, and a mean difference between 4% and 12%. The best agreement was observed between TEAM and ETD, probably because the two tools have a very similar scoring method; CTS did underscore the performance compared with the other scales, especially with TEAM and most notably at the higher level of performance. This trend was substantially confirmed when the analysis was repeated separately for every rater (figure 3A–C).
Figure 3.
Bland-Altman plot of the CTS, ETD and TEAM percentage scores across all raters (first column) and for single raters (A–C). The solid line represents the mean difference and the dashed lines represent 95% limits of agreement. TEAM, Team Emergency Assessment Measure; CTS, Clinical Teamwork Scale; ETD, Emergency Team Dynamics.
Comparison between junior and senior residents: self-assessment versus assessment by external observers
We compared score values between junior and senior residents, and found that ETD and TEAM scores were comparable between the two groups (ETD: 23±6 in junior vs 21±5 in senior residents; TEAM: 37±6 vs 36±7, all p=NS), while CTS score (98±12 in junior vs 88±16 in senior residents, p<0.0001) was significantly higher in novices compared with senior residents. We also reported a comparison per item across all tools: no TEAM item was significantly different between G1 and G2. The significant difference was confirmed for all CTS items; several ETD items were significantly higher in junior compared with senior residents, especially those exploring leadership, team effort, team morale, coordination and communication.
From February 2014, at the end of each scenario, participants rated their teamwork performance with ETD and CTS scales: overall, we examined 15 scenarios for G1 and 17 scenarios for G2. We averaged all participants’ and observers’ scores separately for every single scenario and compared these mean score values. We evidenced that participants gave themselves a better evaluation compared with that given by external observers (CTS: 101±9 for participants vs 90±9 for external observers; ETD: 25±3 vs 20±5, all p<0.0001): these observations were confirmed in G1 (CTS: 104±7 for participants vs 90±8 for external observers; ETD: 26±3 vs 20±5, all p<0.0001). In G2 (CTS: 98±10 for participants vs 89±10 for external observers, p=0.029; ETD: 23±3 vs 21±5, p=NS), ETD was comparable between observers and participants.
Discussion
In a series of simulated scenarios managed by EM residents, we applied already validated instruments in clinical and simulation settings, to identify a reliable and feasible tool for teamwork evaluation. Our aim was not to demonstrate the superiority of one tool over the others, but to test their applicability in our simulation setting. TEAM, ETD and CTS demonstrated a good internal consistency, but inter-rater reliability was slightly better for CTS compared with the other tools. When comparing percentage scores, correlation between the different scales was moderate, probably because the scales had a different structure.
Performance of teamwork scales
Our data confirmed the optimal internal consistency of the scales: the aim of this paper was not to validate the tools, however, we reported these measures to confirm that, in our simulation setting, internal consistency was maintained and all the explored teamwork aspects contributed to the final score. Our data were comparable with those in a previous study, which explored validity and reliability of the TEAM and ETD tools.13 17 18 CTS was employed in an observational study, which did not specifically explore its psychometric properties but evaluated the effect of a training programme on teamwork and communication during trauma care. The study design included a didactic phase, an in-situ simulation phase and a decay phase. The authors demonstrated that the scores for 11 of 14 measures improved from the baseline to the didactic phase.19 These results confirm that these tools comprehensively evaluate teamwork performance, with all explored dimensions equally contributing to overall evaluation; they therefore allow team performance rating and feedback, which is likely to have an impact on patient safety.
Inter-rater reliability was very good for CTS and TEAM, and fair for ETD—again, we can only argue that, with a more concise tool, even a slight disagreement significantly affected reliability.
When we compared different tools, we found that overall rating by CTS and TEAM was identical (overall rating 60% for each; not available for ETD), but Bland-Altman analysis evidenced that CTS scored lower than ETD and TEAM. This trend was confirmed when the analysis was repeated across single raters. The aim of this study was not to evaluate the accuracy of the tools: we do not actually have a gold standard to evaluate teamwork; therefore, it is impossible to say which evaluation is more accurate than the next. The striking difference is that, while CTS scores five behaviour modes, with up to four subscales and a global rating for each behaviour, the TEAM and ETD tool only rate the entire team over 12 and 8 different aspects, respectively. This means, for example, that where there is one score for ‘Communication’ in the TEAM or ETD tool, this facet is scored four times, with an additional ‘overall score’ in the CTS tool. CTS thus appears to be more detailed in its skills coverage, with a negligible difference in compilation time. It would be possible that each scenario's design could influence correlation between different tools—however, we did not design any scenario with the aim to test a specific aspect of teamwork. As it was the first structured experience for our residents, our aim was to give them an opportunity to train in all the basic teamwork dimensions.
Previous papers about inter-rater reliability are very limited: in one analysis of pre-recorded videos of resuscitation team events, using the TEAM and the OSCAR tool, McKay et al20 demonstrated a strong inter-rater reliability, slightly better for the OSCAR score. OSCAR is a very detailed tool that allows the rating of performance of individual subteams within a standard resuscitation team (anaesthetists, physicians and nurses) across six teamwork-related behaviours. This structure enables a more detailed assessment compared with TEAM, which otherwise gives a global and quick team evaluation. In our study, the correlation between different tools was moderate: we involved three raters, thus increasing variability, but the trend was confirmed for every single rater, who was kept rigorously blind to the evaluations of the other raters. CTS could someway resemble OSCAR by the possibility it gives to perform detailed scoring. An instrument for teamwork evaluation has to combine the ability to represent all the dimensions of teamwork performance and ease of use—from these results we may assert that CTS meets these requirements and could become a useful tool to evaluate teamwork, allow feedback and to identify areas of weakness for future training.
A less than optimal inter-rater reliability has already been reported for the OTT in the validation paper, and a significant proportion of MAYO items was not applicable in our simulation setting—this is the reason we decided to exclude them. However, we cannot exclude that, in a different context, the MAYO and OTT could perform better than they did in our study.
Comparison between self-assessment and assessment by external observers
We were not able to evidence a higher score for senior residents compared with novices: CTS scored even better for novices than for senior residents. We could argue that the tools had low discriminative power: however, all the scales demonstrated at best no difference (TEAM) between the performance of novices compared with senior residents. Therefore we cannot deny that these results somewhat truly represent what happened during our simulation. This result can partially be explained considering that simulation is not a routine component of EM residency training and this programme was the first structured experience of simulation for all residents. We can speculate that senior residents were more focused in the technical skills than juniors, with detrimental consequences on team performances. Of note, the junior residents were exposed to less complicated scenarios, which could have increased their teamwork performances further reducing the focus on technical aspects.
We compared self assessment and assessment by external observers with ETD and CTS, two scales that were already validated for self-assessment. As a whole, the EM residents’ self-assessment overrated their own performance compared with the faculty assessment. When we separately examined novice and senior residents, over-rating was fully confirmed for the novice group, while senior residents appeared to overestimate their performance only using CTS, while ETD scores were comparable between the two different raters.
Physicians do not seem to accurately self-assess. This finding is independent of level of training, specialty, the domain of self-assessment and manner of comparison. In a systematic review by Davis et al,21 of the 20 comparisons between self and external assessment, 13 demonstrated little, no, or an inverse relationship between self-assessment measures and other indicators; level of training of the individuals studied did not influence their rating ability. Sadosty et al22 observed that the EM residents’ self-assessment agreed well with the faculty assessment. Residents who scored above the 50th centile in a simulated patient encounter were able to self-assess more accurately than those who scored below.
In our study, population self-evaluation was performed immediately after scenario completion, not allowing a significant reflection on what happened and what eventually could be completed in a better way. It is possible that experience can give residents better performance insight when compared to novices; further data on a larger number of scenarios are needed to better explore this item and to understand which teamwork aspects are responsible for this self-overestimation. However, this poor self-assessment further underlines the need of an objective evaluation of learning processes.
This study has several limitations. All tools have only been applied in a simulated environment, although it is most likely this is where they will be employed most of the time. Individuals and teams may behave differently in a real clinical scenario than in a simulated one, and this may influence ratings; however, we think that training in teamwork will be most likely performed with simulation. Therefore, we need reliable tools to evaluate the team in a simulated environment.
Our raters employed all tools to rate the same scenarios at the same time, in random order; this might inevitably inflate the correlations between the tools, but it did not affect inter-rater reliability. This is an inherent problem with any study where multiple assessments of various skills are carried out concurrently, probably because the number of assessors/faculty is often limited. Nevertheless, analysis of agreement showed that some overscoring or underscoring was repeatable between different raters and probably due to the different structure of the tools. Further evaluation of correlation between the tools is thus required, perhaps with assessors scoring only one tool each or scoring each tool a few days apart.
Conclusion
Good teamwork performance is a key factor for safe patient management, especially in critical care; several learning programmes are including teamwork training in their curriculum. A major drawback in this framework is represented by the absence of reliable tools to quantify teamwork performance in order to evaluate baseline characteristics and improvement. In a series of simulation sessions on management of the critically ill by EM residents, we evaluated teamwork performance using several previously validated tools: we found that ETD, CTS and TEAM performed well, with good internal consistency and a fair to good inter-rater reliability. When we compared participants and external observers, we found a significant overestimation of teamwork performance by participants: this was especially pronounced among novices, probably reflecting their lower level of experience and lower ability for self-evaluation compared with senior residents.
Further studies are needed to explore specific areas of teamwork behaviour in more depth, in order to identify strengths and weakness of our teams, and to tailor adequate training.
Footnotes
Contributors: RP and FI were involved in the concept and design. EA, MS and AA were involved in the data collection. AA, EA and FI were involved in the data analysis and interpretation. FI, MS and EA were involved in the drafting of the article. RP was involved in the critical revision of the article and statistics.
Competing interests: None declared.
Provenance and peer review: Not commissioned; externally peer reviewed.
References
- 1.Friedman SM, Provan D, Moore S, et al. Errors, near misses and adverse events in the emergency department: what can patients tell us? CJEM 2008;10:421–7. [DOI] [PubMed] [Google Scholar]
- 2.Cosby KS, Roberts R, Palivos L, et al. Characteristics of patient care management problems identified in emergency department morbidity and mortality investigations during 15 years. Ann Emerg Med 2008;51:251–61, 261. 10.1016/j.annemergmed.2007.06.483 [DOI] [PubMed] [Google Scholar]
- 3.Kohn LT, Corrigan JM, Donaldson MS. To err is human—building a safer health system. Washington: National Academy Press, 1999. [PubMed] [Google Scholar]
- 4.Kilner E, Sheppard LA. The role of teamwork and communication in the emergency department: a systematic review. Int Emerg Nurs 2010;18:127–37. 10.1016/j.ienj.2009.05.006 [DOI] [PubMed] [Google Scholar]
- 5.Binstadt ES, Walls RM, White BA, et al. A comprehensive medical simulation education curriculum for emergency medicine residents. Ann Emerg Med 2007;49:495–504, 504. 10.1016/j.annemergmed.2006.08.023 [DOI] [PubMed] [Google Scholar]
- 6.Ten Eyck RP, Tews M, Ballester JM. Improved medical student satisfaction and test performance with a simulation-based emergency medicine curriculum: a randomized controlled trial. Ann Emerg Med 2009;54:684–91. 10.1016/j.annemergmed.2009.03.025 [DOI] [PubMed] [Google Scholar]
- 7.Cooper S, Wakelam A. Leadership of resuscitation teams: ‘Lighthouse Leadership’. Resuscitation 1999;42:27–45. 10.1016/S0300-9572(99)00080-5 [DOI] [PubMed] [Google Scholar]
- 8.Guise JM, Deering SH, Kanki BG, et al. Validation of a tool to measure and promote clinical teamwork. Simul Healthc 2008;3:217–23. 10.1097/SIH.0b013e31816fdd0a [DOI] [PubMed] [Google Scholar]
- 9.Cooper S, Cant R, Porter J, et al. Rating medical emergency teamwork performance: development of the Team Emergency Assessment Measure (TEAM). Resuscitation 2010;81:446–52. 10.1016/j.resuscitation.2009.11.027 [DOI] [PubMed] [Google Scholar]
- 10.Kim J, Neilipovitz D, Cardinal P, et al. A pilot study using high-fidelity simulation to formally evaluate performance in the resuscitation of critically ill patients: The University of Ottawa Critical Care Medicine, High-Fidelity Simulation, and Crisis Resource Management I Study. Crit Care Med 2006;34:2167–74. 10.1097/01.CCM.0000229877.45125.CC [DOI] [PubMed] [Google Scholar]
- 11.Malec JF, Torsher LC, Dunn WF, et al. The mayo high performance teamwork scale: reliability and validity for evaluating key crew resource management skills. Simul Healthc 2007;2:4–10. 10.1097/SIH.0b013e31802b68ee [DOI] [PubMed] [Google Scholar]
- 12.Bogossian F, Cooper S, Cant R, et al. Undergraduate nursing students’ performance in recognising and responding to sudden patient deterioration in high psychological fidelity simulated environments: an Australian multi-centre study. Nurse Educ Today 2014;34:691–6. 10.1016/j.nedt.2013.09.015 [DOI] [PubMed] [Google Scholar]
- 13.Cooper S, Beauchamp A, Bogossian F, et al. Managing patient deterioration: a protocol for enhancing undergraduate nursing students’ competence through web-based simulation and feedback techniques. BMC Nurs 2012;11:18. 10.1186/1472-6955-11-18 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Cooper SJ, Cant RP. Measuring non-technical skills of medical emergency teams: an update on the validity and reliability of the Team Emergency Assessment Measure (TEAM). Resuscitation 2014;85:31–3. 10.1016/j.resuscitation.2013.08.276 [DOI] [PubMed] [Google Scholar]
- 15.Carne B, Kennedy M, Gray T. Review article: crisis resource management in emergency medicine. Emerg Med Australas 2012;24:7–13. 10.1111/j.1742-6723.2011.01495.x [DOI] [PubMed] [Google Scholar]
- 16.Flin R, Maran N. Identifying and training non-technical skills for teams in acute medicine. Qual Saf Health Care 2004;13(Suppl 1):i80–4. 10.1136/qhc.13.suppl_1.i80 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Cooper S, O'Carroll J, Jenkin A, et al. Collaborative practices in unscheduled emergency care: role and impact of the emergency care practitioner--quantitative findings. Emerg Med J 2007;24:630–3. 10.1136/emj.2007.048058 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Bogossian F, McKenna L, Higgins M, et al. Simulation based learning in Australian midwifery curricula: results of a national electronic survey. Women Birth 2012;25:86–97. 10.1016/j.wombi.2011.02.001 [DOI] [PubMed] [Google Scholar]
- 19.Miller D, Crandall C, Washington C III, et al. Improving teamwork and communication in trauma care through in situ simulations. Acad Emerg Med 2012;19:608–12. 10.1111/j.1553-2712.2012.01354.x [DOI] [PubMed] [Google Scholar]
- 20.Mckay A, Walker ST, Brett SJ, et al. Team performance in resuscitation teams: comparison and critique of two recently developed scoring tools. Resuscitation 2012;83:1478–83. 10.1016/j.resuscitation.2012.04.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Davis DA, Mazmanian PE, Fordis M, et al. Accuracy of physician self-assessment compared with observed measures of competence: a systematic review. JAMA 2006;296:1094–102. 10.1001/jama.296.9.1094 [DOI] [PubMed] [Google Scholar]
- 22.Sadosty AT, Bellolio MF, Laack TA, et al. Simulation-based emergency medicine resident self-assessment. J Emerg Med 2011;41:679–85. 10.1016/j.jemermed.2011.05.041 [DOI] [PubMed] [Google Scholar]