Abstract
Objectives
Effective trainee‐led debriefing after critical events in the pediatric emergency department has potential to improve patient care, but debriefing assessments for this context have not been developed. This study gathers preliminary validity and reliability evidence for the Debriefing Assessment for Simulation in Healthcare (DASH) as an assessment of trainee‐led post–critical event debriefing.
Methods
Eight fellows led teams in three simulated critical events, each followed by a video‐recorded discussion of performance mimicking impromptu debriefings occurring after real clinical events. Three raters assessed the recorded debriefings using the DASH, and their feedback was collated. Data were analyzed using generalizability theory, Gwet’s AC2, intraclass correlation coefficient (ICC), and coefficient alpha. Validity was examined using Messick’s framework.
Results
The DASH instrument had relatively low traditional inter‐rater reliability (Gwet’s AC2 = 0.24, single‐rater ICC range = 0.16‐0.35), with 30% fellow, 19% rater, and 23% rater by fellow variance. DASH generalizability (G) coefficient was 0.72, confirming inadequate reliability for research purposes. Decision (D) study results suggest the DASH can attain a G coefficient of 0.8 with five or more raters. Coefficient alpha was 0.95 for the DASH. A total of 90 and 40% of items from Elements 1 and 4, respectively, were deemed “not applicable” or left blank.
Conclusions
Our results suggest that the DASH does not have sufficient validity and reliability to rigorously assess debriefing in the post–critical event environment but may be amenable to modification. Further development of the tool will be needed for optimal use in this context.
Effective debriefing after resuscitations can potentially improve future care. 1 , 2 , 3 , 4 The American Heart Association and American Academy of Pediatrics recommend regular postevent debriefings. 3 , 4 , 5 The majority of U.S. pediatric emergency medicine (PEM) fellows, however, do not receive formal training in debriefing, making it difficult to assure newly graduated attendings acquire adequate experience in this area. 1 , 2 A solution is the creation of rigorous post–critical event feedback and debriefing curricula for PEM trainees. Development of a curriculum that is suitable for trainees instead of attendings or outside facilitators may best mimic debriefing environments in academic centers, as advanced trainees such as fellows will frequently be in the best position to conduct these events, and it is thus important that this skill set be honed early. Determining effectiveness of such curricula requires assessments with evidence of validity and reliability. While several general debriefing assessments exist, none have been psychometrically assessed in this context.
The Debriefing Assessment for Simulation in Healthcare (DASH) represents one of the most popular debriefing assessments. The DASH was created to evaluate debriefing techniques and quality of information transfer to learners and possesses good psychometric properties among health care simulation educators (inter‐rater reliability by intraclass correlation coefficient [ICC] of 0.74; internal consistency by coefficient alpha of 0.89; statistically significant difference in scores across subjects with different debriefing skills). 6 , 7 , 8
Developers intended the DASH instrument to assess debriefings conducted after educational simulations and performed the above psychometric testing in this context. Post–educational simulation debriefings, however, are lengthier and more extensive than typical post–critical event debriefings, with a reported median duration of 10 minutes. 9 , 10 , 11 Additionally, faculty facilitators typically lead post–educational simulation debriefings, while either the medical team leader or other staff guide post–critical event debriefings. 12 , 13 In the case of trainee‐led events attending faculty members are also often present alongside the trainees, also introducing an element of hierarchy. Thus, while the underlying content of the DASH seems relevant to this context, these differences must be considered and additional reliability testing is needed. The DASH has been successfully used as a debriefing evaluation of novice debriefers in several published studies, however, lending additional credence to its value here. 14 , 15
Given the relative infrequency of post–critical event debriefing and the lack of standardization present in real patient care environments, gathering strong psychometric data in this context is relatively difficult. In situ simulation (i.e., simulation conducted in the patient care setting) can be of use here. 16 Educators have successfully utilized in situ simulation to re‐create actual clinical events for quality and safety purposes, and it thus represents a means of generating the standardized clinical events required. 17 This study sought to preliminarily explore the psychometric properties of the DASH as an assessment of PEM trainee‐led post–critical event debriefings using in situ simulations as a proxy for actual clinical events, with the goal of further developing it as a rigorous curriculum development assessment.
METHODS
The Institutional review board of the University of Louisville/Norton Children’s Hospital approved this prospective study. The primary purpose of this study was to verify that the constructs underlying the DASH could be applied to trainee‐led post–critical event debriefing and were sufficiently reliable for research/curriculum evaluation purposes. Accordingly, we chose to focus on two key sources of validity evidence described in Messick’s unified framework: content and internal structure. 18 , 19 , 20 , 21 This approach has been used in numerous published studies. 22 , 23
Content
Content evidence reflects how an instrument was developed based on literature review, expert panel evaluation, and best practices. 19 This aspect of validity was first addressed via an iterative series of local expert panels consisting of six individuals with expertise in acute care medicine, simulation, debriefing, and assessment, who reviewed relevant literature to locate applicable constructs. These panels concluded that the Feedback Assessment for Clinical Education (FACE) and DASH tools both had potential applicability in the study context. 21 , 24 After further consultation with a co‐author of these tools and subsequent feedback obtained via an ALERT (Advanced Look Exploratory Research Template) presentation and panel discussion at INSPIRE@IMSH 2018, it was determined that the DASH construct would best match the trainee‐led post–critical event debriefing environment, as our focus was more on the debriefing process rather than provision of specific feedback (the process for which the FACE was developed) Figure 1 contains a detailed description of the flow, discussion themes, and member qualifications for this process.
Figure 1.

Flow chart of expert panel process and outcomes. FACE = Feedback Assessment for Clinical Education; DASH = Debriefing Assessment for Simulation in Healthcare.
During the ALERT discussion, the decision was made to use an unmodified long version of the DASH tool for the exploratory analysis. This was done as we desired raters to evaluate the applicability of each individual subcomponent as well as the overall value of each element in this novel context. While some aspects of the DASH (most notably Element 1) initially seemed less applicable, it was felt that this approach would provide the most systematic rater feedback possible, potentially assisting us should modifications be needed.
Internal Structure
Internal structure evidence includes key psychometric properties of an instrument and overlaps with reliability. 21 , 25 , 26 Traditional psychometrics examine one type of reliability (i.e., inter‐rater, test–retest) at a time, while generalizability (G) theory permits researchers to disentangle specific factors that may impinge on overall reliability (i.e., “facets”) and isolate the variance each contributes. Through decision (D) studies, G theory also allows researchers to estimate reliability under varying conditions. 27 , 28 We chose to evaluate the tool’s psychometric properties (and hence its internal structure) using both traditional measures such as Gwet’s AC2 (inter‐rater reliability), coefficient alpha (internal consistency), and G theory. 27 , 29 , 30
Study Design
To re‐create common critical events from the pediatric emergency department (ED) in a standardized environment, study investigators developed a series of three high‐fidelity in situ simulations. These cases were written by members of the expert panel mentioned in Figure 1. Case difficulty was evaluated by the expert panel and adjusted with the goal of providing roughly equivalent scenarios in terms of medical complexity. Each case presented a critically ill child requiring immediate intervention for an array of medical crises (including cardiac arrest, neurotrauma, and septic shock) at a consistent level of difficulty. Full scenarios are available in Data Supplement 1, available as supporting information in the online version of this paper, which is available at http://onlinelibrary.wiley.com/doi/10.1002/aet2.10482/full).
Subjects included eight PEM fellows. This study population was chosen due to our desire to utilize (and hence validate) the tool as an assessment of trainee performance after participation in post–‐critical event debriefing curricula. As the validity and reliability of a given tool are properties of the tool’s relationship with the situation in which it is being applied (and do not reside in the tool itself) we felt it necessary to evaluate it in the subject group to which it would be applied next in our program of research. 31 Fellows first completed a brief anonymous survey of prior debriefing experience. Each fellow then led a team composed of PEM attending physicians, nurses, and respiratory therapists in the Norton Children’s Hospital ED trauma bay through the three scenarios. Fellows were not familiar with the cases prior to participation. The PEM attending present on each team was asked to be present and act as a content expert during the debriefing if needed, because this best reflected the actual environment of practice we were attempting to re‐create. Investigators provided explicit instructions, however, that the PEM fellows were to lead the team in both simulations and debriefings. Each scenario lasted 15 minutes, after which teams had 10 minutes to debrief in a separate conference room. Fellows did not receive any formal debriefing training as part of the research project. The scenarios and the debriefings were video recorded for later review by raters.
Three raters with background in debriefing techniques assessed the recorded debriefings using the DASH tool. Two raters were subspecialists in pediatric critical care medicine (AC, MS) and one in PEM (DK). DASH rater training consisted of review of the DASH rater’s handbook, which describes the DASH elements in detail and how to properly score subjects on each element. While some prior studies of the DASH have added a component of video‐assisted rating practice our goal was to apply the DASH to a new environment with different team and debriefing dynamics, and we thus lacked the ability to confidently create standard anchors by which ratings could be calibrated. 32 , 33 We also note that video training is not specified as a prerequisite for use in the DASH rater’s handbook. Raters were instructed to take whatever time was needed to become as familiar as possible with tool contents and explicitly told to only rate the fellow’s performance. Raters then viewed each case and paired debriefing (replaying as needed), rating each subject’s performance after the debriefing concluded using a standardized electronic platform (SurveyMonkey) that presented the DASH tool elements. Raters were explicitly asked to mark any items which they felt could not be reasonably assessed in this particular context as “not applicable.” Items scored as not applicable were not included in the final numerical analysis. While cases were labeled in ascending order (i.e., random case labels were not used), raters were not given specific instructions on the order in which to rate each case.
Data Analysis
Gwet’s AC2 was used to calculate traditional estimates of inter‐rater reliability, and coefficient alpha was used to calculate internal consistency. Recognizing that the initial DASH paper used ICCs to calculate inter‐rater reliability, these calculations were also performed on overall scores for each case as well as for each of the elements (ICCs cannot be performed on all three cases simultaneously). Both single and mean score ICC’s were calculated (two‐way mixed model, absolute agreement). Data were also analyzed via a fully crossed G study using fellows (n = 8) as the facet of differentiation (object of measurement) and raters (n = 3) and simulated cases (n = 3) as the facets of generalization. 34 , 35 Case was chosen as a facet to assure that the tools could evaluate debriefing skill across a variety of medical content areas. D studies were performed varying the number of raters and cases. Values of ≥0.8 was used as the standard for “acceptable” reliability for all coefficients. 36 G and D studies were conducted using GENOVA, and other statistical tests were performed using R. The percentage of items within each DASH element and, across the entire tool, scored as not applicable were also averaged and written rater feedback was collected.
RESULTS
Demographics
Eight PEM fellows were assessed using the DASH instrument during 24 video‐recorded debriefings. Fellows were drawn from all 3 years of training and reported 8 to 20 hours per year of simulation participation. Fifty percent of the fellows reported some formal training in postevent debriefing in a clinical setting. All fellows indicated that they had been debriefed by another provider after an event in an actual clinical setting. Half the fellows reported “sometimes” debriefing critical events with their team in the pediatric ED, while the other half reported “rarely” debriefing. Half of the fellows reported comfort with debriefing others in the medical setting; however, none felt adequately trained to debrief critical events in the pediatric ED.
Validity Evidence
Content Validity
An average of 90% of items in Element 1, 5% of items in Element 2, 0% of items in Element 3, 40% of items in Element 4, 1% of items in Element 5, and 0% of items in Element 6 were answered as not applicable. Of note, the specific behaviors most often deemed not applicable within DASH Element 4 included video review of the simulated scenario and management of the upset participant. Raters stated that the DASH did not seem to apply in its entirety to time‐constrained, more‐focused debriefings such as these. Raters also noted that, in many cases, fellows appeared passive in terms of leadership, interacting in a more conversational style. Comments also indicated that for three of the fellows the attending physicians seemed to take control from the fellows partway through the debriefing during six debriefings (near the end in two cases and halfway through in four cases). This led to an inability to evaluate the fellow beyond this point. Salient comments are reported in Table 1.
Table 1.
Key Rater Comments
| Types of Comments | Comment Description | Percentage of Comments |
|---|---|---|
| General comments | Key elements of debrief were missing | 45.5% (5/11) |
| Attending took over debriefing | 18.2% (2/11) | |
| DASH‐specific comments | Attending took over debriefing | 30.8% (8/26) |
| Fellow was not clearly the leader of debriefing | 15.4% (4/26) |
This table displays representative comments raters given by raters with regard to the instrument. Percentages of each comment type are also provided.
DASH = Debriefing Assessment for Simulation in Healthcare.
Internal Structure
Coefficient alpha for the DASH was 0.95 (95% confidence interval = 0.93‐0.96). Gwet’s AC2 was 0.24 for the DASH, indicating relatively low traditional inter‐rater reliability. Overall ICC values ranged between 0.16 and 0.35 for single raters and 0.36 and 0.61 for average raters. Table 2 depicts the element‐specific and overall ICC values compared with the values obtained during the original DASH validation study. With respect to the G study, the majority of the variance for the DASH was attributed to fellow (30%). Raters accounted for 19% of the variance, but rater by fellow variance was relatively high (23%). Case and rater by case variances were negligible, and fellow by case variances was only 3%. Twenty‐four percent of the variance was attributed to “error,” which in G theory can either represent the highest‐order interaction (i.e., fellow by rater by case) or represent one or more unmeasured facets. The generalizability coefficient was 0.72. Decision study results indicated that the DASH can attain an estimated G coefficient of 0.80 with five or more raters and three cases. The full G and D study results can be seen in Tables 3 and 4
Table 2.
Overall and Element‐specific ICC Values as Compared With Original DASH Values
| DASH ICC Values Traditional Debriefing | DASH ICC Values Trainee‐led Post–Critical Event Debriefing |
|---|---|
| Element 1: | |
| 0.60 |
Single rater: could not calculate Mean: could not calculate |
| Element 2: | |
| 0.65 |
Single rater: 0.05–0.21 Mean: 0.12–0.44 |
| Element 3: | |
| 0.62 |
Single rater: 0.10–0.48 Average: 0.26–0.73 |
| Element 4: | |
| 0.68 |
Single rater: 0.02–0.37 Average: 0.06–0.64 |
| Element 5: | |
| 0.57 |
Single rater: 0.02–0.30 Average: 0.04–0.55 |
| Element 6: | |
| 0.63 |
Single rater: 0.27–0.32 Average: 0.53–0.59 |
| Overall: | |
| 0.74 |
Single rater: 0.16–0.35 Average: 0.37–0.61 |
The table compares overall and element‐specific inter‐rater reliability data (calculated by ICC) compared with similar data from the initial DASH validation study. Data from this study are presented as ranges due to the need to calculate ICC values for each specific case. Because the initial DASH study did not specify the ICC model used, we have presented both single‐ and average‐rater values from the current data set to allow for a more nuanced comparison. 6
ICC = intraclass correlation coefficient.
Table 3.
DASH G Study Results
| Facets | DASH | |
|---|---|---|
| Variance Component* | Variance (%) | |
| Fellow (n = 8) | 0.23 | 30 |
| Rater (n = 3) | 0.05 | 19 |
| Case (n = 3) | 0.00 | 0 |
| Fellow × rater | 0.06 | 23 |
| Rater × case | 0.00 | 0 |
| Fellow × case | 0.01 | 3 |
| Error | 0.02 | 24 |
| G coefficient | 0.72 | |
This table displays the generalizability study results for the DASH instrument. Absolute and percent variance of each facet and interfacet relationship are listed along with error variance. The overall reliability of each instrument in this context is expressed as a G coefficient. A G coefficient of 0.80 or greater is typically required for summative feedback or research use.
D study = decision study; G coefficient = generalizability coefficient; ICC = intraclass correlation coefficient.
Adjusted variance components.
Table 4.
DASH D Study Results for Varying Numbers of Raters and Cases
| Raters | Cases | DASH |
|---|---|---|
| G Coefficient | ||
| 3* | 3 | 0.72 |
| 3 | 5 | 0.75 |
| 3 | 7 | 0.76 |
| 3 | 9 | 0.77 |
| 4 | 3 | 0.77 |
| 5 | 3 | 0.80 |
| 7 | 3 | 0.84 |
| 9 | 3 | 0.87 |
This table displays the decision study results for the DASH instrument. Rater and case number facets were altered, and generalizability coefficients were computed and compared.
D study = decision study; G coefficient = generalizability coefficient.
Original study—three raters by three cases.
DISCUSSION
While we initially had high expectations for performance, our data have led us to a different conclusion regarding the DASH’s current validity and reliability in this context. In keeping with Messick’s framework, we analyze this data below under content and internal structure/reliability subheadings.
Content
In terms of content validity, 25% of DASH questions were rated as not applicable or left blank, indicating that a number of the DASH questions lacked usefulness in the trainee‐led post–critical event context. These questions clustered within DASH Elements 1 (90%) and 4 (40%). Element 1 includes formal introductions, session logistics, and discussion of simulation content that occur prior to the case itself to “establish an engaging learning environment,” while Element 4 involves behaviors such as “provoking engaging discussion,” use of video or recorded data to support analysis, and assistance of participants who became emotionally disturbed during the session. As we structured the in situ cases to resemble “unexpected” real‐world events (in which briefing/orientations are not possible), it is unsurprising that many items within Element 1 lacked relevance.
While most of these teams were familiar with one another and did not do formal introductions at the outset of the debriefing, a brief introduction should be given at the outset of a post–critical event debriefing and should include elements that the DASH describes (i.e., goals and expectations, confidentiality, commitment to respecting participants, and psychological safety). Because these trainees were relative novices to debriefing, they were likely unaware of the content included in an introduction and instead simply began debriefing the case. Courses aimed at enhancing this skill will need to include explicit instruction.
In terms of Element 4, the scenario video was not available at the time of debriefing and no visible emotional disturbance was apparent in any study participants. This likely explains why some raters marked the last two items of Element 4 as “not applicable.” Raters also commented, however, that some items were not applicable due to the relatively time‐constrained nature of the debriefing. Taken together, these findings imply that some portions of the construct underlying the instrument may potentially not apply here. Conversely, these ratings do not by themselves constitute evidence that each of these elements should be removed. For example, the fact that no emotional disturbance was observed during the study does not preclude this from occurring in future post–critical event debriefings, and the item thus remains relevant.
The attending “takeover” phenomenon observed in six of the videos also bears consideration. Given the power dynamics present between fellows and supervising physicians, this phenomenon is unlikely to be isolated to our subjects. Since attendings are likely to be present in many real trainee‐led post–critical event debriefings, it is thus important that the final tool be able to explicitly for these differentials. We thus view this as a key observation that will need to be incorporated in any tool modifications.
Internal Structure/Reliability
The DASH had relatively low traditional measures of inter‐rater reliability (Gwet’s AC2 of 0.24). Both single and average rater ICC’s were also low. Of note, the average single‐rater ICC value (the statistic that would be expected to correlate most closely with Gwet’s AC2) across cases was 0.23, which corresponds almost exactly to the Gwet’s AC2 value as calculated. All values are substantially lower than the inter‐rater reliability of the DASH (0.74 ICC) as measured in its original context. Table 2 further demonstrates this with regard to element‐specific scores.
The G study allows us to analyze the reasons behind this lower inter‐rater reliability further via examination of the variance components of the measured facets and their interactions. Of these facets and interactions, the greatest contributors to this lower reliability are the error (24%), rater by fellow (23%) interactions, and the rater facet (19%), which together account for the lower traditional measure. Case, case by fellow, and case by rater variances were low for the DASH, suggesting that the specific medical content of the cases did not affect the tool’s ability to assess debriefing skill.
While the G study cannot provide a cause for the rater by fellow variance for the tool, the results imply that some bias exists between specific subjects and specific raters that the instrument cannot account for. Possible causes for this include the effect of the attending commandeering the debriefing, a mismatch between the construct behind the tool and the environment of assessment, and previous relationships between raters and fellows. Because attending takeover occurred with only three of the eight fellows, it seems likely that the effects of this phenomenon on rater perception would effectively bias their perceptions of specific subjects, making this the most reasonable source of this variance pattern. Additionally, both this cause and potential differences in the underlying construct would be potentially amenable to tool revision. Previous relationships seems an unlikely source given their infrequent interactions in practice.
It is also vital to consider the error term. Generalizability theory interprets error variance as due to either one or more untested facets or the highest‐order interaction (i.e., fellow by rater by case). The attending takeover effect seems, again, to be the likely culprit, because this could have easily led to an inordinate rater by fellow by case interaction due to incomplete fellow assessment on certain of the cases. Inadvertent scoring of the attending physician in these cases seems unlikely as raters frequently specified that their scores were based on the trainee alone. We again note that attending presence is typical in these environments, so the error term does not truly represent an error in the assessment environment. Instead, it represents the inability of the tool to reliably account for this interaction.
Other facets that could also have influenced error variance exist, including potential differences in rater experience with (and understanding of) the instrument, the timing of the simulations, their relationship to other educational or clinical activities, and variation in the fellows’ clinical experience. The high internal consistency values, however, strongly imply that the contribution of unmeasured facets to error variance is minimal. 25 , 37 Thus, most of the error variance is likely derived from the rater by fellow by case interaction as described above. We also note that the high value of the error component may be in large part responsible for the wide range in case‐specific ICC scores.
The combined outcome of these sources of variance is an overall G coefficient of 0.72 for the DASH. This coefficient is a measure of how well subject scores can be “generalized” to a universe of PEM fellows and thus provides an index of overall reliability (and hence internal structure). 36 The relative strength of this coefficient may seem counterintuitive given the lower traditional inter‐rater reliability scores, but the tool does possess a relatively strong ability to discriminate between subjects (subject variance of 23%) and the G coefficient takes this into account. The most reasonable interpretation, then, is that the tool can detect real differences between subjects relatively well, but its ability to do so is impaired by significant psychometric “noise” due to the new environment. The D study sheds additional light on this because it demonstrates that G coefficient can be substantially improved with five or more raters, which provides more data to cut through this noise. This is further supported by the increase in ICC values when average‐rater statistics are used, because these assume multiple raters contributing to a final, mean score. The use of five raters per debriefing, however, is not feasible in practice, and so we conclude that modifications to the DASH are needed for use in this environment.
Based on this, we conclude that the DASH has insufficient validity and reliability in its current form to serve as a rigorous evaluation of post–critical event debriefings. The rater comments provide a clear starting point for modifications and suggest the need to abbreviate or remove certain parts of Elements 1 and 4 (although some items still merit inclusion). Additional modifications will also be needed to accommodate the somewhat ad hoc, time‐constrained nature of post–critical event debriefings as well as the possibility of disruption of the trainee facilitator role by supervisors. Use of a formal modified Delphi process followed by additional psychometric assessment represent the best means of accomplishing this. 38
LIMITATIONS
Perhaps the largest potential confounder was a lack of video‐based rater training. Such training has been used in the past with the DASH and had we been able to meaningfully do so it could have altered the variance profile of the tool. Constructing such training, however, requires some a priori knowledge of how a tool is “supposed” to behave in a certain environment. This is knowledge that we did not have. Additionally, unwanted variance due to this would have primarily manifested in the rater facet of the G study, not the rater by fellow and error terms, because rater training would necessarily have been conducted and fellows than were present in the study to avoid bias. Because rater variance was not excessively high for the tool (19%), it seems unlikely that this factor is a major contributor to the high variances noted.
We also want to acknowledge again the literature suggesting that those participating in an event (whether fellow or attending) should not lead the debriefing and a trained external facilitator should conduct the review instead. 39 While we agree with the rationale behind this advice, practically speaking such facilitators are unlikely to be present during many of these events, and other literature suggests that team members most often serve as the post–critical event debriefing facilitator. 2 This supports the overall need for an instrument validated for this population.
Unavoidable variation in team composition existed for several of the simulations/debriefings due to time and scheduling constraints, which could have led to differences in performance and ability to debrief. It seems likely, however, that comparable variation would also exist in actual post–critical event debriefings. Because subjects were assessed anonymously, we are unfortunately unable to assess correlations between simulation experiences, level of training, and performance during the debriefing, which could shed further light on this question. The use of video‐recorded debriefings presents an additional potential limitation, because raters could miss subtleties such as the “mood” of the room or nonverbal communication unclear to those not physically present. We also note that two of the three raters (AC, MS) are familiar with the fellows through limited but occasional clinical interactions and all raters were aware of each fellow’s year of training. This may have contributed to subject by rater bias to a small extent.
In accordance with DASH guidelines, raters were instructed to approach each element as an integrated whole when assessing individual subdomains (i.e., to consider the goal of the entire domain when assessing each item). Raters did not, however, score the domain as a global entity separate from the individual within‐domain items but were asked to let their overall sense of that domain influence their item scores. We recognize this as a scoring difference between our use and the primary description of the tool and is thus a further limitation. Raters also were asked to watch both the simulation and the debriefing, because we were concerned they would have insufficient context for their assessment if they had not seen the session. While this could have introduced additional bias, we believe this to be minimal because the case and rater by case variance was negligible and fellow by case variance was low at 3%. Finally, we must recognize the very real differences between in situ simulations and actual clinical events. While in situ simulation can accurately represent many aspects of these events and care was taken to create realistic case templates, teams, clinical environments, and debriefing environments, the connection between post–critical event debriefing in this context and after real events remains inferential. Still, given the standardization issues noted, this approach remains the most viable way to obtain initial psychometric data.
CONCLUSION
Our preliminary analyses indicate that the Debriefing Assessment for Simulation in Healthcare does not have sufficient validity and reliability to rigorously assess debriefing in the post–critical event environment. They further suggest, however, that the tool may be effectively modified for this purpose. Further work will be needed to realize this.
The authors acknowledge the valuable assistance of the International Network for Simulation‐based Pediatric Innovation, Research, and Education (INSPIRE) in the design and implementation of this study.
Figure 2.

Percentage of items within the DASH elements deemed not applicable. This figure displays the percentage of items within the DASH elements that were marked not applicable by the raters. DASH = Debriefing Assessment for Simulation in Healthcare.
Supporting information
Data Supplement S1. In Situ Simulation Cases Used in this Study
AEM Education and Training 2021;5:1–10
Funding for gift cards for research participation was provided by the Office of Medical Education of the Department of Pediatrics at the University of Louisville
The authors have no potential conflicts to disclose.
Author contributions: All authors were involved in the conceptual design of the study and all authors edited, read and approved the final manuscript; SZ, AH, and ML developed, coordinated, and led the simulation scenarios; AC, MS, and DK served as debriefing raters; and AC and GG analyzed and interpreted the generalizability and decision study results.
References
- 1. Zinns LE, O'Connell KJ, Mullan PC, Ryan LM, Wratney AT. National survey of pediatric emergency medicine fellows on debriefing after medical resuscitations. Pediatr Emerg Care 2015;31:551–4. [DOI] [PubMed] [Google Scholar]
- 2. Sandhu N, Eppich W, Mikrogianakis A, et al. Postresuscitation debriefing in the pediatric emergency department: a national needs assessment. CJEM 2014;16:383–92. [PubMed] [Google Scholar]
- 3. Maestre JM, Rudolph JW. Theories and styles of debriefing: the good judgment method as a tool for formative assessment in healthcare. Rev Esp Cardiol (Engl Ed) 2015;68:282–5. [DOI] [PubMed] [Google Scholar]
- 4. Wolfe H, Zebuhr C, Topjian AA, et al. Interdisciplinary ICU cardiac arrest debriefing improves survival outcomes*. Crit Care Med 2014;42:1688–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Zebuhr C, Sutton RM, Morrison W, et al. Evaluation of quantitative debriefing after pediatric cardiac arrest. Resuscitation 2012;83:1124–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Brett‐Fleegler M, Rudolph J, Eppich W, et al. Debriefing assessment for simulation in healthcare: development and psychometric properties. Simul Healthc 2012;7:288–94. [DOI] [PubMed] [Google Scholar]
- 7. Rudolph JW, Simon R, Dufresne RL, Raemer DB. There's no such thing as "nonjudgmental" debriefing: a theory and method for debriefing with good judgment. Simul Healthc 2006;1:49–55. [DOI] [PubMed] [Google Scholar]
- 8. Durand C, Secheresse T. Leconte M. [The use of the Debriefing Assessment for Simulation in Healthcare (DASH) in a simulation‐based team learning program for newborn resuscitation in the delivery room]. Arch Pediatr 2017;24:1197–204. [DOI] [PubMed] [Google Scholar]
- 9. Fanning RM, Gaba DM. The role of debriefing in simulation‐based learning. Simul Healthc 2007;2:115–25. [DOI] [PubMed] [Google Scholar]
- 10. Cheng A, Eppich W, Grant V, Sherbino J, Zendejas B, Cook DA. Debriefing for technology‐enhanced simulation: a systematic review and meta‐analysis. Med Educ 2014;48:657–66. [DOI] [PubMed] [Google Scholar]
- 11. Mullan PC, Wuestner E, Kerr TD, Christopher DP, Patel B. Implementation of an in situ qualitative debriefing tool for resuscitations. Resuscitation 2013;84:946–51. [DOI] [PubMed] [Google Scholar]
- 12. Ireland S, Gilchrist J, Maconochie I. Debriefing after failed paediatric resuscitation: a survey of current UK practice. Emerg Med J 2008;25:328–30. [DOI] [PubMed] [Google Scholar]
- 13. Theophilos T, Magyar J, Babl FE, Paediatric Research in Emergency Departments International Collaboration (PREDICT) . Debriefing critical incidents in the paediatric emergency department: current practice and perceived needs in Australia and New Zealand. Emerg Med Australas 2009;21:479–83. [DOI] [PubMed] [Google Scholar]
- 14. Tanoubi I, Labben I, Guedira S, et al. The impact of a high fidelity simulation‐based debriefing course on the Debriefing Assessment for Simulation in Healthcare (DASH)(c) score of novice instructors. J Adv Med Educ Prof 2019;7:159–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Aponte‐Patel L, Salavitabar A, Fazzio P, Geneslaw AS, Good P, Sen AI. Implementation of a formal debriefing program after pediatric rapid response team activations. J Grad Med Educ 2018;10:203–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Lopreiato JO. Healthcare Simulation Dictionary. Rockville, MD: Agency for Healthcare Research and Quality, 2016. [Google Scholar]
- 17. Patterson MD, Blike GT, Nadkarni VM.In situ simulation: challenges and results. In: Henriksen K, Battles JB, Keyes MA, Grady ML, editors. Advances in Patient Safety: New Directions and Alternative Approaches. Vol 3. Rockville, MD: Performance and Tools, 2008. [Google Scholar]
- 18. Messick S. Validity. Princeton, NJ: Educational Testing Service, 1987.
- 19. Messick S. Validity. Educ Meas 1989;3:13–103. [Google Scholar]
- 20. American Educational Research Association., American Psychological Association, National Council on Measurement in Education., Joint Committee on Standards for Educational and Psychological Testing (U.S.) . Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association, 2014. [Google Scholar]
- 21. Downing SM. Validity: on meaningful interpretation of assessment data. Med Educ 2003;37:830–7. [DOI] [PubMed] [Google Scholar]
- 22. Cook DA, Zendejas B, Hamstra SJ, Hatala R, Brydges R. What counts as validity evidence? Examples and prevalence in a systematic review of simulation‐based assessment. Adv Health Sci Educ Theory Pract 2014;19:233–50. [DOI] [PubMed] [Google Scholar]
- 23. Calhoun A. Simulation for high‐stakes assessment in pediatric emergency medicine. Clin Pediatric Emerg Med 2016;17:212–23. [Google Scholar]
- 24. Onello RRJ, Simon R. Feedback for Clinical Education (FACE) Rater's Handbook. 2015 ed. Boston, MA: Center for Medical Simulation, 2015. [Google Scholar]
- 25. Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika 1951;16:297–334. [Google Scholar]
- 26. Tavakol M, Dennick R. Making sense of Cronbach's alpha. Int J Med Educ 2011;2:53–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Prion SK, Haerling KA. Generalizability theory: an introduction with application to simulation evaluation. Clin Simul Nurs 2016;12:546–54. [Google Scholar]
- 28. Bloch R, Norman G. Generalizability theory for the perplexed: a practical introduction and guide: AMEE Guide No. 68. Med Teach 2012;34:960–92. [DOI] [PubMed] [Google Scholar]
- 29. Cronbach LJ, Rajaratnam N, Gleser GC. Theory of generalizability: a liberalization of reliability theory. Br J Stat Psychol 1963;16:137–63. [Google Scholar]
- 30. Cronbach LJ, Gleser GC, Nanda H, Rajaratnam N. The Dependability of Behavioral Measurements. New York: John Wiley & Sons, Inc., 1972. [Google Scholar]
- 31. Cook DA, Brydges R, Ginsburg S, Hatala R. A contemporary approach to validity arguments: a practical guide to Kane's framework. Med Educ 2015;49:560–75. [DOI] [PubMed] [Google Scholar]
- 32. Feldman M, Lazzara EH, Vanderbilt AA, DiazGranados D. Rater training to support high‐stakes simulation‐based assessments. J Contin Educ Health Prof 2012;32:279–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Woehr DJ, Huffcutt AI. Rater training for performance appraisal: a quantitative review. J Occupat Organizat Psychol 1994;67:1897–205. [Google Scholar]
- 34. Webb NM. Shavelson RJ. Generalizability Theory: Overview. Encyclopedia of Statistics in Behavioral Science. Chicester: John Wiley & Sons, 2005. p. 717–9. [Google Scholar]
- 35. Brennan RL. Generalizability Theory. New York: Springer‐Verlag, 2001. [Google Scholar]
- 36. Chapter 3: Understanding Test Quality‐Concepts of Reliability and Validity. 2018. Available at: https://www.hr‐guide.com/Testing_and_Assessment/Reliability_and_Validity.htm. Accessed January 20, 2019.
- 37. Guttman L. A basis for analyzing test‐retest reliability. Psychometrika 1945;10:255–82. [DOI] [PubMed] [Google Scholar]
- 38. Humphrey‐Murto S, Varpio L, Wood TJ, et al. The use of the Delphi and other consensus group methods in medical education research: a review. Acad Med 2017;92:1491–8. [DOI] [PubMed] [Google Scholar]
- 39. Kessler DO, Cheng A, Mullan PC. Debriefing in the emergency department after clinical events: a practical guide. Ann Emerg Med 2015;65:690–8. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Supplement S1. In Situ Simulation Cases Used in this Study
