Abstract
Background:
Teaching trainees the knowledge and skills to perform general anesthesia (GA) for cesarean delivery (CD) requires innovative strategies, as they may never manage such cases in training. We used a multistage design process to create a criterion-referenced multiple-choice test as an assessment tool to evaluate CA1's knowledge related to this scenario.
Methods:
Three faculty created 33 questions, categorized as: (1) physiologic changes of pregnancy (PCP), (2) pharmacology (PHA), (3) anesthetic implications of pregnancy (AIP), and (4) crisis resource management principles (CRM). A Delphi process (3 rounds) provided content validation. In round 1, experts (n = 15) ranked questions on a 7-point Likert scale. Questions ranked ≥ 5 in importance by ≥ 70% of experts were retained. Five questions were eliminated, several were revised, and 1 added. In round 2, consensus (N = 14) was reached in all except 7 questions. In round 3 (N = 14), all questions stabilized. A pilot test of the 29-question instrument evaluating internal consistency, reliability, convergent validity, and item analysis was conducted with the July CA1 classes at our institution after a lecture on GA for CD (n = 26, “instructed group”) and another institution with no lecture (n = 26, “uninstructed group”), CA2s (N = 17), and attendings (N = 10).
Results:
Acceptable internal consistency and reliability was demonstrated (ρ = 0.67). Convergent validity coefficients between the CA1 uninstructed and instructed group suggested theoretical meaningfulness of the 4 sub-scales: PCP correlated at 0.29 with PHA, 0.35 with CRM, and 0.25 with AIP. PHA correlated with CRM and AIP at 0.23 and 0.28, respectively. The correlation between CRM and AIP was 0.29.
Conclusion:
The test produces moderately reliable scores to assess CA1s' knowledge related to GA for urgent CD.
Keywords: Criterion-referenced test, content validation, empirical validation
Introduction
Cesarean delivery (CD) is the most commonly performed surgical procedure in American hospitals, representing over 30% of all births,1 most of which are performed under spinal anesthesia (80% of elective cesarean deliveries in stratum III hospitals, which are centers that provide subspecialty care).2 Even in tertiary care centers with high volumes of deliveries, the rates of GA have been reported to be as low as 0.5%.3
General anesthesia (GA) for CD is associated with persistently higher rates of anesthesia-related adverse events compared with regional anesthesia.4,5 In New York State, despite a significant decrease in the proportion of CD performed under GA—from 7.5% in 2003 to 6% in 2012—the overall rate of anesthesia-related adverse events among women receiving GA for CD did not decrease. The declining utilization of GA for CD has concerned educators regarding insufficient training of anesthesiology trainees to manage this high risk clinical scenario.3,6,7
For an urgent CD, the learner needs to develop the skill of obtaining a quick, focused medical/surgical/obstetric history and airway exam. In devising the anesthesia plan, both fetal and maternal concerns must be taken into account. Finally, the learner must account for the physiological differences in the parturient, such as the effect of aortocaval compression, the increased risk of difficult intubation and pulmonary aspiration of gastric contents, the risk of uterine atony in response to inhaled volatile agents, the depressant effects of medication on the fetus, and the risk of maternal awareness.
In our institution, residents begin obstetric anesthesiology rotations as early as the third month of their first year of residency, and we must prepare them for the possibility of being involved in the management of a patient undergoing GA for emergent CD from day one of the rotation.
We identified the need for a valid and reliable assessment tool to evaluate trainee competency related to this critical scenario. We describe the multistage design process to create and validate a criterion-referenced knowledge test as an assessment tool for anesthesiology trainees with no clinical experience in obstetric anesthesiology.
Methods
The Columbia University (CU) and University of Miami (UM) Institutional Review Boards approved this study.
Instrument Development
Instrument development comprised four phases: (1) purpose and domain specification, (2) development of survey specifications, (3) content validation, and (4) empirical validation, based on Chatterji's process model (Figure 1).8
Figure 1.

Iterative process for designing and validating a knowledge test.
1. Purpose and domain specification
The target population was novice CA1 (first year) anesthesiology trainees, never exposed to obstetric anesthesia cases. The purpose of the instrument was to assess the degree of trainee proficiency for the domain, “GA for urgent CD.” The design was a criterion-referenced test (CRT), the score of which is a measurement of performance against set criteria indicating mastery of the domain.9
2. Development of survey specifications
The essential areas of knowledge were based on a validated checklist.10 A panel comprising three CU faculty content experts agreed upon the subdomains, (1) physiologic changes of pregnancy (PCP), (2) pharmacology (PHA), (3) anesthetic implications of pregnancy (AIP), and (4) crisis resource management principles (CRM). The competencies being tested for each subdomain were listed as keywords (Table 1). The content experts were asked to each submit ≥ 10 multiple choice questions (each stem with 1 correct answer and 3 distractors), which would be a representative sample covering the knowledge content of the indicators in the four subdomains at the level of a novice CA1 trainee. Equal numbers of questions for each subdomain was not expected. Thirty-six questions were submitted: (1) PCP (n = 12), (2) PHA (n = 4), (3) AIP (n = 14), and (4) CRM (n = 6). Three questions were discarded because they were too easy, resulting in 33 questions.
Table 1.

3. Content Validation
A Delphi process was conducted in three rounds for content validation of the questions.
Experts were recruited from the Society for Obstetric Anesthesia and Perinatology Research and Education Committees by email. Twenty experts initially agreed to participate. Panel members were asked to anonymously rate the 33 questions based on a 7-point Likert scale where 1 = “I feel this is not important at all” and 7 = “I feel this is extremely important.” Feedback and suggestions for improving individual questions was encouraged.
4. Empirical Validation
Empirical validation was conducted after the three rounds of the Delphi process.
Participants for pilot testing
The knowledge test was administered to three different groups:
Uninstructed group (UG): The July 2016 CA1 (n = 26) class at UM was selected since they mimicked the CU CA1s with respect background characteristics. This group received no training regarding management of GA for CD.
Instructed group (IG): The 2017 CA1 (n = 26) class at CU received a 1-hour didactic lecture (delivered by A.L.) in the third week of July teaching about the domain, management of GA for CD, as part of the orientation month core lecture series. Using the case study of umbilical cord prolapse necessitating emergent CD under GA, the content taught covered the subdomains of physiologic and pharmacodynamic changes in pregnancy, the implications of the latter for anesthetic management in pregnancy, and the crisis management, teamwork and communication skills necessary to safely conduct GA for emergent CD.
Expert group (EG): Ten attending anesthesiologists (n = 10), volunteers from the UM CA2 (second-year residents) class (n = 10) and CU CA2 class (n = 7) took the same test (completed July 2016). This expert group was used for sensitivity analysis to further verify if the knowledge test is valid and reliable in assessing the GA for urgent CD knowledge domain.
Frequency polygons of the UG and IG were plotted to verify the consistency of the expert-selected cut-score. Internal consistency reliability was measured with Hoyt Analysis of Variance (ANOVA) method; a coefficient with a cutoff value of ≥ 0.70 was considered desirable. Item analysis with methods from Classical Test Theory was conducted.11
Based on the calculated item discrimination index (D), we used the following guidelines12 to interpret CRT item analysis results:
If D <10%, the item should be removed.
If 10% ≤ D < 20%, the item should be revised.
If D ≥ 20%, the item is functioning well
Convergent validity was assessed through examining the intercorrelations among the four major subdomains on the survey (Table 3). The convergent validity coefficients were calculated with Spearman rank order correlation. All the aforementioned analyses were performed first with the UG and IG to establish evidence of validity and reliability for the criterion-referenced knowledge test. Then we performed a sensitivity analysis with the UG and EG to cross validate the UG/IG empirical validation results. Experts were consulted about setting a standard or cut-score for the bank of questions, the standard being the minimum competency expected of a CA1 resident after training in the clinical scenario. All analyses were performed using SPSS statistical software (version 20.0; IBM Corporation, Armonk, NY). A P value ≤ 0.05 was considered to be statistically significant.
Table 3.
Item analysis results for the uninstructed group (UG) and instructed group (IG)

Results
Content Validation Results
Fifteen experts participated in Round 1 (completed April 2016). The mean and median ranking for individual questions are in Table 2. The criteria for question elimination were established a priori as follows: Questions ranked ≥5 in importance by ≥70% of participants were retained. Questions eliminated were #6, #11, #15, #19, and #24, which received ratings of 5 to 7 in 66.7%, 53.3%, 66.7%, 60%, and 33.3% of participants, respectively. Several questions were revised, and 1 new question was added based on suggestions by experts.
Table 2.

In Round 2 (completed May 2016), participants rated in a manner similar to Round 1: the 28 questions remaining and the new question. Consensus was defined as (1) a change of ≤ 10% in the mean score for each item, and (2) after individuals were grouped into quartiles, a change of ≤ 5% in the average of the individual total scores all items by quartile. Fourteen responses were received. Consensus was reached in all except #17 and #28, the new question (#34), and 4 questions that had been revised based on feedback from Round 1 (#8, #12, #25, and #26).
For Round 3 (completed June 2016) experts rated in a manner similar to Round 1: 7 items for which consensus was not reached or for which significant revisions were made. Fourteen responses were received. All questions were found to have stabilized. The final number of questions within each category were (1) PCP (n = 8), (2) PHA (n = 3), (3) AIP (n = 11), (4) CRM (n = 7).
Empirical Validation Results
The overlapping frequency polygons of the UG and IG suggested an appropriate cut score should be between 20 and 21, where the two distributions first intersected (see Figure 2). A panel of three experts agreed that a high cut score of at least 25 was desirable to demonstrate mastery of the domain. The 29-item survey demonstrated acceptable internal consistency and reliability (ρ = 0.67). Table 3 shows the item analysis results for the UG and IG. Regarding the item discrimination index, only 3 items obtained the highest rating (D ≥ 20%) and suggested preservation. The convergent validity coefficients (Table 4) for the UG/IG suggested theoretical meaningfulness of the four subdomains: PCP correlated at 0.29 with PHA, 0.35 with CRM, and 0.25 with AIP. PHA correlated with CRM at 0.23, and AIP at 0.28. The CRM-AIP correlation was 0.29. The subdomains also demonstrated strong, positive correlations with total scores (correlations ranged from 0.54 to 0.74). Consistent with theoretical expectations, the positive intercorrelations suggested construct validity of the four measures in assessing knowledge pertinent to the conduct of GA for urgent CD.
Figure 2.

Standard-setting for the criterion-referenced knowledge test.
Table 4.
Convergent validity results for the uninstructed group (UG) (n=25) and instructed group (IG) (n=25)

To cross validate the UG/IG results, we performed a sensitivity analysis to compare the UG and EG (Tables 5 and 6); similar reliability and validity in terms of direction and magnitude were found. Six items were well-functioning and 6 more were borderline in terms of item discrimination index. The sensitivity analysis results further supported the evidence of validity and reliability in using this CRT to assess knowledge pertaining to GA for urgent CD in novice CA1s.
Table 5.
Item analysis results for the uninstructed group (UG) and expert group (EG)

Table 6.
Convergent validity results for the uninstructed group (UG) (n=25) and expert group (EG) (n=25)

Discussion
This study describes the stages of development of a valid and reliable instrument to assess CA1 trainees' knowledge related to the conduct of GA for urgent CD. Reasonable internal consistency reliability and good convergent validty were demonstrated, but the instrument is currently lacking in internal structure evidence. Instrument validation is an iterative process (Figure 1). We believe that while the current test does have utility for measuring novice trainee knowledge, revisions are warranted to achieve greater robustness.
The discrimination indices betweeen the instructed and the uninstructed groups showed only 3 highly performing questions; however, the uninstructed and expert group comparison showed that 6 questions performed very well and 6 were borderline (D > 15), yielding 12 acceptable items (highlighted in Table 5). The lack of separation between the uninstructed and instructed groups may have been because the instructed group were still inexperienced novices, despite having received the lecture. The intent was not to test the effectiveness of the lecture. We acknowledge the limitation of applying a written test to verify competency in skills such as CRM.
Next steps will include consultation with experts to agree upon the disposition of the worst-performing items. If the underlying knowledge being tested for those items is considered important (as had been indicated by the Delphi process) those questions may need to be rewritten as opposed to being discarded, followed by additional rounds of pilot testing. To improve the ability to discriminate between experts and novices, we will consider weighting individual item scores by level of difficulty—easier items that are still considered to be critical knowledge would be assigned a lower score value.
With the shift towards competency-based milestones in graduate medical education, the development of reliable assessment tools to track training progress is invaluable.13,14 We envision use of the finally validated instrument as a benchmark for trainees, which may allow faculty to identify and bridge knowledge gaps related to this infrequently encountered clinical scenario.
Footnotes
Funding: Funded by a Gertie Marx grant from the Society of Obstetric Anesthesia and Perinatology
References
- 1.Hamilton BEPD, Martin JA, Osterman MMHS, Curtain SMA. Births: preliminary data for 2014. Natl Vital Stat Rep. 2015;64(6):1–19. [PubMed] [Google Scholar]
- 2.Bucklin BA, Hawkins JL, Anderson JR, Ullrich FA. Obstetric anesthesia workforce survey: twenty-year update. Anesthesiology. 2005;103(3):645–53. doi: 10.1097/00000542-200509000-00030. [DOI] [PubMed] [Google Scholar]
- 3.Palanisamy A, Mitani AA, Tsen LC. General anesthesia for cesarean delivery at a tertiary care hospital from 2000 to 2005: a retrospective analysis and 10-year update. Int J Obstet Anesth. 2011;20(1):10–6. doi: 10.1016/j.ijoa.2010.07.002. [DOI] [PubMed] [Google Scholar]
- 4.Guglielminotti J, Wong CA, Landau R, Li G. Temporal trends in anesthesia-related adverse events in cesarean deliveries, New York State, 2003–2012. Anesthesiology. 2015;123(5):1013–23. doi: 10.1097/ALN.0000000000000846. [DOI] [PubMed] [Google Scholar]
- 5.Hawkins JL, Chang J, Palmer SK, Gibbs CP, Callaghan WM. Anesthesia-related maternal mortality in the United States: 1979–2002. Obstet Gynecol. 2011;117(1):69–74. doi: 10.1097/AOG.0b013e31820093a9. [DOI] [PubMed] [Google Scholar]
- 6.Hawthorne L, Wilson R, Lyons G, Dresner M. Failed intubation revisited: 17-yr experience in a teaching maternity unit. Br J Anaesth. 1996;76(5):680–4. doi: 10.1093/bja/76.5.680. [DOI] [PubMed] [Google Scholar]
- 7.Hawkins JL, Gibbs CP. General anesthesia for cesarean section: are we really prepared? Int J Obstet Anesth. 1998;7(3):145–6. doi: 10.1016/s0959-289x(98)80000-9. [DOI] [PubMed] [Google Scholar]
- 8.Chatterji M. Designing and Using Tools for Educational Assessment. Allyn & Bacon/Pearson; Boston, MA: 2003. pp. 105–110. [Google Scholar]
- 9.Chatterji M. Designing and Using Tools for Educational Assessment. Allyn & Bacon/Pearson; Boston, MA: 2003. p. 85. [Google Scholar]
- 10.Scavone BM, Sproviero MT, McCarthy RJ et al. Development of an objective scoring system for measurement of resident performance on the human patient simulator. Anesthesiology. 2006;105(2):260–6. doi: 10.1097/00000542-200608000-00008. [DOI] [PubMed] [Google Scholar]
- 11.Crocker LM, Algina J. Introduction to Classical and Modern Test Theory. Wadsworth Group/Thomas Learning; Mason, OH: 2006. pp. 311–335. [Google Scholar]
- 12.Crocker LM, Algina J. Introduction to Classical and Modern Test Theory. Wadsworth Group/Thomas Learning; Mason, OH: 2006. p. 329. [Google Scholar]
- 13.Boulet JR, Murray D. Review article: assessment in anesthesiology education. Can J Anaesth. 2012;59(2):182–92. doi: 10.1007/s12630-011-9637-9. [DOI] [PubMed] [Google Scholar]
- 14.Cook DA, Zendejas B, Hamstra SJ, Hatala R, Brydges R. What counts as validity evidence? Examples and prevalence in a systematic review of simulation-based assessment. Adv Health Sci Educ Theory Pract. 2014;19(2):233–50. doi: 10.1007/s10459-013-9458-4. [DOI] [PubMed] [Google Scholar]
