Skip to main content
Annals of Surgery Open logoLink to Annals of Surgery Open
. 2023 Oct 30;4(4):e346. doi: 10.1097/AS9.0000000000000346

Measuring Competency: Improving the validity of your procedural performance assessments

Pamela B Andreatta *,, Christopher H Renninger *, Mark W Bowyer *, Jennifer M Gurney †,‡,§
PMCID: PMC10735095  PMID: 38144484

Abstract

Objective:

The objective of the study was to compare the use of ordinal scales and interval scales for capturing surgical competency information for general surgeons performing 3 complex trauma procedures.

Background:

Surgical performance assessment is typically captured using nonparametric data (eg, checklists) that do not support inferential analyses. Interval scales support parametric analyses that are essential for determining competency. We compared assessment outcomes for surgeons performing 3 complex trauma procedures using ordinal and interval scales.

Methods:

All participants were board-certified or eligible general surgeons. Each participant was assessed by an experienced trauma surgeon while performing 3 trauma procedures on cadavers. All assessors completed a rigorous assessment certification process. We calculated descriptive statistics to examine the differences between interval (parametric) and ordinal (nonparametric) outcomes.

Results:

Ordinal scales overestimated competence in up to 100% of the participants and did not identify specific performance gaps. Interval scales provided more granularity and identified specific capability gaps.

Conclusions:

Imprecise instrumentation conveys a false sense of competence and deprives surgeons of opportunities to close capability gaps. Measuring discrete procedural components with interval scales provides a more precise measurement of surgical competency.

Keywords: competency-based assessment, interval scales, ordinal scales, performance-based assessment, surgical competence, surgical performance assessment


Mini abstract The study compared ordinal and interval scales for assessing 3 complex trauma procedures. Ordinal scales overestimated competence in 37% to 100% of the sample (mean error = 77%). Ordinal scales for competency assessment have lower validity and reliability than interval scales. Interval scales improve assessment validity.

INTRODUCTION

Competency is a critical requirement for surgeons and includes a cluster of highly interrelated performance dimensions. These dimensions give rise to the behaviors needed to effectively perform the work of a surgeon within a specified scope of practice. Surgical competencies facilitate the specification, performance, and assessment of critical work functions, including the measurement criteria for the knowledge, skills, and abilities that underscore efficacious surgical performance.1 Competency-based assessment is the process of collecting evidence and making judgments to determine if an individual meets the specified performance criteria while performing work within the defined scope of practice, and therefore requires a higher level of both accuracy and fairness than other performance assessment methods might because their uses document the achievement of specific competency requirements.24 But how do we measure performance in a way that allows accurate and fair determination of surgical competencies across the spectrum of professional performance, including all necessary knowledge, skills, and abilities ?1 Medicine and surgery are highly complex performance domains, and to accurately, comprehensively, and efficiently measure competency in each aspect requires a deep understanding of measurement science and a willingness to establish evidence-based, fair performance standards.

Performance assessment serves 2 primary purposes: (1) to provide accurate and actionable information to those who are assessed so that they can close any identified performance gaps and 2) to assure that any decisions made about the competence of those being assessed are based on a fair and accurate measurement of capabilities within a defined scope of practice.2,5 In surgery, this is particularly important because the performance environment itself has many uncontrolled variables, especially those related to patient-specific complexities. At best, inaccurate or unfair assessment processes will neither encourage professional mastery nor differentiate between inexpert and expert performers, which is ostensibly the purpose. At worst, an inaccurate assessment of surgical performance could establish false confidence without parallel evidence of competence, which is a dangerous paradigm for surgeons.6,7

At present, performance assessment in surgery is typically captured through the use of nominal scales (eg, checklists) and ordinal scales (eg, global rating scales), neither of which provide parametric data for the types of inferential statistical analyses that are necessary for high-stakes assessment decisions, such as determination of competency.8 Nominal scales (checklists) serve to distinguish between different categories with no quantitative meaning and resulting data may only be examined using nonparametric analyses. Ordinal scales (eg, rating scales) use numbers to distinguish between performance categories but imply rank order between the categories to distinguish the relative performance quality.2 However, ordinal measures cannot be assumed to have equal intervals, and therefore the resulting data may only be analyzed using statistical indices such as mode, median, or interquartile range to provide a snapshot of categorical distributions.2,5,8 Competency assessment is a high-stakes examination and necessitates parametric measurement to conduct appropriate inferential statistical analyses of outcomes compared with professional standards.5,9

Performance measurement instrumentation is derived from a universe of all possible performance observations, from which decision-makers determine the acceptable performance criteria. Generalizability theory provides a framework for evaluating performance measurements by pinpointing the sources of measurement error.10,11 Generalizability theory accommodates 2 types of performance measurements: (1) performance assessed relative to a norm-referenced standard (eg, training year) and (2) performance assessed to an absolute criterion-referenced standard (eg, professional competency standard). Both types of performance assessments are contingent on the precision and accuracy of measurement, each of which is directly related to measurement errors.1012 Precision refers to the quality of the measurement, and accuracy compares the result to a criterion (professional competency standard). Random errors result from precision limitations of the measurement instrumentation itself (scale), and a more precise scale will lead to a more accurate measurement by reducing these random errors. For high-stakes assessments, the goal is to reduce as many sources of error as possible by making precise and accurate measurements.1012 Rank order scales are acceptable for norm-referenced assessment purposes where the relative decision focuses on the rank order of persons. However, for absolute decisions that focus on the level of performance meeting a criterion-referenced standard, regardless of rank, interval scales reduce the potential for measurement error because they have greater precision and accuracy of measurement.

We propose that interval scales used in concert with detailed procedural components provide for a more accurate and fair performance assessment of procedural competencies than ordinal scales. The purpose of the study was to compare procedural assessment outcomes captured by assessors using both interval scales and ordinal scales to score the same individual performing 3 surgical procedures.

METHODS

This research was designated with exempt status by the institutional review board at our university. We followed STROBE guidelines for preparing this article. The study sample included 60 active-duty general surgeons affiliated with the Military Health System, all of whom are required to maintain competencies in the provision of trauma care in addition to their routine surgical practice. All participants were board-certified or eligible general surgeons who had not completed a trauma surgery fellowship. The median postgraduate surgical experience for the sample was 7 years. Data were collected over a 6-month period from fall 2021 through spring 2022.

Assessment Process

Data were collected as part of a trauma surgery course, during which participants performed the specified procedures on fresh cadavers.13 All procedures were time delimited to enhance the authenticity of the assessment context for real-life trauma situations. Performance dimensions for each procedure were derived from task deconstruction of all critical steps and subject matter consensus. Example dimensions included identification of anatomical structures, procedural sequencing, accurate approach and incisions, technique and instrument use, completeness of required procedural steps, protection of neurovascular structures, avoidance of iatrogenic injury, and the ability to manage complications as necessary. Because of the multifaceted integration of cognitive and kinesthetic components associated with open surgical procedures, we identified 3 complex trauma procedures to anchor the assessment in an authentic performance context: resuscitative thoracotomy, Cattell-Braasch maneuver (right-to-left visceral medial rotation), and supraclavicular exposure of the subclavian artery.14

Each participant was assessed in real-time on all 3 procedures in a one-to-one fashion by an experienced trauma surgeon. All procedures were performed in a set sequence and in accordance with the approach prescribed by the American College of Surgeons Committee on Trauma Advanced Surgical Skills for Exposure in Trauma course (https://www.facs.org/quality-programs/trauma/education/asset/). All assessors had completed a rigorous certification training process for using the assessment instrumentation ahead of the course. Assessors completed the assessments for each procedural component using both interval scales and an ordinal scale, and a holistic rating for the full procedure using an ordinal scale similar to a global rating scale. Holistic ratings are classified into broad groupings based on the assessor’s overall impression.2 To assure fairness associated with anatomical variances between cadavers, the instrumentation included consensus-derived score weighting using the nominal group technique with expert trauma surgeons to account for case difficulty between the cadavers.15 This ensured that participants were not unfairly penalized for issues outside of their control (eg, body habitus or previous surgery).13

We determined that the experienced trauma surgeons allowed us to implement criterion referencing to compare the participants’ performance to that of a group comprised of individuals with recent, representative, and relevant characteristics to the participants being assessed.2 Normal curves were calculated using data captured from a representative sample of these experienced trauma surgeons (criterion performance), which provided context for interpreting scores relative to the performance criterion. These criterion values provided a way to determine how well the participants performed compared with a trauma surgery standard for each assessed component (eg, incision location, length, and trajectory).2,5 Competency benchmarks were determined as the score corresponding to the accurate and independent performance of all procedural components. The scoring rubric for these interval scales is described below.

Ordinal Scale Instrumentation

Ordinal scales use rank order numbers to distinguish between performance categories of increasing performance quality or competency level. We used a common category framework for this study, ranking performance levels from beginner (1) to expert (4), with the competent category (3) reflecting independent and accurate performance. We analyzed the outcomes from this ordinal scale for performance on each procedural component and as a holistic measure of overall procedural performance, the latter of which aligns with the use of global rating scales in surgical performance assessment.1618 The ordinal scale rubric categories and associated rank order values are shown in Table 1.

TABLE 1.

The Ordinal Scale Rank Order Rubric Categories for Procedural Performance Assessment

Category Ranking Description
Beginner 1 Not competent, no independent capabilities
Transitional 2 Not competent, some independent capabilities
Competent 3 Independent, accurate capabilities
Expert 4 Efficient, independent, accurate capabilities

Interval Scale Instrumentation

Criterion referencing relates scores to some predetermined criteria, and quantifying performance categories with interval scores facilitates numeric representation and inferential statistical analyses. The interval scale values for this study were based on normal curve equivalents, with the mean score of 1 indicating independent and accurate performance. Normal curve equivalents are a type of normalized score with equal interval properties and a value range that corresponds exactly to the first and 99th percentile ranks.2,5 Normal curve equivalents are useful because the scores are tied to an established measurement standard and are appropriate for inferential statistical analyses.

To determine the accuracy of a particular measurement, we have to know the ideal value of the measured performance criterion from the universe of possibilities. The ideal value of the criterion is an accepted measured value that is then used to estimate the accuracy of other performance results. If an established ideal value is not known, a theoretical value may be calculated from basic scientific principles and assumed to be the best available ideal value for measurement accuracy. In this study, we used normal curve equivalents based on the standard normal curve to measure the surgical performance of credentialed trauma surgeons performing the procedures using the specified approach. This provided the standard (criterion) determined by credentialed trauma surgeons as the ideal value. We then recalculated the normal curve based on those data to derive normal curve equivalent values for these trauma procedures and used those values for the interval scales, thereby optimizing both accuracy and precision of measurement. We analyzed the outcomes from this criterion-referenced interval scale for performance on each procedural component. The rubric categories and associated scores for each procedural component are shown in Table 2.

TABLE 2.

The Ratio Scale Scoring Rubric for Procedural Performance Assessment

Category Score Description
Unsafe practice, iatrogenic error −0.25 Unsafe, inaccurate performance, errors that could/did lead to patient injury/harm
Does not perform (N/A) 0.00 Did not perform the procedural component independently or accurately
Does not meet performance standard 0.25 Performed, but required substantial guidance to complete accurately
Partially meets performance standard 0.50 Performed, but required a few verbal suggestions to complete accurately
Meets performance standard 1.00 Independently performed accurately
Exceeds performance standard 1.25 Independently performed accurately, with efficiency or refined technique

Assessor Protocol

Before each course, all course faculty participated in one-to-one briefings by an expert in performance assessment on how to use the instrumentation and one-to-one mentorship by experienced course instructors to minimize assessor variability and ensure that all assessment protocols were followed. Both performance assessment guidance and oversight by experienced course instructors were available during the course to answer any assessor questions about how to implement the instrumentation. Comparative analyses of assessor scoring compared with their mentor’s scoring (inter-rater reliability) and scores from all assessors were examined for internal consistency by comparing the scores for participants who were also scored by other assessors on different procedures. Assessment outliers were identified and remediated until all assessors were able to meet the assessment requirements.

Statistical Analysis

Because we examined 2 scale forms (nonparametric and parametric) to quantify procedural performance, we calculated descriptive statistics to examine the differences between the interval scale data and the ordinal scale data for quantifying competency. Ordinal scale outcomes were examined through nonparametric analyses of mode and frequency distributions for each procedural component and overall procedural capabilities. Interval scale outcomes were examined through parametric analyses of group mean and standard deviation for each procedural component and the distribution of individual scores for each procedural component. We compared the number of participants scored as competent for both scale forms at the group and individual level and calculated percentage differences as the error rate between the 2 scales. We then examined the specific information provided to the subjects about which procedural components did not meet the specified competency standard (independent, accurate performance). Descriptive analyses were performed using Microsoft Excel version 16.16.2.

RESULTS

Ordinal Scale for Holistic Procedural Abilities Versus Interval Scales for Procedural Components

The number of participants rated as competent for overall procedural abilities using the ordinal scale was significantly greater than the number scored using the interval scales for each procedural component. The data for the Cattell-Braasch maneuver are shown in Figure 1, with ordinal scale data indicating 48/60 individuals as competent and interval scale data for each procedural component indicating none of the 60 subjects met the competency standard for independent and accurate performance. The ordinal scale used for the holistic assessment of the Cattell-Braasch maneuver resulted in an 80% error in scoring competence in the subject pool. The ordinal outcomes for resuscitative thoracotomy indicated 42/60 were competent; however, the interval scales indicated 4/60 were competent (63% error). For subclavian artery exposure, the ordinal scale indicated 22/60 were competent and the interval scale indicated none of the subjects met the competency standard (37% error).

FIGURE 1.

FIGURE 1.

Comparison of participants meeting competency requirements for the Cattell-Braasch procedure using an ordinal scale for holistic assessment and criterion-referenced interval scale for separate procedural components.

Ordinal Scale for Procedural Components Versus Interval Scales for Procedural Components

The competence of the participants for each procedural component using the ordinal scale was also significantly greater than what was captured using the interval scales. The data for subclavian artery exposure are shown in Figure 2, with ordinal scale data indicating the group performed competently on 8/10 procedural components and interval scale data for each procedural component indicating that the group did not meet the competency standards for any of the components; again, an 80% error rate. The ordinal outcomes for the Cattell-Braasch maneuver indicated the group was competent for 9/9 procedural components; however, the interval scales indicated that the group did not demonstrate competence for any of the procedural components (100% error). For resuscitative thoracotomy, the ordinal scale indicated the group met the competency standard for 8/8 procedural components, whereas the interval scale indicated none of the subjects met the competency standard (100% error).

FIGURE 2.

FIGURE 2.

Comparison of participants meeting competency requirements for subclavian artery exposure procedural components using ordinal scale and criterion-referenced interval scale.

Gap Analyses Comparison of Ordinal Scale Versus Interval Scales for Procedural Components

To examine the gaps in the performance of procedural components between the ordinal and interval scales, we considered the individual outcomes for 2 areas identified across all 3 procedures as being weakest: understanding of relevant anatomy and procedural steps. To illustrate these detailed analyses, we randomly selected 20% of the sample to compare the individual interval scores and ordinal rankings for understanding relevant anatomy and procedural steps for resuscitative thoracotomy. The outcomes shown in Figure 3 for the resuscitative thoracotomy procedure demonstrate the value of the interval scales for identifying specific gap areas and the challenges to both validity and reliability of the ordinal scale for capturing competency data. Participants A–C are ranked as transitional (2) for procedural components with the ordinal scale; however, with the interval scale, participant A was scored as making significant errors in both components and participant B was scored as requiring substantial guidance in understanding the relevant anatomy. Participant C was appropriately scored as requiring minor guidance to perform accurately, which would be considered transitional but not yet competent. Participants D–L were all ranked as competent in understanding both anatomy and procedural steps; however, the details captured with the interval scales suggest a more complicated analysis. The interval scores for participants D and E were identical to those for participant C; however, participants D and E were ranked as competent, not transitional. Participants F and G were scored as competent with the anatomy but made surgical errors with the procedural steps, which would not align with the accurate performance requirement of competence. Participant H was scored as competent with procedural steps but required guidance with anatomy, which would not align with the independent performance requirement of competence. Participant I was scored as demonstrating independence, accuracy, and efficiency for both procedural components with the interval scale but was scored as competent with the ordinal scale. The opposite was true for participant L, who reached the competency benchmarks for both components with the interval scale but was ranked at the expert level with the ordinal scale. Note that the same assessor used both, so there is no rater variation for these inconsistencies.

FIGURE 3.

FIGURE 3.

Comparison of individual performance on 2 procedural components of resuscitative thoracotomy procedure using ordinal and criterion-referenced interval scales.

Similar outcomes were confirmed for the other 2 procedures. The differences between the 2 scale types suggest that the ordinal scale results in data that are less valid, but also less reliable than data captured with interval scales. It is also evident that the interval scales provide more granularity on each of the procedural dimensions so that the individuals can address specific capability gap areas.

DISCUSSION

We examined the differences between ordinal scales and interval scales for assessing procedural performance in an authentic surgical context using cadavers. The same assessor used both scales to assess each participant performing resuscitative thoracotomy, Cattell-Braasch maneuver (right-to-left visceral medial rotation), and supraclavicular exposure of the subclavian artery. The outcomes demonstrate that ordinal scales do not support the assessment requirements for high-stakes applications, such as the determination of procedural competence. The ordinal scales overestimated competence in 37% to 100% of the sample and did not provide information to the individuals being assessed about specific performance gaps that could be closed with additional study or practice. The uses of global (holistic) assessments lack sufficient performance characterization compared with assessing each procedural component and also result in overestimation of competency.

The differences between the global assessment and the assessment of each procedural component can be explained by the limited variance captured by the ordinal scale on a single assessment point.2,5 Assessment precision and accuracy require that variance within the performance area is fully captured, inclusive of all possible performance outcomes. From a theoretical standpoint, measuring every element of the performance would provide the best data; however, that is neither practical nor feasible for surgery. The best way to identify the procedural components that will accurately and practically provide the best performance assessment is to identify the critical performance elements of the procedure and assess those. Critical performance components are those capabilities that, if performed incorrectly or inaccurately, will lead to patient harm or harm to surgical team members, including the surgeon. Focusing on critical performance components assures that the performance assessment will capture the most impactful competency areas in a way that is both practical and feasible.

Scaling is the second way to increase variance and is extremely important in assuring the accuracy and fairness of measurement.2 This requires precise parametric scaling and comprehensive measurement of the critical performance elements. The outcomes using the ordinal scales overestimated competence in up to 100% of the study sample. Determination of competency is a high-stakes proposition that requires a comparison of performance to a statistical model of accepted standards, for example, the normal distribution of performance by credentialed trauma surgeons. Because ordinal scales result in nonparametric data, it is not possible to determine a statistical model of the performance of the population of interest. For high-stakes decision-making—such as verification of competency—it is necessary to both capture the full range of possible performance for the area of interest (eg, trauma surgery) and assure that those who are being assessed are accurately and fairly scored to the accepted performance criteria.

This has significant implications for both surgeons and patients. For surgeons, believing they are competent when they have areas of performance where they are not meeting competency standards deprives them of the opportunity to close capability gaps. It may also lead to a sense of confidence when a focus on continued development and maintenance of acquired abilities would be more appropriate. For patients, a surgeon who is confident but lacks awareness of their performance detriments is a precarious prospect, whereas a surgeon who is aware of their capabilities will operate within the appropriate care paradigm with constraint or assistance to assure patient safety. Interval scales derived from established criterion standards of performance are optimal for the purpose of valid and reliable competency-based assessment.

The implications of these outcomes for the determination of competency, certification, maintenance of competency and certification, credentialing, licensing, and privileging are substantial. Arguably, the implications of these outcomes for patient safety and quality of care are even more significant. Performance assessments are invaluable for measuring surgical competencies because they measure the numerous high-level capabilities within an authentic context that reflects actual surgical care. For surgeons, individuals must be able to carry out procedures that directly impact a patient’s life. It is not a stretch to consider that most patients would not feel confident in the abilities of surgeons who have not demonstrated competence in performing the procedures to be performed on them, regardless of how well they may have scored on written or oral examinations of the same content. Arguably, it is more important to assess the performance of authentic procedural tasks because these are the tasks that will form the basis of future performance in applied surgical practice.

During training, ordinal scales may be appropriate for providing feedback and relative rank within a cohort (eg, PGY2s) information to the trainees and those who oversee training programs. However, it is important that surgical educators understand that these ordinal scales provide different information than interval and ratio scales. The respective usefulness of each scale type is dependent upon the purposeful use of the data captured and the degree of precision that is acceptable for decisions made using the data. If the purpose is to capture the competency of the surgeon(s), which has high-stakes implications, an ordinal scale will not provide sufficient information because the performance is not tied to specific criteria that may be measured with sufficient precision to facilitate appropriate inferential analyses.

It would be beneficial for surgeons across the profession continuum to include periodic competency-based assessments using valid and reliable performance assessments at all levels, from intern to professional practice.2 The performance area and criteria may vary with experience, such that surgical trainees have progressively different performance criteria from fellowship-trained experienced credentialed surgeons. However, the process of competency-based assessment may also be used to document that graduating residents have demonstrated the abilities necessary to independently and accurately perform procedures within the scope of their training program.

Best practices in performance assessment specify that surgeons should be assessed in an environment that is as authentic to applied surgical care as possible. Practically, compromises are often needed between authenticity and safety, feasibility, and ethics for assessments of surgical performance.2,9 Performance assessment in simulated contexts can be more easily accomplished; however, evidence of the extent to which performance measures in simulated environments correspond to similar measures in applied surgical care must be captured to ensure that outcomes captured in the simulated context reflect those from the operative context.19 Fortunately, the same assessment instrumentation can be used to assess performance in both contexts and facilitate these analyses.

Limitations

This study examined the differences between ordinal scales (nonparametric data) and interval scales (parametric data) for determining competency in performing trauma procedures on cadaveric specimens. The study was performed using cadavers and not during actual applied surgical care. Additionally, the number of procedural components to be assessed during actual patient care is likely greater, and therefore the differential between the 2 scale forms of procedural skills assessment would likely be also greater.

We appreciate that there are many reports in the literature examining the uses of milestone scales, global rating scales, and checklists, all of which result in nonparametric data. We have simplified the discussion to focus on the overall uses of ordinal and interval scales for competency-based assessments. Ordinal scales that result in nonparametric data for the purposes of providing feedback during training are distinctly different from their use for high-stakes decision-making (determination of competency). Our intention is not to minimize prior work reporting the outcomes from ordinal scales for training purposes. Rather, our intention is to demonstrate how ordinal and interval scales provide very different types of data with distinctly different precision and accuracy and to consider why it is important for surgeons to understand these distinctions to be certain any competency-based assessment is accurate and fair, especially for potential credentialing considerations.

Conclusions

The use of ordinal scales and a single global assessment of procedural competence overestimated the competence of up to 100% of the surgeons performing resuscitative thoracotomy, Cattell-Braasch maneuver (right-to-left visceral medial rotation), and supraclavicular exposure of the subclavian artery. Ordinal (rating) scales for high-stakes performance assessments do not yield valid and reliable measures of procedural competency. Measuring discrete procedural components with interval scales provides more precise measurement and is no more difficult to implement than assessment through ordinal scales. Just as surgeons achieve optimal outcomes using appropriate instruments, those who assess their competency require no less.

Footnotes

Published online 30 October 2023

Disclosure: The authors declare that they have nothing to disclose.

The study was reviewed by the institutional review board at Uniformed Services University of the Health Sciences.

REFERENCES

  • 1.Andreatta PB, Bowyer MW, Remick K, Knudson MM, Elster EA. Evidence-based surgical competency outcomes from the clinical readiness program. Annals of Surgery. 2021 Dec. [DOI] [PubMed] [Google Scholar]
  • 2.Bandalos DL. Measurement Theory and Applications for the Social Sciences. New York, NY: The Guilford Press; 2018:PP–PP. [Google Scholar]
  • 3.Potgieter TE, Van der Merwe RP. Assessment in the workplace: a competency-based approach. SA J Ind Psychol. 2002;28:60–66. [Google Scholar]
  • 4.McClelland DC. Testing for competence rather than for “intelligence.”. Am Psychol. 1973;28:1–14. [DOI] [PubMed] [Google Scholar]
  • 5.Popham WJ. Modern Educational Measurement. 3rd ed. Needham, MA: Allyn & Bacon; 2000. [Google Scholar]
  • 6.Elfenbein DM. Have we created a crisis of confidence for general surgery residents? a systematic review and qualitative discourse analyses. JAMA Surgery. 2016;151:1166–1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Andreatta P, Lori J. Developing clinical competence and confidence. In Mancini MB, Palaganas J, Ulrich B, eds. Mastering Simulation. 2nd ed. Indianapolis, IN: Sigma Theta Tau International; 2020:23–27. [Google Scholar]
  • 8.DeVellis RF, Thorpe CT. Scale Development: Theory and Applications (Applied Social Research Methods Series). 3rd ed. London: Sage Publications; 2011. [Google Scholar]
  • 9.Andreatta P. Outcome measures and data. In: Nestel D, Hui J, Kunkler K, Calhoun AW, Scerbo MW, eds. Healthcare Simulation Research: A Practical Guide. New York, NY: Springer; 2019:175–182. [Google Scholar]
  • 10.Shavelson RJ, Webb NM. Generalizability theory: 1973–1980. Br J Math Stat Psychol. 1981;34:133–166. [Google Scholar]
  • 11.Shavelson RJ, Webb NM. Generalizability Theory: A Primer. Newbury Park, NJ: Sage Publications. [Google Scholar]
  • 12.Streiner DL, Norman GR. Precision and accuracy: two terms that are neither. J Clin Epidemiol. 2006;59:327–330. [DOI] [PubMed] [Google Scholar]
  • 13.Bowyer MW, Andreatta PB, Armstrong JH, et al. A novel paradigm for surgical skills training and assessment of competency. JAMA Surgery. 2021;156:1103–1109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Andreatta P. Cognitive neuroscience foundations of surgical and procedural expertise: Focus on Theory. In: Nestel D, Reedy G, McKenna L, Gough S, eds. Clinical Education for the Health Professions. Singapore: Springer; 2020. [Google Scholar]
  • 15.McMillan SS, King M, Tully MP. How to use the nominal group and Delphi techniques. Int J Clin Pharm. 2016;38:655–662. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Martin JA, Regehr G, Reznick R, et al. Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg. 1997;84:273–278. [DOI] [PubMed] [Google Scholar]
  • 17.Faulkner H, Regehr G, Martin J, et al. Validation of an objective structured assessment of technical skill for surgical residents. Acad Med. 1996;71:1363–1365. [DOI] [PubMed] [Google Scholar]
  • 18.Anderson DD, Long S, Thomas GW, et al. Objective structured assessments of technical skills (OSATS) does not assess the quality of the surgical result effectively. Clin Orthop Relat Res. 2016;474:874–881. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Andreatta PB, Gruppen LD. Conceptualizing and classifying validity evidence for simulation. Med Educ. 2009;43:1028–1035. [DOI] [PubMed] [Google Scholar]

Articles from Annals of Surgery Open are provided here courtesy of Wolters Kluwer Health

RESOURCES