Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2021 Jan 14;16(1):e0245439. doi: 10.1371/journal.pone.0245439

Impact of integrating objective structured clinical examination into academic student assessment: Large-scale experience in a French medical school

Alexandre Matet 1,2,3,*,#, Ludovic Fournel 1,4,5,#, François Gaillard 6,#, Laurence Amar 1,7,8, Jean-Benoit Arlet 1,9, Stéphanie Baron 1,6, Anne-Sophie Bats 1,10,11, Celine Buffel du Vaure 12, Caroline Charlier 1,13,14, Victoire De Lastours 1,15,16, Albert Faye 1,17, Eve Jablon 18, Natacha Kadlub 1,19, Julien Leguen 1,20, David Lebeaux 1,21, Alexandre Malmartel 12, Tristan Mirault 1,7,8, Benjamin Planquette 1,22,23, Alexis Régent 1,24,25, Jean-Laurent Thebault 12, Alexy Tran Dinh 1,26,27, Alexandre Nuzzo 1,28, Guillaume Turc 1,29,30, Gérard Friedlander 1,6,31, Philippe Ruszniewski 1,29,32, Cécile Badoual 1,7,33, Brigitte Ranque 1,7,9, Mehdi Oualha 1,34,35,, Marie Courbebaisse 1,6,31,
Editor: Etsuro Ito36
PMCID: PMC7808634  PMID: 33444375

Abstract

Purpose

Objective structured clinical examinations (OSCE) evaluate clinical reasoning, communication skills, and interpersonal behavior during medical education. In France, clinical training has long relied on bedside clinical practice in academic hospitals. The need for a simulated teaching environment has recently emerged, due to the increasing number of students admitted to medical schools, and the necessity of objectively evaluating practical skills. This study aimed at investigating the relationships between OSCE grades and current evaluation modalities.

Methods

Three-hundred seventy-nine 4th-year students of University-of-Paris Medical School participated to the first large-scale OSCE at this institution, consisting in three OSCE stations (OSCE#1–3). OSCE#1 and #2 focused on cardiovascular clinical skills and competence, whereas OSCE#3 focused on relational skills while providing explanations before planned cholecystectomy. We investigated correlations of OSCE grades with multiple choice (MCQ)-based written examinations and evaluations of clinical skills and behavior (during hospital traineeships); OSCE grade distribution; and the impact of integrating OSCE grades into the current evaluation in terms of student ranking.

Results

The competence-oriented OSCE#1 and OSCE#2 grades correlated only with MCQ grades (r = 0.19, P<0.001) or traineeship skill grades (r = 0.17, P = 0.001), respectively, and not with traineeship behavior grades (P>0.75). Conversely, the behavior-oriented OSCE#3 grades correlated with traineeship skill and behavior grades (r = 0.19, P<0.001, and r = 0.12, P = 0.032), but not with MCQ grades (P = 0.09). The dispersion of OSCE grades was wider than for MCQ examinations (P<0.001). When OSCE grades were integrated to the final fourth-year grade with an incremental 10%, 20% or 40% coefficient, an increasing proportion of the 379 students had a ranking variation by ±50 ranks (P<0.001). This ranking change mainly affected students among the mid-50% of ranking.

Conclusion

This large-scale French experience showed that OSCE designed to assess a combination of clinical competence and behavioral skills, increases the discriminatory capacity of current evaluations modalities in French medical schools.

Introduction

Objective structured clinical examination (OSCE) aims to evaluate performance and skills of medical students including clinical reasoning, communication skills, and interpersonal behavior [14]. OSCE has been proposed as a gold standard for the assessment of medical students performance during the ‘clinical’ years of medical school [5, 6] and is used in several countries worldwide [710], including the United States and Canada [1113] who pioneered its integration in medical teaching programs.

The use of OSCE is currently expanding in France, where clinical training has long relied on bedside clinical practice in academic hospitals. To date, in this country, medical knowledge is mainly evaluated using multiple choice questions (MCQ)-based written examinations, whereas the evaluation of clinical skills and behavior relies on subjective assessments at the end of each hospital-based traineeship in a non-standardized manner. Upon completion of the sixth year of medical school, all French students take a final classifying national exam that determines their admission into a residency program. Their admission into a given specialty and a given teaching hospital network is based on their national rank. This national exam is currently based on MCQs only, either isolated or related to progressive clinical cases, and MCQs dealing with the critical reading of a peer-reviewed medical article.

However, the need for a simulated teaching environment has recently emerged in French medical schools, due to the increasing number of admitted students, and the necessity of objectively evaluating practical skills. In a near future, OSCE will be implemented in the reformed version of the French final classifying national exam, accounting for 40% of the final grade. In this context, medical teachers at the Université de Paris Medical School Paris, France, which has two sites that have recently merged, the Paris Nord and Paris Centre sites, and ranks among the largest medical schools in France with 400–450 students per study year, have designed a large-scale OSCE taken by all fourth-year medical students to assess the impact of such evaluation on student ranking.

Considering the plurality of evaluation modalities available for medical students, to study the correlations between grades obtained on performance-based tests, such as OSCE, and other academic and non-academic tests, is of paramount importance. The aims of this study were (i) to investigate the correlation of OSCE grades with those obtained on current academic evaluation modalities, consisting in written MCQ-based tests and assessment of clinical skills and behavior during hospital traineeship, (ii) to analyze the distribution of grades obtained on this first large-scale experience of OSCE at this institution, and (iii) to simulate the potential impact of integrating OSCE grades into the current evaluation system in terms of student ranking.

Methods

Study population

The 426 medical students completing the fourth year at the Paris Centre site of Université de Paris Medical School (Paris, France), from September 2018 to July 2019 were invited to participate to the large-scale OSCE evaluation organized by the Medical School on May 25, 2019. Students were exempted of OSCE if they were on night shift the night before, or the day of the OSCE, or if they were completing a traineeship abroad at the time of the evaluation (European student exchange program). The education council and review board of University of Paris approved the observational and retrospective analysis of grades obtained at OSCE and all written and practical evaluations during the 2018–2019 academic year for the fourth-year class. The need for informed consent was waived because all data were anonymized before analysis.

Current evaluation of fourth-year medical students

Hospital-based traineeship evaluation

At the end of each 3-month hospital traineeship, students are evaluated by the supervising MDs in a non-standardized manner in two areas: i) knowledge and clinical skills acquired during the traineeship (50% of the grade of the traineeship) and ii) behavior, which includes presence, diligence, relationship to the patient, integration within the care team (50% of the grade of the traineeship).

Academic evaluation

During the fourth year of medical school, students are divided in three subgroups and enrolled successively in three teaching units (TU) subdivided as follows: TU1 includes cardiology, pneumology, and intensive care; TU2 includes hepato-gastroenterology, endocrinology, and diabetology; and TU3 includes rheumatology, orthopedics, and dermatology. For each subgroup, the evaluation of each TU takes place at the end of the quarter during which the three specialties of this TU were taught. Thus, the whole class is not evaluated concomitantly for a given TU.

The academic evaluation of each TU lasts 210 minutes. This test comprises three progressive clinical cases including 10 to 15 MCQs, 45 isolated MCQs (15 MCQs per specialty taught in the unit), and 15 MCQs evaluating the critical reading of a scientific article related to one of the specialties taught in the TU.

Calculation of the final grade for each unit of teaching

The academic evaluation accounts for 90% of the final grade for a TU, and the grade obtained from the evaluation of knowledges and medical skills obtained at the end of the hospital-based traineeship corresponding to this TU accounts for 10%. The grade obtained from the evaluation of behavior during the hospital traineeship is used to pass the traineeship but is not taken into account in the TU average grade. To pass a given TU, a minimal grade of 50% (≥10/20) must be obtained.

OSCE stations

OSCE scenarios were designed by a committee of 16 medical teachers, according to the guidelines of the Association for Medical Education in Europe [14, 15]. The first OSCE station (OSCE #1) focused on diagnosis (acute dyspnea due to pulmonary embolism secondary to lower leg deep venous thrombosis), the second (OSCE #2) on prevention (cardiovascular counselling after acute myocardial infarction) and the third (OSCE #3) on relational skills (exposition of cholecystectomy indication following acute cholecystitis). The OSCE #1, #2 and #3 scenarios and their detailed standardized evaluation grids are presented in S1S3 Data, respectively. Of note, the first and second stations (OSCE #1 and OSCE #2) dealt with cardiovascular conditions covered in TU1, whereas the third OSCE station (OSCE #3) was a hepato-gastroenterology scenario and therefore corresponded to TU2. The items retained in the evaluation grid to assess student’s performance followed the guidelines of the Association for Medical Education in Europe, which outlines four major categories: clinical cognitive and psychomotor abilities (grouped and referred to as ‘Competence’); non-clinical skills and attitudes (grouped and referred to as ‘Behavior’) [14]. This categorization of items revealed that OSCE #1, OSCE#2 and OSCE #3 were designed to assess clinical competence and relational skills in difference proportions, as displayed in Fig 1.

Fig 1.

Fig 1

Pie charts displaying the proportions of competence-based and behavior-based items in the evaluation grids for OSCE stations #1, #2 and #3 (A, B and C, respectively). Detailed evaluation grids are provided as S1S3 Data.

Physicians and teachers from all clinical departments at Université de Paris Medical School were enrolled as actors to act as standardized patients. The OSCE committee organized several training sessions to explain the script of each OSCE station and ensure standardization of actions and dialogues from standardized patients. Moreover, each OSCE scenario was recorded by members of the OSCE committee who had contributed to the scripts, and the videos were available in a secure online platform for training.

Organization of the OSCE

The test took place on May, 2019 concomitantly for all participating students, at three different facilities of Université de Paris, Paris Centre site (Cochin, Necker, and European Georges-Pompidou University Hospitals, Paris, France). The duration of each station was 7 minutes. In each room, two teachers were present: one acted as standardized patient, and the second evaluated the performance of each student in real time according to a standardized evaluation grid (provided with the OSCE scripts in S1S3 Data), which was accessed on a tablet connected to the internet. In addition to the 16 members of the OSCE organization committee, 162 teachers of Université de Paris participated as standardized patients or evaluators. To assess quality and inter-standardized patient reproducibility, OSCE coordinators attended as observers at least one OSCE scenario run by each standardized patient. The homogeneity of training between assessors was maximized by preparatory meetings throughout the academic year preceding the OSCE test, specific training for each OSCE station in small groups by one single coordinating team, diffusion of video recordings of a standard patient undergoing each OSCE station. Moreover, the homogeneity of motivation between assessors was maximized by the facts that all were medical doctors belonging to the same university hospital network, that all were implicated at various degrees in medical pedagogy, and that all participated for the first time to a large-scale pedagogical experiment of an upcoming evaluation and teaching modality.

The proportion of evaluators from the same specialty as the one evaluated in each OSCE station (pneumologists in OSCE1, cardiologists/vascular specialists in OSCE2, and gastroenterologists/digestive surgeons in OSCE3) was ~9.5% across the 3 stations. This proportion was 7%, 12%, and 9% for OSCE 1, 2 and 3, respectively.

Statistical analyses

Descriptive and correlative statistics were computed on GraphPad Prism (version 5.0f, GraphPad Software). Spearman correlation coefficients, and Mann-Whitney tests were used where appropriate, due to the non-normal distribution of grades (ascertained by the density plot as shown in Fig 2 and confirmed by the Kolmogorov-Smirnov test, P<0.001 for the distribution of OSCE, MCQ, traineeship skill and traineeship behavior grades). Categorical distributions were compared using the Chi-square test. Plots were created using the R Software (Version 3.3.0, R Foundation for Statistical Computing, R Core Team, 2016, http://www.R-project.org/) and the ‘ggplot’ package. Multivariate analyses were conducted using R software.

Fig 2.

Fig 2

Distribution of mean OSCE grades (red) and mean fourth-year multiple-choice question (MCQ)-based grades (black). (A) Density plot showing the wider dispersion of OSCE grades compared to MCQ grades. (B) Relationship between student rank among the 379-student class, and grades obtained at OSCE and MCQ-based examinations, showing a flatter slope for OSCE and a steeper slope for MCQs, confirming the wider dispersion of OSCE grades than MCQ grades among the fourth-year class.

For certain analyses, competence-oriented items and behavior-oriented items were extracted from OSCEs #1, #2, and #3 and averaged, as previously reported by Smith et al. [16].

To compare the score obtained from OSCE to the current evaluation based on MCQ tests, we simulated the potential impact of the integration of OSCE scores on the ranking of fourth-year medical students included in our study. Since the evaluation of teaching units and the national classifying exam both consist in MCQs (isolated or based on progressive medical cases or on critical reading of a peer-reviewed medical article), we first classified fourth-year medical students according to the mean grades obtained in the three teaching units, as if it was the classification they would have obtained on the national classifying exam. To evaluate the potential impact of OSCE on the rankings, we integrated the mean grade for the three OSCE stations into the current evaluation, with 10%, 20% and 40% coefficients (based on the planned coefficient of 40% for OSCE in the future version of the final classifying exam). We evaluated the proportion of students who would enter the top 20% and who would be dropped, upon OSCE grade inclusion with 20% or 40% coefficients.

Results

Of 426 students completing the fourth year at Université de Paris Medical School, Paris Centre site, from September 2018 to July 2019, 379 (89%) participated in the first large-scale OSCE test. The descriptive statistics of the average fourth-year MCQ-based grades obtained after the three TU, each OSCE station, and the mean OSCE grades, are summarized in Table 1. Grades obtained at each OSCE station are provided in the S1 Table (https://doi.org/10.6084/m9.figshare.13507224.v1).

Table 1. Descriptive statistics for mean multiple-choice-based examination grades and OSCE grades of the fourth-year class of medical school.

MCQ-based examinations Mean OSCE OSCE #1 OSCE #2 OSCE #3
Mean 12.17 13.53 12.66 13.33 14.55
SD 1.51 1.92 3.11 2.74 2.75
Median 12.25 13.83 13.0 13.50 15.0
Range 6.47–15.71 7.0–18.0 2.0–19.0 3.0–19.0 4.0–20.0

In the French notation system, the maximal grade is 20.

SD = standard deviation; MCQ = multiple-choice question; OSCE = objective structured clinical examination

Correlation between OSCE, MCQ-based grades and hospital-based traineeship grades

Correlations between OSCE grades and MCQ-based grades obtained for each TU, traineeship skills, and traineeship behavior are explored in Fig 3 and Table 2. Positive, but weak correlations were identified between the mean OSCE grade and the mean fourth-year MCQ-based examination or traineeship skill grades (r = 0.18, P = 0.001 and r = 0.19, P<0.001, Fig 3A, 3B, respectively). Interestingly, mean OSCE grades did not correlate with traineeship behavior grades (P = 0.28, Fig 3C). A sub-analysis revealed that grades obtained at each OSCE station correlated differently with the other evaluation modalities. OSCE #1 grades correlated with MCQ-based grades (r = 0.19, P<0.001, Fig 3D), but not traineeship skill or behavior grades (P = 0.32 and P = 0.76, Fig 3E and 3F, respectively). OSCE#2 grades showed a near-significant correlation with MCQ-based grades (P = 0.078, Fig 3G), a correlation with traineeship skill grades (r = 0.17, P = 0.001, Fig 3H) but not with traineeship behavior grades (P = 0.83, Fig 3I). Conversely, OSCE #3 grades correlated with both traineeship skill and behavior grades (r = 0.19, P<0.001and r = 0.12, P = 0.032, Fig 3K and 3L, respectively), but not with MCQ-based grades (P = 0.09, Fig 3J).

Fig 3. Scatterplots of the relationships between OSCE grades and mean fourth-year teaching unit grades.

Fig 3

(A-C) Mean OSCE grades versus mean fourth-year multiple-choice question (MCQ)-based examination grades, traineeship skill and traineeship behavior grades. (D-F) Mean OSCE #1 grades versus mean fourth-year MCQ-based examination, traineeship skill and traineeship behavior grades. (G-I) Mean OSCE #2 grades versus mean fourth-year MCQ-based examination, traineeship skill and traineeship behavior grades. (J-L) Mean OSCE #3 grades versus mean fourth-year MCQ-based examination, traineeship skill and traineeship behavior grades. To highlight trends, a smoothing regression line was added to each plot using the geom_smooth function (R Software, ggplot2 package). P values and Spearman r coefficient were highlighted in green for significant and in red for non-significant correlations, respectively.

Table 2. Correlation between OSCE grades and mean fourth-year multiple-choice-question-based grades.

Spearman r P
Mean 4th-year MCQ-based examination grades
vs. OSCE mean 0.18 0.001
vs. OSCE #1 0.19 0.0003
vs. OSCE #2 - 0.078
vs. OSCE #3 - 0.094
Mean 4th-year traineeship skills grade
vs. OSCE mean 0.19 0.0002
vs. OSCE #1 - 0.32
vs. OSCE #2 0.17 0.001
vs. OSCE #3 0.19 0.0003
Mean 4th-year traineeship behavior grade
vs. OSCE mean - 0.28
vs. OSCE #1 - 0.76
vs. OSCE #2 - 0.83
vs. OSCE #3 0.12 0.032

OSCE = objective structured clinical examination; MCQ = multiple-choice question.

OSCEs #1 and #2 are focused on cardiovascular and are predominantly competence-oriented; OSCE #3 is focused on hepato-gastroenterology, and predominantly behavior-oriented.

Moreover, of 94 students within the top quarter of the fourth-year class (top 25%) for averaged MCQ-based grades, only 27 students (29%) obtained an averaged OSCE grade (average of OSCE #1–3) within the top quarter. In contrast, 39 (41%) of the 94 students within the top quarter for traineeship skill grades, and 55 (59%) of the 94 students within the top quarter for traineeship behavior grades obtained an averaged OSCE grade within the top quarter (P<0.001, Chi-square test, Fig 4).

Fig 4. Proportion of students ranked in the top quarter based on fourth-year teaching unit grades who were ranked within the top quarter of OSCE grades (average of OSCE #1–3).

Fig 4

(A) Multiple-choice question (MCQ)-based examination grades (fourth-year average). (B) Traineeship skill grades (fourth-year average). (C) Traineeship behavior grades. There was a significant difference between the three proportions (P<0.001, Chi-square test).

Table 3 summarizes an additional analysis averaging separately all competence-oriented and all behavior-oriented items from the three OSCE stations. Whereas averaged behavior-oriented items showed a significant correlation to traineeship skill and behavior grades but not to MCQ-based grades (r = 0.13, P = 0.010; r = 0.11, P = 0.046, and P = 0.079, respectively), averaged competence-oriented items showed a significant correlation to MCQ-based and traineeship skill grades, but not to traineeship behavior grades (r = 0.15, P = 0.004; r = 0.15, P = 0.003 and P = 0.35, respectively).

Table 3. Correlation between averaged knowledge-oriented and behavior-oriented items composing OSCE grades and mean fourth-year grades.

Spearman r P
Competence-oriented OSCE items
vs. overall MCQ-based grade 0.15 0.004
vs. traineeship skills grade 0.15 0.003
vs. traineeship behavior grade - 0.35
Behavior-oriented OSCE items
vs. overall MCQ-based grade - 0.079
vs. traineeship skills grade 0.13 0.010
vs. traineeship behavior grade 0.11 0.046

OSCE = objective structured clinical examination; MCQ = multiple-choice question

Distribution of grades obtained on the OSCE

As shown in Table 1, grades obtained at the behavior-oriented OSCE #3 station were higher than those obtained at the predominantly competence-oriented OSCE #1 and #2 stations. The dispersion of grades, assessed by the standard deviation, was higher for OSCE than for MCQ-based written examinations (P<0.001), as confirmed graphically in Fig 2.

The overall relationship between mean OSCE and MCQ-based grades is displayed in Fig 5. The ratio between OSCE and MCQ-based grades is higher for lower MCQ-based grades. In other words, the OSCE grades are more likely to be higher than the MCQ-based grade when the MCQ-based grade is low. The discrepancy between the two grades is higher for students with lower grades on the MCQ-based examination. The regression line shows that the ratio between OSCE and MCQ-based grades tends towards 1 for higher MCQ-based grades.

Fig 5. Dot plot of the relationship between multiple-choice question (MCQ)-based grades obtained for teaching units and the ratio of the OSCE grades and those MCQ-based grades.

Fig 5

This plot highlights graphically that MCQs and OSCE evaluates students differently, since a non-neglectable proportion of students obtained better grades at OSCE than at MCQs, and more so among students with middle- or low-range grades at MCQs. To facilitate the reading, the dotted red line indicates the OSCE/MCQ ratio equal to 1.0. Students with an OSCE/MCQ ratio lower than 1.0 have a lower grade on the OSCE than the MCQ-based exam.

Cardiovascular and hepato-gastroenterology topics predominated in the OSCE scenarios. Since fourth-year students are divided into three groups that follow the TUs in a rotating order, the quarter when a student was taught TU1, 2, or 3 may have affected OSCE grades. To rule out this potential bias, we computed a uni- and multivariate model predicting OSCE grades using the attributed rotating group (TU1/2/3, TU2/3/1 or TU3/1/2 over the 3 quarters of the academic year) and the examination grades obtained for TU1 (cardiovascular diseases) and TU2 (hepato-gastroenterology). The grades obtained at TU1 (P<0.001) and TU2 (P<0.001), but not the quarter in which the students had received training (P = 0.60) influenced OSCE grades in the univariate model. We also built a multivariate model into which the ‘training quarter’ parameter was forced and found that the only contributing parameter was the TU1 grade (P<0.001) (multivariate model: R2 = 0.040, P<0.001), reflecting the predominant proportion of cardiovascular topics in the OSCE.

Impact of integrating OSCE grades into the current evaluation system

We simulated the impact on the rank of students of integrating an incremental coefficient of 10%, 20% and 40% of OSCE grades into the fourth-year average grade (Fig 6). As the coefficient of OSCE grades increased, an increasing proportion of the 379 students had a ranking variation by ±50 ranks (n = 2, n = 50 and n = 131 of 379 students, respectively; P<0.001, Chi-square test), as displayed on Fig 6.

Fig 6. Variation in ranking based on the mean fourth-year multiple-choice question (MCQ)-based grades, with incremental percentages of OSCE grade integrated into the final grade.

Fig 6

The upper and lower solid black lines represent thresholds for +50 or -50 rank variation, respectively. Results are displayed for integration of OSCE grade with a 10%, 20% and 40% coefficient.

Moreover, for all coefficients, the rank-variation was more important for students in the mid-50% of ranking, compared to students in the top or the bottom 25%, as evidenced visually on Fig 6. The magnitude of this effect was progressive as the OSCE coefficient increased. When integrating OSCE grades with a 10% coefficient, no student in the top or bottom 25%, but 2 students in the mid-50% of ranking changed their ranking by ±50, (P = 0.50, Fisher’s exact test). When integrating OSCE grades with a 20% coefficient, 7 students in the top or bottom 25%, compared to 46 students in the mid-50% of ranking changed their ranking by ±50, (P<0.001, Fisher’s exact test). When integrating OSCE grades with a 40% coefficient, 51 students in the top or bottom 25%, compared to 80 students in the mid-50% of ranking changed their ranking by ±50, (P = 0.02, Fisher’s exact test).

Regarding the effect of OSCE on the highest-ranking students, integrating the OSCE grade at 10%, 20%, or 40% of the final grade changed the composition of the top 25% of the class (95 students) by 7% (n = 7/95 students), 15% (n = 14/95 students), and 40% (n = 38/95 students) (P<0.001, Chi-square test).

Discussion

This study evaluating the impact of a large-scale OSCE on students’ assessment in a French medical school (i) highlighted weak but statistically significant correlations between OSCE and MCQ grades, traineeship skills or traineeship behavior assessment, mainly influenced by the design of the OSCE scenario; (ii) showed a wider dispersion of grades obtained at the OSCE compared to conventional evaluation modalities; and (iii) demonstrated that integrating OSCE marks in the current grading system modified the ranking of students and affected predominantly those in the middle of the ranking.

Previous experiences of OSCE were reported by several academic institutions worldwide. This OSCE study is among the largest described, with 379 participating students. Major studies from several countries that have assessed the correlation of OSCE with other academic evaluation modalities are summarized in Table 4. It is widely accepted that OSCEs offer the possibility to evaluate different levels and areas of clinical skills [17, 18]. In contrast to conventional MCQs or viva voce examinations, OSCEs are designed to assess student competences and skills rather than sheer knowledge [19], as exemplified throughout the studies listed in Table 4. Yet, there is no precise border between clinical skills and knowledge in a clinical context [16, 18]. The categorization of OSCE items into broad evaluation fields may help extract valuable and quantitative parameters reflecting each student’s clinical and behavioral skills, as performed in the present study. The three OSCE stations composing this large-scale test were designed to specifically assess clinical competence and relational skills (referred to as “behavior”) in different proportions. Interestingly, we observed different correlation profiles between OSCE grades at each station, and MCQs, traineeship skills and traineeship behavior. The more competence-oriented OSCE #1 station correlated only with MCQ grades, while the balanced OSCE#2 correlated near-significantly with MCQ grades, and significantly with traineeship skill grades, and finally the behavior-oriented OSCE #3 correlated with both traineeship skill and behavior grades. These differential profiles confirm the paramount importance of OSCE station design, according to its specific pedagogic objectives, as recently pointed out by Daniels and Pugh who proposed guidelines for OSCE conception [20]. Remarkably, similar correlations have been previously observed in studies summarized in Table 4 [19, 2123], which supports the reliability of OSCE as an evaluation tool for medical students [24]. To note, the weak level of correlations observed between OSCE grades and the other evaluation modalities in the present study is consistent with the weakness of correlations reported in the literature (see Table 4). It may reflect the fact that OSCEs evaluate skills in a specific manner depending on their design, as compared to conventional assessment methods [19, 22]. Overall, the correlations observed between OSCE grades and classical assessment modalities, and the consistence of weak correlation levels with those reported in the literature, strongly support the notion that these correlations do not result from chance or from a fluctuation of grades.

Table 4. Previous studies from the literature investigating correlations between OSCE and other academic assessment methods.

Reference Country Students, No. Academic assessment method compared to OSCE Statistical evaluation of the inter-relationship Conclusions
Smith et al (1984) United Kingdom 229 Viva voce examination, in-case assessment (clinical aptitude and written project), MCQ examination, comparable traditional clinical examinations. Significant correlation between OSCE and marks from MCQ (r = 0.34, P<0.001), comparable clinical examination (r = 0.26, P<0.001), and previous in-case assessment (r = 0.24, P<0.001). In contrast to viva-voce examination, OSCE results correlated well with an overall assessment of the student’s ability.
No correlation between OSCE and viva voce examination (r = 0.08, P>0.05). The clinical component of OSCE did not correlate well with MCQ.
Probert et al (2003) United Kingdom 30 Long and short case-based viva voce examinations. Overall performance at traditional finals was correlated with the total OSCE mark (r = 0.5, P = 0.005). Dichotomizing traditional examinations into surgery and medicine assessment resulted in significant correlations between OSCE and surgery marks (r = 0.71, P<0.001) but not between OSCE and medicine marks (r = 0.15, P = 0.4). This was a pilot study for OSCE implementation, and the analyzed sample of students who performed both examination methods was representative of the whole population.
The authors added independent consultant evaluations to assess clinical performance by students. OSCE assesses different clinical domains than do traditional finals and improved the prediction of future clinical performance.
Dennehy et al (2008) USA 62 National Board Dental Examination (NBDE, computerized assessments of theoretical knowledge in part I, and clinical knowledge in part II), and MCQ examinations. NBDE score was statistically associated with OSCE score (P ranging from <0.001 to 0.04). Didactic predictors (NBDE, MCQ examinations) explained around 20% of the variability in OSCE scores.
There was no significant association between OSCE and MCQ scores. OSCE may be a tool that allows educators to assess student capabilities that are not evaluated in typical standardized written examinations.
In multiple regression models none of the didactic predictors were significantly associated with overall OSCE performance.
Sandoval et al (2010) Chile 697 Written examination and daily clinical practice observation guidelines. Positive correlation between percentages of success for all three evaluation methods with OSCE (P<0.001). Pearson’s correlation co-efficient was higher between assessment methods after seven years of OSCE implementation.
These evaluations are complementary.
Kirton et al (2011) United Kingdom 39 per year during a 3-year long evaluation Medicine and pharmacy practice (MPP) written examination combining MCQ and essays expected to relate to clinical practice. Moderate positive correlation between OSCE and MPP (r = 0.6, P<0.001). For 20% of students, experience in OSCE did not increase marks or individual performance.
These two examinations assess different areas of expertise according to Miller’s Pyramid of Competence and both should be performed.
Kamarudin et al (2012) Malaysia 152 Student’s clinical competence component evaluated during the final professional long-case examination. Positive correlation between OSCE and long case for the diagnostic ability (r = 0.165, P = 0.042) and total exam score (r = 0.168, P = 0.039). There is a weak correlation between OSCE and long-case evaluation format. These two assessment methods test different clinical competences.
Tijani et al (2017) Nigeria 612 Long-case examination (end of posting in the 4th and 6th years of medical school), final MCQ, and total written papers (TWP): sum of MCQ examinations and essays. Positive correlation between OSCE and MCQ (r = 0.408), TWP (r = 0.523), and long case (r = 0.374), P<0.001. The total clinical score combining OSCE and long-case marks was a better predictor of student clinical performance than each assessment method analyzed separately.
These two evaluations could be complementary.
Previous experience with OSCE was not taken into account in the analysis.
Lacy et al (2019) Mexico 83 Communication skills evaluated during direct observation of a clinical encounter (DOCE) using the New Mexico Clinical Communication Scale. Students’ matched scores on OSCE and DOCE were not correlated. The discordance between OSCE and DOCE suggests that OSCE may not be an optimal method to identify students requiring additional communication training.
Mean scores were not statistically different between faculty evaluators for individual communication skills (P = 0.2).

No. = number; OSCE = objective structured clinical examination; MCQ = multiple-choice question; NBD = national board dental examination; MPP = medicine and pharmacy practice; TWP = total written papers; DOCE = direct observation of a clinical encounter.

Importantly, we observed a significantly larger distribution of grades obtained at OSCE compared to grades from current academic evaluation modalities, relying essentially on written MCQ tests. This underlies the potential discriminating power of OSCE for student ranking, of importance in the French medical education system and many other countries, where admissions into residency programs depend on a single national ranking. Currently, more than 8,000 6th-year medical students take the French national classifying exam each year. Its outcome has been subject to criticism over the hurdles to accurately rank such large number of students based on MCQs only [25].

Finally, this study underlines the potential impact of OSCE on student ranking. OSCE have not been employed in other national settings for the purpose of student ranking, a specificity of the French medical education system, but rather as a tool to improve or to evaluate clinical competence. Using a simulation strategy, we observed that the impact of integrating OSCE grade with a 10-to-40% coefficient was greater for students with intermediate ranks, which is of importance since it suggests that OSCE may contribute to increase the discriminatory power of the French classifying national exam. This observation is the consequence of the two above-mentioned results, showing a weak correlation between OSCE and MCQs grades and a larger distribution of grades obtained at OSCE compared to grades from current academic evaluation modalities. At both ends of the distribution of MCQ grades there were fewer students, resulting in a higher MCQ grade difference between top- or bottom-ranked students than among middle-ranked students. Therefore, integrating the OSCE grade with a coefficient up to 40% did not change the composition of the top and bottom ranks. It should be noted, however, that the discriminating ability of OSCE is debated. As pointed by Konje et al, OSCE are complementary to other components of medical students' examination, such as clinical traineeships, but may not be sufficient to assess all aspects of their clinical competences in order to classify them [26]. Moreover, Daniels et al have demonstrated that the selection of checklist items in the design of OSCE stations has a strong effect on the station reliability to assess clinical competence, and, therefore on its discriminative power [24]. Currently, the French national classifying exam, based on MCQs only, is appropriate to discriminate higher and lower-level students, but several concerns have been raised over its ability to efficiently discriminate students in the middle of the ranking where grades are very tight [25, 27]. Moreover, these MCQs assess mainly medical knowledge and have little ability to assess clinical skills [28]. Whether OSCE are well correlated to real-life medical and behavior skills could not be assessed in our study but OSCE have already proven their superiority to evaluate knowledge, skill, and behavior compared to written examinations [1922]. In addition, the French academic context requires this novel examination modality to possess a high discriminatory power, in order to contribute to the national student ranking. Overall, these previous results indicate that OSCE is potentially a relevant and complementary tool for student training and ranking [29, 30].

This study has several limitations. It reports the first experience using OSCE over an entire medical school class of the Université de Paris. Therefore, students had not been previously trained for this specific evaluation modality. In future, the impact of OSCE grade integration may be modified when French students will have trained specifically before taking the final OSCE. Moreover, standardized patients were voluntary teachers from the institution. According to the standards of best practice from the Association of Standardized Patient Educators (ASPE), standardized patients do not have to be professional actors [31]. However, the fact that they were medical teachers may have induced an additional stress in students, possibly altering their performance. In addition, contrary to the ASPE guidelines [31], no screening process was applied to medical educators who were recruited on a voluntary basis from all clinical departments in our University Hospitals, because 162 educators were required to run all OSCE stations simultaneously. To minimize these biases and homogenize their roles, a training program for teachers who acted as standardized patients was well-defined and mandatory. An additional bias may result from inter- or intra-standardized patient variability that may be noted in performances over time. We attempted to limit this bias by homogenizing the training of standardized patients during several pre-OSCE meetings, by sharing videos of the expected standard roles, and by controlling their performance by observers from the OSCE committee during the examination. The proportion of evaluators from the same specialty as the one evaluated in each OSCE station was <10%, which can be deemed sufficiently low not to bias the evaluation. For future OSCE sessions, the organizing committee from our University should exclude specialists from OSCE stations of their own field. To reduce evaluation bias, care should also be taken to minimize the risk for an evaluator to have already evaluated during an hospital traineeship one of the students taking his/her OSCE station. Moreover, for practical reasons during this first large-scale session, students were assessed in only three OSCE stations, whereas at least eight stations are usually used for medical school examinations [20, 24]. The ranking of the fourth-year medical students according to the mean of all the MCQs of the three TUs probably will not be the rankings these students will receive in two years at the final national classifying exam. Finally, since teaching programs differ between countries, results from this French study may not be relevant to other education systems.

These results consolidate the current project of expanding the use of OSCE in French medical schools and suggest further developments. Besides increasing the number of stations and diversifying scenarios to cover multiple components of clinical competence, future studies should explore the potential use of OSCE not only as evaluation tool, but also as learning tool, as compared to traditional bedside training. Among other parameters, the impact of OSCE on student grades within a given teaching unit should be investigated. Feedback from students, medical teachers, and simulated patients have been collected and are under analysis to fine-tune the conception and organization of OSCE in France, both at local and national levels.

In conclusion, this large-scale French experiment showed that OSCE assess clinical competence and behavioral skills in a complementary manner, compared to conventional assessment methods, as highlighted by the weak correlation observed between OSCE grades and MCQ grades, traineeship skills or behavior assessment. It also demonstrated that OSCE have an interesting discriminatory capacity, as highlighted by the larger distribution of grades obtained at OSCE compared to grades from current academic evaluation modalities. Finally, it evidenced the impact of integrating OSCE grades into the current evaluation system on student ranking.

Supporting information

S1 Table. Grades obtained at each OSCE station.

(TXT)

S1 Data. OSCE #1 script and evaluation grid.

(DOCX)

S2 Data. OSCE #2 script and evaluation grid.

(DOCX)

S3 Data. OSCE #3 script and evaluation grid.

(DOCX)

Acknowledgments

The authors thank Mrs. Bintou Fadiga, European Georges-Pompidou Hospital, Necker-Enfants Malades Hospital and Cochin Hospital, AP-HP, Paris, for technical assistance.

Data Availability

The S1 Table file is available from the Figshare repository : https://doi.org/10.6084/m9.figshare.13507224.v1.

Funding Statement

The authors received no specific funding for this work.

References

  • 1.Epstein RM. Assessment in medical education. N Engl J Med. 2007;356: 387–396. 10.1056/NEJMra054784 [DOI] [PubMed] [Google Scholar]
  • 2.O’Sullivan P, Chao S, Russell M, Levine S, Fabiny A. Development and implementation of an objective structured clinical examination to provide formative feedback on communication and interpersonal skills in geriatric training. J Am Geriatr Soc. 2008;56: 1730–1735. 10.1111/j.1532-5415.2008.01860.x [DOI] [PubMed] [Google Scholar]
  • 3.Casey PM, Goepfert AR, Espey EL, Hammoud MM, Kaczmarczyk JM, Katz NT, et al. To the point: reviews in medical education—the Objective Structured Clinical Examination. Am J Obstet Gynecol. 2009;200: 25–34. 10.1016/j.ajog.2008.09.878 [DOI] [PubMed] [Google Scholar]
  • 4.Brannick MT, Erol-Korkmaz HT, Prewett M. A systematic review of the reliability of objective structured clinical examination scores. Med Educ. 2011;45: 1181–1189. 10.1111/j.1365-2923.2011.04075.x [DOI] [PubMed] [Google Scholar]
  • 5.Sloan DA, Donnelly MB, Schwartz RW, Strodel WE. The Objective Structured Clinical Examination. The new gold standard for evaluating postgraduate clinical performance. Ann Surg. 1995;222: 735–742. 10.1097/00000658-199512000-00007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Norman G. Research in medical education: three decades of progress. BMJ. 2002;324: 1560–1562. 10.1136/bmj.324.7353.1560 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Pierre RB, Wierenga A, Barton M, Branday JM, Christie CDC. Student evaluation of an OSCE in paediatrics at the University of the West Indies, Jamaica. BMC Med Educ. 2004;4: 22 10.1186/1472-6920-4-22 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Nasir AA, Yusuf AS, Abdur-Rahman LO, Babalola OM, Adeyeye AA, Popoola AA, et al. Medical students’ perception of objective structured clinical examination: a feedback for process improvement. J Surg Educ. 2014;71: 701–706. 10.1016/j.jsurg.2014.02.010 [DOI] [PubMed] [Google Scholar]
  • 9.Majumder MAA, Kumar A, Krishnamurthy K, Ojeh N, Adams OP, Sa B. An evaluative study of objective structured clinical examination (OSCE): students and examiners perspectives. Adv Med Educ Pract. 2019;10: 387–397. 10.2147/AMEP.S197275 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Heal C, D’Souza K, Banks J, Malau-Aduli BS, Turner R, Smith J, et al. A snapshot of current Objective Structured Clinical Examination (OSCE) practice at Australian medical schools. Med Teach. 2019;41: 441–447. 10.1080/0142159X.2018.1487547 [DOI] [PubMed] [Google Scholar]
  • 11.Boulet JR, Smee SM, Dillon GF, Gimpel JR. The use of standardized patient assessments for certification and licensure decisions. Simul Healthc J Soc Simul Healthc. 2009;4: 35–42. 10.1097/SIH.0b013e318182fc6c [DOI] [PubMed] [Google Scholar]
  • 12.Dauphinee WD, Blackmore DE, Smee S, Rothman AI, Reznick R. Using the Judgments of Physician Examiners in Setting the Standards for a National Multi-center High Stakes OSCE. Adv Health Sci Educ Theory Pract. 1997;2: 201–211. 10.1023/A:1009768127620 [DOI] [PubMed] [Google Scholar]
  • 13.Hoole AJ, Kowlowitz V, McGaghie WC, Sloane PD, Colindres RE. Using the objective structured clinical examination at the University of North Carolina Medical School. N C Med J. 1987;48: 463–467. [PubMed] [Google Scholar]
  • 14.Khan KZ, Ramachandran S, Gaunt K, Pushkar P. The Objective Structured Clinical Examination (OSCE): AMEE Guide No. 81. Part I: an historical and theoretical perspective. Med Teach. 2013;35: e1437–1446. 10.3109/0142159X.2013.818634 [DOI] [PubMed] [Google Scholar]
  • 15.Khan KZ, Gaunt K, Ramachandran S, Pushkar P. The Objective Structured Clinical Examination (OSCE): AMEE Guide No. 81. Part II: organisation & administration. Med Teach. 2013;35: e1447–1463. 10.3109/0142159X.2013.818635 [DOI] [PubMed] [Google Scholar]
  • 16.Smith LJ, Price DA, Houston IB. Objective structured clinical examination compared with other forms of student assessment. Arch Dis Child. 1984;59: 1173–1176. 10.1136/adc.59.12.1173 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Dennehy PC, Susarla SM, Karimbux NY. Relationship between dental students’ performance on standardized multiple-choice examinations and OSCEs. J Dent Educ. 2008;72: 585–592. [PubMed] [Google Scholar]
  • 18.Probert CS, Cahill DJ, McCann GL, Ben-Shlomo Y. Traditional finals and OSCEs in predicting consultant and self-reported clinical skills of PRHOs: a pilot study. Med Educ. 2003;37: 597–602. 10.1046/j.1365-2923.2003.01557.x [DOI] [PubMed] [Google Scholar]
  • 19.Tijani KH, Giwa SO, Abiola AO, Adesanya AA, Nwawolo CC, Hassan JO. A comparison of the objective structured clinical examination and the traditional oral clinical examination in a Nigerian university. J West Afr Coll Surg. 2017;7: 59–72. [PMC free article] [PubMed] [Google Scholar]
  • 20.Daniels VJ, Pugh D. Twelve tips for developing an OSCE that measures what you want. Med Teach. 2018;40: 1208–1213. 10.1080/0142159X.2017.1390214 [DOI] [PubMed] [Google Scholar]
  • 21.Kirton SB, Kravitz L. Objective Structured Clinical Examinations (OSCEs) Compared With Traditional Assessment Methods. Am J Pharm Educ. 2011;75 10.5688/ajpe756111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Kamarudin MA, Mohamad N, Siraj MNABHH, Yaman MN. The Relationship between Modified Long Case and Objective Structured Clinical Examination (Osce) in Final Professional Examination 2011 Held in UKM Medical Centre. Procedia—Soc Behav Sci. 2012;60: 241–248. 10.1016/j.sbspro.2012.09.374 [DOI] [Google Scholar]
  • 23.Sandoval GE, Valenzuela PM, Monge MM, Toso PA, Triviño XC, Wright AC, et al. Analysis of a learning assessment system for pediatric internship based upon objective structured clinical examination, clinical practice observation and written examination. J Pediatr (Rio J). 2010;86: 131–136. 10.2223/JPED.1986 [DOI] [PubMed] [Google Scholar]
  • 24.Daniels VJ, Bordage G, Gierl MJ, Yudkowsky R. Effect of clinically discriminating, evidence-based checklist items on the reliability of scores from an Internal Medicine residency OSCE. Adv Health Sci Educ Theory Pract. 2014;19: 497–506. 10.1007/s10459-013-9482-4 [DOI] [PubMed] [Google Scholar]
  • 25.Rivière E, Quinton A, Dehail P. [Analysis of the discrimination of the final marks after the first computerized national ranking exam in Medicine in June 2016 in France]. Rev Med Interne. 2019;40: 286–290. 10.1016/j.revmed.2018.10.386 [DOI] [PubMed] [Google Scholar]
  • 26.Konje JC, Abrams KR, Taylor DJ. How discriminatory is the objective structured clinical examination (OSCE) in the assessment of clinical competence of medical students? J Obstet Gynaecol J Inst Obstet Gynaecol. 2001;21: 223–227. 10.1080/01443610120046279 [DOI] [PubMed] [Google Scholar]
  • 27.Rivière E, Quinton A, Neau D, Constans J, Vignes JR, Dehail P. [Educational assessment of the first computerized national ranking exam in France in 2016: Opportunities for improvement]. Rev Med Interne. 2019;40: 47–51. 10.1016/j.revmed.2018.07.006 [DOI] [PubMed] [Google Scholar]
  • 28.Steichen O, Georgin-Lavialle S, Grateau G, Ranque B. [Assessment of clinical observation skills of last year medical students]. Rev Med Interne. 2015;36: 312–318. 10.1016/j.revmed.2014.10.003 [DOI] [PubMed] [Google Scholar]
  • 29.Pugh D, Bhanji F, Cole G, Dupre J, Hatala R, Humphrey-Murto S, et al. Do OSCE progress test scores predict performance in a national high-stakes examination? Med Educ. 2016;50: 351–358. 10.1111/medu.12942 [DOI] [PubMed] [Google Scholar]
  • 30.Pugh D, Touchie C, Wood TJ, Humphrey-Murto S. Progress testing: is there a role for the OSCE? Med Educ. 2014;48: 623–631. 10.1111/medu.12423 [DOI] [PubMed] [Google Scholar]
  • 31.Lewis KL, Bohnert CA, Gammon WL, Hölzer H, Lyman L, Smith C, et al. The Association of Standardized Patient Educators (ASPE) Standards of Best Practice (SOBP). Adv Simul. 2017;2: 10 10.1186/s41077-017-0043-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Etsuro Ito

12 Nov 2020

PONE-D-20-30792

Impact of Integrating Objective Structured Clinical Examination into Academic Student Assessment: Large-Scale Experience in a French Medical School

PLOS ONE

Dear Dr. MATET,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Dec 27 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Etsuro Ito

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

Additional Editor Comments:

The comments seem very minor. Please revise your MS according to them.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This manuscript reports on a relevant and well conducted study.

Concusions are meaningfull not only for the French system but also for other countries internationally.

I only have minor remarks, already partially adressed by the authors, but for whci I would like a bit more clarifications and highlighting in the paper.

Methods: Assessment of students using OSCE required a large number of assessors (162+27). How did the investigators ensure for the homogeneity of the training and motivation of these assessors?

Results: how do you explain different correlations of various OSCE (particularly OSCE1 and 2, because OSCE3 is clearly a different exercise) and classical assessment tool. Does it represent real differences with plausible explanation or fluctuation of the results compromising their meaning and significance?

Reviewer #2: Matet et al. report a study evaluating the correlations between OSCE and “traditional” French medical student evaluation, which combines hospital-based traineeship evaluation and academic evaluation using MCQs. This study is of importance, as long OSCE will be soon integrated in the French medical school evaluation system and will account for 40% of the final exam grade. Besides, such correlations were so far not described in this setting and give insights into the potential consequences of such a transition, which may take place in other countries.

Among the novel findings, it is reported that only one third of the top 25% MCQ graded students were among the top 25% OSCE graded students. Although some of the MCQ and OSCE grades were statistically correlated, correlations remained poor, despite an interesting sample size.

These discrepancies precisely illustrate the limits of MCQs, which are definitely less able to evaluate the clinical skills of medical students.

Finally, the authors show that OSCE integration may increase the discriminative capacity of the exam, a crucial finding for a ranking exam involving thousands of students…

The paper is overall well written and clear. Statistical analyses are appropriate. The main limitations (absence of student training / limited number of workshop / teachers instead of trained actors) are discussed in detail.

I would have some questions and minor comments:

- Were teachers involved as OSCE patients or evaluators chosen in medical specialties different to the evaluated ones? This should be mentioned as evaluators should ideally be chosen among other specialties to improve the objectivity of evaluation. Also, there could be an evaluation bias if an evaluator has already evaluated a student in the context of a Hospital-based traineeship evaluation.

- I don’t understand the point of figure 5. In the legend, it is written “Students with a ratio OSCE/MCQ lower than 1.0 (dotted red line) have a lower grade on the OSCE than the MCQ-based exam”: isn’t it obvious? Please clarify.

- I would suggest to shortly discuss the validity of traineeship skill and behavior evaluation: their results seem completely disconnected from both MCQ and OSCE grades…

Minor comments:

- Page 6, line 132: “which has two sites that have recently merged, the Paris Nord and Paris Centre sites”: I am not sure this information is relevant for this study.

- Page 6, line 133: “per class” > per year

- Page 7, line 135: “This study was conducted at the Paris Centre site”: this sentence should be in the methods section

- Page 7, line 151: “on duty call”: I don’t understand. Do you mean on night shift?

- Page 10, line 217: “all 379 participating students”: I would suggest to remove “379” as long as this is part of the results.

Reviewer #3: The study was conducted rigorously. It's an original and very interesting work, especially with the upcoming change of the French final classifying national exam. The sample size, for an OSCE examination evaluation is large, and provides interesting data. The statistical analysis are performed appropriately and the results are clearely expressed, with sufficient detail. The manuscript is written in standard and intelligible English.

I do have a few comments about the standardized patients:

It is mentionned as a limit that the standardized patients are not professional actors but volunteer teachers. In reference to ASPE SOBP's or Howard Barrows' definition, SP's do not have to be professional actors. I would rather specify as a limit the fact that they are teachers, wich could create a particular stress in students, modifiying their performance. Also, nothing is mentionned about the screening process, wich is highly recommended in SP's. Finally, nothing is said about the SP quality assurance evaluation. How many different SP's played the same patient (risk of inter-SP variability) and how many runs of each scenario did each SP do (intra-SP variability)? Differences may be noted in performances between different SP's or within the performances of a same SP over time. These are also minimal limits or biases that could be reported.

Here by you will find a few general comments about what might be typos :

page 11 line 247: the 10% coefficient is not mentionned as in th rest of the manuscript with the 20% and 40% coefficient

page 15 line 328: maybe a mistake in the rotating groups? Isn't the third group TU 3/1/2 rather than 3/2/1?

page 18 line 408: you refer to the impact of integrating OSCE grades with a 5 to 20% coefficient, but in the results, its presented as a 10 to 40% ceofficient (same on page 19 line 416, ...up to 20% isn't it up to 40%?) If it isn't a mistake, its confusing.

page 19 line 435 : typo on the word "school" written "scholl"

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Edouard Louis

Reviewer #2: No

Reviewer #3: Yes: Pr Anne Bellot, Caen University Hospital, Caen Medical School, University of Caen Normandy

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jan 14;16(1):e0245439. doi: 10.1371/journal.pone.0245439.r002

Author response to Decision Letter 0


31 Dec 2020

RESPONSE TO REVIEWERS

We thank the three Reviewers for their constructive comments. We have addressed each of them in the point-by-point response below, that refers to the changes made to the manuscript. Page and line numbers refer to the initial version of the manuscript.

REVIEWER #1: This manuscript reports on a relevant and well conducted study.

Concusions are meaningfull not only for the French system but also for other countries internationally.

I only have minor remarks, already partially adressed by the authors, but for whci I would like a bit more clarifications and highlighting in the paper.

• Methods: Assessment of students using OSCE required a large number of assessors (162+27). How did the investigators ensure for the homogeneity of the training and motivation of these assessors?

We thank Reviewer #1 for raising this issue.

- The homogeneity of the training between assessors was maximized by several means:

o Preparatory meetings throughout the academic year preceding the OSCE test, and specific training for the OSCE in small groups by one single coordinating team for each OSCE station.

o Shared video recordings of a standard patient undergoing each OSCE station.

o The homogeneity of student evaluation was further optimized by the use of shared evaluation grids (provided in Suppl Material #1-2-3).

- The homogeneity of motivation between assessors is critical, yet it is difficult - if not impossible - to ensure a complete homogeneity of motivation. To our view, the major factor warranting their motivation was that all assessors were medical doctors belonging to the University of Paris hospital network - therefore all implicated at various degrees in medical pedagogy - who participated for the first time in the large-scale pedagogical experiment of an upcoming evaluation and teaching modality.

The following changes were made to the text:

- Methods, p 10 line 225, addition of: “The homogeneity of training between assessors was maximized by preparatory meetings throughout the academic year preceding the OSCE test, specific training for each OSCE station in small groups by one single coordinating team, diffusion of video recordings of a standard patient undergoing each OSCE station.”

- Methods, p 10 line 225, addition of: “Moreover, the homogeneity of motivation between assessors was maximized by the facts that all were medical doctors belonging to the same university hospital network, that all were implicated at various degrees in medical pedagogy, and that all participated for the first time to a large-scale pedagogical experiment of an upcoming evaluation and teaching modality.”

- We have clarified by removing the expression “27 external supervisors were enrolled”(Methods, p10 line 224), which referred to employees assigned to the organizational and not academical aspects of the OSCE exam.

• Results: how do you explain different correlations of various OSCE (particularly OSCE1 and 2, because OSCE3 is clearly a different exercise) and classical assessment tool. Does it represent real differences with plausible explanation or fluctuation of the results compromising their meaning and significance?

We believe that these discrepancies reflect real differences between the profiles of grades obtained at OSCE1, 2 and 3, and that plausible findings car explain these different profiles. It is true that correlation levels are weak, which can suggest that the dispersion of grades is high and that the observed correlations result from data fluctuation. However, several observations of the very grades, and of similar correlations from the literature, support the notion that the observed differences are real and reflect underlying factors.

First, the degrees of correlation observed are consistent with weak correlation levels reported in the literature for similar OSCE grade data.

Second, the correlation profiles observed between OSCE grades from each station and conventional modalities were consistent with the content and pedagogical orientation of each OSCE station. OSCE1 and 2 were more competence-oriented, with OSCE1 relying more on clinical knowledge (and correlating only with MCQ grades), and OSCE2 relying on the relational capacity to translate this knowledge to a patient (and correlating only with hospital traineeship skills and behavior). In contrast, OSCE3 was clearly more behavior-oriented, and correlated consistently with traineeship behavior grades only.

We made the following addition:

Discussion, p18 line 403, addition of: “Overall, the correlations observed between OSCE grades and classical assessment modalities, and the consistence of weak correlation levels with those reported in the literature, strongly support the notion that these correlations do not result from chance or from a fluctuation of grades.”

REVIEWER #2: Matet et al. report a study evaluating the correlations between OSCE and “traditional” French medical student evaluation, which combines hospital-based traineeship evaluation and academic evaluation using MCQs. This study is of importance, as long OSCE will be soon integrated in the French medical school evaluation system and will account for 40% of the final exam grade. Besides, such correlations were so far not described in this setting and give insights into the potential consequences of such a transition, which may take place in other countries.

Among the novel findings, it is reported that only one third of the top 25% MCQ graded students were among the top 25% OSCE graded students. Although some of the MCQ and OSCE grades were statistically correlated, correlations remained poor, despite an interesting sample size.

These discrepancies precisely illustrate the limits of MCQs, which are definitely less able to evaluate the clinical skills of medical students.

Finally, the authors show that OSCE integration may increase the discriminative capacity of the exam, a crucial finding for a ranking exam involving thousands of students...

The paper is overall well written and clear. Statistical analyses are appropriate. The main limitations (absence of student training / limited number of workshop / teachers instead of trained actors) are discussed in detail.

I would have some questions and minor comments:

• Were teachers involved as OSCE patients or evaluators chosen in medical specialties different to the evaluated ones? This should be mentioned as evaluators should ideally be chosen among other specialties to improve the objectivity of evaluation. Also, there could be an evaluation bias if an evaluator has already evaluated a student in the context of a Hospital-based traineeship evaluation.

- We thank the Reviewer for raising a critical issue in OSCE preparation and practical organization. The selection of evaluators was random among university hospital physicians in clinical departments affiliated to our institution. Based on the Reviewer’s suggestion, we retrospectively assessed the proportion of teachers from the same specialties as the OSCE stations they were assigned to (pneumologists in OSCE1, cardiologists/vascular specialists in OSCE2, and gastroenterologists/digestive surgeons in OSCE3). This proportion was 9.5% across the 3 stations (7%, 12% and 9% for OSCE 1, 2 and 3, respectively), which can be deemed sufficiently low not to bias the evaluation. For future OSCE sessions, the organizing committee from our University should exclude specialists from OSCE stations of their own field.

- We are also grateful for the second suggestion, which is extremely relevant. The risk for an evaluator to have already met one of the students evaluated in his/her OSCE station was very low since students are affected to more than a hundred departments for traineeship positions. Yet it remained possible, and this potential bias should therefore be anticipated in the next OSCE examinations.

We made the following additions to the text:

- Methods, p11 line 230: “The proportion of evaluators from the same specialty as the one evaluated in each OSCE station (pneumologists in OSCE1, cardiologists/vascular specialists in OSCE2, and gastroenterologists/digestive surgeons in OSCE3) was ~9.5% across the 3 stations. This proportion was 7%, 12%, and 9% for OSCE 1, 2 and 3, respectively.”

- Discussion, p20 line 445, § on study Limitations: “The proportion of evaluators from the same specialty as the one evaluated in each OSCE station was <10%, which can be deemed sufficiently low not to bias the evaluation. For future OSCE sessions, the organizing committee from our University should exclude specialists from OSCE stations of their own field.”

- Discussion, p20 line 447, § on study Limitations: To reduce evaluation bias, care should also be taken to minimize the risk for an evaluator to have already evaluated during an hospital traineeship one of the students taking his/her OSCE station.

• I don’t understand the point of figure 5. In the legend, it is written “Students with a ratio OSCE/MCQ lower than 1.0 (dotted red line) have a lower grade on the OSCE than the MCQ-based exam”: isn’t it obvious? Please clarify.

This legend was explanatory and, indeed, the legend simply described a logical, graphically visible finding.

The main finding displayed in Figure 5 was that a non-neglectable proportion of students obtained much better grades at OSCE than at conventional examination modalities such as MCQs.

In order to stress the fact that our commentary in the legend is explanatory and does not intent to reflect the main finding of this Figure, we have clarified the Legend of Figure 5 as follows:

- “Figure 5. Dot plot of the relationship between multiple-choice question (MCQ)-based grades obtained for teaching units and the ratio of the OSCE grades and those MCQ-based grades. This plot highlights graphically that MCQs and OSCE evaluates students differently, since a non-neglectable proportion of students obtained better grades at OSCE than at MCQs, and more so among students with middle- or low-range grades at MCQs.

To facilitate the reading, the dotted red line indicates the OSCE/MCQ ratio equal to 1.0. Students with an OSCE/MCQ ratio lower than 1.0 have a lower grade on the OSCE than the MCQ-based exam.”

Minor comments:

- Page 6, line 132: “which has two sites that have recently merged, the Paris Nord and Paris Centre sites”: I am not sure this information is relevant for this study.

The ‘Paris Nord’ site corresponds to the former Paris Diderot University, and the ‘Paris Centre site’ corresponds to the former Paris Descartes University. Both entities have merged in 2019, the same year the first large-scale OSCE described in the manuscript was performed. We agree that this information is not essential to the study. However, this study was conducted exclusively on grades from the Paris Centre site, and this is specified in the manuscript (see Comment below on p7, line 151). Another study on this OSCE, that focused on student communication skill training, has analyzed all grades from both the Paris Nord and the Paris Centre site (PMID: 32886733). Therefore, we wish to maintain this clarification.

- Page 6, line 133: “per class” > per year.

The term class has been replaced by “study year”.

- Page 7, line 135: “This study was conducted at the Paris Centre site”: this sentence should be in the methods section

We have removed this mention since the “Paris Centre site” is already specified in the “Study Population” paragraph of the Methods section.

- Page 7, line 151: “on duty call”: I don’t understand. Do you mean on night shift?

Yes, we have replaced “duty call” by “night shift”.

- Page 10, line 217: “all 379 participating students”: I would suggest to remove “379” as long as this is part of the results.

Yes, we have simply rephrased as “all participating students”.

REVIEWER #3: The study was conducted rigorously. It's an original and very interesting work, especially with the upcoming change of the French final classifying national exam. The sample size, for an OSCE examination evaluation is large, and provides interesting data. The statistical analysis are performed appropriately and the results are clearly expressed, with sufficient detail. The manuscript is written in standard and intelligible English.

I do have a few comments about the standardized patients:

• It is mentioned as a limit that the standardized patients are not professional actors but volunteer teachers. In reference to ASPE SOBP's or Howard Barrows' definition, SP's do not have to be professional actors. I would rather specify as a limit the fact that they are teachers, wich could create a particular stress in students, modifiying their performance.

We are grateful to the Reviewer for pointing out the exact definition of standard patient according to ASPE SPBP’s standards. We fully agree and have modified the manuscript accordingly, as follows.

- Discussion, p20 line 450, removal of: “not professional actors, but…”

- Discussion, p20 line 450, addition of: According to the standards of best practice from the Association of Standardized Patient Educators, standardized patients do not have to be professional actors [31]. However, the fact that they were medical teachers may have induced an additional stress in students, possibly altering their performance. (…) To minimize these biases and homogenize their roles, a training program for teachers who acted as standardized patients was well-defined and mandatory.

- Addition of a Reference: Standards of best practice from the Association of Standardized Patient Educators (New reference 31).

• Also, nothing is mentionned about the screening process, wich is highly recommended in SP's.

Again, we thank the Reviewer for raising a critical issue regarding the preparation of medical educators contributing to an OSCE. The recruitment of medical educators was made on a voluntary basis: each clinical department from our University Hospitals was contacted to indicate the name of one or several voluntary physician. However, due to the large number of students taking the OSCE simultaneously, and the fact that two educators per OSCE station were required (one actor and one evaluator), no further screening was made to refine the selection of educators.

We have indicated this notion in the Limitation paragraph of the manuscript:

- Discussion, p20 line 450, addition of: “In addition, contrary to the ASPE guidelines, no screening process was applied to medical educators who were recruited on a voluntary basis from all clinical departments in our University Hospitals, because 162 educators were required to run all OSCE stations simultaneously.”

• Finally, nothing is said about the SP quality assurance evaluation. How many different SP's played the same patient (risk of inter-SP variability) and how many runs of each scenario did each SP do (intra-SP variability)? Differences may be noted in performances between different SP's or within the performances of a same SP over time. These are also minimal limits or biases that could be reported.

Standardized patient quality assurance is indeed crucial to assess the robustness of an OSCE session. Regarding the OSCE analyzed in the manuscript, 27 different standardized patients played the same patient role for each OSCE scenario. Each scenario was run between 12 to 15 times by each standardized patient. OSCE coordinators attended as observers at least one OSCE scenario run by each standardized patient. It is not possible to assess accurately the quality of each OSCE actor but efforts were made to homogenize the acting performance between standardized patients, by implementing a standardized training, as reported above for assessors of OSCE (changes to the manuscript detailed in our Response to Point 1 from Reviewer #1 above: several training session throughout the academic year preceding the OSCE, videos shared before the OSCE displaying the expected standard performance).

We made the following changes to the manuscript:

- Methods, p10 line 220, addition of: “To assess quality and inter-standardized patient reproducibility, OSCE coordinators attended as observers at least one OSCE scenario run by each standardized patient.”

- Results, p12 line 260, addition of: “Twenty-seven different standardized patients played the same patient role for each OSCE scenario. Each scenario was run between 12 to 15 times by each standardized patient.”

- Discussion, p20 line 450, addition of: “An additional bias may result from inter- or intra-standardized patient variability that may be noted in performances over time. We attempted to limit this bias by homogenizing the training of standardized patients during several pre-OSCE meetings, by sharing videos of the expected standard roles, and by controlling their performance by observers from the OSCE committee during the examination.”

Here by you will find a few general comments about what might be typos :

• page 11 line 247: the 10% coefficient is not mentionned as in the rest of the manuscript with the 20% and 40% coefficient

Indeed, these are typos, we are grateful to the Reviewer for pointing them out. We have rephrased as “with 10%, 20% and 40% coefficients”

• page 15 line 328: maybe a mistake in the rotating groups? Isn't the third group TU 3/1/2 rather than 3/2/1?

Yes, we have rephrased as “TU 3/1/2”.

• page 18 line 408: you refer to the impact of integrating OSCE grades with a 5 to 20% coefficient, but in the results, its presented as a 10 to 40% ceofficient (same on page 19 line 416, ...up to 20% isn't it up to 40%?) If it isn't a mistake, its confusing.

Yes, we apologize and thank the Reviewer for the understanding. These typos are remnants from the construction of the manuscript where we had considered initially different coefficients

- p18 line 408, corrected as “with a 10-to-40% coefficient”

- p19 line 416: corrected as “integrating the OSCE grade with a coefficient up to 40%”

• page 19 line 435 : typo on the word "school" written "scholl"

Thank you, this typo was corrected as “medical school“.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Etsuro Ito

2 Jan 2021

Impact of Integrating Objective Structured Clinical Examination into Academic Student Assessment: Large-Scale Experience in a French Medical School

PONE-D-20-30792R1

Dear Dr. MATET,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Etsuro Ito

Academic Editor

PLOS ONE

Acceptance letter

Etsuro Ito

6 Jan 2021

PONE-D-20-30792R1

Impact of Integrating Objective Structured Clinical Examination into Academic Student Assessment: Large-Scale Experience in a French Medical School

Dear Dr. MATET:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Etsuro Ito

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Grades obtained at each OSCE station.

    (TXT)

    S1 Data. OSCE #1 script and evaluation grid.

    (DOCX)

    S2 Data. OSCE #2 script and evaluation grid.

    (DOCX)

    S3 Data. OSCE #3 script and evaluation grid.

    (DOCX)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    The S1 Table file is available from the Figshare repository : https://doi.org/10.6084/m9.figshare.13507224.v1.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES