ABSTRACT
The purpose of this study was to compare student performance and question discrimination of multiple-choice questions (MCQs) that followed a standard format (SF) versus those that do not follow a SF, termed here as non-standard format (NSF). Medical physiology exam results of approximately 500 first-year medical students collected over a five-year period (2020–2024) were used. Classical test theory item analysis indices, e.g. discrimination (D), point-biserial correlation (rpbis), distractor analysis for non-functional distractors (NFDs), and difficulty (p) were determined and compared across MCQ format types. The results presented here are the mean ± standard error of the mean (SEM). The analysis showed that D (0.278 ± 0.008 vs 0.228 ± 0.006) and rpbis (0.291 ± .006 vs 0.273 ± .006) were significantly higher for NSF questions compared to SF questions, indicating NSF questions provided more discriminatory power. In addition, the percentage of NFDs was lower for the NSF items compared to the SF ones (58.3 ± 0.019% vs 70.2 ± 0.015%). Also, the NSF questions proved to be more difficult relative to the SF questions (p = 0.741 ± 0.007 for NSF; p = 0.809 ± 0.006 for SF). Thus, the NSF questions discriminated better, had fewer NFDs, and were more difficult than SF questions. These data suggest that using the selected non-standard item writing questions can enhance the ability to discriminate higher performers from lower performers on MCQs as well as provide more rigour for exams.
KEYWORDS: Multiple choice, item writing, first-year medical student, physiology, test development
Introduction
Examinations are engrained in the academic culture and, as most educators will attest, writing quality test questions is a daunting task. An instrument aligned to the needs and the level of the examinees can provide useful information. Instructors use test results to ensure learning goals are met, identify students who are struggling, and identify materials, activities, or teaching strategies for revision or restructuring [1,2]. Examinees may use the results to evaluate strengths (and gaps) in content knowledge. For test results to provide meaningful information, the instrument must align the content to be measured, the characteristics of the examinees, and the assessment’s purpose [3].
With respect to medicine, a key element of training is to prepare graduates to be independent thinkers who can make clinical judgements based on evaluation and analysis of information. Although examinations can take many forms, our focus is on objective tests with a multiple-choice question (MCQ) format. This format is prevalent for many different testing situations, from low stakes (e.g. classroom tests) to high stakes tests (e.g. professional licensure examinations). MCQ tests have numerous other advantages, such as the ability to assess many examinees easily, quickly, objectively, and reliably [1]. Additionally, MCQ tests are applied to a wide age range of learners, from preschool children to adults [1].
In the health professions, objective testing with MCQ items plays a key role [4–6]. To aid test development, numerous papers have offered guidelines [1,2,4,5,7–10]. These guidelines provide an excellent starting point and suggest many “dos and don’ts” for MCQ item writing, thus providing resources to create MCQs that assess and foster critical thinking skills. Unfortunately, there is not universal agreement, nor research support, regarding guidelines that test developers should follow. For example, out of a list of 40 item writing rules, only four rules were backed by empirical support [8].
Frey et al’.s [8] review of item writing guidelines stated that two of the most popular guidance statements were to avoid use of “All of the Above” (AOTA) or “None of the Above” (NOTA) as MCQ distractors, with 80% of sources reviewed stating to avoid AOTA and 75% to avoid NOTA. Further, negative wording included in the item stem is typically discouraged due to possible misinterpretation (e.g [9]).
Considering item type, the National Board of Medical Examiners (NBME) reviews different MCQ formats for inclusion (or exclusion) on exams [7]. Item types differ with respect to psychometric principles. The K-type question (Table 1) organises distractors into multiple true-false sets, where distractors must contain facts that are totally true when combined. K-type items have been found to be more difficult than MCQs with one correct answer [11,12]; however, K-type items may decrease item discrimination and overall test reliability if test-wiseness skills are used to arrive at the correct answer [7,13]. The C-type question (Table 1) requires the test taker to determine the truth regarding two distractors and must indicate if one, both, or neither is correct, thus, including NOTA and AOTA within an item. Building off the C-type, a modified K-type (Table 1) includes a third distractor for evaluation, making this type of problem more complex than the C-type.
Table 1.
Item formats.
K-type | C-type | Modified K-type |
---|---|---|
1. Item Stem | 1. Item Stem | 1. Item Stem |
1. Distractor A | A. Distractor A | A. Distractor A |
2. Distractor B | B. Distractor B | B. Distractor B |
3. Distractor C | C. BOTH A and B are correct | C. Distractor C |
4. Distractor D | D. NEITHER A nor B is correct. | D. A and C are BOTH correct |
E. A, B and C are ALL correct. | ||
Choose one of the following: | ||
|
The NBME provides assistance for medical students and professors to aid in preparation. For example, K-type items are no longer used on NBME exams; however, NBME guidelines state that these items can be useful in classroom testing situations with careful consideration of format strengths and weaknesses so as to not affect exam quality [7]. Similarly, C-type and modified K-type depicted items are often referred to as complex MCQ items and their use is generally not recommended [14]. Psychometric guidelines state that the purpose of the test should align to the intended use of the information. Given that medical professionals may need to systematically consider and eliminate numerous patient signs and symptoms, shouldn’t classroom practice extend to exams that include MCQ items that foster and assess such skills? By including some of the “non-recommended” item guidelines on lower stakes (i.e. classroom) tests, can these items promote critical thinking? If so, psychometric indices should differentiate between items to suggest which types work best to foster needed professional skills.
The purpose of this study was to examine five item construction “rule violations” to determine if there is evidence that such question types confound exam quality. Rule violation MCQs are referred to as non-standard format (NSF) and items following guidelines as standard format (SF). Note that this is a retrospective analysis of data, meaning we did not specifically create questions a priori for comparison. For over a decade, faculty have written both NSF and SF questions; examining a retrospective database helps eliminate bias that may occur when writing items for a defined study.
Methods
Participants and Course Content
The results from first-year medical physiology student exams were collected over a five-year period (2020–2024). Each course was in the spring semester and consisted of roughly 100 students, each of whom took four exams over the semester. Each exam contained 50 MCQs covering the topics indicated below (Table 2). This means, each year, approximately 100 students answered a total of 200 questions broken-out over 4 exams. Thus, for the data analysed in this study, roughly 500 students answered 1000 items. All exams were taken on the student’s laptop using ExamSoft software. We included only this five-year period as use of ExamSoft began in 2020, thus avoiding any possible complications associated with paper versus electronic test-taking during previous testing administrations.
Table 2.
Medical physiology: four testing situations and material covered.
Topics | |
---|---|
Exam 1 | Biophysics; Muscle; Autonomic nervous system; Basic sensory; Gastrointestinal |
Exam 2 | Cardiovascular |
Exam 3 | Renal; Endocrine; Reproductive |
Exam 4 | Pulmonary; Integrative |
Content covered in the course and by examination are provided in Table 2. For uniformity of course delivery, the sections were largely taught by the same faculty member for all five years studied. One exception to this was Muscle Physiology (3 hours): there was a different lecturer in 2020, from 2021–2022, and from 2023–2024. As part of the standardisation processes, in general, course instructional methods did not change significantly over the five-year period. An exception was an active learning session for GI physiology that was instituted in 2023 and 2024. There was no difference in the mean score on GI questions when comparing these two years to the previous three years, thus we included the data. The teaching methods for the remainder of the topics were consistent across years and sections, where direct lecture constituted slightly less than 50% of the course content, and the remaining content delivery methods included laboratories, case studies, small group discussions, self-study, clinical correlations, and active learning sessions (Clickers, Kahoot, etc.) delivered in-person. The second half (two exams) delivered in 2020 did require alternative teaching methods, (i.e. providing recordings, online interactions) due to COVID-19 precautions.
To improve validity associated with the test scores, all test components (test instructions, MCQs, scenarios, etc.) were reviewed by other medical school faculty as subject matter experts. They provided feedback on the appropriateness of the content, item wording, and options prior to administration.
Preliminary investigation of item analysis indices showed no difference across years, suggesting that COVID-19 delivery differences did not confound psychometric results. Exams are kept secure, so it is possible that questions could have been repeated over multiple years of testing. None of the item statistics demonstrated values that would suggest students had prior access to any exam questions.
Test Instruction and MCQ Item Format
Test instructions asked examinees to select the single best answer. Every effort was made to ensure that each distractor was either totally correct (could occur) or totally incorrect (not possible or accurate). NBME refers to these as “true-false” MCQs [7]. Exams consisted of a variety of MCQ questions. Not counting C-type items, the vast majority of the MCQ items on the exams contained five response options, although there could be as few as three and as many as eight.
Violating MCQ Guidelines
Questions were divided into either standard format (SF) or non-standard format (NSF) based upon previous MCQ item writing guidelines [8,9]. The specific criteria used to denote a question as NSF was: 1) the last distractor was all of the above (AOTA) are correct, 2) the last distractor was none of the above (NOTA) is correct, 3) Type-C format (Table 1), 4) modified K-type format (Table 1), and 5) negative worded stem (negative stem), i.e. which of the following is NOT correct? These five types were chosen because they can be identified easily and objectively. We note that the modified K-type format stayed the same on the tests. In other words, alternate different pairings for any of the questions, such as A and B are correct or B and C are correct, etc, were not included.
Analysis
Items were collected by type across exams. Thirteen questions were eliminated because of poor wording in the stem or in the answer choices, thus a total of 987 questions comprise the final analysis. Questions classified as NSF comprised about 43% (n = 420 over the 5 years) of the questions and this percentage was consistent across the four exams. The following psychometric indices were calculated for each item: item difficulty (p-value), item discrimination (D), and the point-biserial correlation (rpbis). In addition, the Kruder-Richardson formula 20 (KR20) internal consistency reliability was calculated for the overall test. Each index used in the psychometric investigation was defined along with guidelines for interpretation based on current recommendations [15].
A test is said to be reliable if the test can be repeated across different conditions (e.g. time) and similar results are achieved. Internal consistency reliability is appropriate when one exam is administered to a set of examinees at one timepoint. The KR-20 reliability estimate was calculated; this index is used when items are dichotomously scored, such as correct/incorrect [15]. Values of KR-20 in the areas of 0.60–0.65 and higher have been recommended for health professions [16].
The item discrimination index (D) is used to distinguish between high achievers and low achievers. D is calculated as the proportion of correct responses in the upper group minus the proportion of correct responses in the lower group. A value of 0 indicates that the item has no discrimination and cannot distinguish students from upper and lower groups [15]. Values of 0.25 or higher have been recommended in the health sciences [16].
Point-biserial correlation (rpbis), also called item-total correlation, is another class of discrimination index, and is based on correlations of item scores with the total test score [15]. An item with a discrimination index less than 0.2 is considered to have insufficient discriminatory ability [17]. Sometimes, extremely difficult or easy items are needed to cover course content and learning objectives, but they often have low discrimination values.
We conducted a distractor analysis to evaluate the performance of the item’s distractor choices. Effective distractors should represent plausible alternatives or common misconceptions or errors. Discrimination indices for distractors should be negative, indicating the distractor will attract more students in the lower group than in the upper group [15]. The response option is considered a non-functional distractor (NFD) if less than 5% of students select a given distractor [18].
An unpaired t-test was used to compare SF and NSF questions. This t-test was also done to compare the SF to the various subtypes of NSF questions. A one-way analysis of variance (ANOVA) was used to compare the subtypes of NSF questions.
Results
The data presented in the results are the mean ± standard error of the mean (SEM). The mean KR20 for the set of items (over all exams) was estimated as 0.757 ± 0.012, demonstrating acceptable internal consistency.
Table 3 shows the mean discrimination factors, NFDs, and difficulty data comparing the SF and NSF. Average values for all item formats were within recommended guidelines. The mean D value and mean rpbis were significantly greater for NSF items compared to SF questions. On the other hand, the percentage of NFDs was significantly lower for the NSF items. The difficulty was significantly greater for the NSF questions. Thus, the NSF questions discriminated better, had fewer NFDs, and were more difficult than the SF questions.
Table 3.
Average psychometric data across formats (mean ± SEM).
SF | NSF | |
---|---|---|
D | 0.228 ± 0.006 | 0.278 ± 0.008* |
rpbis | 0.273 ± 0.006 | 0.291 ± 0.006** |
NFD (%) | 70.5 ± 0.017 | 56.5 ± 0.022** |
Difficulty (p-value) | 0.809 ± 0.006 | 0.741 ± 0.007* |
*p < 0.001; **p < 0.05; SEM = Standard Error of the mean; SF = Standard Format; NSF = Non-Standard Format; D = Discrimination Index; rpbis = Point-biserial correlation; NFD = Non-functional Distractors.
Table 4 shows average values by types of NSF questions. The percentage of questions for the subset of NSF questions refers to percentage of NSF questions, not percentage of total questions. It also provides the actual number of questions in parentheses. Given the paucity of questions and thus lack of statistical power, there were no differences when comparing AOTA and negative stem questions to SF items. Numerically, there was a trend for better discrimination for AOTA as both D and rpbis were higher, and the difficulty was greater. There was some trend towards better discrimination, but not difficulty, for the negative stem. In contrast, NOTA, C-type, and modified K-type showed significant differences for at least one of the parameters compared to SF questions. Both C-type and modified K-type showed higher D and modified K-type also showed a greater rpbis. All three types were more difficult.
Table 4.
NSF questions by violation type (mean ± SEM).
AOTA | NOTA | C-type | Mod K-type | Negative stem | |
---|---|---|---|---|---|
Approximate % and (n) of questions | 4% (16) | 40% (168) | 18% (76) | 34% (144) | 5% (19) |
D | 0.297 ± 0.048 | 0.256 ± 0.013 | 0.281 ±0.015* | 0.302 ± 0.015* | 0.240 ± 0.028 |
rpbis | 0.299 ± 0.027 | 0.274 ± 0.010 | 0.288 ± 0.011 | 0.303 ± 0.011** | 0.295 ± 0.026 |
NFD (%) | 75.0 ± 0.079 | 60.5 ± 0.037* | 48.7 ± 0.046* | 53.1 ± 0.040* | 71.9 ± 0.074 |
Difficulty (p-value) | 0.760 ± 0.035 | 0.751 ± 0.013* | 0.733 ± 0.014* | 0.718 ± 0.012* | 0.829 ± 0.024 |
Abbreviations are same as for Table 3; other abbreviations are: AOTA = all of the above; NOTA = none of the above.
The ANOVA failed to detect a significant difference within the subgroup of NSF items. The power of such an analysis is reduced given the paucity of data points for some of the items, particularly AOTA and negative stem. Further, this study was not designed to specifically write and compare questions that fell under the various subtypes. Instead, we chose to examine these subtypes from our question bank because they can be easily and objectively labelled as NSF. Numerically, the modified K-type, along with the AOTA items, showed the highest D of the groups. The D value for C-type items was less than modified K-type and AOTA, but higher than the NOTA and negative stem. The same held true for rpbis. The C-type questions had the lowest NFD values, followed by modified K-type, then NOTA, AOTA, and finally, negative stem. The AOTA and negative stem NFD values were similar to the NFD values for the SF questions. The modified K-type items were the most difficult, while the C-type items were the second-most difficult.
Discussion
The purpose of this study was to examine five specific types of “rule violations” of MCQ item writing guidelines on first-year medical school examinations. The data provided here are simply a presentation of the results. Based upon experience and the literature, we believed NSF questions are likely more difficult [10,13,14,19]. However, we felt it important to provide data to support whether or not this is true.
The results of this study supported our theory that NSF questions are more difficult. Given that NSF items don’t follow published guidelines, one could hypothesise that NSF questions have diminished psychometric data and thus are flawed [14]. The data presented here do not support this contention. In fact, the contrary was observed. NSF questions had significantly greater discrimination than the SF questions. Thus, our findings provide no evidence to suggest NSF questions are flawed and therefore compromise examination results. Given that they are more difficult, and they enhance discrimination of higher versus lower performing students, one could argue NSF questions provide more rigour, something that is coveted when assessing a class.
The NBME characterises an MCQ in which distractors are either totally true or false as a “true-false MCQ” [7]. In other words, students are not asked to differentiate the better answer, just whether the answer choice is either true or false. While this may suggest K-type questions are a viable option for assessment [7], we did not test “true” K-type items (Table 1); thus, a thorough discussion is not warranted here. On the other hand, we do use a modified version of K-type MCQs. These, as well as the C-type, are commonly called complex MCQs and there are arguments for and against such items [7–10,13] and the NBME indicates these format types might be useful for course exams [7]. The results of this study support this suggestion from NBME.
Modified K- and C-type items are thought to require deeper understanding of the material and reduce guessing. The results of this study support this hypothesis. These items can strengthen the reasoning power of the students as it applies to a differential diagnosis, commonly used by clinicians. For example, if a patient presents with symptoms A, B, and C, the professional immediately begins to consider various possible causes (X, Y, and Z), in other words, they begin a differential diagnosis. Using C-type and modified K-type questions, we expect the student to indicate if in fact more than one of the options is on their differential. These item types may be extremely helpful during the first year of medical school as the items do not ask them to differentiate which option is correct, but rather, “Have you considered this option?” Stated another way, the option to be selected reflects the process of performing a differential diagnosis. At later stages, a student must now make a judgement as to which of the items on the differential is most likely (i.e. the best answer). This is the rationale indicated by NBME for use of the one-best-answer format. We understand and appreciate this rationale.
In terms of limitations, it is possible that some “cluing” could occur for modified K-type questions [13]. In brief, cluing refers to eliminating some options by deciding the truth of only one of the statements. While we cannot rule out the possibility that this occurs, given the strength of discrimination and the high difficulty, cluing is likely a minor concern at best.
The use of NOTA is controversial. Some indicate this item format can be used but the writer must be careful of its use [9,20]. Others argue against its use [6]. The results of this study provide evidence that NOTA can be used effectively. Adding NOTA may make a student stop and reconsider the options. It requires them to have more “certainty” regarding an option they are leaning towards. In other words, they must have a deeper understanding of the material. However, we do side with the notion of being cautious regarding its use. There should be multiple instances of it and it should sometimes be the correct answer. Just “throwing it in” where it is always correct or never correct will be detected by the savvy student.
As observed in Table 4, we rarely use AOTA and negative stem questions and these types of questions are typically frowned upon [8,9,14]. The results here indicate that they appear to discriminate just as well as other MCQs. However, given the paucity of their use, further generalisations regarding their use cannot be ascertained from this study.
Well thought-out distractor options are also very important components to writing effective MCQs. In addition to grammatical aspects, good distractors share a variety of traits, some of which include: plausible and wrong, do not inadvertently hint at the correct answer, be independent, not overlap or contradict each other, and of similar length [4,7,9,21]. A thorough discussion of this matter is beyond the scope of this work. However, what may be pertinent is the number of distractors. For typical MCQs, there are four distractors and one correct answer, for a total of five options (A-E). There is no consensus regarding the number of distractors. Haladyna et al. [9] suggested writing as many as possible, the rationale being that more reduces guessing and thus enhances discriminating between high and low achievers. However, several reports indicate this might not be the case. Tarrant et al. [18] and Vyas and Supe [22] indicated three choice exams (two distractors and one correct answer) provided the most robust discrimination. In their review, Gierl et al. [21] indicated two distractors are optimal. Boland et al. [4] recommend three distractors. Writing good descriptors is difficult and the authors cited suggest that faculty have great difficulty coming up with three or more quality distractors. Our results support this contention. We did a descriptor analysis of the test items and if less than 5% of students select a given distractor, the response option is considered a NFD [18]. The SF questions in our analysis had a higher percentage of NFDs than the NSF items. Looking at the subset of NSF items, the percentage of NFDs was lowest for the C- and modified K-type items, providing additional support that these two types of questions can enhance the quality of exam items. NOTA had NFDs that were less than SF questions, further bolstering the notion they can be an effective item type and they do reduce the number of distractors the author must write.
We only looked at specific types of NSF test items, thus we are not advocating against the excellent guidelines that have previously been put forth [1,2,4,5,7–10]. Writing MCQs that test higher-order thinking is a difficult task and guidelines are helpful and necessary. Instead, our work focuses on a few specific exceptions to the guidelines. We used classical test-theory indices to examine item quality [15]. Classical test theory indices have limitations, such as sample specific estimates and error inherent in measurement at the observed level (e.g [15]; [23]); however, these indices are widely used and accepted due to the stringent assumptions and large sample sizes required of item response theory (IRT) models [15]. Future studies may incorporate modern psychometric analyses (e.g. IRT or differential item functioning) to provide additional information about NSF items. Finally, we recognise the need to examine validity associated with the scores. Every attempt was made to increase the validity associated with the testing situations. Such details have not been presented in previous research. However, future studies may examine how students scoring high (or low) on NSF items relate to other important outcomes in medical school (or similar relevant outcomes for other advanced test-taking populations).
The results of this study come from assessing undergraduate medical students in a physiology course. In this setting, students are working to learn new information. The results show NSF items provide more rigour, which requires a deeper understanding of the new information in order to be successful. However, our results may also be applicable to other educational scenarios, such as continuing medical education (CME). With CME, the goal is to maintain, develop, or increase the knowledge, skills, and professional performance of those in the medical profession. Given NSF questions provide more rigour, that is fair and equitable, it stands to reason these items would enhance the acquisition of knowledge and skills if added to CME. Further, by requiring a deeper understanding, health care workers are more likely to maintain this knowledge and these skills for a longer period of time. Quality of care can only be enhanced if health-care professionals have a greater depth of knowledge and sharper skills regarding treatments/procedures.
In summary, our results suggest that some versions of NSF items can be used for examination purposes, coinciding with recommendations put forth by NBME. The formats of NSF questions examined in this study are more difficult, have better discrimination, and have better distractors. Thus, using them may help to provide a rigorous, but fair examination, which in turn may enhance the learning process of medical students. The modified K-type showed the highest discrimination and low NFDs, suggesting this type of question style is something for educators to consider using, at least to some degree. Further, C-type questions can clearly enhance discrimination and reduce NFDs, as do NOTA. Thus, sprinkling in a few of these three formats would most likely strengthen the rigour and discrimination of current MCQ exams, and we encourage educators to consider their use when preparing future assessments. The ability to deeply analyse and interpret information is an important educational goal for future physicians. According to Van Der Vleutan [24], testing can enhance learning, therefore including more difficult test items may enhance medical students’ clinical competence, which in turn better equips them to meet their future career requirements. The same holds true regarding maintaining this competence, which is what CME provides. Thoughtfully “violating” typical MCQ item guidelines and including the NSF items studied here on classroom/CME tests may help to achieve this goal.
Acknowledgments
All authors contributed to this manuscript by: 1) Providing substantial contributions to the conception or design of the work and/or the acquisition, analysis, and/or interpretation of data for the work. 2) All authors participated in the drafting the work or reviewing it critically for important intellectual content. 3) All authors approve the submission the manuscript and will approve the final version if it is published. 4) All authors agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Disclosure Statement
No potential conflict of interest was reported by the author(s).
References
- [1].Green S, Johnson R.. Assessment is essential. (NY): McGraw-Hill; 2010. [Google Scholar]
- [2].Popham WJ. Classroom assessment: what teachers need to know. 9th ed. Boston, Pearson/Allyn & Bacon: Pearson; 2019. [Google Scholar]
- [3].AERA, APA, & NCME . Standards for educational and psychological testing: national council on measurement in education. American Educational Research Association; 2014. [Google Scholar]
- [4].Boland RJ, Lester NW, Williams, et al. Writing multiple-choice questions. Academic Psychiatry. 2010;34(4):310–7. doi: 10.1176/appi.ap.34.4.310 [DOI] [PubMed] [Google Scholar]
- [5].Considine J, Botti M, Thomas S. Design, format, validity and reliability of multiple choice questions for use in nursing research and education. Collegian. 2005;12(1):19–24. doi: 10.1016/S1322-7696(08)60478-3 [DOI] [PubMed] [Google Scholar]
- [6].Rudolph MJ, Daugherty KK, Ray ME, et al. Best practices related to examination item construction and post-hoc review. Am J Pharm Educ. 2019;83(7):7204–1503. doi: 10.5688/ajpe7204 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Billings MS, DeRuchie K, Hussie K, et al. NBME® Item-Writing Guide. Constructing Writ Test Questions For The Health Sci. 2021. Available from: https://www.ucns.org/common/Uploaded%20files/Help/NBME%20Item%20Writing%20Guide.pdf [Google Scholar]
- [8].Frey BB, Petersen S, Edwards LM, et al. Item-writing rules: collective wisdom. Teach And Teach Educ. 2005;21(4):357–364. doi: 10.1016/j.tate.2005.01.008 [DOI] [Google Scholar]
- [9].Haladyna TM, Downing SM, Rodriguez MC. A review of multiple-choice item-writing guidelines for classroom assessment. Appl Meas In Educ. 2002;15(3):309–333. doi: 10.1207/S15324818AME1503_5 [DOI] [Google Scholar]
- [10].Haladyna TM, Downing SM. A taxonomy of multiple-choice item-writing rules. Appl Meas In Educ. 1989;2(1):37–50. doi: 10.1207/s15324818ame0201_3 [DOI] [Google Scholar]
- [11].Harasym PH, Leong EJ, Violato C, et al. Cuing effect of “all of the above” on the reliability and validity of multiple-choice test items. Evaluation & The Health Professions. 1998;21(1):120–133. doi: 10.1177/016327879802100106 [DOI] [PubMed] [Google Scholar]
- [12].Nnodim JO. Multiple-choice testing in anatomy. Med Educ. 1992;26(4):301–309. doi: 10.1111/j.1365-2923.1992.tb00173.x [DOI] [PubMed] [Google Scholar]
- [13].Albanese M. Type K and other complex multiple-choice items: an analysis of research and item properties. Educational measurement: issues and practice. Educ Meas: Issues And Pract. 1993;12(1):28–33. doi: 10.1111/j.1745-3992.1993.tb00521.x [DOI] [Google Scholar]
- [14].Downing SM. The effects of violating standard item writing principles on tests and students: the consequences of using flawed test items on achievement examinations in medical education. Adv Health Sci Educ. 2005;10(2):133–143. doi: 10.1007/s10459-004-4019-5 [DOI] [PubMed] [Google Scholar]
- [15].Bandalos DL. Measurement theory and applications for the social sciences. (NY): Guilford Press; 2018. [Google Scholar]
- [16].Rezigalla AA. Item analysis: concept and application. In: Firstenberg M, editor. Medical education for the 21st. Century edn ed. Stawicki SP: IntechOpen; 2022. p. 105–120. [Google Scholar]
- [17].Ali SH, Ruit KG. The impact of item flaws, testing at low cognitive level, and low distractor functioning on multiple-choice question quality. Perspect Med Educ. 2015;4(5):244–251. doi: 10.1007/s40037-015-0212-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Tarrant M, Ware J, Mohammed AM. An assessment of functioning and non-functioning distractors in multiple-choice questions: a descriptive analysis. BMC Med Educ. 2009;9(1):40–40. doi: 10.1186/1472-6920-9-40 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Pate S, Caldwell DJ. Effects of multiple-choice item-writing guideline utilization on item and student performance. Curr In Pharm Teach And Learn. 2014;6(1):130–134. doi: 10.1016/j.cptl.2013.09.003 [DOI] [Google Scholar]
- [20].Tamir P. Positive and negative multiple choice items: how different are they? Stud in Educ Evaluation. 1993;19(3):311–325. doi: 10.1016/S0191-491X(05)80013-6 [DOI] [Google Scholar]
- [21].Gierl MJ, Bulut O, Guo Q, et al. Developing, analyzing, and using distractors for multiple-choice tests in education: a comprehensive review. Rev Educ Res. 2017;87(6):1082–1116. Available from: https://www.jstor.org/stable/44667687#metadata_info_tab_contents [Google Scholar]
- [22].Vyas R, Supe A. Multiple choice questions: a literature review on the optimal number of options. Natl Med J India. 2008;21(3):130–133. Available from: https://www.researchgate.net/publication/23468587_Multiple_choice_questions_A_literature_review_on_the_optimal_number_of_options [PubMed] [Google Scholar]
- [23].Crocker L, Algina J. Introduction to classical and modern test theory. (NY): Harcourt Publishers; 1986. [Google Scholar]
- [24].Van Der Vleutan CPM. The assessment of professional competence: developments, research and practical implications. Adv Health Sci Educ. 1996;1(1):41–67. doi: 10.1007/BF00596229 [DOI] [PubMed] [Google Scholar]