Educator's blueprint: A how‐to guide for developing high‐quality multiple‐choice questions

Michael Gottlieb; John Bailitz; Megan Fix; Eric Shappell; Mary Jo Wagner

doi:10.1002/aet2.10836

. 2023 Jan 24;7(1):e10836. doi: 10.1002/aet2.10836

Educator's blueprint: A how‐to guide for developing high‐quality multiple‐choice questions

Michael Gottlieb ^1,^✉, John Bailitz ², Megan Fix ³, Eric Shappell ⁴, Mary Jo Wagner ⁵

PMCID: PMC9873868 PMID: 36711253

Abstract

Multiple‐choice questions are commonly used for assessing learners' knowledge, as part of educational programs and scholarly endeavors. To ensure that questions accurately assess the learners and provide meaningful data, it is important to understand best practices in multiple‐choice question design. This Educator's Blueprint paper provides 10 strategies for developing high‐quality multiple‐choice questions. These strategies include determining the purpose, objectives, and scope of the question; assembling a writing team; writing succinctly; asking questions that assess knowledge and comprehension rather than test‐taking ability; ensuring consistent and independent answer choices; using plausible foils; avoiding grouped options; selecting the ideal response number and order; writing high‐quality explanations; and gathering validity evidence before and evaluating the questions after use.

INTRODUCTION

Multiple‐choice questions (MCQs) are a common modality used in undergraduate, graduate, and postgraduate medical education for assessing learners' knowledge across a number of applications. Educators create MCQs to assess learners' understanding of concepts and preparedness for higher‐stakes examinations (e.g., United States Medical Licensing Examinations, in‐training examinations, shelf examinations, board certification). Education researchers may use MCQs to assess knowledge before and after an intervention to gauge the intervention's effectiveness. Creators of online resources, including both paid (e.g., Physician's Evaluation and Educational Review, RoshReview) and open‐access (e.g., https://www.aliem.com/), may use MCQs as both tools for teaching (i.e., test‐enhanced learning) and for learners to self‐identify knowledge gaps to refine studying. Finally, learners may create their own tests to apply the learning or to share with others. Data have demonstrated the value of using MCQs for both knowledge assessment and test‐enhanced learning. ¹ , ² Multiple‐choice questions also offer the benefits of being time‐efficient and easy to score and may allow more objective scoring compared with other open‐ended testing methods. However, faculty rarely receive training in how to create a high‐quality MCQ. This can lead to poorly constructed MCQs, which can hinder the ability to assess learning and guide study efforts by learners. ³ , ⁴ Moreover, poorly designed MCQs can also impact the ability to assess knowledge attained from an education intervention when used for program evaluation and research reporting. Therefore, there is a need to better understand the key components of creating effective MCQs for novice and experienced MCQ developers alike.

For the purposes of this paper, MCQ will refer to a selected‐response test item with a single best answer as a stand‐alone item (as opposed to multianswer, matching, or item series–based questions). This also differs from constructed response test items (i.e., open‐response items). General MCQ terminology is included in Figure 1. This Educator's Blueprint paper will present 10 practical strategies for developing high‐quality MCQs.

TEN STRATEGIES

Determine the purpose, objectives, and scope of the question

Writing a high‐quality MCQ begins with identifying the intended audience. This should include both the experience and the background of the planned end‐users (e.g., third‐year medical student, senior resident, new faculty member) as well as the overall purpose of the MCQs (e.g., self‐learning, ordinal stratification, mastery learning). For example, a test used for ordinal stratification may be more focused on rank‐ordering learners, whereas a test for mastery learning is focused on helping learners develop mastery of a given topic. After determining the purpose, developers should identify specific objectives. If MCQs are the assessment step of an existing curriculum, educators may utilize the learning objectives of the curriculum to develop the MCQs. In other instances, this will need to be developed specifically for the MCQ test and will model the approach to curriculum development (e.g., problem identification and general needs assessment, targeted needs assessment, and goals and objectives). ⁵ If testing after a curriculum with specific objectives, questions should match the category or domain of the intended goal. One model that can be used for this is Bloom's revised taxonomy (remember, understand, apply, analyze, evaluate, create). ⁶ After objectives have been established or identified, create a blueprint for the scope to be tested (also known as a test specification table). This will help ensure the MCQ content maps to the intended objectives and avoid duplication of content or overemphasis of minutia. An example of a test blueprint is included in Appendix S1.

Assemble a writing team

Creating a valid and reliable MCQ test requires time and effort from a team of at least one item writer and one editor. Writing teams should have expertise in both MCQ development and subject matter (though each individual team member does not require both). Subject matter expertise should include academic knowledge as well as the practical clinical application to ensure the questions are accurate and relevant. The level of content area expertise should match the purpose of the test and training level of the learner. For example, fellowship directors may be better suited to serve as the writers and content editors for postfellowship certification tests, whereas clerkship directors may be more ideal for clerkship completion tests. When possible, having two editors for a question set is preferred to improve detection of errors or points of confusion. Using a multieditor model would also allow better division of efforts, with one focusing on medical accuracy, while the other can review structure, format, and syntax. Depending on the number of questions and content areas included, as well as the timeline for MCQ creation, multiple writers and editors may be needed.

Write succinctly

An important edict of writing MCQs is to keep them as succinct and clear as possible. ⁷ Cognitive load theory describes the limited capacity of the working memory in a learner. ⁸ Just as one teaches by connecting new ideas to a student's prior knowledge, a well‐created MCQ will allow for the elasticity of working memory by limiting extraneous information, thereby highlighting the important information. ⁹ If values are used in the answer, they should be defined outside of the test question sets in the standard format (e.g., milligrams [mg], international units [IU]) with abbreviations used to keep answer choices succinct. Vague terms such as “nearly all” or “somewhat” should be avoided, as they can create confusion and their interpretation may vary between readers. ⁷ Case scenarios can also add to the extraneous cognitive stress and should only be included if the additional details are necessary to clarify the question. Long question stems lead to increased effect of reading comprehension on performance, which is rarely the target construct to be tested. This can lead to construct irrelevant variance, which risks the validity of scores and decisions made based on these scores. ¹⁰ When writing a case scenario–based question, only information directly needed for the question should be included. An author should resist the urge to use the question stem to teach about a given topic as this can further add to the extraneous cognitive load and detract from the role of the MCQ in assessment.

Ask questions that assess knowledge and comprehension, not test‐taking ability

Fill in the blank, “best” answer, and “which statement is correct” formats should be avoided. These formats often test disparate objectives and place excess cognitive load on learners, which may ultimately provide a better assessment of test‐taking skills than the ability to actually apply knowledge to clinical practice. Writers should also avoid the temptation to simply turn declarative statements from core reference texts or landmark articles into multiple‐choice questions, as these really only assess a learners' reading recall of that particular source. Additionally, avoid negative phrasing (e.g., “all of the following except”, “which of these should not be used”) as these types of statements are more cognitively taxing for the learner, who may then be more prone to make an error, which will decrease the validity of the testing process. ⁷ While some have proposed highlighting the negative word or using all capital letters to help it stand out, this still can cause excess cognitive stress and should be avoided. ¹¹ Here are some phrases that could be used instead: “what is the contraindication to …?”, “which treatment should a physician avoid ?”, “what is the lowest rate of …?”, “which is a safe alternative to …?” Ultimately, the MCQ should address a clear objective and the answers should logically follow the prompt. A well‐prepared learner should be able to mentally answer a well‐written MCQ without reading the answer. ⁷

Ensure consistent and independent answer choices

Creating the answer set to a MCQ is one of the most challenging steps in the process, as the author must provide similar disjointed foils (i.e., incorrect answers) that are distinct from the correct answer. Consistent length, voice, grammar, syntax, and verb tense between answer choices should be ensured to avoid the respondent being able to guess the answer based on how the answer is written. ⁷ Similarly, avoiding duplication of words in the question and answer will prevent providing clues to the test‐taker on the correct response. ⁷ Since each question should test one concept discreetly, the answer set should not require the knowledge of several related but distinct concepts. The best questions will have foils that fall within the same physician task (e.g., etiology, diagnoses, history, treatment) or the same category (e.g., cardiac medications, different procedures for traumatic injury). If the answers do include information from different tasks or categories, it is best if each answer is from a separate focus to avoid signaling the correct answer.

Use plausible foils

Choosing incorrect answer choices takes careful thought and consideration. In addition to the ensuring consistency between answer choice text (see Step 5 above), it is important to also make sure all choices are plausible. ¹² These often involve commonly held misconceptions, common mistakes, or frequently confused ideas. Outdated approaches are also good options, but avoid cutting‐edge and very recent changes (e.g., changed within the past year) unless explicitly taught or expected of the test takers. Avoid answer choices that are too obscure or obviously incorrect, as test takers can eliminate these answer choices to arrive at the correct answer without needing to understand why the correct answer should have been selected. If all answers are plausible, however, then the item will more precisely select for those who truly know the best answer choice. Similarly, it is important to avoid joke responses. These can draw away the test takers attention (reducing their ability to focus on the examination) and are generally identified as obviously wrong and can easily be eliminated. Additionally, avoid overlapping answers, where one answer can be grouped within another answer's category (e.g., Choice A, fingers; Choice B, hands). Finally, avoid extreme options. Although many of us have seen the phrases “always” or “never” in previous tests, these should be avoided. These keywords draw readers attention and often can be easily identified as a foil, as situations are rarely always or never true in medicine.

Avoid grouped options

Another pitfall is using “all of the above” or “none of the above” for grouped selection. ⁷ The phrase “none of the above” is particularly problematic as it focuses on detecting wrong answers rather than the correct response and increases cognitive load as described above. ⁷ The problem with “all of the above” is that it makes it possible to answer the question with only partial information (e.g., if two answers are right, the student can assume all are right; if one is wrong, the student can assume this option is wrong). Consequently, many guidelines recommend against using these options ¹²

Select the response number and order

Some test developers may be motivated to include multiple foils in an effort to minimize the likelihood that learners will answer an item correctly by guessing. This must be balanced with the significant effort required to develop high‐quality distractors (see Strategies 5–7). ¹³ In addition, the inclusion of more answer choices increases the amount of time learners need to read and process each item, which will either increase the time required to take the test or decrease the number of items that can be included (which will decrease the test's reliability). ¹⁰

Single best‐answer test items commonly have four answer choices: one correct answer and three foils. There is good evidence, however, that the psychometric and feasibility “sweet spot” may actually be three answer choices (i.e., one correct answer and two foils). ¹⁴ Research has demonstrated minimal change in item difficulty or item discrimination (i.e., the ability of an item to differentiate among learners based on how well they know the material being tested) when changing from four to three answer choices, while decreasing the burden for item writers and increasing the number of items that can be included on a test with fixed time constraints. ¹⁴ Therefore, we suggest limiting responses to three choices.

With regard to ordering of answer choices, it is most common to list options alphabetically or in logical order (e.g., numerical or temporal order). ¹² For test items that will be repeated (e.g., mastery learning tests, self‐study formative assessment platforms), randomization of answer choices across test iterations may be preferred to decrease the influence of memory on performance.

Write high‐quality explanations

In test‐enhanced learning environments, inclusion of explanations for learners to review after completing a test provides an additional mechanism to improve understanding of concepts tested. These explanations should cover both why the correct answer is correct and why each foil is incorrect. The goal in writing answer choice explanations is for learners to be able to answer different questions on the same topic correctly in the future. Explanations should therefore be reviewed to ensure that the information included provides insight beyond the narrow scope of the single test item. Inclusion of tables, figures, and/or infographics may help further reinforce concepts for learners. ¹⁵ Finally, inclusion of references where learners can review concepts more comprehensively may improve understanding for learners with limited knowledge in a subject area.

Gather validity evidence before and evaluate the questions after use

Similar to other assessment tools, it is important to gather validity evidence prior to implementing the MCQ item. Several frameworks exist and a comprehensive discussion is beyond the scope of this article and has been described elsewhere. ¹⁶ , ¹⁷ Using Messick as an example, we will briefly highlight the five components applied to MCQ design. ¹⁷ , ¹⁸ Content validity refers to the relationship between the test item content and the construct of interest. Evidence for this can be gathered by conducting a search of the literature and relevant resources, using a test blueprint, and discussing items with subject matter experts. Response process validity reflects the degree to which the test item (as interpreted by respondents) aligns with the meaning intended by test developers. This can be assessed with pilot testing and cognitive interviewing (e.g., asking respondents to think aloud, describing what they believe the question is asking). Internal structure assesses the consistency and reliability of responses. This is statistically measured using tools such as Cronbach's alpha, inter‐rater reliability, or factor analysis. Relationship to other variables reflects the degree to which a response correlates with an external measure (e.g., senior residents should perform better on a test of advanced medical knowledge than first‐year medical students). Consequential validity refers to the intended or unintended impacts of the item (e.g., score interpretation and decisions based on scores, disproportionate focus on a rare area of interest to the question writer may cause learners to inappropriately focus their studying on an esoteric low‐yield area of research). After the item has been administered, MCQ developers should evaluate it to determine the difficulty index (i.e., the total number of students who got an item correct) and the discriminatory index (i.e., how effectively an individual item discriminates between the top and bottom scorers for a given exam). They should ensure the difficulty, discrimination, and content importance all are at desired/acceptable levels based on the test's purpose. Finally, it is critical to solicit feedback from end‐users and continue to iterate and refine questions over time adapting to the feedback, outcomes, advances in the literature, and any changes to the underlying curricula.

CONCLUSION

In this article, we described 10 strategies for developing high‐quality multiple‐choice questions. Educators can utilize these strategies when writing new questions as well as for revising existing questions to align with best practice recommendations.

CONFLICT OF INTEREST

The authors declare no potential conflict of interest.

Supporting information

Appendix S1

Click here for additional data file.^{(13.8KB, docx)}

Gottlieb M, Bailitz J, Fix M, Shappell E, Wagner MJ. Educator's blueprint: A how‐to guide for developing high‐quality multiple‐choice questions. AEM Educ Train. 2023;7:e10836. doi: 10.1002/aet2.10836

Supervising Editor: Dr. Teresa Smith

REFERENCES

1. Pham H, Trigg M, Wu S, et al. Choosing medical assessments: does the multiple‐choice question make the grade? Educ Health (Abingdon). 2018;31(2):65‐71. [DOI] [PubMed] [Google Scholar]
2. Jud SM, Cupisti S, Frobenius W, et al. Introducing multiple‐choice questions to promote learning for medical students: effect on exam performance in obstetrics and gynecology. Arch Gynecol Obstet. 2020;302(6):1401‐1406. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Downing SM. Construct‐irrelevant variance and flawed test questions: do multiple‐choice item‐writing principles make any difference? Acad Med. 2002;77(10 Suppl):S103‐S104. [DOI] [PubMed] [Google Scholar]
4. Downing SM. The effects of violating standard item writing principles on tests and students: the consequences of using flawed test items on achievement examinations in medical education. Adv Health Sci Educ Theory Pract. 2005;10(2):133‐143. [DOI] [PubMed] [Google Scholar]
5. Kern D, Thomas PA, Hughes MT. Curriculum Development for Medical Education: A Six‐Step Approach. Johns Hopkins University Press; 2009. [Google Scholar]
6. Anderson LW, Krathwohl DR. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives. Longman; 2001. [Google Scholar]
7. NBME Item‐writing Guide . National Board of Medical Examiners. February 2021. Accessed October 23, 2022. https://www.nbme.org/sites/default/files/2021‐02/NBME_Item%20Writing%20Guide_R_6.pdf
8. Sweller J. Cognitive load during problem solving: effects on learning. Cognit Sci. 1988;12(2):257‐285. [Google Scholar]
9. Gillmor SC, Poggio J, Embretson S. Effects of reducing the cognitive load of mathematics test items on student performance. Numeracy. 2015;8(1):1‐18. [Google Scholar]
10. Yudkowsky R, Park YS, Downing SM. Assessment in Health Professions Education. Routledge; 2019. [Google Scholar]
11. Chiavaroli N. Negatively‐worded multiple choice questions: an avoidable threat to validity. Pract Assess Res Eval. 2017;22(3):1‐14. [Google Scholar]
12. Haladyna TM, Rodriguez MC. Developing and Validating Test Items. Routledge; 2013. [Google Scholar]
13. Tarrant M, Ware J, Mohammed AM. An assessment of functioning and non‐functioning distractors in multiple‐choice questions: a descriptive analysis. BMC Med Educ. 2009;9:40. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Rodriguez MC. Three options are optimal for multiple‐choice test items: a meta‐analysis of 80 years of research. Educ Meas Issues Pract. 2005;24(2):3‐13. [Google Scholar]
15. Gottlieb M, Ibrahim AM, Martin L, Yilmaz Y, Chan T. Educator's blueprint: a how‐to guide for creating a high‐quality infographic. AEM Educ Train. 2022;6:e10793. doi: 10.1002/aet2.10793 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Cook DA, Brydges R, Ginsburg S, Hatala R. A contemporary approach to validity arguments: a practical guide to Kane's framework. Med Educ. 2015;49(6):560‐575. [DOI] [PubMed] [Google Scholar]
17. Messick S. Validity. In: Linn RL, ed. Educational Measurement. 3rd ed. American Council on Education and Macmillan; 1989. [Google Scholar]
18. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education . Standards for Educational and Psychological Testing. American Educational Research Association; 2014. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix S1

Click here for additional data file.^{(13.8KB, docx)}

[aet210836-bib-0001] 1. Pham H, Trigg M, Wu S, et al. Choosing medical assessments: does the multiple‐choice question make the grade? Educ Health (Abingdon). 2018;31(2):65‐71. [DOI] [PubMed] [Google Scholar]

[aet210836-bib-0002] 2. Jud SM, Cupisti S, Frobenius W, et al. Introducing multiple‐choice questions to promote learning for medical students: effect on exam performance in obstetrics and gynecology. Arch Gynecol Obstet. 2020;302(6):1401‐1406. [DOI] [PMC free article] [PubMed] [Google Scholar]

[aet210836-bib-0003] 3. Downing SM. Construct‐irrelevant variance and flawed test questions: do multiple‐choice item‐writing principles make any difference? Acad Med. 2002;77(10 Suppl):S103‐S104. [DOI] [PubMed] [Google Scholar]

[aet210836-bib-0004] 4. Downing SM. The effects of violating standard item writing principles on tests and students: the consequences of using flawed test items on achievement examinations in medical education. Adv Health Sci Educ Theory Pract. 2005;10(2):133‐143. [DOI] [PubMed] [Google Scholar]

[aet210836-bib-0005] 5. Kern D, Thomas PA, Hughes MT. Curriculum Development for Medical Education: A Six‐Step Approach. Johns Hopkins University Press; 2009. [Google Scholar]

[aet210836-bib-0006] 6. Anderson LW, Krathwohl DR. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives. Longman; 2001. [Google Scholar]

[aet210836-bib-0007] 7. NBME Item‐writing Guide . National Board of Medical Examiners. February 2021. Accessed October 23, 2022. https://www.nbme.org/sites/default/files/2021‐02/NBME_Item%20Writing%20Guide_R_6.pdf

[aet210836-bib-0008] 8. Sweller J. Cognitive load during problem solving: effects on learning. Cognit Sci. 1988;12(2):257‐285. [Google Scholar]

[aet210836-bib-0009] 9. Gillmor SC, Poggio J, Embretson S. Effects of reducing the cognitive load of mathematics test items on student performance. Numeracy. 2015;8(1):1‐18. [Google Scholar]

[aet210836-bib-0010] 10. Yudkowsky R, Park YS, Downing SM. Assessment in Health Professions Education. Routledge; 2019. [Google Scholar]

[aet210836-bib-0011] 11. Chiavaroli N. Negatively‐worded multiple choice questions: an avoidable threat to validity. Pract Assess Res Eval. 2017;22(3):1‐14. [Google Scholar]

[aet210836-bib-0012] 12. Haladyna TM, Rodriguez MC. Developing and Validating Test Items. Routledge; 2013. [Google Scholar]

[aet210836-bib-0013] 13. Tarrant M, Ware J, Mohammed AM. An assessment of functioning and non‐functioning distractors in multiple‐choice questions: a descriptive analysis. BMC Med Educ. 2009;9:40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[aet210836-bib-0014] 14. Rodriguez MC. Three options are optimal for multiple‐choice test items: a meta‐analysis of 80 years of research. Educ Meas Issues Pract. 2005;24(2):3‐13. [Google Scholar]

[aet210836-bib-0015] 15. Gottlieb M, Ibrahim AM, Martin L, Yilmaz Y, Chan T. Educator's blueprint: a how‐to guide for creating a high‐quality infographic. AEM Educ Train. 2022;6:e10793. doi: 10.1002/aet2.10793 [DOI] [PMC free article] [PubMed] [Google Scholar]

[aet210836-bib-0016] 16. Cook DA, Brydges R, Ginsburg S, Hatala R. A contemporary approach to validity arguments: a practical guide to Kane's framework. Med Educ. 2015;49(6):560‐575. [DOI] [PubMed] [Google Scholar]

[aet210836-bib-0017] 17. Messick S. Validity. In: Linn RL, ed. Educational Measurement. 3rd ed. American Council on Education and Macmillan; 1989. [Google Scholar]

[aet210836-bib-0018] 18. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education . Standards for Educational and Psychological Testing. American Educational Research Association; 2014. [Google Scholar]

PERMALINK

Educator's blueprint: A how‐to guide for developing high‐quality multiple‐choice questions

Michael Gottlieb, MD

John Bailitz, MD

Megan Fix, MD

Eric Shappell, MD, MHPE

Mary Jo Wagner, MD

Abstract

INTRODUCTION

FIGURE 1.

TEN STRATEGIES

Determine the purpose, objectives, and scope of the question

Assemble a writing team

Write succinctly

Ask questions that assess knowledge and comprehension, not test‐taking ability

Ensure consistent and independent answer choices

Use plausible foils

Avoid grouped options

Select the response number and order

Write high‐quality explanations

Gather validity evidence before and evaluate the questions after use

CONCLUSION

CONFLICT OF INTEREST

Supporting information

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Educator's blueprint: A how‐to guide for developing high‐quality multiple‐choice questions

Michael Gottlieb, MD

John Bailitz, MD

Megan Fix, MD

Eric Shappell, MD, MHPE

Mary Jo Wagner, MD

Abstract

INTRODUCTION

FIGURE 1.

TEN STRATEGIES

Determine the purpose, objectives, and scope of the question

Assemble a writing team

Write succinctly

Ask questions that assess knowledge and comprehension, not test‐taking ability

Ensure consistent and independent answer choices

Use plausible foils

Avoid grouped options

Select the response number and order

Write high‐quality explanations

Gather validity evidence before and evaluate the questions after use

CONCLUSION

CONFLICT OF INTEREST

Supporting information

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases