Abstract
Multiple choice questions play an important role in training and evaluating biomedical science students. However, the resource intensive nature of question generation limits their open availability, reducing their contribution to evaluation purposes mainly. Although applied-knowledge questions require a complex formulation process, the creation of concrete-knowledge questions (i.e., definitions, associations) could be assisted by the use of informatics methods. We envisioned a novel and simple algorithm that exploits validated knowledge repositories and generates concrete-knowledge questions by leveraging concepts’ relationships. In this manuscript we present the development and validation of a prototype which successfully produced meaningful concrete-knowledge questions, opening new applications for existing knowledge repositories, potentially benefiting students of all biomedical sciences disciplines.
Introduction
Multiple choice questions (MCQ) have been widely used as an assessment tool in biomedical sciences education, currently being a major component of recognized tests such as the Graduate Record Examinations (GRE)1, the United States Medical Licensing Examination (USMLE)2, and the American Board of Medical Specialties3. Appropriately constructed, MCQ can be used to objectively assess all levels of learning of the Boom’s taxonomy of cognitive learning4, from concrete-knowledge up to application5.
A MCQ usually consists of a question or a statement to solved (usually referred to as “stem”), followed by a list of possible answers to choose from. There is only one correct answer, while the incorrect answers are distractors. Formulating MCQ, specially finding meaningful distractors, is resource intensive and time consuming. This, prevents educators from openly sharing question-banks with students, usually restricting access to local cohorts6. This behavior limits the contribution of MCQ to evaluation purposes mainly, while it is recognized that they can contribute to the student training process by improving knowledge retention and learning.7 Furthermore, it has been reported that exposing students to questions before receiving learning material can have beneficial effects on students’ learning8.
Thus, in an attempt to facilitate MCQ generation and expand their use beyond evaluation purposes, automatic MCQ generation approaches has been attempted.9 Although applied-knowledge questions require a complex formulation process10 (e.g., “what is the best antibiotic for a patient with infection X and allergic to Y?”), automatic MCQ formulation strategies have been able to successfully generate concrete-knowledge questions. Distinct approaches range from entirely automatic generation of true and false statements by leveraging classes and relationships from domain ontologies, and presenting those statements as possible answers for the stem “choose the correct sentence” 9; up to complex computer-aided natural language processing (NLP) based methods11.
Expanding the previous efforts in automatic MCQ generation from domain ontologies, we envisioned a novel approach that automatically leverages existing knowledge repositories, aiming at providing open MCQ banks for students to train and learn. We hypothesize that a simple generation strategy based on concepts’ definition and their hierarchical relationships can be leveraged to generate MCQ with meaningful distractors. In the current manuscript, we describe the development and validation of our novel approach based in a controlled vocabulary thesauri, and present a working prototype: a new resource for students in biomedical sciences to learn and review concepts.
Methods
Central paradigm
We attempted to create concrete-knowledge questions, specifically definitions, relying on the premise that hierarchical relationships of concepts could be leveraged to retrieve meaningful distractors. Based on a hierarchical tree structure of concepts and definitions (i.e., a taxonomy), we can select a concept, and expose its definition as the “question”. Then, the concept name becomes the correct answer. Since concepts that share a parent concept are similar but at the same time mutually exclusive, they can be considered appropriate distractors. Thus, we then search for siblings in the hierarchy to retrieve distractors and complete the question (see Figure 1).
Figure 1:

Definition-type question generation process. Our idea consist on selecting a random node within the category of interest, expose the definition as the question, and retrieve siblings’ names to populate the distractors.
Data source
Ontologies, particularly in the biomedical domain, are well maintained repositories of concepts and relationships, suitable for our aim. We began our search for an appropriate source for questions’ generation in the Unified Medical Language System (UMLS) definitions table (MRDEF.RRF), which encompasses over one million definitions contributed by the National Cancer Institute metathesaurus (50.1%), the Gene Ontology (27.4%), the National Library of Medicine’s Medical Subject Headings (21.2%), the Foundational Model of Anatomy ontology (0.9%), and the Systematized Nomenclature of Medicine - Clinical Terms (0.5%). Out of the three major contributors, we selected the National Library of Medicine’s controlled vocabulary thesaurus (MeSH) due to its broad scope within the biomedical sciences, its rich definitions, the general good acceptance and recognition among the scientific community, and because of its interesting category-based hierarchical tree schema, which would allow us to easily navigate distinct disciplines.
Data representation
After accepting a memorandum of understanding, MeSH provides universal free access to the descriptors (or subject headings, hence the name) in both XML and ASCII format, and to the tree structure in ASCII format. The MeSH 2014 tree has 55,611 nodes, representing 27,983 unique descriptors from 16 categories (any given descriptor may be represented more than once in the hierarchical tree). For example, the descriptor “Respiratory Tract Fistula” is represented twice in the hierarchy: in location C08.702 and location C23.300.575.687. “C” stands for “Diseases”, “C08” for “Respiratory Tract Diseases” and “C08.702” for “Respiratory Tract Fistula”. Same wise, “C23.300” stands for “Anatomical Pathological Conditions”, “C23.300.575” for “Fistula” and “C.23.300.575.687” also for “Respiratory Tract Fistula”. Most of MeSH descriptors contain a short free-text narrative, the scope note, giving the scope and meaning of the concept written by the MeSH team, sometimes referencing specific sources.
We transformed the path enumeration format of the MeSH tree and created an adjacency list model and a nodes table. Then, from the concepts table, we extracted each MeSH Heading (the descriptor name), the associated MeSH Scope note and the MeSH tree numbers (the location or locations within the hierarchy), and loaded everything onto a relational database (MySQL).
Algorithm
Out of the 27,983 concepts contained in MeSH, only 26,144(93.4%) had a definition (the “scope note”), reducing the number of useful initial nodes to 54,148. We did not delete concepts without definitions since even if they are not suitable for question generation, they can be used as distractors. We designed the algorithm following the simple approach stated above (Figure 1): it selects a random node from the tree and exposes the definition of the node as the question. One of the choices is the name of the node being displayed (correct answer), while the other alternatives (distractors) are retrieved by looking for siblings within the sub tree.
Education-related research has shown that 3 options MCQ (2 distractors) provide a similar quality of a test as compared to those with 4 or 5 options12,13 which could improve efficiency in question generation. For our prototype, time was not a concern, but the available number of siblings was important. Table 1 shows the number of nodes with x number of siblings. For example, 36,321 nodes (67.1% of the tree) have at least 3 siblings, and thus could produce a question with 4 alternatives. Since the recommendations of using 3 alternatives MCQ comes from human generated questions, we thought to increase it to 4 to reduce the chance of selecting the correct answer by guessing while still being able to use 67.1% of the tree.
Table 1:
number of nodes with definition and number of siblings they have.
| # of siblings | n | (%) |
|---|---|---|
| 0 or more | 54,148 | 100% |
| 1 or more | 48,516 | 89.6% |
| 2 or more | 42,045 | 77.6% |
| 3 or more | 36,321 | 67.1% |
| 4 or more | 31,572 | 58.3% |
| 5 or more | 27,899 | 51.5% |
| 6 or more | 24,466 | 45.2% |
When more than 3 siblings are available, distractors retrieval occurs at random and the order of the choices (distractors + correct node name) are alphabetically sorted before being presented to the user.
Application development
In our preceding study concerning medical education (MoCK Test, manuscript under preparation), we parsed an existing open question bank, only available as a flat web page, and developed a native mobile application providing mobile-optimized access for students to test their knowledge on the go. We described the wide adoption and usage patterns, evidencing the benefits of mobile-optimized content: it allowed users to study whenever they had a short opening in their busy and interrupted life.
Based on those results, and our belief that learning tools should be available for all students regardless of their device preference or operating system, we named the new tool “MoCKTest 2.0” and developed it as a web based application, following the mobile first paradigm. We used free web technologies, including the Foundation (ZURB)14 framework on the client side, and PHP and MySQL on the server side. The responsive web design we used made our concept available to any internet capable device, while ensuring a mobile optimized presentation of the content.
Evaluation and user acceptance
Because biomedical science students were thought as the group that would most likely benefit from the content of MoCKTest 2.0, we invited medical, nursing and pharmacy students to try the application and participate in the evaluation section of our study. We contacted Ohio State University students from those disciplines via their institutional weekly newsletter and/or via personal referral, while access to the app was open to anyone willing to participate in the study (previous acceptance of the informed consent). A likert-scale based usability survey was created to evaluate user acceptance, consisting of 15 questions measuring concepts including ease of use, usefulness, satisfaction, and intention to use. The survey was triggered after users completed 15 questions, and participation was voluntary. In addition, we conducted semi-structured interviews with subject matter experts (SME) in biomedical education to examine and comment on the quality and meaningfulness of questions generated by MoCKTest 2.0, the potential benefits that this tool could provide to students as well as suggested improvements. The Office of Responsible Research Practices (ORRP) at The Ohio State University determined that this research protocol was exempt from IRB review, as it corresponds to a review exemption category established by federal regulations. This protocol was approved as such by the ORRP.
Results
Prototype
Our algorithm successfully generated questions using the proposed approach. The adjacency list model used to represent the tree resulted in good performance for question generation. We designed an appealing and intuitive user interface, and ensured mobile-optimized content with the responsive design provided by Foundation. The final prototype, MoCKTest 2.0, was successfully deployed in our production environment at the OSU Wexner Medical Center IT servers, and can be accessed at http://www.mocktest2.com.
In an effort to minimize barriers to adoption, we implemented social login capabilities using OpenAuth standards15. This approach eliminates the requirement to create a user and remember a new password. It relies on permissions granted by the users to get their email address from their preferred social account, which is then used for authentication purposes. Users can choose to login with Facebook, Google or Microsoft accounts.
On application launch, users can select one of the MeSH categories and subcategories and start answering questions concerning the selected topic (see Figure 2). Due to the considerable size of the question bank (36,000+ concepts) and the randomness of question generation, we provide a favorite feature, allowing users to tag questions to practice later. By default, the app also permits students to answer questions they have gotten wrong in previous attempts. After each answer, the user receives immediate feedback, informing them which choice was the correct one, in case they selected a wrong answer (Figure 3). New question requests happen on the background via an AJAX call without page refresh, improving user experience and decreasing traffic to/from the server. Users can also assess their performance and get an overview of the number of questions answered per sub-category and the percentage of success on each of them. Stats can be reset by students to begin a new study cycle at any point.
Figure 2:
Desktop view of a question generated from the “Diseases” category, particularly from the “Eye diseases” subsection. The user can see that the topic of the definition corresponds to “Ocular motility disorders”, and is presented with four possible alternatives for the given definition.
Figure 3:

Mobile view of feedback received when a question is answered wrong.
Users’ perception of the tool
Invitations to participate in our study were sent to the students of The Ohio State University through their college specific newsletter of the second week of February 2014. In three weeks, 325 unique visitors (not exclusively Ohio State University students) viewed the landing page, while only 120 accepted the informed consent and created an account (conversion rate of 36.9%). Of those who used the tool, 50 answered the survey (41.7% response rate). Seventy eight percent found it useful for biomedical knowledge self-assessment, while 75% agreed that using the app could improve their biomedical knowledge and be useful for their medical, nursing or pharmacy education. Sixty seven percent believed that using the system could improve their performance in school. Ninety seven percent of respondents believed that learning to navigate the app was easy. A complete report of the survey results can be found in Appendix 1.
Feedback from educators
We interviewed four subject matter experts (SME) in biomedical education: a Vice Dean for Education and Associate Vice President for Health Sciences Education of a College of Medicine, an Assistant Dean for Prelicensure Programs and Professor of Clinical Nursing of a College of Nursing, and an instructor of licensure review courses including the National Council Licensure Examination for Registered Nurses (NCLEX-RN) and the National Council Licensure Examination for Practical Nurses (NCLEX-PN). During 15 minutes of detailed assessment of random questions, SME evaluated the question generation process and the quality and meaningfulness of the questions presented, based on standard MCQ item-writing guidelines16. Despite an overall satisfaction with the quality of distractors generated, two scenarios raised concerns with the stems. The first scenario corresponded to high-level nodes (close to the root), which seemed to produce vain and basic questions (i.e. “Tumors or cancer of the uvea.”. Answer: “Uveal neoplasms”). The second scenario corresponded to questions where, despite an appropriate level of complexity, the user was able to “guess” the correct answer. This seemed to happen when a variation of the node name was present in the definition itself (i.e. synonyms). The application received a consensual very good feed-back on ease of use, and most importantly, they all agreed on the benefit of providing such a question bank to students for training purposes.
Discussion
Alghought adding complex post-processing for better articulation of the questions might improve the prototype, the novelty of our proposal relies on the simplicity of the algorithm presented. By relying only on a hierarchical structure and definitions, our approach is discipline agnostic and allows the use of any knowledge representations meeting these requirements.
Due to the nature of the datasets involved and the scope of the questions, our tool might not be used by teachers to generate questions for tests. However, the easy implementation of our approach and the open nature of the content, provides an unprecedented platform for students to train, learn and self-assess their knowledge. Despite the two conflictive scenarios identified, the tool as is, presents itself as a contribution to education for all biomedical students.
Overcoming conflictive scenarios
Regarding the first scenario where “high level nodes” might be too general and produce vain questions, a potential solution corresponds to limiting questions retrieval to certain levels of the tree. However, each MeSH category has different number of branches (sub categories), and each “branch” has different levels of depth [Figure 4]. For example, the subcategories of “Publication Characteristics” have only five levels of depth, while one of the subcategories of “Organisms” reaches twelve levels of depth. For that reason we believe that level 4 of “Publication Characteristics” might not be comparable in complexity to a level 4 question from Organisms. We hypothesize that a meaningful way of overcoming this issue might be to retrieve nodes from up to a certain level away from the leaves (bottom up). For example, a level 8 question from the geography category (deepest level) is as granular as that category can go, and thus might be equivalent in complexity to a level 11 question of Chemical and Drugs category (deepest level). A complementary approach would be to trim out categories that are expected to produce futile questions, such as “publication characteristics”. Future work will focus on validating these hypotheses.
Figure 4:
Composition of MeSH. The central pie chart represents the relative contribution of each category to the tree. The orbital graphs represent each category composition (distinct number of subcategories, and distinct levels of depth). For example, “Geographicals” has only one subcategory, with 8 levels of depth, while “Organisms” has 5 subcategories, varying from 4 levels down to 11 levels of depth). The size of a node represents the relative number of concepts in that level.
Although the second scenario - where the user could guess the correct answer- might be regarded as a limitation of the tool for evaluation purposes, we believe that this event could also be seen as a learning opportunity. The end goal of our tool is to empower the user to learn, not to serve as an evaluation tool. Thus, even if the student is able to guess the answer to a question he would have missed, he is actually learning the concept by reading the definition and associating it to the correct answer. Moreover, many times the definitions contained in MeSH provide more detailed information than what would be strictly necessary to answer the question, thus providing more content to this learning opportunity.
Ongoing efforts
Based on feedback received during the interviews, we envisioned possible improvements for the prototype. First, we will provide the ability to read the definition of distractors when receiving feedback on any given question, thus increasing the learning opportunities for unknown domains. Second, we will provide the user with module for navigating down the tree and discovering definitions of unknown concepts, akin to the MeSH browser17. Another interesting idea proposed by the nursing experts corresponds to an alternate question generation process, also based on the hierarchical structure available. The model proposes to create “except” questions by selecting a random parent node, presenting the name of the node as the group name, and listing the siblings plus a random node retrieved from a distant relative. Thus, the user will be asked to identify the term that doesn’t belong to the group. For example: All of the following correspond to Peroxisomal Disorders EXCEPT: Adrenoleukodystrophy, Mevalonate Kinase Deficiency, Refsum Disease, Fanconi Syndrome and Zellweger Syndrome. Although the idea seems interesting, it requires further testing and tuning, since the meaningfulness of these approach might be highly related and affected by the depth of the nodes.
Limitations
Although we successfully validated the question generation strategy, the use-case as an educational tool for biomedical students might be limited due to the data source selected: MeSH might not be the best source of comprehensive medical knowledge, and may have several biases in the concepts included/excluded in this hierarchy. Our evaluation of the prototype seems to suggest that the strategy might be useful for medical education, although a more large-scale evaluation effort is probably needed. Future efforts will focus on expand the tool to a larger population to gain feedback. The persistent data storage implemented in our prototype is not the most efficient in this context. A graph data base would improve the queries to lookup distractors. The approach used might become limited if the project scales to large populations.
Conclusion
We introduce MoCKTest 2.0: a new asset for students in biomedical sciences to learn and review concepts. Our simple tool leverages existing resources such as controlled vocabulary thesauri, creating concrete-knowledge questions and opening a new educational use for knowledge repositories. Students and educators recognized the contribution of such a question bank to the learning process, extending multiple choice questions contribution beyond evaluation purposes. Despite minor limitations, our idea has the potential to contribute to the training and education of scientists and researchers of all the biomedical sciences.
Appendix 1
Diverging stacked bar charts of survey results. This visualization easily allows to identify the skew between total positive and negative responses, due to the central base. Each question is represented as a row. The total width of the bar shows the percentage of respondents who have non-neutral feelings towards the statement. The depth of color represents the intensity of feeling.
References
- 1.Graduate Record Examinations. 2013. http://www.ets.org/gre.
- 2.United States Medical Licensing Examination. 2013. http://www.usmle.org/
- 3.American Board of Medical Specialties. 2013. http://www.abms.org/
- 4.Newble D, Cannon RA. A Handbook for Teachers in Universities and Colleges: A Guide to Improving Teaching Methods. 3rd ed. London: Kogan Page; 1991. [Accessed December 1, 2013]. p. 161. http://books.google.com/books/about/A_handbook_for_teachers_in_universities.html?id=EYefAAAAMAAJ&pgis=1. [Google Scholar]
- 5.Morrison GR, Ross SM, Kemp JE. Designing Effective Instruction. Vol Wiley; 2006. [Accessed December 1, 2013]. Developing evaluation instruments; p. 464. http://www.amazon.com/Designing-Effective-Instruction-Gary-Morrison/dp/0470074264. [Google Scholar]
- 6.Hammoud MM, Barclay ML. Development of a Web-based question database for students’ self-assessment. [Accessed December 1, 2013];Acad Med. 2002 77(9):925. http://www.ncbi.nlm.nih.gov/pubmed/12228094. [PubMed] [Google Scholar]
- 7.Fox JS. The multiple choice tutorial: its use in the reinforcement of fundamentals in medical education. [Accessed December 1, 2013];Med Educ. 1983 17(2):90–94. doi: 10.1111/j.1365-2923.1983.tb01106.x. http://www.ncbi.nlm.nih.gov/pubmed/6843396. [DOI] [PubMed] [Google Scholar]
- 8.Skalban Y, Ha LA, Specia L, Mitkov R. Automatic Question Generation in Multimedia-Based Learning. [Accessed November 19, 2014];COLING (Posters) 2012 http://scholar.google.com/scholar?hl=en&q=Automatic+Question+Generation+in+multimedia-based+learning.&btnG=&as_sdt=1%2C9&as_sdtp=#0. [Google Scholar]
- 9.Papasalouros A, Kanaris K, Kotis K. Automatic Generation Of Multiple Choice Questions From Domain Ontologies; IADIS International Conference E-Learning 2008, Amsterdam, The Netherlands, July 22–25, 2008. Proceedings; 2008. [Accessed November 19, 2014]. pp. 427–434. http://www.researchgate.net/publication/220969955_Automatic_Generation_Of_Multiple_Choice_Ques tions_From_Domain_Ontologies. [Google Scholar]
- 10.Collins J. Education techniques for lifelong learning: writing multiple-choice questions for continuing medical education activities and self-assessment modules. Radiographics. 26(2):543–551. doi: 10.1148/rg.262055145. [DOI] [PubMed] [Google Scholar]
- 11.Miktov R, Ha LA. Computer-aided generation of multiple-choice tests. In: Edmonton, editor. Proceedings of the HLT/NAACL 2003 Workshop on Building Educational Applications Using Natural Language Processing. 2003. pp. 17–22. [Google Scholar]
- 12.Vyas R, Supe A. Multiple choice questions: a literature review on the optimal number of options. [Accessed December 1, 2013];Natl Med J India. 2008 21(3):130–133. http://www.ncbi.nlm.nih.gov/pubmed/19004145. [PubMed] [Google Scholar]
- 13.Tarrant M, Ware J. A comparison of the psychometric properties of three- and four-option multiple-choice questions in nursing assessments. Nurse Educ Today. 2010;30(6):539–543. doi: 10.1016/j.nedt.2009.11.002. [DOI] [PubMed] [Google Scholar]
- 14.Foundation: The Most Advanced Responsive Front-end Framework from ZURB. [Accessed July 31, 2013]. http://foundation.zurb.com/
- 15.Hardt D. The OAuth 2.0 Authorization Framework. 2012. [Accessed February 28, 2014]. http://tools.ietf.org/html/rfc6749.
- 16.Haladyna TM, Downing SM, Rodriguez MC. A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Appl Meas Educ. 2002;15(3):309–333. doi: 10.1207/S15324818AME1503_5. [DOI] [Google Scholar]
- 17.MeSH Browser. 2014. [Accessed February 28, 2014]. http://www.nlm.nih.gov/mesh/MBrowser.html.



