Abstract
Practitioners in the sciences have used the “flow” of knowledge (post-test score minus pre-test score) to measure learning in the classroom for the past 50 years. Walstad and Wagner, and Smith and Wagner moved this practice forward by disaggregating the flow of knowledge and accounting for student guessing. These estimates are sensitive to misspecification of the probability of guessing correct. This work provides guidance to practitioners and researchers facing this problem. We introduce a transformed measure of true positive learning that under some knowable conditions performs better when students’ ability to guess correctly is misspecified and converges to Hake’s normalized learning gain estimator under certain conditions. We then use simulations to compare the accuracy of two estimation techniques under various violations of the assumptions of those techniques. Using recursive partitioning trees fitted to our simulation results, we provide the practitioner concrete guidance based on a set of yes/no questions.
Keywords: disaggregated learning, gain measurement, value-added learning, Monte Carlo simulation
Introduction
For the past half-century, educators in the sciences (e.g., economics, physics) have used pre- and post-tests to measure the “flow” of knowledge (Siegfried & Fels, 1979), or aggregate gains in learning over time. Flow-of-knowledge measures play an important role in program/course assessment, instructor self-improvement, and in estimating the impact of educational treatments as they have a closer connection to the total learning that occurred during a treatment than stock measures (such as a post-treatment exam). In determining the effectiveness of an intervention or program, the practitioner1 is usually interested in the increase in student knowledge that occurred because of the intervention. As the effect of the treatment is not directly observable, the increase in student knowledge (the flow) from the beginning of the treatment period (e.g., class or program) to the end of the treatment period is often used instead.
Attempting to measure knowledge increases without accounting for prior knowledge can lead to unreliable measures of student learning. Prior student knowledge can vary widely from one section of a course to another (as the sample is the class or program size). Despite this hurdle, it is necessary to accurately measure student knowledge gains in such settings, as these measurements drive education intervention research, changes that individual faculty might make to a course, and the perceived strengths and weaknesses of an academic program. In its most basic formulation, the practitioner or researcher can find the flow of knowledge by subtracting the matched pre-test scores from the post-test scores (students who did not take both the pre- and post-test are removed from the analysis). While differencing the pre- and post-tests is a common approach, other methods include regression analysis with the pre-test as an independent variable and differencing with a normalization factor (for instance, the Hake, 1998, estimator).
Walstad and Wagner (2016) improved this practice by suggesting that the practitioner or researcher should disaggregate learning into four types: positive, negative, zero, and retained. Positive learning is said to occur when the student answered incorrectly on the pre-test, then answered correctly on the post-test. Negative learning is said to occur when the student answered correctly on the pre-test but then answered incorrectly on the post-test. Zero learning is said to occur when the student answered incorrectly both times, and retained learning is said to occur when the student answered correctly both times.
While many fields have generated 2 × 2 tables to study specific subsets of populations based on outcomes,2 the insight of Walstad and Wagner (2016) was that these disaggregated learning types represent different outcomes and should not be intermixed; a practitioner should react very differently to an increase in positive versus a decrease in negative learning (an increase in positive learning is likely due to an improved pedagogical technique, while a decrease in negative learning is likely due to the removal of a confusing explanation). Despite this fundamental difference, the flow of knowledge is measured as positive learning minus negative learning. This insight leads to rapid adoption of the new procedure in research and in assessment. For instance, Emerson and English (2016) used positive learning to estimate the impact of educational treatments. Happ et al. (2016) measured the relationship between positive learning and characteristics such as native language and grade point average (GPA). To enable the adoption of this technique when assessing many courses, Smith (2018) developed software that produced the disaggregated values directly from Scantron formatted exam output files.
The work by Walstad and Wagner (2016) had one notable shortcoming: It did not account for student guessing. Guessing on a multiple choice exam decreases the accuracy of the instrument (Zimmerman & Williams, 2003), and multiple choice exams are in widespread use, especially in lower level courses. Smith and Wagner (2018) showed that guessing masks the true learning values and creates bias in the expected values of the unadjusted learning types suggested by Walstad and Wagner (2016). However, the article also showed that the unadjusted learning values could be adjusted to account for guessing when the probability of guessing correct is known. The critical adjusted measures are (true positive learning) and (true negative learning). When determining whether a pedagogical technique is effective or when measuring the learning in a class, is the value of interest. (Throughout this work, we will use hats to emphasize when we are discussing an estimate of a learning parameter and not the parameter itself.) This modified approach can be applied in educational research. Even more valuable is the improved accuracy it provides to assessment procedures (the software in Smith, 2018, also produces these adjusted measures).
Smith and Wagner (2018) adjusted the results from the national-norming sample for micro- and macroeconomics (Test of Understanding of College Economics [TUCE]; Walstad et al., 2007). The article estimated the probability of guessing correct using a workhorse in the psychometrics literature: the Three-Parameter Logistic (3PL; De Ayala, 2013). The 3PL estimates the probability of a very low ability student to answer a given question correct despite not knowing the correct answer to a question. In the literature, this value is referred to as the pseudoguessing parameter. This method can detect when students are able to remove distractors from consideration; many well-written questions have point estimates for the pseudoguessing parameter statistically indistinguishable from , where is the number of question options. While this is a reasonable method for the TUCE data, the 3PL cannot be applied to most classes.
The 3PL procedure requires a tremendous amount of data to converge; researchers disagree about the precise minimum number, but 1,000 students with 20 questions seem to be the median recommendation (De Ayala, 2013, pp. 130–131). Even when this observational threshold is satisfied, the 3PL estimator will only converge on truth under select conditions.3 This convergence problem is part of the reason Han (2012) suggested fixing the pseudoguessing parameter at to estimate the other parameters in the model.
Relationship to General Educational Measurement Literature
While the consideration of guessing in evaluating learning is a relatively new feature in the Economic Education literature, it has been a core element of Educational Measurement for at least 50 years. Birnbaum (1968) extended existing models by statistically accounting for guessing.4 Later, Maris (1999) elaborated that students could have a partial mastery of question content that changes the probability of answering a question correct. These mental resources may include skills such as educated guessing (partial knowledge), elimination techniques, or other methods of increasing the likelihood of success.
Educated guessing and other techniques to utilize partial knowledge become particularly important where the stakes of the test are not sufficiently high. Wise and DeMars (2006) present a model that is able to accommodate and adjust for low levels of effort. Under low effort, many students will seek to simply answer questions quickly, either due to indifference or fatigue, and will thereby generate inaccurate estimates of learning or understanding. Accounting for low effort is critical when educators seek to use scores from low-stakes problems (pilot questions given without stakes, small homework assignments, etc.) to project scores or results in high-stakes contexts like certification exams or final exams. The model proposed by Wise and DeMars (2006) can outperform the 3PL model in some cases. Another way for educators to overcome the low-stakes problem is simply to raise the stakes. Wise and DeMars (2005) propose several methods to make a low-stakes test high-stakes for both the educator and the test-takers, including providing incentives for good performance and prompt feedback.
Finally, Gönülateş and Kortemeyer (2017) show that item response theory can be adapted to account for a student’s propensity to guess using multidimensional item response theory (MIRT). This approach, like the approach of Wise and DeMars (2006), can be utilized to generate more accurate forecasts from low-stakes settings like homework assignments to higher stakes examinations. Gönülateş and Kortemeyer (2017) emphasize that controlling for more learner traits will improve models, but mostly in low-stakes settings.
In our work below, we assume that questions are presented in a high-stakes setting, and that educators can accurately state how many effective discriminators exist for a given question. In the context of evaluating homework questions or other relatively low-stakes questions, these assumptions may not be true. In those cases, we urge the reader to refer to the works cited in this section for a more complete understanding of how guessing and partial knowledge affect our understanding of learning measurements.
Contribution to the Literature
This article extends the work of Walstad and Wagner (2016) and Smith and Wagner (2018) in a number of ways. First, we suggest a transformation of the learning estimator: . We show that this transformation (what we call the gain estimator) performs better than the original estimator under conditions knowable to the practitioner or researcher; in some cases, it has a superior interpretation. Moreover, we extend the Monte Carlo simulations of no learning by Smith and Wagner (2018, pp. 5–9) to include the gain estimator. Therefore, the practitioner or researcher can perform the same statistical test regardless of if they use the or gain estimator.
Second, this article suggests that when the probability of guessing correct on a set of exam questions does not converge to , assuming true negative learning is zero could be a superior option. Throughout the article, the modified estimators given this assumption are provided. The gain estimator under the assumption that true negative learning is zero results in Hake’s (1998) normalized learning gain—a common learning value used in STEM (science, technology, engineering, and mathematics) fields. Therefore, this article provides a clarified interpretation of previous research that have used the Hake learning gain.
Third, under strong assumptions, we use a large number of Monte Carlo simulations to compare deviations from the true value using the estimated and when assuming the probability equals or setting estimated true negative learning to zero. Using these simulation results, we make three general observations and fit the results to decision trees (in Online Appendix A). These trees represent the collective guidance of all the Monte Carlo simulations in a visual form and are critical to providing practitioners’ guidance.
Finally, we provide a “practitioner’s guide” to the results in this article. This section walks the practitioner through the process of choosing a learning estimate ( or ), selecting the best method of specifying the probability of guessing correct ( or ), and performing a statistical test against the null hypothesis of no learning. The practitioner who is only interested in applying the resulting techniques in this article can read the practitioner’s guide without first reading the preceding sections.
This article will proceed as follows: In “A Brief Description of the Estimators in Smith and Wagner (2018)” section, we review the estimators in Smith and Wagner (2018) and provide those estimators when estimated true negative learning is zero. In “The Gain Measurement” section, we will propose a transformed measure of positive learning. In “Comparing the Adjusted Positive Learning and the Gain Estimators’ Sensitivity to Probability Misspecification” section, we will show under what conditions the gain estimator is less sensitive than the original estimator to probability misspecification. In “A Statistical Test of the Gain Measurement Using a Counterfactual Monte Carlo Simulation” section. we will provide a statistical test for the gain estimator equivalent to what was provided by Smith and Wagner (2018) for the original estimator. In “Comparative Monte Carlo Simulations” section, we compare deviations from the true learning values with Monte Carlo simulations. In “Practitioner’s Guide” section, we provide instructions on how to use the results in this article. We then conclude the article.
A Brief Description of the Estimators in Smith and Wagner (2018)
In Smith and Wagner (2018), the authors develop estimators of the true underlying learning parameters using the raw estimates described in Walstad and Wagner (2016). The estimators are as follows:
| (1) |
| (2) |
| (3) |
where is raw or unadjusted positive learning, is unadjusted negative learning, and is unadjusted retained learning (Figure 1). is the estimated probability of a student who does not know the correct answer answering correct nonetheless (for the estimates to be valid, the probability of guessing correctly cannot equal one).5 is the estimate of true positive learning, is the estimate of true negative learning,6 and is the estimate of incoming stock knowledge (what proportion of students knew the answer when they took the pre-test). Therefore, the estimate of true retained learning would be , and the estimate of true zero learning would be (for the reader’s convenience, we provide a reference of our notation in Figure 2). As the underlying learning values must sum to one, the bounds of learning values depend on each other. Notably, cannot exceed as proportion of students know the question at the time of the pre-test (this bound holds true of the estimated values as well).
Figure 1.

Mapping of responses to raw learning types as described by Walstad and Wagner (2016). This original disaggregation is used in all calculations in Smith and Wagner (2018) and this article. Note that and .
Figure 2.
This figure provides notations used throughout the article for the convenience of the reader: Estimates of an underlying parameter are always presented with a hat (e.g., ). If we are referring to the underlying parameter itself, it will be presented without a hat (e.g., ). (A) Notation for learning types in the article. (B) Other notations used in the article.
Assuming is properly specified, with an infinite number of observations , , and converge to the true underlying parameters of , , and . However, there is no guarantee is properly specified. In the authors’ experience, it seems that is often near on nationally normed exam questions or other questions where the students are unable to remove distractors or otherwise modify the probability of guessing correct beyond pure chance. However, not all exam questions meet this criterion; for brevity, we will refer to questions where the probability of guessing correct is substantively different than as “guessing-probability-deviating questions.”
A reasonable alternative to is to set the estimated true negative learning to zero and solve for the implied probability. Intuitively, true negative learning would occur when students forgot material they learned prior to the class/program and the class/program instruction did not resurrect that knowledge. Alternatively, it can occur when the instructor’s explanation of a concept is inaccurate or otherwise confuses a student that previously had a solid understanding of a concept. In practice, true negative learning appears to be rare.
In Smith and Wagner (2018), the estimated values were very low when using the TUCE national-norming sample (Walstad et al., 2007). This suggests true negative learning is quite rare in the sample. In addition, as the authors have applied the method in Smith and Wagner (2018) to numerous classes for assessment purposes, values are nearly always below and trend toward zero. Therefore, while assuming zero true negative learning is an incorrect assumption, it might be a useful assumption when could be substantially incorrect due to guessing-probability-deviating questions. Furthermore, this assumption has been implicitly made by many authors working in STEM education over the last 20 years (see “The Gain Measurement” section).
Setting Equation 2 equal to zero and solving for reveal the implied probability in Equation 4 (derivations of all new equations presented in this article are in Online Appendix C).
| (4) |
Notably, when unadjusted zero learning equals zero , the implied probability equals one (which cannot be true). Conceptually, the way unadjusted zero learning could equal zero is if students can guess correct with 100% probability. Naturally, the accuracy of this estimate will vary greatly and will at times suggest unrealistic probabilities through randomness alone; for the estimates presented here to be correct, cannot equal one. This point will be made clear in the Comparative Monte Carlo Simulations section of this article.
Assuming adjusted negative learning equals zero, we can substitute Equation 4 into Equation 1 to find our estimator given .
| (5) |
One can see the impact of on Equation 5: The estimator is undefined as the denominator is zero when unadjusted zero learning equals zero (when the implied probability of guessing correct, , equals one). Naturally, the accuracy of this measure will depend on the accuracy of . Nonetheless, as we will show in the upcoming simulation, this measure can be preferable when the estimate is particularly unreasonable. This would occur when myopic students have a different than chance probability of guessing correctly (e.g., the ability to eliminate a distractor or confusion caused by the wording of a question or distractor).
The Gain Measurement
As noted in the previous section, has an upper bound at . Therefore, the maximum observable value is question and population dependent. This has an interpretation disadvantage where learning measurements, even for the same question, cannot be compared across student populations if the incoming stock knowledge differs. In this section, we present a simple transformation that resolves this interpretation disadvantage. Furthermore, under some knowable conditions, this measure is less sensitive to probability misspecification. To be clear, this transformation is not always preferable to the estimator. However, given data, it is knowable if the transformation is preferable.
While is the proportion of students that learned , is the proportion of students who learned the material who could have possibly learned the material (i.e., they didn’t already know it). This measure of positive learning has consistent bounds of regardless of stock knowledge. Using Equations 1 and 3, the gain measurement is revealed to be the following equation:
| (6) |
When might be a particularly poor estimate of , one may set adjusted negative learning to zero and solve for . The probability implied by Equation 4 suggests the gain estimator simplifies to Equation 7.
| (7) |
Equation 7 is mathematically equivalent to (note that and ). Hake (1998) refers to this measure as a “normalized learning gain”; This measure has been used in many STEM studies, including Colt et al. (2011), Hamne and Bernhard (2000), and Supasorn (2015). With this work, we show that this measure can be interpreted as the adjusted-for-guessing gain measure when assuming zero true negative learning.
Comparing the Adjusted Positive Learning and the Gain Estimators’ Sensitivity to Probability Misspecification
To determine the sensitivity of the estimators, we will use a common tool in the field of economics: elasticity. Likely, the most familiar elasticity is price elasticity of demand, which measures how sensitive a good or service is to changes in price. Formally, it is the percent change in quantity divided by the percent change in price. This basic equation can be rearranged as follows: is the change in quantity divided by the change in price; simply put, it is the derivative of the quantity function with respect to price. Therefore, we can re-express the price elasticity of demand as .
In this section, we will develop probability misspecification elasticities, that is, in percent terms, how responsive the estimate is to the probability of guessing correct being misspecified; mathematically, it is the percent change in the estimate divided by the percent change in the probability misspecification. All else equal, a “less elastic” estimator is desirable as it will be closer to the true learning value when the probability of guessing correct is misspecified.
To compare the sensitivity of the two estimators of interest ( and —Equations 1 and 6), we will make two notational changes. First, we will describe the specified probability as , where is the true probability of guessing correct and is the specification error of the success rate of guessing (which can be positive or negative). Second, to simplify the notation later in this section, we will describe the gain estimator as the function (where the partial derivative with respect to would be described as ) and the adjusted positive learning estimator as the function (where the partial derivative with respect to would be described as ). With these two notation changes, our estimators can be specified as follows:
| (8) |
As the bounds of the estimators differ, we cannot compare the impact of on the estimators directly; we will calculate elasticities. Here, we will describe as the probability misspecification elasticity of the gain estimator. Similarly, is the probability misspecification elasticity of the estimator. These elasticities are presented in Equation 9:
| (9) |
Dividing the equations presented above results in the ratio of elasticities (Equation 10):
| (10) |
When this ratio is between and , the gain estimator is less sensitive to a probability misspecification. When the value is outside of that range, the original estimator is less sensitive to probability misspecification. Concrete examples of this behavior are demonstrated in Figure 3. Notably, while and are not known, is the value specified by the practitioner or researcher . is therefore a knowable value for any given dataset, because and never occur except as the summed value specified by the practitioner or researcher. If the practitioner or researcher is assuming , then this ratio simplifies to:
Figure 3.
By comparing and , we demonstrate the sensitivity of the estimates to probability misspecification at example learning values; along the x-axis of both graphs is the amount that the estimated probability deviates from the true probability () and the true probability of guessing correct is in these graphs. In Panel 3(A), (when ) with the specified learning values below the figure. This falls into the range of where the gain estimator is less sensitive to probability misspecification. This can be seen by the steeper slope of (solid) in comparison with (dashed), indicating the misspecified estimator deviates away from the true value at a higher rate than the misspecified estimator. In Panel 3(B), (when ) with the specified learning values below the figure. In this range, the estimator is less sensitive to probability misspecification. This can be seen by the steeper slope of in comparison with . (A) . (B) .
| (11) |
Whether is between and depends on the relative size of and . On the positive side of the range, if is at least twice as large as , then the gain measurement will always be less sensitive to probability misspecification. Regardless, this ratio, like in general, can be calculated by the practitioner or researcher given data.
A Statistical Test of the Gain Measurement Using a Counterfactual Monte Carlo Simulation
Following an identical methodology to Smith and Wagner (2018), we provide a statistical test for the gain measurement by simulating a counterfactual distribution when there is no true positive or negative learning occurring. Smith and Wagner (2018) created a counterfactual distribution of no learning for the positive learning estimator . However, as the gain estimator suggested in this article did not exist, it was not included in their critical value tables.
This Monte Carlo simulation replicates student responses to a single question asked on both a pre- and post-test. First, students are randomly pulled from a population with stock knowledge, where is the number of students in the class. After this random pull, each student either knows or does not know the answer to the question (with the proportion of the population knowing the answer equal to ). The students who know the answer simply answer the question correct on both the pre- and post-test. The students who do not know the answer guess on both the pre- and post-test with probability of guessing correct. Each student’s guess on the post-test is independent of their guess on the pre-test, given that the student does not know the correct answer to the question.
The unadjusted learning types are then calculated and adjusted to produce estimates for positive learning, negative learning, and stock knowledge (, , and , respectively). We extend the original simulation by calculating for each simulated class. This process is repeated 10,000 times. The resulting estimated gain measurements are then ordered from lowest to highest, and the 90% and 95% critical values are extracted for the simulated empirical distribution. Similarly, the 95% confidence interval for stock knowledge is extracted.
To use these Monte Carlo simulations, the practitioner or researcher should find the set of rows associated with the number of students in their class . Then, using their calculated value, they should find the set of simulated distributions where their value falls in the confidence interval (Online Appendix Table B1 if and Online Appendix Table B2 if ). If their calculated gain measurement exceeds the greatest critical value of the relevant set of rows, then the practitioner or researcher can say that their value likely did not occur from randomness alone when there is in fact no learning. Put more simply, the value is statistically different from randomness. We present these tables in Online Appendix B (Table B1 if and Table B2 if ; a larger set of tables can be found at https://goo.gl/kKAySX and the simulation code is available at https://goo.gl/zOqlDx). For each row in the tables, (stock knowledge), (number of students), and (question options) are specified; all other values are derived as part of the simulation (, , and ). For a more detailed description of these Monte Carlo simulations, see the “Monte Carlo Simulation of the Counterfactual” section of Smith and Wagner (2018).
Comparative Monte Carlo Simulations
In this section, we perform Monte Carlo simulations to compare the accuracy of assuming the probability of guessing correct equals against the accuracy of assuming that true negative learning equals zero . It is not surprising that setting the probability correctly (by either method) would result in the superior estimate. Therefore, we present the results based on when these assumptions are violated in Online Appendix E. In the case of the estimator, we assume the practitioner or researcher is setting or , but the true probability is simulated from to (with steps of ). Similarly, with the true negative learning equaling zero estimator, we assume the practitioner or researcher is setting , but the true negative learning ranges from to (with steps of ). These results have been simulated with all combinations of and conditioned on . We have generated results using class sizes of , , and . In total, we have generated 15,120 distributions: 7,560 assuming , 7,560 assuming . The complete set of results is available here: https://goo.gl/2f8NSa. The chosen parameters are motivated by the maximum and minimum values observed by the authors in the TUCE dataset as well as data collected through departmental assessment procedures. If the reader wishes to simulate a combination of parameters not provided in the article, our simulation code is available here: https://goo.gl/qjCCUj. The Monte Carlo simulation works as follows:
For a given underlying (true stock knowledge), each simulated student’s stock ability on an exam with equally difficult questions is drawn from a Binomial distribution with probability such that the population equals the seeded value. From this procedure, each student knows (randomly) some set of questions on the pre-test.
For a given underlying (true positive learning), each student randomly “learns” some set of questions they did not know during the pre-test. This value is determined by a draw from a Binomial distribution such that the population equals the seeded value.
For a given underlying (true negative learning), each student randomly “forgets” some set of questions they knew at the time of the pre-test. This value is determined by a draw from a Binomial distribution such that the population equals the seeded value.
Based on the random draws above, each student knows or does not know the answer to each question on each of the exams. For the questions they know, they answer correctly. On the questions they don’t know, they guess with the seeded value probability of guessing correct. From the resulting simulated exam responses, the raw learning types can be calculated.
From the steps above, the pre- and post-test responses are simulated such that the unadjusted learning types can be calculated. From this, true positive learning and the gain measurement are estimated assuming (regardless of the true probability of guessing correct; simulations have also been conducted assuming ). Similarly, true positive learning and the gain measurement are calculated under the assumption that true negative learning equals zero (regardless of the actual seeded value).
From the steps above, we have simulated estimates of the true positive learning and the gain measurement as calculated by a practitioner or researcher. However, we also know the true learning values for the simulated class ( and ). The simulation calculates the absolute deviation from the true value of for the class when assuming and when assuming true negative learning equals zero . Similarly, the absolute deviation from the true gain measurement is calculated when assuming and when assuming true negative learning equals zero .
This process is repeated for 10,000 classes and the average of each of the four measures of absolute deviation is reported for a given set of seeded values. If unadjusted zero learning equals zero in one of the 10,000 simulated classes (which can occur by randomness alone), then the absolute deviation from truth for the estimator is infinite. Therefore, assuming true negative learning is equal to zero is not feasible for that specification; this is indicated with a blank cell in the tables.
The probabilities used to characterize the Binomial distributions are not specifically defined above as they vary with the specified values of , , and . That is, for a given population value to be achieved, the specified probability that characterizes the Binomial distribution must increase with an increase in as a smaller proportion of the population have the possibility to learn . Similarly, for a given value of , the specified probability that characterizes the Binomial distribution decreases with as there is a larger group of students who could forget.
Our rationale for the use of the Binomial distribution in the procedure above is that each question on the simulated exam is mapped to a unique bit of knowledge that could be known, learned, or forgotten. If the knowledge is truly independent (and no two questions are mapped to the same bit of knowledge), then the questions can be thought of as a Binomial distribution. By necessity, the distributional assumption is strong. Without assuming a specific distribution (with relatively few parameters), the Monte Carlo simulations are not feasible given that we are simulating all combinations of the input parameters; a targeted sensitivity check of this assumption was performed in Online Appendix D.
In the educational context, the Binomial assumption indicates that we are simulating an exam with equally difficult questions where each question is entirely independent. In many cases, these are unreasonable assumptions. However, many instructors attempt to write exams that contain independent questions; often in the context of program/course assessment for accreditation, the questions are required to be independent. The sensitivity check in Online Appendix D is an attempt to partially address this weakness. In that set of simulations, we structured the process of learning and forgetting to be highly correlated with student ability (more details in the online appendix). The simulations in Online Appendix D show that the simulations presented here are at least somewhat robust to changes in the correlation structure. If the reader wishes to perform a different targeted simulation using a different distribution or correlation structure, they can do so as the pulls from the Binomial distribution occur in three places in the code (which we have made available).
This distribution assumption emphasizes the intended use of these Monte Carlo simulations. Unlike the simulations provided in the previous section, which can be used more generally, our goal here is to provide guidance to the practitioner or researcher. For our simulations to be feasible, our assumptions are necessarily strong; without strong assumptions, we would not be able to provide guidance to the practitioner at all. Therefore, the values in the table should only be compared with each other, and one should focus on trends instead of any single simulated distribution. We will describe these trends as observations.
In addition to the observations included in this article, we use the full set of simulation results (https://goo.gl/2f8NSa) to generate Figures A1 and A2 in Online Appendix A. These recursive decision trees (Breiman et al., 1984) find the splits in the data that minimize error. In this context, the data (all of the simulations) are recursively split by the parameter (and value on a continuous variable) that best predicts whether the practitioner should use the or probability strategy. The subnodes are then split using the same logic (and so on). Therefore, these recursive decision trees represent the most important breaks in the simulated data and can assist the practitioner in determining the best probability estimation strategy based on the assumptions they are willing to make about their dataset. These generated decision trees are in concordance with the general observations made in the following section but provide a concrete procedure for the practitioner to follow.
Results of the Monte Carlo Simulations
In Online Appendix E, we provide Monte Carlo simulation results for the procedure described above. For each specification, the mean absolute deviation from the true value is provided for the two proposed methods of setting the probability of guessing correct. Furthermore, to visually show any trend, a column (“ Pref.”) has the value “T” whenever the method of specifying the probability of guessing correct results in a lower mean absolute deviation from the true value. In total, we have four tables of results (https://goo.gl/2f8NSa): The first table compares the methods of estimating when the practitioner believes the probability of guessing correct equals ; the second table compares the methods of estimating when the practitioner believes the probability of guessing correct equals ; the third table compares the methods of estimating when the practitioner believes the probability of guessing correct equals ; and the final table compares the methods of estimating when the practitioner believes the probability of guessing correct equals . The true probability of guessing correct is provided in the column in each table.
Observation 1: Under the specified data generating process in our Monte Carlo simulations, when the specified probability of guessing correct using 1/n deviates from the true probability by .05 or less, it is more often preferable to use the 1/n probability estimate.
The mean absolute deviation is smaller using the estimator in 78% of the 720 simulated distributions estimating when we specify that the true probability deviates by or less from . This is particularly true as increases. In 82% of cases when and 99% of cases when , the estimator produces a smaller mean absolute deviation. The smaller mean absolute deviation can be due to either a less biased estimator or a narrower distribution; when is overestimated, the distribution produced by assuming is substantially wider than when assuming the probability equals (see Figure 1 for an example). Put simply, due to the increased width of the distribution, the estimator might be a superior option even when the center of the distribution is closer to the true value.
A similar story can be told with the gain estimator: in 65% of the simulated distributions when the probability estimator violates truth by , the mean absolute deviation is smaller using the probability estimate. Like the estimator, these percentages increase with : in 75% of the simulated distributions and 99% of the distributions, the estimator is the preferred strategy.
Observation 2: Under the specified data generating process in our Monte Carlo simulations, when the true value is less than or equal to .03 and the true probability deviates by .10 from the estimated 1/n probability, it is preferable to assume zero true negative learning when estimating the probability of guessing correct. This is particularly true if the and values are low.
When the probability estimator is incorrect by , assuming true negative learning is equal to zero often results in a smaller mean deviation from truth. In the case of the estimator, in 50% of the simulated distributions, the estimator results in a lower mean deviation from truth. However, when only looking at the simulated distributions where , setting is a preferable strategy 60% of the time.
This conclusion is stronger with the gain measurement: In only 26% of cases when the probability deviation of is equal to is using the estimator a preferable strategy. Moreover, when , setting results in a lower mean absolute deviation 77% of the time; this is increased to 99% when .
Furthermore, if and are low, then setting might result in lower mean absolute deviation. As an example, examining only the rows of the online table where and the true probability deviates by from the estimated probability, in 78% of cases using the estimator would be preferable to setting . However, when only looking at the rows where (all else equal), in 15% of cases is the preferable strategy. A similar story can be told with .
Observation 3: Under the specified data generating process in our Monte Carlo simulations, calculating the probability of guessing correct under the assumption of zero true negative learning is more often the best strategy when using the gain measurement than when using as the measurement of learning.
As noted earlier, the mean absolute deviation is smaller using the estimator in 78% of the 720 simulated distributions estimating when we specify that the true probability deviates by from . However, under the same specifications with the gain metric, the strategy is only preferable in 65% of cases. Moreover, when the probability specified by is off by , then can be estimated using either method with roughly the same chance of the strategy having the lower mean absolute deviation. However, in the case of the gain metric of the same specification, the probability strategy is only preferable 26% of the time. Using the same underlying parameters as Figure 4, Figure 5 compares the width of the distributions when assuming and when assuming the probability equals . This shows the difference in distribution width of the two estimation strategies decreases in comparison with the distributions shown in Figure 4. In Figure 5(A), the estimator distribution has a standard deviation of while estimator distribution has a standard deviation of . In Figure 5(B), the estimator distribution has a standard deviation of while estimator distribution has a standard deviation of .
Figure 4.
Kernel density estimates (KDE—for details, see Scott, 2015) of distributions when underlying , , and . Both simulations are of class sizes of 50 students where the practitioner or researcher assumes the probability of guessing correct is equal to . The true probability of guessing correct is below each figure; this is the only difference in the simulation parameters that generated the figures. Optimal bandwidth is calculated using Silverman’s (1986) method. The mean value of is plotted as a vertical line. (A) . (B) .
Figure 5.
Kernel density estimates (KDE—for details, see Scott, 2015) of distributions when underlying , , and . Both simulations are of class sizes of 50 students where the practitioner or researcher assumes the probability of guessing correct is equal to . The true probability of guessing correct is below each figure; this is the only difference in the simulation parameters that generated the figures. Optimal bandwidth is calculated using Silverman’s (1986) method. The mean value of is plotted as a vertical line. (A) . (B) .
Practitioner’s Guide
In this section, we describe how to make actionable decisions based on the information in the study. This section is simplified to show only the relevant conclusions and tools; how these conclusions/tools were derived are left to other sections of the article.
Assuming the practitioner wishes to use disaggregated value-added learning to estimate the amount of learning in their classroom, study, or assessment procedure, they are faced with the two fundamental questions: (a) Should they measure their results in terms of positive learning adjusted for guess or the gain estimator introduced in this article ? and (b) Should they estimate the probability of guessing correct as or assume ?
These choices could be influenced by many factors. For instance, the practitioner could have a long-standing assessment process calculating values for a given course. In that situation, the ability to compare across years might override any statistical factors. Similarly, a practitioner facing populations with very different stock levels of knowledge might prefer to use the gain estimator for increased interpretability. However, for purposes of this section, we will assume that the practitioner wishes to maximize the accuracy of their estimate. This set of choices can be visualized in Table 1.
Table 1.
Decision for the Practitioner Using the Tools Presented in This Article.
| & | | & | | |
| & | | & | |
Note. is the ratio of elasticities (or sensitivity) to probability misspecification and is the average absolute error of given estimation approach. To determine which cell is the most appropriate, the practitioner must calculate , defined in the “Comparing the Adjusted Positive Learning and the Gain Estimators’ Sensitivity to Probability Misspecification” section, and follow the decision trees presented in Online Appendix A. We will walk through each of these steps.
The first step is to calculate the unadjusted , , , and . This disaggregation can be performed by hand or by using software that performs this disaggregation automatically (Smith, 2018). Using these values, the practitioner can easily calculate both equations with a formula: when the practitioner assumes , when the practitioner assumes .
represents the ratio of two probability misspecification elasticities . As the gain estimator is in the numerator of this fraction, when then the gain estimator is less sensitive (in percent terms) to misspecification. The two calculated values are likely close to each other (both in the range or not); if they are not, the practitioner might need to perform the steps that follow twice. With the estimator selected, the practitioner can move to selecting either or to estimate .
The goal of the practitioner is to choose a strategy that maximizes the accuracy of the chosen estimator: in essence, which strategy would result in less absolute error. However, this absolute error is not known without assumptions. To provide the practitioner at least some guidance on this choice, a large set of Monte Carlo simulations have been conducted under a given set of assumptions, not least of which is the independence of the questions (details of these assumptions are provided in the “Comparative Monte Carlo Simulations” section). These results were then fit to decision trees that represent the collective wisdom of all of these simulations in one diagram.
The practitioner should start with what they know about the exam questions. If the question has been carefully designed to give a myopic student chance probability of guessing correct, then probability of guessing correct is likely to converge on . If that was not a central tenant of question construction, however, this might not be the case. Knowledge of question guessing probability is the first step to putting the trees in Online Appendix A in action. As an example, suppose that the practitioner has chosen to use the gain estimator and believes that their four-option multiple choice exam was designed to give students chance probability of guessing correct. Following Online Appendix Figure A2(b), there is a yes/no choice, indicating that . If the practitioner believes that the questions are designed to give the student only chance probability of guessing correct, they likely do not believe that there is such a large deviation from the true probability. Noted in the round node, without further questions, the diagram suggests the practitioner should use the probability estimator. However, if they believe they know how much negative learning occurs in their sample, they could further refine their choice. In this case, the tree asks if there is less than true negative learning—a very small amount. The practitioner likely does not believe that is reasonable, answers “no,” and finds the final node (continuing to suggest the estimator).
The final step is to determine whether the observed learning value is statistically different from no learning. To give a concrete example, if a practitioner or researcher observed a of 0.6, had a class size of 30 , estimated the gain measurement to be 0.4, and was exploring a question with , then the final five rows of Online Appendix Table B1 corresponding to class sizes of 30 might apply. Because the true may lie between 0.4 and 0.8, the practitioner or researcher would use the critical values (at either the 90% or 95% level) to test whether or not the gain was statistically different from 0. With a gain of 0.4, the practitioner or researcher would fail to reject the null hypothesis of no learning at the 95% level but reject the null at the 90% level.
Conclusion
This article extends the work of Walstad and Wagner (2016) and Smith and Wagner (2018) by (a) providing a transformed measure of learning that under knowable conditions is less sensitive to probability misspecification, (b) providing a statistical test for the new transformed measure, (c) suggesting an alternative method of determining the probability of guessing correct , and (d) simulating the distribution of outcomes when the probability of guessing correct is assumed to be (where is the number of answer options) and assuming true negative learning is zero and solving for the probability that this implies. These Monte Carlo simulations show when one probability specification results in a smaller mean absolute deviation from the true value (which are known in the simulations). As our assumptions are very strong, we only make general observations about our simulation results.
One reasonable conclusion one could draw from the Monte Carlo simulations is a reiteration of the importance of exam questions with a known probability of guessing correct. When the specified probability of guessing correct using was deviated by or less in either direction from the true value, the simulations indicated it was almost always better to use the probability estimator. Arguably, exam question options that result in more than a probability deviation from random guessing is in an indication of exam questions with distractors that are not serving their intended purpose. This suggests that carefully written question distractors are not just important to create a stronger discriminator between high- and low-quality students but are in fact critical for this type of analysis of the data. Widely used exam questions that have been tested, even if a norming sample is not available, might result in a higher quality analysis.
Furthermore, our simulations suggested that the estimators assuming zero true negative learning performed best when true positive and retained learning were comparatively low. This result was hinted at in “A Brief Description of the Estimators in Smith and Wagner (2018)” section, where if raw zero learning equaled zero, the implied probability of guessing the correct answer equaled one when assuming . This suggests that datasets with very low zero learning would result in impractical implied probabilities of guessing the correct answer on a question.
While the transformed measure, supplemented by the Monte Carlo simulations, provides the practitioner or researcher guidance, it does not replace judgment. Only the practitioner/researcher can decide what assumptions they are willing to make for their particular dataset. However, this article provides them with tools to make more informed decisions when it comes time to assess learning in a collection of classes, in their own class, or to test the effectiveness of a pedagogical technique; the section “Practitioner’s Guide” can help practitioners make this decision.
Supplemental Material
Supplemental material, sj-pdf-1-apm-10.1177_01466216211013905 for On Guessing: An Alternative Adjusted Positive Learning Estimator and Comparing Probability Misspecification With Monte Carlo Simulations by Ben O. Smith and Dustin R. White in Applied Psychological Measurement
Acknowledgments
The authors thank Michael O’Hara, Ignacio Sarmiento-Barbieri, and Brandon Sheridan for their helpful comments on earlier drafts of this article. They also thank John Donoghue, Brian Habing, and the anonymous referees for their helpful comments during the peer review process.
Throughout this article, we will often use the term “practitioner” to describe an instructor or department interested in measuring learning in one or more classes.
See, for example, disaggregations like those provided in Figure 1 of Swets (1969).
Notably, Orlando and Thissen (2003) show questions of non-monotonic form can be problematic when estimating the monotonic Three-Parameter Logistic (3PL).
In item response theory, questions are labeled items. We will proceed to call them questions to maintain the clarity of the article.
In Smith and Wagner (2018), the authors describe the probability of guessing correct as . As a result, their estimators are filled with ’s instead of ’s. For notational convenience in the next section, we describe the estimators in terms of . The estimators are mathematically identical.
Unadjusted negative learning is disaggregation of performance which can be generated through multiple mechanisms including guessing, forgetting, and actively learning incorrect information to replace correct information. True negative learning removes guessing in the expectation but still measures negative learning without regard to how it was generated. In practice, people often refer to negative learning as “forgetting” even though that is not completely accurate.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iD: Ben O. Smith
https://orcid.org/0000-0003-1286-0852
Supplemental Material: Supplementary material is available for this article online.
References
- Birnbaum A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord F., Novick M. R. (Eds.), Statistical theories of mental test scores (pp. 397–479). Addison-Wesley. [Google Scholar]
- Breiman L., Friedman J., Stone C., Olshen R. (1984). Classification and regression trees. Taylor & Francis. [Google Scholar]
- Colt H. G., Davoudi M., Murgu S., Rohani N. Z. (2011). Measuring learning gain during a one-day introductory bronchoscopy course. Surgical Endoscopy, 25(1), 207–216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Ayala R. J. (2013). The theory and practice of item response theory. Guilford Press. [Google Scholar]
- Emerson T. L. N., English L. K. (2016). Classroom experiments: Teaching specific topics or promoting the economic way of thinking? The Journal of Economic Education, 47(4), 288–299. [Google Scholar]
- Gönülateş E., Kortemeyer G. (2017). Modeling unproductive behavior in online homework in terms of latent student traits: An approach based on item response theory. Journal of Science Education and Technology, 26(2), 139–150. [Google Scholar]
- Hake R. R. (1998). Interactive-engagement versus traditional methods: A six-thousand-student survey of mechanics test data for introductory physics courses. American Journal of Physics, 66(1), 64–74. [Google Scholar]
- Hamne P., Bernhard J. (2000). Educating pre-service teachers using hands-on and microcomputer based labs as tools for concept substitution. In Pinto R., Surinach S. (Eds.), Physics teacher education beyond (pp. 663–666). Elsevier. [Google Scholar]
- Han K. T. (2012). Fixing the c parameter in the three-parameter logistic model. Practical Assessment, Research & Evaluation, 17(1), 1–24. [Google Scholar]
- Happ R., Zlatkin-Troitschanskaia O., Schmidt S. (2016). An analysis of economic learning among undergraduates in introductory economics courses in Germany. The Journal of Economic Education, 47(4), 300–310. [Google Scholar]
- Maris E. (1999). Estimating multiple classification latent class models. Psychometrika, 64(2), 187–212. [Google Scholar]
- Orlando M., Thissen D. (2003). Further investigation of the performance of s-x2: An item fit index for use with dichotomous item response theory models. Applied Psychological Measurement, 27(4), 289–298. [Google Scholar]
- Scott D. W. (2015). Multivariate density estimation: Theory, practice, and visualization. John Wiley. [Google Scholar]
- Siegfried J. J., Fels R. (1979). Research on teaching college economics: A survey. Journal of Economic Literature, 17(3), 923–969. [Google Scholar]
- Silverman B. W. (1986). Density estimation for statistics and data analysis (Vol. 26). Chapman & Hall. [Google Scholar]
- Smith B. O. (2018). Multiplatform software tool to disaggregate and adjust value-added learning scores. The Journal of Economic Education, 49(2), 220–221. [Google Scholar]
- Smith B. O., Wagner J. (2018). Adjusting for guessing and applying a statistical test to the disaggregation of value-added learning scores. The Journal of Economic Education, 49(4), 307–323. [Google Scholar]
- Supasorn S. (2015). Grade 12 students’ conceptual understanding and mental models of galvanic cells before and after learning by using small-scale experiments in conjunction with a model kit. Chemistry Education Research and Practice, 16(2), 393–407. [Google Scholar]
- Swets J. A. (1969). Effectiveness of information retrieval methods. American Documentation, 20(1), 72–89. [Google Scholar]
- Walstad W. B., Wagner J. (2016). The disaggregation of value-added test scores to assess learning outcomes in economics courses. The Journal of Economic Education, 47(2), 121–131. [Google Scholar]
- Walstad W. B., Watts M., Rebeck K. (2007). Test of understanding of college economics: Examiner’s manual (4th ed.). Council for Economic Education. [Google Scholar]
- Wise S. L., DeMars C. E. (2005). Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment, 10(1), 1–17. [Google Scholar]
- Wise S. L., DeMars C. E. (2006). An application of item response time: The effort-moderated IRT model. Journal of Educational Measurement, 43(1), 19–38. [Google Scholar]
- Zimmerman D. W., Williams R. H. (2003). A new look at the influence of guessing on the reliability of multiple-choice tests. Applied Psychological Measurement, 27(5), 357–371. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-pdf-1-apm-10.1177_01466216211013905 for On Guessing: An Alternative Adjusted Positive Learning Estimator and Comparing Probability Misspecification With Monte Carlo Simulations by Ben O. Smith and Dustin R. White in Applied Psychological Measurement




