Skip to main content
Educational and Psychological Measurement logoLink to Educational and Psychological Measurement
. 2019 May 16;79(6):1064–1074. doi: 10.1177/0013164419846234

A Graphical Method for Displaying the Model Fit of Item Response Theory Trace Lines

Steven T Kalinowski 1,
PMCID: PMC6777066  PMID: 31619840

Abstract

Item response theory (IRT) is a statistical paradigm for developing educational tests and assessing students. IRT, however, currently lacks an established graphical method for examining model fit for the three-parameter logistic model, the most flexible and popular IRT model in educational testing. A method is presented here to do this. The graph, which is referred to herein as a “bin plot,” is the IRT equivalent of a scatterplot for linear regression. Bin plots display a conventional IRT trace line (with ability on the horizontal axis and probability correct on the vertical axis). Students are binned according to how well they performed on the entire test, and the proportion of students in each bin who answered the focal question correctly is displayed on the graph as points above or below the trace line. With this arrangement, the difference between each point and the trace line is the residual for the bin. Confidence intervals can be added to the observed proportions in order to display uncertainty. Computer simulations were used to test four alternative ways for binning students. These simulations showed that binning students according to number of questions they answered correctly on the entire test works best. Simulations also showed confidence intervals for bin plots had coverage probabilities close to nominal values for common testing scenarios, but that there are scenarios in which confidence intervals had inflated error rates.

Keywords: item response theory, instrument development, model fit


Item response theory (IRT) is a widely used statistical paradigm for developing educational tests1 and assessing students (see de Ayala, 2009, for a review). The core concept in IRT is that the probability of a student answering a question correctly is a function of the latent ability of the student and characteristics of the question (including, e.g., the difficulty of the question). In IRT, student abilities are usually modeled as locations on a number line extending from negative infinity to positive infinity. IRT provides numerous mathematical models that specify how this ability affects performance on test questions. The most flexible, widely used, one-dimensional model is the three-parameter logistic (3PL) model. In this model, the probability, P3PL, of a student answering a question correctly is a function of the ability of the student, θ; the difficulty of the question, δ; the discrimination (or slope) of the question, α; and the “guessing rate” for the question, χ:

P3PL(θ|α,δ,χ)=χ+(1χ)exp[α(θδ)]1+exp[α(θδ)]. (1)

A graph of this function (Figure 1a) is called, alternatively, an item characteristic curve, an item response function, an item curve, or an item trace line. Figure 1a depicts an item trace line for a question with difficulty = 0.61, discrimination = 2.15, and guessing rate = 0.08.

Figure 1.

Figure 1.

An example of an item response theory bin plot (b) and graphs that illustrate how it was constructed (a, b, and d).

Note. Panel a shows the trace line for the empirical example discussed in the article (Question 1 on the formal operational reasoning test). The trace line shown in Panel a was estimated by assuming that student abilities had a standard normal distribution (Panel b). Panel c shows a bin plot with seven bins for the example. Panel d shows a standard normal distribution divided into seven quantiles. These quantiles represent the distribution of student abilities in each bin.

Fitting a 3PL IRT item trace line to test data is similar to fitting a regression line to bivariate data. In linear regression, the fit and uncertainty of regression lines can be assessed graphically by examining how the line passes through a scatterplot of data. The amount of scatter around the line illustrates how much uncertainty there is for the slope of the line, and the distribution of points can identify nonlinearity in the data or the presence of outliers. There are formal statistical tests for performing these analyses, but viewing a scatterplot around the line is invaluable. Unfortunately, there are no established ways to do something like this for IRT trace curves. IRT has an extensive collection of statistical methods for assessing model fit (see Maydeu-Olivares, 2015, for a review), but lacks a simple graphical method to visually inspect model fit and assess uncertainty.

The goal of the work presented here was to develop a simple method to graphically display the fit of IRT trace lines to the data they were estimated from (see Figure 1c, for an example). This research builds on the work of Hambleton, Swaminathan, and Rogers (1991, Chapter 4), who presented graphs like this. This investigation extends their work in several ways. This work points out some of the statistical choices that must be made to construct these graphs, proposes new ways to do some of the calculations, uses computer simulations to evaluate the accuracy of confidence intervals in the graphs, and presents empirical examples that illustrate the usefulness of these graphs. This article assumes that readers are familiar with normal distributions and have a working knowledge of IRT.

Methods and Results

A description of the graphs described below and how they will be interpreted will make the statistical methods easier to understand. The graphs will compare IRT trace lines with the student responses used to estimate the trace lines. The graph, therefore, will have the same axes as IRT trace lines: student ability on the horizontal axis and probability on the vertical axis. This differentiates these plots from other “empirical” plots that have number of questions correct on the horizontal axis (Chalmers, 2012; Morris et al., 2006).

Graphically comparing the fit of IRT trace lines to observed data is complicated by the fact that IRT trace lines (Equation 1) show probabilities, but each student’s response to a question is either correct or incorrect. Student responses can be plotted as 1 or 0, but assessing the fit of an IRT trace line to such points is not easy. A reasonable solution to this problem is to bin students according to how well they did on the test and calculate the proportion of students in each bin who answered the focal question correctly. This proportion can then be plotted next to the IRT trace line (Hambleton et al., 1991). A name is needed for these plots. IRT “bin plots” seems appropriate.

Constructing Bin Plots

An empirical example will make the statistical methods presented below easier to understand. Data from the formal operational reasoning test (Kalinowski & Willoughby, 2019) will be used an example. The test has 20 questions, and the data for this example are from Question 1. The data were collected from an introductory biology course with a sample size of 240 students.

  • Step 1. Fit an IRT model to the data. The first step in making bin plots is to fit an IRT model to data and estimate the parameters for the model. A 3PL model is used here, but any one-dimensional model could be used. There are a variety of methods to estimate IRT coefficients. The most popular method is the marginal maximum likelihood method (Bock & Aitkin, 1981). This method assumes student have a standard normal distribution of abilities. In the example being discussed, the estimated coefficients are α^=2.15, δ^=0.61, and χ^=0.08. Figure 1a shows a graph of this trace line.

  • Step 2. Sort the students into bins. The next step is to sort students into bins. This requires making a few decisions. The first decision is whether the bins should have an equal number of students or equal widths on the ability axis. If the bins have equal widths, bins on the extreme ends of the ability axis will have very few (or no) students in them—which will make it difficult to compare the performance of these students to expectations. Therefore, the latter approach will be used here. The next decision that must be made is what statistic to use to sort students into bins. Two possibilities seem reasonable. The estimated ability of each student (obtained using standard IRT methods) could be used or, alternatively, the number of questions each student answered correctly on the test could be used. The next decision is whether responses from the focal question will be used to bin students or whether this question will be dropped for the purpose of binning. For the purpose of this exposition, students will be binned according to the number of correct responses on the entire test (including the focal question), but this issue will be revisited later and computer simulations will be used to see which of the four approaches described above work best. The last decision to make is how many bins to use. The more bins that are used, the fewer students will be in each bin. Seven bins were used for the example being presented (Table 1).

  • Step 3. Calculate the fraction of students in each bin who answered the focal question correctly. The proportion of students in each bin who answered the focal question correctly, Pobs, is calculated next. This calculation is straightforward. The number of correct responses for students in the bin can be counted and divided by the number of individuals in the bin. Confidence intervals for this proportion can be calculated using a variety of standard methods. The 95% intervals shown in Table 1 were calculated using the Clopper–Pearson method.

  • Step 4. Calculate Pexp for each bin. Bin plots compare the observed proportion of students in each bin who answered a question correctly with the proportion expected, Pexp, from the IRT model. There are a few ways to calculate these expected proportions. Three are discussed here.

    Hambleton et al. (1991, p. 60) suggested Pexp could be calculated by using the midpoint of each bin, θmid, as a representative ability for the bin, and using Equation 1 to calculate the proportion of students who would answer the question correctly, that is, Pexp=P3PL(θmid|α,δ,χ). This approach has two flaws. First, if abilities in the student population are normally distributed, there will usually be more students with abilities at one end of the bin than the other (see Figure 1d). Second, this approach does not account for changes in P3PL(θ) across the range of the bin.

    Hambleton et al. (1991, p. 60) suggested a second way to calculate Pexp. Pexp could be calculated from the estimated abilities of the students in the bin. Each student’s estimated ability, θ^j, can be used to calculate the probability he or she answers the focal question correctly, and these probabilities can be averaged:

    Pexp=1NjP3PL(θ^j|α,δ,χ),

    where N is the number of individuals in the bin, θ^j is the estimated ability of the jth student, and summation is taken over all students in the bin. This approach fixes the two flaws described above but has a disadvantage. In Step 1 above, marginal maximum likelihood was used for estimating the coefficients of the IRT trace line. The marginal maximum likelihood algorithm assumes that student abilities are normally distributed. This assumption is part of the IRT model and therefore should be used to calculate expected values for the question for each bin.

    Given this assumption, the abilities of students in each bin have a truncated normal distribution. Endpoints for each bin, a and b, can easily be calculated with the inverse function for the normal distribution (also known as the quantile function). In the example being discussed here, each bin includes one seventh of the students, so the bin with the lowest ability students has an ability range of −∞ to −1.0676 (Figure 1d).

    Now that a distribution of abilities for each bin has been specified, the proportion of students expected to answer the focal question correctly can be calculated. This proportion equals

    Pexp=abP3PL(θ|α,δ,χ)ϕ(θ)dθ,

    where ϕ(θ) is the truncated normal probability density function for the bin and a and b are endpoints for the bin. The integral can be evaluated numerically. Table 1 displays the expected proportions for each bin for our example.

  • Step 5. Find locations on the IRT trace line for each value of Pexp. The next step in constructing a bin plot is to plot the observed proportions as points on a graph. To do this, some value on the horizontal (ability) axis must be used for each point. Reviewing how the finished graph is supposed to be interpreted suggests a method for finding these values. The observed proportions for each bin are supposed to be plotted above or below the point on the trace line that corresponds to the expected proportion for the bin. Therefore, each point should have an ability value of θ* such that P3PL(θ*|α,δ,χ)=Pexp. Numerical methods are available to find these values (Press et al., 1992).

  • Step 6. Construct the graph. All the quantities needed to construct a bin graph have now been calculated. The finished graph (Figure 1b) uses the IRT trace line for the question to depict the expected proportions. The points on the graph depict observed proportions for each bin. They have horizontal and vertical coordinates Pobs and θ*, respectively. Ninety-five percent confidence intervals for Pobs are also included on the graph.

Table 1.

Data Used to Construct the Bin Plot Displayed in Figure 1c.

Proportion correct
Quantile bin N Observed Expected Bin endpoints (a, b) θ*
1 34 0.29 0.21 (−∞,−1.0676) −1.4621
2 35 0.57 0.45 (−1.0676, −0.5659) −0.7954
3 34 0.65 0.66 (−0.5659, −0.1800) −0.3717
4 34 0.71 0.80 (−0.1800, 0.1800) −0.0066
5 34 0.82 0.90 (0.1800, 0.5659) 0.3581
6 35 0.94 0.96 (0.5659, 1.067) 0.7801
7 34 1.00 0.99 (1.067, ∞) 1.4377

Note. See section “Constructing Bin Plots” for an explanation.

Examples

Examples from two unpublished tests (S. Kalinowski, unpublished, 2018) illustrate the usefulness of bin plots. The first example is from a test with “well-behaved” questions. The test has 24 questions relating to natural selection and a sample size of 230 students. The test is the 11th revision of an instrument published by Kalinowski, Leonard, and Taper (2016). Figure 2a shows trace line bin plots for all the questions on the test relating to “evolution” (see Kalinowski et al., 2016, for a description of the types of questions on the test). The trace lines for these questions display several desirable properties: The questions have steep slopes, a variety of difficulties, and low-guessing rates. The fit of the trace lines to the observed proportions looks excellent. The sample size for this data was 230 students. This is usually considered a small sample for the 3PL model (e.g., de Ayala, 2009), but the confidence limits on the observed proportions appear to constrain the shape of trace lines fairly tightly. In other words, the coefficients for the questions seem well estimated.

Figure 2.

Figure 2.

Bin plots for questions on two tests: (a) a test with “well-behaved” questions relating to natural selection and (b) a test with “pathological” questions relating to correlational thinking.

The second example is a test with questions displaying a number of problems. This “pathological” test (Figure 2b) is the first draft of an instrument that was supposed to assess correlational thinking. There were 10 questions on the test, and the sample size for the data was 183 students. Figure 2b shows several of the questions have low discrimination (including negative values) and poor item fit.

Computer Simulations

Computer simulations were used to explore two questions relating to IRT bin plots. The first question was which of the four binning methods described above (Step 2) worked best. The second question was how often do the confidence intervals in the bin plots contain the parametric value they are supposed to include.

Computer simulations were performed using the R statistics computing environment (R Core Team, 2018). Simulated tests had 10, 20, or 40 questions. The difficulties of questions on each test ranged from −2 to 2 and were evenly distributed. For example, questions on the 10-question test had difficulties −2.000, −1.556, −1.111, −0.667, −0.222, 0.222, 0.667, 1.111, 1.556, and 2.000. The discrimination of all questions was 2.0, and the guessing rate for all questions was 0.2. Student abilities were assumed to have a standard normal distribution. Two samples sizes were used: 200 or 1,000 students. Student responses were simulated using the aforementioned parameters. A 3PL IRT model was then fit to the data using marginal maximum likelihood (Bock & Aitkin, 1981) implemented by the mirt (Chalmers, 2012) statistics package. Student abilities were estimated using the Bayesian expected a posteriori method (Bock & Aitkin, 1981), again implemented by mirt (Chalmers, 2012).

Bin plots were constructed using four methods to bin students. The first method binned students by the total number of questions answered correctly, including the focal question being graphed. The second method also used the total number of questions answered correctly but did not include the focal question in this calculation. The third method used the estimated ability of each student (estimated from the entire test) to bin students, and the last method used the estimated ability without including the focal question in the analysis. This last method is computationally more demanding than the other methods because an IRT model has to be fit to the data for each question on a test.

After data were simulated and bin plots were constructed, the coverage of all the confidence intervals on the plot was evaluated. This was done by checking to see if each confidence interval contained the true expected value for each bin. These true values were calculated using the parametric values used to simulate the data (i.e., the actual IRT coefficients used to simulate the data). One thousand simulations were performed for each combination of sample size and number of questions, and an average error rate for the bin plot confidence intervals was calculated.

Results from the computer simulations showed that binning students by the total number of questions answered correctly on the entire test generally worked best (Table 2). The error rate for this method was consistently lower than the other three methods. A couple of patterns are evident in the results: The error rate for bin plots was closer to the nominal value of 0.05 for long tests (40 questions) than for short tests (10 questions) and for tests with small sample sizes (200 students) than for tests with large sample sizes (1,000 questions). The confidence intervals for short tests with large sample sizes were quite permissive. In the worst scenario examined (10 questions, 1,000 students), the error rate was 0.21.

Table 2.

Confidence Interval Coverage Error Rates for Bin Plots Constructed From Simulated Data.

Binning method
Sum all questions Sum without focal question θ^ All questions θ^ Without focal question
10 questions
 200 students 0.06 0.09 0.15 0.08
 1,000 students 0.22 0.32 0.34 0.24
20 questions
 200 students 0.03 0.05 0.06 0.05
 1,000 students 0.09 0.13 0.12 0.10
40 questions
 200 students 0.03 0.04 0.04 a
 1,000 students 0.05 0.07 0.06 a

Note. The error rates shown in the table represent the proportion of confidence intervals in bin plots that did not contain the parametric proportion of students in the bin expected to answer the question correctly. Error rates are averaged across all questions and all bins. Results are shown for four binning methods and for simulated tests having different numbers of questions and different sample sizes (number of students). Confidence intervals on the bin plots had a nominal 95% coverage probability, so an error rate of 0.05 is desired.

a

Results could not be obtained in a reasonable amount of computer time.

Discussion

This article describes a method for graphically displaying the fit of IRT trace lines to test data. The approach is based on a method developed by Hambleton et al. (1991). These “bin plots” display IRT trace lines and the observed fraction of students sorted into bins who answered a question correctly. The graphs provide a simple method for viewing model fit and getting a sense for how much uncertainty there is regarding the shape (difficulty, slope, guess rate) of an IRT trace line.

Computer simulations showed that confidence intervals in bin plots can have error rates substantially greater than their nominal value. Here, an error means that the confidence interval for the bin did not contain the proportion expected for the bin given the parameters used to simulate data. The error rate was highest for short tests given to large samples of students. This inflated error rate may be explained by the interaction of two statistical phenomena. The first contribution to this problem is probably binning error. Sorting students into bins is subject to estimation error. Short tests are expected to sort students more poorly than long tests. This error is likely to bias the observed proportion correct for each bin. This bias can be understood by considering a worst-case scenario: a test with so much binning error that students are randomly divided into bins. In this scenario, each bin would be expected to have the same observed proportion correct. This would bias proportions for low ability bins upward and proportions for high-ability bins downward. Realistic tests would not have such extreme bias, but the bias would have the same direction. The second contributor to inflated error rates for the confidence intervals is the effect that large sample size has on the width of confidence intervals. For the purpose of illustration, assume binning error has introduced a small amount of bias for the observed proportions for each bin. This bias will not prevent confidence intervals from containing the true proportion if the interval is wide enough. However, if the sample size is large, the confidence interval will be narrow, and the narrower the interval is, the less likely it will contain the true proportion. This may be why short tests (10 questions) with large samples (1,000 students) had the highest confidence interval coverage error rates.

Even with these coverage errors, bin plots should be useful for assessing the fit of IRT trace lines, provided the graphs are interpreted with some care. For example, there is no reason to suspect the pathologies depicted in Figure 2b are artifacts of binning error or some related phenomenon. Nor is there reason to doubt the excellent fit of the trace lines in Figure 2a. This is because bin plots for tests with at least 20 questions and small samples (200 students) have confidence interval coverage probabilities close to nominal values. This is fortuitous, because IRT practitioners with small samples are probably most in need of ways to assess model fit and understand uncertainty. Many university courses have approximately 200 students, so bin plots should be useful for postsecondary discipline-based education research. IRT practitioners with larger data sets might interpret confidence intervals more cautiously, and users with very large data sets might eliminate confidence intervals altogether and inspect how each observed proportion differs from the trace line.

1.

IRT is equally useful for psychological assessment, but this article will assume an educational context to simplify presentation.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by the National Science Foundation (Award 1432577).

ORCID iD: Steven T. Kalinowski Inline graphic https://orcid.org/0000-0001-8504-4923

References

  1. Bock R. D., Aitkin M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459. [Google Scholar]
  2. Chalmers R., P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1-29. doi: 10.18637/jss.v048.i06 [DOI] [Google Scholar]
  3. de Ayala R. J. (2013). The theory and practice of item response theory. New York, NY: Guilford Press. [Google Scholar]
  4. Hambleton R. K., Swaminathan H., Rogers H. J. (1991). Fundamentals of item response theory (Vol. 2). Thousand Oaks, CA: Sage. [Google Scholar]
  5. Kalinowski S. T., Leonard M. J., Taper M. L. (2016). Development and validation of the conceptual assessment of natural selection (CANS). CBE-Life Sciences Education, 15, article 64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Kalinowski S. T., Willoughby S. D. (2019). Development and validation of a scientific (formal) reasoning test for college students. Journal of Research in Science Teaching. Advance online publication. doi: 10.1002/tea.21555 [DOI] [Google Scholar]
  7. Maydeu-Olivares A. (2015). Evaluating the fit of IRT models. In Reise S. P., Revicki D. A. (Eds.), Handbook of item response theory (pp. 111-127). New York, NY: Routledge. [Google Scholar]
  8. Morris G. A., Branum-Martin L., Harshman N., Baker S. D., Mazur E., Dutta S., . . . McCauley V. (2006). Testing the test: Item response curves and test quality. American Journal of Physics, 74, 449-453. [Google Scholar]
  9. Press W. H., Flannery B. P., Teukolsky S. A., Vetterling W. T. (1986). New Numerical Recipes. [Google Scholar]
  10. R Core Team. (2018). R: A language and environment for statistical computing. Vienna, Austria R Foundation for Statistical Computing; Retrieved from http://www.R-project.org/ [Google Scholar]

Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES