Skip to main content
Educational and Psychological Measurement logoLink to Educational and Psychological Measurement
. 2019 Dec 6;80(4):695–725. doi: 10.1177/0013164419892049

Improving Measurement Precision in Experimental Psychopathology Using Item Response Theory

Leah M Feuerstahler 1,, Niels Waller 2, Angus MacDonald III 2
PMCID: PMC7307489  PMID: 32616955

Abstract

Although item response models have grown in popularity in many areas of educational and psychological assessment, there are relatively few applications of these models in experimental psychopathology. In this article, we explore the use of item response models in the context of a computerized cognitive task designed to assess visual working memory capacity in people with psychosis as well as healthy adults. We begin our discussion by describing how item response theory can be used to evaluate and improve unidimensional cognitive assessment tasks in various examinee populations. We then suggest how computerized adaptive testing can be used to improve the efficiency of cognitive task administration. Finally, we explore how these ideas might be extended to multidimensional item response models that better represent the complex response processes underlying task performance in psychopathological populations.

Keywords: cognitive assessment, item response theory, computerized adaptive testing


In experimental psychopathology and cognitive psychology more broadly, cognitive processes are often measured using computerized tasks (e.g., Pashler, 1988; Rouder et al., 2008). Success on these tasks requires subjects to implement one or more cognitive processes and/or the synthesis of information from one or more sensory systems. Based on their performance, subjects’ proficiencies on a limited set of processes are inferred with various degrees of precision.

In this article, we discuss the application of item response theory (IRT; Embretson & Reise, 2000; Lord, 1980) to cognitive task data. Partly because of the relatively large sample sizes (e.g., N>1000) available, most IRT models were developed in the context of educational assessment, whereas they have been less common in experimental psychology and experimental psychopathology (for notable exceptions, see Miyake, Friedman, Emerson, Witzki, & Howerter, 2000; Patz, Junker, Lerch, & Huguenard, 1996; Tuerlinckx, de Boeck, & Lens, 2002). Instead, this literature has primarily relied on simple sum scores, signal-detection theory (Peterson, Birdsall, & Fox, 1954), or task-specific mathematical models such as those described by Gibson et al. (2011) and Morey (2011) to model the multifaceted nature of cognitive response processes. Unfortunately, such models often lack established methods of evaluating model-data fit, score reliability, or parameter standard errors. IRT models, on the other hand, provide well-established methods for computing these methodological desiderata. Thus, the twin goals of this article are to introduce several state-of-the-art IRT methods to experimental psychopathologists and to demonstrate how IRT can be used effectively to model cognitive task data. In particular, we demonstrate how IRT can be used to (a) check model predictions, (b) evaluate model fit, and (c) design optimally informative and/or adaptive tasks that allow for high-quality measurements with a minimum number of trials and a modest number of subjects, at sample sizes commonly attainable for patient groups. In the remainder of this article, we illustrate these ideas with real and simulated responses to a well-known change detection task that is often used to model individual differences in visual working memory (Cowan, 2000; Pashler, 1988).

Item Response Theory

Core Concepts in IRT

To keep matters simple, we introduce the fundamental ideas of IRT using the one-parameter logistic model (1PL; Birnbaum, 1968), also known as the Rasch (1960/1980) model.1 In our opinion, this model—because of its simplicity and broad applicability—provides an ideal framework with which to introduce core IRT concepts to researchers who may be unfamiliar with modern psychometrics. Moreover, the wide availability of open source software specifically designed for Rasch family models (e.g., Robitzsch, Kiefer, & Wu, 2019) makes this model an attractive choice for experimental psychologists who are interested in fitting IRT models to their data. Finally, the 1PL, with its relatively small number of parameters, should work well with many cognitive and experimental psychopathology data sets due to its modest sample size requirements (Lord, 1983). Although other item response models have been applied successfully to psychopathology and experimental psychology data (Coleman et al., 2002; Feuerstahler & Waller, 2014; Reise & Waller, 2003, 2009; Suzuki, Hoshino, & Shigemasu, 2006; Waller & Feuerstahler, 2017; Waller & Reise, 2010; Waller, Thompson, & Wenk, 2000), all these models build on the core ideas that are present in the 1PL. Thus, the 1PL also serves as a natural gateway to these more complex (i.e., more highly parameterized) models. Later in this article, we discuss ways in which more complex IRT models can be used to better represent the cognitive processes that are required for successful task performance in a change detection task. We conclude by demonstrating how knowledge of the putative cognitive processes that underlie this task may be used to design a highly constrained, multiple-group IRT model that is well-suited for comparative psychopathology research.

Before introducing the fundamental ideas of IRT, it will be prudent to clarify some terminology. Because IRT was developed in the context of educational testing, discussions of IRT often refer to items that belong to tests. In this context, an item refers to a prompt that elicits a scored response from a subject, and a test is the collection of items to which a subject is exposed. Although this terminology is intuitive in educational contexts, it may not be obvious how these terms apply to other testing situations. For example, in the change detection task that we discuss later in this report (Pashler, 1988), a subject responds to a series of trials, each of which consists of a sample array of objects followed by a test array. Next, the subject is asked whether any objects changed between the sample array and the test array. In the change detection task (Pashler, 1988), each trial is designed to elicit a single response. Thus, trials in cognitive tasks may be treated analogously to items in educational tests. Throughout this article, we use the terms item and trial interchangeably, and the terms task and test interchangeably.

In general, IRT refers to a collection of mathematical models that characterize item responses from properties of both examinees and items. The simplest item response model, the 1PL, assumes that item responses can be dichotomously keyed as being correct/incorrect (or agree/disagree, true/false, like/dislike). Although it is called the one-parameter model, the 1PL model is actually defined by two parameters: (a) a person location (or ability) parameter usually denoted by θ and (b) an item location (or difficulty) parameter that is often denoted by δ. Importantly, in the 1PL both the person (θ) and item (δ) parameters are expressed on the same latent trait scale. In other words, person and item parameters are expressed in the same units on a common latent continuum. More formally, under the 1PL, the probability of a correct (keyed) response is modeled as a nonlinear function of the proximity of person j’s trait value (θj) to item i’s difficulty value (δi). Mathematically, the distance between θj and δi defines the expected log odds of a keyed response such that,

log(Pij1Pij)=θjδi, (1)

where Pij denotes the probability of a correct response of item i for person j. By rearranging Equation 1,

Pij=exp(θjδi)1+exp(θjδi), (2)

gives the probability of a correct response for any {θj,δi} pair. Under the 1PL, an individual has a 50% probability of a keyed response to the item when θj=δi. If θj>δi, the probability is greater than 50%, and if θj<δi, the probability is less than 50%. As a result, items with higher δi values require higher θj values to have equal probabilities of a correct response. In many contexts, δi is interpreted as the difficulty of item i because items with higher δi values should be more difficult for examinees to answer correctly.

Equation 2 defines the item response function (IRF). As noted above, this formula describes the probability of a correct response as a nonlinear function of item difficulty and person trait value. Figure 1 shows three example 1PL IRFs with δ1=2, δ2=1, and δ3=0. Note that θ=0 is traditionally set as the midpoint of the latent trait metric. As such, it is generally the case that positive θj values represent above average trait levels, and negative θj values represent below average trait levels. Under many applications of the 1PL (typically those that use the Rasch modeling framework), the variance of the latent variable is also estimated. In other IRT models, the latent variable typically is assumed to have variance equal to 1.

Figure 1.

Figure 1.

Example item response and item information functions for the unidimensional onle-parameter logistic model (1PL).

As displayed in Panel A of Figure 1, IRFs give the probability of a correct response as a nonlinear function of the latent trait θ. An important function of the IRF, called an item information function (IIF), is used to quantify the usefulness of a specific item for measuring individuals at different points along the latent trait continuum. The IIF for the 1PL equals

Ii(θj)=Pij(1Pij), (3)

where Ii(θj) denotes the amount information provided by item i for individuals with trait level j and Pij is defined in Equation 2. As shown in Panel B of Figure 1, the IIF is highest at θ values that are near the item’s difficulty value. Specifically, under the 1PL, items that give a 50% probability of a keyed response at a given θ value provide the greatest information at that θ value. Because item information is additive over a test, IIFs are often summed to produce a test information function (TIF). For a given pool of items, the TIF plays an important role in both IRT test development and computerized adaptive testing (CAT; Weiss, 1982) because (a) the precision of an estimated trait value θ^ is indexed by a function of the TIF, known as the standard error of measurement, SEM (θj), and (b) the TIF shows the θ regions for which a test is informative, that is, the range of θ values for which the test is reliable. More information on these ideas can be found in standard textbooks on IRT (e.g., de Ayala, 2009; Embretson & Reise, 2000).

An Application of the 1PL to Cognitive Task Data

Having introduced the reader to some of the more fundamental ideas of item response modeling, we now apply these ideas to data obtained from a change detection task that was originally administered to samples of schizophrenic, schizoaffective, bipolar, and healthy control populations. These data were collected as part of the Cognitive Neuroscience Test Reliability and Clinical applications for Schizophrenia (CNTRaCS) initiative (Gold et al., 2012), and they correspond to a computerized multiple change detection task.

An example trial of the multiple change detection task is displayed in Figure 2. During each trial of the task, the computer first displays a sample array of five objects. The sample array is then removed and, after a short interstitial delay, a test array is displayed. After the test array appears, subjects indicate whether they perceived any changes from the sample array to the test array. For the data set considered in this article (which are also analyzed by Gold et al., 2019), all subjects were administered a task consisting of 60 replications of four trial types for a total of 240 trials. The four trial types are defined by the number of objects that changed between the test array and the sample array: zero, one, two, or five objects. Trials were administered using a block randomization design wherein each of the 4 trial types was administered twice for every 8 trials. Subjects were recruited from five sites across the United States and were recruited from each of four diagnostic groups: control, schizophrenia, schizoaffective, and bipolar. In aggregate, data were collected from 227 subjects; 223 subjects had complete responses to the 240 items,2 including 59 controls and 63, 51, and 50 individuals classified as schizophrenic, schizoaffective, and bipolar, respectively. All data analyses reported in this article are based on the responses from the 223 subjects with complete data. On each trial, accurate responses were coded with the number “1,” and inaccurate responses were coded with the number “0.” These binary scores can be used to create preliminary sum scores as measures of task performance. The distributions of sum scores (number correct), conditional on item type and diagnostic group are shown in Figure 3.

Figure 2.

Figure 2.

Example change detection shapes task used in the CNTRaCS project.

Note. CNTRaCS = Cognitive Neuroscience Test Reliability and Clinical applications for Schizophrenia.

Figure 3.

Figure 3.

Distribution of number correct scores for the four trial types in the CNTRaCS data.

Note. CNTRaCS = Cognitive Neuroscience Test Reliability and Clinical applications for Schizophrenia; Ctrl = control group, Scz = schizophrenia group, SczAff = schizoaffective group, Bplr = bipolar group.

For data collected using the multiple change detection task, a formal model that describes how trial response behavior relates to visual working memory capacity is sometimes used to estimate an examinee’s visual working memory capacity. Several variants of this model exist (Cowan, 2000; Gibson, Wasserman, & Luck, 2011; Rouder et al., 2008; Rouder, Morey, Morey, & Cowan, 2011). Some of these models do not take attentional capacity into account (Pashler, 1988), whereas others have more fundamental statistical problems when used to estimate person-level parameters (see Feuerstahler, Luck, MacDonald, & Waller, 2019). In the version of this model that we use in this article, three parameters are estimated for each individual (note that for simplicity, we do not include subscripts corresponding to subjects or trials in the following notation). The first parameter represents working memory capacity and is denoted by the letter K. The K parameter can be interpreted as the number of objects held in visual working memory. The second parameter represents attentional capacity and is denoted by the letter A. The A parameter can be interpreted as the probability that attention is paid on any given trial. The third parameter reflects guessing propensity and is denoted by the letter G. This guessing parameter measures the probability that the subject guessed that at least one object changed if the subject did not pay attention or otherwise does not have the object in visual working memory. A trial is defined by the set size S, which denotes the number of displayed objects and the number of changed objects C. By definition, CS.

For the CNTRaCS data, all trials contained a set size S=5 and C{0,1,2,5}. In other words, examinees were initially exposed to a five-item display followed by another display in which either zero, one, two, or five objects had changed. We initially measured individual differences in change detection with a model designed for the change detection task (see Rouder et al., 2011). In this model, DC denotes the probability that an object has changed and exists in working memory for a trial in which C objects have changed. In this variant of the model, DC is modeled as a function of K and C such that

ifC=0,D0=0, (4)
ifC>0,DC={min[1,1Πi=SC+1SiKi]ifCSK1otherwise. (5)

In the above equations, parameters K, A, and G are used to model the probability of a correct responses to each experimental trial type. For a trial with C changes, we write PC to denote the predicted proportion of “change” responses (i.e., the hit rate). When C=0, we write P0 to denote the false alarm rate (i.e., the proportion of “change” responses when no changes have occurred). These probabilities (i.e., change proportions) are modeled as functions of the four person-level parameters described above: A, G, and (indirectly via DC) K,

PC=G+(1G)ADC. (6)

In some variants on this model, closed-form formulas have been proposed to obtain estimates of visual working memory capacity from the observed proportion correct scores for each trial type (Cowan, 2000; Pashler, 1988). However, previous work has shown that these formulas can lead to inconsistent or severely biased trait estimates, especially for examinees who perform at the floor or ceiling (Morey, 2011). Thus, these formulas are not recommended because they can provide suboptimal estimates of working memory capacity under realistic testing scenarios. To avoid these issues, some investigators have advocated using maximum likelihood procedures to estimate subject-specific K, A, and G parameters (Gibson et al., 2011; Rouder et al., 2008; Rouder et al., 2011). Other researchers have also recommended using a Bayesian hierarchical model to estimate these model parameters (Morey, 2011). However, to the best of our knowledge, none of these methods can be used (in this context) to assess the psychometric properties (e.g., score reliability and empirical model fit) of change detection data. Moreover, some of these methods require specialized task-specific software to fit the model and to estimate individual differences in working memory capacity.

In this article, we argue that IRT provides an alternative and desirable method for analyzing change detection task data that does not suffer from some of the methodological limitations of the aforementioned approaches. For example, item response models can be easily implemented using widely available open-source or commercial software packages (e.g., Cai, Thissen, & du Toit, 2015; Chalmers, 2012; Muthén & Muthén, 1998-2011; Robitzsch et al., 2019; Rizopoulos, 2006), and the statistical and psychometric properties of item response models have been thoroughly studied for over 50 years. For these reasons, we believe that item response models could be profitably used to model and enhance the psychometric properties—and thereby interpretability—of cognitive task data.

Fitting IRT Models to the CNTRaCS Data

A major barrier when applying IRT models to cognitive task data is that data sets in experimental psychopathology are relatively small (compared with the very large data sets that are often available in educational applications of IRT, such as high-stakes testing). Importantly, sample size is a relevant feature to consider when contemplating the use of IRT because some IRT models (e.g., the three-parameter model) require relatively large samples sizes (N1000) to obtain accurate item parameter estimates. In general, the minimum sample size required when fitting an IRT model depends on properties of the data and at least three additional factors (a) which model is chosen, (b) how model parameters are estimated, and (c) how the model results will be used. In general, simple analyses such as rank-ordering item difficulties or person abilities under the 1PL require relatively small samples. On the other hand, analyses that are aimed at answering more complicated questions, such as determining whether items function differently in different groups, require larger sample sizes to yield reliable results. As such, various minimum sample size recommendations exist in the literature. For exploratory analyses using the simple 1PL, as few as 30 subjects may provide useful results (Linacre, 1994). However, to stably estimate properties of the items and persons, samples of at least 200 (Drasgow, 1989) or 250 (Harwell & Janosky, 1991; Stone, 1992) are recommended. For more complex models such as the multidimensional 1PL discussed later in this article, several hundred (Adams, Wilson, & Wang, 1997) or several thousand examinees (Ackerman, 1994) may be needed for stable item parameter estimation.

We believe that drawing on cognitive theories of trial response processes may be a promising approach when applying IRT to small samples. For example, in the multiple change detection task, cognitive theory suggests that items of a given type are interchangeable. Specifically, for a given individual, all one-change trials with a set size of five predict the same success probability regardless of other trial features. Taking advantage of this cognitive model structure, we can simplify the IRT model that is used to characterize these data by estimating a single difficulty parameter for all trials of a given trial type. For the CNTRaCS data, this means that we need to estimate only 4 difficulties for the 240 trials (i.e., one difficulty for each trial type). Thus, by constraining item parameters to be equal across items of a given type, we can greatly reduce the number of estimated parameters in data sets that contain repeated trials of a given task. To our knowledge, although this approach seems uniquely suited for fitting cognitive data with IRT models, is has not been discussed in the experimental psychopathology literature.

When several 1PL item difficulties are restricted to the same value, the resulting model is an instance of the linear logistic test model (LLTM; Fischer, 1973, 1983). A classic application of the LLTM is in modeling performance on a mathematics test in which each item asks the student to compute the first derivative of the given function (Fischer, 1973). In this context, different items require different combinations of mathematical operations, such as differentiation of a polynomial or the application of the product rule. Applying the LLTM involves modeling the difficulty of each operation rather than the difficulty of each item. As such, the LLTM gives more diagnostic information about the relative operation difficulty. We suggest that the same logic can be applied to change detection task conditions or items in other homogeneous trial types. Thus, by estimating the difficulty of each trial type, rather than of each trial instance, we can draw general conclusions about trial types that can be applied in new contexts. For example, application of the LLTM to change detection task data would allow for general conclusions to be made about the relative difficulty of different trial types.

In the analyses described below we assume that there is one general performance trait underlying the change detection task data. Thus, in our first set of analyses we do not model the effects of individual differences in guessing propensities (Pashler, 1988) or in attentional capacity (Rouder et al., 2008). It is worthwhile noting that these assumptions are also made when analyzing the number correct scores for these types of data. In addition, it merits note that most IRT analyses assume that examinees respond to distinct items/trials, whereas other models for analyzing change detection data typically treat trials as exchangeable. More specifically, most IRT analyses require a data set in which examinees are crossed with a fixed set of items, whereas other models for change detection only consider the observed proportion of change responses for each trial type. To make the CNTRaCS data amenable to IRT analysis, we labelled trials of each type according to their order of administration for each individual. Finally, because our combined data set was formed by aggregating examinee groups with different forms of psychopathology (and a group of normal controls), we fit a multiple-group model (Bock & Zimowski, 1997; Muthén & Christoffersson, 1981) to the CNTRaCS data. In the present context, the multiple-group model estimates a separate mean ability and variance for each diagnostic group and allows for a direct test of whether the measurement model is invariant across the different groups (Muthén & Lehman, 1985).

Method and Results

In all IRT analyses that are described in this article, we estimated item and person parameters by marginal maximum likelihood (MML, Bock & Aitkin, 1981) using the TAM function library (Robitzsch et al., 2019) in R (R Core Team, 2019). Under MML, item difficulties are estimated under the assumption that examinee abilities (for each group) follow a normal distribution. For the multiple-group model, we identified the model by fixing the mean of the control group to zero and then freely estimated the other group means. It is worth noting that our assumption that the within-group trait scores are normally distributed is not an overly strict assumption of the model in that reliable parameter estimates can, in many cases, be obtained when the population trait distribution is not normal (Stone, 1992).3 After MML item parameters are estimated, person parameter estimates (i.e., trait scores) can be calculated based on the item parameter estimates. There are several ways to compute IRT trait estimates, each with their own desiderata. In this article, we computed weighted likelihood estimates (WLEs; Warm, 1989) because this method produces unbiased estimates of the latent trait.

We fit the LLTM to the CNTRaCS data by constraining the item difficulties of each trial type to the same value. Trial response data were coded for accuracy such that a “change” response to a zero-change trial was coded “0” when coding for accuracy, and it was coded “1” when coding for the observed “change” response. For one-, two-, and five-change trials, a “change” response was always coded “1.”

Results of fitting the 1PL with constrained item difficulties to the CNTRaCS data are shown in Table 1. Before interpreting these parameter estimates, it is important to establish that the model fits the data. In contrast to recommended practice, we will discuss the interpretation of the parameter estimates before evaluating model fit.

Table 1.

Item Parameter Estimates for the CNTRaCS Data.

Trial type δ^ SE(δ^)
Zero-change −1.95 0.023
One-change −0.46 0.018
Two-change −1.48 0.021
Five-change −2.80 0.031

Note. CNTRaCS = Cognitive Neuroscience Test Reliability and Clinical applications for Schizophrenia. The symbol δ^ denotes the estimated trial difficulty for each trial type. Entries under SE(δ^) are the standard errors of estimated trial difficulties.

For the CNTRaCS data, the estimated item difficulties (δ^) are all negative, indicating that each of these trial types implies a greater than 50% chance of success for the average person in the control group (because the mean person score in the control group equals zero; see Equation 2). For the three trial types in which one or more objects changed, five-change trials are the easiest, followed by two-change trials and then one-change trials. This ordering of trial difficulties is as expected. The zero-change trials have an estimated difficulty equal to −1.95 and appear to be somewhat easier than two-change trials and somewhat harder than five-change trials. Note that all item difficulties were estimated with fairly high precision, as indicated by the relatively small parameter standard errors, SE(δ^), ranging from 0.018 to 0.031 that are reported in Table 1.

As stated previously, to identify our model, the mean location of the person parameters (i.e., the trait scores) for the control group was set to zero. Under this constraint, we found that the estimated group means for the schizophrenia, schizoaffective, and bipolar groups equalled −0.32, −0.47, and −0.37, respectively. Moreover, the estimated group standard deviations for the control, schizophrenia, schizoaffective, and bipolar groups equaled 0.66, 0.58, 0.65, and 0.65, respectively. To make these results more interpretable, the estimated group means can be divided by the estimated group standard deviations to obtain standardized effect sizes (when compared to the control group) of −0.55, −0.73, and −0.57 for the schizophrenia, schizoaffective, and bipolar groups, respectively. If desired, these values could be used to conduct formal hypothesis tests of group differences. These results support and further clarify the idea that, when compared to normal control subjects, the three diagnostic groups have somewhat worse general performance on the change detection task.

We now turn our attention to the estimation of individual trait scores. When fitting unidimensional IRT models, latent trait estimates are similar to sum scores in that they provide an overall ranking of examinee performance. In fact, when fitting the 1PL there is a direct relationship between sum scores and trait estimates (in technical language, under this model the sum score is a sufficient statistic for the latent trait score, see E. G. Andersen, 1977). The relationship between trait estimates and sum scores for the CNTRaCS data is displayed in Figure 4. This figure not only shows that the relationship between sum scores and θ^ values is one-to-one but it also shows that this relationship is nonlinear, most noticeably at extremes of the latent trait continuum. In particular, the θ scale is stretched at the top end of the scale, with a relatively large range of θ values representing a relatively small range of sum scores. The nonlinearity of this function has an important implication for the interpretation of sum scores on this task. Namely, these results indicate that sum scores do a relatively poor job of discriminating among examinees with elevated trait scores. Alternatively, our results suggest that, relative to the sum score scale, the unidimensional IRT trait estimates are better able to distinguish performance at higher trait levels.

Figure 4.

Figure 4.

Scatter plot of IRT WLE θ^ and CNTRaCS data sum scores.

Note. IRT = item response theory; WLE = weighted likelihood estimate; CNTRaCS = Cognitive Neuroscience Test Reliability and Clinical applications for Schizophrenia.

Earlier, we noted that one advantage of applying IRT to experimental psychology data is that the utility of items or trials can be expressed on the latent trait metric via the item IIF. We now demonstrate this idea. In Figure 5, the item/trial information functions are plotted for the four trial types, along with a histogram of the WLEs trait estimates computed from the CNTRaCS data. Notice in this figure that the IIFs for each trial type are highest at the difficulty estimate δ^. By comparing the θ regions for which trial information is high with the histogram of WLEs, we can determine which trial types provide the most information for these examinees. As shown in the figure, one-change trials are highly informative at the mean examinee location. In contrast, relatively few examinees are located at trait levels for which five-change trials are informative. In a later section, we will see that this auxiliary information—which is not available when working with sum scores—can be used to design tasks that are maximally informative for a given test length by excluding trials that provide little information for the individual or subject group.

Figure 5.

Figure 5.

Histogram of CNTRaCS WLE trait estimates and trial information functions for the four trial types. The histogram represents the observed distribution of WLEs for the CNTRaCS data (groups combined), and each black line represents the item information function for one of the indicated trial types.

Note. CNTRaCS = Cognitive Neuroscience Test Reliability and Clinical applications for Schizophrenia; WLE = weighted likelihood estimate.

Fit Assessment

As noted previously, a major advantage of IRT analysis is that well-established statistical indices exist for evaluating model-data fit at both the item and person levels. Although these statistics have been studied in a wide variety of contexts, we acknowledge that no single index is a perfect arbiter of model-data fit. Not surprisingly, the assessment of IRT model fit is an active area of research (Maydeu-Olivares, 2015; Maydeu-Olivares & Joe, 2005, 2006). The first step in fit assessment it to assess overall model-data fit. Model fit assessment includes evaluating the overall discrepancy between the model and the data, as well as the extent to which model dimensionality assumptions are met by the data. One way to evaluate the unidimensionality assumption of the 1PL is with Q3 statistics (Yen, 1984), which are the correlations between pairs of item-level residuals. If all item pairs are conditionally independent (i.e., if the test is truly unidimensional), then the average of these residual correlations should be close to zero but slightly negative. For hypothesis testing, an adjustment is made to the Q3 statistics so that their expected average value equals 0. For the 1PL fit to the CNTRaCS data, the mean adjusted absolute Q3 statistic equaled 0.066, with Holm-adjusted p=.0009, indicating that the hypothesis of unidimensionality is not supported for these data. Other overall fit measures include the standardized root mean square residual (SRMR) and the standardized root mean square root of squared residuals (SRMSR, Maydeu-Olivares, 2013). There are no absolute cutoffs for the SRMR and SRMSR measures, but Maydeu-Olivares suggests that values of these indicates greater than 0.05 might be indicative of poor overall model fit. For the CNTRaCS data fit to the 1PL, SRMR = 0.069 and SRMSR = 0.087, both of which support the conclusion that the 1PL provides only moderate fit to these data.

If overall model fit statistics suggest poor fit, an appropriate next step is to specify a model that better represents the data-generating process. For pedagogical purposes, we will continue to illustrate model evaluation in the context of the 1PL, but will return to the issue of model specification later in this article. Whether or not the overall model fits, it is informative to further probe the fitted model to determine whether certain items or persons are particularly poor-fitting. Even for models that fit well overall, it is important to establish good model–data fit for each smaller piece of the model. Smaller pieces of the model can be analyzed either by looking at the fit of individual items or the fit of individual test-takers. In this article, we will use the root mean squared deviation statistic (RMSD; Yamamoto, Khorramdel, & von Davier, 2013) to assess item fit and the lz statistic (Drasgow, Levine, & Williams, 1985) to assess person fit. These statistics were selected for illustration because R functions are available to compute these statistics from TAM (Robitzsch et al., 2019) output (RMSD is available in TAM library and lz is available in the sirt library (Robitzsch, 2019). For the RMSD statistic, Yamamoto et al. recommend flagging items with RMSD > 0.1 as poorly fitting. The lz statistic follows an asymptotic standard normal distribution under the null hypothesis of perfect fit, and so lz statistics greater than 3 in absolute value might be flagged as poorly fitting. However, if lz is calculated using θ estimates such as WLEs instead of true θ values (as must necessarily be done), the variance of lz statistics may be smaller than 1 (van Krimpen-Stoop & Meijer, 1999). Although we use only these fit statistics, we emphasize that a variety of fit statistics have been developed. An overview of item fit statistics is given by Swaminathan, Hambleton, and Rogers (2007), and an overview of person fit statistics is given by Meijer and Sijtsma (2001).

A summary of the item and person fit statistics of the LLTM applied to the CNTRaCS data are shown in Table 2. In this table, we report the proportion of misfitting trials by both trial type and diagnostic group using the RMSD > 0.1 cutoff, and the proportion of misfitting persons by diagnostic group using a |lz|>3 cutoff. The data in Table 2 suggest relatively poor item fit overall. This is somewhat unsurprising given that we earlier found deviations from overall item fit, and we have theoretical reasons to believe that trial responses are affected by multiple individual-level variables. Using the lz statistic to assess person fit, we find that between 88% and 95% of examinees in each diagnostic group produced responses that are well-characterized by the 1PL. Although imperfect, these fit statistics are useful for flagging trial types and persons that are not behaving as expected. Later in this article, we will illustrate a plausible alternative model for which these item and person fit statistics are more informative.

Table 2.

Proportion of CNTRaCS Items/Persons That Fit According to RMSD Item Fit and lz Person Fit for Parameters Estimated From the CNTRaCS Data.

Item fit
Diagnostic group Zero-change One-change Two-change Five-change Person fit
Control 0.77 0.23 0.62 0.92 0.95
Schizophrenic 0.48 0.18 0.35 0.52 0.90
Schizoaffective 0.28 0.07 0.27 0.55 0.88
Bipolar 0.05 0.17 0.07 0.17 0.90

Note. CNTRaCS = Cognitive Neuroscience Test Reliability and Clinical applications for Schizophrenia; RMSD = root mean squared deviation statistic. Trials were classified as well-fitting if RMSD < 0.1, and persons were classified as well-fitting if |lz|<3.

Simulation as an Aid in Small Sample IRT Applications

Small samples are inevitable in many research contexts, especially when the target populations are small and difficult to access, as is the case for the schizophrenic, schizoaffective, and bipolar populations in the CNTRaCS data. In this section, we suggest one way in which the strength of cognitive theories might be used to overcome some of the limitations of small samples for IRT parameter estimation. Specifically, cognitive task data is often accompanied by theories about the cognitive processes underlying the task response data. If data can be simulated from a non-IRT model that provides a plausible representation of the data-generating process, then an item response model can be specified that closely follows the theoretical model. Based on the item response model that is built from the simulated data, trait estimates can then be obtained for real examinees, and the extent to which IRT scores provide a more reliable and valid representation of the underlying latent trait(s) can be tested empirically. In this framework, it is possible to identify ways in which empirical data depart from a theoretical model. For instance, as noted above in the section on fit assessment, the IRT framework allows for detection of individuals who do not respond in accordance with model predictions. Moreover, it is possible that theory-based simulation can overcome some of the sample size limitations present in ordinary IRT analyses. In this section, we explore the extent to which meaningful analyses can be performed by first (a) simulating data from a non-IRT model, then (b) fitting these data to an IRT model, and finally (c) scoring data using the fitted IRT model results.

To demonstrate how data simulation may improve the measurement of real data, we simulated a large data set that mimicked the CNTRaCS task design described earlier. Namely, we simulated responses for 240 trials for each individual. All trials had set size S=5, and there were 60 replications each of 4 different trial types: C{0,1,2,5}. Under this task design, we considered values of working memory capacity in the range of 0.5K4, attentional capacity in the range of 0.6A1, and guessing in the range of 0G0.5. Next, we randomly generated {K,A,G} combinations for 5,000 subjects by sampling K, A, and G parameters from independent (i.e., uncorrelated) uniform distributions4 and used these parameter values to generate binary trial response data by (a) computing the probability of a “change” response for each individual, (b) generating a random number between 0 and 1, and (c) coding a response as “1” if the probability of a “change” response was greater than the random number, and as “0” otherwise. Finally, we coded the binary trial response data for accuracy (whereas the probabilities by Equation 6 gives data code for the observed “change” response).

To obtain IRT trial difficulty estimates, we next fit the 1PL with one estimated difficulty parameter per trial type (an instance of the LLTM, described earlier) to the simulated data using MML and the TAM software. The resulting trial difficulty estimates are reported in Table 3. Notice that these parameter estimates appear to be much different than those that were estimated for the real data and are reported in Table 1. Specifically, the item parameter estimates from the real data are systematically higher for the simulated data than those for the real data. This discrepancy is not necessarily worrisome. Specifically, the simulated data included a wide range of combinations of abilities such that performance for the simulated data is, on average, worse for the simulated data than for the real data. When fitting IRT models for both the real CNTRaCS data and the simulated data, we fixed the average θ value to zero to anchor the metric of the latent trait scores. As a result, the average trait score in the real data was somewhat higher than that for the simulated data, and the estimated item difficulties for the real data were systematically lower than those for the simulated data. Finally, note that the standard errors of the item parameter estimates were much lower for the simulated data than for the real data. These differences in the standard errors are partly the result of the different sample sizes of the real and simulated data sets. For the real data, we had only 223 cases and therefore we have less precise parameter estimates than when using the 5,000 simulated cases. Overall model fit indices again indicated somewhat poor fit (mean adjusted Q3 = 0.04, Holm-adjusted p<.0001, SRMR = 0.061, SRMSR = 0.069). All one-, two- and five-change trials fit according to the RMSD < 0.1 criterion, but only 12% of zero-change trials fit according to this criterion. The results of these model and item fit analyses generally support our previous findings that the unidimensional model is a somewhat flawed representation of the response process for these data. For the time being, we will continue to illustrate IRT using the unidimensional model, returning to the issue of model specification in a later section.

Table 3.

Item Parameter Estimates Computed From Simulated Data.

Trial type δ^ SE(δ^)
Zero-change −1.14 0.004
One-change −0.02 0.004
Two-change −0.67 0.004
Five-change −1.49 0.005

Note. SE = standard error.

In a previously described analysis, we used the item parameters estimated from the simulated data to model how the change detection trials operate. The resulting estimated trial difficulties can subsequently be used to estimate the proficiency of persons on the latent trait. Note that we do not expect smaller standard errors for estimated person parameters computed from the simulated item parameters than from the parameters estimated from real data. The standard errors of the estimated person parameters depends primarily on the number of administered trials rather than the number of persons in the item calibration sample. Regarding the robustness of our traits estimates, we found that the WLE trait estimates computed from the estimated item parameters from the simulated data (summarized in Table 3) correlated 0.9999 with the WLEs computed from the CNTRaCS estimated item parameters that are reported in Table 1. Moreover, the WLEs estimated from the simulated data exhibit slightly better person fit than those computed from the real data. According to the |lz|<3 criterion, 97%, 94%, 92%, and 92% of the control, schizophrenic, schizoaffective, and bipolar groups, respectively, fit (compare these results to those reported in Table 2). It is noteworthy that we also observed a mean shift corresponding to the shift in item parameter estimates when comparing the estimated item difficulties from the real data against those for the simulated data. Specifically, when scoring the CNTRaCS data using the estimated difficulties from simulated data, the estimated group means equal 0.70, 0.43, 0.30, and 0.36 for the control, schizophrenia, schizoaffective, and bipolar groups, respectively. These estimates are, on average, 0.74 points higher than the group means from reported earlier. This difference represents a shift in the origin of the latent trait scale. If desired, methods such as those described by Stocking and Lord (1983) can be used to place IRT models from separate calibrations onto a common latent trait metric. Importantly, both sets of estimated group means provide the same ordering of group means, and the magnitudes of estimated group mean differences is approximately the same for both sets of estimates.

Optimal Task Design

We now turn our attention to the important question of optimal task design. Specifically, in this section we describe how the results from IRT analyses can be used to design cognitive assessment tasks that measure examinees with maximum efficiency, (i.e., with a minimal number of trials; see van der Linden & Glas, 2000; Weiss, 1982). In this section we first introduce IRT-based CATs and then discuss how these state-of-the-art assessment methods might be applied to cognitive task data. We then describe one way in which change detection tasks can be adaptively administered by way of randomized trial blocks that ensure that each trial type is administered frequently while also efficiently measuring subjects’ latent trait levels.

IRT can be used to design tests that adapt to each examinee by applying the methods of CAT (Segall, 2005; Wainer, Dorans, Flaugher, Green, & Mislevy, 2000; Weiss, 1982). When administering CATs, each test begins with at least one item that is either chosen randomly or predetermined by the researcher. After one or more items have been administered, an examinee’s trait level is estimated based on the initial trial responses. Based on this provisional trait estimate, subsequent trials are chosen and administered that are maximally informative for that examinee. Adaptive tests that follow this paradigm provide greater precision of trait estimates and require fewer items than conventionally administered tests (Weiss, 1982). In practice, CATs have been particularly successful in large-scale educational contexts for which (a) it is possible to estimate item parameters ahead of time, (b) a large item pool is available from which to select new items, (c) efficiency in testing is desirable, and (d) the facilities are available for tests to be administered via computer. Cognitive tasks share many of these properties (a) simulation can be used to precalibrate item parameters (as demonstrated in the previous section), (b) it is easy to generate new trials with known properties, (c) efficiency in testing is desirable due to limited resources and to counter fatigue or attention effects, and (d) testing is already routinely done using computers. For these reasons, we believe that computerized cognitive tasks are an ideal avenue in which to apply CATs.

Practical applications of CAT differ from the ideal scenario described above for several reasons. For one, in most item banks there are some items that are more informative than others for the majority of subjects. These highly informative items tend to be selected often by CATs, leading to exposure effects in which many test-takers are given the same item. A related and relevant concern in applying CATs to educational testing is in ensuring that the breadth of test content is adequately represented throughout the test. Several methods have been developed to balance content in CATs so that each of several content areas are well-represented (Leung, Chang, & Hau, 2003; van der Linden & Veldkamp, 2004). In cognitive tasks, content balancing is also needed so that the correct response is not the same for every trial. For instance, in the CNTRaCS data, one-change trials are highly informative at most estimated trait levels, but if the task included no zero-change items, then a “change” response would always yield a correct answer. For cognitive tasks, there are a limited number of available trial types, but trials with the same properties can be administered multiple times. Thus, instead of designing a CAT that selects a trial type after each response (as is usually done in aptitude and achievement testing), a cognitive assessment CAT can be designed to select a block of trial types. A block might consist of a small number of trials but include a variety of trial types.

To illustrate these ideas in the context of the multiple change detection task, imagine that we have constructed blocks of eight trials. Further suppose that to ensure variety in trials, each block includes 2 zero-change trials, 1 one-change trial, 1 two-change trial, and 1 five-change trial. The remaining three trials (of each eight-trial block) are selected to provide maximum information for a given examinee conditional on their current trait estimate. It can be shown that these additional trials will always be of the same trial type if block information is to be maximized, leading to four blocks of eight trials each:

  • Block A: 5 zero-change, 1 one-change, 1 two-change, 1 five-change

  • Block B: 2 zero-change, 4 one-change, 1 two-change, 1 five-change

  • Block C: 2 zero-change, 1 one-change, 4 two-change, 1 five-change

  • Block D: 2 zero-change, 1 one-change, 1 two-change, 4 five-change

By including 2 zero-change trials in each block, we can ensure that at least 25% of the administered trials have “no change” as the correct answer (note also 25% of trials in the CNTRaCS data set were zero-change trials). Moreover, within blocks, the trial administration order can be randomized. In this way, subjects are less likely to become aware of the block-adaptive strategy of test administration.

Based on the parameter estimates from simulated data that are described in the previous section, we computed block information functions for Blocks A to D. These block IIFs are displayed in Figure 6. This figure shows that, for all θ values greater than 0.34, Block B provides more information than the other blocks. Moreover, Block D is the most informative for all θ values below 1.30; Block A is the most informative block for 1.31<θ<0.91; and Block C is the most informative block for 0.90<θ<0.35. As more items in a CAT are administered, the estimated θ values will become more stable (and have smaller standard errors) and thus the same block will be selected repeatedly for a given subject. However, if trials are randomized within blocks, examinees will likely be unaware of the administration strategy. CATs designed in this fashion can be terminated after a certain number of trials or blocks are administered or after a prespecified standard error of measurement is obtained. We believe that the implementation and refinement of CAT-based cognitive tasks could greatly improve the efficiency and accuracy of these types of measurements.

Figure 6.

Figure 6.

Example block information functions for the change detection task.

Multidimensional IRT Models for Improved Validity

In a previous section we we applied the LLTM, a variant of the 1PL, to the CNTRaCS data. In those analyses, we chose to illustrate the major concepts of IRT with the 1PL because (a) the 1PL is simple to understand, (b) the sample size of the CNTRaCS data is relatively small, and (c) simple IRT models may lead to more stable person and item parameter estimates in small sample sizes (e.g., N<200; Lord, 1983). Although the 1PL is valuable for summarizing an overall score, it provides somewhat poor fit to the CNTRaCS data, and we do not believe that it provides the most valid representation of the response processes that examinees employ when responding to cognitive tasks. In this section, we describe a multidimensional IRT model that may better represent response behavior, thus leading to more valid scores.

In an earlier section we showed that data that are simulated from a theoretical data-generating model might be used to further improve the precision of estimated trial difficulties. Specifically, we suggested that data sets can be simulated to represent thousands of examinees that respond to a task in accordance with a theoretical model. These simulated data can then be used to estimate properties of the task, which can then be applied to study individuals or groups of examinees. Because the simulation method is not limited by sample size, researchers are free to consider other item response models that provide a more accurate representation of the data-generating process. For example, the 1PL considers examinee ability on a single dimension. However, it is widely acknowledged that response processes in educational and psychological tests usually reflect multiple abilities (Reckase, 1997; Whitely, 1980). The mathematical model that is commonly used to characterize the CNTRaCS data is no exception—Equation 6 gives the probability of a “change” response as a function of three abilities: (a) working memory capacity, (b) attentional capacity, and (c) guessing propensity. In such situations, multidimensional item response models might provide a more valid account of response behavior on cognitive tasks.

To account for the multidimensional nature of response behavior, the 1PL can be extended so that multiple traits are estimated simultaneously. Specifically, instead of estimating one latent variable θ for each person, it is possible to use a modified 1PL to estimate a vector of D latent trait scores θ=(θ1,θ2,,θD) for each individual. The Rasch family of IRT models can accommodate complex relationships between items and dimensions (Adams et al., 1997), but a simplified form of the multidimensional 1PL can be written

log(Pij1Pij)=d=1Dbidθjdδi, (7)

where bid=1 if latent variable d affects trial i, bid=0 if latent variable d does not affect trial i, and δi is the multidimensional item difficulty. In the form of the multidimensional 1PL given in Equation 7, examinee j has a 50% chance of responding correctly to item i when δi=d=1D(bidθjd), that is, when δi equals the sum of θd values used to respond to the item.

In recent psychometric approaches to educational measurement, there has been increased attention on modeling the psychological processes that underlie observed test responses (Junker, 1999; Mislevy & Huang, 2007). Such models are sometimes called structural item response models (Rupp & Mislevy, 2007). In structural item response models, various aspects of the testing situation, such as features of items (e.g., the LLTM; Fischer, 1973) and the cognitive components needed to correctly answer an item (Embretson, 1984; Whitely, 1980), are modeled explicitly. Relatedly, multidimensional item response models specify dimensionality either between or within items. In tests with between-items multidimensionality, tests are composed of unidimensional subtests. As such, the test is sometimes said to have simple structure or independent cluster structure (McDonald, 1999, p. 174). In between-items multidimensionality, responses to each item or trial are treated as a function of only one underlying latent variable. Alternatively, tests positing within-item multidimensionality suppose that one or more items requires multiple traits to respond correctly. There are two ways in which the multiple traits might interact. First, it may be posited that ability on all dimensions must be above certain levels to respond correct. This is called a noncompensatory model because high ability on one dimension cannot compensate for a lack of ability on another dimension. Noncompensatory response processes are modeled by a multiplicative relationship between the traits, and so they cannot be modeled using the multidimensional model in Equation 7. Instead, Equation 7 specifies a compensatory model in which high ability on one ability dimension can overcome lesser abilities on other dimensions to increase the probability of a correct response (Reckase, 1997).

With these different models in mind, we aimed to specify an appropriate multidimensional item response model for the CNTRaCS data. We fit a series of multidimensional models to the set of 5,000 simulated responses that we described in an earlier section. When fitting these multidimensional models, we decided to score the zero-change trials for the observed “change” responses instead of for accuracy. We did this because a greater proclivity to guess (i.e., higher levels of G) should lead to lower accuracy (higher instances of “change” responses) for the zero-change trials. By scoring the zero-change trials for “change” responses rather than accuracy, higher scores on the estimated guessing dimension should correlate positively with raw sum scores and with the other dimensions. Next, recall that according to the data-generating model, there are three individual-level variables that underlie responses to the CNTRaCS data: visual working memory capacity K, attentional capacity A, and guessing propensity G. In theory, all three variables affect responses to the one-, two-, and five-change trials, and both A and G affect responses to the zero-change trials. However, these three variables interact in a noncompensatory manner; no amount of attention paid or proclivity to guess can compensate for a lack of working memory capacity. To complicate matters, the multidimensional 1PL forces a noncompensatory relationship between latent traits if there is within-item multidimensionality. Rather than fitting a multidimensional model that is explicitly compensatory, we were inspired by Whitely’s (1980) multicomponent latent trait model (see also Embretson, 1984). In this model, if the components (latent traits) are independent, then the model is, in a sense, compensatory (see Adams et al., 1997). By assuming independent components, Whitely’s model specifies between-item multidimensionality. However, interactions among the latent traits are still modeled indirectly in the between-items multidimensional model through the correlations among the latent trait dimensions.

As a next step in identifying an appropriate model for the multiple change detection task, we considered the empirical properties of the simulated data. Specifically, we looked at the correlations between the proportion-correct scores on each trial type and the data-generating K, A, and G parameters. These correlations are displayed in Table 4. As we see in this table, zero-change trials have essentially zero correlation with any data-generating parameter other than G. Similarly, five-change trials correlate highly with A (attention) and have relatively low correlations with K (capacity) or G (guessing). One- and two-change trials not only correlate most highly with K but also have moderate correlations with A and G. After several attempts at finding a valid measurement model based on these considerations, we found good recovery of the K, A, and G parameters when using a three-dimensional model with between-items multidimensionality. Like the unidimensional model described earlier in this article, we estimated one item location parameter for each trial type. Moreover, our three-dimensional model specified different trial types to load on different dimensions, but we allowed the dimensions to be correlated. Namely, the first dimension (visual working memory capacity) loaded only on the one- and two-change trials, the second dimension (guessing) loaded on only the one-change trials, and the third dimension (attention) loaded on only the five-change trials. Note that in the data-generating model, K, A, and G affect performance on all trials (except for the zero-change trials which are not affected by K), a feature we do not include in our multidimensional model. Although in our data simulation, we specified K, A, and G to be uncorrelated, we allow their representative dimensions to be correlated in the multidimensional IRT model. In this way, information that is missing in the between-items structure of the multidimensional model might be accounted for in the correlations among dimensions.

Table 4.

Correlations Between Proportion Correct by Trial Type and Data-Generating Parameters for the Simulated Data.

Trial type K A G
Zero-change 0.00 −0.01 0.94
One-change 0.66 0.23 0.56
Two-change 0.71 0.43 0.38
Five-change 0.09 0.84 0.28

We fit the three-dimensional IRT model to the 5,000 simulated cases that were described earlier using the tam.mml function in the TAM package (Robitzsch et al., 2019) for R (R Core Team, 2019). We used quasi-Monte-Carlo integration with quadrature nodes at a three-dimensional grid of 15 points uniformly spaced from −3 to +3 on each dimension (previous explorations of these data show that variances on the latent trait tend to be far less than 1, which is why we set the quadrature nodes to only range from −3 to +3). The resulting parameter estimates are shown in Table 5. For the simulated data, model fit was greatly improved over the unidimensional model with the mean adjusted Q3 statistic equal to 0.01, Holm-adjusted p=.23, SRMR = 0.013, and SRMSR = 0.016. Moreover, all items in all diagnostic groups fit according to the RMSD < 0.1 cutoff (in fact, all trials had RMSD < 0.06). Caution should be taken when interpreting the difficulty estimates reported in Table 5. When we fit the multidimensional model, we centered each dimension d at θd=0. Because different trial types load on different dimensions, the resulting difficulty estimates are relative to the average trait levels on each dimension. As a result, difficulty estimates cannot be directly compared across dimensions. For example, notice that that the zero-change trial difficulty estimate equals 1.31. However, because the zero-change items comprises the totality of the guessing dimension, the difficulty estimate means that the probability of a “change” response on zero-change trials equals 0.5 only for subjects that are highly inclined to guess. Because the one- and two-change trials represent the same dimension, these two difficulty estimates can be compared, and as expected, the one-change items are more difficult than the two-change items. Finally, the five-change items have a difficulty estimate of −1.64, indicating that an examinee would have a 50% probability of responding correctly to five-change trials only if his attentional capacity is severely inhibited.

Table 5.

Parameter Estimates for the Multidimensional Model (Computed From Simulated Data).

Parameter Estimate (δ^) Standard error SE(δ^) Covariance/correlation matrix
Zero-change 1.31 0.004 Dimension 1 Dimension 2 Dimension 3
One-change −0.02 0.004 Dimension 1 0.58 0.49 0.61
Two-change −0.71 0.004 Dimension 2 0.36 0.94 0.24
Five-change −1.64 0.005 Dimension 3 0.42 0.21 0.80

Note. In the covariance/correlation matrix, variances are shown on the diagonal, covariances are shown under the diagonal, and correlations are shown above the diagonal.

Next, to evaluate the structural validity of our multidimensional model, we computed the correlations between the data-generating parameters and the IRT multidimensional WLE trait estimates. The resulting correlations among the multidimensional WLEs, the data-generating K, A, and G values, and the maximum likelihood K^, A^, and G^ values are shown in Table 6. This table shows that Dimension 2 correlates 0.89 with the data-generating G parameter, and 0.93 with the the maximum likelihood G^. Dimension 3 correlates 0.83 with the A parameter and 0.89 with the maximum likelihood A^. Finally, Dimension 1 correlates only 0.72 with K and only 0.71 with the maximum likelihood K^. This last set of correlations may be considered high in some contexts, but they are too low for WLEs on Dimension 1 to be considered a reliable measure of working memory capacity alone. Perhaps one of the reasons for this relatively low correlations is that, even though the data were generated such that K, A, and G are uncorrelated, Dimension 1 also correlates moderately with the data-generating A and G parameters. In addition, this correlation might be artificially low due to potentially unrealistic {K,A,G} combinations in the simulated data.

Table 6.

Correlation Between Multidimensional WLEs, Data-Generating K, A, and G Parameters, and Estimated K, A, and G Parameters for the Simulated Data.

WLE Dimension 1 WLE Dimension 2 WLE Dimension 3 K^ A^ G^
K 0.72 −0.01 0.11 0.88 0.13 0.00
A 0.42 0.02 0.83 0.02 0.88 0.02
G 0.48 0.89 0.23 0.01 0.01 0.94
K^ 0.71 −0.03 0.03
A^ 0.44 −0.02 0.89
G^ 0.46 0.93 0.22

Note. WLE = weighted likelihood estimate. Bold values correspond to correlations between measures that represent the same dimension.

The sample size of the CNTRaCS data (223) is too small to obtain reliable trial difficulty estimates for the multidimensional IRT model. However, it is possible to estimate person abilities for the CNTRaCS data on the three dimensions by conditioning on the trial parameter estimates from the simulated data. The results of scoring examinees based on the simulated parameter estimates are shown in Table 7. Relative to the simulated parameter estimates and the 0.7 and 1.3 cutoffs used earlier, all examinees in all groups fit according to the |lz|<3 criterion. As Table 7 shows, multidimensional WLE scores correlate highly with the K^, A^, and G^ maximum likelihood estimates. Specifically, K^ correlates 0.86 with WLEs on Dimension 1, A^ correlates 0.80 with WLEs on Dimension 3, and G^ correlates 0.94 with WLEs on Dimension 2. A byproduct of scoring these groups with TAM is that mean dimension estimates are available for each group; these estimates are reported in the bottom half of Table 7. Note that, for the same reason that trial difficulties cannot be compared across dimensions, group means can be compared within dimensions but not across dimensions. Similarly, Dimension 1 means can only be compared to the one- and two-change trial difficulty estimates, Dimension 2 means can only be compared to the zero-change difficulty estimate, and Dimension 3 means can only be compared to the five-change difficulty estimate. On Dimension 1, all groups are located above the mean difficulties for the one- and two- change trials. Namely, one- and two-change trials have estimated difficulties of −0.02 and −0.71, respectively, but group means range from 0.22 to 0.48 (here, the simulated data group mean equals 0, so all group means can be estimated). Moreover, the control group mean score on Dimension 1 is equal to 0.48 and is somewhat higher than the other three group means, which range from 0.22 to 0.48. On both Dimensions 2 and 3, we see severe mismatch between estimated trial type difficulties and mean scores on each dimension. For example, Dimension 3 is supposed to measure attentional capacity, but group means on this dimension range from 0.66 to 1.53, whereas the difficulty estimate for the five-change trial equals −1.64. We see this result not as a flaw of IRT modeling, but as reflective of the challenge of measuring attentional capacity reliably. These results expose the extreme mismatch between the five-change trial difficulties and group-level mean attentional capacities. Item response modeling also provides the tools to assess how well attentional capacity is assessed by various task designs. Developing a better task design for reliable measurement of attentional capacity is beyond the scope of this article. However, the methods we laid forth in this article, such as the assessment of trial information functions and block adaptive testing, can be applied in a multidimensional context to design tasks that can reliably measure the desired constructs.

Table 7.

Correlation Between Multidimensional WLEs and Estimated K, A, and G Parameters for the CNTRaCS Data (Upper Part of Table), and Group Means for the Three Dimensions (Lower Part of Table).

WLE Dimension 1 WLE Dimension 2 WLE Dimension 3
K^ 0.86 −0.33 0.46
A^ 0.41 −0.58 0.80
G^ −0.01 0.94 −0.46
Control 0.48 −0.97 1.53
Schizophrenia 0.29 −0.57 1.11
Schizoaffective 0.22 −0.23 0.66
Bipolar 0.23 −0.38 1.23

Note. WLE = weighted likelihood estimate. Bold values correspond to correlations between measures that represent the same dimension.

Conclusions

A primary goal of this article was to introduce the core concepts of IRT to cognitive and experimental psychopathology researchers and to demonstrate how these methods can be used to improve measurement efficiency and quality. To achieve this goal, we introduced the basic ideas of IRT using the 1PL, and we demonstrated how this model could be used to model trial performance on a change detection task that was administered to groups of normal controls, and individuals with schizophrenic, schizoaffective, and bipolar diagnoses. This article explores the application of IRT models to experimental psychology data. First, we describe how IRT can be used to estimate properties of different cognitive tasks. This information can be used to evaluate how well the cognitive tasks measure the target populations and to design adaptive tasks that accurately measure examinees with a minimum number of administered trials. Furthermore, we describe how existing mathematical models of cognition can be used to simulate data that are used to establish IRT-based measurement models. These models can then be used score real responses to the cognitive task. Through this use of theory-based model fitting, many sophisticated IRT methods such as multidimensional models and CAT can be used even with the relatively small samples that are common in experimental psychopathology.

One of the novel ideas discussed in this article is the use of mathematical models to simulate data that are then fit to an IRT model. This use of simulation can resolve some of the problems encountered when applying IRT models to data sets with small samples. However, simulation cannot overcome all of the problems associated with small sample data analysis. Subsequent statistical analyses, such as comparing groups and correlating capacity estimates with other variables, will still have limited power if small samples are used. Moreover, our example of fitting an IRT model that is nontrivially different than the data-generating model might lead to systematic errors in estimation. In general, we recommend that if using these methods, the data-generating model should be as similar as possible to the fitted model. Perhaps noncompensatory multidimensional models, which arguably better describe the interactions among parameters, might better represent the relationship between visual working memory capacity, attentional capacity, and guessing proclivity. Another possibility is to use explanatory item response models (de Boeck & Wilson, 2004) to model features of the cognitive task (e.g., shape and position of changed objects) that are typically ignored when analyzing data with sum scores or task-specific models.

IRT has become widespread in educational testing and in some areas of psychological research. We believe that IRT models have the potential to drastically improve measurement in experimental psychology. Specifically, IRT allows researchers to evaluate the quality of trial types for measuring various populations and to optimally design tasks for efficient measurement, including tasks that adapt to individual examinees. In summary, we believe that IRT is an under-utilized tool in experimental psychology research and that the methods described in this article, including the use of simulated data and adaptive task design, can be used profitably to improve measurement in experimental psychology.

1.

Although the 1PL and the Rasch model are formally equivalent, these models were developed from distinct psychometric traditions. Among these differences, the Rasch tradition emphasizes test design and refinement (Wilson, 2005), whereas the tradition reflected in this article assumes that data-collection procedures are immutable.

2.

Only complete responses were used for this illustration so that we could fairly compare IRT to other scoring approaches. Notably, the MML algorithm, presented later, is generally unaffected by the presence of ignorable missing data. This feature of the IRT approach may be particularly attractive in contexts for which large amounts of missing data are expected.

3.

Although we assumed that the CNTRaCS latent abilities followed a normal distribution, there are ways to relax this assumption. Methods have been developed to flexibly estimate the latent trait distribution when using MML (Woods, 2006; Woods & Thissen, 2006). Additionally, an alternative to MML estimation is conditional maximum likelihood estimation (E. B. Andersen, 1972), which is available for models in the Rasch family and makes no assumptions about the distribution of abilities. In this article, we chose to use MML estimation for didactic reasons, specifically because MML can be used with both Rasch and non-Rasch models.

4.

Some readers may find it strange that we generated {K,A,G} values from uniform and uncorrelated distributions if these variables are likely to be correlated in the population. Theoretically, the latent trait distribution should not affect properties of the fitted model (although it may do so in finite samples). We made this choice to represent the broadest possible range of parameter combinations. Because a psychometric model should be developed with consideration for the entire range of persons to whom it will be applied, we believe that it is more appropriate to simulate data in this manner, although we leave open the possibility that future research may find other advantages for using correlated simulated data (e.g., if the resulting scores correlate higher with external variables).

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID ID: Leah M. Feuerstahler Inline graphichttps://orcid.org/0000-0002-7001-8519

References

  1. Ackerman T. A. (1994). Using multidimensional item response theory to understand what items and tests are measuring. Applied Measurement in Education, 7, 255-278. [Google Scholar]
  2. Adams R. J., Wilson M., Wang W.-C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1-23. [Google Scholar]
  3. Andersen E. B. (1972). The numerical solution of a set of conditional estimation equations. Journal of the Royal Statistical Society. Series B (Methodological), 34, 42-54. [Google Scholar]
  4. Andersen E. G. (1977). Sufficient statistics and latent trait models. Psychometrika, 42, 69-81. [Google Scholar]
  5. Birnbaum A. (1968). Estimation of ability. In Lord F. M., Novick M. (Eds.), Statistical theories of test scores (pp. 397-479). Reading, MA: Addison-Wesley. [Google Scholar]
  6. Bock R. D., Aitkin M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459. [Google Scholar]
  7. Bock R. D., Zimowski M. F. (1997). Multiple group IRT. In van der Linden W. J., Hambleton R. K. (Eds.), Handbook of modern item response theory (pp. 433-448). New York, NY: Springer. [Google Scholar]
  8. Cai L., Thissen D., du Toit S. H. C. (2015). IRTPRO for Windows [Computer software]. Lincolnwood, IL: Scientific Software International. [Google Scholar]
  9. Chalmers R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1-29. [Google Scholar]
  10. Coleman M. J., Cook S., Matthysse S., Barnard J., Lo Y., Levy D. L., Rubin D. B., Holzman P. S. (2002). Spatial and object working memory impairments in schizophrenia patients: A Bayesian item-response theory analysis. Journal of Abnormal Psychology, 111, 425-435. [DOI] [PubMed] [Google Scholar]
  11. Cowan N. (2000). The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24, 87-185. [DOI] [PubMed] [Google Scholar]
  12. de Ayala R. J. (2009). The theory and practice of item response theory. New York, NY: Guilford Press. [Google Scholar]
  13. de Boeck P., Wilson M. (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York, NY: Springer. [Google Scholar]
  14. Drasgow F. (1989). An evaluation of marginal maximum likelihood estimation for the two-parameter logistic model. Applied Psychological Measurement, 13, 77-90. [Google Scholar]
  15. Drasgow F., Levine M. V., Williams E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38, 67-86. [Google Scholar]
  16. Embretson S. (1984). A general latent trait model for response process. Psychometrika, 49, 175-186. [Google Scholar]
  17. Embretson S. E., Reise S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
  18. Feuerstahler L. M., Luck S. J., MacDonald A., Waller N. G. (2019). A note on the identification of change detection task models to measure storage capacity and attention in visual working memory. Behavior Research Methods, 51, 1360-1370. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Feuerstahler L. M., Waller N. G. (2014). Estimation of the 4-parameter model with marginal maximum likelihood. Multivariate Behavioral Research, 49, 285. [DOI] [PubMed] [Google Scholar]
  20. Fischer G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374. [Google Scholar]
  21. Fischer G. H. (1983). Logistic latent trait models with linear constraints. Psychometrika, 48, 3-26. [Google Scholar]
  22. Gibson B., Wasserman E., Luck S. J. (2011). Qualitative similarities in the visual short-term memory of pigeons and people. Psychonomic Bulletin & Review, 18, 979-984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Gold J. M., Barch D. M., Carter C. S., Dakin S., Luck S. J., MacDonald A. W., . . .Strauss M. (2012). (2012). Clinical, functional, and intertask correlations of measures developed by the cognitive neuroscience test reliability and clinical applications for schizophrenia consortium. Schizophrenia Bulletin, 38, 144-152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Gold J. M., Barch D. M., Feuerstahler L. M., Carter C. S., MacDonald A. W., Ragland J. D., . . . Luck S. J. (2019). Working memory impairment across psychotic disorders. Schizophrenia Bulletin, 45, 804-812. doi: 10.1093/schbul/sby134 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Harwell M. R., Janosky J. E. (1991). An empirical study of the effects of small datasets and varying prior variances on item parameter estimation in BILOG. Applied Psychological Measurement, 15, 279-291. [Google Scholar]
  26. Junker B. W. (1999). Some statistical models and computational methods that may be useful for cognitively relevant assessment (Prepared for the National Research Council Committee on the Foundations of Assessment). Retrieved from http://www.stat.cmu.edu/~brian/nrc/cfa/documents/final.pdf
  27. Leung C. -K., Chang H. -H., Hau K. -T. (2003). Computerized adaptive testing: A comparison of three content balancing methods. Journal of Technology, Learning and Assessment, 2, 1-13. Retrieved from https://ejournals.bc.edu/index.php/jtla/article/view/1665 [Google Scholar]
  28. Linacre J. M. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7, 328. [Google Scholar]
  29. Lord F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. [Google Scholar]
  30. Lord F. M. (1983). Small N justifies Rasch model. In Weiss D. J. (Ed.), New horizons in testing: Latent trait test theory and computerized adaptive testing (pp. 51-61). New York, NY: Academic Press. [Google Scholar]
  31. Maydeu-Olivares A. (2013). Goodness-of-fit assessment of item resposne theory models (with discussion). Measurement: Interdisciplinary Research and Perspectives, 11, 71-137. [Google Scholar]
  32. Maydeu-Olivares A. (2015). Evaluating the fit of IRT models. In Reise S. P., Revicki D. A. (Eds.), Handbook of item response theory modeling: Applications to typical performance assessment (pp. 111-127). New York, NY: Taylor & Francis. [Google Scholar]
  33. Maydeu-Olivares A., Joe H. (2005). Limited and full information estimation and testing in 2n contingency tables: A unified framework. Journal of the American Statistical Association, 471, 1009-1020. [Google Scholar]
  34. Maydeu-Olivares A., Joe H. (2006). Limited information goodness-of-fit testing in multidimensional contingency tables. Psychometrika, 71, 713-732. [Google Scholar]
  35. McDonald R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
  36. Meijer R. R., Sijtsma K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, 107-135. [Google Scholar]
  37. Mislevy R. J., Huang C. -W. (2007). Measurement models as narrative structures. In von Davier M., Carstensen C. H. (Eds.), Multivariate and mixture distribution Rasch models: Extensions and applications (pp. 16-35). Berlin, Germany: Springer. [Google Scholar]
  38. Miyake A., Friedman N. P., Emerson M. J., Witzki A. H., Howerter A. (2000). The unity and diversity of executive functions and their contributions to complex “frontal lobe” tasks: A latent variable analysis. Cognitive Psychology, 41, 49-100. [DOI] [PubMed] [Google Scholar]
  39. Morey R. D. (2011). A Bayesian hierarchical model for the measurement of working memory capacity. Journal of Mathematical Psychology, 55, 8-24. [Google Scholar]
  40. Muthén B., Christoffersson A. (1981). Simultaneous factor analysis of dichotomous variables in several groups. Psychometrika, 46, 407-419. [Google Scholar]
  41. Muthén B., Lehman J. (1985). Multiple group IRT modeling: Applications to item bias analysis. Journal of Educational Statistics, 10, 133-142. [Google Scholar]
  42. Muthén L. K., Muthén B. O. (1998-2011). MPlus user’s guide (6th ed.). Los Angeles, CA: Muthén & Muthén. [Google Scholar]
  43. Pashler H. (1988). Familiarity and visual change detection. Perception & Psychophysics, 1988, 369-378. [DOI] [PubMed] [Google Scholar]
  44. Patz R. J., Junker B. W., Lerch F. J., Huguenard B. R. (1996). Analyzing small psychological experiments with item response models (CMU Statistics Department Technical Report No. 644). Retrieved from http://www.stat.cmu.edu/~brian/bjtrs.html
  45. Peterson W. W., Birdsall T. F., Fox W. (1954). The theory of signal detectability. Transactions of the IRE Professional Group on Information Theory, 4(4). Retrieved from https://ieeexplore.ieee.org/document/1057460/authors#authors [Google Scholar]
  46. R Core Team. (2019). R: A language and environment for statistical computing [Computer software]. Vienna, Austria. Retrieved from http://www.R-project.org/
  47. Rasch G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago, IL: University of Chicago Press. (Original work published 1960) [Google Scholar]
  48. Reckase M. D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21, 25-36. [Google Scholar]
  49. Reise S. P., Waller N. G. (2003). How many IRT parameters does it take to model psychopathology items?. Psychological Methods, 8, 164-184. [DOI] [PubMed] [Google Scholar]
  50. Reise S. P., Waller N. G. (2009). Item response theory and clinical measurement. Annual Review of Clinical Psychology, 5, 27-48. [DOI] [PubMed] [Google Scholar]
  51. Rizopoulos D. (2006). ltm: An R package for latent variable modelling and item response theory analysis. Journal of Statistical Software, 17, 1-25. [Google Scholar]
  52. Robitzsch A. (2019). sirt: Supplementary item response theory models (R package Version 3.6-21). Retrieved from https://CRAN.R-project.org/package=sirt
  53. Robitzsch A., Kiefer T., Wu M. (2019). TAM: Test analysis modules (R package Version 3.3-10). Retrieved from https://CRAN.R-project.org/package=TAM
  54. Rouder J. N., Morey R. D., Cowan N., Zwilling C. E., Morey C. C., Pratte M. S. (2008). An assessment of fixed-capacity models of visual working memory. Proceedings of the National Academy of Sciences of the U S A, 105, 5975-5979. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Rouder J. N., Morey R. D., Morey C. C., Cowan N. (2011). How to measure working memory capacity in the change detection paradigm. Psychonomic Bulletin & Review, 18, 324-330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Rupp A. A., Mislevy R. J. (2007). Cognitive foundations of structured item response theory models. In Leighton J., Gierl M. (Eds.), Cognitive diagnostic assessment in education: Theory and practice (pp. 205-241). Cambridge, England: Cambridge University Press. [Google Scholar]
  57. Segall D. O. (2005). Computerized adaptive testing. In Kempf-Lenard K. (Ed.), The Encyclopedia of social measurement (pp. 429-438). San Diego, CA: Academic Press. [Google Scholar]
  58. Stocking M. L., Lord F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210. [Google Scholar]
  59. Stone C. A. (1992). Recovery of marginal maximum likelihood estimates in the two-parameter logistic response model: An evaluation of MULTILOG. Applied Psychological Measurement, 16, 1-16. [Google Scholar]
  60. Suzuki A., Hoshino T., Shigemasu K. (2006). Measuring individual differences in sensitivities to basic emotions in faces. Cognition, 99, 327-353. [DOI] [PubMed] [Google Scholar]
  61. Swaminathan H., Hambleton R. K., Rogers H. J. (2007). Assessing the fit of item response theory models. In Rao C. R., Sinharay S. (Eds.), Psychometrics: Vol. 26. Handbook of statistics (pp. 683-718). Amsterdam, Netherlands: Elsevier. [Google Scholar]
  62. Tuerlinckx F., De Boeck P., Lens W. (2002). Measuring needs with the thematic apperception test: A psychometric study. Journal of Personality and Social Psychology, 82, 448-461. [PubMed] [Google Scholar]
  63. van der Linden W. J., Glas C. A. W. (2000). Computerized adaptive testing: Theory and practice. Boston, MA: Kluwer-Nijhoff. [Google Scholar]
  64. van der Linden W. J., Veldkamp B. P. (2004). Constraining item exposure in adaptive testing with shadow tests. Journal of Educational and Behavioral Statistics, 29, 273-291. [Google Scholar]
  65. van Krimpen-Stoop E. M. L. A., Meijer R. R. (1999). Simulating the null distribution of person-fit statistics for conventional and adaptive tests. Applied Psychological Measurement, 23, 327-345. [Google Scholar]
  66. Wainer H., Dorans N. J., Flaugher R., Green B. F., Mislevy R. J. (2000). Computerized adaptive testing: A primer. New York, NY: Routledge. [Google Scholar]
  67. Waller N. G., Feuerstahler L. (2017). Bayesian modal estimation of the four-parameter item response model in real, realistic, and idealized data sets. Multivariate Behavioral Research, 52, 350-370. [DOI] [PubMed] [Google Scholar]
  68. Waller N. G., Reise S. P. (2010). Measuring psychopathology with non-standard IRT models: Fitting the four-parameter model to the MMPI. In Embretson S., Roberts J. S. (Eds.), Measuring psychological constructs: Advances in model-based approaches (pp. 147-173). Washington, DC: American Psychological Association. [Google Scholar]
  69. Waller N. G., Thompson J., Wenk E. (2000). Black-white differences on the MMPI: Using IRT to separate measurement bias from true group differences on homogeneous and heterogeneous scales. Psychological Methods, 5, 125-146. [DOI] [PubMed] [Google Scholar]
  70. Warm T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427-450. [Google Scholar]
  71. Weiss D. J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6, 473-492. [Google Scholar]
  72. Whitely S. E. (1980). Multicomponent latent trait models for ability tests. Psychometrika, 45, 479-494. [Google Scholar]
  73. Wilson M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
  74. Woods C. M. (2006). Ramsay-curve item response theory (RC-IRT) to detect and correct for nonnormal latent variables. Psychological Methods, 11, 253-270. [DOI] [PubMed] [Google Scholar]
  75. Woods C. M., Thissen D. (2006). Item response theory with estimation of the latent population distribution using spline-based densities. Psychometrika, 71, 281–301. [DOI] [PubMed] [Google Scholar]
  76. Yamamoto K., Khorramdel L., von Davier M. (2013). Scaling PIAAC cognitive data. In OECD (Ed.). Technical Report of the Survey of Adults Skills (PIAAC) (Ch. 17, pp. 1-33). Paris, France: OECD; Retrieved from https://www.oecd.org/skills/piaac/_Technical%20Report_17OCT13.pdf [Google Scholar]
  77. Yen W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125-145. [Google Scholar]

Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES