Abstract
The adaptation of experimental cognitive tasks into measures that can be used to quantify neurocognitive outcomes in translational studies and clinical trials has become a key component of the strategy to address psychiatric and neurological disorders. Unfortunately, while most experimental cognitive tests have strong theoretical bases, they can have poor psychometric properties, leaving them vulnerable to measurement challenges that undermine their use in applied settings. Item response theory–based computerized adaptive testing has been proposed as a solution but has been limited in experimental and translational research due to its large sample requirements. We present a generalized latent variable model that, when combined with strong parametric assumptions based on mathematical cognitive models, permits the use of adaptive testing without large samples or the need to precalibrate item parameters. The approach is demonstrated using data from a common measure of working memory—the N-back task—collected across a diverse sample of participants. After evaluating dimensionality and model fit, we conducted a simulation study to compare adaptive versus nonadaptive testing. Computerized adaptive testing either made the task 36% more efficient or score estimates 23% more precise, when compared to nonadaptive testing. This proof-of-concept study demonstrates that latent variable modeling and adaptive testing can be used in experimental cognitive testing even with relatively small samples. Adaptive testing has the potential to improve the impact and replicability of findings from translational studies and clinical trials that use experimental cognitive tasks as outcome measures.
Keywords: computerized adaptive testing, experimental cognitive psychopathology, cognitive psychometrics, cognitive assessment, item response theory, neurocognitive disorders
Item response theory (IRT) and computerized adaptive tests (CATs) are regarded by many as the optimal methodology and technology to support applied measurement in education, clinical practice, and research (Lord, 1980; Thomas, 2011, 2019). However, the large sample requirements of IRT have all but precluded use in many applied settings. We present a methodology designed to bring the benefits of modern psychometric theory and technology to bear on small-sample clinical and cognitive research. In particular, we aim to provide researchers with methods that can improve the reliability and validity of experimental cognitive measures that are increasingly being used to assess cognitive functioning in translational and experimental psychopathology research.
Measurement Challenges in Experimental Psychopathology Research
The experimental cognitive perspective in psychopathology research seeks to understand mental illness through the lens of tests designed to investigate specific, theoretically motivated hypotheses regarding the integrity of basic cognitive functions in impaired populations (Forsyth & Zvolensky, 2001; MacDonald, 2009; MacDonald & Carter, 2002). Within this context, the number of experimental cognitive tests used in mental health research has grown rapidly, leading to not only in-kind discoveries but also challenges. Standardized clinical tests are often the primary outcome measures used in clinical trials. This is because standardization, which requires examiners to administer stimuli from a fixed item pool according to a structured protocol, allows researchers to compare an examinee’s score to normative data and also because the psychometric properties of scores are known a priori. Yet, standardization implies that these measures cannot be altered for the purpose of testing novel hypotheses. Researchers have argued that standardized tests are ill-suited for basic and even some types of applied science primarily because the measures tend to lack specificity, often reflecting generalized deficits rather than impairments in circumscribed cognitive abilities and processes (MacDonald & Carter, 2002).
Experimental cognitive tests, on the other hand, are typically unstandardized, due largely to the fact that theories evolve, and thus the life span of a given paradigm—and especially variants of those paradigms—may be short-lived. Examinees’ scores often must be interpreted in toto without reference to normative data and without clear knowledge of, or the ability to improve, their psychometric properties. Moreover, because examinee and item variance are often unwanted sources of variability in experimental studies, tests closely based on experimental methods may poorly discriminate individual differences (Hedge et al., 2018). Measurement error is directly tied to discriminating power, in that as measurement error increases, discrimination decreases. When discriminating power differs between tests and groups, differential deficits can be artificially inflated or attenuated (Chapman & Chapman, 1973; Thomas et al., 2017).
Understanding and controlling the determinants of measurement error leads to an approach for controlling discriminating power, which in turn will lead to more accurate assessments of absolute and differential group differences.
Computerized Adaptive Testing
Modern psychometrics, especially IRT, identifies that measurement error is a function of the match between examinee ability and item difficulty; specifically, measurement precision increases to the extent that item difficulty is closely matched to ability (Lord, 1980).1 This is because items that are too hard or too easy produce little systematic variation in observed test scores (Lord, 1980). In extreme cases, tests may show “floor” or “ceiling” effects, where all examinees within a particular range of the ability distribution receive the same score (Haynes et al., 2011). Moreover, because reliability is a function of measurement error, and because effect size is a function of reliability, ability–difficulty mismatch results in poor reliability and weakened effect size. Thus, experimental investigators who use factorial or parametric manipulations to alter a task’s difficulty will also systematically alter the reliability of a task’s measures of individual differences and in clinical applications will alter the size of the between group effect (Thomas et al., 2017).
Measurement challenges facing experimental cognitive psychopathology research can be addressed by ensuring that measurement error is held invariant over tests, populations, and individuals. This can be accomplished through the use of CATs. CATs use algorithms to adaptively choose which and how many items are administered to examinees (van der Linden & Glas, 2010). The process involves administering a flexible number of tailored items until measurement error meets some predetermined minimum value. To accomplish this minimization, CATs depend on a measurement model, precalibrated item parameters with an associated item pool, and the use of software to adaptively administer the test (Scalise & Allen, 2015).
Unfortunately, calibration of item parameters typically requires hundreds or even thousands of participants. As Strauss (2001) noted, IRT’s large sample requirement implies that the methodology “. . . does not seem practical for testing specific, theoretically based hypotheses” (p. 12). Thus, it would seem that the IRT CAT solution to controlling measurement error is not feasible in experimental cognitive psychopathology research.
Generalized Latent Variable Model
We propose a novel solution to this problem based on the generalized latent variable model (GLVM; Skrondal & Rabe-Hesketh, 2004). The GLVM is a unifying framework for many related statistical approaches used in psychology including linear mixed-effects models, factor analysis, structural equation modeling, and IRT. The flexible nature of this modeling approach, along with strong parametric assumptions, permits the use of adaptive testing with only limited data.
The primary challenge to using IRT and adaptive testing in small-sample experimental research is the inability to accurately estimate item parameters. However, experimental cognitive testing differs from more typical assessment contexts in that items are purposely designed to be homogenous on dimensions that are not related to the manipulated experimental factor(s). That is, both within and between experimental conditions, items are expected to be highly similar. Examples include memory tasks that use letters, single syllable pseudowords, or nondescript geometric shapes (e.g., Nelson et al., 2003), sustained attention tasks that use letters or simple visual displays (e.g., Barch et al., 2001), and learning tasks that require the examinee to form abstract associations between shapes and colors (e.g., Kurtz et al., 2004). For these types of tasks, difficulty is not thought to be greatly affected by the inherent properties of the stimuli, but instead by some other manipulated factor (e.g., number of items to be remembered, task length, or feedback). A counterexample is a proverb task (e.g., Delis et al., 2001), where the subject must interpret the abstract meaning of proverbial statements. In such tasks, the items often vary dramatically in difficulty, but differences in difficulty are not explained by an experimentally manipulated factor.
Additionally, experimental cognitive researchers typically derive performance measures using simple unweighted sums of item accuracy scores, or even predetermined measurement weights. Experimental hypotheses are then evaluated at the level of summary scores across experimental conditions using experimental contrasts. We take a similar strategy here but framed within the context of the GLVM. Specifically, we present a hierarchical model with a minimum of three levels defined by variation over items (Level 1), variation over experimental conditions (Level 2), and variation over examinees (Level 3). This structure is shown in Figure 1. As desired, a fourth level (not shown in the figure) is added to capture variation over and between groups.
Figure 1.
Hierarchical model depicting three nested levels of variability that are relevant to experimental cognitive testing.
Note. Levels are defined by variation over items (Level 1), variation over experimental conditions (Level 2), and variation over examinees (Level 3). Individual item responses are for the ith item within the jth condition for the kth examinee. Examinee parameters vary over all individual examinees and condition parameters vary over all conditions within each examinee. Examinee effects are defined with regard to the experimental manipulation of interest. Thus, the modeling approach separates measures of individual differences among examinees into components that are explained by the experimental manipulation (ω) and components that are not (ζ). The notation y* reflects that items may be related to the linear model via a link function, thereby permitting item responses that do not have a normal distribution (e.g., dichotomous accuracy scores). Subscript notation in the figure is used to denote variation, not matrix dimensions.
To begin, we assume multiple, continuous underlying latent response variables (y*) that are related to the item parameters and examinee variables by the equation
| (1) |
where η is a K-by-JM matrix of examinee abilities or traits, Λ is a IJ-by-JM matrix of item structure parameters, ν is a K-by-IJ matrix of item intercept parameters, and ε is a K-by-IJ matrix of residuals. Λ′ is the transpose of Λ (similar transpose notations will be used throughout the text for matrix terms). I is the number of items per experimental condition, J is the number of experimental conditions, M is the number of latent ability (or trait) dimensions, and K is the number of examinees. Level 1 residuals ε are assumed to be normally distributed, with vector of means all 0 and covariance matrix assumed to be diagonal; for the dichotomous response model, if the diagonal values are fixed to one, the parameterization is consistent with a probit-link model within the generalized linear model framework, and if these are fixed to 1.702, the parameterization is consistent with a logit-link model. Threshold parameters are added to the model to accommodate the observed item responses. Specifically, the observed item response (e.g., accuracy), y, is related to y* by an item threshold parameter (τ)
However, the intercepts and thresholds are not jointly identified, and thus, as is common, τ is fixed to zero for identification. The ν parameters are interpreted as item easiness. The Λ parameters function to allow certain items to be more or less discriminating of individual differences in the latent examinee variables. The latent η variables measure Level 2 examinee effects of the experimental manipulation.
Equation 1 is equivalent to the multidimensional extension of the two-parameter IRT model (Reckase, 2009), but using different notation. The Λ matrix contains IRT discrimination parameters and the ν parameters are negatively related to item difficulty. The two-parameter IRT model is limited in experimental testing contexts in that it only considers variation due to differences in item properties and examinee ability. It does not consider variance due to experimental manipulations of task parameters. Thus, to accommodate this added source of variance, we further define η as
| (2) |
where ω is a K-by-MN matrix of Level 3 examinee effects of the experimental manipulation, Γ is a JM-by-MN matrix of experimental structure parameters, β is a JM-by-JM matrix of cross-condition regressors, γ is a K-by-JM matrix of intercepts, and ζ is a K-by-JM matrix of Level 2 residual abilities that are not explained by the general experimental contrasts. N is the number of hypothesized experimental contrasts. Note that the model expressed by Equation 2 allows cross-condition dependencies between the elements of η via β; however, here, as in many experimental testing applications, we assume that β is a null matrix (i.e., that there are no cross-condition causal effects), and thus the product term between β and η is dropped from the model. Moreover, if we combine Equations 1 and 2, multiply through, and fix all of the elements of γ to zero (for identification), this results in the simplified model
| (3) |
Table 1 provides notation, definitions, and interpretations of all terms in Equation 3.
Table 1.
Summary of Notation, Definition, and Interpretation for Model Variables and Parameters.
| Symbol | Name | Dimensions | Definition | Common interpretation | Vary over |
|---|---|---|---|---|---|
| y * | y star | K-by-IJ | Latent item responses | Item accuracy | Items, conditions, and examinees |
| ω | omega | K-by-MN | Latent examinee variables (Level 3) | Examinee ability or trait explained by the experimental contrast | Examinees |
| Γ | gamma | JM-by-MN | Experimental structure parameters | Experimental contrasts | Conditions |
| Λ | lambda | IJ-by-JM | Item structure parameters | Measurement weights | Items |
| ζ | zeta | K-by-JM | Latent examinee variables (Level 2) | Examinee ability or trait not explained by the experimental contrast | Conditions within examinees |
| ν | nu | K-by-IJ | Item intercept parameters | Item easiness | Items |
| ε | epsilon | K-by-IJ | Residuals | Independent prediction errors | Items, conditions, and examinees |
Note. y* = latent item response that is related to the observed response (y) via a link function. I = number of items per experimental condition; J = number of experimental conditions; K = number of examinees; M = number of latent ability (or trait) dimensions; and N = number of hypothesized experimental contrasts.
The key difference between Equation 3 and the two-parameter multidimensional IRT model is that the ability (or trait) dimensions are split between a general experimental contrast level (Level 3) and a specific condition level (Level 2); that is, measures of individual differences that are explained by the hypothesized effects of the experimental manipulation (ω) and components that are not (ζ). The approach is conceptually similar to the IRT bifactor model in that it distinguishes between general and specific levels of ability (Gibbons & Hedeker, 1992). That is, Level 3, or ω, contains the general factors and Level 2, or ζ, contains the specific factors. In experimental cognitive testing, researchers are generally interested in the effects of the experimental manipulation (Level 3 or general effects), and not all sources of individual differences per se. This approach is also similar to explanatory IRT models (de Boeck & Wilson, 2004), except that in our approach the examinee (not item) parameters are explained by the task manipulation.
Interpretations of the latent ω and ζ variables are governed by Γ and Λ, which are defined a priori by the researcher. The elements of Γ are codes representing the experimental contrasts of interest. These codes can take on any scheme desired (e.g., “dummy” coding, orthogonal polynomial contrasts, Helmert contrasts, etc.). Thus, the latent ω variables—which are related to item responses as a function of the matrix product of Λ and Γ—reflect both the hypothesized measurement structure and the hypothesized experimental structure of a task. The latent ζ variables, on the other hand, are related to response accuracy only by Λ (i.e., they are unrelated to the hypothesized experimental structure). Most elements of Λ are coded zero, and only elements relating items to condition-specific ζ variables are allowed to take nonzero values.
This GLVM framework eliminates the need for calibration of the item parameters (ν and Λ), in limited contexts, by treating item parameters as random effects or by fixing values based on theory. Specifically, the item intercept parameters (ν) do not need to be explicitly estimated, but rather can be treated as random effects at Level 1. The item structure parameters (Λ), on the other hand, are fixed based on strong parametric assumptions (see example below). This leaves only the general examinee effects (ω) and the specific examinee effects (ζ) to be estimated. The approach leads to a substantial reduction in the number of variables and parameters estimated in comparison to other psychometric methods (e.g., IRT), and thus permits latent variable modeling and adaptive testing using small samples.
Adaptive Selection of Task Conditions
Adaptive testing in experimental cognitive research will differ from the more typical scenario in that the goal is not to choose optimal items, but rather optimal conditions—levels of the experimental factor(s)—that maximize precision. Moreover, optimized selection of factor levels depends almost entirely on the latent examinee variables, ω and ζ, rather than on the item parameters. That is, computerized adaptive adjustments are tied to the examinee’s response to the experimental manipulation. For example, in a memory task, the effect of increased memory load can be captured by using orthogonal polynomial contrasts in Γ that would include constant and linear effects. The subject ability parameters contained in ω can then be interpreted as the intercept and slope of the memory load effect. Thus, if an examinee performs increasingly worse as the number of items to be remembered increases, the linear slope parameter in ω will be negative and will inform the algorithm of the examinee’s likely performance at higher or lower levels of memory load. These predictions will then inform the optimal choice of task load conditions in future testing.
Typically, ω for a single examinee will contain one or more intentional parameters (i.e., the parameters for which task selection is intentionally optimized) and the ζ vector will contain nuisance parameters (i.e., parameters that are of no interest to the researcher; cf. Mulder & van der Linden, 2009). See the appendix for further details.
Example Application
As noted above, the GLVM framework proposed requires strong parametric assumptions. In particular, the parameters of the item structure matrix (Λ) are assumed to be known. A simple choice is to fix all nonzero elements of Λ to 1.0. This approach is consistent with Rasch-type models. Fixing all non-zero elements of Λ to 1.0 assumes that all items are equally related to the measured ability. An alternative option is to fix nonzero elements of Λ to weighted values that are based on a priori theory. In particular, within the context of cognitive testing, mathematical cognitive models serve as a useful reference.
A prominent example is the equal variance model from signal detection theory (SDT; Wickens, 2002). SDT is commonly used to score data produced by many types of recognition memory (Kane et al., 2007; Ragland et al., 2002), continuous performance (Conners, 1994), and emotion recognition (Gur et al., 2010) tests, among other paradigms. Thus, psychometric refinements and adaptations of this measurement model have the potential for high impact. The equal variance SDT model assumes that the presentation of repeated or old items—targets—and nonrepeated or new items—foils—during the recognition period of testing evokes familiarity that can be represented by underlying probability distributions. Target and foil items are assumed to follow unimodal, symmetric distributions of familiarity with equal variances but different means. A larger distance between the mean of the target distribution and the mean of the foil distribution implies greater familiarity, and thus a higher probability of accurate responding. This distance is a measure of discriminability known as d′. Familiarity drives recognition; however, because familiarity follows a probability distribution, and because the familiarity distributions of targets and foils often overlap, the SDT model assumes that examinees must establish a criterion, C, representing the level of familiarity beyond which they will classify test items as targets. The criterion centered relative to the midpoint between the foil and target distributions is known as Cc (for C centered). It is interpreted as the conservative or “no” bias. These parameters are depicted in Figure 2.
Figure 2.
Equal variance, signal detection theory model.
Note. μT = mean of the distribution of familiarity for targets; μF = mean of the distribution of familiarity for foils; d′ = μT minus μF (discrimination); C = criterion; Cc = value of the criterion relative the midpoint between μT and μF (bias).
Previous work has shown how the equal variance SDT model can be expressed as a generalized linear model (DeCarlo, 1998) and, therefore, also as an IRT model (Thomas et al., 2018). Specifically, it has been demonstrated that the log-odds of a correct response can be expressed as −Cc+d′/2 when the item presented is a target, and Cc+d′/2 when the item presented is a foil. This equivalence of certain SDT and IRT models leads to important simplifications: the dimensionality of the measurement model and the values of the Λ matrix are defined a priori. Specifically, the SDT model assumes two dimensions (i.e., d′ and Cc) and requires us to fix the nonzero elements of Λ to 0.5 for the relation between all items and d′ and 1.0 for the relation between foil items and Cc, but −1.0 for the relation between target items and Cc (Thomas et al., 2018).2
We demonstrate this approach using the N-back task. The N-back, a measure of recognition and working memory, is one of the most commonly used paradigms in cognitive and clinical neuroscience. During an N-back, examinees are asked to monitor a continuous stream of stimuli and respond each time an item is repeated from N before. The N-back level represents the experimental manipulation of working memory load, with investigators typically administering 1-back, 2-back, and/or 3-back conditions. That is, 1-, 2-, and 3-back all represent specific experimental conditions. The expectation is that as the researcher increases the number of items that must be maintained in working memory (i.e., moving from 1 item, to 2 items, to 3 items), the task becomes systematically more difficult. Stimuli used are varied between studies and tasks, but often highly homogenous within tasks. Examples include letters, digits, words, shapes, pictures, faces, locations, auditory tones, and even odors (Owen et al., 2005). In each case, while the modality of memory might change, it is fundamentally assumed that task difficulty is not driven by variation in the properties of the individual stimuli themselves, but rather by the experimental manipulation of N-back memory load.
The N-back task is often administered as a forced choice paradigm. That is, the subject is forced to indicate whether each item is, or is not, repeated from N items back. Typically, the participate presses a button to indicate that the item is repeated, or otherwise does not respond. Given the forced choice accuracy format of the N-back, data are often scored using SDT. To demonstrate how the SDT model can be applied to N-back data within the GLVM framework proposed, we present an expanded version of Equation 3 in Figure 3 assuming (for the sake of demonstration) a test comprising 1 foil item and 1 target item is administered over 1- and 2-back load conditions (i.e., a toy model). The latent response variables (y*) are regressed onto the latent subject variables representing the subject-level effects of the experimental manipulation (ω) and the residual, condition-level prediction errors (ζ). Γ contains the experimental structure parameters that represent the hypotheses of the investigator (i.e., the experimental contrasts). Here, we use orthogonal polynomial contrasts for the two levels of load. In this quantitative coding scheme, the intercept parameters for both d′ and Cc (i.e., Id′ and ICc) are multiplied by values of 1.0 in Γ and the slope parameters for both d′ and Cc (i.e., Sd′ and SCc) are multiped by values of −1.0 at 1-back and 1.0 at 2-back in Γ (i.e., predetermined codes for a linear effect of a variable with two levels within the orthogonal polynomial coding scheme; Cohen et al., 2003). In the figure, ω terms have the prefix “I” or “S” to indicate intercept versus slope and ζ terms have the prefix “r” to indicate residual or specific effects (i.e., specific to each N-back load condition). The intercept component is akin to average ability (or average bias) and the slope component is the examinee’s response to the experimental manipulation. The elements of Λ are not estimated, but instead are fixed based on assumptions from the SDT model. Specifically, d′ parameters are weighted by 0.5 for all items and Cc parameters are weighted by −1.0 for target items but 1.0 for foil items. As noted above, the nonzero elements of the covariance matrix for ε are fixed based on the desire to have scaling that is consistent with either probit- or logit-link models. Thus, the only parameters that must be estimated are contained in ω, ζ, and ν (although ν can be treated as a random effect).
Figure 3.
Example of the signal detection theory model applied to N-back data within the generalized latent variable model framework proposed
Note. This toy model is for a single examinee and a test comprising 1 foil item and 1 target item administered over 1- and 2-back load conditions (i.e., 1B, F = 1-back foil; 1B, T = 1-back target; 2B, F = 2-back foil; 2B, T = 2-back target). The latent response variables (y*) are regressed onto the latent subject variables representing the subject-level effects of the experimental manipulation (ω) and the residual, condition-level prediction errors (ζ). Γ contains the experimental structure parameters that represent the hypotheses of the investigator. Here, we use orthogonal polynomial contrasts for the two levels of load. The intercept parameters for both d′ and Cc (Id′ and ICc) are multiplied by 1.0 and the slope parameters for both d′ and Cc (Sd′ and SCc) are multiped by −1.0 at 1-back and 1.0 at 2-back. The elements of Λ are fixed based on assumptions from the signal detection theory model. Specifically, d′ parameters are weighted by 0.5 for all items and Cc are weighted by −1.0 for target items but 1.0 for foil items.
Aims and Hypotheses
Analyses aimed to examine whether adaptive testing can improve the efficiency and precision of measures used in experimental cognitive psychopathology research. We focused on the N-back task as a prototypical paradigm. Our first aim was to determine whether the proposed GLVM adequately fits N-back data. We hypothesized that the dimensionality of scores would be consistent with a two-dimensional measurement model (i.e., the SDT d′ and Cc parameters) and that the model would adequately explain systematic variance in the data. Our second aim was to determine whether adaptive testing improves efficiency and precision when compared to nonadaptive testing. We hypothesized that adaptive testing would require the administration of fewer items and task conditions to achieve acceptable precision when compared with nonadaptive testing. We further hypothesized that adaptive testing would produce smaller estimation errors. Our third aim was to evaluate the comparability of latent examinee variable estimates produced by adaptive versus nonadaptive testing in terms of their correlations with relevant examinee, cognitive, and neuroimaging variables.
Method
Data were collected from 92 research participants comprising three populations collected across two separate studies. We chose to aggregate data across studies in order to maximize sample size and to demonstrate the flexibility of the modeling approach. The first study was a pilot investigation of the N-back task within a population of undergraduate research participants (UNs; n = 38). The second study was designed specifically to compare outpatients diagnosed with schizophrenia or schizoaffective disorder (SZs; n = 28) to matched healthy comparison participants (HCs; n = 26) on the N-back task under investigation. Psychiatric diagnoses, or lack thereof, for patients and controls were verified using structured clinical interviews. Both studies were approved by a university institutional review board. Informed consent was obtained from all participants.
Measures
N-Back Task
Items were generated by creating words from all consonant–vowel–consonant combinations of letters in the English alphabet. We then eliminated offensive words, names, and common abbreviations. Next, we used the online MCWord database (Medler & Binder, 2005) to collect word frequency and orthographic neighborhood statistics. We eliminated all words with a frequency of zero, and then selected words in the top 20% of the orthographic neighborhood statistic. We then created 15 pseudoword lists, or blocks, that would allow us to administer 1- through 5-back conditions up to three times without repeating items. Blocks consisted of 20 pseudowords with four targets (items repeated from N-back; 20%), four lures (items repeated, but not from N-back; 20%), and 12 foils (nonrepeated items; 60%). Item order and placement within the lists were randomized.
Three unique runs of five blocks each were administered in a counterbalanced order to SZs and HCs using PsychoPy (Peirce, 2007) and to UNs using an online application programed using HTML5 and JavaScript and administered using the Firefox web browser. SZs and HCs were administered the via PsychoPy-based version of the task due to the need for precise stimulus timing during neuroimaging (see below). Pseudowords were presented in white font on a black background for 2500 milliseconds with a 500-millisecond interword-interval. The timing was constrained so that each block would last exactly 60 seconds. Blocks were separated by 20-second interrun-intervals (with a fixation cross) including before the first block.
Clinical Symptoms and Medication
Symptoms were assessed using the Scale for the Assessment of Negative Symptoms (SANS; Andreasen, 1984a) and the Scale for the Assessment of Positive Symptoms (SAPS; Andreasen, 1984b). Negative and positive symptoms were operationalized as summed SANS and SAPS global rating scores, respectively. Chlorpromazine equivalent doses (CPZs) were calculated (Woods, 2003) using data on patients’ prescribed medication.
Cognitive Functioning
Additional cognitive measures included the MATRICS (Measurement and Treatment Research to Improve Cognition in Schizophrenia) Consensus Cognitive Battery (MCCB; Nuechterlein et al., 2008) Letter–Number Span test, a measure of working memory, and the Wide Range Achievement Test (WRAT) Reading subtest (Snelbaker et al., 2001).
Neuroimaging Data
SZs and HCs underwent functional magnetic resonance imaging (fMRI; see Buxton, 2009) while completing the N-back task. Detailed methods and results from the neuroimaging experiment are presented in separate work (Thomas & Brown, 2018). Here we present limited results for the purposes of comparing scores.
Analyses
Aim 1: Model Fit
We conducted a parallel analysis with principle axis factoring using the R “psych” package (Revelle, 2011) to test the assumption of a two-dimensional measurement model. In a parallel analysis, dimensionality is inferred from the number of eigenvalues retrieved from the observed data that are larger than corresponding eigenvalues retrieved from simulated data. We focused on summed item scores (i.e., total number correct for targets, lures, and foils within each of the five N-back conditions), instead of item scores themselves. This is because of the large number of items, combined with the observation that many items had perfect or near perfect accuracies, made it impractical to calculate the tetrachoric correlation matrix.
Next, we fitted a series of GLVM models to the data using the R “lme4” package (Bates et al., 2014) in order to test the assumption that item residuals would be uncorrelated conditional on the model estimates. These models defined fixed effects as experimental contrasts fitted at the group level (i.e., Level 4 per the notation above) and random effects as experimental condition residuals fitted at the examinee-nested-within-condition level (i.e., Level 2 per the notation above). We fitted models that included an intercept and up to four orthogonal polynomial contrasts for the fixed effects.3 This allowed us to compare the fit of models of varying complexity through a quartic term using Akaike information criterion (AIC) values, Bayesian information criterion (BIC) values, and likelihood ratio tests. AIC and BIC both decrease with parsimonious fit, and a significant likelihood ratio test indicates that the more constrained model significantly worsens fit relative to the less constrained model. After determining the best-fitting model, we calculated residual correlations. Specifically, we estimated the latent examinee variables (i.e., the random effects within the context of mixed-effects models), used these to predict item accuracy scores, and then calculated the difference (residual) between the predicted and observed scores. Smaller absolute residual correlations imply better model fit (Christensen et al., 2017; de Ayala, 2009). Parametric bootstrapping was used to further evaluate the residuals. Specifically, over 500 iterations, we simulated data using the model-based parameter and latent variable estimates, refitted the model to the simulated data, and then estimated residuals. We then determined the proportion of simulations in which residual correlations based on the observed data exceeded residual correlations based on the simulated data. Finally, we calculated the number of residual correlation p values for items that fell in the lower 2.5% or the upper 97.5% of the simulated distributions.
Aim 2: Adaptive Testing Simulation
In a simulation, we compared adaptive versus nonadaptive administration of N-back load conditions using complete data provided by HCs and SZs (N = 46). More specifically, using participants’ actual item responses, we simulated new testing sequences based on either adaptive or nonadaptive selection of N-back load conditions, over multiple iterations, until an optimization criterion was met. We compared two optimization criteria. In the efficiency condition, testing continued until the standard error of the estimated d′ intercept variable met a minimum value corresponding to an approximate reliability of .70 or until the total iteration count exceeded 9 (this number was chosen both because it allows flexibility in the adaptive selection of N-back runs and also because 9 [simulated] minutes of testing on a single task would seem to be a reasonable upper limit within the context of a clinical trial or translational study). In the precision condition, testing always continued for exactly 3 total iterations (i.e., 3 simulated minutes of testing). For adaptive testing, the administered N-back load at each iteration was selected to optimize measurement of the d′ intercept variable (i.e., the chosen intentional parameter). For nonadaptive testing, the administered N-back load at each iteration followed the pattern 111-222-333 in the efficiency condition, and 1-2-3 in the precision condition. Both are comparable to the staircase pattern of increasing difficulty that is common in neuropsychological assessment. Linear mixed-effects models were used to compare efficiency and precision estimates between testing conditions. Efficiency is defined here by total iterations of simulated N-back runs administered and precision is defined by standard error of the estimated d′ intercept values.
To estimate the latent examinee variables, we used the Metropolis–Hastings Robbins–Monro hybrid (MH-RM) algorithm (Cai, 2010a, 2010b; Chalmers & Flora, 2014). Adaptive selection of N-back load conditions at each iteration was based on the approach outlined by Segall (1996, 2010), which seeks to optimize the determinant of the posterior covariance matrix of the latent examinee variables (i.e., D-optimality). We used a multivariate normal prior for ω with means and variances fixed based on estimates derived from the Aim 1 analysis. All algorithms, as well as the simulation, were programmed in R. Examples of these scripts and data can be found online at GitHub as part of the CogIRT repository (https://github.com/mlthom/CogIRT). Additional details are provided in the Appendix.
Aim 3: Validity Analyses
The d′ intercept variable can be interpreted as working memory ability; that is, the examinee’s working memory discriminability score. Over examinees, the variable captures individual differences in response accuracy corrected for bias. We evaluated the comparability of this variable estimated using adaptive versus nonadaptive testing. Specifically, we correlated estimates with a dummy-coded group variable (SZ = 1, HC = 0), MCCB Letter–Number Span T scores, and brain activation (i.e., blood oxygenation–level dependent [BOLD] response) within a single region-of-interest located within the dorsolateral prefrontal cortex comparing all N-back load conditions to a baseline fixation cross (see Glahn et al., 2005). The Steiger method implemented within the R “cocor” package (Diedenhofen & Musch, 2015) was used to compare dependent correlations for significant differences between testing conditions.
Results
Descriptive Statistics
Demographic and clinical characteristics of the four samples are reported in Table 2. Figure 4 plots accuracy over all N-back load conditions for each group. UNs had the highest overall mean accuracy, followed by HCs, and SZs. Mean accuracy within conditions followed a curvilinear pattern over N-back load, with 1-back accuracy at 93%, 2-back accuracy at 85%, 3-back accuracy at 79%, 4-back accuracy at 77%, and 5-back accuracy at 79%. Accuracy also varied by item type, with foils being the easiest (94% accuracy), followed by lures (72% accuracy), and targets (60% accuracy).
Table 2.
Demographic and Clinical Characteristics
| Characteristic | Undergraduates (UNs) | Healthy controls (HCs) | Schizophrenia patients (SZs) | p HCs versus SZs |
|---|---|---|---|---|
| Sample size | 38 | 26 | 28 | |
| Age, years | 21.18 (2.55) | 41.04 (9.46) | 43.29 (9.41) | .39 |
| Gender: Male | 9 (24%) | 16 (62%) | 18 (64%) | >.999 |
| Hispanic | 6 (16%) | 5 (19%) | 9 (32%) | .36 |
| Race | — | — | — | .51 |
| American Indian or Alaskan Native | 0 (0%) | 0 (0%) | 1 (4%) | — |
| Asian | 20 (53%) | 4 (15%) | 3 (11%) | — |
| Black or African American | 1 (3%) | 2 (8%) | 6 (21%) | — |
| More than one race | 5 (13%) | 5 (19%) | 3 (11%) | — |
| Unknown or unreported | 4 (11%) | 0 (0%) | 0 (0%) | — |
| White | 8 (21%) | 15 (58%) | 15 (54%) | — |
| Education | 14.11 (1.20) | 15.81 (2.02) | 12.75 (2.12) | <.001 |
| Parents’ education | 15.83 (3.69) | 14.08 (3.58) | 12.32 (3.67) | .08 |
| WRAT reading score | 111.03 (7.91) | 104.62 (10.83) | 95.93 (10.67) | <.001 |
| Chlorpromazine equivalents | — | — | 504.17 (464.82) | — |
| SAPS | — | — | 6.43 (4.12) | — |
| SANS | — | — | 9.21 (4.64) | — |
Note. Means and standard deviations are reported for continuous variables. Counts and percentages are reported for discrete variables. Groups were compared using regression for continuous variables and Fisher’s exact test for categorical variables. Education is in years completed. WRAT = Wide Range Achievement Test; SAPS = Scale for the Assessment of Positive Symptoms reported as total global rating scores; SANS = Scale for the Assessment of Negative Symptoms reported as total global rating scores.
Figure 4.
Proportion correct by group, N-back condition, and item type
Aim 1: Results of Model Fit Analyses
Results of the parallel analysis, Supplemental Figure 1, are consistent with the hypothesis of a two-dimensional measurement model. That is, only the first two factors derived from N-back summary scores produced eigenvalues that are larger than those produced by randomly generated data.
The fit statistics for models with varying degrees of orthogonal polynomial contrasts are reported in Supplemental Table 1. The model with intercept, linear, and quadratic effects of N-back load on d′ and Cc was most parsimonious as indicated by the AIC value, but the model with only intercept and linear effects was most parsimonious as indicated by the BIC value. Given that the pattern of the results (Supplemental Figure 2) suggest that load had curvilinear effects on both d′ and Cc, we selected the former. Parameter estimates (fixed effects) for this model are reported in Table 3. As suggested by the descriptive data, the parameter estimates indicate a decrease in d′ and an increase in Cc over load conditions. The median absolute residual correlation for the model was .09; approximately 10% of these fell in the lower 2.5% or the upper 97.5% of the simulated distributions from the parametric bootstrap. This suggests adequate, but not good model fit. Follow-up analyses revealed that significant residual correlations were most often produced by foils, and were mostly due to overprediction rather than underprediction. That is, foil responses were “noisier” (less determined by parameter estimates) than target responses.
Table 3.
Group-Level Parameter Estimates for the Quadratic Contrast Model.
| Model Parameter | Estimate | SE |
|---|---|---|
| Intercept effect on d′ | 1.63 | 0.17 |
| Linear effect of load on d′ | −0.65 | 0.08 |
| Quadratic effect of load on d′ | 0.28 | 0.06 |
| Intercept effect of load on Cc | 0.68 | 0.08 |
| Linear effect of load on Cc | 0.14 | 0.04 |
| Quadratic effect of load on Cc | −0.05 | 0.03 |
Note. d′ = memory discriminability parameter from signal detection theory model; Cc = conservative bias parameter from signal detection theory model. Note that the effects have been coded so that the intercept is at N-back = 3.
Aim 2: Results of Model Fit Adaptive Testing Simulation
Supplemental Table 2 (available online) reports the mean number of iterations (runs) at each N-back load administered within simulated adaptive versus nonadaptive testing. As required, nonadaptive testing for the precision condition always administered one 1-back run, one 2-back run, and one 3-back run. Similarly, in the nonadaptive testing for efficiency condition, only 1-, 2-, and 3-back runs were administered; however, the exact proportion of each depended on the number of iterations required to meet the minimum standard error criterion. In both adaptive testing conditions, N-back load was chosen by the adaptive algorithm. Notably, the algorithm chose high numbers of 3- and 4-back runs, fewer 2- and 5-back runs, and no 1-back runs. Results comparing the efficiency and precision of adaptive versus nonadaptive testing are reported in Table 4. Adaptive testing was significantly more efficient than nonadaptive testing, providing a savings of 2.76 iterations on average. Adaptive testing also produced significantly more precise estimates of the latent d′ intercept variable. The reduction in standard error was approximately 0.13. In other words, adaptive testing either made the N-back task 36% more efficient, or score estimates 23% more precise, depending on the optimization criterion chosen.
Table 4.
Adaptive Testing Results
| Condition | Adaptive | Nonadaptive | Difference | p |
|---|---|---|---|---|
| Iterations-efficiency condition | 4.96 | 7.72 | −2.76 | <.001 |
| Standard error-precision condition | 0.45 | 0.58 | −0.13 | <.001 |
Aim 3: Results of Validity Analyses
Results of validity analyses are reported in Table 5. We focused on estimates of the d′ intercept variable produced within the precision condition because these are based on a fixed number of N-back runs. None of the correlations were significantly different between adaptive and nonadaptive testing. However, adaptive testing produced descriptively larger correlations between the d′ intercept variable and the dummy coded group variable (i.e., the patient deficit) as well as brain activation within the dorsolateral prefrontal cortex. Conversely, nonadaptive testing produced a descriptively larger correlation between the d′ intercept variable and MCCB Letter–Number Span T scores.
Table 5.
Validity Correlations
| Validity Variable | Adaptive | Nonadaptive | Difference in magnitude | p |
|---|---|---|---|---|
| Group (SZ) | −0.47 | −0.33 | −0.14 | .097 |
| MCCB Letter–Number Span | 0.61 | 0.64 | −0.03 | .648 |
| BOLD Response in DLPFC | 0.24 | 0.14 | 0.10 | .311 |
Note: MCCB= MATRICS ((Measurement and Treatment Research to Improve Cognition in Schizophrenia) Consensus Cognitive Battery; BOLD = blood oxygenation–level dependent; DLPFC = dorsolateral prefrontal cortex.
Discussion
Our goal was to present and to evaluate an approach that permits latent variable modeling and the use of computerized adaptive testing in experimental cognitive psychopathology research without large samples or the need to precalibrate item parameters. We have presented a modeling framework based on the GLVM and provided equations, notation, definitions, and interpretations of relevant variables and parameters (Equation 1 and Table 1). We also provided an example of the approach as applied to working memory assessment in schizophrenia using the N-back task. Results confirmed our hypotheses that the model would adequately fit the data and that adaptive testing would improve both the efficiency and precision of measurement. Adaptive testing also produced larger, though not significantly different, group differences in ability, as well as correlations between ability and brain activation (BOLD response) within an fMRI experiment. In contrast, adaptive testing produced smaller, though not significantly different, correlations between ability and a separate measure of the same construct (MCCB Letter–Number Span).
We evaluated model fit by examining the dimensionality of scores and the magnitudes of residual correlations. A parallel analysis confirmed our hypothesis of a two-dimensional measurement model for the N-back task. Specifically, we assumed two latent examinee dimensions that could be interpreted as memory discriminability (d′) and conservative bias (Cc) within the context of an equal variance SDT model. Although the parallel analysis could not define the meaning of these dimensions, results from linear mixed-effects model analyses provided further justification for our interpretation. Specifically, we compared models with varying degrees of orthogonal polynomial contrasts, concluding ultimately that the best fitting model included both linear and quadratic effects of N-back load on both d′ and Cc. We then calculated residual correlations based on this model. The number of significant residual correlations was higher than expected, but not excessive (i.e., 10% instead of the nominal Type I error rate of 5%). Misfit appeared to be primarily driven by the over-prediction of foil responses. That is, foil responses may be “noisier”—more likely to be influenced by nonsystematic factors—than target and lure responses. A possible solution is to explore use of the unequal variance signal detection model (DeCarlo, 2010), which allows discrimination parameters to vary between targets and foils.
We next compared adaptive versus nonadaptive testing using two different optimization criteria. For the efficiency condition, the goal was to produce precise estimates of the d′ intercept parameter using as few iterations as possible (i.e., few N-back runs). We found that adaptive testing improved efficiency by 36%, resulting, practically, in an average (simulated) savings of 2.76 minutes of testing. This savings, while not as robust as those observed in more typical adaptive testing applications (e.g., Gibbons et al., 2012), could nonetheless be highly valuable in studies of clinical populations. For the precision condition, the goal was to produce optimally precise estimates of the d′ intercept parameter given a fixed number of iterations. We found that adaptive testing reduced standard error by 23%. Removing this error variability from the data is expected to produce a substantial improvement in the reliability of scores and therefore also improve effect size and statistical power.
We correlated estimates of the d′ intercept variable—an index of working memory ability—with three types of variables that might be used to compare and validate scores: a dummy coded group variable, a convergent measure of working memory, and brain activation within a specific region of interest (dorsolateral prefrontal cortex) measured during task performance. Despite the observation that adaptive and nonadaptive testing resulted in very different patterns of N-back load administration, correlations were similar overall. However, while adaptive testing improved the magnitudes of the group difference and brain activation correlations, nonadaptive testing produced a slightly larger correlation with the convergent measure of working memory. Small differences in effect size can have major implications for research. For example, while 80% power requires a sample size of n = 69 for a group difference effect of r = −0.33, the nonadaptive value, only a sample size of n = 32 is required for a group difference effect of r = −0.47, the adaptive value. Why adaptive testing did not also improve the working memory measure correlation is unclear, but this is possibly due to complex span and N-back tasks measuring distinct components of working memory. It should also be noted that this difference in correlation (r = .03) was the smallest of the three examined. Additional research with larger samples is needed to fully understand if, and how, adaptive testing improves the validity of scores in experimental cognitive research.
Limitations and Future Directions
Favorable conclusions should be tempered by consideration of the importance of obtaining prior information that can be used to inform Bayesian estimates of the latent examinee variables. In the current study, priors were based on analyses of the complete data, of which a subset would subsequently be used in the adaptive testing simulation. This is an unrealistic scenario, as the goal of adaptive testing is to tailor measurement to yet untested examinees. Nonetheless, the approach presented is conceptually similar to what researchers might achieve in live applications. Researchers can rely on pilot data or published results in the literature to inform the choice of prior parameters. The benefits of adaptive testing are likely to be highly dependent on the choice of priors, and thus we conjecture that the value of adaptive testing in experimental cognitive research is dependent on existing data. Nonetheless, this data requirement, perhaps as few as 50 to 100 examinees depending on the paradigm, is far more feasible than the thousands of examinees that might be needed using typical adaptive testing methods.
The approach also makes several requirements of experimental paradigms that are not always feasible. First, the approach is only applicable if the measurement model for the task can be expressed as a linear model. Thus, unlike SDT, many prominent mathematical cognitive models that cannot be expressed using the generalized linear model form (e.g., Markov learning models) would not be appropriate. Second, the approach requires relatively strong parametric assumptions. In particular, elements of the measurement structure matrix (Λ) are assumed to be known a priori. As noted above, the nonzero elements of the Λ matrix will typically be fixed to 1.0 for simple measurement models; that is, for models that assume that the latent examinee dimension measured by the task is equally relevant to all types of items within the task. With more complex measurement models, such as the SDT example presented here, the nonzero elements of the Λ matrix will take on specific values that match theory. Also, the ν parameters, while accounted for, are nonetheless assumed to a have a relatively minor impact on item accuracy. Although this assumption is not strictly required, to the extent that items do vary greatly in their difficulty, this will add error and inefficiency to the adaptive testing approach outlined. Third, the modeling approach assumes that the dimensionality of the measurement model is invariant over experimental conditions. That is, we assume that the same cognitive abilities and traits determine individual differences in item responses over all levels of the experimental manipulation. To the extent that the relevant cognitive abilities change over conditions, the comparability of scores may not be supported. Finally, adaptive manipulation of experimental conditions in the manner outlined here is only plausible if the examinee’s responses to all previous conditions do not affect their responses to future conditions. In the current example, this implies, for instance, that having been previously administered a 5-back condition does not affect the examinee’s responses to a future 1-back condition. Although there are paradigms where this assumption is plausible, for others, where carryover effects are anticipated, adaptive administration of task conditions may not be feasible. In the case of learning, however, learning parameters can be added to the GLVM proposed, which could address this concern.
The strong parametric assumptions required for the approach outlined here should be scrutinized for each and every application; most notably, the a priori assumptions of the model dimensionality and the fixed parameters contained in Λ. These are the elements of what is termed the measurement model. A poorly specified measurement model and violations of assumptions can strongly undermine the benefits of latent variable modeling and adaptive testing. Thus, in instances with large sample size, it may be beneficial to relax assumptions and to allow the data to more strongly inform the measurement model. In the context of small samples, there is a risk that allowing a poorly specified measurement model to inform the adaptive collection of data will lead to a fundamental flaw in the data collected. Thus, we argue that the GLVM proposed should only be used in instances where investigators are both confident in their measurement model and have collected an initial data sample that can be used to test assumptions.
On the other hand, it is important to acknowledge that statistical models fitted to data will ultimately be proven wrong—at least in some, even minor aspects—and thus their value is judged by usefulness in comparison to existing approaches (Box, 1979). The modeling requirements presented here are not absolute, but rather flexible, and should be considered in comparison to existing methods. Fixed administration of items and task conditions is rarely optimal, but other adaptive approaches (e.g., the adaptive staircase method) might be similarly effective and much easier to implement. We expect that the appropriateness and usefulness of adaptive testing will vary depending on the context of assessment. Determining paradigms and experimental manipulations for which the proposed modeling and adaptive testing framework works is an important direction of future research.
The statistical approaches chosen here (i.e., estimation of latent examinee variables and adaptive selection of experimental conditions) may not be optimal. Our purpose was to demonstrate, but not necessary refine the approach. We observed approximately one third and one fourth improvements in efficiency and precision, respectively. Researchers might do better with more optimal statistical approaches.
It is important to note that explanatory IRT modeling is not new (e.g., de Boeck & Wilson, 2004) nor is the integration of cognitive models with psychometric models (Batchelder, 2010; van der Maas et al., 2011). Such approaches have been viewed as superior to more traditional psychometric modeling by some because they provide stronger tests of construct representation (Embretson, 2010). A second, and perhaps equally important benefit, is that they can simplify estimation by reducing model parameters. This simplicity, in turn, leads to a reduction in required sample size. Indeed, we view simplicity and small sample size requirements as the two primary benefits of the approach outlined here. Future work will develop user friendly scripts and software packages that allow the approach to be widely used in applied research.
Conclusion
In sum, there is growing need for quantitative methods and technology designed specifically to support experimental cognitive testing, particularly in psychopathology research. We have presented an approach based on the GLVM, combined with mathematical cognitive modeling, that permits the use of computerized adaptive testing in experimental research without large samples or the need to precalibrate item parameters. Results suggest that computerized adaptive testing can substantially improve the efficiency and precision of cognitive testing when compared to nonadaptive testing. This proof-of-concept study further shows that adaptive testing can be used in experimental research even with relatively small samples. Adaptive testing has the potential to improve the impact and replicability of findings from experimental psychopathology research.
Supplemental Material
Supplemental material, supplemental for Latent Variable Modeling and Adaptive Testing for Experimental Cognitive Psychopathology Research by Michael L. Thomas, Gregory G. Brown, Virginie M. Patt and John R. Duffy in Educational and Psychological Measurement
Appendix
Adaptive Selection of Task Conditions
An approach commonly used in adaptive testing seeks to maximize the determinant of the Fisher information matrix (I) with regard to the intentional variables, here specific elements of ω (van der Linden & Glas, 2010). Fisher information reflects the expected contribution of the data to the estimation of the latent variables. That is, greater information implies greater precision. The reciprocal of the square root of information is equal to the standard error of the estimates (Embretson & Reise, 2000). Adaptive testing algorithms that seek to minimize standard error, by maximizing information, rely on an iterative process that involves administering items, provisionally estimating latent examinee variables, determining whether standard error meets a minimum value needed to end testing, and if standard error is too high, administering additional items that maximize information. Maximizing the determinant of I (or related quantities such as the posterior covariance matrix) accomplishes this aim along multiple dimensions (Mulder & van der Linden, 2009; Segall, 2010).
Although the latent variables are defined at the examinee level, they need not all vary over examinees. Often, only the latent intercept variable in ω—which is most comparable to the concept of latent ability—will be assumed to vary over examinees. To the extent that certain elements of ω are constant, and known a priori, adaptive testing is expected to be highly efficient. Unfortunately, experimental cognitive researchers may lack sufficient data to confirm that the effect of an experimental manipulation does not vary over examinees or to precisely estimate the effect size. A more flexible option is to define prior probability distributions for the effects of the experimental manipulation within the context of Bayesian estimation (Gelman et al., 2014). That is, if the researcher can make reasonable assumptions regarding the prior probability distributions of variables (e.g., normal) and the parameters of those distributions (e.g., mean and variance) based on any data that is available, this information can be used to facilitate and improve the estimation of variables and therefore the efficiency of adaptive testing. This Bayesian approach is not required—and indeed, as described here, is not strictly Bayesian with regard to the interpretation of variables—but will likely prove valuable in applied settings, and therefore is utilized in the current demonstration of the methodology.
Model Assumptions
In the Bayesian estimation approach, model identification is achieved by assuming that ν, ζ, and ω are distributed as follows: ν ~N(0, σν2), ζ ~ MVN(0, σζ2), and ω ~ MVN(μω, σω2). The assumptions are (1) normality; (2) that the mean of the item easiness distribution is 0; and (3) that the residual ability parameters (ζ) follow a multivariate normal distribution with a mean vector of 0s and a variance–covariance matrix that is diagonal. These assumptions are both reasonable and implicit to most psychometric treatments of experimental cognitive tasks. Although these assumptions are sufficient to achieve identification, and to permit adaptive testing, the algorithm is generally inefficient—and adaptive testing is less beneficial—without additional information. The approach allows the investigator to choose strongly informative, weakly informative, or noninformative hyper-parameters (i.e., σν2, σζ2, μω, and σω2).
Ability, within this context, is a psychometric term defined as a latent variable within a mathematical model that relates observed scores to some function of examinee and item parameters.
The magnitudes of the λ parameters is only important in so far as the investigator wishes their parameters estimates to be scaled in a metric that is common in SDT; more important are the signs of the λ parameters, which lead to the interpretation that whereas memory discriminability (d′) is consistently and positively related to item accuracy, a conservative, or “no,” bias (Cc) is positively related to foil accuracy, but negatively related to target accuracy.
Because the model was fitted at the group level, the ξ parameters represent group averages (i.e., Level-4 model) and the ζ parameters are specific to conditions nested with examinees.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Research reported in this publication was supported, in part, by the National Institute of Mental Health of the National Institutes of Health under award number K23 MH102420. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
ORCID iD: Michael L. Thomas
https://orcid.org/0000-0002-3026-7609
Supplemental Material: Supplemental material for this article is available online.
References
- Andreasen N. C. (1984. a). Modified Scale for the Assessment of Negative Symptoms (SANS). University of Iowa. [Google Scholar]
- Andreasen N. C. (1984. b). Scale for the Assessment of Positive Symptoms (SAPS). University of Iowa. [Google Scholar]
- Barch D. M., Carter C. S., Braver T. S., Sabb F. W., MacDonald A., Noll D. C., Cohen J. D. (2001). Selective deficits in prefrontal cortex function in medication-naive patients with schizophrenia. Archives of General Psychiatry, 58(3), 280-288. 10.1001/archpsyc.58.3.280 [DOI] [PubMed] [Google Scholar]
- Batchelder W. H. (2010). Cognitive psychometrics: Using multinomial processing tree models as measurement tools. In Embretson S. E. (Ed.), Measuring psychological constructs: Advances in model-based approaches (pp. 71-93). American Psychological Association; 10.1037/12074-004 [DOI] [Google Scholar]
- Bates D., Maechler M., Bolker B., Walker S. (2014). lme4: Linear mixed-effects models using Eigen and S4 (R package version 1.1-7). Journal of Statistical Software, 67, 1-48. 10.18637/jss.v067.i01 [DOI] [Google Scholar]
- Box G. E. P. (1979). Robustness in the strategy of scientific model building. In Launer R. L., Wilkinson G. N. (Eds.), Robustness in statistics (pp. 201-236). Academic Press; 10.1016/B978-0-12-438150-6.50018-2 [DOI] [Google Scholar]
- Buxton R. B. (2009). Introduction to functional magnetic resonance imaging: Principles and techniques. Cambridge University Press; 10.1017/CBO9780511605505 [DOI] [Google Scholar]
- Cai L. (2010. a). High-dimensional exploratory item factor analysis by a Metropolis-Hastings Robbins-Monro algorithm. Psychometrika, 75(1), 33-57. 10.1007/s11336-009-9136-x [DOI] [Google Scholar]
- Cai L. (2010. b). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 35(3), 307-335. 10.3102/1076998609353115 [DOI] [Google Scholar]
- Chalmers R. P., Flora D. B. (2014). Maximum-likelihood estimation of noncompensatory IRT models with the MH-RM algorithm. Applied Psychological Measurement, 38(5), 339-358. 10.1177/0146621614520958 [DOI] [Google Scholar]
- Chapman L. J., Chapman J. P. (1973). Problems in the measurement of cognitive deficits. Psychological Bulletin, 79(6), 380-385. 10.1037/h0034541 [DOI] [PubMed] [Google Scholar]
- Christensen K. B., Makransky G., Horton M. (2017). Critical values for Yen’s Q3: Identification of local dependence in the Rasch model using residual correlations. Applied Psychological Measurement, 41(3), 178-194. 10.1177/0146621616677520 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cohen J., Cohen P., West S. G., Aiken L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. Lawrence Erlbaum. [Google Scholar]
- Conners C. K. (1994). Conners’ continuous performance test computer program 3.0 user’s manual. Multi-Health Systems Inc. [Google Scholar]
- de Ayala R. J. (2009). The theory and practice of item response theory. Guilford Press. [Google Scholar]
- de Boeck P., Wilson M. (2004). Explanatory item response models: A generalized linear and nonlinear approach. Springer; 10.1007/978-1-4757-3990-9 [DOI] [Google Scholar]
- DeCarlo L. T. (1998). Signal detection theory and generalized linear models. Psychological Methods, 3(2), 186-205. 10.1037/1082-989X.3.2.186 [DOI] [Google Scholar]
- DeCarlo L. T. (2010). On the statistical and theoretical basis of signal detection theory and extensions: Unequal variance, random coefficient, and mixture models. Journal of Mathematical Psychology, 54(3), 304-313. 10.1016/j.jmp.2010.01.001 [DOI] [Google Scholar]
- Delis D. C., Kaplan E., Kramer J. H. (2001). The Delis-Kaplan executive function system: examiner’s manual. The Psychological Corporation; 10.1037/t15082-000 [DOI] [Google Scholar]
- Diedenhofen B., Musch J. (2015). cocor: A comprehensive solution for the statistical comparison of correlations. PLOS ONE, 10(4), Article e0121945. 10.1371/journal.pone.0121945 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Embretson S. E. (2010). Cognitive design systems: A structural modeling approach applied to developing a spatial ability test. In Embretson S. E. (Ed.), Measuring psychological constructs: Advances in model-based approaches (pp. 71-93). American Psychological Association; 10.1037/12074-011 [DOI] [Google Scholar]
- Embretson S. E., Reise S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum. [Google Scholar]
- Forsyth J. P., Zvolensky M. J. (2001). Experimental psychopathology, clinical science, and practice: An irrelevant or indispensable alliance? Applied and Preventive Psychology, 10(4), 243-264. 10.1016/S0962-1849(01)80002-0 [DOI] [Google Scholar]
- Gelman A., Carlin J., Stern H., Dunson D., Vehtari A., Rubin D. (2014). Bayesian data analysis (3rd ed.). CRC Press; 10.1201/b16018 [DOI] [Google Scholar]
- Gibbons R. D., Hedeker D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57(3), 423-436. 10.1007/BF02295430 [DOI] [Google Scholar]
- Gibbons R. D., Weiss D. J., Pilkonis P. A., Frank E., Moore T., Kim J. B., Kupfer D. J. (2012). Development of a computerized adaptive test for depression. Archives of General Psychiatry, 69(11), 1104-1112. 10.1001/archgenpsychiatry.2012.14 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glahn D. C., Ragland J. D., Abramoff A., Barrett J., Laird A. R., Bearden C. E., Velligan D. I. (2005). Beyond hypofrontality: A quantitative meta-analysis of functional neuroimaging studies of working memory in schizophrenia. Human Brain Mapping, 25(1), 60-69. 10.1002/hbm.20138 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gur R. C., Richard J., Hughett P., Calkins M. E., Macy L., Bilker W. B., Brensinger C., Gur R. E. (2010). A cognitive neuroscience-based computerized battery for efficient measurement of individual differences: Standardization and initial construct validation. Journal of Neuroscience Methods, 187(2), 254-262. 10.1016/j.jneumeth.2009.11.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haynes S. N., Smith G., Hunsley J. D. (2011). Scientific foundations of clinical assessment. Routledge; 10.4324/9780203829172 [DOI] [Google Scholar]
- Hedge C., Powell G., Sumner P. (2018). The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behavior Research Methods, 50(3), 1166-1186. 10.3758/s13428-017-0935-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kane M. J., Conway A. R. A., Miura T. K., Colflesh G. J. H. (2007). Working memory, attention control, and the N-back task: A question of construct validity. Journal of Experimental Psychology-Learning Memory and Cognition, 33(3), 615-622. 10.1037/0278-7393.33.3.615 [DOI] [PubMed] [Google Scholar]
- Kurtz M. M., Ragland J. D., Moberg P. J., Gur R. C. (2004). Penn Conditional Exclusion Test: A new measure of executive-function with alternate forms for repeat administration. Archives of Clinical Neuropsychology, 19(2), 191-201. 10.1016/S0887-6177(03)00003-9 [DOI] [PubMed] [Google Scholar]
- Lord F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum. [Google Scholar]
- MacDonald A. W. (2009). Is more cognitive experimental psychopathology of schizophrenia really necessary? Challenges and opportunities. In Ritsner M. S. (Eds.), Handbook of neuropsychiatric biomarkers, endophenotypes and genes (Vol. 1, pp. 141-154). 10.1007/978-1-4020-9464-4_9 [DOI]
- MacDonald A. W., Carter C. S. (2002). Cognitive experimental approaches to investigating impaired cognition in schizophrenia: A paradigm shift. Journal of Clinical and Experimental Neuropsychology, 24(7), 873-882. 10.1076/jcen.24.7.873.8386 [DOI] [PubMed] [Google Scholar]
- Medler D. A., Binder J. R. (2005). MCWord: An on-line orthographic database of the English language. http://www.neuro.mcw.edu/mcword/
- Mulder J., van der Linden W. J. (2009). Multidimensional adaptive testing with optimal design criteria for item selection. Psychometrika, 74(2), 273-296. 10.1007/s11336-008-9097-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nelson J. K., Reuter-Lorenz P. A., Sylvester C.-Y. C., Jonides J., Smith E. E. (2003). Dissociable neural mechanisms underlying response-based and familiarity-based conflict in working memory. Proceedings of the National Academy of Sciences of the U S A, 100(19), 11171-11175. 10.1073/pnas.1334125100 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nuechterlein K. H., Green M. F., Kern R. S., Baade L. E., Barch D. M., Cohen J. D., Essock S., Fenton W. S., Frese F. J., 3rd, Gold J. M., Goldberg T., Heaton R. K., Keefe R. S., Kraemer H., Mesholam-Gately R., Seidman L. J., Stover E., Weinberger D. R., Young A. S., . . . Marder S. R. (2008). The MATRICS Consensus Cognitive Battery, Part 1: Test selection, reliability, and validity. American Journal of Psychiatry, 165(2), 203-213. 10.1176/appi.ajp.2007.07010042 [DOI] [PubMed] [Google Scholar]
- Owen A. M., McMillan K. M., Laird A. R., Bullmore E. (2005). N-back working memory paradigm: A meta-analysis of normative functional neuroimaging studies. Human Brain Mapping, 25(1), 46-59. 10.1002/hbm.20131 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peirce J. W. (2007). PsychoPy-psychophysics software in Python. Journal of Neuroscience Methods, 162(1-2), 8-13. 10.1016/j.jneumeth.2006.11.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ragland J. D., Turetsky B. I., Gur R. C., Gunning-Dixon F., Turner T., Schroeder L., Chan R., Gur R. E. (2002). Working memory for complex figures: An fMRI comparison of letter and fractal n-back tasks. Neuropsychology, 16(3), 370-379. 10.1037/0894-4105.16.3.370 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reckase M. (2009). Multidimensional item response theory. Springer; 10.1007/978-0-387-89976-3 [DOI] [Google Scholar]
- Revelle W. (2011). psych: Procedures for personality and psychological research. http://personality-project.org/r/psych.manual.pdf
- Scalise K., Allen D. D. (2015). Use of open-source software for adaptive measurement: Concerto as an R-based computer adaptive development and delivery platform. British Journal of Mathematical and Statistical Psychology, 68(3), 478-496. 10.1111/bmsp.12057 [DOI] [PubMed] [Google Scholar]
- Segall D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61(2), 331-354. 10.1007/BF02294343 [DOI] [Google Scholar]
- Segall D. O. (2010). Principles of multidimensional adaptive testing. In van der Linden W. J., Glas C. A. W. (Eds.), Elements of adaptive testing (pp. 57-75). 10.1007/978-0-387-85461-8_3 [DOI]
- Skrondal A., Rabe-Hesketh S. (2004). Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models. Chapman & Hall/CRC; 10.1201/9780203489437 [DOI] [Google Scholar]
- Snelbaker A. J., Wilkinson G. S., Robertson G. J., Glutting J. J. (2001). Wide Range Achievement Test 3 (WRAT3). In Dorfman W. I., Hersen M. (Eds.), Understanding psychological assessment: Perspectives on individual differences (pp. 259-274). Springer; 10.1007/978-1-4615-1185-4_13 [DOI] [Google Scholar]
- Strauss M. E. (2001). Demonstrating specific cognitive deficits: A psychometric perspective. Journal of Abnormal Psychology, 110(1), 6-14. 10.1037/0021-843X.110.1.6 [DOI] [PubMed] [Google Scholar]
- Thomas M. L. (2011). The value of item response theory in clinical assessment: A review. Assessment, 18(3), 291-307. [DOI] [PubMed] [Google Scholar]
- Thomas M. L. (2019). Advances in applications of item response theory to clinical assessment. Psychological Assessment, 31(12), 1442-1455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thomas M. L., Brown G. G. (2018). A psychometric-neuroimaging analysis of load-response curves in schizophrenia. Paper presented at the 126th Annual Convention of the American Psychological Association, San Francisco, CA. [Google Scholar]
- Thomas M. L., Brown G. G., Gur R. C., Moore T. M., Patt V. M., Risbrough V. B., Baker D. G. (2018). A signal detection-item response theory model for evaluating neuropsychological measures. Journal of Clinical and Experimental Neuropsychology, 40(8), 745-760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thomas M. L., Patt V. M., Bismark A., Sprock J., Tarasenko M., Light G. A., Brown G. G. (2017). Evidence of systematic attenuation in the measurement of cognitive deficits in schizophrenia. Journal of Abnormal Psychology, 126(3), 312-324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Linden W. J., Glas C. A. W. (2010). Elements of adaptive testing. Springer; 10.1007/978-0-387-85461-8 [DOI] [Google Scholar]
- van der Maas H. L. J., Molenaar D., Maris G., Kievit R. A., Borsboom D. (2011). Cognitive psychology meets psychometric theory: On the relation between process models for decision making and latent variable models for individual differences. Psychological Review, 118(2), 339-356. 10.1037/a0022749 [DOI] [PubMed] [Google Scholar]
- Wickens T. D. (2002). Elementary signal detection theory. Oxford University Press; 10.1093/acprof:oso/9780195092509.001.0001 [DOI] [Google Scholar]
- Woods S. W. (2003). Chlorpromazine equivalent doses for the newer atypical antipsychotics. Journal of Clinical Psychiatry, 64(6), 663-667. 10.4088/JCP.v64n0607 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, supplemental for Latent Variable Modeling and Adaptive Testing for Experimental Cognitive Psychopathology Research by Michael L. Thomas, Gregory G. Brown, Virginie M. Patt and John R. Duffy in Educational and Psychological Measurement




