Abstract
Computer-based testing (CBT) is becoming increasingly popular in assessing test-takers’ latent abilities and making inferences regarding their cognitive processes. In addition to collecting item responses, an important benefit of using CBT is that response times (RTs) can also be recorded and used in subsequent analyses. To better understand the structural relations between multidimensional cognitive attributes and the working speed of test-takers, this research proposes a joint-modeling approach that integrates compensatory multidimensional latent traits and response speediness using item responses and RTs. The joint model is cast as a multilevel model in which the structural relation between working speed and accuracy are connected through their variance-covariance structures. The feasibility of this modeling approach is investigated via a Monte Carlo simulation study using a Bayesian estimation scheme. The results indicate that integrating RTs increased model parameter recovery and precision. In addition, Program of International Student Assessment (PISA) 2015 mathematics standard unit items are analyzed to further evaluate the feasibility of the approach to recover model parameters.
Keywords: joint modeling, multidimensional item response theory, response times, accuracy and speed
Recently, the use of computers to administer tests has provided a platform not only for recording examinees’ responses to items but also in collecting response process data (RPD). Continuous RPD in the form of digital records such as response times (RTs) and eye-tracking characteristics are currently being used to capture problem-solving processes, strategies, and behaviors of test-takers (Man & Harring, 2019; Ercikan & Pellegrino, 2017). RTs, one of the critical types of RPD, have garnered considerable attention in improving current measurement practices because this type of data is thought to deliver a more comprehensive depiction of the performance and attributes of test-takers beyond what is available based on response accuracy (i.e., correct responses) alone (Bolsinova, De Boeck, & Tijmstra, 2017). RTs provide essential information regarding the amount of time spent by test-takers across assessment items indicating, for instance, their level of engagement with the content. RTs have also been utilized to deal with some measurement issues and challenges. RTs could be used to help select items, for example, that maximize Fisher’s information at the current estimate of the test-takers’ latent ability. This type of procedure may be most advantageous for computer adaptive tests as the length of the test could be ultimately shortened (Meyer, 2010; van Rijn & Ali, 2017; Wise & Kong, 2005). In addition, in the context of test security, RTs have been used to help identify aberrant testing behaviors such as preknowledge cheating and lucky guessing thereby ensuring test fairness (Bolt, Cohen, & Wollack, 2002; Man, Harring, Ouyang, & Thomas, 2018; Marianti, Fox, Avetisyan, Veldkamp, & Tijmstra, 2014; Thissen, 1983; van der Linden, 2006b; van der Linden & Guo, 2008).
Much of the previous literature on integrating RTs and item responses has centered on the relation between a test-taker’s responding speed and responding accuracy within a unidimensional item response theory (IRT) modeling framework (e.g., Bolsinova et al., 2017; De Boeck, Chen, & Davison, 2017; Fox & Marianti, 2016; Meng, Tao, & Chang, 2015; Roskam, 1997; Thissen, 1983). Importantly, among these methods is a two-level hierarchical framework for modeling item responses and RTs proposed by van der Linden (2006a). In this framework, RTs and item responses are modeled at the first level; whereas, the dependencies between lognormal RT model parameters (van der Linden, 2006b) and IRT model parameters (i.e., two-parameter logistic [2-PL]) are specified at the second hierarchical level. To properly apply this model, an assumption is made that test-takers respond to items with constant speed. However, many researchers have challenged the veracity of this assumption by considering within-subject variation of item characteristics across RTs (e.g., Bolsinova et al., 2017; Fox & Marianti, 2016).
Some extensions based on this hierarchical joint modeling approach have been proposed (Bolsinova et al., 2017; Fox & Marianti, 2016; van Rijn & Ali, 2017). However, a limited number of studies could be found that have examined the functional relation between RTs and multidimensional cognitive constructs thought to drive item responses. Clearly, conventional unidimensional IRT models would be inadequate to capture complexities inherent in these scenarios, whereas applying multidimensional IRT (MIRT) models would be theoretically defensible. In recent years, MIRT models have gained more prominence in educational and psychological testing due to the increasing needs of stakeholders to understand and interpret multifaceted cognitive processes of test-takers used during the test period (e.g., Jiao, Kamata, Wang, & Jin, 2012; Reckase, 2009). For example, the Program of International Student Assessment (PISA) assesses various content ability domains, including mathematics, reading literacy, and science. Within each domain, several cognitive processing constructs were measured such as explaining phenomena and evaluating and interpreting data. In addition, an abundance of process data such as item RTs and action sequences were gathered as ancillary information. Yet, there currently exists no model that integrates process data like RTs into the MIRT framework for evaluating the relative dependency between test-takers responding speed and accuracy. It is this gap in the literature that the current study is aiming to fill.
Inspired by the work of van der Linden (2006a, 2006b), and acknowledgment of the importance of MIRT, in this study, a joint modeling approach for multidimensional item responses and RTs is proposed, which describes the relation between the speediness and accuracy of a person answering items in multidimensional latent space. The proposed joint modeling is an extension of the hierarchical modeling framework proposed by van der Linden (2006a) to MIRT models. In this joint modeling approach, a MIRT model and an RT model are specified separately at Level 1. The variance–covariance structure of the person and item parameters are jointly estimated at Level 2. A Bayesian estimation approach is used to investigate the proposed hierarchical model via a Monte Carlo simulation under a limited number of conditions thought to impact estimation accuracy.
To outline the focus of this manuscript, the hierarchical model for RTs and item responses within a MIRT framework is introduced. In a subsequent section, a Bayesian approach to estimating the model via a Markov chain Monte Carlo (MCMC) algorithm will be discussed including specification of prior distributions. A simulation study to highlight contexts where the proposed MIRT-RT model may be advantageous over a MIRT model is outlined and results are articulated. Finally, an empirical example using data from PISA 2015 mathematics test is provided to underscore the findings from the simulation. Results from the analyses are discussed.
Hierarchical Model Specification
Level 1: Measurement Model for Accuracy
Compensatory MIRT model
MIRT models describe the relation between item responses and two or more latent traits and are categorized as either being compensatory or noncompensatory (see, for example, Ackerman, 1989; Adams, Wilson, & Wang, 1997; Bolt & Lall, 2003; Fox, Entink, & Avetisyan, 2014; Wang & Nydick, 2015). Compensatory MIRT models are used when the latent traits compensate for each other when answering an item. In other words, high proficiency in one latent dimension is thought to compensate for low proficiency on other dimension(s). By contrast, in a noncompensatory MIRT model, a deficiency in one latent dimension cannot be offset by adequacy in other latent dimensions. As noncompensatory MIRT models are notoriously difficult to estimate (see, for example, Bolt & Lall, 2003; Wang & Nydick, 2015), in this study, the authors will focus on jointly modeling compensatory MIRT with RTs.
At the first level of the modeling hierarchy, a 2-PL compensatory MIRT model (Reckase, 1985) is specified, which assumes that the probability of correctly answering an item is influenced by a weighted linear combination of abilities and is formulated as
| (1) |
where is the probability of a correct response to item , , by person , ; is an vector of discrimination parameters for all items; is the location parameter for item , and is an vector of latent traits for person .
Level 1: Measurement Model for Working Speed
RT modeling
The lognormal RT model proposed by van der Linden (2006b) is utilized for the modeling of RTs. Although many RT models have been proposed assuming a variety of distributions (see, for example, Roskam, 1997), the lognormal RT model proposed by van der Linden (2006b) assumes that the log-transformed RTs follow a normal distribution allowing for the joint modeling of item responses within a multivariate normal distributional framework. RTs could be modeled as being multidimensional as well. For example, if several items share a common stimulus, RTs would be related to a certain degree—much the same way items sharing a common reading passage would have a certain association. In this case, it would be necessary to measure an additional RT dimension. However, this study maintains a unidimensional perspective for now. A multidimensional RT model could be investigated as an extension in a future study.
The lognormal RT model is
| (2) |
where the latent parameter represents working speed for test-taker . The item parameter denotes time intensity, or simply, the amount of time required for answering a specific item. The parameter is an item time discrimination parameter. The mean value, , is parameterized as .
Level 2: Modeling Person Parameters
The second-level model incorporates two correlational structures to account for the dependencies on both the item and person parameters, respectively. The relation between latent attributes, , and speediness, , for the population of test-takers is assumed to follow a multivariate normal distribution such that
| (3) |
with mean vector, , and covariance matrix
| (4) |
The parameters , represent the linear dependencies among different ability dimensions and the speediness of the test-taker. A negative value indicates that test-takers who solve a task more quickly also have lower latent abilities on both dimensions (Bolsinova et al., 2017; De Boeck et al., 2017; van der Linden, 2006a). By modeling the multidimensional latent structures with RTs, the structural dependencies between different latent cognitive abilities and the speediness can further be evaluated.
Level 2: Modeling Item Parameters
To account for the item parameter dependency in this joint modeling approach, a multivariate normal distribution is defined for the item parameters, , such that
| (5) |
where the mean vector and symmetric covariance matrix, and , are defined, respectively, as and
These moments are a restrictive version of the more general moment structures of parameter vector , in which and
respectively. There are several reasons for placing restrictions on these item parameters such that the only parameters to be estimated will be item location and time intensity. First, this type of reduction is often the convention used when estimating MIRT models within a Bayesian framework (see, for example, Bolt & Lall, 2003; Fox et al., 2014; Wang & Nydick, 2015). In those studies, the correlation between slope parameters of distinct ability dimensions were not specified. A compelling reason why this might be the case is that estimating the correlation of slope parameters provides neither significant information about item quality nor useful information inferring about test-takers’ abilities. Notably, the speed and accuracy trade-off was not addressed by the correlation of item slopes. Moreover, estimating the correlation between slope parameters could potentially lead to model overfitting. The estimation precision of person-side parameters would be reduced due to the lower degrees of freedom induced by needlessly estimating these correlations. Thus, item slopes are assumed not to be correlated. Furthermore, the correlation between the item slopes and item time discrimination is not considered either. Van der Linden (2006a) reported that the correlation between item discrimination and time discrimination was .04 with their real data set, which was not significant as 0 was a plausible value in the interior of the credible interval constructed from the correlation’s posterior density.
The hierarchical modeling framework proposed by van der Linden (2006a) has been extended in the current study to the jointly modeling item responses within an MIRT model and RTs. This model is referred to as the MIRT-RT model throughout the remainder of the article. The RTs are used as ancillary information in the estimation of the MIRT model parameters. Figure 1 displays the graphical representation of the MIRT-RT model.
Figure 1.
A hierarchical graphical representation of the MIRT-RT model.
Note. MIRT = multidimensional item response theory; RT = response time; CMIRT = compensatory multidimensional item response theory.
Bayesian Estimation Using MCMC Sampling
An MCMC algorithm for a fully Bayesian specification of the model was used for model parameter estimation in Just Another Gibbs Sampler (JAGS; Plummer, 2015). All data for the simulation were generated in R (R Core Team, 2016). Two chains with thinning of five were executed using at least 15,000 total iterations and parameter estimates and standard deviations from the posterior densities were computed using the final 2,000 iterations. Autojags(), an auto-updating JAGS function housed in the R2jags package (Su & Yajima, 2015) was utilized to monitor convergence. This package automatically runs additional iterations if convergence to the posterior densities was not achieved in the specified 15,000 iterations. The potential scale reduction (PSR) factor was used for evaluating convergence for all model parameters (Gelman, Carlin, Stern, & Rubin, 2003). For the current study, a PSR value of 1.1 or less for each model parameter was used as the arbiter indicating convergence.
Identification Constraints
Several constraints were set for identifying parameter scales and solving rotational indeterminacy issues related to estimation of MIRT models. To identify parameter scales, the population mean and variances of the latent attributes, are fixed to 0 and 1, respectively. This strategy has been used in past methodological investigations (Bolt & Lall, 2003; Fox et al., 2014). In addition, the population mean of is fixed as 0 as well (van der Linden, 2006a). To solve the dimensional indeterminacy issue, a design matrix,
| (6) |
is applied for identifying different latent ability dimensions within the MCMC algorithm in which the first two items only load on the first dimension and the next two items only load on the second dimension. Following Bolt and Lall (2003), the remainder of the items load on both dimensions.
Prior Distributions
The prior distribution of item parameters for the MIRT-RT model is assumed to be bivariate normal such that
| (7) |
The item slope parameters, for are assumed to follow a uniform distribution over [0, 3]. An inverse Gamma distribution is assumed for the time discrimination parameter (i.e., ). This is the inverse of the variances of the log-times on different items () based on the RT model: . In this fully Bayesian specification, hyperpriors are defined as
where is an identity matrix, and indicates the degree of freedom, which in this case is equal to for . Similarly, the prior distribution for the person parameters of the MIRT-RT model follows a multivariate normal distribution such that
| (8) |
where
The population means and variances of the latent attributes, are fixed to 0 and 1, respectively (Bolt & Lall, 2003; Fox et al., 2014; van der Linden, 2006a). The population mean is constrained to 0. The model is identified by fixing the scale of these person parameters. The variances of and the covariance between any pair of the latent person parameters are allowed to be freely estimated. To achieve scale identification, is decomposed as , and the entries of is as follows:
| (9) |
where denotes the conjugate transpose of , which is a lower triangular matrix with real and positive diagonal entries. The priors are set
The joint posterior probability for the proposed MIRT-RT model can be represented as
Simulation Study
In this study, simulated data were used for evaluating model parameter recovery—both item and person parameters. Specifically, two levels of the number of examinees, (a) 500 and (b) 1,500, were simulated following previous methodological investigations (Bolt & Lall, 2003; Fox et al., 2014; Wang & Nydick, 2015). Two test lengths were considered: (a) 15 and (b) 30 items. Again these values were chosen to align with previous MIRT research (Bolt & Lall, 2003; Wang & Nydick, 2015) as well as conforming to the number of items found in large-scale assessments. Response data were generated based on a two-dimensional structure following a MIRT model (see, Equation 1) by applying the design matrix presented in Equation 6. RTs were generated based on the RT model given in Equation 2. Examinees’ latent attribute parameters , , and were generated from a multivariate normal distribution with zero mean vector and the specified correlation structure. The correlation between latent and assumed two different values: .3 and –.3. These two values were defined based on empirical data and previous studies. For instance, a moderately negatively correlated ability level with person speediness was reported based on the 2012 PISA mathematics data set (Zhan, Jiao, & Liao, 2017). Similar findings are supported by other methodological studies (Bolsinova et al., 2017; van der Linden, Klein Entink, & Fox, 2010). However, other studies also identified positive correlation between latent ability and speediness (Bolt & Lall, 2003; Meng et al., 2015; van der Linden, 2006a). The correlation between the two latent abilities and was also fixed at .3, following the simulation design proposed by Fox et al. (2014) for evaluating a MIRT model. Item parameters, and , were generated from a bivariate normal distribution. The mean vector of and were set as 0 and 4 (on the logit scale). Their corresponding variances were set at values of 1 and 0.25, and covariance was fixed as 0 (Bolt & Lall, 2003; Fox & Marianti, 2016; Zhan et al., 2017). The item slopes, , for two dimensions were drawn from uniform distributions with interval endpoints of 0.7 and 1.8. These discrimination values corresponded to values from the design proposed by Wang and Nydick (2015). The time discrimination parameter () was also drawn from a uniform distribution with range from 1 to 2 (Fox & Marianti, 2016; van der Linden, 2006a). Table 1 presents the population values used to generate the item parameters for both the MIRT and RT model components.
Table 1.
Generated Item Parameters for the Compensatory Multidimensional Logistic Model and RT Model.
| MIRT |
RT |
||||
|---|---|---|---|---|---|
| Item | |||||
| 1 | 1.01 | 0.00 | 1.42 | 3.85 | 1.26 |
| 2 | 1.42 | 0.00 | −1.54 | 3.83 | 1.45 |
| 3 | 0.00 | 1.56 | 0.30 | 3.77 | 1.46 |
| 4 | 0.00 | 0.77 | 1.35 | 3.95 | 1.17 |
| 5 | 1.17 | 1.07 | 0.08 | 3.71 | 1.60 |
| 6 | 1.57 | 1.73 | 0.90 | 4.28 | 1.38 |
| 7 | 1.53 | 1.31 | −0.31 | 4.05 | 1.90 |
| 8 | 0.90 | 0.98 | −1.00 | 3.79 | 1.37 |
| 9 | 0.77 | 1.74 | −2.74 | 4.25 | 1.02 |
| 10 | 1.37 | 1.46 | 0.21 | 3.71 | 1.75 |
| 11 | 0.79 | 1.15 | 0.50 | 3.99 | 1.54 |
| 12 | 1.20 | 1.78 | 0.04 | 4.35 | 1.66 |
| 13 | 1.32 | 0.81 | −0.67 | 4.27 | 1.65 |
| 14 | 0.93 | 0.76 | −1.08 | 4.09 | 1.82 |
| 15 | 1.11 | 1.65 | −2.53 | 3.70 | 1.06 |
| 16 | 1.21 | 1.00 | −2.34 | 3.97 | 1.94 |
| 17 | 1.50 | 1.58 | −0.74 | 3.77 | 1.61 |
| 18 | 1.59 | 1.59 | 1.42 | 4.10 | 1.44 |
| 19 | 1.33 | 1.74 | −0.19 | 3.82 | 1.49 |
| 20 | 1.25 | 1.56 | −0.47 | 4.54 | 1.65 |
| 21 | 1.14 | 0.74 | 0.44 | 3.79 | 1.29 |
| 22 | 1.38 | 1.11 | 1.06 | 4.06 | 1.57 |
| 23 | 1.41 | 1.50 | −0.64 | 4.23 | 1.74 |
| 24 | 1.76 | 0.89 | 1.05 | 4.09 | 1.08 |
| 25 | 0.94 | 1.29 | 1.58 | 4.00 | 1.95 |
| 26 | 1.26 | 1.62 | 0.01 | 4.17 | 1.88 |
| 27 | 1.60 | 1.51 | 0.23 | 3.73 | 1.33 |
| 28 | 1.56 | 0.80 | −0.86 | 4.58 | 1.95 |
| 29 | 1.24 | 1.13 | −0.75 | 4.19 | 1.18 |
| 30 | 1.75 | 1.26 | 0.83 | 4.18 | 1.61 |
Note. MIRT = multidimensional item response theory; RT = response time.
For the 15-item condition, only the first 15 of the total of 30 items were used. Twenty-five replications were generated for each combination of test length, correlation, and the number of test-takers. There are in total fully crossed conditions. Two different models, a MIRT with RT model and a MIRT without RT were estimated and compared across all the simulation conditions. For the convenience of reporting the study results, the simulated study conditions are summarized in Table 2.
Table 2.
Summary of Simulation Conditions.
| Simulation conditions | Number of examinees | Test length | Correlation between and |
|---|---|---|---|
| C1 | 500 | 15 | −.3 |
| C2 | 500 | 15 | .3 |
| C3 | 500 | 30 | −.3 |
| C4 | 500 | 30 | .3 |
| C5 | 1,500 | 15 | −.3 |
| C6 | 1,500 | 15 | .3 |
| C7 | 1,500 | 30 | −.3 |
| C8 | 1,500 | 30 | .3 |
Root mean square error (RMSE) was used to evaluate model parameter recovery. However, there are two types of model parameters to be recovered: item parameters and person parameters. RMSEs are calculated separately for item parameters, , and person parameters, , as
where is the true model parameter for item , is the estimated model parameter for item in replicate ; and are the population value of the parameter and estimate from the rth replication for a given sample size of test-takers, respectively.
Results
Table 3 shows the RMSE for the item and person parameter estimates of the joint MIRT-RT model, as well as the stand alone MIRT model. In general, the parameters are recovered relatively well based on the joint modeling approach, given that all the RMSE values are lower than 0.4. Notably, the RMSE of item slope parameters and item location parameters are lower than 0.26 (see Figure 2 which is provided in the online supplement). Several general trends are observed with item parameter recovery. First, the RMSE of item parameters in the jointly modeling MIRT with RT is smaller than those in the MIRT in most of the conditions. Primarily, the RMSE of item variance components is consistently lower than the values only based on MIRT. Second, in contrast to the findings based on 500 samples, the RMSE of item parameters is lower when the sample size increased to 1,500 mainly for the item slope parameters. Third, the RMSE of item parameters is lower when test length increases from 15 items to 30 items. Also, Table 3 indicates that RMSE (efficiency) is relatively smaller for the joint modeling approach than the single MIRT. In another word, RTs as ancillary information could improve the item parameter estimation precision regarding RMSE.
Table 3.
Parameter Recovery Results: RMSE for the Joint MIRT-RT Model and MIRT Model.
| Model | Conditions |
||||||||
|---|---|---|---|---|---|---|---|---|---|
| C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | ||
| Item parameter | |||||||||
| MIRT-RT | 0.22 | 0.21 | 0.24 | 0.20 | 0.16 | 0.14 | 0.15 | 0.12 | |
| MIRT | 0.25 | 0.23 | 0.24 | 0.22 | 0.15 | 0.15 | 0.14 | 0.14 | |
| MIRT-RT | 0.25 | 0.22 | 0.26 | 0.23 | 0.19 | 0.14 | 0.15 | 0.13 | |
| MIRT | 0.27 | 0.24 | 0.27 | 0.25 | 0.20 | 0.14 | 0.14 | 0.14 | |
| MIRT-RT | 0.14 | 0.14 | 0.13 | 0.13 | 0.09 | 0.08 | 0.08 | 0.08 | |
| MIRT | 0.14 | 0.15 | 0.14 | 0.14 | 0.09 | 0.08 | 0.08 | 0.08 | |
| MIRT-RT | 0.03 | 0.03 | 0.03 | 0.03 | 0.02 | 0.02 | 0.02 | 0.02 | |
| MIRT | — | — | — | — | — | — | — | — | |
| MIRT-RT | 0.05 | 0.05 | 0.05 | 0.05 | 0.03 | 0.03 | 0.03 | 0.03 | |
| MIRT | — | — | — | — | — | — | — | — | |
| MIRT-RT | 0.32 | 0.34 | 0.19 | 0.20 | 0.25 | 0.23 | 0.11 | 0.12 | |
| MIRT | 0.34 | 0.36 | 0.20 | 0.21 | 0.33 | 0.32 | 0.14 | 0.15 | |
| MIRT-RT | 0.11 | 0.12 | 0.15 | 0.15 | 0.32 | 0.31 | 0.26 | 0.26 | |
| MIRT | — | — | — | — | — | — | — | — | |
| MIRT-RT | 0.02 | 0.02 | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | |
| MIRT | — | — | — | — | — | — | — | — | |
| MIRT-RT | 0.05 | 0.05 | 0.02 | 0.03 | 0.04 | 0.03 | 0.02 | 0.02 | |
| MIRT | 0.04 | 0.04 | 0.03 | 0.04 | 0.04 | 0.03 | 0.03 | 0.04 | |
| MIRT-RT | 0.03 | 0.03 | 0.03 | 0.03 | 0.01 | 0.01 | 0.01 | 0.01 | |
| MIRT | — | — | — | — | — | — | — | — | |
| Person parameters | |||||||||
| MIRT-RT | — | — | — | — | — | — | — | — | |
| MIRT | — | — | — | — | — | — | — | — | |
| MIRT-RT | — | — | — | — | — | — | — | — | |
| MIRT | — | — | — | — | — | — | — | — | |
| MIRT-RT | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | |
| MIRT | — | — | — | — | — | — | — | — | |
| MIRT-RT | 0.14 | 0.12 | 0.06 | 0.07 | 0.08 | 0.07 | 0.10 | 0.09 | |
| MIRT | 0.22 | 0.17 | 0.11 | 0.11 | 0.12 | 0.09 | 0.12 | 0.11 | |
| MIRT-RT | 0.04 | 0.07 | 0.05 | 0.07 | 0.03 | 0.04 | 0.03 | 0.03 | |
| MIRT | — | — | — | — | — | — | — | — | |
| MIRT-RT | 0.04 | 0.04 | 0.04 | 0.06 | 0.05 | 0.02 | 0.04 | 0.03 | |
| MIRT | — | — | — | — | — | — | — | — | |
Note. RMSE = root mean square error; MIRT = multidimensional item response theory; RT = response time.
Table 3 also presents the results of the estimation of the variance–covariance recovery of item domain and population domain at the Level 2 (see Figure 1). This is of interest because the second-level item domain covariance components between different item parameters show the dependencies between responses and RT. In general, the recovery of all item variance–covariance parameters is quite well for the MIRT-RT model. The RMSE is in the range of 0.01 to 0.34. The recovery of person-side variance and covariance parameters is also satisfactory. The RMSE values are lower than 0.4.
Overall, the obtained results indicate that the joint model approach outperformed the single MIRT model. Regarding RMSE under different test length and samples sizes, the findings suggest that the joint modeling approach could improve the accuracy and precision of both item and person parameter estimates by incorporating RTs as ancillary information for MIRT model estimation.
Analysis of PISA 2015 Computer-Based Mathematics Data
Data Set Description
The PISA 2015 computer-based mathematics data were used to fit both the MIRT-RT model and MIRT model. The PISA 2015 data set is exemplar for illustrative purposes because for a number of reasons including that a 2-PL MIRT model was used to scale the test-takers responses matching the authors’ proposed MIRT-RT model, and RTs were collected. There are test-takers and items in the sample used here after eliminating examinees with missing data. The item IDs are 74 Q 01, 155 Q 01, 411 Q 01, 411 Q 02, 442 Q 02, 305 Q 01, 496 Q 01, 496 Q 02, 603 Q 01, 564 Q 01, and 564 Q 02. Based on the published codebook of PISA 2015 mathematics framework, the 11 selected items measure two dimensions: (a) employing mathematical concepts, facts procedures, and reasoning and (b) context societal knowledge. To successfully apply mathematical reasoning to solve the test items, students are also required to have certain social knowledge such as voting systems, political system, and economics. Thus, the two latent ability dimensions are compensatory to each other for solving the items. The two-dimensional slope parameter-loading pattern is displayed as
| (10) |
The RTs were transformed to a logarithmic scale before running the RT model. The Deviance Information Criterion (DIC, Spiegelhalter, Best, Carlin, & van der Linde, 2002) was calculated for comparing overall model fit between the joint-modeling approach of MIRT and RT and separated MIRT and RT models. The results are displayed in Table 3.
Table 4 shows both item and person parameter estimates. The range of the item difficulty parameter estimates varied from −0.78 to 1.36. The range of time intensity parameter estimates was from 3.75 to 5.05, which are on the logarithm scale. The estimated covariance between item difficulty and item time intensity was −0.16 (covariance = −0.375) with a 95% credible interval of −0.537 to 0.085, which is not significant suggesting that the item difficulty was not associated with item time intensity with the authors’ data set. The person covariance and were estimated to be −0.07 (95% credible interval: −0.104 to −0.052; covariance = −0.09) and −0.11 (95% credible interval: −0.142 to −0.083; corvarience = −0.14), respectively. These values indicate a slight correlation between the two latent abilities and speediness. This is in line with other research studies that have also reported negative correlation between latent ability and latent speediness (e.g., Fox & Marianti, 2016; Klein Entink, Kuhn, Hornke, & Fox, 2009; van der Linden & Fox, 2015). This could be a proxy for a lack of motivation for students required to take this low-stakes assessment (Wise & Kong, 2005). However, there are other plausible explanations. For example, higher performing examinees may work slowly and deliberately. In terms of overall model fit, the joint MIRT-RT model was supported by DIC as shown in Table 5 with lower DIC values preferred. The DICs of the MIRT-RT and the summation of MIRT and RT were 39,814.1 and 43,114.7, respectively, suggesting that jointly modeling item responses with a MIRT model and RTs fits better than modeling them separately.
Table 4.
Item Parameter Estimates: PISA 2015 Computer-Based Mathematical Literacy Standard Unit Items.
| MIRT-IRT |
MIRT |
RT |
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Item Parameters | |||||||||||
| MIRT |
RT |
MIRT |
RT |
||||||||
| Average | SD | Average | SD | Average | SD | Average | SD | ||||
| 0.38 | 0.28 | 6.34 | 0.31 | 0.41 | 0.28 | 5.75 | 0.29 | ||||
| 0.77 | 0.10 | 1.00 | 0.04 | 0.77 | 0.10 | 0.99 | 0.04 | ||||
| 1.58 | 0.16 | 1.61 | 0.07 | 1.55 | 0.17 | 1.62 | 0.08 | ||||
| 0.13 | 0.08 | 4.31 | 0.20 | 0.13 | 0.08 | 4.53 | 0.22 | ||||
| 1.66 | 0.16 | 8.63 | 0.44 | 1.66 | 0.17 | 8.78 | 0.53 | ||||
| 0.97 | 0.35 | 4.87 | 0.23 | 1.10 | 0.31 | 4.98 | 0.25 | ||||
| 0.68 | 0.09 | 1.23 | 0.05 | 0.70 | 0.10 | 1.25 | 0.05 | ||||
| 0.63 | 0.09 | 1.71 | 0.08 | 0.65 | 0.10 | 1.69 | 0.08 | ||||
| 0.69 | 0.10 | 3.73 | 0.18 | 0.71 | 0.12 | 3.79 | 0.19 | ||||
| 0.91 | 0.12 | 1.60 | 0.07 | 0.96 | 0.14 | 1.61 | 0.07 | ||||
| 0.91 | 0.32 | 1.77 | 0.08 | 0.90 | 0.35 | 1.78 | 0.08 | ||||
| 0.10 | 0.07 | 0.11 | 0.08 | ||||||||
| 0.43 | 0.32 | 0.30 | 0.28 | ||||||||
| 0.70 | 0.10 | 0.72 | 0.11 | ||||||||
| 0.92 | 0.07 | 3.75 | 0.01 | 0.91 | 0.08 | 3.75 | 0.05 | ||||
| 1.36 | 0.09 | 4.00 | 0.03 | 1.34 | 0.09 | 4.00 | 0.05 | ||||
| 0.53 | 0.08 | 4.77 | 0.03 | 0.54 | 0.08 | 4.78 | 0.05 | ||||
| 0.10 | 0.07 | 3.96 | 0.02 | 0.10 | 0.07 | 3.96 | 0.05 | ||||
| −0.78 | 0.09 | 5.05 | 0.01 | −0.77 | 0.10 | 5.05 | 0.04 | ||||
| −0.19 | 0.06 | 4.26 | 0.02 | −0.19 | 0.06 | 4.26 | 0.05 | ||||
| 0.05 | 0.09 | 4.24 | 0.03 | 0.05 | 0.09 | 4.24 | 0.05 | ||||
| 1.29 | 0.11 | 4.12 | 0.02 | 1.30 | 0.11 | 4.12 | 0.05 | ||||
| −0.44 | 0.07 | 4.57 | 0.02 | −0.44 | 0.07 | 4.57 | 0.05 | ||||
| −0.05 | 0.07 | 3.87 | 0.03 | −0.06 | 0.07 | 3.87 | 0.05 | ||||
| −0.02 | 0.07 | 4.08 | 0.02 | −0.03 | 0.07 | 4.08 | 0.05 | ||||
Note. PISA = Program of International Student Assessment; MIRT = multidimensional item response theory; IRT = item response theory; RT = response time.
Table 5.
Person and Item Variance and Covariance Estimates for MIRT-RT and MIRT Models: PISA 2015 Computer-Based Mathematical Literacy Standard Unit Items.
| MIRT-RT |
MIRT |
RT |
||||
|---|---|---|---|---|---|---|
| Average | CI | Average | CI | |||
| Item parameters | ||||||
| Variance–covariance Parameters | ||||||
| 0.65 | [0.268, 1.539] | |||||
| 0.28 | [0.118, 0.653] | |||||
| −0.16 | [−0.537, 0.085] | |||||
| Person parameters | ||||||
| Variance–covariance Parameters | ||||||
| 1 | — | 1 | — | |||
| 0.88 | [0.749, 0.951] | 0.85 | [0.707, 0.983] | |||
| 1 | — | 1 | — | |||
| 0.06 | [0.052, 0.066] | |||||
| −0.07 | [−0.104, −0.052] | |||||
| −0.11 | [−0.142, −0.083] | |||||
| Information criteria | ||||||
| DIC | 39,814.1 | 16,854.2 | + | 26,260.5 | ||
Note. MIRT = multidimensional item response theory; RT = response time; PISA = Program of International Student Assessment; CI = confidence interval; DIC = Deviance Information Criterion.
Discussion
As is becoming increasingly evident, to gain a more comprehensive understanding of test-takers requires collecting and analyzing additional information beyond their responses to items. To this end, computer-based testing (CBT) permits the gathering of RPD, like RTs, that can be used to make more accurate inferences regarding item parameters and test-takers’ abilities (Fox & Marianti, 2016). Incorporating this type of supplementary information has been shown to improve estimation of item and person parameters in IRT (van der Linden et al., 2010) while providing insights regarding the behaviors of test-takers that cannot be identified by using item response information in isolation. A MIRT-RT model within a hierarchical framework was proposed to jointly model RTs and multidimensional latent constructs underlying item responses. The latter was accomplished by specifying a 2-PL MIRT model for the item responses. A lognormal RT model was chosen for RTs and both were jointly modeled at Level 1 of the hierarchy. The Level 2 model incorporated the mean vectors and variance–covariance structure for both item and person parameters, respectively. Moreover, the joint modeling approach allows evaluating the dependencies among the item parameters reflected by an item domain model. Estimation was carried out within a Bayesian framework using an MCMC algorithm.
Results from the small simulation indicate that RTs are useful ancillary information for improving the estimation of the MIRT model parameters. In general, the MIRT-RT model yields more accurate parameter estimates than the modeling of item responses and RTs independently. This type of joint model specification has additional benefits. Both the dependencies among item parameters and dependencies among person parameters can be assessed and may provide potential avenues of investigation for practitioners and substantive researchers. Of course, one limitation of the simulation study performed in this study is its scope. Although the conditions and levels used in the study have a theoretical grounding in the methodological literature and correspond to reasonable conditions found in practice, they are not exhaustive and a more comprehensive investigation is warranted.
Finally, several model extensions could be further investigated. A logical next elaboration might be to noncompensatory MIRT and its functional relation to RTs. Noncompensatory IRT models are notoriously more challenging to estimate, so care will need to be taken in specifying the model and on thoughtful consideration of prior distributions of parameters if a Bayesian estimation approach is enacted. A second elaboration is to response models for polytomous or graded responses. This is important for psychological testing where items are Likert-scaled. Finally, the current assumption that the working speed is constant over the entire test could be modified. This may provide localized information regarding test-takers as focus may center on specific items of an assessment rather than across all items.
Supplemental Material
Supplemental material, Kaiwen_APM_suplimentary for Joint Modeling of Compensatory Multidimensional Item Responses and Response Times by Kaiwen Man, Jeffrey R. Harring, Hong Jiao and Peida Zhan in Applied Psychological Measurement
Supplemental Material
Supplemental material, Online_supplimentary_Kaiwen for Joint Modeling of Compensatory Multidimensional Item Responses and Response Times by Kaiwen Man, Jeffrey R. Harring, Hong Jiao and Peida Zhan in Applied Psychological Measurement
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material: Supplemental material for this article is available online.
ORCID iD: Peida Zhan
https://orcid.org/0000-0002-6890-7691
References
- Ackerman T. A. (1989). Unidimensional IRT calibration of compensatory and noncompensatory multidimensional items. Applied Psychological Measurement, 13, 113-127. [Google Scholar]
- Adams R. J., Wilson M., Wang W. C. (1997). Unidimensional IRT calibration of compensatory and noncompensatory multidimensional items. Applied Psychological Measurement, 21, 1-23. [Google Scholar]
- Bolsinova M., De Boeck P., Tijmstra J. (2017). Modelling conditional dependence between response time and accuracy. Psychometrika, 82, 1126-1148. [DOI] [PubMed] [Google Scholar]
- Bolt D. M., Cohen A. S., Wollack J. A. (2002). Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement, 39, 331-348. [Google Scholar]
- Bolt D. M., Lall V. F. (2003). Estimation of compensatory and noncompensatory multidimensional item response models using Markov chain Monte Carlo. Applied Psychological Measurement, 27, 395-414. [Google Scholar]
- De Boeck P., Chen H., Davison M. (2017). Spontaneous and imposed speed of cognitive test responses. British Journal of Mathematical and Statistical Psychology, 70, 225-237. [DOI] [PubMed] [Google Scholar]
- Ercikan K., Pellegrino J. W. (2017). Collecting, analyzing and interpreting response time, eye-tracking, and log data. In Ercikan K., Pellegrino J. W. (Eds.), Validation of score meaning for the next generation of assessments: The use of response processes. New York, NY: Taylor & Francis. [Google Scholar]
- Fox J. P., Entink R. K., Avetisyan M. (2014). Compensatory and noncompensatory multidimensional randomized item response models. British Journal of Mathematical and Statistical Psychology, 67, 133-152. [DOI] [PubMed] [Google Scholar]
- Fox J. P., Marianti S. (2016). Joint modeling of ability and differential speed using responses and response times. Multivariate Behavioral Research, 51, 540-553. [DOI] [PubMed] [Google Scholar]
- Gelman A., Carlin J. B., Stern H. S., Rubin D. B. (2003). Bayesian data analysis. New York, NY: Chapman & Hall. [Google Scholar]
- Jiao H., Kamata A., Wang S., Jin Y. (2012). A multilevel testlet model for dual local dependence. Journal of Educational Measurement, 49, 82-100. [Google Scholar]
- Klein Entink R. H., Kuhn J. T., Hornke L. F., Fox J. P. (2009). Evaluating cognitive theory: A joint modeling approach using responses and response times. Psychological Methods, 14, 54-75. [DOI] [PubMed] [Google Scholar]
- Man K., Harring J. R. (2019). Negative binomial models for visual fixation counts on test items. Educational and Psychological Measurement. 10.1177/0013164418824148 [DOI] [PMC free article] [PubMed]
- Man K., Harring J. R., Ouyang Y., Thomas S. L. (2018). Response time based nonparametric Kullback-Leibler divergence measure for detecting aberrant test-taking behavior. International Journal of Testing, 18, 155-177. doi: 10.1080/15305058.2018.1429446 [DOI] [Google Scholar]
- Marianti S., Fox J. P., Avetisyan M., Veldkamp B., Tijmstra J. (2014). Testing for aberrant behavior in response time modeling. Journal of Educational and Behavioral Statistics, 39, 426-451. [Google Scholar]
- Meng X. B., Tao J., Chang H. H. (2015). A conditional joint modeling approach for locally dependent item responses and response times. Applied Psychological Measurement, 52, 1-27. [Google Scholar]
- Meyer J. P. (2010). A mixture Rasch model with item response item components. Applied Psychological Measurement, 34, 521-538. [Google Scholar]
- Plummer M. (2015). JAGS: Just Another Gibbs Sampler (version 4.0.0). Retrieved from http://mcmc-jags.sourceforge.net/
- R Core Team. (2016). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria: Available from https://www.R-project.org [Google Scholar]
- Reckase M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9, 401-412. [Google Scholar]
- Reckase M. D. (2009). Multidimensional item response theory: Statistics for social and behavioral sciences. New York, NY: Springer. [Google Scholar]
- van Rijn P. W., Ali U. S. (2017). A comparison of item response models for accuracy and speed of item responses with applications to adaptive testing. British Journal of Mathematical and Statistical Psychology, 70, 317-345. [DOI] [PubMed] [Google Scholar]
- Roskam E. E. (1997). Models for speed and time-limit tests. In van der Linden W. J., Hambleton R. K. (Eds.), Handbook of modern item response theory (pp. 187-208). New York, NY: Springer. [Google Scholar]
- Spiegelhalter D. J., Best N. G., Carlin B. P., van der Linde A. (2002). Bayesian measures of model complexity and fit. Royal Statistical Society, Series B: Statistical Methodology, 64, 583-639. [Google Scholar]
- Su Y. S., Yajima M. (2015). R2jags: Using R to run JAGS (Version 0.5). Retrieved from https://CRAN.Rproject.org/package=R2jags
- Thissen D. (1983). Timed testing: An approach using item response theory. In Weiss D. J. (Ed.), New horizons in testing: Latent trait test theory and computerized adaptive testing (pp. 179-203). New York, NY: Academic Press. [Google Scholar]
- van der Linden W. J. (2006. a). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72, 287-308. [Google Scholar]
- van der Linden W. J. (2006. b). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31, 181-204. [Google Scholar]
- van der Linden W. J., Fox J.-P. (2015). Joint hierarchical modeling of responses and response times. In van der Linden W. J. (Ed.), Models: Vol. 1. Handbook of item response theory (pp. 481-501). Boca Raton, FL: Chapman & Hall/CRC Press. [Google Scholar]
- van der Linden W. J., Guo F. (2008). Bayesian procedures for identifying aberrant response-time patterns in adaptive testing. Psychometrika, 73, 365-384. [Google Scholar]
- van der Linden W. J., Klein Entink R. H., Fox J. P. (2010). IRT parameter estimation with response times as collateral information. Applied Psychological Measurement, 34, 327-347. [Google Scholar]
- Wang T., Nydick S. W. (2015). Comparing two algorithms for calibrating the restricted non-compensatory multidimensional IRT model. Applied Psychological Measurement, 39, 119-134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wise S. L., Kong X. J. (2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Psychological Measurement, 18, 163-183. [Google Scholar]
- Zhan P., Jiao H., Liao D. (2017). Cognitive diagnosis modelling incorporating item response times. British Journal of Mathematical and Statistical Psychology, 71, 262-286. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, Kaiwen_APM_suplimentary for Joint Modeling of Compensatory Multidimensional Item Responses and Response Times by Kaiwen Man, Jeffrey R. Harring, Hong Jiao and Peida Zhan in Applied Psychological Measurement
Supplemental material, Online_supplimentary_Kaiwen for Joint Modeling of Compensatory Multidimensional Item Responses and Response Times by Kaiwen Man, Jeffrey R. Harring, Hong Jiao and Peida Zhan in Applied Psychological Measurement

