Abstract
So far, modeling approaches for not-reached items have considered one single underlying process. However, missing values at the end of a test can occur for a variety of reasons. On the one hand, examinees may not reach the end of a test due to time limits and lack of working speed. On the other hand, examinees may not attempt all items and quit responding due to, for example, fatigue or lack of motivation. We use response times retrieved from computerized testing to distinguish missing data due to lack of speed from missingness due to quitting. On the basis of this information, we present a new model that allows to disentangle and simultaneously model different missing data mechanisms underlying not-reached items. The model (a) supports a more fine-grained understanding of the processes underlying not-reached items and (b) allows to disentangle different sources describing test performance. In a simulation study, we evaluate estimation of the proposed model. In an empirical study, we show what insights can be gained regarding test-taking behavior using this model.
Keywords: response times, not-reached items, missing data, item response theory, survival
In large-scale assessments (LSAs), examinees do not always attempt all items they were assigned to answer. When an examinee fails to attempt a sequence of items presented at the end of a test, the resulting missing responses are referred to as not-reached items (NRIs). NRIs can occur for a variety of reasons. Examinees may not reach the end of a test due to lack of speed when tests are administered with time limits. This is supported by results from experimental research suggesting that increased test-taking time results in lower NRI rates (e.g., Mandinach, Bridgeman, Cahalan-Laitusis, & Trapani, 2005; Wild & Durso, 1979). However, the onset of NRIs does not seem to depend solely on test time. Examinees may quit the assessment prematurely due to, for example, fatigue or lack of motivation. This is particularly the case in low-stakes assessments where low motivation is likely to affect examinee test-taking behavior (Chen, von Davier, Yamamoto, & Kong, 2015; Cosgrove, 2011; Liu, Rios, & Borden, 2015; Wise & DeMars, 2005). Indeed, in LSAs, NRIs are even observed in assessments administered without time constraints, such as in the Programme for the International Assessment of Adult Competencies (PIAAC; Organisation for Economic Co-operation and Development [OECD], 2013).
NRIs occurring due to lack of speed and NRIs occurring due to quitting represent different types of missingness processes that tend to occur under different testing situations, correspond to different test-taking strategies, and might be related differently to ability. Thus, disentangling and modeling different types of NRIs can be beneficial for understanding examinee performance as well as for informing decisions regarding the adequate treatment of missing values due to NRIs. In this context, considering additional data retrieved from computer-based assessment facilitates the understanding of examinee behavior and thus of potential mechanisms underlying NRIs. For instance, cumulative response times (RTs) contain information on the time passed up to the last item attempted before ending the assessment. This allows to distinguish examinees who worked at a slow pace and reached the time limit before reaching the end of the test from those who displayed cumulative RTs far below the time limit without attempting all items administered (Pohl, Ulitzsch, & von Davier, 2019). Based on the information contained in RT data, Pohl et al. (2019) illustrated that within the same data set NRIs are plausible to occur due to different mechanisms, that is, lack of speed and quitting. In this article, we argue that these are potentially different mechanisms that should be modeled as such. We propose a framework to disentangle and simultaneously model these missingness mechanisms.
Dealing With Not-Reached Items in Large-Scale Assessments
Current practices for handling NRIs in LSAs are rather heterogeneous. While in the majority of LSAs, NRIs are either ignored (e.g., in the National Educational Panel Study [NEPS], Pohl & Carstensen, 2012) or scored as incorrect, mixed approaches exist. For instance, in the Trends in International Mathematics and Science Study (TIMSS) and the Progress in International Reading Literacy Study (PIRLS), NRIs are ignored for item parameter estimation and scored as incorrect for person parameter estimation (Foy, 2017, 2018). In PIAAC, NRIs are ignored if sufficient information about examinee proficiency is available; that is, if there are more than five item responses per domain. Otherwise, the examinees’ self-stated reasons for not completing the assessment are considered when handling missing responses due to NRIs. For examinees quitting the assessment after giving responses to fewer than five items per domain, all NRIs are treated as incorrect if the self-stated reason for not responding is related to cognitive skills. If examinees give reasons unrelated to competence, NRIs are ignored in the analysis (OECD, 2013). This treatment of NRIs acknowledges that NRIs can occur due to different mechanisms. The approach is however rather heuristic in that it (a) relies on self-stated reasons for not reaching the end of the test and (b) distinguishes different types of NRIs only for examinees who responded to less than five items.
Scoring NRIs as wrong assumes the probability to solve an NRI to be zero—regardless of the examinee’s ability level (see Lord, 1983; Rose, 2013; Rose, von Davier, & Nagengast, 2017). Ignoring NRIs assumes ignorability of the missingness mechanism. For ignorability to hold, data need to be missing at random; that is, missingness needs to be conditionally independent of the unobserved data given the observed data. Furthermore, the (unobserved) parameters governing the distribution of NRIs need to be distinct from ability (Mislevy & Wu, 1996; Rubin, 1976). There is, however, a substantial body of research suggesting that not reaching the end of a test is indeed related to ability (Debeer, Janssen, & Boeck, 2017; Glas & Pimentel, 2008; Lawrence, 1993; List, Köller, & Nagy, 2017; Pohl, Gräfe, & Rose, 2014; Rose, von Davier, & Xu, 2010). This indicates that the mechanisms underlying NRIs are nonignorable. Not properly accounting for the mechanisms that produce nonignorable missing data poses a threat to valid inferences and may potentially lead to biased person and item parameter estimates or distort relationships between ability and explanatory variables as well as country rankings (Glas & Pimentel, 2008; Köhler, Pohl, & Carstensen, 2017; Pohl et al., 2014; Rose, 2013). To properly account for nonignorable NRIs, a model for the mechanisms underlying their occurrence is needed.
Model-Based Approaches for Nonignorable Missing Values
In recent years, model-based approaches for handling nonignorable NRIs have been developed. In this class of models, information about NRIs is integrated into item response theory (IRT) models—either employing a latent or manifest variable—and thus accounted for when estimating ability.
For modeling ability, customary IRT models are employed. In the case of a Rasch model, the probability of a correct response on response indicator , containing person ’s response on item , can be modeled as a function of person ability and the item’s difficulty :
(1) |
Missing values due to not reaching the end of the test are not coded as wrong but rather treated as missing at random by including terms for missing data in the likelihood function used to estimate parameters.
Rose et al. (2010) suggested to employ a manifest missing data model, since, due to the monotone missing pattern resulting from NRIs, considering the number of reached items is sufficient to account for NRIs. Within manifest NRI approaches, information on the number of reached items is included as a manifest variable in the background model. This can be achieved by either regressing ability on or by applying multigroup IRT models where stratification on serves as a grouping variable (Rose et al., 2010). Manifest approaches are computationally less intensive compared with latent variable approaches for NRIs, and since 2015 are considered in the population model in the Programme for International Student Assessment (PISA) for the generation of plausible values (OECD, 2017).
Within latent variable approaches for nonignorable missing values due to NRIs, the information on NRIs is included in form of a second dimension describing the propensity to reach the end of the test (Glas & Pimentel, 2008; List et al., 2017). Missingness indicators , being defined as 1 if is observed, 0 if is the first NRI, and coded as missing otherwise, constitute the measurement model for this propensity. is then modeled as a function of examinee ’s propensity to reach the end of the test and item ’s response difficulty. Linear restrictions are imposed on the response difficulty parameters implying a monotonously decreasing probability of observing a response.
In both latent and manifest variable approaches for modeling the onset of NRIs, correlations different from zero between ability and the number of reached items/the propensity to reach the end of the test indicate that the onset of NRIs is related to the construct being measured, and thus nonignorability of the missingness mechanism. Including information about nonignorable NRIs has been shown to yield less biased and more accurate parameter estimates as compared with ignoring or scoring missing values as wrong (Glas & Pimentel, 2008; Rose et al., 2010; Rose et al., 2017).
Using Response Times to Model Not-Reached Items
If NRIs occur due to lack of speed, examinees reach the time limit before managing to reach the end of the test. For this case, it has been shown that the missing data process underlying NRIs can be described by examinee speed (Pohl et al., 2019). With the widespread availability of RT data retrieved from computerized testing, a direct measure of speed becomes available (van der Linden, 2006). Pohl et al. (2019) were the first to suggest utilizing this information on the missingness process by employing van der Linden’s (2007) hierarchical speed–accuracy (SA) framework to model the occurrence of NRIs in low-stakes assessments. They showed that the SA model (a) can successfully model NRIs due to time limits, (b) provides a closer description of the missing data processes than model-based approaches for nonignorable missing values, and (c) can also deal with varying enforcement of time limits—given that NRIs are the result of lack of speed.
In the SA model, first-level models are specified separately for the responses and associated RTs. For the response indicators , van der Linden has recommended employing customary IRT models. For the RTs , denoting the time examinee required to generate an answer to item , a lognormal model with separate person and item parameters is chosen. That is, logarithmized RTs are assumed to follow a normal distribution. In the lognormal model, logarithmized RTs are considered to be a function of the examinee’s speed and the item’s time intensity :
(2) |
represents the inverse of the RTs’ standard deviation and can be interpreted as a time discrimination parameter. That is, the larger the , the larger the proportion of the RT variance that stems from differences in speed across examinees. On a second level, joint multivariate normal distributions of person and item parameters are specified.
Pohl et al. (2019) delineated that approaches that consider the number of NRIs and the SA model are closely related. First, both approaches include an additional variable that represents the missingness mechanism in the model. While the SA model includes a direct measure of speed, model-based approaches for nonignorable missing values include the tendency to (not) reach the end of the test, as measured by the number of (not) reached items. If the number of NRIs is a result of lack of speed under testing conditions with time limits, the propensity to reach the end of the test can be understood as a proxy of working speed. In this case, the SA model presents a better and more fine-grained description of the missingness process (Pohl et al., 2019).
Second, both approaches assume a single mechanism underlying NRIs. Obviously, when the SA model is applied to account for NRIs, it is assumed that NRIs occurred due to lack of speed. Although model-based approaches that consider the number of NRIs do not explicitly rely on this assumption, they still assume the same missingness mechanism for all NRIs. When mechanisms leading to NRIs differ across examinees, that is, when multiple mechanisms are underlying NRIs such as lack of speed and quitting, this assumption is violated. Hence, in the case that missingness at the end of a test occurs not only due to speed but also due to motivational reasons, neither controlling for speed nor for the tendency to (not) reach the end of the test is sufficient to properly model NRIs.
Objective
We propose a new framework that takes into account that multiple mechanisms can underlie NRIs. Doing so requires (a) distinguishing examinees who quit the assessment from those who did not work with sufficient speed and reached the time limit and (b) establishing a model that considers both mechanisms simultaneously.
The remainder of this article is organized as follows: First, we present an approach that distinguishes between and simultaneously models two different types of NRIs. Second, parameter recovery of the proposed model is investigated using a simulation study. Third, the relevance of the model for understanding the missingness processes is illustrated in an empirical example.
Speed–Accuracy + Quitting Model
The proposed SA + quitting (SA + Q) framework, depicted in Figure 1, is an extension of the hierarchical SA model that also accounts for quitting. Following Pohl et al. (2019), NRIs due to lack of speed are modeled by considering examinee speed. For simplicity, we model ability employing a Rasch model as given by Equation 1. This is in accordance with the analysis frameworks of major LSAs (e.g., National Educational Panel Study; Pohl & Carstensen, 2012). Note that the model can be extended to other measurement models (see Ulitzsch, von Davier, & Pohl, 2019). For speed, we employ the lognormal model suggested by van der Linden (2006) as given by Equation 2. We model the quitting process by considering the number of items reached before quitting. These are determined by employing RTs to distinguish between examinees displaying NRIs due to lack of speed or due to quitting and constitute the measurement model for a newly introduced variable giving examinee test endurance. If the end of the test or the time limit was reached, the number of items reached before quitting is set to be missing, as no information about quitting is available in this case.
Figure 1.
Hierarchical framework for the joint modeling of speed, accuracy, and test endurance.
Identifying Quitting
In the present study, we assume that examinee has quit the assessment when (a) he or she did not reach the end of the test (i.e., is smaller than the number of items administered ) and (b) his or her total RT falls below the time limit (i.e., is smaller than ). Based on this information, an indicator of observed quitting behavior can be constructed as follows:
(3) |
By construction, distinguishes between examinees who quit the assessment () and those who reached the time limit or finished the test before quitting ().
Modeling Quitting
To model the quitting process, we utilize the information contained in the number of reached items up to the point where the assessment has been quit . When quitting behavior has been observed, that is, when , is given by the observed number of reached items . We suggest to employ a Poisson lognormal model for . These are common for count data on the test or task level (Doebler & Holling, 2016; Jansen, 1994, 1995), such as the number of correct items in a task or—as mentioned in passing by Jansen (1995)—the number of completed items in a test. More specifically, we model the probability that examinee quits the assessment after attempting items as a Poisson process with mean , where corresponds to the person parameter :
(4) |
denotes examinee ’s test endurance and thus governs the item position at which examinee is most likely to quit the assessment—given that the assessment has not been quit before. As such, can be understood as a survival parameter. In the context of NRIs, the onset of NRIs poses the event of interest occurring within a sequence of item positions (List et al., 2017). The survival function depends on the Poisson lognormal model for and gives the probability that examinee will continue the assessment beyond items, as follows:
(5) |
with denoting the Poisson cumulative distribution function.
The presented model incorporates the assumption that examinees will continue the assessment only for a definite number of items and thus will quit the assessment at some point. Note, however, that quitting behavior is not fully observable since some examinees either manage to complete the test or reach the time limit before quitting. Hence, under , is exposed to right censoring. That is, the observed number of reached items corresponds to the number of reached items before quitting only under . Otherwise, marks the item position at which has been right censored. The relationship between , , and the censoring variable is given by
(6) |
Table 1 illustrates this relationship for three examinees administered a test of length with a time limit of 1,800 seconds. Examinee 1 reached the end of the test within the allocated time without showing quitting behavior. Hence, is censored due to reaching the end of the test at . Examinee 2 did not reach the end of the test due to lack of speed. Hence, gives the item position at which quitting behavior has been censored due to lack of speed. Examinee 3 did reach neither the end of the test nor the time limit and is thus assumed to have quit the assessment. Thus, corresponds to the item position at which the assessment has been quit .
Table 1.
Illustration of the Relationship Between , , and
(s) | |||||
---|---|---|---|---|---|
1 | 20 | 1,450 | 0 | NA | 20 |
2 | 16 | 1,800 | 0 | NA | 16 |
3 | 16 | 1,450 | 1 | 16 | NA |
Note.: Observed number of reached items; : total response time; : number of reached items before quitting; : censoring item position; : quitting indicator; NA: not applicable.
Note that usually in LSAs only a few examinees show quitting behavior. That is, there are few observations with . This results in sparse data with respect to the number of reached items before quitting on one hand and a large portion of to be assumed to have been censored on the other. If speed is related to test endurance and there is censoring of due to reaching the time limit, speed is informative with respect to the censoring of . Under such conditions, speed is related to both the parameter governing the distribution of and the probability that is censored. Modeling test endurance and speed jointly accounts for the informative censoring of (Baker, Fitzmaurice, Freedman, & Kramer, 2005). Likewise, modeling ability and speed jointly with test endurance accounts for nonignorable missingness due to quitting on response as well as on RT indicators.
Second-Level Models
In analogy to the SA model, on the second level, the joint distributions of the first-level person and item parameters are modeled. Following van der Linden (2007), person parameters are assumed to be multivariate normal with mean vector
(7) |
and covariance matrix
(8) |
Assessing the joint distribution of person parameters provides valuable insights into the processes underlying NRIs, their relationship to ability as well as to each other. Nonzero correlations with test endurance indicate nonignorability of the associated missingness process.
For the sake of simplicity, for the measurement model of RTs, time discrimination parameters are constrained to be equal across items—that is,
(9) |
This constraint can be understood as an analogue to the Rasch model in IRT (van der Linden, 2006) and thus mirrors the Rasch parameterization implemented for item responses in major LSAs (see, e.g., Pohl & Carstensen, 2012).
The joint distribution of item parameters is, in accordance with van der Linden (2007), assumed to be multivariate normal with mean vector
(10) |
and covariance matrix
(11) |
When a Rasch model is employed for response indicators, the model can be identified by setting the expectations of and to zero.
Assuming joint distributions for person and item parameters yields the following likelihood:
(12) |
The first four terms incorporate the assumption of conditional independence of responses, RTs, and number of reached items before quitting given the second-order variables in the model. The third and fourth terms take the right censoring of quitting behavior into account: For examinees who quit the assessment (), the probability that examinee quits the assessment after attempting items as given by Equation 4 contributes to the likelihood function. For examinees with unobserved quitting behavior (), the likelihood function considers the probability that examinee will continue the assessment beyond the censoring position as given by the survival function in Equation 5. and denote the multivariate normal densities of the person and item parameters, respectively. To facilitate estimation of the SA + Q model, we employ Bayesian estimation techniques.
Parameter Recovery
We conducted a simulation study to investigate whether true parameter values can satisfactorily be recovered in estimation under realistic conditions. The SA model and extensions thereof have been shown to yield good parameter recovery under realistic conditions (e.g., Fox & Marianti, 2016; Molenaar, Tuerlinckx, & van der Maas, 2015; Pohl et al., 2019). We therefore focused especially on possible challenges for estimation imposed by censoring of quitting behavior and the resulting data sparseness on .
Data Generation
Data were generated using R Version 3.5.1 (R Development Core Team, 2017). We employed the SA + Q model as the data-generating model. Using the mvrnorm function from the MASS package (Venables & Ripley, 2002), person and item parameters were randomly drawn from multivariate normal distributions with variances and covariances set to values similar to those of the data application reported below. Population values of the data-generating model are reported in Table 2. We employed a Rasch model for the item responses and set time discrimination parameter for all items to = 1.75 (see, e.g., van der Linden, 2007).
Table 2.
Population Parameters of the Data-Generating Model
Person parameters | Item parameters | |||||||
---|---|---|---|---|---|---|---|---|
1.00 | 0.00 | 1.50 | 0.00 | |||||
−.40 | .10 | .40 | 0.25 | 0.00 | ||||
−.15 | .25 | 0.65 | ||||||
Missingness mechanisms | ||||||||
Quitting | Speed and quitting | |||||||
%NR | %Q | %Q | ||||||
20 | 2.5 | 8.15 | −2.50 | 4.15 | 4.28 | −3.75 | 4.40 | |
5 | 15.08 | −2.50 | 3.85 | 7.50 | −3.85 | 4.15 | ||
10 | 26.49 | −2.50 | 3.50 | 12.75 | −4.00 | 3.85 | ||
40 | 2.5 | 7.83 | −2.50 | 4.85 | 5.30 | −3.75 | 5.00 | |
5 | 14.50 | −2.50 | 4.55 | 7.20 | −3.85 | 4.85 | ||
10 | 28.44 | −2.50 | 4.15 | 12.60 | −4.00 | 4.55 |
Note.: number of examinees; : number of items; %NR: overall missingness rate due to not-reached items; %Q: percentage of examinees quitting; : ability; : speed; : test endurance; : item difficulty; : time intensity; and give mean vectors of person and item parameters, respectively. Mean speed and the mean test endurance are varied in the simulation design to control the amount of mechanisms underlying not-reached items.
Missing values were induced based on (a) cumulative RTs across item positions and (b) the number of reached items before quitting. Cumulative RTs give the time passed when the respective item is responded to. The number of reached items before quitting was generated for each examinee according to the Poisson lognormal model for the quitting process. All items with either a cumulative RT exceeding the time limit or whose position exceeded the number of reached items before quitting were assumed to be not reached and coded as missing.
To evaluate the effects of censoring of quitting behavior, we considered multiple censoring mechanisms and varied four factors that are relevant for data sparseness:
(a) The sample size (; ), representing low and medium sample sizes per item encountered in LSAs with balanced incomplete block designs (see Gonzalez & Rutkowski, 2010)
(b) The test length ( with ; with )
(c) The rate of NRIs (2.5%, 5%, 10%), reflecting the upper three quarters of a typical range of percentages of NRIs. For instance, in the PISA 2012 computer-based assessment, percentages of NRIs across booklets ranged from 0.42% to 11.19% (OECD, 2014)
(d) The missingness mechanisms underlying NRIs (NRIs caused solely by quitting; half of the NRIs resulting from quitting and half from lack of speed).1 While the former represents a testing condition without or a very generous time limit (e.g., as in PIAAC), the latter represents a more speeded testing situation where some examinees run out of time before reaching the end of the test or quitting (e.g., as in PISA). We included these conditions to (a) assess whether the proposed model yields unbiased and efficient parameter estimates under various testing situations and (b) to disentangle possible effects of different censoring mechanisms on estimation accuracy and efficiency. Under the first condition, the number of reached items is censored due to test length. Under the second condition, censoring occurs due to both test length and lack of speed.
Under this design, conditions with 5% (2.5%) missing values and no censoring due to speed display the same missingness rates due to quitting as conditions with 10% (5%) missing values with half of it going back to quitting. This allows to assess whether the model performs differently when—for the same amount of information available on test endurance—the overall missingness rate increases and is exposed to different censoring mechanisms. We controlled the amount and types of NRIs by varying the expectations of test endurance and speed (see Table 2). Note that low rates of missingness due to quitting might go back to relatively high proportions of examinees exhibiting such behavior. This becomes evident in Table 2, displaying the corresponding average proportions of simulated examinees exhibiting quitting behavior across all cells of the simulation design. In total, the simulation design led to 2 × 2 × 3 × 2 = 24 conditions. For each cell of the simulation design, we generated 100 data sets.
Estimation Procedure
We employed Bayesian estimation with Gibbs sampling. All analyses were conducted in JAGS Version 4.3.0 (Plummer, 2003) using the rjags package (Plummer, 2016) for R Version 3.5.1 (R Development Core Team, 2017). Settings for noninformative priors were chosen following recommendations provided by Fox (2010) and Gelman and Hill (2007).
For person and item parameter variances and covariances, we employed inverse Wishart priors with
(13) |
and
(14) |
where and represent identity matrices of dimension 3 and 2, respectively. These are default prior settings for inverse Wishart priors implemented in statistical software for Bayesian analyses (van Erp, Mulder, & Oberski, 2017). Note that inverse Wishart priors tend to be informative about variances when these are close to zero and the sample size is small (Alvarez, Niemi, & Simpson, 2014; Schuurman, Grasman, & Hamaker, 2016). Since usually the number of items is small, for item parameter variances and covariances, prior settings will have a larger impact.
We set and to zero for model identification. For the remaining item and person parameter means, we chose noninformative normal priors with mean zero and variance . A noninformative gamma prior with shape 0.5 and rate was employed for squared time discrimination . JAGS code for the SA + Q model is provided in the appendix.
Each generated data set was analyzed running three Markov chain Monte Carlo chains with 100,000 iterations each. We employed a thinning factor of 5 and discarded the first 40,000 iterations as burn-in, saving 36,000 iterations as a sample of the posterior distribution. We determined the number of iterations in preanalyses, inspecting potential scale reduction factor (PSRF) values, trace plots, and effective sample sizes. In the case of nonconvergence (i.e., PSRF values higher than 1.10; Gelman & Shirley, 2011), we increased the number of iterations by 50,000 per chain out of which 30,000 were discarded as burn-in. This procedure was repeated up to 250,000 iterations per chain in total.
Evaluation Criteria
We evaluated statistical performance in terms of convergence, bias in and efficiency of parameter estimates, as well as coverage of the true parameter values by 95% highest density intervals. Convergence was assessed on the basis of PSRF values, with PSRF values below 1.10 being considered acceptable (Gelman & Rubin, 1992; Gelman & Shirley, 2011). Coverage between .91 and .98 was considered good (Muthén & Muthén, 2002).
Results
Convergence
Table 3 gives the proportions of replications converging after 100,000 up to 250,000 iterations across all cells of the simulation design. Reaching convergence was more challenging under conditions with low quitting rates. This effect was more pronounced in conditions with larger number of items. Accordingly, conditions with and an overall missingness rate of 2.5% out of which half went back to quitting were most challenging with respect to convergence, with up to 58% of the replications not converging after 250,000 iterations. The number of iterations needed to reach convergence decreased rapidly with higher quitting rates, such that under conditions with 10% missingness solely due to quitting, at least 95% of the replications converged after 250,000 iterations at most. No considerable differences concerning convergence could be observed for the same amount of missing values due to quitting when either missing values due to speed were present or not.
Table 3.
Proportion of Replications Converging After 100,000 to 250,000 Iterations
Iterations | |||||||
---|---|---|---|---|---|---|---|
Percent NR | Mechanisms | 100,000 | 150,000 | 200,000 | 250,000 | ||
20 | 350 | 2.5 | Quitting | .13 | .31 | .52 | .64 |
Speed & quitting | .07 | .26 | .38 | .46 | |||
5 | Quitting | .28 | .55 | .73 | .85 | ||
Speed & quitting | .13 | .31 | .44 | .60 | |||
10 | Quitting | .46 | .78 | .91 | .96 | ||
Speed & quitting | .25 | .50 | .66 | .81 | |||
700 | 2.5 | Quitting | .08 | .32 | .53 | .64 | |
Speed & quitting | .08 | .22 | .35 | .48 | |||
5 | Quitting | .29 | .50 | .67 | .77 | ||
Speed & quitting | .12 | .26 | .44 | .51 | |||
10 | Quitting | .59 | .87 | .97 | .99 | ||
Speed & quitting | .17 | .35 | .54 | .75 | |||
40 | 350 | 2.5 | Quitting | .07 | .23 | .41 | .57 |
Speed & quitting | .03 | .11 | .29 | .42 | |||
5 | Quitting | .20 | .43 | .62 | .74 | ||
Speed & quitting | .10 | .20 | .41 | .55 | |||
10 | Quitting | .55 | .79 | .95 | .98 | ||
Speed & quitting | .13 | .28 | .51 | .64 | |||
700 | 2.5 | Quitting | .05 | .25 | .40 | .57 | |
Speed & quitting | .10 | .23 | .38 | .47 | |||
5 | Quitting | .19 | .37 | .62 | .77 | ||
Speed & quitting | .10 | .24 | .35 | .45 | |||
10 | Quitting | .57 | .84 | .92 | .97 | ||
Speed & quitting | .14 | .34 | .59 | .71 |
Note. % NR: Overall missingness rate due to not-reached items; quitting and speed & quitting denote conditions under which all not-reached items go back to quitting and not-reached items occurred due to both lack of speed and quitting, respectively; : number of items; : number of examinees.
Parameters typically yielding high PSRF values were test endurance mean and variance estimates. This is due to the censoring of quitting behavior, with the distribution of test endurance needing to be extrapolated based on information from examinees assumed to belong to the lower quartiles of the distribution. When quitting behavior is observable only for few examinees, information on the distribution of test endurance is sparse, affecting convergence. Replications that did not converge were excluded from further analyses.
Coverage
Coverage values for all parameter types and conditions are available in the Supplemental Material (available online). For item parameter variances, covariances, and means, coverage was satisfactory across all conditions. Coverage values falling below .91 occurred rarely and the lowest coverage value was still as high as .82. For person parameters, when either the number of examinees or items was sufficiently high ( or ), coverage fell below .91 only for test endurance variance and mean estimates under conditions with missingness rates due to quitting below 5% (i.e., under conditions with less than 10% NRIs due to speed and quitting as well as under conditions with an overall missingness rate below 5% only due to quitting). Under these conditions, for , the lowest observed coverage values were .88 and .67 for conditions with and , respectively. For , the lowest observed coverage values were .76 and .75 for conditions with and , respectively, under conditions with less than 5% missingness caused by quitting.
Parameter Estimation
To evaluate bias in and efficiency of parameter estimates, we assessed the median along with 90% ranges of the means of the posterior distribution. Figures 2 and 3 depict results for person parameter variances and covariances as well as mean test endurance posterior means.2 Across all conditions, median person parameter variance and covariance estimates were close to the true data-generating values. The only exception were parameters concerning the distribution of test endurance ( and ), which were sensitive to bias under conditions with missingness rates due to quitting below 5% as well as under conditions with few items. Under conditions with , median parameter estimates of ranged from 0.53 up to 0.71 as compared to the true value of 0.65. Bias for under conditions with was less severe. Under conditions with and 10%, respectively, 1.25% missingness due to quitting, median parameter estimates of 3.55 and 4.19 as compared to the data-generating values of 3.50 and 4.40 were observed. Under conditions with and a missingness rate due to quitting of at least 5%, median estimates of and were well recovered. Median bias in parameter estimates did not vary largely across the sample sizes under consideration. Nevertheless, variability of parameter estimates decreased with increasing sample size.
Figure 2.
Medians and 90% ranges of person parameter variance and covariance estimates over all 100 replications per condition. The dashed horizontal line indicates the respective true parameter. Note that y-axes differ in scale. : ability; : speed; : test endurance; : number of examinees; : number of items; NRIs: not-reached items; quitting and speed & quitting denote conditions under which all not-reached items go back to quitting and not-reached items occurred due to both lack of speed and quitting.
Figure 3.
Medians and 90% ranges of mean test endurance over all 100 replications per condition. The dashed horizontal line indicates the respective true parameter. Plots are organized according to the data-generating values employed to achieve missingness rates due to quitting ranging from 1.25% (speed & quitting, 2.5%) to 10% (quitting, 10%). Note that y-axes differ in scale. : number of examinees; : number of items; % NR: overall missingness rate due to not-reached items; quitting and speed & quitting denote conditions under which all not-reached items go back to quitting and not-reached items occurred due to both lack of speed and quitting, respectively.
Results for bias and efficiency of item parameter means, variances, and covariances are given in the Supplemental Material (available online). Due to the small number of items, estimates were shrunken toward the prior mean of the inverse Wishart prior (see Alvarez et al., 2014; Daniels & Kass, 1999).
Subsequent Analyses
In subsequent analyses, we aimed at investigating whether the recovery of under conditions with few items () improves with an increasing amount of NRIs. To do so, we increased the missingness rate under the condition with , and missingness solely caused by quitting to 20% by setting to 3. We employed 100 replications and analyzed all generated data sets employing three chains with 250,000 iterations each. No convergence issues occurred. Indeed, the additional condition yielded a median estimate of of 0.67 as compared to the true value of 0.65, thereby supporting the conclusion of the simulation study that test endurance variances can be better recovered when data sparseness on is less severe.
Empirical Example
We used data from the Spanish sample of the PISA 2015 assessment to illustrate the use of the SA + Q model for the understanding of the occurrence of NRIs. We analyzed data from examinees who were administered Science Cluster Number 7 at the second position out of four 30-minute blocks. For reasons of simplicity, we eliminated examinees who showed item omissions from further analyses. The final sample consisted of 326 examinees responding to 17 items. The data set under consideration displayed a missingness rate of 7.13%, going back to those 21.47% of examinees who did not reach the end of the cluster. The majority of examinees with NRIs (58.57%) reached all but four items at most.
Total Response Time Distributions
In a first step, to get a better understanding of possible mechanisms, we followed Pohl et al. (2019) and examined distributions of total RT . Figure 4 displays as a function of the number of NRIs. The time limit of 1,800 seconds is marked with a dashed horizontal line. The results suggest that the time limit was enforced with varying rigor by test administrators, since for some examinees, exceeded the time limit. More importantly, varied largely across examinees with NRIs. Such variation was not to be expected under conditions where NRIs occurred entirely due to lack of speed, where all associated with NRIs should be close to . Instead, the fact that associated with NRIs was close to only for some examinees, while it was considerably below for others can be understood as evidence that different mechanisms—that is, lack of speed and premature quitting—were underlying NRIs.
Figure 4.
Total response time distributions for PISA (Programme for International Student Assessment) Science Cluster Number 7 administered in Spain, by number of not-reached items. The dashed horizontal line marks the time limit of 1,800 seconds.
Investigating the Occurrence of Not-Reached Items
For simplicity, when classifying examinees as quitters, we ignored that in PISA time limits were enforced with varying rigor. To deal with the fact that RT was not recorded for the last item seen when no response has been generated, we employed a heuristic approach and adjusted the decision boundary for associated with NRIs by a typical item-level RT. We classified NRIs as due to quitting when fell below the time limit by less than the 90th percentile of RTs across all items and examinees—that is, .3 Doing so led to classifying 68 out of 70 examinees who did not reach the end of the test as quitters. We employed the SA + Q model to analyze the data. With 250,000 iterations per chain, no PSRF values below 1.10 were encountered.
Results are displayed in Table 4. The negative correlation between ability and speed as well as between ability and test endurance indicates that more able examinees tended to display lower general working speed and showed lower test endurance. That is, more able examinees were more likely to display NRIs due to both reaching the time limit and quitting. Note that the highest density interval for the correlation between and includes zero, which means that the correlation is not credibly different from zero. The positive correlation between speed and test endurance indicates that examinees who worked faster had the tendency to generate answers to more items before quitting the assessment. Furthermore, the fact that and are not highly correlated underlines that missingness due to speed and missingness due to quitting should be seen as different processes. gives the expectation of the mean logarithmized number of items reached before quitting. The value of 3.52 corresponds to 33.78 items.
Table 4.
Variances and Means of as Well as Correlations Among Person and Item Parameters.
Person parameters | ||||
---|---|---|---|---|
1.13 [0.87, 1.39] | ||||
−.37 [–.50, –.24] | 0.07 [0.06, 0.09] | |||
−.16 [–.35, .03] | .30 [.12, .47] | 0.62 [0.34, 0.94] | 3.52 [3.31, 3.74] | |
Item parameters | ||||
1.41 [0.59, 2.47] | −0.36 [–0.99, 0.18] | |||
.41 [.02, .76] | 0.21 [0.09, 0.37] | 4.14 [3.94, 4.39] |
Note. Highest density intervals are given in square brackets. : ability; : speed; : test endurance; : item difficulty; : time intensity; and give mean vectors of person and item parameters, respectively.
Discussion
The SA + Q model proposed in this article allows to disentangle and simultaneously model different missing data mechanisms underlying NRIs. Namely, the SA + Q model distinguishes between NRIs stemming from lack of speed and NRIs due to quitting and, thereby, allows to substantively meaningfully describe different processes underlying NRIs. This is achieved by further integrating research on missing data and research on RTs (see Pohl et al., 2019). The SA + Q model considers RT data to handle NRIs by (a) utilizing the additional information contained in RTs to distinguish between examinees displaying NRIs due to lack of speed and those who quit and (b) extending van der Linden’s SA model by a Poisson lognormal survival model describing the quitting process in terms of examinee test endurance. The SA + Q model can be employed to model the occurrence of NRIs under conditions where all NRIs go back to quitting as well as the occurrence of NRIs occurring due to both lack of speed and quitting.
The SA + Q model represents a refined model-based approach for dealing with nonignorable missing data. As delineated above, previously suggested model-based approaches for NRIs (Glas & Pimentel, 2008; Rose et al., 2010; Pohl et al., 2019) rely on the assumption of a single missingness mechanism. The SA + Q model complements model-based approaches for NRIs in that it considers that multiple mechanisms can underlie their occurrence. As such, the SA + Q model overcomes limitations of current state-of-the-art approaches for NRIs.
We employed data from PISA 2015 to illustrate how the approach can provide insights into the processes underlying NRIs. We showed that there is strong evidence that NRIs in LSAs indeed can be attributed to different missingness processes. In this context, the SA + Q model supports a more fine-grained understanding of the occurrence of NRIs by further assessing examinee characteristics associated with test endurance. As such, the SA + Q model can be used to evaluate and inform substantive theories on test-taking behavior and strategies.
The model gives reasonable estimates under conditions with a missingness rate of at least 5% due to quitting (or approximately 15% of examinees exhibiting quitting behavior) and a higher number of items (). Since the SA + Q model estimates test endurance based on information on the number of reached items before quitting, the number of examinees quitting the assessment might be of greater importance for retrieving unbiased estimates than the missingness rate due to quitting. Generally, a higher number of iterations is needed when little information on quitting behavior is available. Note that model-based approaches yield ability estimates considerably different from those retrieved when ignoring nonignorable missing values only under high missingness rates (Pohl et al., 2014; Rose, 2013; Rose et al., 2010), such that the application of the SA + Q might be useful mainly under conditions with higher rates of NRIs.
Due to the model’s complexity, we recommend keeping measurement models as simple as possible—for example, by employing a Rasch model for item responses and/or fixing time discrimination parameters in the measurement model of RTs to be equal across items. More complex measurement models that might better fit the data at hand can be incorporated in the SA + Q framework. For RTs, for instance, alternative parameterizations have been suggested assuming distributions of RTs different from lognormal (Klein-Entink, van der Linden, & Fox, 2009) or introducing additional parameters that reflect the way an item distinguishes between examinees of different speed levels (Klein-Entink, Kuhn, Hornke, & Fox, 2009). Note, however, that adding additional model complexity by choosing more complex measurement models for responses and/or RTs might further challenge estimation of the SA + Q model and increase the number of items or examinees needed to achieve convergence as well as unbiased and efficient parameter estimation.
Limitations and Future Research
While the SA + Q model allows to incorporate different missingness mechanisms, it heavily relies on extrapolation of the distribution of the number of reached items before quitting and the underlying test endurance variable. When the majority of examinees manages to reach the end of the test, the number of reached items before quitting is strongly affected by right censoring. Thus, as it is the case with previously developed latent model-based approaches for NRIs (e.g., Glas & Pimentel, 2008), the distribution of reached items before quitting is extrapolated from observations assumed to belong to lower quartiles of the distribution. This renders it difficult to assess whether distributional assumptions are met. In the simulation study for evaluating estimability, rather low missingness rates due to quitting went back to a relatively high proportion of examinees exhibiting such behavior. For instance, under conditions with a missingness rate of 10% due to quitting, quitting behavior was observable for roughly a quarter of examinees. This, however, must not be the case in real data. It might well be that high NRI rates can be attributed to few examinees quitting the assessment at early stages. The capability of the SA + Q model to yield an accurate description of the quitting process under such conditions remains to be evaluated. Likewise, it still remains to be assessed whether statistical performance of the SA + Q model can be improved under conditions with large sample sizes, as often encountered on the country level in LSAs. It might well be that the SA + Q model performs well under conditions with a low proportion of examinees exhibiting quitting behavior when the absolute number of examinees is sufficiently high. Until then, the requirements concerning missingness rates due to quitting established on the basis of the simulation study may serve as lower boundaries. In addition, for count data on the test level, distributions different from lognormal, such as a gamma distribution, have been suggested for the random, person-specific mean of the Poisson distribution (Doebler & Holling, 2016; Jansen, 1994). Further research and possibly also experimental studies are needed to evaluate the appropriateness of different distributional assumptions for test endurance.
Furthermore, it remains to reason whether the quitting process can sufficiently be described by only considering item positions. First, it might well be that rather than the number of attempted items, passed test time better measures test endurance. As long as time intensities do not show large differences across items, item positions might serve as a proxy for passed test time (Fox & Marianti, 2016). Otherwise, the SA + Q model might not be able to sufficiently capture the quitting process. Second, subpopulations with different quitting strategies might exist. For instance, while some examinees might quit due to lack of motivation, others might quit when they consider the test too difficult or even out of frustration when noticing that the time remaining is not sufficient to complete the assessment. Likewise, there might be qualitative differences in quitting mechanisms between examinees who quit at earlier and later stages of the assessment. To meaningfully incorporate different forms of quitting behavior into the model, further substantive research is needed to describe and understand these mechanisms.
Moreover, the SA + Q model assumes stationarity of speed. This assumption is reasonable when tests are administered with generous time limits, such that examinees are unlikely to run out of time (van der Linden, 2007). However, the presence of NRIs due to lack of speed indicates that the allocated time might not have been sufficient for all examinees. When examinees perceive their current speed level as being insufficient to reach the end of the test, they might try to adjust their pace (Yamamoto & Everson, 1997), rendering the stationarity assumption implausible. In future studies, it seems therefore necessary to allow for within-person variation of speed in the SA + Q model (see Fox & Marianti, 2016, for an extended model).
In general, estimation of the SA + Q model was found to be rather challenging. Severe convergence issues were encountered under conditions with less than 10% missingness due to quitting. When convergence was reached, this oftentimes was only the case with high numbers of iterations. Future research should therefore address facilitating the estimation procedure of the SA + Q model.
In the empirical example, we encountered conditions that pose further challenges to adequate modeling of NRIs and provided heuristic solutions to address these issues. Future research is needed on how to better deal with (a) the fact that oftentimes no RT information is available for the last item seen when no response has been generated as well as (b) varying enforcement of time limits when applying the SA + Q framework. In addition, for the SA + Q model to be readily applicable to empirical data, it needs to be considered that NRIs are often not the only source of missingness going back to examinee behavior. In most LSAs, both NRIs as well as item omissions can be encountered (Pohl et al., 2014). To additionally handle (nonignorable) omission processes, the SA + Q model could be combined with model-based approaches for item omissions (e.g., O’Muircheartaigh & Moustaki, 1999; Rose, 2013; Ulitzsch et al., 2019).
Supplemental Material
Supplemental material, Supplementary for A Multiprocess Item Response Model for Not-Reached Items due to Time Limits and Quitting by Esther Ulitzsch, Matthias von Davier and Steffi Pohl in Educational and Psychological Measurement
Appendix
JAGS Code
Figure A1.
JAGS code for the speed–accuracy + quitting model.: number of persons; : number of items. and are by matrices containing the item responses and associated response times, and are vectors of length containing the number of reached items before quitting and the censoring item position. contains information on observed quitting behavior for each examinee and takes the values 1 and 2 for and , respectively. and represent identity matrices of sizes 3 and 2.
Note that in the case of NRIs going back entirely to lack of speed, there is no need to model quitting behavior and the SA model is sufficient for modeling the mechanism underlying NRIs.
Note that plots for parameter recovery of mean test endurance (Figure 3) are organized according to the data-generating values employed to achieve missingness rates due to quitting ranging from 1.25% (speed & quitting, 2.5%) to 10% (quitting, 10%).
Note that other approaches may also be taken for classifying NRIs as due to quitting. Especially when more detailed information is available for each item, it might be possible to even better classify quitting. Since currently publicly available databases containing RT information do not provide RTs for the last item seen when no response has been generated, the heuristic decision rule employed here might serve as a first guideline for applying the SA + Q framework to such data.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Grant PO 1655/3-1 from the German Research Foundation (DFG).
ORCID iD: Esther Ulitzsch https://orcid.org/0000-0002-9267-8542
References
- Alvarez I., Niemi J., Simpson M. (2014). Bayesian inference for a covariance matrix. Retrieved from https://arxiv.org/abs/14084050 [Google Scholar]
- Baker S. G., Fitzmaurice G. M., Freedman L. S., Kramer B. S. (2005). Simple adjustments for randomized trials with nonrandomly missing or censored outcomes arising from informative covariates. Biostatistics, 7, 29-40. doi: 10.1093/biostatistics/kxi038 [DOI] [PubMed] [Google Scholar]
- Chen H. H., von Davier M., Yamamoto K., Kong N. (2015). Comparing data treatments on item-level nonresponse and their effects on data analysis of large-scale assessments: 2009 PISA study (ETS Research Report No. RR-15-12). Princeton, NJ: Educational Testing Service. doi: 10.1002/ets2.12059 [DOI] [Google Scholar]
- Cosgrove J. (2011, December). Does student engagement explain performance on PISA? Comparisons of response patterns on the PISA tests across time. Dublin, Ireland: Educational Research Centre; Retrieved from http://www.erc.ie/documents/engagement_and_performance_over_time.pdf [Google Scholar]
- Daniels M. J., Kass R. E. (1999). Nonconjugate Bayesian estimation of covariance matrices and its use in hierarchical models. Journal of the American Statistical Association, 94, 1254-1263. doi: 10.1080/01621459.1999.10473878 [DOI] [Google Scholar]
- Debeer D., Janssen R., Boeck P. (2017). Modeling skipped and not-reached items using IRTrees. Journal of Educational Measurement, 54, 333-363. doi: 10.1111/jedm.12147 [DOI] [Google Scholar]
- Doebler A., Holling H. (2016). A processing speed test based on rule-based item generation: An analysis with the Rasch Poisson counts model. Learning and Individual Differences, 52, 121-128. doi: 10.1016/j.lindif.2015.01.013 [DOI] [Google Scholar]
- Fox J.-P. (2010). Bayesian item response modeling: Theory and applications. New York, NY: Springer. [Google Scholar]
- Fox J.-P., Marianti S. (2016). Joint modeling of ability and differential speed using responses and response times. Multivariate Behavioral Research, 51, 540-553. doi: 10.1080/00273171.2016.1171128 [DOI] [PubMed] [Google Scholar]
- Foy P. (2017). TIMSS 2015 user guide for the international database. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College and International Association for the Evaluation of Educational Achievement; Retrieved from https://timssandpirls.bc.edu/timss2015/international-database/downloads/T15_UserGuide.pdf [Google Scholar]
- Foy P. (2018). PIRLS 2016 user guide for the international database. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College and International Association for the Evaluation of Educational Achievement; (IEA). Retrieved from https://timssandpirls.bc.edu/pirls2016/international-database/downloads/P16_UserGuide.pdf [Google Scholar]
- Gelman A., Hill J. (2007). Data analysis using regression and multilevel hierarchical models. Cambridge, England: Cambridge University Press. [Google Scholar]
- Gelman A., Rubin D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7, 457-472. doi: 10.1214/ss/1177011136 [DOI] [Google Scholar]
- Gelman A., Shirley K. (2011). Inference from simulations and monitoring convergence. In Brooks S., Gelman A., Jones G., Meng X.-L. (Eds.), Handbook of Markov chain Monte Carlo (pp. 163-174). Boca Raton, FL: Chapman & Hall. [Google Scholar]
- Glas C. A., Pimentel J. L. (2008). Modeling nonignorable missing data in speeded tests. Educational and Psychological Measurement, 68, 907-922. doi: 10.1177/0013164408315262 [DOI] [Google Scholar]
- Gonzalez E., Rutkowski L. (2010). Principles of multiple matrix booklet designs and parameter recovery in large-scale assessments. In von Davier M., Hastedt D. (Eds.), IERI monograph series: Issues and methodologies in large-scale assessments (Vol. 3, pp. 125-156). Hamburg, Germany: IEA-ETS Research Institute. [Google Scholar]
- Jansen M. G. H. (1994). Parameters of the latent distribution in Rasch’s Poisson counts model. In Fischer C., Laming D. (Eds.), Contributions to mathematical psychology, psychometrics, and methodology (pp. 319-326). New York, NY: Springer. [Google Scholar]
- Jansen M. G. H. (1995). The Rasch Poisson counts model for incomplete data: An application of the EM algorithm. Applied Psychological Measurement, 19, 291-302. doi: 10.1177/014662169501900307 [DOI] [Google Scholar]
- Klein-Entink R. H., Kuhn J.-T., Hornke L. F., Fox J.-P. (2009). Evaluating cognitive theory: A joint modeling approach using responses and response times. Psychological Methods, 14, 54-75. doi: 10.1037/a0014877 [DOI] [PubMed] [Google Scholar]
- Klein-Entink R., van der Linden W., Fox J.-P. (2009). A Box-Cox normal model for response times. British Journal of Mathematical and Statistical Psychology, 62, 621-640. doi: 10.1348/000711008X374126 [DOI] [PubMed] [Google Scholar]
- Köhler C., Pohl S., Carstensen C. H. (2017). Dealing with item nonresponse in large-scale cognitive assessments: The impact of missing data methods on estimated explanatory relationships. Journal of Educational Measurement, 54, 397-419. doi: 10.1111/jedm.12154 [DOI] [Google Scholar]
- Lawrence I. M. (1993). The effect of test speededness on subgroup performance (ETS Research Report No. RR-93-49). Princeton, NJ: Educational Testing Service. doi: 10.1002/j.2333-8504.1993.tb01560.x [DOI] [Google Scholar]
- List M. K., Köller O., Nagy G. (2017). A semiparametric approach for modeling not-reached items. Educational and Psychological Measurement. Advance online publication. doi: 10.1177/0013164417749679 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu O. L., Rios J. A., Borden V. (2015). The effects of motivational instruction on college students’ performance on low-stakes assessment. Educational Assessment, 20, 79-94. doi: 10.1080/10627197.2015.1028618 [DOI] [Google Scholar]
- Lord F. M. (1983). Maximum likelihood estimation of item response parameters when some responses are omitted. Psychometrika, 48, 477-482. doi: 10.1007/BF02293689 [DOI] [Google Scholar]
- Mandinach E. B., Bridgeman B., Cahalan-Laitusis C., Trapani C. (2005). The impact of extended time on SAT®test performance (Research Report No. 2005-8). New York, NY: College Board; Retrieved from https://files.eric.ed.gov/fulltext/ED563027.pdf [Google Scholar]
- Mislevy R. J., Wu P.-K. (1996). Missing responses and IRT ability estimation: Omits, choice, time limits, and adaptive testing (ETS Research Report No. RR-96-30-ONR). Princeton, NJ: Educational Testing Service. doi: 10.1002/j.2333-8504.1996.tb01708.x [DOI] [Google Scholar]
- Molenaar D., Tuerlinckx F., van der Maas H. L. (2015). A generalized linear factor model approach to the hierarchical framework for responses and response times. British Journal of Mathematical and Statistical Psychology, 68, 197-219. doi: 10.1111/bmsp.12042 [DOI] [PubMed] [Google Scholar]
- Muthén L. K., Muthén B. O. (2002). How to use a Monte Carlo study to decide on sample size and determine power. Structural Equation Modeling: A Multidisciplinary Journal, 9, 599-620. doi: 10.1207/S15328007SEM0904\_8 [DOI] [Google Scholar]
- O’Muircheartaigh C., Moustaki I. (1999). Symmetric pattern models: A latent variable approach to item non-response in attitude scales. Journal of the Royal Statistical Society: Series A (Statistics in Society), 162, 177-194. doi: 10.1111/1467-985X.00129 [DOI] [Google Scholar]
- Organisation for Economic Co-operation and Development. (2013). Technical report of the survey of adult skills (PIAAC). Paris, France: Author; Retrieved from https://www.oecd.org/skills/piaac/_Technical%20Report_17OCT13.pdf [Google Scholar]
- Organisation for Economic Co-operation and Development. (2014). PISA 2012 technical report. Paris, France: Author; Retrieved from https://www.oecd.org/pisa/pisaproducts/PISA-2012-technical-report-final.pdf [Google Scholar]
- Organisation for Economic Co-operation and Development. (2017). PISA 2015 technical report. Paris, France: Author; Retrieved from https://www.oecd.org/pisa/sitedocument/PISA-2015-technical-report-final.pdf [Google Scholar]
- Plummer M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Hornik K., Leisch F., Zeileis A. (Eds.), Proceedings of the Third International Workshop on Distributed Statistical Computing (pp. 1-10). Vienna, Austria: Austrian Association for Statistical Computing. [Google Scholar]
- Plummer M. (2016). rjags: Bayesian graphical models using MCMC (R package Version 4-6). Retrieved from https://CRAN.R-project.org/package=rjags
- Pohl S., Carstensen C. H. (2012). NEPS technical report-Scaling the data of the competence tests (NEPS Working Paper No. 14). Bamberg, Germany: Otto-Friedrich-Universität, Nationales Bildungspanel. [Google Scholar]
- Pohl S., Gräfe L., Rose N. (2014). Dealing with omitted and not-reached items in competence tests: Evaluating approaches accounting for missing responses in item response theory models. Educational and Psychological Measurement, 74, 423-452. doi: 10.1177/0013164413504926 [DOI] [Google Scholar]
- Pohl S., Ulitzsch E., von Davier M. (2019). Using response time models to account for not-reached items. Psychometrika, 84(3), 892-920. doi: 10.1007/s11336-019-09669-2 [DOI] [PubMed] [Google Scholar]
- R Development Core Team. (2017). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; Retrieved from http://www.R-project.org [Google Scholar]
- Rose N. (2013). Item nonresponses in educational and psychological measurement (Unpublished doctoral dissertation). Friedrich-Schiller-Universität Jena, Jena, Thuringia, Germany: Retrieved from https://d-nb.info/1036873145/34 [Google Scholar]
- Rose N., von Davier M., Nagengast B. (2017). Modeling omitted and not-reached items in IRT models. Psychometrika, 82, 795-819. doi: 10.1007/s11336-016-9544-7 [DOI] [PubMed] [Google Scholar]
- Rose N., von Davier M., Xu X. (2010). Modeling nonignorable missing data with item response theory (IRT) (ETS Research Report No. RR-10-11). Princeton, NJ: Educational Testing Service. doi: 10.1002/j.2333-8504.2010.tb02218.x [DOI] [Google Scholar]
- Rubin D. B. (1976). Inference and missing data. Biometrika, 63, 581-592. doi: 10.1093/biomet/63.3.581 [DOI] [Google Scholar]
- Schuurman N., Grasman R., Hamaker E. (2016). A comparison of inverse-Wishart prior specifications for covariance matrices in multilevel autoregressive models. Multivariate Behavioral Research, 51, 185-206. doi: 10.1080/00273171.2015.1065398 [DOI] [PubMed] [Google Scholar]
- Ulitzsch E., von Davier M., Pohl S. (2019). Using response times for joint modeling of response and omission behavior. Multivariate Behavioral Research, 1-29. doi: 10.1080/00273171.2019.1643699 [DOI] [PubMed] [Google Scholar]
- van der Linden W. J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31, 181-204. doi: 10.3102/10769986031002181 [DOI] [Google Scholar]
- van der Linden W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72, 287-308. doi: 10.1007/s11336-006-1478-z [DOI] [Google Scholar]
- van Erp S., Mulder J., Oberski D. L. (2017). Prior sensitivity analysis in default Bayesian structural equation modeling. Psychological Methods, 23, 363-388. doi: 10.1037/met0000162 [DOI] [PubMed] [Google Scholar]
- Venables W. N., Ripley B. D. (2002). Modern applied statistics with S (4th ed.). New York, NY: Springer; Retrieved from http://www.stats.ox.ac.uk/pub/MASS4 [Google Scholar]
- Wild C., Durso R. (1979, June). Effect of increased test-taking time on test scores by ethnic groups, age, and sex. Princeton, NJ: Educational Testing Service; Retrieved from https://www.ets.org/Media/Research/pdf/GREB-76-06R.pdf [Google Scholar]
- Wise S. L., DeMars C. E. (2005). Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment, 10, 1-17. doi: 10.1207/s15326977ea1001\_1 [DOI] [Google Scholar]
- Yamamoto K., Everson H. (1997). Modeling the effects of test length and test time on parameter estimation using the HYBRID model. In Rost J. (Ed.), Applications of latent trait and latent class models in the social sciences (pp. 89-98). Münster, Germany: Waxmann. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, Supplementary for A Multiprocess Item Response Model for Not-Reached Items due to Time Limits and Quitting by Esther Ulitzsch, Matthias von Davier and Steffi Pohl in Educational and Psychological Measurement