Skip to main content
Educational and Psychological Measurement logoLink to Educational and Psychological Measurement
. 2019 Sep 3;80(2):293–311. doi: 10.1177/0013164419871597

A Two-Level Alternating Direction Model for Polytomous Items With Local Dependence

Igor Himelfarb 1,, Katerina M Marcoulides 2, Guoliang Fang 3, Bruce L Shotts 1
PMCID: PMC7047261  PMID: 32158023

Abstract

The chiropractic clinical competency examination uses groups of items that are integrated by a common case vignette. The nature of the vignette items violates the assumption of local independence for items nested within a vignette. This study examines via simulation a new algorithmic approach for addressing the local independence violation problem using a two-level alternating directions testlet model. Parameter values for item difficulty, discrimination, test-taker ability, and test-taker secondary abilities associated with a particular testlet are generated and parameter recovery through Markov Chain Monte Carlo Bayesian methods and generalized maximum likelihood estimation methods are compared. To aid with the complex computational efforts, the novel so-called TensorFlow platform is used. Both estimation methods provided satisfactory parameter recovery, although the Bayesian methods were found to be somewhat superior in recovering item discrimination parameters. The practical significance of the results are discussed in relation to obtaining accurate estimates of item, test, ability parameters, and measurement reliability information.

Keywords: testlet response theory (TRT), violation of local independence, Bayesian methods, Markov Chain Monte Carlo (MCMC), generalized maximum likelihood estimation (GMLE)

Introduction

The National Board of Chiropractic Examiners (NBCE) is an independent third-party testing agency for the chiropractic profession. It was incorporated in 1963 and began administering its first examinations in 1965. Prior to the formation of the NBCE, each state chiropractic licensing board created and administered its own battery of licensure examinations. As a consequence, licensure testing for the chiropractic profession and standards for entry-level chiropractic licensure varied considerably from state to state. Presently successful completion of the NBCE examinations comprised of four separate parts is required for licensure across all of the United States, thereby providing the chiropractic profession with a single pathway to licensure (for additional details, see online Supplementary Appendix A).

One key component of the chiropractic clinical competency examination is the use of groups of items that are integrated by a common case vignette. The nature of the vignette items, however, violates the defining item response theory (IRT) assumption of conditional independence or local independence for items. This assumption basically states that the conditional probability of observing a particular response pattern given a latent trait value equals the product of the conditional probabilities of the items. The violation of this assumption is referred to as local dependence. Stated another way, local dependence implies that the only thing causing items to covary is the modeled latent trait(s). As items in a test should not be related to each other, having items nested within a vignette is a potential source of local dependence. Not accounting for these dependencies can lead to biased estimates of items, test, and ability parameters and even overestimation of measurement reliability (Sireci, Thissen, & Wainer, 1991; Wainer & Wang, 2000).

The purpose of this article is to introduce a new computational estimation algorithm based on the widely used two-level unidimensional testlet response theory (TRT) model (Wainer, Bradlow, & Wang, 2007). Although past research has suggested that the TRT model is ideally suited for obtaining parameters estimated in settings with local dependence, the estimation process can be computationally challenging due to the strong nonlinear coupling of examinee and item parameters. The newly proposed estimation approach seeks to address this issue and contribute to the literature by introducing a computationally efficient algorithm that can be used in practical measurement settings. To accomplish the purposes of this article, a number of simulations using data based on realistic data scenarios were conducted and scrutinized.

The remainder of the article is organized in the following way. The next section presents an overview of the literature on the estimation of tests containing testlet structure. This is followed by a description of the proposed computational approach. Next, complete details on the simulation design, the analyses, and the procedures used to examine the proposed computational approach are described. This is followed by a description of results obtained from analyses of the simulated data. Finally, a discussion of the implications of the findings is provided.

Overview of the Literature

Recent developments in educational and professional testing have provided support for the use of complex test items that often include group of items united by a common stimulus (Attali, 2011). Examples of such complex test items include ones containing a subset of so-called scaffolded or broken down tasks in one larger task that provide an opportunity to elicit more responses as well as situations when test-takers are provided with more information to complete an item (Bergner, Choi, & Castellano, 2019; Wolf et al., 2016). Wainer and Kiely (1987) proposed a name for such complex items, calling them “testlets.” Testlets are commonly used to boost testing efficiency in situations that examine an individual’s ability to understand some sort of stimulus (e.g., a reading passage, an information graph, a musical passage, or a table of numbers; Wainer et al., 2007). Other complicated situations include examinations where the local independence between the items of different testlets holds, but the assumption of within-testlet independence is violated due to the presence of within-testlet residual dependency in responses. Although attempts to develop a comprehensive, inclusive model for various types of dependent items date back some 30 years (Fray, 1989), the debate on how best to account for dependency in the test in a holistic final score continues to date. A main reason for this debate is that when items are united by a single prompt, they inevitably violate the assumption of local independence, which is a fundamental assumption for IRT models (De Boeck & Wilson, 2004).

To account for such potentially complex tests with dependencies in responses, De Boeck and Wilson (2004) suggested three different approaches to modeling the vector of Y responses. The first approach is called the conditional modeling approach, because the conditional probability of a certain response to an item is modeled conditionally on some or all other responses to that item. The second approach is called marginal modeling and uses the marginal distributions that are constructed for each component of Y based on the set of predictor variables with the correlation among the components of Y added on top of the marginal distributions to explain residual dependency. The third model is called the random-effects modeling approach and involves the use of a vector of random effects introduced to the model to account for the dependency among the components of Y. Although using these approaches can prove in general beneficial, for a test with a large number of items they can become prohibitively computationally demanding (Tuerlinckx & De Boeck, 2004).

A number of other approaches to account for tests with dependencies in responses have also been proposed in the extant literature. For example, Fischer (1989) proposed a generalization of the logistic linear model with relaxed assumptions in the context of violation of stochastic independence for tests when testlets are formed. In the model, called a hybrid model, it is possible to estimate changes even if the responses do not have a common latent trait. Verhelst and Glas (1995) also proposed two dynamic generalizations of models that relax the assumption of local stochastic independence. In the first approach, they proposed a special case of a log-linear model with added parameters based on the Rasch model, while for the second approach they applied a framework from mathematical learning theory (Sternberg, 1963).

Other models proposed to handle settings with possible local dependency include the rating score model (Andrich, 1978), the partial credit model (Masters, 1982), the generalized partial credit model (Muraki, 1992), and the graded response model (Samejima, 1969). Following this strand of research, Culpepper (2014) presented an overview of different sequential item response models applicable to tests constructed using items with violation of local independence, specifically those allowing multiple attempts of an item. He demonstrated how these models for repeated attempts could be applied within the Rasch modeling framework, introducing attempt-specific parameters as a strategy to account for the differences in probability of providing correct response during the repeated attempts. Although advantageous, this modeling framework was never extended beyond the Rasch model to other models such as the 2-parameter- and 3-paramater-logistic models, that are often used by operational testing programs to explain the responses collected from test takers. Another limitation with these approaches is the computational complexity needed to solve the problem can be quite demanding. This is the reason Li, Li, and Wang (2010) indicated that “some relatively simple methods to detect local dependency and measure the magnitude of the testlet effect” (p. 22) need to be urgently developed. Given that to date no broader computationally efficient approach for addressing the problem has been suggested, this article proposes a new computational algorithm that is based on the estimation of parameters via the two-level unidimensional TRT model (Wainer et al., 2007).

Method

Model Notation and Specification

To illustrate the approach, let us consider Yji to represent the score category of examinee j on item i. Assuming next that a test item i belongs to a testlet d(i) with mi+1 scored categories starting from 0, the probability that examinee j provides an answer in the category k can be given by the following equation:

Pijk=exp(v=0k(ai1θjtiv+ai2γd(i)j))c=0miexp(v=0c(ai1θjtiv+ai2γd(i)j)) (1)

where Pijk is the probability of scoring in category k of item i by examinee j, tiv is the difficulty parameter for score category v, θj is the ability of examinee j, γd(i)j represents an examinee’s secondary ability associated with testlet d(i) for examinee j (which essentially implies that γd(i) accounts for local item dependence within testlet d(i)), and ai1 and ai2 indicate the item’s discriminating power with respect to θ and γ, respectively. For identification purposes, the following assumptions are made:

θ~N(0,1);γ~N(0,1)andCov(θ,γ)=0 (1)

An alternating direction approach is adopted to decouple the examinee ability–related parameter θj and the item-related parameters ai1, ai2, tiv, and γd(i). The initial estimation value for θj is obtained at the testlet level by grouping related items using the following generalized partial credit model (GPCM):

Pijk=expv=0kai1(θjtiv)c=0miexpv=0kai1(θjtiv) (2)

where the primed parameters are the testlet-level counterparts. The testlet item scores are obtained by summing up the scores in each constituting item. The testlet item categories are obtained in the same fashion.

More generally, during the alternating direction testlet TRT fitting, after obtaining the examinee ability parameter θj, the estimation for item parameters is obtained via maximizing the standard likelihood function:

L(ai1,ai2,tiv,γd(i))=k=1Nlogexpv=0kai1(θjtiv+ai2γd(i))c=0miexpv=0kai1(θjtiv+ai2γd(i)) (3)

with N being the total number of examinees considered.

With successful decoupling, the above equation becomes a relatively straightforward mathematical system that can be solved using standard numerical approaches. Similarly, after the parameters ai1,ai2,tiv,γd(i) have been updated, the θjs are obtained by maximizing the following equation:

L(θj)=k=1Mlogexpv=0kai1(θjtiv+ai2γd(i))c=0miexpv=0kai1(θjtiv+ai2γd(i)) (4)

with M being number of items.

Monte Carlo Data Simulation and Analytic Strategy

To systematically evaluate the performance of the proposed approach, simulated data using Monte Carlo techniques were analyzed under a variety of design conditions. The goal is to develop a computational approach using the likelihood function that will allow for an estimation of the parameters in the testlet model without loss of precision or reliability. Two simulated datasets were constructed and examined. The first contained synthetic data for 800 test-takers examined on 6 testlets with 5 items within each testlet, and the second contained synthetic data for 800 test-takers examined on 5 testlets and 20 items within each testlet. A complete itemized structure of the two generated data sets is presented in Tables 1 to 4.

Table 1.

The Structure of Data Set 1, Test–Taker-Related Parameters.

Test-taker Gammas Thetas
1 γ11,γ12,,γ16 θ1
2 γ21,γ22,,γ26 θ2
800 γ8001,γ8002,,γ8006 θ800

Table 4.

The Structure of Dataset 2, Item-Related Parameters.

Item Discrimination Difficulty
1 a11,a12 t11,t12,t13
2 a21,a22 t21,t22,t23
100 a1001,a302 t1001,t1002,t1003

Table 2.

The Structure of Dataset 1, Item-Related Parameters.

Item Discrimination Difficulty
1 a11,a12 t11,t12,t13
2 a21,a22 t21,t22,t23
30 a301,a302 t301,t302,t303

Table 3.

The Structure of Data Set 2, Test–Taker-Related Parameters.

Test-taker Gammas Thetas
1 γ11,γ12,,γ16 θ1
2 γ21,γ22,,γ26 θ2
800 γ8001,γ8002,,γ8006 θ800

On generating the synthetic data sets, the next step was to recover the parameters using Bayesian methods via Markov Chain Monte Carlo (MCMC) estimation and using the generalized maximum likelihood estimation (GMLE) method.

Bayesian Estimation

Considering the vector of Y responses, Bayesian analysis relies on samples drawn from the following posterior probability distribution:

P(a,t,θ,γ|Y,) (5)

where “…” correspond to the hyper-parameters used in prior distributions. The prior distributions of a,andt are often chosen from conjugate priors to yield a closed-form posterior distribution for algebraic convenience. The MCMC method is the most frequently used algorithm to generate samples according to a posterior distribution. This estimation algorithm is summarized below.

Given an old sample (aold,told,θold,γold)—where the term “old” refers to the parameter estimate in the previous step:

  1. anew is sampled from distribution P(a|told,θold,γold,Y,)

  2. tnew is sampled from distribution P(t|anew,θold,γold,Y,)

  3. θnew is sampled from distribution P(θ|anew,tnew,γold,Y,)

  4. γnew is sampled from distribution P(γ|anew,tnew,θnew,Y,)

The fully updated (anew,tnew,θnew,γnew) is then a new sample following the posterior distribution.

Generalized Maximal Likelihood Estimation

Maximal likelihood approaches estimate model parameters that maximize the likelihood function, given examinees overall matrix of responses Y. This can be specified as follows:

Πi=1IΠj=1JP(Yij|ai1,ai2,tiv,θj,γd(i)j) (6)

or equivalently as,

E(a,t,θ,γ|Y)=i=1Ij=1Jlog(P(Yij|ai1,ai2,tiv,θj,γd(i)j)=i=1Ij=1Jlog(exp(v=0k(ai1θjtiv+ai2γd(i)j))c=0miexp(v=0c(ai1θjtiv+ai2γd(i)j))) (7)

Numerically, the iterative gradient descent method is usually adopted to find optimal model parameters:

(anew,tnew,θnew,γnew)=(aold,told,θold,γold)+ru(Ea,Et,Eθ,Eγ) (8)

where ru is the updating parameter.

Because of the complexity of E, solving explicit formula calculating derivatives of E with respect to a,t,θ,γ is impractical and subject to numerical instability. To ease the computational complexity, the so-called TensorFlow platform (described further below) is used to aid in the complex calculation of the derivatives.

The following steps summarize the algorithm implemented:

  • 1. Generate θold,γold from N(0,1).

  • 2. Find aold,told maximizing

E(a,t|θold,γold)=i=1Ij=1Jlog(exp(v=0k(ai1θ[old],jtiv+ai2γ[old]d(i)j))c=0miexp(v=0c(ai1θ[old]jtiv+ai2γ[old]d(i)j))) (8)

using the gradient descent method.

  • 3. Obtain aold,told, finding θnew,γnew maximizing

E(θ,γ|aold,told)=i=1Ij=1Jlog(exp(v=0k(a[old]i1θjt[old]iv+a[old]i2γd(i)j))c=0miexp(v=0c(a[old]i1θjt[old]iv+a[old]i2γd(i)j))) (8)
  • 4. Set θold=θnew,γold=γnew

  • 5. Repeat steps 2 and 3 until convergence is attained.

The TensorFlow Platform

In the fields of mathematics and physics, tensors are large-dimension geometrical objects that describe linear relations between vectors, scalars, and other tensors. Accordingly, tensors can be considered as merely generalizations of scalars and vectors: a scalar is a zero-rank vector, and a vector is a first-rank tensor. The need for higher rank tensors comes when more than one direction is required to describe physical or other properties. In statistics, tensors can be quite helpful when the examined data are multidimensional and require tedious calculations.

TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence organizational group for the purpose of conducting machine learning and deep neural network research. TensorFlow is basically an open-source software platform for dataflow programming. The platform provides an interface that can be used to express complex machine learning algorithms and enables the implementation and execution of such algorithms. The system is quite flexible and can be effectively used to express a wide variety of algorithms, particularly those required in machine training and neural network models (Abadi et al., 2015).

The system uses an increasingly powerful form of computational learning with very impressive accuracy. Any computation that can be expressed as a computational flow graph can in principle be computed on the TensorFlow platform. For the purpose of completing the analyses necessary for this study, TensorFlow was used to calculate the complex derivatives in the likelihood functions. All calculations were conducted on the Cloud, which also afforded any additional needed computational power. The Python source code needed to estimate the testlet models examined in this study using TensorFlow is included in the online Supplementary Appendix B.

Results

The overall parameter recovery solutions produced by the MCMC and the GMLE estimation methods were in general successful. In terms of estimated item parameters, overall item difficulties were recovered much better than item discriminations. In terms of test–taker-related parameters, the recovery was outstanding using both methodologies. Detailed results of the parameter recoveries are presented in Tables 5, 6, and 7 for Dataset 1 and in Tables 8, 9, and 10 for Dataset 2. Figures 1 to 4 display graphical visualizations of parameter recovery results for Dataset 1, while Figures 5 to 8 display similar graphical visualizations of parameter recovery for the second dataset.

Table 5.

Correlation Between Real and Recovered Differentiation Parameters, Dataset 1.

Real differentiation MCMC differentiation GMLE differentiation
Real differentiation 1.00
MCMC differentiation 0.83 1.00
GMLE differentiation 0.34 0.40 1.00

Note. GMLE = generalized maximum likelihood estimation, MCMC = Markov Chain Monte Carlo.

Table 6.

Correlation Between Real and Recovered Difficulty Parameters, Dataset 1.

Real difficulty MCMC difficulty GMLE difficulty
Real difficulty 1.00
MCMC difficulty 0.99 1.00
GMLE difficulty 0.97 0.99 1.00

Note. GMLE = generalized maximum likelihood estimation, MCMC = Markov Chain Monte Carlo.

Table 7.

Correlations Between Real Parameters and Recovered Proficiency, Dataset 1.

Real Theta Real Gamma 1 Real Gamma 2 Real Gamma 3 Real Gamma 4 Real Gamma 5 Real Gamma 6 MCMC Theta MCMC Gamma 1 MCMC Gamma 2 MCMC Gamma 3 MCMC Gamma 4 MCMC Gamma 5 MCMC Gamma 6 GMLE Theta GMLE Gamma 1 GMLE Gamma 2 GMLE Gamma 3 GMLE Gamma 4 GMLE Gamma 5 GMLE Gamma 6
Real Theta 1.00
Real Gamma 1 −0.01 1.00
Real Gamma 2 −0.01 −0.03 1.00
Real Gamma 3 0.00 −0.07 0.02 1.00
Real Gamma 4 −0.05 0.02 0.01 0.04 1.00
Real Gamma 5 0.01 0.05 −0.01 −0.05 −0.07 1.00
Real Gamma 6 0.05 0.00 −0.06 0.04 −0.01 −0.01 1.00
MCMC Theta 0.92 0.11 0.10 0.08 0.06 0.10 0.19 1.00
MCMC Gamma 1 0.13 0.80 −0.12 −0.14 −0.06 −0.03 −0.10 0.14 1.00
MCMC Gamma 2 0.13 −0.09 0.80 −0.05 −0.07 −0.10 −0.13 0.13 −0.11 1.00
MCMC Gamma 3 0.13 −0.16 −0.04 0.81 −0.07 −0.11 −0.03 0.12 −0.17 −0.04 1.00
MCMC Gamma 4 0.13 −0.05 −0.07 −0.03 0.78 −0.15 −0.09 0.14 −0.07 −0.06 −0.07 1.00
MCMC Gamma 5 0.15 −0.07 −0.11 −0.12 −0.14 0.79 −0.07 0.15 −0.08 −0.12 −0.13 −0.17 1.00
MCMC Gamma 6 0.11 −0.12 −0.12 −0.05 −0.12 −0.08 0.80 0.16 −0.17 −0.15 −0.07 −0.13 −0.10 1.00
GMLE Theta 0.89 0.14 0.14 0.13 0.10 0.12 0.21 0.96 0.18 0.19 0.18 0.19 0.17 0.18 1.00
GMLE Gamma 1 0.48 0.71 −0.08 −0.12 −0.04 0.00 −0.02 0.52 0.89 −0.07 −0.13 −0.03 −0.02 −0.10 0.50 1.00
GMLE Gamma 2 0.46 −0.05 0.73 −0.03 −0.05 −0.06 −0.07 0.49 −0.06 0.90 −0.02 −0.03 −0.07 −0.10 0.50 0.14 1.00
GMLE Gamma 3 0.41 −0.12 −0.01 0.74 −0.08 −0.08 0.02 0.42 −0.13 −0.02 0.93 −0.05 −0.08 −0.04 0.44 0.05 0.14 1.00
GMLE Gamma 4 0.50 −0.02 −0.04 −0.01 0.66 −0.11 −0.02 0.53 −0.03 −0.02 −0.03 0.87 −0.10 −0.05 0.53 0.20 0.18 0.13 1.00
GMLE Gamma 5 0.50 −0.03 −0.07 −0.08 −0.12 0.70 0.00 0.53 −0.02 −0.07 −0.07 −0.11 0.89 −0.04 0.51 0.19 0.13 0.10 0.13 1.00
GMLE Gamma 6 0.50 −0.07 −0.08 −0.02 −0.09 −0.02 0.72 0.57 −0.10 −0.10 −0.02 −0.07 −0.02 0.87 0.54 0.15 0.12 0.16 0.20 0.21 1.00

Note. GMLE = generalized maximum likelihood estimation, MCMC = Markov Chain Monte Carlo.

Table 8.

Correlation Between Real and Recovered Differentiation Parameters, Dataset 2.

Real differentiation MCMC differentiation GMLE differentiation
Real differentiation 1.00
MCMC differentiation 0.91 1.00
GMLE differentiation 0.42 0.40 1.00

Note. GMLE = generalized maximum likelihood estimation, MCMC = Markov Chain Monte Carlo.

Table 9.

Correlation Between Real and Recovered Difficulty Parameters, Dataset 2.

Real difficulty MCMC difficulty GMLE difficulty
Real difficulty 1.00
MCMC difficulty 0.99 1.00
GMLE difficulty 0.98 0.99 1.00

Note. GMLE = generalized maximum likelihood estimation, MCMC = Markov Chain Monte Carlo.

Table 10.

Correlations Between Real Parameters and Recovered Proficiency, Dataset 2.

Real Theta Real Gamma 1 Real Gamma 2 Real Gamma 3 Real Gamma 4 Real Gamma 5 MCMC Theta MCMC Gamma 1 MCMC Gamma 2 MCMC Gamma 3 MCMC Gamma 4 MCMC Gamma 5 GMLE Theta GMLE Gamma 1 GMLE Gamma 2 GMLE Gamma 3 GMLE Gamma 4 GMLE Gamma 5
Real Theta 1.00
Real Gamma 1 0.03 1.00
Real Gamma 2 −0.01 0.04 1.00
Real Gamma 3 0.01 0.02 −0.01 1.00
Real Gamma 4 0.00 0.02 −0.02 0.00 1.00
Real Gamma 5 −0.05 −0.03 −0.01 0.03 0.04 1.00
MCM Theta 0.94 0.10 0.05 0.13 0.11 0.05 1.00
MCMC Gamma 1 0.12 0.91 −0.02 −0.07 −0.04 −0.11 0.11 1.00
MCMC Gamma 2 0.10 −0.02 0.90 −0.08 −0.12 −0.08 0.09 −0.02 1.00
MCMC Gamma 3 0.10 −0.02 −0.08 0.89 −0.07 −0.07 0.12 −0.04 −0.08 1.00
MCMC Gamma 4 0.11 −0.05 −0.08 −0.10 0.90 −0.06 0.11 −0.03 −0.11 −0.09 1.00
MCMC Gamma 5 0.07 −0.08 −0.07 −0.07 −0.05 0.92 0.08 −0.11 −0.06 −0.10 −0.07 1.00
GMLE Theta 0.87 0.22 0.15 0.19 0.19 0.15 0.93 0.23 0.18 0.21 0.20 0.18 1.00
GMLE Gamma 1 0.44 0.84 −0.06 −0.06 −0.04 −0.14 0.45 0.92 −0.05 −0.05 −0.04 −0.13 0.49 1.00
GMLE Gamma 2 0.41 −0.04 0.82 −0.07 −0.12 −0.11 0.42 −0.05 0.92 −0.08 −0.11 −0.08 0.44 0.07 1.00
GMLE Gamma 3 0.50 −0.04 −0.10 0.80 −0.07 −0.09 0.55 −0.06 −0.09 0.88 −0.09 −0.10 0.54 0.12 0.09 1.00
GMLE Gamma 4 0.47 −0.07 −0.11 −0.08 0.81 −0.09 0.50 −0.05 −0.12 −0.08 0.90 −0.09 0.51 0.11 0.04 0.12 1.00
GMLE Gamma 5 0.38 −0.10 −0.10 −0.06 −0.06 0.84 0.41 −0.13 −0.09 −0.10 −0.09 0.93 0.43 −0.01 0.03 0.07 0.06 1.00

Note. GMLE = generalized maximum likelihood estimation, MCMC = Markov Chain Monte Carlo.

Figure 1.

Figure 1.

Comparison between real differentiation parameters and parameters recovered by Markov Chain Monte Carlo (MCMC) and generalized maximum likelihood estimation (GMLE), Dataset 1.

Figure 4.

Figure 4.

Comparison between real secondary ability (gamma) parameters and parameters recovered by Markov Chain Monte Carlo (MCMC) and generalized maximum likelihood estimation (GMLE), Dataset 1.

Figure 5.

Figure 5.

Comparison between real differentiation parameters and parameters recovered by Markov Chain Monte Carlo (MCMC) and generalized maximum likelihood estimation (GMLE), Dataset 2.

Figure 8.

Figure 8.

Comparison between real secondary ability (gamma) parameters and parameters recovered by Markov Chain Monte Carlo (MCMC) and generalized maximum likelihood estimation (GMLE), Dataset 2.

As can be seen by examining these results, it appears that both these methods did not perform well in recovering the item discrimination values (see Figures 1 and 4), although overall the MCMC was somewhat superior to GMLE in the recovery. The correlations between the sets of real parameters and the sets of parameters recovered by the MCMC were r = .83 for Dataset 1, and r = .91 for Dataset 2. In contrast, the correlations between the real parameters and the GMLE-recovered item-discrimination values were rather low at values of r = .34 for Dataset 1, and r = .42 for Dataset 2.

Item difficulties were recovered with much better success by both methods (see Figures 2 and 6). Specifically, the correlations between the real parameters and the MCMC-recovered parameters were r = .99 for Dataset 1, and r = .99 for Dataset 2. Similarly, for the GMLE-recovered parameters, the correlations were r = .97 for Dataset 1, and r = .98 for Dataset 2.

Figure 2.

Figure 2.

Comparison between real difficulty parameters and parameters recovered by Markov Chain Monte Carlo (MCMC) and generalized maximum likelihood estimation (GMLE), Dataset 1.

Figure 6.

Figure 6.

Comparison between real difficulty parameters and parameters recovered by Markov Chain Monte Carlo (MCMC) and generalized maximum likelihood estimation (GMLE), Dataset 2.

Test-taker-related parameters were also recovered well by both methods. The relationships between the real thetas and the MCMC-estimated thetas were r = .92 for Dataset 1, and r = .94 for Dataset 2, while for the GMLE-estimates parameters, the relationships were r = .89 for Dataset 1, and r = .87 for Dataset 2 (see Figures 3 and 7). Finally, for the gamma parameter recovery, the following correlation averages were obtained: for MCMC, r¯ = .8 for Dataset 1, and r¯ = .91 for Dataset 2, and for GMLE, r¯ = .7 for Dataset 1, and r¯ = .84 for Dataset 2 (see Figure 4 and 8).

Figure 3.

Figure 3.

Comparison between real overall ability (theta) parameters and parameters recovered by Markov Chain Monte Carlo (MCMC) and generalized maximum likelihood estimation (GMLE), Dataset 1.

Figure 7.

Figure 7.

Comparison between real overall ability (theta) parameters and parameters recovered by Markov Chain Monte Carlo (MCMC) and generalized maximum likelihood estimation (GMLE), Dataset 2.

Discussion

When the NBCE made the decision to move away from classical test theory scoring and implement IRT-based operational scoring procedures, they were faced with a problem of the violation of model assumptions by some of the items in their licensure examination. Specifically, the problem emerged because key components of the chiropractic clinical competency examination involve the use of groups of items that are integrated by a common case vignette. Because of the nature of the vignette items, the IRT assumption of conditional independence or local independence for items is inevitably violated. In search of an alternative model to account for this consequential violation, a two-level unidimensional TRT model was implemented (Wainer et al., 2007). Because the application of such a model can be computationally demanding with increased numbers of items, a new computational estimation algorithm needed to be developed. The newly proposed estimation approach provides the necessary solution to address this issue and contribute to the literature by introducing a computationally efficient algorithm that can be used in practical measurement settings. To accomplish the purposes of this article, a number of simulations using data based on realistic data scenarios were conducted and scrutinized. Although the specific models themselves are not entirely new, the approach to computation and the development of the algorithms for estimation are novel. To estimate the various testlet models examined in this study, we used the TensorFlow platform. To the best of our knowledge, this study is the first application of such a powerful platform for psychometric work. Through increased computational power, the platform enabled the analysis of complex likelihood functions that to date have proved to be computationally intensive and extremely time consuming. For example, it took as much as 15 hours for one MCMC chain to converge while the GMLE analyses took only about 1 hour. Another shortcoming of the Bayesian method is its inability to parallelize the analysis (i.e., to divide the computation and compute different parts using different computers) as the MCMC chain requires all parts to be run at the same time. By gaining the computational power through the use of the TensorFlow platform, we were able to compare without much difficulty the quality of estimation between the MCMC and GMLE approaches.

Overall our results indicated that apart from the estimation of the item discrimination values, the parameter estimates recovered by both examined methods were very similar. Although the MCMC estimation showed some minor supremacy over the GMLE approach, the longer time needed to arrive at these estimates can sometimes be rather excessive. Given the tendency of the MCMC approach to overestimate the test-takers with lower abilities and underestimate the test-takers with higher ability, using the GMLE that provided more unbiased estimates may ultimately be the better option. Indeed, the parameter estimates obtained from the GMLE estimation approach may even serve as starting values for the MCMC, which will save the “burning” time and will aid the model converge much faster.

In conclusion, we believe the proposed approach to be extremely valuable in helping researchers tackle parameter estimation in complex tests with dependencies in responses with more power and accuracy than ever before. At the same time, we acknowledge that more work and thinking needs to be done regarding improved efficiency in implementation of these approaches.

Supplemental Material

Online_Appendix – Supplemental material for A Two-Level Alternating Direction Model for Polytomous Items With Local Dependence

Supplemental material, Online_Appendix for A Two-Level Alternating Direction Model for Polytomous Items With Local Dependence by Igor Himelfarb, Katerina M. Marcoulides, Guoliang Fang and Bruce L. Shotts in Educational and Psychological Measurement

Footnotes

Authors’ Note: Guoliang Fang is now affiliated with Colorado State University Global, CO, USA.

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD: Igor Himelfarb Inline graphichttps://orcid.org/0000-0002-2622-6062

Supplemental Material: Supplemental material for this article is available online.

References

  1. Abadi M., Agrawal A., Barham P., Brevdo E, Chen Z., Citro G., . . . Zheng X. (2015). TensorFlow: Large-scale machine learning on heterogeneous distributed systems (Preliminary White Paper). [Google Scholar]
  2. Andrich D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573. [Google Scholar]
  3. Attali Y. (2011). Immediate feedback and opportunity to revise answers: Application of a graded response IRT model. Applied Psychological Measurement, 35, 472-479. [Google Scholar]
  4. Bergner Y., Choi I., Castellano K. E. (2019). Item response models for multiple attempts. Journal of Educational Measurement, 56, 415-436. [Google Scholar]
  5. Culpepper A. S. (2014). If at first you don’t succeed, try, try again: Applications of sequential IRT models to cognitive assessment. Applied Psychological Measurement, 38, 632-644. [Google Scholar]
  6. De Boeck P., Wilson M. (2004). Explanatory item response models: A generalized nonlinear approach. New York, NY: Springer. [Google Scholar]
  7. Fischer G. H. (1989). An IRT-based model for dichotomous longitudinal data. Psychometrika, 54, 599-624. [Google Scholar]
  8. Fray R. B. (1989). Partial credit scoring methods for multiple choice tests. Applied Measurement in Education, 2, 79-96. [Google Scholar]
  9. Li Y., Li S., Wang L. (2010). Application of a general polytomous testlet model to the reading section of a large-scale English language assessment (ETS RR-10-21). Princeton, NJ: ETS. [Google Scholar]
  10. Masters G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. [Google Scholar]
  11. Muraki E. (1992). A generalized partial credit model: Application of an E-M algorithm. Applied Psychological Measurement, 16, 159-176. [Google Scholar]
  12. Samejima F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometrika Monograph no. 17). Richmond, VA: Psychometric Society. [Google Scholar]
  13. Sireci S. G., Thissen D., Wainer H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, 237-247. [Google Scholar]
  14. Sternberg S. H. (1963). Stochastic learning theory. In Luce R. D., Bush R. R., Galanter E. (Eds.), Handbook of mathematical psychology (Vol. 2; pp. 1-120). New York, NY: Wiley. [Google Scholar]
  15. Tuerlinckx F., De Boeck P. (2004). Models for residual dependencies. In De Boeck P., Wilson M. (Eds.), Explanatory item response models: A generalized nonlinear approach (pp. 289-316). New York, NY: Springer. [Google Scholar]
  16. Verhelst N. D., Glas C. A. W. (1995). Dynamic generalizations of the Rasch model. In Fischer G. H., Molenaar I. W. (Eds.), Rasch models: Foundations, recent developments, and applications (pp. 181-201). New York, NY: Springer-Verlag. [Google Scholar]
  17. Wainer H., Bradlow E. T., Wang X. (2007). Testlet response theory and its applications. New York, NY: Cambridge University Press. [Google Scholar]
  18. Wainer H., Kiely G. L. (1987). Item clusters and computerized adaptive testing: A case of testlets. Journal of Educational Measurement, 24, 185-201. [Google Scholar]
  19. Wainer H., Wang X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37, 203-220. [Google Scholar]
  20. Wolf M. K., Guzman-Orth D., Lopez A., Castellano K., Himelfarb I., Tsutagawa F. (2016). Integrating scaffolding strategies into technology-enhanced assessments of English learners: Task types and measurement models. Assessment, 21, 157-175. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Online_Appendix – Supplemental material for A Two-Level Alternating Direction Model for Polytomous Items With Local Dependence

Supplemental material, Online_Appendix for A Two-Level Alternating Direction Model for Polytomous Items With Local Dependence by Igor Himelfarb, Katerina M. Marcoulides, Guoliang Fang and Bruce L. Shotts in Educational and Psychological Measurement


Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES