Skip to main content
Educational and Psychological Measurement logoLink to Educational and Psychological Measurement
. 2019 Dec 13;80(4):726–755. doi: 10.1177/0013164419892667

Polytomous Item Explanatory Item Response Theory Models

Jinho Kim 1,2,, Mark Wilson 1
PMCID: PMC7307487  PMID: 32616956

Abstract

This study investigates polytomous item explanatory item response theory models under the multivariate generalized linear mixed modeling framework, using the linear logistic test model approach. Building on the original ideas of the many-facet Rasch model and the linear partial credit model, a polytomous Rasch model is extended to the item location explanatory many-facet Rasch model and the step difficulty explanatory linear partial credit model. To demonstrate the practical differences between the two polytomous item explanatory approaches, two empirical studies examine how item properties explain and predict the overall item difficulties or the step difficulties each in the Carbon Cycle assessment data and in the Verbal Aggression data. The results suggest that the two polytomous item explanatory models are methodologically and practically different in terms of (a) the target difficulty parameters of polytomous items, which are explained by item properties; (b) the types of predictors for the item properties incorporated into the design matrix; and (c) the types of item property effects. The potentials and methodological advantages of item explanatory modeling are discussed as well.

Keywords: explanatory item response models, linear logistic test model, item property, polytomous item, many-facet Rasch model, linear partial credit model


In item response theory (IRT), both person traits and item characteristics are modeled together as predictors in a statistical model with a nonlinear relationship between the item response probability and the predictors. Using explanatory item response model (EIRM; De Boeck & Wilson, 2004), one can explain person side or item side, or both sides of the item response data by incorporating person properties, item properties, and their interactions as predictors in the statistical model. Among them, a typical and well-known case of item explanatory IRT model to examine the item property effects is the linear logistic test model (LLTM; Fischer, 1973), in which the item difficulty parameters of a Rasch model are decomposed into weighted sums of elementary components related to item properties (Embretson & Reise, 2000). The LLTM accounts for the effects of item properties on the item difficulties by incorporating observable item properties such as item design variables, cognitive operations, item content features, and response formats. It is also used as a testing tool for hypothesized constructs in item generation. For example, when a specific test construction rationale that is hypothesized for item generation is developed to predict item difficulties, the LLTM can assess the appropriateness of that rationale and help enable automatized item generation (Reif, 2012). Moreover, the LLTM approach can be used for psychometric studies on measuring the effect of various testing conditions such as item presentation position, content-specific learning, speeded item presentation, and item response format as well as the measurement of changes (Kubinger, 2009; Poinstingl, 2009). Thus, this item explanatory approach can serve to provide informative feedback for enhancing test development, item generation, and cognitive assessment.

Despite such diverse uses of the LLTM approach in educational and psychological measurement practices, it has been less widely used. Historically, most of the early studies using the LLTM often appeared in German-speaking countries and were written in German (Kubinger, 2009). In particular, most of its applications are actually limited to dichotomous data (e.g., Fischer, 1973; Kubinger, 2009; Poinstingl, 2009). However, it is very common to have polytomous data in a wide range of educational, psychological, and sociological applications (De Boeck & Wilson, 2004). In this research, we will focus on ordered-category (or ordinal) responses for polytomous data because they are quite common achievement outcomes in measurement and assessment contexts. For instance, very often the scoring rubrics under a learning progression framework will have ordered categorical levels showing qualitatively different learning progression levels (e.g., Jin, Shin, Johnson, Kim, & Anderson, 2015). Also, we will focus on the adjacent-categories logit relationship between the ordered-category item responses and predictors of person ability and item characteristics. Adjacent-categories logit-based item response models can make person and item parameters separable, that is, parameter separability, and hence permit specifically objective comparisons of persons and items, that is, specific objectivity (Masters, 1982). They are appropriate where local comparison of response probabilities between ordered adjacent categories is of interest (Masters & Wright, 1997). Moreover, adjacent-categories logit ordinal regression models have a suitable parameterization for subjectively assessed scores in ordered responses (Johnson, 2007).

Our research aim is to investigate how the LLTM approach can be applied to polytomous data, particularly in terms of item explanatory extensions of a polytomous Rasch model (hereafter referred to as polytomous item explanatory IRT models) under a general statistical modeling framework, the multivariate generalized linear mixed modeling (MGLMM; Hartzel, Agresti, & Caffo, 2001). One may say that any item explanatory IRT model for dichotomous data could be simply applied to polytomous data. However, this is not the case conceptually and practically. Applying the LLTM approach to polytomous data is not that easy due to the complications of item parameterization in the statistical model as well as the difficulty in reparameterization incorporating the observed item properties. The seminal study by Fischer (1977) and its follow-up studies by Fischer and Parzer (1991) and Fischer and Ponocny (1994) investigated extensions of polytomous Rasch models using the LLTM approach, which are based on conditional maximum likelihood estimation rather than more commonly employed marginal maximum likelihood (MML) estimation. They utilized a normalization constant, basic parameters, and given weights or values of item properties for item parameterization; however, these parameters are complicated and difficult to interpret and even make it hard to calculate and interpret the step difficulties. Glas and Verhelst (1989) investigated extensions of a polytomous Rasch model based on MML estimation. Although they introduced a reparameterization to impose linear restrictions on the item parameters, their item parameters are not straightforward to interpret, and difficult translation into the original parametrization are necessarily required. Masters (1982) pointed out, “To my knowledge, this polytomous generalization of the linear logistic test model (Fischer, 1972) has not been applied.” (p. 153) The complications and difficulty in item parameterization and reparameterization may lead applications of the LLTM approach to polytomous data to be considerably less common than to dichotomous data. Such problems for polytomous item explanatory IRT models can be resolved using the MGLMM framework, which is flexible in statistical modeling based on MML estimation.

To achieve the aim of this research, given our focus on ordered-category responses and the adjacent-categories logits, a unidimensional polytomous Rasch model will be reviewed as a starting point through the following sections. Next, we will investigate how to develop it toward polytomous item explanatory IRT models using the LLTM approach under the MGLMM framework. And then, two empirical studies that focus on applying these models to the Carbon Cycle assessment data and to the Verbal Aggression data, respectively, will be demonstrated. In the empirical studies, we can see how the developed models work in practice and how the observed item properties explain and predict the difficulties of polytomous items.

Item Explanatory Extensions of a Polytomous Rasch Model

Partial Credit Model

Based on ordered-category responses and adjacent-categories logits, a polytomous Rasch model will be extended toward polytomous item explanatory IRT models. To reach those extensions, we will start with a basic model for the adjacent-categories logit-based item response models, the partial credit model (PCM; Masters, 1982). The PCM is a straightforward application of the Rasch model to polytomous data (Masters & Wright, 1997), in which the conditional probability that person p with ability θp would respond with category score m on item i with step difficulty parameters δim is defined as

Pr(ypi=m|θp)=exp(l=0m(θpδil))h=0Miexp(l=0h(θpδil)),m=0,1,,Mi, (1)

where θp~N(0,σθ2),δi0=0, and l=00(θpδil)=0. Note that the mean of person abilities is constrained to zero for model identification,1 and this constraint is consistently used for all IRT models in this research.

For explanatory convenience and intuitive interpretation of item parameters, the PCM can be described in terms of local comparison in the response probabilities for pairs of adjacent categories in a sequence in ordered-category responses (Masters & Wright, 1997). In the local comparison, a linear predictor element of the mth adjacent-categories logit (Tuerlinckx & Wang, 2004), from the adjacent category score m1 to m, on item i per person p, ηpim is expressed as follows:

ηpim=lnPr(ypi=m|θp)Pr(ypi=m1|θp)=θpδim,m=1,2,,Mi, (2)

where θp~N(0,σθ2),δi0=0, and ηpi0=0. From this local comparison perspective, the item parameter δim is interpreted as a “step” difficulty for scoring m rather than m1 on item i(i=1,2,,I). Since a value of the category score can be regarded as a step of switching over to the next response category, the step difficulty parameter δim indicates the relative difficulty of the mth-step within item i when shifting the category score from m1 to m (Embretson & Reise, 2000; Masters, 1982).

Additionally, using a twofold item parameterization adopted in the ConQuest software (Wu, Adams, Wilson, & Haldane, 2007), the step difficulty parameter δim can be split into two item parameters of the item location (aka overall item difficulty) parameter βi and the step deviation parameter τim (i.e., δim=βi+τim) as expressed below:

ηpim=lnPr(ypi=m|θp)Pr(ypi=m1|θp)=θpβiτim,m=1,2,,Mi, (3)

where θp~N(0,σθ2),ηpi0=0,τi0=0, and m=1Miτim=0 so that 1Mim=1Miδim=βi. Here, βi is an item location parameter for item i, and τim is a step deviation parameter for the mth-step within item i. Due to the model constraint that 1Mim=1Miδim=βi, the item location parameter is interpreted as the overall item difficulty for each polytomous item. The step deviation parameter τim differs from the step difficulty parameter δim in that the step deviation is a deviation from the item location to the step difficulty (τim=δimβi). Since a sum of step deviations is constrained to be zero for each item (m=1Miτim=0) for model identification, Mi1 step deviation parameters are estimated for each item.

Item Explanatory Extensions With Item Properties

In the adjacent-categories logit-based item response models under the MGLMM framework, there are three types of predictors that can be included in a design matrix for the item side of the response data (Rijmen, Tuerlinckx, De Boeck, & Kuppens, 2003; Tuerlinckx & Wang, 2004): (a) item predictors, (b) step predictors,2 and (c) item-by-step predictors. The item predictor is a predictor if and only if the elements of the corresponding column of the design matrix vary across items but are fixed as category scores for steps within each item for all persons, the step predictor is a predictor if and only if the elements of the corresponding column of the design matrix vary across steps but are constant across items for all persons, and the item-by-step predictor is a predictor if and only if the elements of the corresponding column of the design matrix vary across all items and steps but are constant across persons. Simply put, predictors in the design matrix could be indicators or explanatory properties. For the PCM with the original item parameterization, item-by-step predictors accounting for the step difficulties are indicators for each step of each item. For the PCM with the twofold item parameterization, item predictors accounting for the item locations are indicators for each item and item-by-step predictors accounting for the step deviations are indicators for each step of each item. Thus, the PCM is a traditional measurement approach of describing individual differences in items’ difficulties and/or persons’ abilities.

For item explanatory extensions of the PCM using the LLTM approach, however, predictors in the design matrix are item properties that have explanatory value for polytomous item characteristics beyond descriptive value. If the observed item properties are hypothesized to determine the item main effects and/or have explanatory value for unique characteristics inherent within items regardless of steps (e.g., item content features, reading passage length), they are incorporated as item predictors into the design matrix. If the observed item properties are hypothesized to account for the step main effects and/or have explanatory value for unique characteristics inherent within steps for all items (e.g., learning progression levels, scoring rubrics), they are incorporated as step predictors into the design matrix. If the observed item properties are assumed to account for the item-by-step interaction effects and/or have explanatory value for unique characteristics inherent within steps of items (e.g., cognitive operations, item exposure time), they are incorporated as item-by-step predictors into the design matrix.

However, the item properties that would be incorporated as step predictors are usually predetermined by test/item developers before items design. For example, the progression levels hypothesized in a learning progression framework are hard to be explained by external explanatory properties, because they are just indicators for the step main effects. Under the MGLMM framework, particularly, the number of adjacent-categories logits is determined by the number of response categories. The step main effects would not likely be explained by item properties, unless the number of response categories is reduced. Because of these concerns, in this research, we will leave out the case that polytomous item characteristics are accounted for by item properties incorporated as step predictors.

Polytomous Item Explanatory Item Response Theory Models

As discussed in the introduction, item parameterization and reparameterization incorporating the observed item properties for polytomous item explanatory IRT models are complicated and difficult. To avoid such complications and difficulties, it is helpful to clarify the target difficulty parameters of polytomous items, which are explained by item properties as well as the types of predictors as which the observed item properties are incorporated into the design matrix. Building on the PCM, there are two cases of polytomous item explanatory models: (a) the item location (overall item difficulty) parameters βi can be explained by item properties incorporated as item predictors, which represent item property effects, and (b) the step difficulty parameters δim can be explained by item properties incorporated as item-by-step predictors, which represent item-by-step property effects (or step-specific item property effects).

By default, we will consider a two-facet3 measurement situation (i.e., person and item) and the same set of item properties for each model. Item properties are regarded as subfacets within an item facet (Wang & Wilson, 2005), since they are unique characteristics inherent in a set of items.

Item Location Explanatory Many-Facet Rasch Model

First, consider the former case based on the PCM. Note that the item location parameter is interpreted as the overall item difficulty for each polytomous item. As was done in the LLTM, one may want to account for the overall item difficulties by incorporating item properties as item predictors into the design matrix. The twofold item parameterization for the PCM enables us to impose linear restrictions focused on the item location parameters and also to estimate the step deviation parameters for each item. Thus, the restricted item location parameters βi will now be decomposed into weighted sums of item property effect parameters γk as follows:

βi=k=0Kγkxik,k=0,,K (4)

so that

ηpim=lnPr(ypi=m|θp)Pr(ypi=m1|θp)=θpk=0Kγkxikτim, (5)

where θp~N(0,σθ2),ηpi0=0,τi0=0, and m=1Miτim=0 (m=1,,Mi). Here, γ0 is the item intercept representing the difficulty for items with all xik = 0 for k > 0, γk is the regression weight or the effect of item property k on the overall item difficulties, xi0 is the constant item predictor in which a value of 1 for all items, xik is the value of item i on item property k, and τim is a step deviation parameter for the mth-step of item i. Note that values of the step deviation parameter τim will be different from those in the PCM because the constructed item locations βi are calculated by weighted sums of the item property effects rather than directly estimated. In this model, the constructed step difficulties δim can be calculated using δim=βi+τim, where βi is the constructed item location for item i from Equation 4.

This polytomous item explanatory model is a variation of the many-facet Rasch model (MFRM; Linacre, 1989). By using the twofold item parameterization, the item location parameters are decomposed into a linear combination of the effects of subfacets (i.e., item properties) within the item facet. The step deviation parameters are additively estimated for each item, because each item can have a different number of steps across ordered response categories in its scale structure. Building on the original idea of the MFRM approach, this item explanatory extension of the PCM will be called the “item location explanatory MFRM.” Although this model is a variation of the polytomous MFRM, compared with factorial (ANOVA-style) notations, which are commonly used in the MFRM, the model specification in Equation 5 has a more general regression notation. While the original MFRM approach uses only categorical predictors to represent the effect of each facet (i.e., facet descriptive), this model can incorporate both categorical and continuous explanatory predictors inherent in the item facet to see the effect of item properties (i.e., item explanatory). For instance, this model is useful and suitable when one may want to see the effects of both passage lengths (continuous) and contents (categorical) on the overall item difficulties in a reading comprehension test. Moreover, the step deviation parameters can be estimated variously according to the scale structure in the MFRM (see Eckes, 2009), but they are estimated for each item in this model.

To show a simple illustration of this item location explanatory MFRM under the MGLMM, suppose we have two polytomous items with three response categories for each (M1=M2=2), and one categorical item property with X and Y item formats (The item property could be continuous). Also, suppose that each item is identified as each item format in the observed data: X format for Item 1 and Y format for Item 2. After using dummy coding for the item property (Y is a reference), an intercept (γ0) and the effect of item format X (γX) are estimated to account for the overall item difficulties. In addition, two step deviation parameters (τ11, τ21) are estimated for the two items.

First, consider the linear predictor elements of the adjacent-categories logits for Item 1:

  • when m=0, ηp10=0,

  • when m=1, ηp10+ηp11=0+(θpγ0γXτ11)=θpγ0γXτ11, and

  • when m=2, ηp10+ηp11+ηp12=0+(θpγ0γXτ11)+(θpγ0γXτ12)=2θp2γ02γX, because τ11+τ12=0.

Second, consider the linear predictor elements of the adjacent-categories logits for Item 2:

  • when m=0, ηp20=0,

  • when m=1, ηp20+ηp21=0+(θpγ0τ21)=θpγ0τ21, and

  • when m=2, ηp20+ηp21+ηp22=0+(θpγ0τ21)+(θpγ0τ22)=2θp2γ0, because τ21+τ22=0.

Consequently, the linear predictor vector for person p of this MFRM variation corresponding to Equation 5 can be written as

[ηp10ηp10+ηp11ηp10+ηp11+ηp12ηp20ηp20+ηp21ηp20+ηp21+ηp22]=θp×[012012][001122001020]×[γ0γX][001000000100]×[τ11τ12]. (6)

Step Difficulty Explanatory Linear Partial Credit Model

The second case is an item explanatory extension of the PCM, in which the step difficulties are determined by item-by-step property effects. Compared with the previous approach, the effects of item properties are step-specific, that is, their estimates can vary across steps. For model specification, we can impose linear restrictions on the step difficulty parameters by incorporating item properties as item-by-step predictors into the design matrix. Thus, the restricted step difficulty parameters δim will now be decomposed into weighted sums of item-by-step property effect parameters ωkm as follows:

δim=k=0Kωkmxik,k=0,,K (7)

so that

ηpim=lnPr(ypi=m|θp)Pr(ypi=m1|θp)=θpk=0Kωkmxik, (8)

where θp~N(0,σθ2),ηpi0=0,ωk0=0, and m=1,,Mi. Here, ω0m is the step intercept representing the mth-step difficulty for items with all xik = 0 for k > 0, ωkm is the regression weight or the effect of item property k on the mth-step difficulties, xi0 is the constant item predictor in which a value of 1 for all items, and xik is the value of item i on item property k. Note that the item property values of xik in the linear predictor elements are expanded over the adjacent-categories logits for each item (Zheng & Rabe-Hesketh, 2007), so that they can be incorporated as item-by-step predictors in the design matrix (see a simple illustration below). According to a configuration of item properties for each item, the constructed step difficulties δim can be calculated by weighted sums of the estimated step-specific item property effects from Equation 7.

This polytomous item explanatory model is a variation of Fischer and Ponocny’s (1994) linear partial credit model (LPCM). Although less interpretable item parameters such as a normalization constant and basic parameters were employed, linear restrictions were imposed on the step difficulty parameters and the basic parameters were estimated from the values of item-by-step predictors. Building on the original idea of the LPCM, this item explanatory extension of the PCM will be called the “step difficulty explanatory LPCM.” Compared with the original parameterization for the LPCM, however, the model specification in Equation 8 employs the item-by-step property effect parameters ωkm, which are more interpretable. This model allows estimation of the item property effects step-specifically so that we can interpret them as the effects of item properties on the step difficulties for each step. Tuerlinckx and Wang (2004) introduced this step-specific explanatory approach to the PCM, and they also added person predictors. For the item explanatory part of the model, they specified item regression components for each step using the same set of item properties. Building on their work, we combined the item regression components with the linear predictor elements using one formula in Equation 8 under the MGLMM framework, which makes it more tractable to incorporate item properties into the design matrix.

For a simple illustration of this step difficulty explanatory LPCM under the MGLMM, we will use the previous example for the MFRM variation, but now will consider decomposition of step difficulty parameters rather than decomposition of item location parameters. After using dummy coding for the categorical item property (Y is a reference), an intercept (ω01,) and the effect of item format X (ωX1) are estimated to account for the step difficulties of the first step, and the same set of other parameters (ω02, ωX2) are estimated to explain the step difficulties of the second step.

First, consider the linear predictor elements of the adjacent-categories logits for Item 1:

  • when m=0, ηp10=0,

  • when m=1, ηp10+ηp11=0+(θpω01ωX1)=θpω01ωX1, and

  • when m=2, ηp10+ηp11+ηp12=0+(θpω01ωX1)+(θpω02ωX2)=2θpω01ω02ωX1ωX2.

Second, consider the linear predictor elements of the adjacent-categories logits for Item 2:

  • when m=0, ηp20=0,

  • when m=1, ηp20+ηp21=0+(θpω01)=θpω01, and

  • when m=2, ηp20+ηp21+ηp22=0+(θpω01)+(θpω02)=2θpω01ω02.

Accordingly, the linear predictor vector for person p of this LPCM variation corresponding to Equation 8 can be expressed as

[ηp10ηp10+ηp11ηp10+ηp11+ηp12ηp20ηp20+ηp21ηp20+ηp21+ηp22]=θp×[012012][000010101111000010001100]×[ω01ω02ωX1ωX2]. (9)

Data Analysis

Two empirical studies were conducted to show how the proposed polytomous item explanatory models under the MGLMM framework, that is, the item location explanatory MFRM and the step difficulty explanatory LPCM (hereafter referred to as MFRM and LPCM, respectively), work for practical item response data sets. These two item explanatory approaches are conceptually and functionally different in terms of the target difficulty parameters of polytomous items to be explained and the types of predictors for the incorporated item properties as well as the types of item property effects. To see how the two approaches are different in practice, we fit both models to the Carbon Cycle assessment data and to the Verbal Aggression data, respectively. In addition, since the PCM is a saturated model for polytomous Rasch models, it was fitted to each empirical data set to see how the two polytomous item explanatory models perform compared with the saturated model. To fit all the adjacent-categories logit-based item response models including the proposed models and the PCM under the MGLMM framework, we used the gllamm command in Stata, which is implemented with MML estimation (Rabe-Hesketh, Skrondal, & Pickles, 2004).

Regarding the goodness of fit of the models, the likelihood ratio (LR) test was conducted to compare the nested models. The two polytomous item explanatory models are nested within the PCM, and the LPCM is nested within the MFRM due to the restrictions on the item parameters. Three other goodness-of-fit indices were also reported for each model: the deviance (D), the Akaike information criterion (AIC; Akaike, 1974), and the Bayesian information criterion (BIC; Schwarz, 1978). In addition to the goodness-of-fit comparison, a graphical comparison and a correlation analysis were conducted to see if there is agreement for the estimated and calculated step difficulties between the fitted models (see Fischer, 1973; Poinstingl, 2009). In particular, the estimated step difficulties in the PCM were compared with the constructed step difficulties, which were calculated by weighted sums of the estimated item property effects from the two polytomous item explanatory models. Last, in each empirical study, the item property effects found from the MFRM and the LPCM were examined to see how the observed item properties can explain and predict the difficulties of polytomous items—the overall item difficulties or the step difficulties—differently for the two models.

Study 1: Application to the Carbon Cycle Assessment Data

Data and Item Properties

A subset of the Carbon Cycle assessment data, specifically pretest responses of the Math and Science Partnership Carbon student assessment data collected during the 2010-2011 academic year, was used in the first empirical study. Participants were middle and high school students from urban, suburban, and rural areas in five states. The Carbon Cycle assessment and items were developed based on a learning progression framework for carbon cycling in socioecological systems in science education (Jin et al., 2015; Mohan, Chen, & Anderson, 2009), by conducting an iterative process of designing, analyzing, and modifying assessments with interviews for secondary school students. The learning progression for carbon cycling in socioecological systems has four ordered levels of achievement corresponding to students’ progress toward more sophisticated reasoning about biogeochemical processes (Mohan et al., 2009), as shown in Table 1. Based on the learning progression framework, items were developed to ask students to answer forced choice questions and explain their choices, and their scoring rubrics were developed to score item responses into the four ordered achievement levels. If students’ answers are not related to the question, or are illegible or nonsensical, they are treated as missing.

Table 1.

The Learning Progression Framework for the Carbon Cycle Assessment.

Level General description Response characteristics
4 Linking processes with matter and energy as constraints • Identify the cell as the basic unit and use atomic–molecular ideas and chemical models consistently to explain macroscopic processes by linking carbon-transforming processes with matter and energy as constraints
• Trace matter through hierarchically organized systems to explain changes at different scales systematically
3 Changes of molecules and energy forms with unsuccessful constraints • Recognize the transformation of matter
• Explain macroscopic processes in terms of changes of molecules and energy forms or chemical changes
• Explanations are restricted due to a lack of understanding chemical substances
2 Force dynamic accounts with hidden mechanisms • Explain macroscopic processes in terms of materials changed by hidden mechanisms or unobservable actors
• Focus on actors and results rather than changes
1 Macroscopic force dynamic accounts • Make explanations limited to macroscopic processes about organisms and objects in terms of the action-result chain
• Use everyday language rather than technical vocabulary or scientific representations

The 13 Carbon Cycle items consist of a combination of three categorical item properties: (a) the Process property has three types of biogeochemical processes in carbon cycling that transform carbon in socioecological systems at multiple scales—cellular respiration (CR), photosynthesis (PS), and digestion/biosynthesis (DB); (b) the Progress property has four learning progress variables related to the carbon cycling process—large-scale systems (LS), micro-scale systems (MS), energy (EN), and mass (MA); and (c) the Format property has two item response formats—multiple choice with explanation (MC) and yes/no choice with explanation (YN). For example, the first item (BODYTEMP) can be identified as a combination of a CR predictor in the Process property, an EN predictor in the Progress property, and an MC predictor in the Format property. The text of the first item is presented in Figure 1.

Figure 1.

Figure 1.

The text of Carbon Cycle Item 1 (BODYTEMP).

Through item analyses for the data and consideration of item design, the items can be classified by the predictors of each item property as in Table 2. These predictors of the item properties are functioning as weights of the elementary components, which are gathered into a Q matrix in the LLTM approach (Kubinger, 2009; Poinstingl, 2009). Three categorical item properties were dummy coded to be incorporated in polytomous item explanatory models; the DB in the Process property, the MA in the Progress property, and the YN in the Format property served a reference for each item property.

Table 2.

The Carbon Cycle Items Composed of Three Item Properties.

Item
Process
Progress
Format
No. Name Cellular respiration (CR) Photosynthesis (PS) Digestion/biosynthesis (DB) Large scale (LS) Micro scale (MS) Energy (EN) Mass (MA) Multiple choice (MC) Yes/No choice (YN)
1 BODYTEMP 1 0 0 0 0 1 0 1 0
2 CARBCYC2 1 0 0 1 0 0 0 0 1
3 CARBPLNT 0 1 0 0 1 0 0 0 1
4 CONTCARB 0 0 1 0 1 0 0 0 1
5 CRKTGRWTH 1 0 0 0 0 0 1 1 0
6 EATAPPLE 0 0 1 0 1 0 0 0 1
7 INFANT 0 0 1 0 0 0 1 0 1
8 LIGHTEN 0 1 0 0 0 1 0 1 0
9 LIGHTEN2 0 1 0 0 0 1 0 1 0
10 MAPLEMASS 0 1 0 0 0 0 1 1 0
11 PLNTGSENS 1 0 0 0 1 0 0 0 1
12 PLNTGRWTH 0 1 0 0 0 0 1 0 1
13 THINGTREE 0 1 0 0 0 0 1 0 1

In the initial data, several items had too few or zero responses for the highest level, which would result in poor or unfeasible estimation of the third-step difficulty parameters. To avoid such sparseness problem, item responses of the four-level categories (1, 2, 3, 4) were recoded into three category scores (0, 1, 2, 2). We can interpret the second-step difficulties as the relative difficulties as one goes from the Level 2 to the combined Levels of 3 and 4. Additionally, cases with less than three valid item responses were dropped from the analysis to enhance the quality of estimation. In total, 1,157 students’ responses on the 13 polytomous items were included in the analysis.

Empirical Results

According to the data analysis procedure, the PCM, the MFRM, and the LPCM were fit to the Carbon Cycle assessment data. Table 3 shows the results of the fitted models to see performance of the two polytomous item explanatory models that we proposed compared with the saturated model, the PCM.

Table 3.

Performance of the Many-Facet Rasch Model and the Linear Partial Credit Model on the Carbon Cycle Assessment Data.

Partial credit model Many-facet Rasch model Linear partial credit model
Variance
σθ2 1.17 0.81 0.78
Goodness of fit
D 12223.62 12990.74 13195.59
 AIC 12277.62 13032.74 13225.59
 BIC 12492.86 13200.15 13345.17
q 27 21 15

Note. AIC = Akaike information criterion; BIC = Bayesian information criterion; q = The number of estimated parameters.

The estimated person variance (σθ2) was 0.81 for the MFRM and 0.78 for the LPCM, which were smaller than 1.17 in the PCM. This indicates shrinkage of the estimated person variance in the polytomous item explanatory models as in the LLTM. This can be understood as a scaling effect that comes from the deficient explanation of the item parameters by imposing linear restrictions of the item properties (De Boeck & Wilson, 2004). In terms of the person variance shrinkage, the LPCM was slightly more shrunk than the MFRM but the effects were very similar between the two models.

In terms of the goodness of fit, both the MFRM and the LPCM appeared to fit worse than the PCM. The LR test comparing to the PCM was significant for both the MFRM, χ2(6) = 767.12, p < .001; and the LPCM, χ2(12) = 971.97, p < .001. This result was confirmed in that all goodness-of-fit indices of the two models were greater than the PCM. This is as expected because the item explanatory models commonly fit worse than the saturated model (Kubinger, 2009; Tuerlinckx & Wang, 2004). The methodological advantage of a smaller number of item parameters in the LLTM approach is at the cost of statistically lower goodness of fit.

When comparing between the two item explanatory models, both the AIC and BIC indicate that the MFRM showed a superior goodness of fit than the LPCM for the Carbon Cycle data. The LR test comparing the LPCM to the MFRM was also significant as χ2(6) = 204.85 (p < .001). This result makes sense because the step deviation parameters were freely estimated for each item in the MFRM and hence it had a larger number of the estimated parameters than the LPCM. Nonetheless, this statistical result does not mean that the LPCM is methodologically or practically inferior to the MFRM. They are conceptually and functionally different models in that the target difficulty parameters of polytomous items which are explained, the types of predictors for the incorporated item properties, and the types of item property effects are different.

In addition to the goodness-of-fit comparison, a graphical comparison is useful to see performance of the two polytomous item explanatory models, by examining agreement for the estimated and calculated step difficulties between the fitted models. The step difficulties δim for each item were directly estimated in the PCM, and the constructed step difficulties δim were calculated from the estimated item property effects in each item explanatory model. The PCM was used as the reference model for graphical model comparisons. Figure 2 shows that the MFRM had a similar agreement to the LPCM although a few more step difficulty points were close to the 45° line, which represents perfect alignment. However, the agreement gap in the two models was quite notable. In both models, the first-step difficulty of Item 13 was located farthest from the 45° line.

Figure 2.

Figure 2.

Graphical comparison of step difficulties between the models fitted to the Carbon Cycle assessment data.

Note. PCM = partial credit model; MFRM = many-facet Rasch model; LPCM = linear partial credit model.

The correlations show the same results to the graphical comparison. Correlations with the PCM were similar in the two item explanatory models (ρ = 0.913 for the MFRM and 0.905 for the LPCM). A correlation of the calculated step difficulties between the two item explanatory models (ρ = 0.970) was higher than their correlations with the estimated step difficulties in the PCM, but there was not a complete agreement. This implies that the MFRM and the LPCM could predict the step difficulties in a similar but different manner. Despite the goodness-of-fit gap between the two item explanatory models, these results showed comparable performance in reconstructing the step difficulties using the item property effects.

To see the practical difference between the two polytomous item explanatory models, the effects of the three item properties on the overall item difficulties or the step difficulties were reported for each model. Table 4 shows the results of the item property effects in the item location explanatory MFRM. Recall that there are three categorical item properties (Process, Progress, and Format) in the design of the Carbon Cycle assessment. The three item properties incorporated as item predictors were taken into account to explain and predict the overall item difficulties (item locations). As we examined the item property effect parameters γ, the contributions of all the item properties were statistically significant at the 5% significance level (but the step deviation parameters τ are not of interest for interpretation).

Table 4.

Item Property Effects on the Carbon Cycle Items in the Many-Facet Rasch Model.

Predictor Item parameter Estimate Standard error p
Intercept γ0 0.59 0.07 <.001
Process (cellular respiration) γCR 0.61 0.07 <.001
Process (photosynthesis) γPS −0.22 0.06 <.001
Progress (large-scale systems) γLS −1.93 0.21 <.001
Progress (micro-scale systems) γMS −0.64 0.06 <.001
Progress (energy) γEN −0.71 0.07 <.001
Format (multiple choice) γMC 0.26 0.08 <.001
Step deviation parameters for each item τ11 −0.77 0.10 <.001
τ21 −3.33 0.20 <.001
τ31 −1.58 0.07 <.001
τ41 −1.60 0.10 <.001
τ51 −1.75 0.11 <.001
τ61 −1.46 0.10 <.001
τ71 −1.29 0.10 <.001
τ81 −0.68 0.10 <.001
τ91 −0.32 0.09 .001
τ101 −0.07 0.10 .474
τ111 −0.98 0.10 <.001
τ121 −0.21 0.09 .023
τ131 −0.81 0.09 <.001

For the Process property, holding other properties constant, cellular respiration (γCR = 0.61) made the overall item difficulty of an item more difficult and photosynthesis (γPS = −0.22) made them less difficult than digestion/biosynthesis. In the Progress property, keeping other properties constant, compared with mass, large-scale systems (γLS = −1.93), energy (γEN = −0.71), and micro-scale systems (γMS = −0.64) made the overall item difficulty of an item less difficult. For the Format property, holding other properties constant, multiple choice with explanation formatted items (γMC = 0.26) had greater overall item difficulties than yes/no choice with explanation formatted items.

In addition, 13 step deviation parameters were estimated for each item. Accordingly, the constructed step difficulties for each observed combination item were calculated by using the estimated item property effects and step deviation parameters. For example, the second-step difficulty of Item 1 (BODYTEMP), δ12 was calculated as 0.59 (γ0) + 0.61 (γCR) − 0.71 (γEN) + 0.26 (γMC)+ 0.77 (τ11) = 1.52.

Table 5 shows the results of the item property effects in the step difficulty explanatory LPCM. The three item properties incorporated as item-by-step predictors were taken into account so that they could explain and predict the step difficulties for each step. Based on the estimated item-by-step property effect parameters ω, we can see how the three item properties affect the step difficulties step-specifically in the Carbon Cycle assessment. Recall that the first-step’s difficulties are interpreted as the relative difficulties as one goes from Level 1 to Level 2, and the second-step difficulties represent the relative difficulties as one goes from Level 2 to the combined Levels of 3 and 4. Except for the two predictors of MC1 (ωMC1 = 0.10) and MS2 (ωMS2 = −0.01), all the other predictors were statistically significant.

Table 5.

Item Property Effects on the Carbon Cycle Items in the Linear Partial Credit Model.

Predictor Item parameter Estimate Standard error p
Intercept ω01 −0.55 0.09 <.001
ω02 1.59 0.10 <.001
Process (cellular respiration) ωCR1 0.66 0.11 <.001
ωCR2 0.45 0.12 <.001
Process (photosynthesis) ωPS1 0.34 0.09 <.001
ωPS2 −0.60 0.09 <.001
Progress (large-scale systems) ωLS1 −4.16 0.36 <.001
ωLS2 0.55 0.23 .016
Progress (micro-scale systems) ωMS1 −1.12 0.08 <.001
ωMS2 −0.01 0.09 .885
Progress (energy) ωEN1 −0.36 0.09 <.001
ωEN2 −0.84 0.13 <.001
Format (multiple choice) ωMC1 0.10 0.10 .310
ωMC2 0.26 0.14 .057

With regard to the effects of the three item properties on the step difficulties of the first step, keeping other properties constant, the cellular respiration process (ωCR1 = 0.66) in the Process property and the mass progress variable in the Progress property made the items more difficult than others within each item property, as a student answered from Level 1 to Level 2. In contrast, the digestion/biosynthesis process in the Process property and the large-scale systems progress variable (ωLS1 = −4.16) in the Progress property made the step difficulties less difficult than others within each item property.

For the step difficulties of the second step, holding other properties constant, the cellular respiration process (ωCR2 = 0.45) in the Process property, the large-scale systems progress variable (ωLS2 = 0.55) in the Progress property, and the multiple choice with explanation format (ωMC2 = 0.26; borderline significant) in the Format property made it more difficult than others within each item property for a student to achieve either of Levels 3 and 4 from the Level 2 in the learning progression for carbon cycling. In contrast, the photosynthesis process (ωPS2 = −0.60) in the Process property, the energy progress variable (ωEN2 = −0.84) in the Progress property, and the yes/no choice with explanation format in the Format property made it less difficult than others within each item property.

Furthermore, we can reconstruct the step difficulties for each item with the observed item property combination using weighted sums of the estimated item property effects. For example, the second-step difficulty of Item 1 (BODYTEMP), δ12 was calculated as 1.59 (ω02)+ 0.45 (ωCR2)− 0.84 (ωEN2)+ 0.26 (ωMC2) = 1.46. We can also predict the step difficulties for newly developed items with the unobserved item property combination in the Carbon Cycle data. For instance, the first-step difficulty for the (CR,MA,YN) combination item (which was not observed in the current data set) could be predicted as δ(CR,MA,YN)1=− 0.55 (ω01)+ 0.66 (ωCR1)= 0.11.

Study 2: Application to the Verbal Aggression Data

Data and Item Properties

For the second empirical study, the Verbal Aggression data set (Vansteelandt, 2000) was used. Participants were first-year psychology students at a Dutch-speaking Belgian university. Twenty-four items were presented in Dutch, the native language of all the participants, to ask behavioral questions about verbally aggressive reactions to frustrating situations. A total of 316 persons responded to the 24 items. The item responses were three ordered-category responses (no = 0, perhaps = 1, and yes = 2) in the order of endorsing an item, which were used as polytomous data without dichotomization in this study.

All items were designed to have a stem, which describes a frustrating situation, and a verbal aggression response part, which describes how people could respond to the situation in question. In this item design, there were three experimental design factors:

  • (a) the Behavior Mode factor has two levels of modes—wanting (Want) and doing (Do),

  • (b) the Situation Type factor has two types of situations—situations in which someone else is to blame (Other-to-blame) and situations in which oneself is to blame (Self-to-blame), and

  • (c) the Behavior Type factor has three kinds of verbal aggressive behaviors, which represent the extent to which they ascribe blame and the extent to which they express frustration—cursing (Curse), scolding (Scold), and shouting (Shout).

For the verbal aggression response, one of the two behavioral modes could be combined with one of the three verbal aggressive behaviors. For example, combinations of these two factors generate six responses such as “I would curse” (for doing and cursing) and “I would want to scold” (for wanting and scolding).

For the item stem, four frustrating situations were used:

  • Other-to-blame A: A bus fails to stop for me. (Bus)

  • Other-to-blame B: I miss a train because the clerk gave me faulty information. (Train)

  • Self-to-blame A: The grocery store closes just as I am about to enter. (Store)

  • Self-to-blame B: The operator disconnects me when I used up my last 10 cents for a call. (Call)

These four frustrating situations were nested in the second design factor: two other-to-blame situations and two self-to-blame situations. The full item contains one of the four frustrating situations for the item stem, and one of the six combinations of two behavioral modes and three behavior types for the verbal aggression response to the situation. After considering two specific situations of the same type (A and B) as replications, in total 24 items were written to fit a 2 × 2 × 3 design with two replications within each cell. For example, the first item, “A bus fails to stop for me. I would want to curse” was designed and identified as a combination of Want (Behavior Mode), Other-to-blame (Situation Type), and Curse (Behavior Type).

The three design factors are regarded as categorical item properties. The Verbal Aggression items are classified by the predictors of each item property as in Table 6, which can be functioning as weights in a Q matrix in the LLTM approach. To incorporate the three categorical item properties into polytomous item explanatory models, they were dummy coded: the Want in the Behavior Mode property, the Self-to-blame in the Situation Type property, and the Shout in the Behavior Type property served a reference for each item property.

Table 6.

The Verbal Aggression Items Composed of Three Item Properties.

Item
Behavior Mode
Situation Type
Behavior Type
No. Situations Want Do Other-to-blame Self-to-blame Curse Scold Shout
1 A bus fails to stop for me. 1 0 1 0 1 0 0
2 1 0 1 0 0 1 0
3 1 0 1 0 0 0 1
4 I miss a train because the clerk gave me faulty information. 1 0 1 0 1 0 0
5 1 0 1 0 0 1 0
6 1 0 1 0 0 0 1
7 The grocery store closes just as I am about to enter. 1 0 0 1 1 0 0
8 1 0 0 1 0 1 0
9 1 0 0 1 0 0 1
10 The operator disconnects me when I used up my last 10 cents for a call. 1 0 0 1 1 0 0
11 1 0 0 1 0 1 0
12 1 0 0 1 0 0 1
13 A bus fails to stop for me. 0 1 1 0 1 0 0
14 0 1 1 0 0 1 0
15 0 1 1 0 0 0 1
16 I miss a train because the clerk gave me faulty information. 0 1 1 0 1 0 0
17 0 1 1 0 0 1 0
18 0 1 1 0 0 0 1
19 The grocery store closes just as I am about to enter. 0 1 0 1 1 0 0
20 0 1 0 1 0 1 0
21 0 1 0 1 0 0 1
22 The operator disconnects me when I used up my last 10 cents for a call. 0 1 0 1 1 0 0
23 0 1 0 1 0 1 0
24 0 1 0 1 0 0 1

Empirical Results

The PCM and the two polytomous item explanatory models were fit to the Verbal Aggression data, according to the data analysis procedure. Table 7 shows the results of the fitted models: the MFRM, the LPCM, and the PCM.

Table 7.

Performance of the Many-Facet Rasch Model and the Linear Partial Credit Model on the Verbal Aggression Data.

Partial credit model Many-facet Rasch model Linear partial credit model
Variance
σθ2 0.93 0.88 0.88
Goodness-of-fit
D 12639.47 12804.61 12846.29
 AIC 12737.47 12864.61 12868.29
 BIC 13131.06 13105.58 12956.64
q 49 30 11

Note. AIC = Akaike information criterion; BIC = Bayesian information criterion; q = The number of estimated parameters.

The estimated person variance (σθ2) was 0.88 for both polytomous item explanatory models, and it was 0.93 in the PCM. Since standard item explanatory models such as the LLTM cannot explain the item parameters perfectly based on the observed item properties, the person variance should show shrinkage. Although the two models use different item explanatory approaches to polytomous items, the shrinkage effect of the person variance estimate was the same for both.

When comparing the goodness of fit, the LR test (compared with the PCM) was significant for the MFRM, χ2(19) = 165.142, p < .001, and the LPCM, χ2(38) = 206.818, p < .001, meaning that both the MFRM and the LPCM fit worse than the PCM, as expected. This result was confirmed by the AIC: The values for the two models were greater than the PCM (12737.47 for the PCM, 12864.61 for the MFRM, 12868.29 for the LPCM). This makes sense because the smaller number of estimated parameters in the item explanatory models made model fit (accuracy) much worse in spite of them being more parsimonious models.

However, this was not supported by the BIC: The value for the PCM was greater than the others (13131.06 for the PCM, 13105.58 for the MFRM, 12956.64 for the LPCM). Although this conflicted result might seem surprising, it can be understood by considering the difference between the AIC and BIC. They differ in theoretical motivations, objectives, assumptions, and meaning of penalty terms (Kuha, 2004; Wagenmakers & Farrell, 2004). Basically, the BIC penalizes complexity more heavily for the number of freely estimated parameters, whereas the AIC penalizes a model with more parameters much less than the BIC (Kuha, 2004). In other words, the BIC favors parsimonious models to a greater extent than the AIC does. In Table 7, the 48 parameters for item effects in the PCM were reduced to 29 in the MFRM and 10 in the LPCM. In terms of the BIC, we could say that the two polytomous item explanatory models performed better than the PCM in this empirical study. In addition, in the Bayesian aspect of the BIC, it assumes that the true data-generating model is in the set of candidate models, and it measures the degree of belief that a certain model is the true model, which generates the observed data (Wagenmakers & Farrell, 2004). In this view, we could also say that the LPCM that had the lowest BIC was most likely to be the true data-generating model for the Verbal Aggression data. Thus, this result implies that the three-factor design with two replications worked well for item generation and the three item properties (design factors) had high explanatory value for the Verbal Aggression items.

Comparing the LPCM with the MFRM, the LR test was significant, χ2(19) = 41.68, p = .002, which indicates that the MFRM fit better than the LPCM to the Verbal Aggression data. The lower AIC value for the MFRM supports this finding, which make sense because the larger number of the estimated parameters in the MFRM made model fit (accuracy) much better than the LPCM. However, the LPCM had the lower BIC value, meaning that the LPCM performed better because it was much more parsimonious than the MFRM. In this empirical study, we couldn’t conclude that one was better than the other in terms of the goodness of fit. In fact, the two polytomous item explanatory models are different methodologically as well as practically.

In addition, a graphical comparison was conducted to examine agreement for the estimated and calculated step difficulties between the fitted models. The constructed step difficulties δim, which were calculated from the estimated item property effects in each item explanatory model, were compared with the directly estimated step difficulties δim for each item in the PCM. Figure 3 shows a similar graphical agreement between the two item explanatory models. Compared with the LPCM, the MFRM had more step difficulty points close to the 45° line indicating perfect alignment, but a few points were farther from the line. The second-step difficulty of Item 21 were located farthest from the 45° line in both models.

Figure 3.

Figure 3.

Graphical comparison of step difficulties between the models fitted to the Verbal Aggression data.

Note. PCM = partial credit model; MFRM = many-facet Rasch model; LPCM = linear partial credit model.

The correlations show the LPCM had a slightly higher correlation with the PCM than the MFRM (ρ = 0.909 for the MFRM and 0.930 for the LPCM), which means that the LPCM could predict the PCM step difficulties slightly better than the MFRM. A correlation of the calculated step difficulties between the two item explanatory models (ρ = 0.965) was higher than their correlations with the estimated step difficulties in the PCM, but it was not a perfect agreement. Considering the results from the graphical comparison, the two item explanatory models showed comparable performance in reconstructing the step difficulties using the item property effects.

The effects of the three item properties on the overall item difficulties or the step difficulties were reported separately for each model to show the practical difference between the two polytomous item explanatory models. For the MFRM, three categorical item properties (Behavior Mode, Situation Type, and Behavior Type) from the item design factors were incorporated as item predictors to explain and predict the overall item difficulties (item locations) of endorsing an item at higher levels. Table 8 represents the results of the item property effects in the item location explanatory MFRM, and the item property effect parameters γ are our focus to interpret the results. All predictors of the item property effects including the intercept were statistically significant at the 5% significance level.

Table 8.

Item Property Effects on the Verbal Aggression Items in the Many-Facet Rasch Model.

Predictor Item parameter Estimate Standard error p
Intercept γ0 1.58 0.07 <.001
Behavior Mode (Do) γDo 0.43 0.04 <.001
Situation Type (Other-to-blame) γOther −0.81 0.04 <.001
Behavior Type (Curse) γCurse −1.28 0.05 <.001
Behavior Type (Scold) γScold −0.63 0.05 <.001
Step deviation parameters for each item τ11 −0.22 0.13 .085
τ21 0.00 0.13 .995
τ31 −0.35 0.13 .005
τ41 −0.47 0.12 <.001
τ51 −0.11 0.13 .377
τ61 −0.11 0.13 .405
τ71 −0.52 0.12 <.001
τ81 −0.29 0.13 .025
τ91 −0.18 0.15 .242
τ101 −0.62 0.12 <.001
τ111 −0.26 0.13 .048
τ121 −0.20 0.15 .188
τ131 −0.34 0.12 .006
τ141 −0.25 0.13 .045
τ151 −0.03 0.14 .833
τ161 −0.17 0.12 .165
τ171 −0.17 0.13 .173
τ181 0.29 0.16 .062
τ191 −0.47 0.12 <.001
τ201 0.00 0.15 .991
τ211 0.62 0.22 .004
τ221 −0.60 0.12 <.001
τ231 −0.56 0.13 <.001
τ241 −0.02 0.17 .911

For the Behavior Mode, holding other properties constant, the Do mode made the overall item difficulty of an item 0.43 logits more difficult to endorse than the Want mode (γDo = 0.43). In the Situation Type, the Other-to-blame situation made the overall item difficulty of an item 0.81 logits easier to endorse than the Self-to-blame situation (γOther = −0.81), after keeping other properties constant. For the Behavior Type, holding other properties constant, compared with Shout, Curse made the overall item difficulty of an item 1.28 logits easier to endorse (γCurse = −1.28), and Scold made the overall item difficulty of an item 0.63 logits easier to endorse (γScold = −0.63).

In addition, 24 step deviation parameters were estimated for each item. By using the estimated item property effects and step deviation parameters, the constructed step difficulties for individual items could be calculated. For example, the first-step difficulty of the first item (“A bus fails to stop for me. I would want to curse.”), δ11 was calculated as 1.58 (Intercept; γ0) + 0.00 (Want; reference) − 0.81 (Other-to-blame; γOther) − 1.28 (Curse; γCurse) − 0.22 (τ11) = −0.73.

Table 9 shows the results of the item property effects in the step difficulty explanatory LPCM. The three item properties were incorporated as item-by-step predictors to explain and predict the step difficulties for each step. The estimated item-by-step property effect parameters ω were examined and interpreted to see how the three item properties affect the step difficulties step-specifically. For the Verbal Aggression items, the step difficulties for the first step represent the relative difficulties of endorsing an item as one answers “perhaps” rather than “no,” and those for the second step are related to the relative difficulties of endorsing an item as one answers “yes” rather than “perhaps.” All predictors of the item property effects including the step intercepts were statistically significant.

Table 9.

Item Property Effects on the Verbal Aggression Items in the Linear Partial Credit Model.

Predictor Item parameter Estimate Standard error p
Intercept ω01 1.42 0.09 <.001
ω02 1.77 0.12 <.001
Behavior Mode (Do) ωDo1 0.54 0.06 <.001
ωDo2 0.35 0.07 <.001
Situation Type (Other-to-blame) ωOther1 −0.69 0.06 <.001
ωOther2 −0.97 0.08 <.001
Behavior Type (Curse) ωCurse1 −1.68 0.08 <.001
ωCurse2 −0.93 0.10 <.001
Behavior Type (Scold) ωScold1 −0.80 0.07 <.001
ωScold2 −0.50 0.10 <.001

The effects of the three item properties on the step difficulties of the first step were interpreted as follows. When a person answered “perhaps” rather than “no,” keeping other properties constant, the Do mode made the items 0.54 logits more difficult to endorse than the Want mode (ωDo1 = 0.54), the Other-to-blame situation made the items 0.69 logits easier to endorse than the Self-to-blame situation (ωOther1 = −0.69), and compared to Shout, Curse made the items 1.68 logits easier to endorse (ωCurse1 = −1.68) and Scold made the items 0.80 logits easier to endorse (ωScold1 = −0.80). For the step difficulties of the second step, holding other properties constant, the Do mode made the items 0.35 logits more difficult to endorse than the Want mode (0.35), the Other-to-blame situation made the items 0.97 logits easier to endorse than the Self-to-blame situation (ωOther2 = −0.97), and compared with Shout, Curse made the items 0.93 logits easier to endorse (ωCurse2 = −0.93) and Scold made the items 0.50 logits easier to endorse (ωScold2 = −0.50), as a person answered “yes” rather than “perhaps.” These results nicely agree with those obtained by Tuerlinckx and Wang (2004) except for the step intercept values, because they also added person predictors.

To reconstruct the step difficulties for individual items, we can calculate them using weighted sums of the estimated item property effects. For example, the first-step difficulty of the first item (“A bus fails to stop for me. I would want to curse.”), δ11 was calculated as 1.42 (Intercept; ω01) + 0.00 (Want; reference) − 0.69 (Other-to-blame; ωOther1) − 1.68 (Curse; ωCurse1) = −0.95.

Conclusion and Discussion

We have investigated how to apply the LLTM approach to polytomous data under the MGLMM framework, given the ordered-category responses and the adjacent-categories logits. The two item explanatory extensions of the PCM—the item location explanatory MFRM and the step difficulty explanatory LPCM—were developed and specified with general statistical formulations. These two polytomous item explanatory IRT models were fit to the Carbon Cycle assessment data and to the Verbal Aggression data, respectively, and their explanatory values were demonstrated in that both models help us figure out how the observed item properties affect the overall item difficulties or the step difficulties in the empirical studies.

The first empirical study showed that the MFRM had a superior goodness of fit compared with the LPCM in terms of both the AIC and BIC, whereas in the second empirical study there was no uniform agreement between the two goodness-of-fit indices regarding the performance of the MFRM and the LPCM. In both studies, the two polytomous item explanatory models showed comparable performance in reconstructing the step difficulties using the estimated item property effects, but they demonstrated practical differences in interpreting the item property effects on the polytomous item difficulties. The MFRM could explain and predict the overall item difficulties (item locations) by the item properties incorporated as item predictors, and the LPCM could explain and predict the step difficulties by the item properties incorporated as item-by-step predictors. Thus, the explanatory and predictive values of polytomous item explanatory IRT models can go beyond traditional descriptive measurement, as they provide informative feedback for enhancing quality of item development in practice.

In fact, the two polytomous item explanatory models are methodologically and practically different in terms of the target difficulty parameters of polytomous items, which are explained by item properties, the types of predictors for the item properties incorporated into the design matrix, and the types of item property effects. Before fitting the model to the data, it is highly recommended for model selection to clarify which polytomous item difficulties should be explained by item properties as well as to examine what kind of predictors should be used for the item properties with consideration for the types of item property effects. When designing items, it is helpful to examine underlying hypotheses for the item properties or design factors. If an item property is assumed to affect switching levels of the construct as in the learning progression framework (see Table 1), the LPCM is useful to test that hypothesis. If one assumes that an item property relates to the item as a whole but not to specific levels, the MFRM is appropriate to see its influence on the overall construct. One recommendable and systematic way of designing items is to use Wilson’s (2005) four-building-blocks approach to constructing measures, which takes advantage of the principles of sound educational and psychological measurement. By using an iterative cycle of the measurement processes (construct mapping, item design, outcome space, and measurement model) one can develop items with articulating the underlying hypotheses.

Among the EIRM approaches, we have focused on explaining the item side only of the response data. There are huge potentials to extend polytomous item explanatory IRT models. For instance, although we have investigated the main effects of item properties in the empirical studies, the interaction effects between them can be taken into account as well. We can also consider adding person predictors and/or person-by-item predictors, which are extended with person or doubly explanatory models such as a latent regression and differential item functioning analysis. A future study could address a multidimensional extension of the models to examine interactions between item properties and individual persons when the item property effects vary over all persons, which are applications of the random-weights linear logistic test model (Rijmen & De Boeck, 2002) to polytomous data. In addition, when we are not sure of perfect explanation through the observed item properties, by allowing for random residuals, item error terms can be added to the polytomous item explanatory models, which can enhance predictions of the item property effects. This is an extension of the linear logistic test model with random item error (Janssen, Schepers, & Peres, 2004) to polytomous data (see Kim, 2018). Last, we have considered the two-facet situation (person and item), but this may not be the only consideration. A future study could concern other-facet situations (e.g., rater, task, and criteria) with different facet properties (see Eckes, 2009).

This research also sheds light on the methodological advantages of item explanatory modeling. In the polytomous item explanatory models we have specified, the step difficulty parameters or the item location (overall item difficulty) parameters are constrained to be a linear combination of the effects of item properties. These models allow to estimate a smaller number of item parameters than the saturated model to explain and predict the item effects, which is a methodological advantage in extracting essential and meaningful elementary components by incorporating item properties. They are useful to understand how the items generated by item design or test construction work, as well as to validate hypothesized constructs for item design or test construction. The models can also help us generate new items by composing specified combinations of the item property effects so that we can predict item locations or step difficulties of the newly developed items in a more scientific and systematic manner rather than an intuitive manner. It is promising that item explanatory models are widely used for polytomous data to measure the effects of various testing conditions such as raters, cognitive operations, and item exposure time, as has been done for dichotomous data (e.g., Fischer, 1973; Kubinger, 2009; Poinstingl, 2009).

In conclusion, we can say that the polytomous item explanatory IRT models can have a large effect in methodological foundations for the educational and psychological measurement, particularly with consideration for De Boeck and Wilson’s (2004) EIRM framework. Peter Drucker, a famous management thinker, said, “If we can’t measure it, we can’t manage and improve it.” Likewise, if we cannot explain the measurement, we cannot know how to improve it.

Acknowledgments

The authors would like to thank Sophia Rabe-Hesketh for her careful comments on model specification and thank anonymous reviewers for their helpful comments on an earlier draft. The authors would also like to thank Charles W. Anderson and Hui Jin for leading the Carbon Cycle research project and sharing a data set.

1.

Instead, a particular item parameter or the mean of item parameters can be constrained to zero and the mean of person abilities can be estimated (De Boeck & Wilson, 2004).

2.

Tuerlinckx and Wang (2004) used the term logit predictor, and Rijmen et al. (2003) used category covariates. Instead, we will use the term step predictor, because a step is more intuitive and pertinent in terms of local comparison approach. See the PCM expressed in terms of the local comparison.

3.

Widely defined, a facet can be any factor, variable, condition, or component of the measurement situation that is assumed to influence test scores in a systematic way (Brennan, 2001; Eckes, 2009; Linacre, 1989).

Footnotes

Authors’ Note: Jinho Kim is also affiliated with KU Leuven and ITEC, imec research group at KU Leuven, Kortrijk, Belgium.

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Collection of the Carbon Cycle assessment data was supported in part by grants from the National Science Foundation (NSF): Learning Progression on Carbon-Transforming Processes in Socio-Ecological Systems (NSF 0815993), and Targeted Partnership: Culturally relevant ecology, learning progressions and environmental literacy (NSF 0832173). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.

References

  1. Akaike H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716-723. [Google Scholar]
  2. Brennan R. L. (2001). Generalizability theory. New York, NY: Springer-Verlag. [Google Scholar]
  3. De Boeck P., Wilson M. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York, NY: Springer-Verlag. [Google Scholar]
  4. Eckes T. (2009). Section H: Many-facet Rasch measurement. In Takala S. (Ed.), Reference supplement to the manual for relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment. Strasbourg, France: Council of Europe/Language Policy Division; Retrieved from https://rm.coe.int/CoERMPublicCommonSearchServices/DisplayDCTMContent?documentId=0900001680667a23 [Google Scholar]
  5. Embretson S. E., Reise S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
  6. Fischer G. H. (1972). A measurement model for the effect of mass-media. Acta Psychologica, 36, 207-220. [Google Scholar]
  7. Fischer G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374. [Google Scholar]
  8. Fischer G. H. (1977). Some probabilistic models for the description of attitudinal and behavioral changes under the influence of mass communication. In Kempf W. F., Repp B. H., Mathematical models for social psychology (pp. 102-151). Vienna, Austria: Hans Huber. [Google Scholar]
  9. Fischer G. H., Parzer P. (1991). An extension of the rating scale model with an application to the measurement of change. Psychometrika, 56, 637-651. [Google Scholar]
  10. Fischer G. H., Ponocny I. (1994). An extension of the partial credit model with an application to the measurement of change. Psychometrika, 59, 177-192. [Google Scholar]
  11. Glas C. A. W., Verhelst N. D. (1989). Extensions of the partial credit model. Psychometrika, 54, 635-659. [Google Scholar]
  12. Hartzel J., Agresti A., Caffo B. (2001). Multinomial logit random effects models. Statistical Modelling, 1, 81-102. [Google Scholar]
  13. Janssen R., Schepers J., Peres D. (2004). Models with item and item group predictors. In Boeck P., Wilson M. (Eds.), Explanatory item response models: A generalized linear and nonlinear approach (pp. 189-212). New York, NY: Springer-Verlag. [Google Scholar]
  14. Jin H., Shin H., Johnson M. E., Kim J., Anderson C. W. (2015). Developing learning progression-based teacher knowledge measures. Journal of Research in Science Teaching, 52, 1269-1295. [Google Scholar]
  15. Johnson T. R. (2007). Discrete choice models for ordinal response variables: A generalization of the stereotype model. Psychometrika, 72, 489-504. [Google Scholar]
  16. Kim J. (2018). Extensions and applications of item explanatory models to polytomous data in item response theory (Unpublished doctoral dissertation). University of California, Berkeley. [Google Scholar]
  17. Kubinger K. D. (2009). Applications of the linear logistic test model in psychometric research. Educational and Psychological Measurement, 69, 232-244. [Google Scholar]
  18. Kuha J. (2004). AIC and BIC: Comparisons of assumptions and performance. Sociological Methods & Research, 33, 188-229. [Google Scholar]
  19. Linacre J. M. (1989). Multi-facet Rasch measurement. Chicago, IL: MESA Press. [Google Scholar]
  20. Masters G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. [Google Scholar]
  21. Masters G. N., Wright B. D. (1997). The partial credit model. In van der Linden W. J., Hambleton R. K. (Eds.), Handbook of modern item response theory (pp. 101-121). New York, NY: Springer. [Google Scholar]
  22. Mohan L., Chen J., Anderson C. W. (2009). Developing a multi-year learning progression for carbon cycling in socio-ecological systems. Journal of Research in Science Teaching, 46, 675-698. [Google Scholar]
  23. Poinstingl H. (2009). The Linear Logistic Test Model (LLTM) as the methodological foundation of item generating rules for a new verbal reasoning test. Psychology Science Quarterly, 51, 123-134. [Google Scholar]
  24. Rabe-Hesketh S., Skrondal A., Pickles A. (2004). GLLAMM manual (UC Berkeley Division of Biostatistics Working Paper Series, Paper No. 160). University of California, Berkeley: Retrieved from http://www.bepress.com/ucbbiostat/paper160/ [Google Scholar]
  25. Reif M. (2012). Applying a construction rational to a rule based designed questionnaire using the Rasch model and LLTM. Psychological Test and Assessment Modeling, 54, 73-89. [Google Scholar]
  26. Rijmen F., De Boeck P. (2002). The random weights linear logistic test model. Applied Psychological Measurement, 26, 271-285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Rijmen F., Tuerlinckx F., De Boeck P., Kuppens P. (2003). A nonlinear mixed model framework for item response theory. Psychological Methods, 8, 185-205. [DOI] [PubMed] [Google Scholar]
  28. Schwarz G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464. [Google Scholar]
  29. Tuerlinckx F., Wang W. C. (2004). Models for polytomous data. In De Boeck P., Wilson M. (Eds.), Explanatory item response models: A generalized linear and nonlinear approach (pp. 75-109). New York, NY: Springer-Verlag. [Google Scholar]
  30. Vansteelandt K. (2000). Formal models for contextualized personality psychology (Unpublished doctoral dissertation). KU Leuven, Flanders, Belgium. [Google Scholar]
  31. Wagenmakers E. J., Farrell S. (2004). AIC model selection using Akaike weights. Psychonomic Bulletin & Review, 11, 192-196. [DOI] [PubMed] [Google Scholar]
  32. Wang W. C., Wilson M. (2005). Exploring local item dependence using a random-effects facet model. Applied Psychological Measurement, 29, 296-318. [Google Scholar]
  33. Wilson M. (2005). Constructing measures: An item response modeling approach. New York, NY: Taylor & Francis. [Google Scholar]
  34. Wu M. L., Adams R. J., Wilson M. R., Haldane S. A. (2007). ACERConQuest 2.0 manual [Computer program]. Hawthorn, Australia: ACER. [Google Scholar]
  35. Zheng X., Rabe-Hesketh S. (2007). Estimating parameters of dichotomous and ordinal item response models with gllamm. Stata Journal, 7, 313-333. [Google Scholar]

Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES