Abstract
In educational and psychological research, the logit and probit links are often used to fit the binary item response data. The appropriateness and importance of the choice of links within the item response theory (IRT) framework has not been investigated yet. In this paper, we present a family of IRT models with generalized logit links, which include the traditional logistic and normal ogive models as special cases. This family of models are flexible enough not only to adjust the item characteristic curve tail probability by two shape parameters but also to allow us to fit the same link or different links to different items within the IRT model framework. In addition, the proposed models are implemented in the Stan software to sample from the posterior distributions. Using readily available Stan outputs, the four Bayesian model selection criteria are computed for guiding the choice of the links within the IRT model framework. Extensive simulation studies are conducted to examine the empirical performance of the proposed models and the model fittings in terms of “in-sample” and “out-of-sample” predictions based on the deviance. Finally, a detailed analysis of the real reading assessment data is carried out to illustrate the proposed methodology.
Keywords: deviance information criterion, leave-one-out cross-validation, logarithm of the pseudomarginal likelihood, Markov chain Monte Carlo, stan, widely applicable information criterion
Introduction
Item response theory (IRT) models, also called latent trait models, have been extensively used in educational testing and psychological measurement (Baker and Kim, 2004; Embretson and Reise, 2000; Lord and Novick, 1968; Van der Linden and Hambleton, 1997). As a latent variable modeling technique, the response probability is used to construct the interaction between an individual’s “ability” and the item level stimuli (difficulty and guessing, etc.), where the focus is on the pattern of responses rather than on composite or total score variables and linear regression theory. Specifically, IRT attempts to model student ability using question level performance instead of aggregate test level performance and focuses more on the information each question provides about a student.
To fit the binary item response data, the normal ogive or two-parameter probit (2PP) model was proposed by Lord (1952). In parallel with the 2PP model, the Rasch model (Rasch, 1960) or one-parameter logistic model (1PL) and two-parameter logistic model (2PL; Birnbaum, 1957) have also been proposed, studied, and widely used. Moreover, Birnbaum (1968) modified the two-parameter logistic model to include a lower asymptote parameter that represents the contribution of guessing to the probability of correct response, namely, the well-known three-parameter logistic model (3PL). The logistic model is chosen as an alternative to the normal ogive model (Lord, 1952) due to its more convenient mathematical properties. In the past three decades, the literature on IRT models has almost exclusively been focused on the development and comparison of parameter estimation techniques and the study of the effects of the characteristics of the data sets (sample size, test length, and distribution of the true abilities) and violations of model assumptions (local independence assumption) on the capability of available algorithms to recover the generating parameters. The application psychologists often do not question if the mathematical form of the link function (item characteristic curve) can be derived from psychological theory of performance in objective testing as opposed to adopting a convenient function that the data are forced fitting to it (Bazán et al., 2006; García-Pérez, 1999).
In other biological and medical fields, there is a rich literature on the development of links for binary and ordinal response data, including Aranda-Ordaz (1981), Guerrero and Johnson (1982), Morgan (1983), Whittemore (1983), Stukel (1988), Czado and Santner (1992), Chen, Dey and Shao (1999), Kim et al. (2008), Wang and Dey (2010), Wang and Dey (2011), Jiang et al. (2014), and Roy and Dey (2014). In this paper, we focus on a class of generalized logistic models proposed by Stukel (1988). This class of links is governed by the two parameters, namely, (α1, α2). By varying the values of (α1, α2), this class of links includes logit, probit, complementary log–log link, as well as many other symmetric and skewed links as special cases. This class of models is flexible enough to allow us to fit the same link or different links to different items within the IRT model framework. Another attractive feature of this class facilitates a convenient implementation of MCMC sampling from the posterior distribution in the recent developed software Stan. Our contributions are several folds: (1) based on our best knowledge, for the first time, we introduce the generalized logit link in the IRT models; (2) we fit different links to different items; and (3) we implement this flexible class of links in Stan for 1P to 3P models and provide Stan codes. Stan allows us to compute LPML, DIC, WAIC, and LOO criteria based on the sample drawings from the posterior distributions. These four criteria can be naturally used to guide the selection of links as well as the types of IRT models. Our detailed analysis of the real reading assessment data empirically demonstrates that the IRT model with different generalized logit links for different items yields a substantial gain in the fit of the data compared to the traditional logistic and normal ogive models according to LPML, DIC, WAIC, and LOO criteria.
The rest of the article is organized as follows. The Item Response Theory Models with Generalized Logistic Links section introduces Stukel’s generalized logistic models and their extensions in IRT models. The Bayesian Inference section is devoted to the specification of the priors, the implementation of MCMC sampling in the Stan software for the proposed models, and the construction and computation of Bayesian model selection criteria. Extensive simulation studies are conducted to examine the empirical performance of the proposed model and the model fittings in terms of “in-sample” and “out-of-sample” predictions based on the deviance in the Simulation section. In addition, an in-depth analysis of the reading assessment data is carried out in the Analysis of the Reading Assessment Data section. We conclude the article with a brief discussion in the Discussion section.
Item Response Theory Models with Generalized Logistic Links
Generalized Logit Links
Let y denote a dichotomous random variable. Assume that y = 1 with the probability μ(η) and y = 0 with the probability 1 −μ(η), and η is a linear predictor. The general form of the generalized logistic models (Glogits; Stukel, 1988) can be written as
| (1) |
where is a strictly increasing nonlinear function of η indexed by two shape parameters α = (α1, α2)′, which is defined as follows: for ,
| (2) |
and for ,
| (3) |
It is easy to see from (2) and (3) that the Glogit reduces to the logit link when α = (0, 0)′. As discussed in Stukel (1988), several important link functions can be approximated by the members of this family such as probit link (α = (0.165, 0.165)′), log–log link (α = (−0.037, 0.62)′), complementary log–log link (α = (0.62, −0.037)′), and standard Laplace link (α = (−0.077, −0.077)′). Several symmetric and asymmetric link functions in this family are shown in Figure 1 and Figure 2. From Figures 1 and 2, we see that two shape parameters α1 and α2 independently govern the tails of the curve, where α1 controls the upper tail of the curve when η > 0, and α2 controls the lower tail of the curve when η≤ 0. Second, when α1 > 0 and α2 < 0, then the upper tail is heavier than the logit link and the lower tail is thinner than the logit link. Third, when α1 < 0 and α2 > 0, then the rate that the probability approaches 1 is slower than the logit link, while the rate that the probability approaches 0 is faster than the logit link. Fourth, when α1 = α2, the generalized item response models reduce to the symmetric link models. In this case, when α1 = α2 < 0, the rate that the probability approaches 1 or 0 is slower than the logit link. Thus, by varying α1 and α2, we can easily control the skewness as well as the tails of links.
Figure 1.
Plots of h(η) and μ(η) against η. The solid line is the logistic model α = (0, 0)′ and the dashed lines are two members of the h family, where the short dashed lines and the long dashed lines correspond to α = (−1, −1)′ and α = (0.5, 0.5)′, respectively.
Figure 2.
Plots of and against η. The solid line is the asymmetric generalized logistic model , and the dashed lines are three members of the h family, where the short dashed lines, the long dashed lines, and the dot-dashed lines correspond to , , and , respectively.
Generalized Item Response Theory Models
Let be the response of the ith examinee answering the jth item for i = 1, …, N and j = 1, …, J. Assume that yij = 1 with the probability of a correct response , and = 0 with probability 1 − . The three-parameter generalized logistic (3PGlogit) IRT model can be written as
| (4) |
where
| (5) |
for i = 1, …, N and j = 1, …, J. In equations (4) and (5), θ i denotes the latent ability for the ith examinee, and aj, bj, and cj are item discrimination, difficulty, and pseudoguessing (lower asymptote) parameters. In equation (5), is a strictly increasing nonlinear function of indexed by two shape parameters for the jth item. Specifically, for ,
| (6) |
and for ,
| (7) |
The 3PGlogit model can be reduced to the 2PGlogit model by constraining the lower asymptote parameter cj to be zero, and the 1PGlogit model can be obtained by further constraining aj to be the same across all items. The three-parameter logistic (3PL) model is a special case of the 3PGlogit model when α1j = α2j = 0 for j = 1, …, J.
Note that the linear predictor in the 2PGlogit model is the same as the linear predictor in the traditional 2PL model. Now, we discuss the meanings and roles of the discrimination parameter, the difficulty parameter, and the two shape parameters in the 2PGlogit model.
The Meaning and Role of the Discrimination Parameter
It is known that the discrimination parameter aj is proportional to the slope of the item characteristic curve (ICC) at the point bj on the ability scale for the 2PL model. It is found by taking the slope of the line tangent to the ICC at the difficulty parameter bj. In fact, the discrimination parameter is the steepness of the ICC at its steepest point. Items with steeper slopes are more useful for separating examinees into different ability levels than the items with less steep. High values of aj result in item characteristic functions that are very “steep,” and low values of aj lead to item characteristic functions that increase gradually as a function of ability. Note that the steepest point does not occur at the tails of the ICC. In other words, the discrimination parameter does not control the rate that the tail probability of the ICC approaches 1 (or 0). Therefore, the meaning of the discrimination parameter is not changed in the linear predictor part of the 2PGlogit model. The size of the discrimination parameter also reflects the steepness of the ICC.
The Meaning and Role of the Difficulty Parameter
The difficulty parameter is the point on the ability scale where the probability of a correct response is 0.5. This is a location parameter, indicating the position of the ICC in relation to the ability scale. According to equations (6) and (7), the meaning of the difficulty parameter is also the same. When the difficulty parameter equals to the ability, the probability of a correct response is 0.5. The greater the value of the bj parameter, the greater the ability is required for an examinee to have a 50% chance of getting the item correct.
The Meaning and Role of the Two Shape Parameters
Built upon the traditional IRT models, we introduce two shape parameters into the 2PGlogit model. For item j, the two shape parameters α1j and α2j independently govern the tails of the ICC, where α1j controls the upper tail of the ICC when θi−bj > 0, and α2j controls the lower tail of the ICC when θi−bj≤ 0. When α1j > 0 and α2j < 0, then the upper tail of the 2PGlogit model is thinner than the 2PL model and the lower tail of the 2PGlogit model is heavier than the 2PL model. When α1j < 0 and α2j > 0, then the rate that the probability approaches 1 for the 2PGlogit model is slower than that of the 2PL model, while the rate that the probability approaches 0 for the 2PGlogit model is faster than that of the 2PL model. When α1j = α2j, the 2PGlogit model reduces to the symmetric link model. In this case, when α1j = α2j < 0, the rate that the probability approaches 1 or 0 for the 2PGlogit model is slower than that of the 2PL model. When α1j = α2j > 0, the rate that the probability approaches 1 or 0 for the 2PGlogit model is faster than that of the 2PL model. In fact, the two shape parameters are only used to adjust the tail probability of the ICC. More specifically, the role of the shape parameter that controls the lower tail of the ICC is to depict that the examinee has a certain probability to guess the item correctly even if the examinee has a low ability, while the role of the shape parameter that controls the upper tail of the ICC is to depict that even if the examinee has a high ability, the correct response probability of the item may be much lower than 1 due to a series of reasons including anxiety, carelessness, distraction of poor testing conditions, or misreading the question (Hockemeyer, 2002; Rulison & Loken, 2009). The two shape parameters are empirically similar to the upper and lower asymptote parameters in the traditional four parameter IRT (4PL) model. From Figure 1, we see that when (α1, α2) = (−1,−1)′, μ(6) = 0.875, which is the value of the upper tail probability under the 2PGlogit model, implies that the upper asymptote parameter (d) in the 4PL model is approximately 0.875 and μ(−6) = 0.125 indicates that the lower asymptote parameter (c) in the 4PL model is approximately 0.125.
As discussed in Supplementary Appendix A4, the linear predictor aj(θi−bj) remains the same approximately under both the 2PGlogit and 2PL models according to the Taylor’s expansion. Thus, the interpretation of aj, bj, and θi under the 2PGlogit model is similar to the one under the 2PL model. In summary, the two shape parameters, the discrimination parameter, and the difficulty parameter work together to capture the different characteristics of the item.
Bayesian Inference
The Prior and Posterior Distributions
Based on equations (5), (6), and (7), the likelihood function for the 3PGlogit IRT model is given by
| (8) |
where y = (y11, …, y1J, …, yN1, …, yNJ)′, θ = (θ1, …, θN)′, a = (a1, …, aJ)′, b = (b1, …, bJ)′, c = (c1, …, cJ)′, and .
Since the posterior distribution is simply proportional to the product of the likelihood function (sample information) and the priors (prior information), the likelihood and the priors have important influence on the posterior distribution. In our large-scale reading assessment data study (2000 examinees × 50 items), the likelihood plays a dominant role while the priors do not exert a great influence on the posterior inference. In this paper, relatively non-informative priors are specified. We assume that θ , a , b , c , and α are independent a priori. Here, the prior distribution of the ability parameter θi is assumed to be a standard normal distribution for i = 1, …, N. The prior for the discrimination parameter aj is a uniform distribution, aj∼U(0.5, 2.5). The prior for the difficulty parameter bj is based on Luo and Jiao (2017) using a hierarchical prior distribution, where bj follows a normal distribution with mean μb and variance , μb∼N(0, 52), and , where the indicator function 1{A} takes a value of 1 if A is true and a value of 0 if A is false. The prior for the guessing parameter cj is assumed to be a beta distribution, that is, cj∼ Beta(5, 23). For Stukel’s models, Chen et al. (1999) pointed out that if the two shape parameters are not constrained and an improper uniform prior is chosen for the regression coefficients, the resulting posterior distribution is improper. Chen et al. (2002) proved that if a random variable ξ has a cumulative probability distribution F(η) = μ(η), where μ(η) is defined in equation (1), then the first moment of ξ exists if α1 > − 1 and α2 > − 1. Thus, to ensure that the posterior distribution is proper (Chen et al. (1999), page 1184), the priors for the two shape parameters are assumed to be a normal distribution truncated at − 1. That is, α1j∼N(0, 52)1{α1j > − 1} and . Then, the joint posterior distribution of θ , a , b , c , α , μb, and σb given the observed data y takes the form
| (9) |
where the priors π( θ ), π( a ), π( b |μb, σb), π( c ), π( α ), π(μb), and π(σb) are specified above.
Implementation of the IRT Models in Stan
Markov chain Monte Carlo (MCMC) has revolutionized modern Bayesian computation, especially for complex IRT models. Many software programs have been developed to implement MCMC sampling techniques, including WinBUGS (Lunn, et al., 2000), OpenBugs (Spiegelhalter et al., 2012), JAGS (Plummer, 2012), and R packages (Hadfield, 2010; Geyer & Johnson, 2012; Martin et al., 2011). Stan (Stan Development Team, 2017), a relatively new Bayesian software program using the Hamiltonian Monte Carlo algorithm (HMC; Neal, 2011), is utilized to implement MCMC sampling. HMC works by pairing each model parameter with a momentum variable, which determines HMC’s exploration behavior of the target distribution based on the posterior density of the current drawn parameter value. Therefore, it is superior to the Gibbs (Geman & Geman, 1984) and Metropolis algorithms (Metropolis et al., 1953) that require a long computation time to search for the efficient posterior parameter space to reach convergence. Another advantage of Stan is that improper priors are allowed to be used. Although the one- and two-parameter logistic models have been introduced in the Stan user manual, this new software has not been widely used in educational and psychological research. Luo and Jiao (2017) filled the gap. The Stan code is provided to implement MCMC sampling for some representative IRT models, including the three-parameter logistic model, the polytomous IRT models, and their multidimensional and multilevel extensions. In this paper, the authors focus on rstan, an R package that interfaces with Stan in the R computing environment. The Stan and R codes for the Glogit IRT models have been developed. The Stan codes for the 2PGlogit IRT model are provided in Supplementary Appendix A1 of the online supplement.
Bayesian Selection of the Links
In this paper, the authors consider the deviance information criterion (DIC; Spiegelhalter et al., 2002), the logarithm of the pseudomarginal likelihood (LPML; Geisser & Eddy, 1979; Ibrahim et al., 2001), the widely applicable information criterion (WAIC; Watanabe, 2010), and leave-one-out cross-validation (LOO; Vehtari et al., 2017) for comparing the IRT models with different links. We note that all of these four criteria are based on the log-likelihood functions evaluated at the posterior samples of model parameters. Vehtari et al. (2017) developed an R package loo to compute WAIC and LOO using the log-likelihood matrix outputted from the rstan package. The detailed development and the definition of the WAIC and LOO can be found in Watanabe (2010) and Vehtari et al. (2017). A smaller value of WAIC (or LOO) indicates a better fit model.
Based on the samples which are drawn from the posterior distributions by the Stan software, the DIC, and LPML can be easily computed. Write φ = ( φ ij , i = 1, …, N, j = 1, …, J), where for i = 1, …, N and j = 1, …, J. Let { φ (1), …, φ (M)}, where , for i = 1, …, N, j = 1, …, J, and m = 1, …, M, denote an MCMC sample from the posterior distribution in equation (9). The logarithm of the likelihood function in equation (8) evaluated at φ (m) is given by
| (10) |
where for i = 1, …, N, and j = 1, …, J. Since the log-likelihoods, , i = 1, …, N, and j = 1, …, J, are readily available from the Stan outputs, log f( y | φ (m)) in equation (10) is easy to compute. Now, we calculate DIC as follows
| (11) |
Where
In equation (11), is a Monte Carlo estimate of the posterior expectation of the deviance function Dev( φ ) = −2 log f( y | φ ), is the Monte Carlo estimate of the deviance function of the posterior expectation of φ , where is the posterior expectation, and is the effective number of parameters. The model with a smaller DIC value fits the data better.
Letting , a Monte Carlo estimate of the conditional predictive ordinate (CPO; Chen et al., 2000; Gelfand et al., 1992) is given by
| (12) |
Note that the maximum value adjustment used in plays an important role in numerical stabilization in computing in equation (12). A summary statistic of the is the sum of their logarithms, which is called the LPML and given by
| (13) |
The model with a larger LPML has a better fit to the data.
The R codes for computing DIC and LPML are developed and given in Supplementary Appendix A2 of the online supplement. From equations (11) and (13), it is easy to see that using the available Stan outputs , the computations of DIC and LPML are simple, fast, and efficient. On the contrary, the R package loo requires a large amount of computer memory. Compared to DIC, LPML, and WAIC, LOO is the most expensive in terms of both computing time and memory space.
Simulation
Simulation 1
This simulation study is performed to validate the model specification (such as the selection of prior distributions) and evaluate the parameter recovery with Bayesian sampling algorithm.
Simulation Designs
In this simulation study, three factors are considered to create different test conditions. The first factor is the sample size, which is varied in three levels . The second factor is the test length, which is varied in two levels . As an explanation, we only consider the general 2PGlogit model to generate the response data.
True Values and Prior Distributions
The true values for the parameters in the item response model are set as follows: the discrimination parameters aj are generated from a truncated normal distribution, , j = 1, …, J, where the indicator function takes a value of 1 if A is true and 0 otherwise. In addition, the difficulty parameters bj are generated from a standard normal distribution. Two shape parameters α1j and α2j are generated from a truncated normal distribution with mean 0 and variance 0.52. That is, and . The latent ability θis are generated from a standard normal distribution for i = 1, …, N. In this simulation, the discrimination parameters aj’s are assumed to follow a non-informative truncated normal prior distribution, . The prior of the difficulty parameter bj is a standard normal distribution. The priors for the two shape parameters are assumed to be a normal distribution truncated at −1 to ensure that the posterior distribution is proper (Chen et al., 1999, page 1184; Chen et al., 2002). That is, and .
Convergence Diagnosis and Accuracy Evaluation Criteria
In this simulation, the Stan associated with R software is used to implement the MCMC sampling, chains of length 3000 are chosen. The convergence is checked by monitoring the trace plots. Usually, if there is no change point or trend in the plot, the convergence of the generated sequence is accepted. The trace plots show that all parameter estimates stabilize after 500 iterations and then converge quickly. Thus, the first 500 iterations are set as the burn-in period. Due to space limitation, the trace plots of the parameters will not be shown here. Another method is to use the Gelman–Rubin method (Gelman & Rubin, 1992) to check the convergence of the interested parameters. All of the chains have the same number of burn in periods (500 iterations). The values of the potential scale reduction factor (PSRF; Brooks & Gelman, 1998) are computed on the rstan package. We find that the PSRFs of ability and item parameters are 1.00.
The accuracy of the parameter estimates was measured by five evaluation criteria, that is, bias, mean squared error (MSE), standard deviation (SD), standard error (SE), and coverage probability of the 95% HPD intervals (CP) statistics. Let η be the parameter of interest. Assume that M = 100 data sets (100 replications) were generated. Also, let and denote the posterior mean and the posterior standard deviation of η obtained from the mth simulated data set for m = 1, …, M.
The bias for parameter η is defined as
and the MSE for parameter η is defined as
The simulation SE is the square root of the sample variance of the posterior estimates over different simulated data sets given by
and the average of posterior standard deviation is defined as
The coverage probability is defined as
Accuracy Analysis of Parameter Estimation
The average Bias, MSE, SD, SE, and CP for discrimination, difficulty, shape, and ability parameters under the six different simulation conditions are shown in Table 1. The following conclusions can be obtained. (1) Given the total test length, when the number of individuals increases from 1000 to 3000, the average MSE, SD, and SE for discrimination, difficulty, and two shape parameters show a decreasing trend. For example, given the total test length of 40 items, when the number of individuals increases from 1000 to 3000, the average MSE of all discrimination parameters decreases from 0.055 to 0.033, the average SE of all discrimination parameters decreases from 0.154 to 0.143, and the average SD of all discrimination parameters decreases from 0.257 to 0.197. For the difficulty parameter, the average MSE of all difficulty parameters decreases from 0.015 to 0.007, the average SE of all difficulty parameters decreases from 0.071 to 0.046, and the average SD of all difficulty parameters decreases from 0.088 to 0.055. For the shape parameter α1•, the average MSE of all shape parameters α1• decreases from 0.082 to 0.049, the average SE of all shape parameters α1• decreases from 0.175 to 0.158, and the average SD of all shape parameters α1• decreases from 0.277 to 0.206. For the shape parameter α2•, the average MSE of all shape parameters α2• decreases from 0.069 to 0.048, the average SE of all shape parameters α2• decreases from 0.148 to 0.136, and the average SD of all shape parameters α2• decreases from 0.281 to 0.222. (2) The average SD of item parameter are larger than the average SE of item parameter. This indicates that the fluctuation of posterior mean for item parameter between different replications is small compared with the fluctuation of posterior mean for item parameter in each replication. (3) Under the six simulated conditions, the average CP of the discrimination, difficulty, shape, and ability parameters are about 0.950. (4) When the number of individuals is fixed and the number of items increases from 20 to 40, the average MSE, SD, and SE show that the recovery results of the discrimination and difficulty parameters are close to the those when the total test length was 20, which indicates that the Hamiltonian Monte Carlo algorithm (HMC; Neal, 2011) is stable and does not reduce the accuracy due to the increase in the number of items. (5) Given the total number of individuals, when the test length increases from 20 to 40, the average MSE, SE, and SD for ability parameter obviously decrease. For example, for the total number of individuals fixed at 1000, when the number of items increases from 20 to 40, the average MSE of all ability parameters decreases from 0.137 to 0.082, the average SE of all ability parameters decreases from 0.302 to 0.240, and the average of SD of all ability parameters decreases from 0.343 to 0.261. (6) Given the total test length, when the number of individuals increases from 1000 to 3000, the recovery results of ability parameters are basically the same in three simulated conditions. This again verifies that the Hamiltonian Monte Carlo algorithm is stable and accurate even if a large number of ability parameters (3000 ability) are estimated with the least number of items (20 items). In summary, the new Bayesian software program using the Hamiltonian Monte Carlo algorithm (HMC; Neal, 2011) provides accurate estimates of the item and ability parameters in term of various numbers of individuals and items. Therefore, it can be used to guide practices.
Table 1.
Evaluating the accuracy of parameters based on six different simulated conditions in simulation study 1.
| No. of items = 20 |
|||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No. of Individuals 1000 | No. of Individuals 2000 | No. of Individuals 3000 | |||||||||||||
| Parameter | Bias | MSE | SE | SD | CP | Bias | MSE | SE | SD | CP | Bias | MSE | SE | SD | CP |
| Discrimination a | 0.059 | 0.043 | 0.148 | 0.274 | 0.966 | 0.066 | 0.034 | 0.144 | 0.241 | 0.976 | 0.072 | 0.030 | 0.140 | 0.220 | 0.980 |
| Difficulty b | 0.014 | 0.019 | 0.068 | 0.087 | 0.942 | 0.016 | 0.014 | 0.051 | 0.065 | 0.938 | 0.003 | 0.010 | 0.044 | 0.055 | 0.939 |
| Shape α 1• | −0.102 | 0.086 | 0.171 | 0.297 | 0.960 | −0.067 | 0.061 | 0.162 | 0.255 | 0.971 | −0.060 | 0.057 | 0.159 | 0.229 | 0.964 |
| Shape α 2• | 0.049 | 0.058 | 0.140 | 0.300 | 0.991 | 0.034 | 0.051 | 0.137 | 0.268 | 0.994 | 0.022 | 0.048 | 0.135 | 0.249 | 0.994 |
| Ability θ | 0.007 | 0.137 | 0.302 | 0.343 | 0.942 | 0.012 | 0.137 | 0.299 | 0.339 | 0.938 | 0.003 | 0.136 | 0.298 | 0.338 | 0.994 |
| No. of items = 40 |
|||||||||||||||
| No. of individuals 1000 |
No. of individuals 2000 |
No. of individuals 3000 |
|||||||||||||
| Parameter | Bias | MSE | SE | SD | CP | Bias | MSE | SE | SD | CP | Bias | MSE | SE | SD | CP |
| Discrimination a | 0.028 | 0.055 | 0.154 | 0.257 | 0.945 | 0.037 | 0.041 | 0.151 | 0.219 | 0.946 | 0.044 | 0.033 | 0.143 | 0.197 | 0.950 |
| Difficulty b | 0.012 | 0.015 | 0.071 | 0.088 | 0.946 | 0.015 | 0.010 | 0.055 | 0.066 | 0.943 | 0.003 | 0.007 | 0.046 | 0.055 | 0.939 |
| Shape α 1• | −0.057 | 0.082 | 0.175 | 0.277 | 0.935 | −0.029 | 0.058 | 0.167 | 0.233 | 0.936 | −0.029 | 0.049 | 0.158 | 0.206 | 0.942 |
| Shape α 2• | 0.072 | 0.069 | 0.148 | 0.281 | 0.964 | 0.048 | 0.054 | 0.139 | 0.243 | 0.961 | 0.038 | 0.048 | 0.136 | 0.222 | 0.959 |
| Ability θ | 0.007 | 0.082 | 0.240 | 0.261 | 0.941 | 0.012 | 0.081 | 0.238 | 0.258 | 0.939 | 0.003 | 0.080 | 0.237 | 0.256 | 0.939 |
Note. The Bias, MSE, SE, SD and CP denote the average Bias, MSE, SE, SD, and CP for the parameters. α 1• represents the whole of the shape parameter α1j, j = 1, 2, …, J. α 2• represents the whole of the shape parameter α2j, j = 1, 2, …, J.
Simulation 2
In this simulation study, we use four Bayesian model assessment criteria to evaluate the model fitting. Two issues warrant further study. The first is whether the four criteria can accurately identify the real model that generates data from numerous fitting models. The second is that we study the over-fitting and under-fitting phenomena between real models and fitting models.
Simulation Designs
In this simulation, the number of individuals N = 2000 is considered and the test length is fixed at 40. Item response is generated within the framework of the two-parameter Glogit model. Three item response models will be considered: (1) the same fixed known links 2PGlogit for all items (2PL model); (2) the same unknown Glogit link 2PGlogit for all items; and (3) the different unknown Glogit links 2PGlogit , j = 1, …, J, for different items. Therefore, we evaluate the model fitting in the following three cases.
Case 1: True model: 2PL v.s. Fitted model: 2PL, 2PGlogit , and 2PGlogit ;
Case 2: True model: 2PGlogit v.s. Fitted model: 2PL, 2PGlogit , and 2PGlogit ;
Case 3: True model: 2PGlogit v.s. Fitted model: 2PL, 2PGlogit , and 2PGlogit .
The true values and prior distributions for the parameters are specified in the same way as in Simulation 1. To implement the MCMC sampling algorithm, chains of length 3000 with an initial burn-in period 500 are chosen. The results of Bayesian model assessment based on the 100 replications are shown in Table 2. Note that the following reported results of LPML, DIC, WAIC, and LOO are based on the average of 100 replications. Additionally, the boxplot of four Bayesian model assessment indexes is shown in Figure 3.
Table 2.
The results of Bayesian model assessment in simulation study 2.
| True Model | Fitted Model | LPML | DIC | WAIC | LOO |
|---|---|---|---|---|---|
| 2PL | −37,822.0(67) | 75,627.9(49) | 75,632.6(54) | 75,657.2(67) | |
| 2PL | 2PGlogit | −37,823.0(33) | 75,629.1(43) | 75,633.7(40) | 75,659.3(33) |
| 2PGlogit | −37,855(0) | 75,649.9(8) | 75,659.1(6) | 75,713.5(0) | |
| 2PL | −36,546(0) | 73,087.2(0) | 73,079.8(0) | 73,106.0(0) | |
| 2PGlogit | 2PGlogit | −36,482.6(100) | 72,926.3(96) | 72,932.0(82) | 72,975.3(97) |
| 2PGlogit | −36,508(0) | 72,957.1(4) | 72,946.6(18) | 73,019.3(3) | |
| 2PL | −39,782(0) | 79,541.1(0) | 79,551.0(0) | 79,577.5(0) | |
| 2PGlogit | 2PGlogit | −39,779.1(0) | 79,541.5(0) | 79,546.8(0) | 79,571.4(0) |
| 2PGlogit | −39,552.1(100) | 79,040.0(100) | 79,016.9(100) | 79,143.5(100) |
Note. The number in bracket denotes the percentage of times each criterion to choose the right model. The percentage of times can be calculated by the 100 replications.
Figure 3.
The boxplots of the LPML, DIC, WAIC, and LOO in simulation study 2. In the first row, true model: 2PGlogit (2PL) v.s. fitted model: 2PGlogit , 2PGlogit , and 2PGlogit ; In the second row, true model: 2PGlogit v.s. fitted model: 2PGlogit , 2PGlogit , and 2PGlogit ; In the third row, true model: 2PGlogit v.s. fitted model: 2PGlogit , 2PGlogit , and 2PGlogit
From Table 2, we find that when the 2PL model is the true model, the 2PL model is chosen as the best-fitting model according to the results of the LPML, DIC, WAIC, and LOO, which is what we expect to see. The LPML, DIC, WAIC, and LOO are, respectively, −37,822.0; 75,627.9; 75,632.6; and 75,657.2. The second best-fitting model is the 2PGlogit model. However, the LPML, DIC, WAIC, and LOO for the two models are basically the same. These results indicate that the 2PGlogit model is basically the same as the 2PL model, which may be attributed to the fact that the estimated shape parameters are close to zero. The 2PGlogit model is the worst model to fit the data. The differences between the 2PL model and 2PGlogit model in LPML, DIC, WAIC, and LOO are 33, −22.0, −26.5, and −56.3. When the 2PGlogit model is the true model, the LPML, DIC, WAIC, and LOO consistently choose 2PGlogit model as the best-fitting model, and their values are −36,482.6; 72,926.3; 72,932.0; and 72,975.7, respectively. The second best-fitting model is the most complex 2PGlogit model. The differences between 2PGlogit model and 2PGlogit model in LPML, DIC, WAIC, and LOO are 25.8, −30.8, −14.6, and −43.6, and the differences between the 2PGlogit model and 2PL model in LPML, DIC, WAIC, and LOO are 63.7, −160.9, −147.8, and −130.3. This shows that when the data are generated from the 2PGlogit model, the complex 2PGlogit model with different shape parameters on each item is more flexible than the 2PL model with fixed shape on each item. Therefore, the 2PGlogit model fits the data more sufficiently than the under-fitted 2PL model. When the 2PGlogit model is the true model, four criteria consistently select the 2PGlogit model for the best-fitting model. The other two models have serious under-fitting when fitting the data. The differences between the 2PGlogit model and 2PL model in LPML, DIC, WAIC, and LOO are 230.1, −501.1, −534.1, and −434, and the differences between the 2PGlogit model and 2PGlogit model in LPML, WAIC, and LOO are 227, −501.5, −529.9, and −427.9. In summary, we can obtain the following important conclusions. Based on the results of model assessment criteria, we find that if the data are generated from a simple 2PL model, the differences between the 2PL model and the most complex 2PGlogit model in LPML, DIC, WAIC, and LOO are small compared with those in Case 1 and Case 2. Therefore, it is also appropriate to use the 2PGlogit model to fit the data. However, the data comes from a complex 2PGlogit model. The simple 2PL model fitting the data will lead to serious under-fitting. Therefore, it is inappropriate to use the simple model. Combining the above conclusions, we can say that whether the data comes from a simple model or a complex model, the 2PGlogit model is appropriate to fit the data.
Simulation 3
In this simulation study, we investigate the model fittings in terms of “in-sample” and “out-of-sample” predictions based on the deviance under the two parameter IRT models.
Simulation Designs
The number of individuals N = 2000 is considered, and the test length is fixed at 40 (J = 40). The true values and prior distributions for the parameters are specified in the same way as in Simulation 1. The true models and fitted models are as follows:
Three true models: 2PL model (2PGlogit(0,0) model), 2PGlogit(α1, α2) model, and 2PGlogit(α1j, α2j) model.
Three fitted models: 2PL model (2PGlogit(0,0) model), 2PGlogit(α1, α2) model, and 2PGlogit(α1j, α2j) model.
To implement the MCMC sampling algorithm, chains of length 3000 with an initial burn-in period 500 are chosen. 11 replications are considered in this simulation study. In order to evaluate the model fitting, we consider the four indices: In-Sample Average (ISA), In-Sample Percentage (ISP), Out-Sample Average (OSA), and Out-Sample Percentage (OSP). Their definitions and computational procedures are given in Supplementary Appendix A3 of the online supplement in details. The results of ISAs, ISPs, OSAs, and OSPs are reported in Table 3.
Table 3.
The ISAs, ISPs, OSAs, and OSPs based on the average deviance.
| True Model | Fitted Model | ISAs | ISPs | OSAs | OSPs |
|---|---|---|---|---|---|
| 2PL | 73,733.0 | 0 | 77,520.0 | 0.91 | |
| 2PL | 2PGlogit | 73,732.7 | 0 | 77,527.9 | 0.09 |
| 2PGlogit | 73,687.5 | 1 | 77,887.5 | 0 | |
| 2PL | 71,154.4 | 0 | 74,956.5 | 0.28 | |
| 2PGlogit | 2PGlogit | 70,995.4 | 0 | 74,924.8 | 0.72 |
| 2PGlogit | 70,959.7 | 1 | 75,289.6 | 0 | |
| 2PL | 73,205.5 | 0 | 77,101.6 | 0.25 | |
| 2PGlogit | 2PGlogit | 73,120.9 | 0 | 77,112.1 | 0.13 |
| 2PGlogit | 72,564.5 | 1 | 77,053.9 | 0.63 |
Abbreviation: ISA, in-sample average; ISP, in-sample percentage, OSA, out-sample average; OSP, out-sample percentage.
We see, from Table 3, that regardless of whether the data is generated from a simple 2PL model or complex 2PGlogit(α1j, α2j) model, ISAs and ISPs always choose the most complex 2PGlogit(α1j, α2j) model as the optimal model. However, the two indices are calculated by the average deviance without considering the penalty term. In Simulation 2, we empirically show that when both the deviance term and penalty term are considered, the four Bayesian model assessment criteria can accurately identify the true model that generates the data. This indicates that the penalty term plays an important role in limiting the occurrence of complex model over-fitting in the in-sample case. In the out-of-sample case, when the true model and fitted model is the same, the OSA is the minimum and the OSP is the maximum percentage. That is, the model that generates the data still has the optimal out-sample performance compared to the other two models. Thus, the true model yields the best cross-validation based on the out-of-sample prediction.
Analysis of the Reading Assessment Data
We consider a subset of the real data analyzed in Tao et al. (2012). The test is from a large-scale state reading assessment, which consists of 50 dichotomous items and 1 polytomous item (5-category). We only focus on the responses of the 50 dichotomous items for all 2000 subjects and exclude the responses of the polytomous item. This state reading assessment data set is referred to as the SRA data thereafter. The proportion correct for items (PCIs) from the SRA data is given in Table 4, and the proportion correct for subjects is shown in Figure 4. From Table 4, we see that the PCIs are different from item to item and the largest and smallest PCIs were 0.8625 and 0.3900, which correspond to items 6 and 39, respectively. Figure 4 indicates a large heterogeneity in the proportion correct among subjects with a minimum of 0 and a maximum of 1 and an interquartile range of (0.46, 0.84). These differences in the PCIs among items as well as among individuals may suggest a need of flexible links in fitting the SRA data.
Table 4.
The proportion correct for items (PCIs) in the SRA data.
| Item | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| PCI | 0.7630 | 0.7080 | 0.6000 | 0.7110 | 0.7585 | 0.8625 | 0.8345 | 0.6805 | 0.3950 | 0.5895 |
| Item | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| PCI | 0.7205 | 0.4490 | 0.7050 | 0.6840 | 0.7330 | 0.7355 | 0.4760 | 0.5620 | 0.6295 | 0.5805 |
| Item | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 |
| PCI | 0.5305 | 0.6825 | 0.6835 | 0.7610 | 0.7130 | 0.4880 | 0.6090 | 0.6745 | 0.4785 | 0.5140 |
| Item | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 |
| PCI | 0.6320 | 0.4910 | 0.7770 | 0.6330 | 0.6230 | 0.5880 | 0.6430 | 0.5840 | 0.3900 | 0.5015 |
| Item | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 |
| PCI | 0.4430 | 0.7590 | 0.4875 | 0.6780 | 0.6695 | 0.6705 | 0.7215 | 0.7325 | 0.5955 | 0.4965 |
Figure 4.

Frequency histogram of the proportion correct for 2000 subjects.
To fit the SRA data, we consider generalized logit IRT models with different numbers of parameters, including (a) one parameter (1P), two parameters (2P), and three parameters (3P); and (b) different shape parameter vectors for the links. The following four types of links are considered:
the same fixed known links for all items: Glogit(0,0) (logit link), Glogit(0.165, 0.165) (probit link), Glogit(−0.037, 0.62) (log–log link), Glogit(0.62, −0.037) (complementary log–log link), and Glogit(−0.077, −0.077) (Laplace link);
the same unknown symmetric Glogit link Glogit(α, α) for all items;
the same unknown general Glogit link Glogit(α1, α2) for all items; and
the different unknown general Glogit links Glogit(α1j, α2j), j = 1, …, J, for different items.
For each generalized IRT model, we compute LPML, DIC, WAIC, and LOO. The values of LPML, DIC, WAIC, and LOO for all the models under consideration are reported in Table 5. From Table 5, we see that all four criteria yield very consistent results to choose the 2PGlogit(α1j, α2j) as the best model among all of the models under consideration. The values of LPML, DIC, WAIC, and LOO are, respectively, −50,489.7; 100,905.1; 100,906.1; and 101,001.9 under the best model. Within the one- and three-parameter sub-models, the four criteria also yield the consistent results to choose the 1P and 3P models with the case (iv) link function as the best models compared to all other 1PGlogit and 3PGlogit models. As for the traditional 2PL and 2PNO models, the 2PGlogit(α1j, α2j) model fits the SRA data substantially better than these two models, and the differences in the LPML, DIC, WAIC, and LOO are −129.0, 350.3, 323.7, and 235.3 between the 2PL (2PGlogit(0, 0)) and 2PGlogit(α1j, α2j) models, respectively, and the differences between the 2PNO (2PGlogit(0.165, 0.165)) and 2PGlogit(α1j, α2j) models are −168.4, 381.6, 392.0, and 313.5, respectively. Therefore, the different Glogit links for different items are much more desirable and appropriate than the traditional logit and probit links for the SRA data.
Table 5.
The values of LPML, DIC, WAIC, and LOO for IRT models with generalized logistic links for the SRA data.
| Link | No. of Parameters | LPML | DIC | WAIC | LOO |
|---|---|---|---|---|---|
| 1PL | 2050 | −51,423.7 | 102,913.8 | 102,845.8 | 102,847.8 |
| 1PGlogit(0.165,0.165) | 2050 | −51,349.8 | 102,714.0 | 102,695.1 | 102,699.7 |
| 1PGlogit(−0.037,0.62) | 2050 | −51,786.8 | 103,563.5 | 103,570.2 | 103,573.9 |
| 1PGlogit(0.62,−0.037) | 2050 | −51,309.6 | 102,537.9 | 102,599.9 | 102,617.8 |
| 1PGlogit(−0.077,−0.077) | 2050 | −51,495.0 | 103,061.2 | 102,989.3 | 102,990.5 |
| 1PGlogit(α, α) | 2051 | −51,343.2 | 102,668.2 | 102,677.4 | 102,685.8 |
| 1PGlogit(α1, α2) | 2052 | −51,275.6 | 102,529.8 | 102,541.2 | 102,550.9 |
| 1PGlogit(α1j, α2j) | 2150 | −50,655.7 | 101,073.0 | 101,199.2 | 101,359.6 |
| 2PL | 2100 | −50,618.7 | 101,255.4 | 101,229.8 | 101,237.2 |
| 2PGlogit(0.165,0.165) | 2100 | −50,658.1 | 101,286.7 | 101,298.6 | 101,315.4 |
| 2PGlogit(−0.037,0.62) | 2100 | −50,659.7 | 101,313.5 | 101,310.5 | 101,318.9 |
| 2PGlogit(0.62,−0.037) | 2100 | −50,781.6 | 101,451.5 | 101,519.6 | 101,560.2 |
| 2PGlogit(−0.077,−0.077) | 2100 | −50,622.1 | 101,282.9 | 101,239.5 | 101,244.3 |
| 2PGlogit(α, α) | 2101 | −50,616.4 | 101,258.1 | 101,225.7 | 101,232.8 |
| 2PGlogit(α1, α2) | 2102 | −50,614.9 | 101,256.6 | 101,222.9 | 101,229.6 |
| 2PGlogit(α1j, α2j) | 2200 | −50,489.7 | 100,905.1 | 100,906.1 | 101,001.9 |
| 3PL | 2150 | −50,757.7 | 101,535.1 | 101,505.9 | 101,515.0 |
| 3PGlogit(0.165,0.165) | 2150 | −50,796.3 | 101,557.8 | 101,571.8 | 101,591.6 |
| 3PGlogit(−0.037,0.62) | 2150 | −50,813.4 | 101,590.3 | 101,617.3 | 101,626.8 |
| 3PGlogit(0.62,−0.037) | 2150 | −50,906.1 | 101,688.4 | 101,756.7 | 101,803.5 |
| 3PGlogit(−0.077,−0.077) | 2150 | −50,759.1 | 101,551.5 | 101,512.4 | 101,518.3 |
| 3PGlogit(α, α) | 2151 | −50,757.5 | 101,532.4 | 101,505.8 | 101,514.7 |
| 3PGlogit(α1, α2) | 2152 | −50,762.7 | 101,541.4 | 101,517.4 | 101,525.2 |
| 3PGlogit(α1j, α2j) | 2250 | −50,666.1 | 101,171.8 | 101,265.5 | 101,324.0 |
Abbreviations: LPML, logarithm of the pseudomarginal likelihood; DIC, deviance information criterion; WAIC, widely applicable information criterion; LOO, leave-one-out cross-validation; IRT, Item response theory.
Figure 5 shows the plots of the 95% highest posterior density (HPD) intervals of α1j−α2j against 50 items for 1PGlogit(α1j, α2j), 2PGlogit(α1j, α2j), and 3PGlogit(α1j, α2j) models. In Figure 5, there are 24, 19, and 14 the 95% HPD intervals of α1j−α2j, which do not include 0, under 1PGlogit(α1j, α2j), 2PGlogit(α1j, α2j), and 3PGlogit(α1j, α2j) models, respectively. In addition, the 95% highest posterior density (HPD) intervals of α1j and α2j against 50 items for the 1PGlogit(α1j, α2j), 2PGlogit(α1j, α2j), and 3PGlogit(α1j, α2j) models are shown in Figures B1 in Supplementary Appendix B of the online supplement. We see from Figure B1 in Supplementary Appendix B that the 95% HPD intervals of α1j are quite different than those for α2j for many items. These results indicate that the asymmetric links are more desirable for many items for the SRA data, which further confirm why all of the four criteria (LMPL, DIC, WAIC, and LOO) consistently select Glogit(α1j, α2j) as the best within each of the 1P, 2P, and 3P IRT models.
Figure 5.
Plots of 95% HPD intervals of α1j−α2j under the 1PGlogit(α1j, α2j) model (left), the 2PGlogit(α1j, α2j) model (middle), and the 3PGlogit(α1j, α2j) model (right).
The box-plots of the MCMC samples of the discrimination and pseudoguessing parameters for 50 items from the posterior distributions under the 2PGlogit(α1j, α2j) and 3PGlogit(α1j, α2j) models are shown in Figure 6. We see from Figure 6 that there are substantial differences among the discrimination parameters. Therefore, it is not appropriate to set the all discrimination parameters equal to one for this data. Namely, the 2PGlogit(α1j, α2j) model should fit the data better than the 1PGlogit(α1j, α2j) model. The posterior distributions of the discrimination parameters support that the 2PGlogit(α1j, α2j) model is better than the 1PGlogit(α1j, α2j) model according to LMPL, DIC, WAIC, and LOO. Moreover, the box-plots of the pseudoguessing parameters indicate that the posterior means of the pseudoguessing parameters are less that 0.1 for most items. Therefore, adding the pseudoguessing parameters to the 2PGlogit(α1j, α2j) model may not provide sufficient gain in the goodness-of-fit to offset the model complexity. That is, the addition of 50 extra parameters may outweigh the gain in the fit for the SRA data. The small posterior estimates of the pseudoguessing parameters provide an empirical justification of why all of the four criteria consistently select the 2PGlogit(α1j, α2j) model over that 3PGlogit(α1j, α2j) model.
Figure 6.
The box-plots of the MCMC samples of the discrimination and pseudoguessing parameters for 50 items from the posterior distributions under the 2PGlogit(α1j, α2j) and 3PGlogit(α1j, α2j) models, respectively.
In all Basyeian computations, we used 2500 MCMC samples after a burn-in of 500 iterations for each model using the RStan. The HPD intervals were computed using the R package boa. For the SRA data, with the 2500 MCMC samples, the object size from the RStan under the 2PGlogit(α1j, α2j) model was about 2.2 GB. It took about 1.65 hours under the 2PGlogit(α1j, α2j) model and 0.77 hours under the 2PL model (2PGlogit(0, 0)) to generate 3000 iterations in RStan. We saved the Stan outputs as an RData file. After closing R, we then opened R again to run the R package loo to compute WAIC and LOO. The computation of each LOO value in Table 1 required about 10 GB RAM. We used Intel(R) Core(TM) Intel i5-2500 CPU @ 3.30 GHz computers with 16 GB of RAM memory to carry out all computations.
It is worth noting that the discrimination parameter estimates depend on the estimates of the shape parameters of the link function, as observed from Table B2 of the Supplementary Material. Moreover, the difficulty parameter estimates also depend on the estimates of the shape parameters, as observed from Table B3 of the Supplementary Materials. Therefore, the discrimination, difficulty, and shape parameters actually work together to control the shape of the ICC. However, the role of each item parameter in governing the shape of the ICC is quite different. The means and standard deviations of α1js and α1js under the 2PGlogit model are summarized in Table B4 of the Supplementary Material.
Due to a large number of parameters, we report the posterior estimates only for the 50 shape parameter vectors under the 2PGlogit(α1j, α2j) model. Tables B5, B6, and B7 in Supplementary Appendix B of the online supplement give the posterior means, the standard deviations, 2.5 and 97.5 posterior percentiles, the effective numbers of simulation draws, and the potential scale reduction factors (PSRF; Gelman & Rubin, 1992) of α1j and α2j for j = 1, …, 50. From these tables, we see that the largest PSRF value for the sharp parameter is 1.01, which is less than 1.05. As for all of the other parameters not shown here, their PSRF values are smaller than 1.05. These results indicate that the MCMC sampling algorithm implemented in RStan practically converges.
Based on the estimation results of two shape parameters in Tables B5, B6, and B7 in the Supplementary Material, we can draw the following conclusions. (1) The estimated values of the two shape parameters, , are almost not zero, which indicates that the traditional symmetric 2PL model is inappropriate to fit the SRA data. (2) By comparing the shape parameters between different items, as an illustration, we select the estimated values of the shape parameters of two items to analyze the role of shape parameters in adjusting the tail probability of the ICCs. For item 19, . The ICC of item 19 is closer to 2PL (2PGlogit)(0, 0) model. For item 21, . We find that and . The rate of the ICC for item 21 approaching to 1 (0) is faster than that of item 19. (3) Comparison of shape parameters within an item, we also consider item 19 as an example. When and , then the upper tail of the ICC for 2PGlogit (α1,19, α2,19) link is thinner than that of the 2PL link, and the lower tail of the ICC for 2PGlogit (α1,19, α2,19) link is heavier than that of the 2PL link.
Discussion
This article presents a class of generalized logit links within IRT models framework for analyzing binary item response data. In the analysis of the real data, we have shown that the 2PGlogit(α1j, α2j) IRT model fits the data much better than the conventional IRT models with logit and probit links according to all of the four Bayesian model selection criteria. Furthermore, we have empirically demonstrated that the choice of different links plays an important role in model fit.
Next, we have further demonstrated the advantages of the new Glogit model over the traditional IRT models. More specifically, why is the more complex Glogit model attractive to educational psychologists as opposed to the classical IRT models (e.g., 2PL, 3PL, and 4PL)? Firstly, why is the two parameter IRT model inappropriate compared with our 2PGlogit model? We see from Figures 1 and 2 that the tail probability of the Glogit link function is more flexible than the traditional logit model after introducing two shape parameters. As we know that the real data is complex and diverse, it is particularly important to build an appropriate model to automatically reflect the information of the data. For the traditional two parameter IRT models, it may be inappropriate to set the ICC as a symmetric curve without considering the characteristic structure of the data. However, our 2PGlogit model is flexible enough not only to adjust the ICC tail probabilities by two shape parameters but also to allow us to fit the different links to different items. Secondly, what are the disadvantages of the four parameter IRT models compared with our 2PGlogit model? Compared with the two parameter IRT models, the four parameter IRT models add two parameters: the upper asymptote and lower asymptote parameters (pseudoguessing parameter). The lower asymptote parameter that controls the lower tail of the ICC is to depict that the examinee has a certain probability to guess the item correctly even if the examinee has low ability, while the upper asymptote parameter that controls the upper tail of the ICC is to depict that even if the examinee has high ability, the correct response probability of an item may be much lower than 1 due to a series of reasons discussed in the Generalized Item Response Theory Models section. However, the traditional four parameter model cannot give a good explanation for the following two questions. (i) Some items may be guessed correctly by the examinee with low ability, and the examinee with high ability may make mistakes in answering some items. Is there any basis for judging the level of ability? (ii) It is well known that the use of any model needs to satisfy certain assumptions before it can be established. The value range of latent ability in the IRT model is between −3 and 3. Although the ability of examinee can reach infinity in theory, under the framework of IRT, if the ability is close to 3, the examinee is considered to have high ability. When the ability exceeds the limit, the model will not hold. Similarly, the two asymptote parameters in the 4PL model also need to be limited, otherwise the inferred results will lead to serious deviation. In fact, for each item, the value of the lower asymptote parameter ranges from 0.05 to 0.25, while the value of the upper asymptote parameter ranges from 0.75 to 0.98 (Loken & Rulison, 2010). The 2PGlogit model divides the ICC into two parts by considering the difference value (θ−b) of ability and item difficulty as a “threshold” value (please see equations (6) and (7)). The ICC of these two parts is continuous and smooth (see Figures 1 and 2). For an item, if the examinee’s ability is greater than the item difficulty, we think that the examinee has high ability. In the right of ICC truncated at the “threshold” value, the rate of the ICC approaching 1 can be as flat as the ICC of the four-parameter IRT model (the examinee tends to slipping) or the rate of the ICC can steeply approach 1 (the examinee does not tend to slipping). In fact, the rate approaching 1 is adjusted by the shape parameter α1j. In this way, the limitation of asymptote parameters on the four-parameter IRT model is avoided, and the shape of the ICC is “data-driven.” Thirdly, the generalized logistic distribution is not a mixture of a discrete point mass distribution and a continuous logistic distribution, the 2PGlogit is advantageous in terms of prior specification and posterior computation.
As discussed in Implementation of the IRT Models in Stan and Simulation sections, one of the limitations of Stan software is that a large amount of RAM memory is required. The Stan and R codes in Supplementary Appendix A may not be executed on a computer with a limited RAM memory when the number of subjects or the number of items or the MCMC sample size is substantially increased. Therefore, it is more desirable to develop a standing-alone R package in order to facilitate educational psychologists to better conduct educational psychological assessment. In this article, the prior distributions specified for the α1j and α2j are mutually independent non-informative priors. If different types of prior distributions such as shrinkage priors or fusion priors (Song & Cheng, 2020) are adopted, is it helpful for improving the accuracy of parameter estimation and further reducing the model over-fitting? All of these require us to conduct an in-depth investigation, which deserves a future research project. In addition, only LPML, DIC, WAIC, and LOO are considered in this current study. Other Bayesian model selection criteria such as marginal likelihoods may also be potentially quite useful within the IRT framework. Further, the Glogit link can be easily extended to fit item response data with multilevel structures or polytomous responses. Also, other type of links such as those developed in Bazán et al. (2006) or Kim et al. (2008) can be considered for fitting item response data. These extensions are beyond the scope of this paper but they are currently under further investigation.
Supplemental Material
Supplemental material, sj-pdf-1-apm-10.1177_01466216221089343 for Bayesian Item Response Theory Models With Flexible Generalized Logit Links by Jiwei Zhang, Ying-Ying Zhang, Jian Tao and Ming-Hui Chen in Applied Psychological Measurement
Acknowledgments
We would like to thank the Editor, the Associated Editor, and the referee for their constructive suggestions and comments, which have led to an improved version of the manuscript.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Dr J. Zhang’s work was partially supported by the National Natural Science Foundation of China (Grant No. 12001091) and China Postdoctoral Science Foundations (Grant No. 2021M690587 and Grant No. 2021T140108). Dr Y.-Y. Zhang’s research was partially supported by the Ministry of Education (MOE) Project of Humanities and Social Sciences on the West and the Border Area (20XJC910001), and the National Social Science Fund of China (21XTJ001). Dr Chen’s research was partially supported by US National Institute of Health grants #GM70335 and #P01CA142538.
ORCID iDs: Jian Tao
https://orcid.org/0000-0002-0343-1426
Ming-Hui Chen
https://orcid.org/0000-0003-1935-2447
Supplemental Material: Supplemental material for this article is available online.
References
- Aranda-Ordaz F. J. (1981). On two families of transformations to additivity for binary response data. Biometrika, 68(2), 357–363. 10.1093/biomet/68.2.357 [DOI] [Google Scholar]
- Baker F. B., Kim S. H. (2004). Item response theory: Parameter estimation techniques. Marcel Dekker. [Google Scholar]
- Bazán J. L., Branco M. D., Bolfarine H. (2006). A skew item response model. Bayesian Analysis, 1(4), 861–892. 10.1214/06-ba128. [DOI] [Google Scholar]
- Birnbaum A. (1957). Efficient design and use of tests ofa mental ability for various decision-making problems. Series Report No. 58-16. Randolph air force base. USAF School of Aviation Medicine. [Google Scholar]
- Birnbaum A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord F. M., Novick M. R. (Eds.), Statistical theories of mental test scores (pp. 397–479). MIT Press. [Google Scholar]
- Brooks S. P., Gelman A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4), 434–455. 10.1080/10618600.1998.10474787 [DOI] [Google Scholar]
- Chen M.-H., Dey D. K., Shao Q.-M. (1999). A new skewed link model for dichotomous quantal response data. Journal of the American Statistical Association, 94(448), 1172–1186. 10.1080/01621459.1999.10473872 [DOI] [Google Scholar]
- Chen M.-H., Dey D. K., Wu Y. (2002). On robustness of choice of links in binomial regression. Calcutta Statistical Association Bulletin, 53(1–2), 145–164. 10.1177/0008068320020113 [DOI] [Google Scholar]
- Chen M.-H., Ibrahim J. G., Shao Q.-M. (2000). Monte Carlo methods in Bayesian computation. Springer. [Google Scholar]
- Czado C., Santner T. J. (1992). The effect of link misspecification on binary regression inference. Journal of Statistical Planning and Inference, 33(2), 213–231. 10.1016/0378-3758(92)90069-5 [DOI] [Google Scholar]
- Embretson S. E., Reise S. P. (2000). Item response theory for psychologists. Erlbaum. [Google Scholar]
- García-Pérez M. A. (1999). Fitting logistic IRT models: small/wonder. The Spanish Journal of Psychology, 2(1), 74–94. 10.1017/s1138741600005473. [DOI] [PubMed] [Google Scholar]
- Geisser S., Eddy W. F. (1979). A predictive approach to model selection. Journal of the American Statistical Association, 74(365), 153–160. 10.1080/01621459.1979.10481632 [DOI] [Google Scholar]
- Gelfand A. E., Dey D. K., Chang H. (1992). Model determinating using predictive distributions with implementation via sampling-based methods (with Discussion). In Bernado J.M., Berger J.O., Dawid A.P., Smith A.F.M. (Eds.), Bayesian statistics 4 (pp. 147–167). Oxford University Press. [Google Scholar]
- Gelman A., Rubin D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472. 10.1214/ss/1177011136 [DOI] [Google Scholar]
- Geman S., Geman D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721–741. 10.1109/tpami.1984.4767596 [DOI] [PubMed] [Google Scholar]
- Geyer G. J., Johnson L. T. (2012). MCMC: Markov chain Monte Carlo (R package). Technical Report. http://www.stat.umn.edu/geyer/mcmc/ [Google Scholar]
- Guerrero V. M., Johnson R. A. (1982). Use of the Box-Cox transformation with binary response models. Biometrika, 69(2), 309–314. 10.1093/biomet/69.2.309 [DOI] [Google Scholar]
- Hadfield J. D. (2010). MCMC methods for multi-response generalized linear mixed models: The MCMCglmm R package. Journal of Statistical Software, 33(2), 1–22. http://www.jstatsoft.org/v33/i02/.20808728 [Google Scholar]
- Hockemeyer C. (2002). A comparison of non-deterministic procedures for the adaptive assessment of knowledge. Psychologische Beitrage, 44(4), 495–503. [Google Scholar]
- Ibrahim J. G., Chen M.-H., Sinha D. (2001). Bayesian survival analysis. Springer. [Google Scholar]
- Jiang X., Dey D. K., Prunier R., Wilson A. M., Holsinger K. E. (2014). A new class of flexible link functions with application to species co-occurrence in Cape floristic region. The Annals of Applied Statistics, 7(4), 2180–2204. 10.1214/13-AOAS663 [DOI] [Google Scholar]
- Kim S., Chen M.-H., Dey D. K. (2008). Flexible generalized t-link models for binary response data. Biometrika, 95(1), 93–106. 10.1093/biomet/asm079. [DOI] [Google Scholar]
- Loken E., Rulison K. L. (2010). Estimation of a four-parameter item response theory model. British Journal of Mathematical and Statistical Psychology, 63(3), 509–525. 10.1348/000711009X474502 [DOI] [PubMed] [Google Scholar]
- Lord F. M. (1952). A theory of mental test scores. In Psychometric monograph (No.7). Psychometric Society. [Google Scholar]
- Lord F. M., Novick M. R. (1968). Statistical theories of mental test scores. Addison-Wesley. [Google Scholar]
- Lunn D. J., Thomas A., Best N., Spiegelhalter D. (2000). WinBUGS–a Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10(4), 325–337. 10.1023/a:1008929526011 [DOI] [Google Scholar]
- Luo Y., Jiao H. (2017). Using the Stan program for Bayesian item response theory. Educational and Psychological Measurement, 78(3), 384–408. 10.1177/0013164417693666 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin A. D., Quinn K. M., Park J. H. (2011). Mcmcpack: Markov chain monte carlo in R. Journal of Statistical Software, 42(9), 1–21. 10.18637/jss.v042.i09 [DOI] [Google Scholar]
- Metropolis N., Rosenbluth A. W., Rosenbluth M. N., Teller A. H., Teller E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21(6), 1087–1092. 10.1063/1.1699114 [DOI] [Google Scholar]
- Morgan B. J. T. (1983). Observations on quantitative analysis. Biometrics, 39(4), 879–886. 10.2307/2531323 [DOI] [Google Scholar]
- Neal R. M. (2011). MCMC using Hamiltonian dynamics. In Brooks S., Gelman A., Meng X.-L. (Eds.), Handbook of Markov chain Monte Carlo 2 (pp. 113–162). Chapman & Hall/CRC Press. 10.1201/b10905-6 [DOI] [Google Scholar]
- Plummer M. (2012). JAGS version 3.2.0 user manual. http://mcmc-jags.sourceforge.net. [Google Scholar]
- Rasch G. (1960). Probabilistic models for some intelligence and attainment tests.Danish Institute for Educational Research. [Google Scholar]
- Roy V., Dey D. K. (2014). Propriety of posterior distribution arising in categorical and survival models under generalized extreme value distribution. Statistica Sinica, 24(2), 699–722. 10.5705/ss.2012.011 [DOI] [Google Scholar]
- Rulison K. L., Loken E. (2009). I’ve fallen and I can’t get up: Can high-ability students recover from early mistakes in CAT?. Applied Psychological Measurement, 33(2), 83–101. 10.1177/0146621608324023 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Song Q., Cheng G. (2020). Bayesian fusion estimation via t shrinkage. Sankhy: The Indian Journal of Statistics, 82(2), 353–385. 10.1007/s13171-019-00177-0 [DOI] [Google Scholar]
- Spiegelhalter D. J., Best N. G., Carlin B. P., Van Der Linde A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(4), 583–639. 10.1111/1467-9868.00353 [DOI] [Google Scholar]
- Spiegelhalter D., Thomas A., Best N., Lunn D. (2012). OpenBUGS user manual. http://www.openbugs.info [Google Scholar]
- Stan Development Team (2017). Stan modeling language user’s guide and reference manual (version 2.16.0). http://mc-stan.org/documentation/ [Google Scholar]
- Stukel T. A. (1988). Generalized logistic models. Journal of the American Statistical Association, 83(402), 426–431. 10.1080/01621459.1988.10478613 [DOI] [Google Scholar]
- Tao J., Shi N.-Z., Chang H.-H. (2012). Item-weighted likelihood method for ability estimation in tests composed of both dichotomous and polytomous items. Journal of Educational and Behavioral Statistics, 37(2), 298–315. 10.3102/1076998610393969 [DOI] [Google Scholar]
- Van der Linden W. J., Hambleton R. K. (Eds.). (1997). Handbook of modern item response theory. Springer. [Google Scholar]
- Vehtari A., Gelman A., Gabry J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432. 10.1007/s11222-016-9696-4 [DOI] [Google Scholar]
- Wang X., Dey D. K. (2010). Generalized extreme value regression for binary response data: An application to B2B electronic payments system adoption. The Annals of Applied Statistics, 4(4), 2000–2023. 10.1214/10-aoas354 [DOI] [Google Scholar]
- Wang X., Dey D. K. (2011). Generalized extreme value regression for ordinal response data. Environmental and Ecological Statistics, 18(4), 619–634. 10.1007/s10651-010-0154-8 [DOI] [Google Scholar]
- Watanabe S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11, 3571–3594. 10.48550/arXiv.1004.2316 [DOI] [Google Scholar]
- Whittmore A. S. (1983). Transformations to linearity in binary regression. SIAM Journal on Applied Mathematics, 43(4), 703–710. 10.1137/0143048 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-pdf-1-apm-10.1177_01466216221089343 for Bayesian Item Response Theory Models With Flexible Generalized Logit Links by Jiwei Zhang, Ying-Ying Zhang, Jian Tao and Ming-Hui Chen in Applied Psychological Measurement





