We are very grateful to Professors Chae Young Lim, Insu Paek, Jorge Luis Bazan, Irene Klugkist and Mr. Duco Veen for their insightful comments and discussions for this paper. Their contributions greatly complement our paper from several different perspectives, and hopefully stimulate interests in future research directions of this topic. We address some of the discussants’ comments in a random order below.
Professor Klugkist and Mr. Veen shared some insights about why the prior-based method (PMC) mentioned in the paper would not work well for most IRT data, and suggested that the computation cost comparison might be fairer, if considering the number of iterations for each method such that the standard errors of the estimated log marginal likelihoods are comparable. We applied the Chib method to the 2PL model with 6000 samples and the log marginal likelihood and the corresponding Monte Carlo standard error are −11 878.3 and 5.61, and the computing time is about 7600 s. It is also pointed out that the log-normal priors for the aj’s may be better in the real data application. They also commented on the software/programming choices and computed the marginal likelihoods using the bridge sampling (Meng & Wong, 1996) technique via the rstan (Stan Development Team, 2018) and the bridgesampling (Gronau, Singmann, & Wagenmakers, 2017) R packages with a lognormal prior for aj. We implemented the discussants’ code with the U(0.5, 2.5) prior for aj under the 2PL model setting and the estimated log marginal likelihood is −11 878.81. It would be interesting to investigate and compare the bridge sampling method with the methods mentioned in our paper in terms of accuracy and computation cost in detail in future work.
Professor Lim pointed out that the results might be sensitive to the choice of a* and b* when applying the methods introduced by Chen (2005) and Chib (1995). Under the 2PL IRT model, we also computed the marginal likelihood by choosing a* and b* as the first quartile (Q25) and the estimated log marginal likelihood and the corresponding Monte Carlo standard error are −11 777.5 and 27.36, which is different from −11 878.81 obtained using the posterior median as a* and b*. The choice of a* and b* does play an important role in computing the marginal likelihood. It seems necessary to try a few different choices to see whether the estimated values of the marginal likelihoods do not change substantially.
Professor Paek’s comments are very constructive and helpful in understanding the assumptions of the IRT model and the distinction between the Rasch model and the 1PL IRT model. The Rasch model and the 1PL model were formulated independently from different countries. They have different perspectives and model parameterization, though they are equivalent in terms of model-data fit. In the paper, we followed Verhagen, Levy, Millsap, and Fox (2016) to assign the N(0, 1) prior for the θi. We now consider the following formulation for the 1PL model:
| (1.1) |
and assume a ~ U(0.5, 2.5) and θi ~ N(0, 1) independently, and the same hierarchical prior for bj as mentioned in Section 2 of our paper. The values of DIC and WAIC for this model are 23 224.85 and 23 428.63. Comparing these values to those shown in Table 1 of the paper, we see that this new 1PL model fits the data slightly better than the 1PL model considered in the paper but worse than the 2PL model. The estimated log marginal likelihood and the corresponding Monte Carlo error using the Chen method with 60 000 MCMC samples are −12 176.87 and 1.45. Again, the log marginal likelihood is in favor of this new 1PL model than the one considered in the paper. Similar to DIC and WAIC, this new 1PL model is still worse than the 2PL model according to the log marginal likelihood. We are very thankful that professor Paek helped make it clearer for the readers to understand the difference between these two concepts.
Professor Bazán raised a few interesting points for the paper. Within the Bayesian framework, all parameters are random. The reason for a hierarchical prior assumed for the item difficulty parameters is that a relatively informative prior is required for the marginal likelihood. Regarding different model selection criteria, the marginal likelihoods (or Bayes factors) have certain desirable theoretical properties such as the posterior model selection consistency (Liang, Paulo, Molina, Clyde, & Berger, 2008) and they can be calibrated for the model comparison (García-Donato & Chen, 2005) while the information based criteria such as DIC and WAIC allow for non-informative priors and enjoy the computation simplicity. In this paper, we primarily focused on the “best” implementation of each Monte Carlo method for computing the marginal likelihoods under the IRT models as the marginal likelihoods are not commonly used in the psychometric literature. Since the marginal likelihoods are not analytically tractable, the true values under these IRT models are unknown. It is our intention that the true value should emerge when several distinct but relatively reliable Monte Carlo methods produce similar estimates. To address Professor Bazán’s fourth comment, we found that the R package BayesianTools implements two of the 5 Monte Carlo methods reviewed in our paper, namely, the prior-based method and the harmonic mean method. The third method implemented in BayesianTools is the method developed by Chib and Jeliazkov (2001) based on the Metropolis algorithm. In terms of reproducibility, Dr. Klugkist and Mr. Veen have reproduced our results shown in the paper. However, we do not have the right to release the data to the public.
Finally, we note that the total number of students should be 468 rather than 452 (typo) in the real data application in the paper.
Acknowledgments
We would like to thank Professors Chae Young Lim, Insu Paek, Jorge Luis Bazán, Irene Klugkist and Mr. Duco Veen again for their helpful comments and insightful discussions. We are also very grateful to Professor Jaeyong Lee and the Editor in Chief Professor Hee-Seok Oh for organizing this discussion. Dr. Hu’s Research was supported by the Dean’s office of the College of Liberal Arts and Sciences at University of Connecticut. Dr. Chen’s research was partially supported by US NIH grants #GM 70335 and #P01CA142538.
References
- Chen M-H (2005). Computing marginal likelihoods from a single MCMC output. Statistica Neerlandica, 59(1), 16–29. [Google Scholar]
- Chib S (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90(432), 1313–1321. [Google Scholar]
- Chib S, & Jeliazkov I (2001). Marginal likelihood from the Metropolis–Hastings output. Journal of the American Statistical Association, 96(453), 270–281. [Google Scholar]
- García-Donato G, & Chen M-H (2005). Calibrating Bayes factor under prior predictive distributions. Statistica Sinica, 359–380. [Google Scholar]
- Gronau QF, Singmann H, & Wagenmakers E-J Bridgesampling: An R package for estimating normalizing constants, arXiv preprint arXiv: 1710.08162. [Google Scholar]
- Liang F, Paulo R, Molina G, Clyde MA, & Berger JO (2008). Mixtures of g priors for Bayesian variable selection. Journal of the American Statistical Association, 103(481), 410–423. [Google Scholar]
- Meng X-L, & Wong WH (1996). Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Statistica Sinica, 831–860. [Google Scholar]
- Stan Development Team (2018). RStan: the R interface to Stan. R package version 2.18.2 [Google Scholar]
- Verhagen J, Levy R, Millsap RE, & Fox J-P (2016). Evaluating evidence for invariant items: A Bayes factor applied to testing measurement invariance in IRT models. Journal of Mathematical Psychology, 72, 171–182. [Google Scholar]
