Abstract
As a method to derive a “purified” measure along a dimension of interest from response data that are potentially multidimensional in nature, the projective item response theory (PIRT) approach requires first fitting a multidimensional item response theory (MIRT) model to the data before projecting onto a dimension of interest. This study aims to explore how accurate the PIRT results are when the estimated MIRT model is misspecified. Specifically, we focus on using a (potentially misspecified) two-dimensional (2D)-MIRT for projection because of its advantages, including interpretability, identifiability, and computational stability, over higher dimensional models. Two large simulation studies (I and II) were conducted. Both studies examined whether the fitting of a 2D-MIRT is sufficient to recover the PIRT parameters when multiple nuisance dimensions exist in the test items, which were generated, respectively, under compensatory MIRT and bifactor models. Various factors were manipulated, including sample size, test length, latent factor correlation, and number of nuisance dimensions. The results from simulation studies I and II showed that the PIRT was overall robust to a misspecified 2D-MIRT. Smaller third and fourth simulation studies were done to evaluate recovery of the PIRT model parameters when the correctly specified higher dimensional MIRT or bifactor model was fitted with the response data. In addition, a real data set was used to illustrate the robustness of PIRT.
Keywords: multidimensional item response theory, projective item response theory, robustness, misspecification
The framework of item response theory (IRT) has certain inherent assumptions. One of the main assumptions is unidimensionality. However, it is uncommon in the context of educational and psychological testing for a test to measure only one underlying construct rather than a composite of multiple constructs (Humphreys, 1986; Ozer, 2001). From a practical standpoint, it may be more realistic to say that most tests are multidimensional in the sense that they measure a dominant dimension and potentially one or more weaker, or so-called “nuisance,” dimensions that may or may not be the construct of interest that practitioners and researchers are seeking (Hulin et al., 1983). When some dimensions are identified as nuisance factors, a potential solution would be to ignore them and fit a unidimensional IRT model to the potentially multidimensional response data. However, when fitting a unidimensional IRT model to multidimensional data, both the item and ability estimates represent a combination of the dominant and minor dimensions (Luecht & Miller, 1992). Wang (1985) termed this the “reference composite,” where the estimated unidimensional scale lies within the multidimensional latent space. This approach produces item and ability estimates that cannot be directly comparable across different test forms that are purported to measure the same underlying construct. Other problems such as interaction between dimensions may complicate the interpretation of the composite dimension. When the fit of a unidimensional IRT model appears to be problematic (e.g., lack of it), another potential solution would be to directly fit a multidimensional IRT (MIRT; Reckase, 1997) model of dimensionality q to the response data (Ackerman, 1994; Ackerman et al., 2005). One potential problem of MIRT is that the dominant dimension and the minor, or nuisance, dimensions may not be measured to the same level of precision, especially when q is large.
Ip and Chen (2012) presented a solution that allows test forms that contain multiple minor nuisance dimensions to be directly comparable. The procedure involves first fitting an MIRT model to the data and then projecting the MIRT solution onto a “purified” dimension that is presumed to be the dimension of interest. The purpose of the projection is to remove the contamination caused by these nuisance dimensions. Essentially, one could project the high-dimensional latent space onto the dominant dimension’s latent space, thus creating a locally dependent unidimensional IRT model (Ip, 2010). Projective IRT (PIRT) has been used, for example, to solve nonproportional requirements of the primary and secondary dimensions in educational testing (Ip et al., 2019).
Relationship of MIRT and Locally Dependent IRT
The MIRT model is essentially equivalent to a unidimensional IRT model, if one is willing to relax the assumption of local independence in the model (Ip, 2010). This framework is grounded in the theory of empirical indistinguishability (EI). Using the concept of EI, the two models are indistinguishable when their first two marginal moments of the observed response match up. From this framework, it is established that a compensatory MIRT model is EI from a locally dependent unidimensional IRT model. The probability of responding correctly to an item in the compensatory multidimensional extension of the two-parameter logistic model (Reckase, 1997) is given by,
| (1) |
where is a vector of person latent traits,
where is the variance of the dimension and is the correlation between dimensions. The is a vector of item discrimination parameters for the item and the intercept for the ith item. The logit in the model can be expanded into the form,
| (2) |
There is a rotational indeterminacy in the MIRT, which can be resolved by fixing some loadings on dimensions to zero (McDonald, 1997).
For this section, we focus on a two-dimensional (2D; dominant dimension, , and nuisance dimension, ) case. The dimension we are interested in projecting to is the dominant dimension, . The item response function of the corresponding locally dependent unidimensional IRT model is given by,
| (3) |
where and are the “projected” item parameters of the locally dependent unidimensional IRT model in Equation 3 and can be computed as follows:
| (4) |
| (5) |
where and are the standard deviation of and , respectively; represents the population correlation between and ; and and are discrimination parameters for the item. The and are scalars. The mathematical derivation for MIRT with is detailed in Ip (2010). It was argued that the projected unidimensional IRT model in Equation 3 properly captures the dominant dimension when a unidimensional model is used to represent that dimension in the multidimensional response space (Ip & Chen, 2012). Note that the mathematical derivation of projected unidimensional IRT model in Equations 3–5 does not require empirical data. In practice, the parameters are not known. Estimated parameters from the MIRT, including , , and , are used for projection purposes. For the remainder of the study, we will term the locally dependent unidimensional IRT model as PIRT (Ip & Chen, 2012), and the dimension will be considered as the primary “targeted” dimension.
Purpose of Study
This study aims to explore how robust the PIRT model is when the estimated MIRT model is misspecified. Misspecification could occur when the true dimensionality of the data does not equal the dimensionality of the model fit to the data. Particularly, we conjecture that fitting a two-dimensional (2D)-MIRT model is sufficient for situations in which a relatively strong primary dimension of interest exists. There are several reasons why this investigation is important. If the PIRT is found to be sensitive to the misspecified MIRT, then the whole projective approach would require exact methods to get (a) the correct dimensionality q and (b) accurately estimated parameters of the underlying MIRT model. However, if it can be demonstrated that this method is robust to the misspecification of the number of dimensions of the underlying MIRT model, the approach is justifiable even when an approximate and/or misspecified MIRT model is used. The 2D-MIRT is especially convenient because it is one of the simplest multidimensional 2PL models that is close to a unidimensional model—it contains fewer model parameters than higher dimensional models and that often translates into more stable parameter estimates. In addition, indeterminacy, identifiability, and computational issues are simpler to resolve in 2D-MIRT as compared to higher dimensional models (Luecht et al., 2006; Luecht & Miller, 1992; McDonald, 2000). This study expands further on the results presented in the study by Ip and Chen (2012) in which they briefly investigated the robustness of the PIRT method by generating data from a 3D-MIRT, fitting either a 2D-MIRT or 3D-MIRT and then projecting onto . The results of the projection from the misspecified 2D-MIRT were compared to the results of the projection from the correctly specified 3D-MIRT. In this article, simulation studies I and II compared PIRT estimated values and the true generating values under a misspecified 2D-MIRT, whereas studies III and IV compared PIRT estimated values and true generating values under the correctly specified models. Simulation design and the results for simulation studies III and IV are presented in the supplementary material.
Simulation Studies
Simulation studies were run using R software (R, 2017). Calibration and scoring of the datasets were done using the package mirt (Chalmers, 2012). The expectation–maximization (EM; Bock & Aitkin, 1981) algorithm was used as the calibration procedure with a convergence criterion set at .001. Due to software limitations, the number of quadrature points specified for the 2D-MIRT calibrations in simulation studies I and II was 31 points per dimension. The number of quadrature points specified for the 5D-MIRT calibrations in simulation study III was seven points per dimension. For the bifactor calibrations in simulation study IV, a dimensional reduction EM algorithm was used (Cai et al., 2011). There were zero nonconvergence cases throughout the entirety of the simulation runs. Scale identifiability issues were resolved by fixing the s and s of the latent distributions to 0 and 1, respectively. Rotational indeterminacy issues in the 2D-MIRT were resolved by fixing the first item’s to 0. The s were estimated using an “expected a-posteriori” (EAP; Embretson & Reise, 2000) estimator. The number of quadrature points used for the EAP estimator in simulation studies I and II was 31 points per dimension. The number of quadrature points used for the EAP estimator in simulation studies III and IV was 12 points per dimension. Mean absolute difference (MAD) and correlation () metrics were used to evaluate recovery between the true and estimated item and ability parameters for the PIRT. Generated item parameters used for both the MIRT and bifactors model simulation studies are located in online supporting information.
Simulation Study I
For the simulation study presented in this section, data were either generated from a 2D-, 3D-, 4D-, or 5D-MIRT. Because the 2D-MIRT will be the model of choice for calibration, it is considered a correctly specified model when the generating MIRT was 2D and misspecified when the generating MIRT was 3D, 4D, or 5D. Thus, the PIRT based on the generating 2D-MIRT will be used as the model of reference when assessing parameter recovery. After the estimated item parameters were obtained from the calibration, the projection formulation was implemented using the generating 2D-, 3D-, 4D-, or 5D-MIRT to obtain true , , and , and then, the fitted 2D-MIRT was used to obtain estimated , , and (refer to Equations 3–5). MAD and correlation between the estimated and true parameters were used to assess performance. For identification, some a-parameters were fixed at zero in both the generating and estimated models. For example, in the 5D-MIRT, the first item’s were fixed to 0, the second item’s were fixed to 0, the third item’s were fixed to 0, and the fourth item’s was fixed to 0. Note that no estimation was required for 3D-, 4D-, and 5D-MIRT. A sample size of , , and were randomly generated from a multivariate normal distribution with µ’s set at 0 and covariance matrix where was fixed at either .3, .5, or .7. A test length of 30 or 60 items was generated for this study. The item parameters and were generated in a way such that the multidimensional discrimination () value for each item ranged from 1 to 1.8 and the multidimensional difficulty () ranged from −3 to 3. We also examined the impact of parameter recovery when the percentage of items primarily measuring decreased. This means the generated tests consisted of either 70% items with and 30% items with or 50% items with and 50% items with , where represents the angle between the two dimensions (see the online supplement). There was a total of conditions with each condition replicated 100 times. A summary of the simulation design is presented in Table S1, and correlation results are located in Figures S2–S4 in the supplementary material. For clarity purposes, in Figures 1–3 and Figures S2–S4, located in the supplementary material, factor levels of and were excluded because the results were negligibly different from the conditions shown. The results are based on the average MAD and with corresponding 95% confidence intervals across the 100 replications per condition.
Figure 1.
Simulation study I: Average MAD results for with corresponding 95% confidence intervals across different underlying MIRT models.
Note. The estimated model for projective IRT is 2D-MIRT. MAD = mean absolute difference; 2D-MIRT = two-dimensional multidimensional IRT; IRT = item response theory. = sample size. The Nuisance factor represents the percentage of items primarily loading on the nuisance dimensions.
Figure 2.
Simulation study I: Average MAD results for with corresponding 95% confidence intervals across different underlying MIRT models.
Note. The estimated model for projective IRT is 2D-MIRT. MAD = mean absolute difference; 2D-MIRT = two-dimensional multidimensional IRT; IRT = item response theory. = sample size. The Nuisance factor represents the percentage of items primarily loading on the nuisance dimensions.
Figure 3.
Simulation study I: Average MAD results for with corresponding 95% confidence intervals across different underlying MIRT or bifactor models.
Note. The estimated model for projective IRT is 2D-MIRT. MAD = mean absolute difference; 2D-MIRT = two-dimensional multidimensional IRT; IRT = item response theory. = sample size. The Nuisance factor represents the percentage of items primarily loading on the nuisance dimensions.
Results from simulation study I
The MAD and correlation results between the true and estimated are located in Figure 1 and Figure S2, respectively. Overall, an increase in sample size decreased the MAD results for and increased the results for , while an increase in test length showed minimal impact on the MAD and the results for . The MAD results decreased for and the results increased for as the percentage of items primarily measuring the nuisance dimensions increased when a sample size of 500 was used. The MAD and results were consistent for despite the percentage of items primarily measuring the nuisance dimensions when the sample size was larger. Findings overall suggest that recovered well to a misspecified 2D-MIRT.
The MAD and correlation results between the true and estimated are located in Figure 2 and Figure S3, respectively. An increase in sample size decreased the MAD results for , while an increase in test length showed minimal impact on the MAD results for . An increase in correlation showed minimal impact on the MAD results for . The between true and estimated was close to 1 for all simulation conditions. Overall, findings suggest that recovered very well to a misspecified 2D-MIRT.
The MAD and correlation results between the true and estimated are located in Figure 3 and Figure S4, respectively. Higher correlation between latent variables in combination with increased dimensionality of the generating MIRT slightly improved recovery of . An increase in sample size showed minimal impact on the MAD and correlation results for , while an increase in test length decreased MAD results for and increased the results for . The MAD results of decreased and the results of increased as the percentage of items primarily measuring the nuisance dimensions decreased. This suggests that as the percentage of items primarily measuring the nuisance dimensions decreased, there is more statistical (or item) information along the primary dimension, , thus improving recovery of the model parameter. Overall, findings suggest that recovered well to a misspecified 2D-MIRT.
Simulation Study II
For the simulation study presented in this section, data were either generated from a 2D-MIRT or a 3D-, 4D-, or 5D-bifactor model. Because the 2D-MIRT will be the model of choice for calibration, it is considered a correctly specified model when the generating MIRT is 2D and misspecified when the generating bifactor model is 3D, 4D, or 5D. Thus, the PIRT based on the generating 2D-MIRT will be used as the model of reference when assessing parameter recovery.
After the estimated item parameters were obtained from the calibration, the projection formulation was implemented using the generating 2D-MIRT and 3D-, 4D-, or 5D-bifactor models to obtain true , , and and then the fitted 2D-MIRT to obtain estimated , , and . (Refer to Equations 3–5.) A sample size of , , and was randomly generated from a multivariate normal distribution with µs set at 0 and covariance matrix fixed at an identity matrix. As the bifactor model assumes that the secondary latent traits are orthogonal to the primary latent trait (Reise et al., 2010), was fixed at 0. A test length of 30 or 60 items was generated for this study. The method for generating item parameters and is the same as what was done in simulation study I. Like simulation study I, we also examined the impact of parameter recovery when the percentage of items primarily measuring decreased. This means the generated tests consisted of either 70% items with and 30% items with or 50% items with and 50% items with . There was a total of conditions with each condition being replicated 100 times. A summary of the simulation design is presented in Table S2, and correlation results are located in Figures S5–S7 in the online supplementary. The results are based on the average MAD and with corresponding 95% confidence intervals across the 100 replications per condition.
Results from simulation study II
The MAD and correlation results between the true and estimated are located in Figure 4 and Figure S5, respectively. An increase in sample size decreased the MAD results for and increased the results for , while an increase in test length showed minimal impact on the MAD and the results for . The MAD results decreased for , and the results were minimally impacted for as the number of items primarily measuring the nuisance dimensions decreased when the generating model was a 2D-MIRT, 4D-bifactor, or 5D-bifactor. The MAD results decreased for and the results increased for as the number of items primarily measuring the nuisance dimensions increased when the generating model was a 3D-bifactor. Overall, findings suggest that was recovered well using a misspecified 2D-MIRT when 50% of the items primarily measured the nuisance dimensions. When 30% of the items primarily measured the nuisance dimensions, data generated by the 3D-bifactor model was more problematic.
Figure 4.
Simulation study II: Average MAD results for with corresponding 95% confidence intervals across different underlying MIRT models.
Note. The estimated model for projective IRT is 2D-MIRT. MAD = mean absolute difference; 2D-MIRT = two-dimensional multidimensional IRT; IRT = item response theory. = sample size. The Nuisance factor represents the percentage of items primarily loading on the nuisance dimensions.
The MAD and correlation results between the true and estimated are located in Figure 5 and Figure S6, respectively. An increase in sample size decreased the MAD results for , while an increase in test length showed minimal impact on the MAD results for . The MAD results of improved slightly as the percentage of items primarily measuring the nuisance dimensions increased in concordance with a sample size of 500. When the sample size increased to 2,000, the increase in the percentage of items primarily measuring the nuisance dimensions showed minimal decrease in the MAD results for . The between true and estimated was close to 1 for all simulation conditions. Overall, findings suggest that recovered well to a misspecified 2D-MIRT.
Figure 5.
Simulation study II: Average MAD results for with corresponding 95% confidence intervals across different underlying MIRT models.
Note. The estimated model for projective IRT is 2D-MIRT. MAD = mean absolute difference; 2D-MIRT = two-dimensional multidimensional IRT; IRT = item response theory. = sample size. The Nuisance factor represents the percentage of items primarily loading on the nuisance dimensions.
The MAD and correlation results between the true and estimated are located in Figure 6 and Figure S7, respectively. An increase in sample size showed minimal impact on the MAD, and the results for while an increase in test length decreased MAD results for and increased the results for . As test length increased to 60 items, the MAD results between the correctly and incorrectly specified 2D-MIRTs were more dissimilar. As the number of items primarily measuring the nuisance dimensions decreased, the MAD results for decreased and the results for increased. This suggests that as the percentage of items primarily measuring the nuisance dimensions decreased, there is more statistical (or item) information along the primary dimension, , thus improving recovery of the model parameter. Overall, findings suggest that recovered well to a misspecified 2D-MIRT. In addition, significance tests and effect sizes were reported for MAD results in the supplementary material Table S3–S5.
Figure 6.
Simulation study II: Average MAD results for with corresponding 95% confidence intervals across different underlying MIRT or bifactor models.
Note. The estimated model for projective IRT is 2D-MIRT. MAD = mean absolute difference; 2D-MIRT = two-dimensional multidimensional IRT; = sample size. The Nuisance factor represents the percentage of items primarily loading on the nuisance dimensions. IRT = item response theory.
ACT Test Data Example
Responses to 60 ACT mathematics multiple-choice items that were dichotomously scored from 4,000 students were considered. Confirmatory 2D-MIRT and 3D-MIRT were both used to illustrate the PIRT. In the 3D-MIRT analysis, 30 items were either identified as “pure math” (i.e., 10 “anchor items” that were only loaded on ), “math and verbal” (i.e., 10 loaded on both and ), or “math and spatial” (i.e., 10 loaded on both and ). In the 2D-MIRT analysis, the same set of 10 anchor items identified as “pure math” in the 3D model were also identified as “pure math” (i.e., 10 only loaded on ), and those identified as “math and verbal” and “math and spatial” were instead identified as “math and nuisance” (i.e., 20 loaded on both and ). The MAD and were computed between the PIRT from the 2D-MIRT and 3D-MIRT. Using superscript 2D and 3D to, respectively, indicate estimates from the projected 2D and 3D models, we found that the MAD and between and was .04 and .99, respectively. These values were .02 and .99, respectively, between and , and .01 and .99, respectively, between and . The results from the empirical example show that even when two different multidimensional models are fitted to the response data, the PIRT produces very similar item and ability parameter estimates. This empirical example further illustrates the robustness of PIRT as a function of different multidimensional models. Detailed item parameter estimates are located in Tables S8–S9 in the supplementary material.
Discussion
The PIRT approach requires first fitting a MIRT to the data before projecting onto a dimension of interest. This study aimed to explore how robust the results are in the PIRT when a possibly misspecifed 2D-MIRT is applied to data that may contain multiple nuisance dimensions. There were four simulation studies implemented in this research study. Results from comparing the 2D projected values and the true projected values (Simulations I and II), as well as comparing the 2D projected value and the estimated projected values from the correctly specified models (Simulations III and IV in the online supplement), are presented. The results from simulation study I showed that PIRT is generally robust—a misspecified 2D MIRT for projection. Study II showed that except in one case, overall the PIRT was robust to a misspecified 2D-MIRT when the underlying model is bifactor. The exception was when the underlying model was a 3D-bifactor. However, as the number of items primarily measuring the nuisance dimensions increased to 50%, the results showed improvement. We conjecture this may be due to the 3D-bifactor beginning to behave more like a 2D-MIRT when the percentage of items primarily measuring the nuisance dimensions increased. While their scope was limited, simulation studies III and IV provided some preliminary evidence that when using a 2D model for projection, the results are similar to when the correctly specified model is used for projection.
Some caveats are warranted. The simulation studies appeared to support our original statement that there exist confounding issues in identifiability and computational instability when the underlying generating model is of high dimension. Specifically, identification issues become more complex and computational issues begin to arise when dimension increases. We included confirmatory models (Simulation Studies II, III, and IV) to isolate the possible effects between identification and computational problems. For example, while results from simulations III and IV demonstrated that the recovery of the PIRT model parameters were comparable between the misspecified 2D-MIRT and the correctly specified 5D models, computational instability, especially in the case of exploratory MIRT, appeared to affect the accuracy of the a-parameter estimates. For the ability estimates, the absolute value of tended not to be well recovered, although the correlations between the estimated and the true values were extremely high.
In the context of PIRT estimation, it is important to note that the accurate recoveries of the item parameters and are a function of the precision of the MIRT model parameters when a correctly specified model is used. MIRT model parameters were not well recovered; for example, in simulation study II when the underlying model was a 3D-bifactor and in simulation study III when the correctly specified 5D-MIRT was fitted. Increasing the number of quadrature points per dimension could improve results for simulation study III; however, this would significantly increase computation time. It is also important to note that the mirt package allows only up to a total of 20,000 quadrature points across all dimensions when running calibrations. Thus, we were not able to increase the number of points per dimension since and an increase to eight points per dimension results in . However, this quadrature point limit is not imposed when using the EAP scoring algorithm. Other algorithms such as the Metropolis-Hastings Robbins-Monro (Cai, 2010a, 2010b) would be more efficient for higher dimensional models. For consistency purposes, we used the same estimation algorithm. Future research can be done by expanding conditions in studies III and IV and exploring robustness of PIRT when different algorithms are used in estimating both the low- and high-dimensional MIRT models.
One inherent problem with MIRT estimation is that when the sample size is not large, parameter estimates for weak dimensions (i.e., all items have small loadings on these dimensions) are unstable, which could easily create challenges for the computation of parameters and model interpretation. Our solution for addressing this was to allow some of the items to measure more on the nuisance dimensions (). However, allowing some of the items to measure more on the nuisance dimensions can impact recovery of the primary dimension (assuming it is ). This can be a direct result of there being less statistical (or item) information along the primary dimension, thus impacting recovery of that model parameter. The key research question in this article was the sensitivity of the projected model to the misspecification of the order (dimensionality) of the underlying MIRT. Another potential way to address the issue of parameter recovery in the MIRT is to include a limited number of simple structure items in the test to bring stability to estimation (Babcock, 2011; Chalmers & Flora, 2014; Han & Paek, 2014; Yao, 2012; Zhang, 2012).
In summary, the simple 2D-MIRT should be sufficient for projection purposes when a primary dimension exists. For most practical applications, the PIRT based on 2D seems robust to the presence of weaker dimensions, even when they are numerous. In high-dimensional response data, the accuracy of an exploratory calibration could depend on how well the primary dimension is identified. For confirmatory bifactor structures, the 2D-MIRT works well up to at least five dimensions when the sample size is adequate (e.g., ). Another practical use of the results from this study is to apply a 2D-MIRT-based projection to equate two test forms that both measure the same dimension of interest but may contain mixtures of different nuisance dimensions in the data. If comparison across different forms is desired for a specific target dimension, one can apply PIRT to multiple forms. Another potential application of the PIRT is the problem of construct shift in vertical scaling—placing tests from a series of different grade levels on a common scale (Martineau, 2006). When multiple dimensions are present across grades, the PIRT could be a useful tool for providing a common unidimensional scale.
Supplemental Material
Supplemental material, supplementary_material for Robustness of Projective IRT to Misspecification of the Underlying Multidimensional Model by Tyler Strachan, Edward Ip, Yanyan Fu, Terry Ackerman, Shyh-Huei Chen and John Willse in Applied Psychological Measurement
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This material is based upon work partially supported by the Institute of Education Sciences (Grant Nos R305D150051; PI: Ip, Ackerman). Any opinions, findings, and conclusions or recommendations are those of the authors and do not necessarily reflect the views of the funding agencies.
ORCID iD: Tyler Strachan
https://orcid.org/0000-0002-3319-6332
Supplemental Material: Supplemental material for this article is available online.
References
- Ackerman T. A. (1994). Using multidimensional item response theory to understand what items and tests are measuring. Applied Measurement in Education, 7, 255–278. [Google Scholar]
- Ackerman T. A., Gierl M. J., Walker C. M. (2005). Using multidimensional item response theory to evaluate educational and psychological tests. Educational Measurement: Issues and Practice, 22, 37–51. [Google Scholar]
- Babcock B. (2011). Estimating a noncompensatory IRT model using Metropolis within Gibbs sampling. Applied Psychological Measurement, 35, 317–329. [Google Scholar]
- Bock R. D., Aitkin M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459. [Google Scholar]
- Cai L. (2010. a). High-dimensional exploratory item factor analysis by a Metropolis-Hastings Robbins-Monro algorithm. Psychometrika, 75(1), 33–57. [Google Scholar]
- Cai L. (2010. b). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 25(3), 207–335. [Google Scholar]
- Cai L., Yang J. S., Hansen M. (2011). Generalized full-information item bifactor analysis. American Psychological Association, 16, 221–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chalmers R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1–29. [Google Scholar]
- Chalmers R. P., Flora D. B. (2014). Maximum-likelihood estimation of noncompensatory IRT model. Applied Psychological Measurement, 38, 339–358. [Google Scholar]
- Embretson S. E., Reise S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum. [Google Scholar]
- Han K. C. T., Paek I. (2014). A review of commercial software packages for multidimensional IRT modeling. Applied Psychological Measurement, 38(6), 486–498. [Google Scholar]
- Hulin C. L., Drasgow F., Parsons C. K. (1983). Item response theory: Application to psychological measurement. Dow Jones-Irwin. [Google Scholar]
- Humphreys L. G. (1986). An analysis and evaluation of test and item bias in the predictive context. Journal of Applied Psychology, 71, 327–333. [Google Scholar]
- Ip E. H. (2010). Empirically indistinguishable multidimensional IRT and locally dependent unidimensional item response models. British Journal of Mathematical and Statistical Psychology, 63, 395–416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ip E. H., Chen S. H. (2012). Projective item response model for test-independent measurement. Applied Psychological Measurement, 36, 581–601. [Google Scholar]
- Ip E. H., Strachan T., Fu Y., Chen S., Rutkowski L., Lay A., Willse J., Ackerman T. (2019). Bias and bias correction method for non-proportional abilities requirement (NPAR) tests. Journal of Educational Measurement, 56, 147–168. [Google Scholar]
- Luecht R. M., Gierl M. J., Tan X., Huff K. (2006, April). Scalability and the Development of Useful Diagnostic Scales [Paper presentation]. The Annual Meeting of the National Council on Measurement in Education, San Francisco, CA, United States. [Google Scholar]
- Luecht R. M., Miller T. R. (1992). Unidimensional calibrations and interpretations of composite traits for multidimensional tests. Applied Psychological Measurement, 16, 279–293. [Google Scholar]
- Martineau J. A. (2006). Distorting value added: The use of longitudinal, vertical scaled student achievement data for growth-based, value-added accountability. Journal of Educational and Behavioral Statistic, 31, 35–62. [Google Scholar]
- McDonald R. P. (1997). Normal-Ogive multidimensional model. In van der Linden W. J., Hambleton R. K. (Eds.), Handbook of modern item response theory (pp. 257–269). Springer. [Google Scholar]
- McDonald R. P. (2000). A basis for multidimensional item response theory. Applied Psychological Measurement, 24, 99–114. [Google Scholar]
- Ozer D. (2001). Four principles of personality assessment. In Pervin L. A., John O. P. (Eds.), Handbook of personality: Theory and research (2nd ed., pp. 671–688). Guildford Press. [Google Scholar]
- Reckase M. D. (1997). A linear logistic multidimensional model for dichotomous item response data. In van der Linden W. J., Hambleton R. K. (Eds.), Handbook of modern item response theory (pp. 271–286). Springer. [Google Scholar]
- Reise S. P., Moore T. M., Haviland M. G. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment, 92, 544–559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang M. M. (1985). Fitting a unidimensional model to unidimensional model to multidimensional item response data: The effect of latent trait misspecification on the application of IRT (Research Report MW: 6-24-85). University of Iowa. [Google Scholar]
- Yao L. (2012). Multidimensional CAT item selection methods for domain scores and composite scores: Theory and applications. Psychometrika, 77, 495–523. [DOI] [PubMed] [Google Scholar]
- Zhang J. (2012). Calibration of response data using MIRT models with simple and mixed structures. Applied Psychological Measurement, 36, 375–398. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, supplementary_material for Robustness of Projective IRT to Misspecification of the Underlying Multidimensional Model by Tyler Strachan, Edward Ip, Yanyan Fu, Terry Ackerman, Shyh-Huei Chen and John Willse in Applied Psychological Measurement






