Abstract
Although item response theory (IRT) models such as the bifactor, two-tier, and between-item-dimensionality IRT models have been devised to confirm complex dimensional structures in educational and psychological data, they can be challenging to use in practice. The reason is that these models are multidimensional IRT (MIRT) models and thus are highly parameterized, making them only suitable for data provided by large samples. Unfortunately, many educational and psychological studies are conducted on a small scale, leaving the researchers without the necessary MIRT models to confirm the hypothesized structures in their data. To address the lack of modeling options for these researchers, we present a general Bayesian MIRT model based on adaptive informative priors. Simulations demonstrated that our MIRT model could be used to confirm a two-tier structure (with two general and six specific dimensions), a bifactor structure (with one general and six specific dimensions), and a between-item six-dimensional structure in rating scale data representing sample sizes as small as 100. Although our goal was to provide a general MIRT model suitable for smaller samples, the simulations further revealed that our model was applicable to larger samples. We also analyzed real data from 121 individuals to illustrate that the findings of our simulations are relevant to real situations.
Keywords: Bayesian IRT, multidimensional IRT, nested-dimensionality structures, nonnested-dimensionality structures
Rating instruments used to gather data in educational and psychological studies are often designed to measure various aspects of latent traits, resulting in complex dimensional structures being represented in the data. Accordingly, item response theory (IRT) models emerged that were designed to confirm these complex structures in data. For instance, the between-item-dimensionality IRT model (Adams, Wilson, & Wang, 1997) was developed to verify whether the data reflect multiple correlated dimensions. The bifactor IRT model (Gibbons & Hedeker, 1992) and the two-tier full-information item-factor model (or the two-tier IRT model; Cai, 2010) were devised to confirm nested-dimensionality structures. More specifically, the bifactor IRT model was designed to verify whether secondary (or specific) dimensions are nested within a single primary (or general) dimension, and the two-tier IRT model was devised to confirm whether secondary dimensions are nested within multiple primary dimensions.
As ideal as these models are for confirming whether the nuanced aspects of latent traits are represented in item response data, they may be impractical in real settings. These models are special cases of the more general multidimensional IRT (MIRT) framework (Reckase, 2009; Yao, 2003) and thus are highly parameterized, making them suitable only for data from large samples. Unfortunately, the sample sizes of many studies conducted in education and psychology are fairly small (e.g., ). The data resulting from these studies, therefore, might not be large enough for the models, leaving researchers unable to confirm whether their hypothesized structures are represented in their data.
These researchers, however, still need MIRT models because of the implications a dimensionality analysis has for the internal structural aspect of validity as outlined in Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014). When researchers confirm that their hypothesized structures are represented in the data, they demonstrate that the latent traits were measured as intended, thereby providing evidence of support for the internal structural aspect of validity. Such evidence is crucial because it allows others to judge the appropriateness of any conclusions these researchers make about the dimensions of the measured traits.
In this article, we provide researchers conducting small-scale studies with the means to confirm complex dimensional structures in rating scale data. We present a general MIRT model from which specific types of confirmatory MIRT models suitable for small samples can be obtained; these confirmatory models include the two-tier, bifactor, between-item-dimensionality, and unidimensional IRT models. (For this article, we treat the unidimensional IRT model as a special case of the MIRT model.) The reason for providing a general model that leads to various confirmatory MIRT models is so researchers can perform a more comprehensive dimensionality analysis. Often, when confirming a dimensional structure in data, a set of models is used—one that matches the hypothesized structure and others that test alternative dimensional structures. For example, when performing a dimensionality analysis to confirm a two-tier structure, researchers could use a bifactor, a between-item-dimensionality, and a unidimensional IRT model in addition to a two-tier IRT model to test whether a dimensional structure associated with one of the other models is more strongly represented in the data than a two-tier structure. All the models used in the dimensionality analysis, then, must be appropriate for the data. Our general MIRT model provides the means to obtain a set of confirmatory models, all of which are suitable for data representing small samples.
Before proceeding to the technical specifications of our general MIRT model and an explanation of why we take a Bayesian approach, we provide a real example of when a two-tier structure could be represented in rating scale data based on a small sample. We use this example to give context for the usefulness of our model, to further expand on why more than one confirmatory MIRT model could be used in a dimensionality analysis, and to provide a conceptual overview of the two-tier, bifactor, and between-item-dimensionality IRT models to show why these models could be challenging to use with small samples. For the remainder of the article, we focus on rating scale versions of MIRT models, unless otherwise noted, because our example is based on rating scale data.
Real Example
Our example involves data from a study that investigated adolescents’ reading motivation levels (Neugebauer & Fujimoto, 2019). These data are the responses (on a 4-point scale) to a subset of the items in the Motivation for Reading Questionnaire (MRQ; Wigfield & Guthrie, 1997), and 121 students provided the responses. A two-tier structure is expected for the data related to the 29 items used for this example (see Figure 1a for a visualization of the two-tier structure) because these items measure the primary dimensions of intrinsic and extrinsic reading motivation, which are broader aspects of reading motivation, and the secondary dimensions of challenge, curiosity, involvement, competition in reading, recognition, and grades, which are specific aspects of the measured trait. The primary dimensions, which represent the first tier, and the secondary dimensions, which represent the second tier, have a nested structure in that challenge, curiosity, and involvement are types of intrinsic reading motivation, and competition in reading, recognition, and grades are forms of extrinsic reading motivation. This nested structure is accounted for in a two-tier IRT model by assigning each item to discriminate on one primary dimension and one secondary dimension nested within that primary dimension, with the item discrimination estimates indicating the extent to which the specific and broader aspects of reading motivation are represented in the data. An assumption is made about the relationship among the dimensions in a two-tier IRT model when accounting for the nesting of the dimensions. That is, the secondary dimensions are assumed to be orthogonal to each other and the primary dimensions, although the primary dimensions can be correlated with each other (Cai, 2010). Even with this assumption, the two-tier IRT model as it relates to the MRQ data is still quite complex because there are eight dimensions and each item discriminates on two dimensions.
Figure 1.
The following are visualizations of the two-tier, bifactor, six-dimensional, two-dimensional, and unidimensional structures as they relate to the Motivation for Reading Questionnaire data analyzed for this article.
Note. The circles above the items are primary dimensions (i.e., broader dimensions). In the two-tier and bifactor structures, the circles below the items are secondary dimensions (i.e., specific dimensions). Each straight line with an arrow from a dimension to an item represents an item discrimination. An arced line with arrows connecting a pair of dimensions indicates that the dimensions can covary with each other (in the two-dimensional, six-dimensional, and two-tier structures).
As previously noted, other confirmatory MIRT models could be used in a dimensionality analysis to confirm a two-tier structure. One such model is the bifactor IRT model. Conceptually, this model can be viewed as a two-tier IRT model but with only one dimension (rather than multiple dimensions) in the first tier of the two-tier structure (see Figure 1b for a visualization of a bifactor structure). As it relates to the MRQ data, the bifactor IRT model can be used to test whether intrinsic and extrinsic aspects of reading motivation are better represented as one general dimension (i.e., general reading motivation) given the secondary dimensions. The bifactor IRT model is slightly more parsimonious relative to the two-tier IRT model because the former has one fewer dimension with all dimensions orthogonal to each other. The bifactor IRT model is, nevertheless, quite complex because it includes seven dimensions that have a nested form, leading to each item discriminating on two dimensions, as with the two-tier IRT model.
Between-item-dimensionality IRT models could also be used to test competing structures. These models are based on nonnested dimensionality. That is, the dimensional structures specified in these models are one-tiered and thus include only primary dimensions. Because the dimensions have a nonnested structure, each item discriminates on only one dimension, meaning that the broad and specific aspects of the latent traits are not separated in these models. As it relates to the MRQ data, a between-item six-dimensional IRT model can be used to test whether the specific aspects of reading motivation are nested within broader dimensions or whether they are better served as correlated primary dimensions (see Figure 1c for a visualization of a six-dimensional structure). A between-item two-dimensional IRT model (see Figure 1d) can be used to test whether any specific dimensions are nested within the broader dimensions of reading motivation. Last, a unidimensional IRT model (see Figure 1e) can be used to determine whether multiple dimensions are even necessary to represent the latent trait space underlying the MRQ data. Between-item-dimensionality IRT models are less complex than nested-dimensionality IRT models because each item discriminates on only one dimension in the former models. However, between-item-dimensionality IRT models allow all dimensions to correlate with each other, such that these models become fairly complex when specified with a large number of dimensions. For instance, a between-item six-dimensional IRT model includes 15 pairs of correlated dimensions, whereas a between-item two-dimensional IRT model includes only one pair of correlated dimensions.
Sample Size Requirements for Multidimensional Item Response Theory Models
As the conceptual review of MIRT models shows, the two-tier, bifactor, and between-item six-dimensional IRT models are heavily parameterized, especially compared with the two-dimensional and unidimensional IRT models. This complexity could make the former set of models only suitable for data provided by large samples, although it is unclear how large because of the lack of research on the sample size requirements for these models. Nevertheless, inferences can be made based on studies in which the sample size requirement for a more simplistic confirmatory MIRT model was investigated. Simulation studies showed that a sample size of 500 was sufficient for a between-item three-dimensional IRT model based on the graded response model, although a sample size of 250 was acceptable in limited situations (Forero & Maydeu-Olivares, 2009; Jiang, Wang, & Weiss, 2016). Based on these findings, a reasonable assumption is that more complex MIRT models, such as the two-tier, bifactor, and between-item-dimensionality IRT models with more than three dimensions, should require a sample size larger than 250, though ideally at least 500. Unfortunately, such a sample size requirement does not bode well for educational and psychological researchers conducting small-scale studies who wish to confirm complex dimensional structures in their data.
In the aforementioned simulation studies, frequentist methods (i.e., full-information maximum likelihood and unweighted least squares) were used to estimate the parameters of the three-dimensional graded response model. These methods do not incorporate any prior information other than specifying a distributional form for the latent trait depending on the estimation method. Frequentist methods, then, rely mainly on the data to estimate the item parameters. One way to decrease the sample size requirement for MIRT models could be to combine outside information with the data, which is possible in Bayesian inference by assigning informative (rather than noninformative) prior distributions to the models’ parameters (Gelman, Carlin, Stern, & Rubin, 2013).
Research has demonstrated the value of using informative priors with Bayesian models when working with small samples. One simulation study showed that a sample size of 250 was sufficient to obtain accurate parameter estimates when informative prior distributions were assigned to the parameters of a Bayesian multilevel trifactor IRT model specified with eight orthogonal dimensions (Fujimoto, 2019). Another simulation study found that a sample size of 200 was enough to obtain accurate parameter estimates when informative priors were used with Bayesian multilevel structural equation models specified with four dimensions (Depaoli & Clifton, 2015; Holtmann, Koch, Lochner, & Eid, 2016). These studies involved models with multilevel components typically used to account for clusters of individuals within a sample (e.g., students nested within schools). Thus, the researchers had to consider the number of clusters and the number of individuals within each cluster when selecting the overall sample sizes for their simulations.
For this article, we focus on situations in which the individuals within the samples do not belong to clusters. When the number of clusters is not a concern, obtaining complex Bayesian MIRT models suitable for sample sizes smaller than those for the aforementioned Bayesian multilevel models should be possible, although this avenue has not yet been explored for confirmatory MIRT models intended to verify complex dimensional structures in rating data of a similar size to the MRQ data (i.e., ). For instance, Bayesian versions of the two-tier IRT model (i.e., the two-tier orthogonal model; Fujimoto, 2018), the bifactor IRT model (Fukuhara & Kamata, 2011; Sheng, 2010), and the between-item-dimensionality IRT model (Kuo & Sheng, 2015) have been presented, but these models were not specified or demonstrated to be appropriate for sample sizes as small as those of interest here.
The Aims of Our Study
In this article, we present a general Bayesian MIRT model for rating scale data. From our model, various confirmatory MIRT models (i.e., submodels) suitable for small samples can be obtained, and these submodels include the two-tier, bifactor, between-item-dimensionality, and unidimensional IRT models. Our model differs from Yao’s (2003) general Bayesian MIRT model in that the latter encompasses exploratory and confirmatory IRT models without any consideration for sample size. We devised our general Bayesian MIRT model to provide others with the means to obtain various confirmatory MIRT models suitable for small samples, although these confirmatory models are also applicable to large samples, as we discuss later. For this article, we focus on submodels used to confirm a two-tier structure in data. However, this set of submodels can also be used to confirm bifactor and between-item-dimensionality structures, although these submodels might not all be used in the dimensionality analysis. For instance, when confirming a bifactor structure is of interest, only the bifactor, between-item-dimensionality, and unidimensional IRT models might be used.
Previously, we noted an advantage to providing a general model from which various submodels can be obtained; that is, a set of submodels can be obtained to perform a more comprehensive dimensionality analysis. Providing a general MIRT model has another benefit in that it simplifies the process for using Bayesian two-tier, bifactor, and between-item-dimensionality IRT models to analyze rating data. As previously noted, Bayesian versions of these submodels have been proposed. Those models, however, have different specifications; the submodels are based on different parameterizations and prior distributions. When these submodels are specified within our general MIRT framework, they are brought together under one set of specifications.
The technical details of our general Bayesian MIRT model are provided in the next section. Thereafter, we report the details of the simulation study that was conducted to evaluate our model. Then, the results of the analysis of the MRQ data from 121 individuals are presented, results illustrating that our simulation findings are applicable to real situations. The article ends with a discussion and concluding remarks.
Our Bayesian Formulation of the Multidimensional Item Response Theory Model
In this section, some notations and indices are described first before the general Bayesian MIRT model for rating scale data is presented. Then, how to specify confirmatory MIRT models (or submodels) within our framework is discussed; these submodels include the two-tier, bifactor, between-item-dimensionality, and unidimensional IRT models.
Some Notations
Let and represent the number of primary and secondary dimensions, respectively, and let denote the total number of dimensions (i.e., ). Let i index the individuals (where , and N denotes the number of individuals). Let j index the items (where , and J denotes the number of items). Individual i’s observed response to item j is represented using , which is also the realized value for the random variable (where , and represents the highest score category of the block of items to which item j belongs; the reasons for blocking the items are provided below).
Regarding distributions, and are used to represent the univariate normal, truncated normal, lognormal, q-variate normal, and degenerate distributions, respectively. The first three of these distributions are parameterized with means and standard deviations (SDs), and the and of the truncated normal distribution represent the lower and upper bounds, respectively, of the truncation range. The q-variate normal distribution is parameterized with a mean vector and variance–covariance matrix. The degenerate distribution assigns a mass of 1 to (where is a value) and a mass of 0 to all other values; assigning this distribution to a parameter, then, fixes the parameter to .
In terms of the model’s parameters that govern the probability of a response to item j by person i, let represent individual i’s vector of latent trait dimensional positions, let represent item j’s vector of item discriminations, let represent item j’s overall intercept, and let represent the set of relative category intercept for the block of items to which item belongs (where and is the number of blocks). These parameters as a group are represented by .
Presentation of Our Bayesian MIRT Model
In the formulation of our Bayesian MIRT model, the data are assumed to follow a multinomial distribution:
where is a set of probabilities given . These probabilities are based on a multidimensional version of the generalized partial credit model (Muraki, 1992), leading to the conditional probability of each observed response being represented by
with
for all j. Next, additional details about the model’s parameters are provided, including the prior distributions assigned to them.
Latent Trait Dimensional Positions
The latent trait dimensional positions are assigned a prior distribution of
(1) |
where
(2) |
To set the location and metric of the latent trait scale, the mean vector in Equation 1 is a vector of 0s (i.e., ), and the elements of the main diagonal of are fixed to 1, respectively. Setting the scale in this manner also makes the variance–covariance matrix a correlation matrix.
The correlations of (for dimensions and , where and because the matrix is symmetric along the main diagonal) are distributed as
(3) |
with the sampled values required to lead to a positive semidefinite matrix. For this article, and , making the prior for the estimated correlations minimally informative.
Item Discriminations
Each element of is item j’s discrimination on dimension d. These elements are assumed to be distributed as
(4) |
The SD of the lognormal distribution in Equation 4 is set to 0.50 (i.e., for all d), and the mean of the distribution is assigned a hyperprior of
(5) |
with , when d represents a primary dimension, and for nested-dimensionality IRT models, , when d represents a secondary dimension.
The prior assigned to the estimated item discriminations is an adaptive informative prior. The prior is informative because the SD of the lognormal distribution is set to 0.50, which leads to the distribution supporting values falling within a narrow range given that the mean of the distribution is less than approximately 0.50. The informative aspect is helpful with small samples, though only if the prior is appropriate for the item discriminations. Unfortunately, a typical informative prior may be appropriate for one condition but not for another. For instance, a lognormal distribution with a mean of (given that the SD is 0.50) would mainly support values that range from to , which would be more appropriate for items with a weaker discriminatory power, and a lognormal distribution with a mean of would mainly support values that range from to , which would be more appropriate for items with a stronger discriminatory power. In practice, the discriminatory power of the items is not known; thus, to what the mean of the lognormal distribution should be set cannot be known. This issue is not a concern for our model due to the hyperprior assigned to the mean of the lognormal distribution. The hyperprior allows the lognormal distribution to adapt such that the distribution automatically adjusts to support the range of values most appropriate for the items’ discriminatory power.
Although the hyperprior assigned to the mean of the lognormal distribution allows the distribution to adapt, our specifications for the hyperprior (i.e., and in Equation 5) limit the lognormal distribution to adapt only within a reasonable range, further facilitating the estimation process when working with small samples. Our values for lead to a low probability of sampling values as large as 2.00 for the mean of the lognormal distribution (i.e., ). Such a restriction is reasonable because a mean of 2.00 for the lognormal distribution results in the distribution supporting values that range from to for the item discriminations, and an item discrimination estimate of 12.00 is highly unlikely given that the variances of the latent trait dimensional positions are fixed to 1 in our model (see Equation 2). Our values for not only set bounds to the adaptive aspect of our prior for the estimated item discriminations but also reflect the conditions for which nested-dimensionality IRT models are most appropriate. These models are for data in which the primary dimensions are more strongly represented relative to the secondary dimensions (DeMars, 2013; Reise, 2012; Toland, Sulis, Giambona, Porcu, & Campbell, 2017), and our choice of values reflects this because the mean of the hyperprior related to primary dimensions (i.e., for ) is greater than that related to secondary dimensions (i.e., for ). These specifications serve only as starting points because these values are for the hyperprior, and the mean of the lognormal distribution related to each dimension is estimated.
Overall Item and Relative Category Intercepts
In our model, the item intercepts are decomposed into overall () and relative category () intercepts so that the items can be blocked to share a set of relative category intercepts (e.g., Muraki, 1992), which may be necessary with rating scale data from small samples because some categories may be infrequently used or not used at all (i.e., null categories). When the sample size is large enough that each item can be parameterized with its own set of relative category intercepts, the current notation can still be used, or the notation can be simplified to . Minimally informative priors can be assigned to the overall and relative category intercepts because these parameters are less sensitive to sample size. The overall item intercepts are assigned the prior of
with the SD set to 5 (i.e., ), and the mean of the distribution is assigned the hyperprior of
with . For the relative category intercepts, only the first of them are assigned a prior distribution because the last intercept of each block is constrained to be the negative of the sum of all preceding intercepts, . The first relative intercepts are assigned the prior of
where is a mean vector of 0s (i.e., ), and is a diagonal matrix with variances of 100 (i.e., , with representing an identity matrix).
Obtaining Submodels
Our general Bayesian MIRT model has been presented so that confirmatory MIRT models (or submodels), such as the two-tier, bifactor, between-item-dimensionality, and unidimensional IRT models, can easily be specified within our framework. These submodels are obtained simply by assigning the priors to the elements of the vector of item discriminations and the correlations of the variance–covariance matrix in different ways. For discussion sake, the submodels are specified such that the first elements of and correspond to primary dimensions and the remaining elements of these vectors (if ) correspond to secondary dimensions. Next, the characteristics of the submodels and how to specify them within our framework are described in detail (see Supplemental Material A, available online, for a summary of these models’ characteristics).
The Two-Tier IRT Model
As previously indicated, the two-tier IRT model includes multiple dimensions in each of the tiers of the dimensional structure (i.e., and ), with the secondary dimensions treated as orthogonal to each other and to the primary dimensions, but the primary dimensions can be correlated with each other given that there are no model identification issues (Cai, 2010). The relationships among the dimensions within and across tiers can be readily observed by reexpressing the variance–covariance (or correlation) matrix of the prior distribution assigned to the dimensional positions (i.e., in Equation 1) in terms of two variance–covariance submatrices:
(6) |
The variance–covariance submatrix (or correlation submatrix because the main diagonal elements are fixed to 1) for the primary dimensions () is a matrix, and all unique correlations of this submatrix are estimated. The variance–covariance submatrix for the secondary dimensions () is a matrix, with all correlations set to 0, making it an identity matrix. The in the equation indicates that each correlation corresponding to a primary and a secondary dimension is 0.
The full variance–covariance matrix in Equation 2 becomes the reexpressed version in Equation 6 by assigning the priors to the correlations of the full variance–covariance matrix as follows. The truncated normal distribution in Equation 3 is assigned to the correlations related to the primary dimensions (i.e., for all pairs of dimensions and where ; ; and ), and the degenerate distribution with is assigned to all remaining correlations (i.e., for all pairs of and where ; ; and ). Regarding the item discriminations, the lognormal distribution in Equation 4 is assigned to one of the first and one of the remaining elements of , and the degenerate distribution with is assigned to the other elements.
The Bifactor IRT Model
Previously, it was noted that the bifactor IRT model can be viewed as a two-tier IRT model but with one primary dimension and multiple secondary dimensions (i.e., and ), with all dimensions orthogonal to each other. The orthogonal assumption of this model can be observed in the reexpressed version of the variance–covariance matrix; in Equation 6, becomes a scalar of 1 (i.e., ) because only one primary dimension is specified in the model, while remains an identity matrix (i.e., ), leading to the full variance–covariance matrix equaling a identity matrix (i.e., ). To specify the bifactor IRT model within our general framework, the degenerate distribution in Equation 3, with , is assigned to all unique correlations of the full variance–covariance matrix in Equation 2. In addition, the lognormal distribution in Equation 4 is assigned to the first element and one of the remaining elements of , and the degenerate distribution with is assigned to the other elements of .
The bifactor IRT model has been presented in its conventional form (e.g., Gibbons & Hedeker, 1992), but an alternative way to represent the model is also provided because the alternative version contextualizes one of the simulation findings reported later. This representation begins with a typical two-tier IRT model, meaning with multiple dimensions in each of the two tiers (i.e., and ). This alternative version of the bifactor IRT model is thus the same as the two-tier IRT model in all ways except that the correlations related to the primary dimensions (i.e., the correlations of in Equation 6) are fixed to 1 rather than being estimated. These correlations are fixed to 1 by assigning the degenerate distribution in Equation 3, with , to them (i.e., to the correlations related to all pairs of dimensions and where ; ; and ). Fixing these correlations to 1 sets the elements of related to the primary dimensions equal to each other (i.e., where ), leading to only one unique primary dimensional position for each individual, as with the conventional representation of the bifactor IRT model.
Nonnested-Dimensionality IRT Models
Nonnested-dimensionality IRT models, such as a between-item-dimensionality IRT model, can also be specified within our general framework. A between-item-dimensionality model (i.e., and , leading to ) includes only correlated primary dimensions, with each item discriminating on one of the dimensions. This model can be viewed as a one-tier model, as indicated by the reexpressed variance–covariance matrix in Equation 6; the variance–covariance submatrix related to the secondary dimensions (i.e., ) is empty because the model does not include secondary dimensions, leading to the full variance–covariance matrix equaling the variance–covariance submatrix related to the primary dimensions (i.e., ). To specify a between-item-dimensionality IRT model within our framework, the truncated normal distribution in Equation 3 is assigned to the unique correlations of the full variance–covariance matrix in Equation 2. In addition, the lognormal distribution in Equation 4 is assigned only to one of the elements of , and the degenerate distribution with is assigned to the other elements of . A unidimensional IRT submodel (i.e., and , leading to ) is another nonnested-dimensionality submodel that can be specified within our framework. In this submodel, , , and are scalars, and the lognormal distribution in Equation 4 is assigned to for all j.
Modifications to the Submodels
Making modifications to the aforementioned submodels is easy to achieve with our framework. As presented, the two-tier IRT model was specified for the MRQ data, meaning with each item discriminating on one primary and one secondary dimension. In other situations, specifying the model with the items discriminating on more than one primary dimension while also on a secondary dimension may be more appropriate. Such a modification simply requires assigning the lognormal distribution in Equation 4 to more than one of the first elements of and one of the remaining elements. Other modifications are also possible, but a discussion of all of them is beyond the scope of this article.
Bayesian Posterior Inference
In a Bayesian analysis, the posterior distribution of the model’s parameters () is of interest. Formally, , where , and . Bayes’s theorem states that the posterior density of given the data, is
In the equation, is the likelihood of the data; represents the joint prior distribution function,
(7) |
and is the joint cumulative distribution function that corresponds to .
In Equation 7, is an indicator function that returns 1 or 0 when the condition within the parentheses is true or false, respectively. For instance, returns 1 when item j’s discrimination on dimension d is not assigned a degenerate distribution and 0 when it is. In the same equation, represents the first relative category intercepts because the last intercept is constrained, as previously noted. For the unidimensional IRT model, does not include correlations.
The posterior distribution is estimated using Markov Chain Monte Carlo (MCMC) methods. Gibbs sampling is used to sample the means of the prior distributions assigned to the items’ overall intercepts and discriminations. An adaptive random-walk Metropolis–Hastings algorithm (Roberts & Rosenthal, 2009) is used to draw values for the remaining parameters, with a check embedded in the algorithm to ensure that the values sampled for the correlations lead to a positive semidefinite matrix (see Supplemental Material B for the specifics about the MCMC algorithm used).
Simulation Study
We conducted a simulation study with a design (sample size by dimensional structure) to examine whether our model can be used to confirm two-tier, bifactor, and between-item six-dimensional structures in rating scale data representing sample sizes of 100, 125, 150, 250, 500, and 1,500. The two-tier structure was motivated by the previously described MRQ data and composed of two primary dimensions (Dimensions 1 and 2; ) and six secondary dimensions (Dimensions 3-8; ), with three secondary dimensions nested within each primary dimension. The primary dimensions were correlated at .70, whereas the secondary dimensions were orthogonal to each other as well as the primary dimensions. Each of the 30 items discriminated on one primary dimension and one secondary dimension nested within that primary dimension. The items were evenly distributed across the secondary dimensions, with five items per secondary dimension, leading to 15 items per primary dimension. The bifactor structure was similar to the two-tier structure, except it included only one primary dimension, leading to seven dimensions in total ( and ). The between-item six-dimensional structure (, and ) comprised each item discriminating on one dimension, with five items per dimension. The correlations among the dimensions ranged from .30 to .70 (see Supplemental Material C for the correlation matrix). Supplemental Material A presents the visualizations of the dimensional structures based on which the data were generated.
Data Generation
One hundred data sets were generated for each sample size and dimensional structural condition, with each data set resembling 4-point ratings to 30 items. Here, only the characteristics of the data generation process that are directly relevant to the aims of this study are described; for full details thereof, see Supplemental Material C. In short, the latent trait dimensional positions () were randomly drawn from for the two-tier condition, for the bifactor condition, and for the six-dimensional condition; each was appropriately specified for the respective dimensional condition. Regarding the item discriminations, the data generation values for these parameters were randomly drawn from a uniform distribution. Then, they were rescaled for the following reasons: (a) so the items discriminated more strongly on the primary dimensions than on the secondary dimensions for the nested-dimensionality conditions and (b) so the values ranged from to to resemble the values observed when the MRQ data were analyzed.
Analytic Strategy
Models Used to Analyze the Data
A total of eight models were used in the study, but only six models were used in each dimensional condition (for reasons provided below). Supplemental Material A includes a table highlighting the characteristics of these models. Models 1 to 5 were specified within our proposed general Bayesian MIRT model and therefore used the priors described in the previous section. Models 1, 2, and 3 were the two-tier, bifactor, and six-dimensional IRT models, respectively, with the dimensional structures matching those of the respective dimensional conditions. Model 4 was a between-item two-dimensional IRT model based on only the primary dimensions of the dimensional structure of the two-tier condition. Model 5 was a unidimensional IRT model.
Models 6, 7, and 8 were the same as Models 1, 2, and 3, respectively, except for having a noninformative prior assigned to the estimated item discriminations; the noninformative prior was a uniform distribution ranging from 0 to 25, , when item j discriminated on dimension d. By specifying the pairs of models that differed only with regard to the priors assigned to the item discriminations (e.g., Models 1 and 6), any differences in the estimates from these models can be attributed to the priors. Because the models with noninformative priors were included to show the consequences of using noninformative priors with small samples and, in turn, show the benefits of the informative priors, each model with noninformative priors was used to analyze the data only in the dimensional condition for which the model was suited. For example, the two-tier IRT model with noninformative priors (Model 6) was only used in the two-tier condition. In all models, the items were blocked in sets of five. That is, the first five items formed the first block, thus sharing a set of relative category intercepts; the second set of five items formed the next block and so forth.
Model Comparisons
The pseudo Bayes factor (BF) was used to determine the extent to which the data supported Model A over Model B. The difference between two models’ log-predicted marginal likelihoods (LPMLs) was used to approximate the BF on a 2 × log scale (Gelfand, 1996),
and the LPMLs were based on the conditional predictive ordinates (Congdon, 2006). Values greater than 10 units suggest that the data strongly support Model A over Model B (Kass & Raftery, 1995), and this threshold was adopted for our model comparisons. When the data did not very strongly support one model over another, the parsimonious model (i.e., the model with fewer estimated parameters) was favored. All the reported LPMLs and BFs related to a study condition are the average values across the 100 data replicates, and the BFs reported hereafter are assumed to be on a 2 × log scale.
Recovery of the Parameters
We assessed the extent to which the item parameters were recovered by the two-tier (Models 1 and 6), bifactor (Models 2 and 7), and six-dimensional (Models 3 and 8) IRT models in the dimensional conditions for which they were appropriate. In addition, the recovery of the correlations among the latent trait dimensional positions was evaluated in the two-tier and six-dimensional conditions. To judge the recovery, the bias and the Monte Carlo SD (MCSD) were calculated, where the bias indicated the accuracy of the estimates from the models and the MCSD indicated the stability of the estimation across the generated data sets. The means of the marginal posterior distributions were used as estimates for the parameters to calculate these indices. The discussion of our results mainly focuses on the bias because this index reveals the extent to which the estimates under- or overrepresent the true values and, therefore, shows how the priors affect the accuracy of the estimates.
Technical Details
The MCMC algorithm used to estimate the posterior distributions for all models was written in C++. The algorithm was run for 120,000 iterations. The first 20,000 samples were discarded and thereafter every sample was saved, leading to a total of 10,000 saved samples. To determine whether the Markov chains stabilized, the batch means, MCMC 95% half-width intervals, and trace plots were inspected (Geyer, 2011). Irregularities in the chains related to the correlations were detected only for the two-tier IRT model with informative priors (Model 1) under the bifactor condition; the details of these irregularities are presented when discussing the results.
Simulation Results: Two-Tier Condition
Model Comparisons
The model comparison results from the two-tier condition are summarized in the simulation columns of Table 1. The table includes the LPMLs, BFs, and percentage of times (across the 100 generated data sets) the BF indicated that the data very strongly supported the two-tier IRT model over each of the comparison models. The data very strongly supported Model 1 over the bifactor IRT model (Model 2) and the non–two-tier IRT models (Models 3-5) at least 96% of the time (), as observed for all sample sizes. A comparison of the different versions of the two-tier IRT models (Models 1 and 6) revealed that the version with informative priors (Model 1) was very strongly favored over the version with noninformative priors (Model 6) at least 99% of the time for (). For , the average BF suggests that Model 1 was generally very strongly favored over Model 6 (); however, Model 1 was favored over Model 6 only 67% of the time, suggesting that the benefits of the informative prior assigned to the item discriminations diminished. For , neither model was very strongly favored (); Model 1 was favored only 22% of the time.
Table 1.
The Log-Predicted Marginal Likelihoods (LPMLs), the Bayes Factors (BFs) on a 2 × Log Scale in Parentheses, and the Percentage of Times (%) the Data Generation Model Was Favored Over Each of the Comparison Models, From the Two-Tier Condition of the Simulation Study and the Analysis of the Motivation for Reading Questionnaire (MRQ) Data.
Simulations | Real data | |||||||
---|---|---|---|---|---|---|---|---|
Model | Description | % | % | % | MRQ | |||
1 | Two-tier (informative) | |||||||
2 | Bifactor | |||||||
3 | Six-dimensional | |||||||
4 | Two-dimensional | |||||||
5 | Unidimensional | |||||||
6 | Two-tier (noninformative)a | |||||||
% | % | % | ||||||
1 | Two-tier (informative) | |||||||
2 | Bifactor | |||||||
3 | Six-dimensional | |||||||
4 | Two-dimensional | |||||||
5 | Unidimensional | |||||||
6 | Two-tier (noninformative)a |
Note. Within each sample size condition, the LPML and BF for a model are the averages across the 100 generated data sets. Each BF indicates the extent to which the data supported the two-tier IRT model with informative priors (Model 1) over the model corresponding to the row in which the BF is reported. The “%” column includes the percentage of times (across the 100 data replicates) Model 1 was favored over each of the other models based on the BF (i.e., BF > 10). The bifactor and six-dimensional IRT models with noninformative priors (Models 7 and 8, respectively) were not used in this dimensional condition.
A noninformative prior distribution was assigned to all estimated item discriminations in this variation of the two-tier IRT model. The priors assigned to the other parameters of this model were the same as those used in Model 1.
Recovery of the Item Discriminations
The bias plots in Figure 2 show the extent to which the item discriminations related to the primary dimensions were recovered under the two-tier IRT models (Models 1 and 6). The degree of bias in the estimates was greater for the model with noninformative priors (Model 6) than for the model with informative priors (Model 1) in the smaller sample size conditions (i.e., see Figure 2a-d). The bias for Model 6 was mostly positive, and the magnitude of the bias was larger for the items with higher discriminations. For instance, the true value for Item 17 was , and the bias related to this item under Model 6 was when . In the two largest sample size conditions, Models 1 and 6 produced minimally biased estimates (see Figure 2e and f).
Figure 2.
Bias plots showing the recovery of the items’ discriminations on the primary dimensions under the two-tier condition of the simulation study.
Note. Models 1 and 6 are the two-tier IRT models with informative and noninformative priors, respectively. In each subplot, the items and bias in the estimates are represented along the horizontal and vertical axes, respectively. The lack of a square marker for an item indicates that the bias for that item was greater than for Model 6. The broken vertical line represents the grouping of the items with regard to their assignment across the primary dimensions.
The bias plots in Figure 3 show the extent to which the item discriminations related to the secondary dimensions were recovered under Models 1 and 6. The recovery pattern of these discriminations was similar to that observed for the item discriminations related to the primary dimensions. In the smaller sample size conditions (i.e., N≤ 250), the degree of bias was greater for the model with noninformative priors (Model 6; see Figure 3a-d), with the bias consistently positive and more pronounced for the items with higher discriminations. In the two largest sample size conditions, the bias was minimal for the models (see Figure 3e and f).
Figure 3.
Bias plots showing the recovery of the items’ discriminations on the secondary dimensions under the two-tier condition of the simulation study.
Note. Models 1 and 6 are the two-tier IRT models with informative and noninformative priors, respectively. In each subplot, the items and bias in the estimates are represented along the horizontal and vertical axes, respectively. The broken vertical lines represent the grouping of the items based on their assignment across the secondary dimensions.
Regarding the MCSDs, for the smaller samples, the averages of the MCSDs related to the item discriminations (averaged across the items) were greater for Model 6 than those for Model 1. For instance, when , the average of the MCSDs was 0.31 (with a range of 0.19-0.46) for Model 6 and 0.20 (with a range of 0.12-0.30) for Model 1. In contrast, when the sample sizes were larger, the MCSDs were similar for the two models. For example, when , the means and ranges were approximately 0.06 and 0.04 to 0.09, respectively, for the two models.
Recovery of the Item Intercepts
In the smaller sample size conditions, the degree of bias in the estimates of the overall and relative category intercepts for Model 1 was less than that for Model 6. For instance, when , the bias related to these parameters ranged from to (with the MCSDs ranging from to ) for Model 1 and to (with the MCSDs ranging from to ) for Model 6. As the sample size increased, the models produced estimates that were minimally biased and consistent across the data replicates. For instance, when , the bias ranged from approximately to and the MCSDs ranged from to for these models.
Recovery of the Correlation Between the Primary Dimensions
For both versions of the two-tier IRT model, the 95% credible intervals (CIs) related to the correlation between the primary dimensions included the data generation correlational value of .70. However, Model 1 more accurately estimated the correlation in the smaller sample size conditions. For instance, when , the posterior mean was (with a bias of ), 95% CI ranged from to , and MCSD was .08 for Model 1, whereas for Model 6, the posterior mean was .74 (with a bias of ), 95% CI ranged from to , and MCSD was .07. In the larger sample size conditions, the performance of the two models was similar. For example, when , the posterior means were approximately , 95% CIs ranged from to , and MCSDs were .
Simulation Results: Bifactor Condition
Model Comparisons
As previously noted, only Models 1 to 5 and 7 were used in the bifactor condition. The model comparison results are summarized in Table 2; each BF indicates the extent to which the data supported the bifactor IRT model with informative priors (Model 2) over the model corresponding to the row in which the BF is reported. Model 2 was very strongly favored over all nonnested-dimensionality IRT models (Models 3-5) at least 97% of the time (). A comparison of the two versions of the bifactor IRT model (Models 2 and 7) showed that the version with informative priors (Model 2) was very strongly favored over that with noninformative priors (Model 7) for (, with Model 2 favored 100% of the time). When , Model 2 was, on average, favored over Model 7; however, Model 2 was favored only 68% of the time. For , neither model was very strongly favored.
Table 2.
The Log-Predicted Marginal Likelihoods (LPMLs), the Bayes Factors (BFs) on a 2 × Log Scale in Parentheses, and the Percentage of Times (%) the Data Generation Model Was Favored Over Each of the Comparison Models, From the Bifactor Condition of the Simulation Study.
Model | Description | % | % | % | |||
---|---|---|---|---|---|---|---|
Two-tier | |||||||
Bifactor (informative) | |||||||
Six-dimensional | |||||||
Two-dimensional | |||||||
Unidimensional | |||||||
Bifactor (noninformative)a | |||||||
% | % | % | |||||
Two-tier | |||||||
Bifactor (informative) | |||||||
Six-dimensional | |||||||
Two-dimensional | |||||||
Unidimensional | |||||||
Bifactor (noninformative)a | 68 |
Note. Within each sample size condition, the LPML and BF for a model are the averages across the 100 generated data sets. Each BF indicates the extent to which the data supported the bifactor IRT model with informative priors (Model 2) over the model corresponding to the row in which the BF is reported. The “%” column includes the percentage of times (across the 100 data replicates) Model 2 was favored over each of the other models. Model 2 was favored over Model 1 when the former was very strongly supported by the data (i.e., BF > 10) or for reasons of parsimony; Model 2 was favored over the other models when BF > 10. The two-tier and six-dimensional IRT models with noninformative priors (Models 6 and 8, respectively) were not used in this dimensional condition.
A noninformative prior distribution was assigned to all estimated item discriminations in this variation of the bifactor IRT model. The priors assigned to the other parameters of this model were the same as those used in Model 2.
Regarding the comparison between the bifactor IRT model with informative priors (Model 2) and the two-tier IRT model (Model 1), the related results should be interpreted with caution. Recall that in this dimensional condition, irregularities were detected in the Markov chains from the two-tier IRT model with informative priors (Model 1), with these chains being related to the correlation between the primary dimensions. The irregular chains indicated that the sampler was often stuck near the value of 1, especially for . If the lack of sufficient mixing of the samples invalidates the results from the two-tier IRT model (Model 1), then the bifactor IRT model with informative priors (Model 2) would automatically be favored over Model 1, which would be the correct conclusion. If one proceeds to interpret the results, Model 2 would be correctly favored over Model 1 at least 98% of the time for based on the BFs being greater than 10 or for reasons of parsimony.
For , Model 2 would be favored over Model 1 76% of the time (based on BF > 10 or for reasons of parsimony), meaning that Model 1 would actually be favored over the bifactor IRT model 24% of the time. However, under Model 1, the average of the correlation between the two primary dimensions was across the 24 data replicates, similar to the value across all 100 data replicates. Whether the average is taken across the 24 or 100 data sets, the correlation is near 1.00. The two-tier IRT model, then, resembled the previously noted alternative representation of the bifactor IRT model, thereby indicating that only one unique primary dimension along with multiple secondary dimensions were represented in the data. Thus, regardless of whether one proceeds to interpret the results from the two-tier IRT model, the correct conclusion is reached.
Recovery of the Item Parameters
The recovery of the item discriminations, overall item intercepts, and relative category intercepts related to the bifactor condition are only briefly reviewed because the result pattern was similar to that observed in the two-tier condition and for space reasons. Supplemental Material D includes bias plots that summarize the recovery of these parameters under the bifactor IRT models (Models 2 and 7). In general, the degree of bias for the bifactor IRT model with informative priors (Model 2) was less than that for the version of the model with noninformative priors (Model 7) when the sample sizes were smaller (). Under the larger sample size conditions (), both models produced minimally biased estimates.
Simulation Results: Six-Dimensional Condition
Model Comparisons
Only Models 1 to 5 and 8 were used in the six-dimensional condition. The model comparison results are summarized in Table 3. The six-dimensional IRT model with informative priors (Model 3) was strongly favored over the other models, except the six-dimensional IRT model with noninformative priors (Model 8), at least 98% of the time (), as observed for all sample sizes. Regarding the two versions of the six-dimensional IRT model, the benefits of the informative priors were not as pronounced in this dimensional condition as it was in the other dimensional conditions. The version with informative priors (Model 3) was favored over that with noninformative priors (Model 8) 86%, 60%, and 35% of the time for the sample sizes of and respectively. For the remaining sample sizes, Model 3 was favored over Model 8 no more than 13% of the time.
Table 3.
The Log-Predicted Marginal Likelihoods (LPMLs), the Bayes Factors (BFs) on a 2 × Log Scale in Parentheses, and the Percentage of Times (%) the Data Generation Model Was Favored Over Each of the Comparison Models, From the Six-Dimensional Condition of the Simulation Study.
Model | Description | % | % | % | |||
---|---|---|---|---|---|---|---|
Two-tier | |||||||
Bifactor | |||||||
Six-dimensional (informative) | |||||||
Two-dimensional | |||||||
Unidimensional | |||||||
Six-dimensional (noninformative)a | |||||||
% | % | % | |||||
Two-tier | |||||||
Bifactor | |||||||
Six-dimensional (informative) | |||||||
Two-dimensional | |||||||
Unidimensional | |||||||
Six-dimensional (noninformative)a |
Note. Within each sample size condition, the LPML and BF for a model are the averages across the 100 generated data sets. Each BF indicates the extent to which the data supported the six-dimensional IRT model with informative priors (Model 3) over the model corresponding to the row in which the BF is reported. The “%” column includes the percentage of times (across the 100 data replicates) Model 3 was favored over each of the other models. Model 3 was favored over Models 1 and 2 when Model 3 was very strongly supported by the data (i.e., BF > 10) or for reasons of parsimony; Model 3 was favored over the other models when BF > 10. The two-tier and bifactor IRT models with noninformative priors (Models 6 and 7, respectively) were not used in this dimensional condition.
A noninformative prior distribution was assigned to all estimated item discriminations in this variation of the six-dimensional IRT model. The priors assigned to the other parameters of this model were the same as those used in Model 3.
The recovery of the item discriminations (see Supplemental Material E) also indicates that the benefit of the informative priors was not as pronounced in this dimensional condition as it was in the other dimensional conditions possibly because the six-dimensional IRT model was less complex than the nested-dimensionality IRT models. Although the six-dimensional IRT model included 15 correlations, each item discriminated on only one dimension in the model and, therefore, still had fewer parameters than the nested-dimensionality IRT models, making the six-dimensional IRT model more stable with small samples relative to the nested-dimensionality IRT models. Nevertheless, the models with informative priors displayed some advantages over those with noninformative priors for . Overall, the estimates from the model with informative priors (Model 3) were more accurate and consistent across the 100 data replicates than those from the model with noninformative priors (Model 8).
Summary of the Simulations
The simulations demonstrated that the submodels obtained from our general MIRT model can differentiate among two-tier, bifactor, and six-dimensional structures in rating scale data representing sample sizes as small as 100. The simulations also revealed the benefits of our informative priors when working with small samples. For , the models with informative priors (Models 1, 2, and 3) displayed better predictive performance, produced more accurate estimates, and had smaller MCSDs than the corresponding models with noninformative priors (Models 6, 7, and 8, respectively). For , the models with informative priors performed similarly to the corresponding models with noninformative priors, demonstrating that our model can also be used with large samples.
Illustration With Real Data
The previously described MRQ data were analyzed to illustrate that the simulation results from the two-tier dimensional condition related to smaller sample sizes are relevant to real settings. Data were obtained from 121 students (in Grades 6-8); in-depth details of the characteristics of the sample are in Neugebauer and Fujimoto (2019). The two-tier structure hypothesized for the data was composed of primary dimensions of intrinsic and extrinsic motivation and secondary dimensions of challenge (five items), curiosity (four items), involvement (six items), competition in reading (six items), recognition in reading (four items), and grades (four items). Figure 1a presents a visualization of this structure.
The models for analyzing the data related to the two-tier condition of the simulation study (i.e., Models 1-6) were used to analyze the MRQ data. The number of MCMC sampling iterations was increased to 150,000. The first 50,000 samples were discarded and thereafter saved every 10th sample to obtain a total of 10,000 saved samples. The process for diagnosing the stability of the Markov chains was the same as that used in the simulations, and no irregularities were detected from any of the models. Supplemental Material F includes the trace plots for a few of the parameters, with these trace plots from the two-tier IRT model with informative priors (Model 1). These plots show that the chains became stationary and that the mixing was sufficient during the sampling process.
Results and Related Discussion
The MRQ column of Table 1 includes the LPMLs and BFs from the analysis of the MRQ data. The pattern of these values is similar to the pattern that appeared in the simulations related to the smaller sample sizes () within the two-tier condition. The two-tier IRT model with informative priors (Model 1) was very strongly favored over the models that were not based on a two-tier structure (Models 2-5; ), suggesting that among the dimensional structures tested, a two-tier structure was the most strongly represented structure in the MRQ data. The posterior mean for the correlation between the primary dimensions was .69, with 95% CI [], for Model 1, showing that the model did not reduce to a bifactor IRT model. The pattern of the results for both versions of the two-tier IRT model (Models 1 and 6) was also consistent with that of the simulation findings related to the smaller sample size conditions. Model 1 was very strongly favored over the version with noninformative priors (Model 6; ). Moreover, many of the estimates of the item discriminations from Model 6 were greater than those from Model 1 (see Supplemental Material G for the posterior means and SDs of the item parameters from Model 1). In addition to model comparisons, how well Model 1 fits the data was assessed by performing posterior predictive model checking (PPMC) based on the standardized generalized dimensionality discrepancy measure (Levy, Xu, Yel, & Svetina, 2015). The posterior predictive probability value was .11, indicating that the residual correlations were trivial under this model because the value was greater than .05.
Discussion
The sample size requirements for MIRT models could create a disadvantage for educational and psychological researchers conducting small-scale studies who wish to confirm complex dimensional structures in their data. The general Bayesian MIRT model we presented in this article is intended to offset this disadvantage. As the simulations demonstrated, our model can be used to verify a two-tier structure with eight dimensions (two primary and six secondary dimensions), a bifactor structure with seven dimensions (one primary and six secondary dimensions), and a between-item six-dimensional structure in rating scale data representing sample sizes as small as 100. Additionally, for sample sizes smaller than 250, more accurate estimates can be obtained for the parameters of two-tier, bifactor, and six-dimensional IRT models when these submodels are specified within our framework rather than specifying them with noninformative priors.
Our model can perform these tasks because of the adaptive informative prior assigned to the item discriminations. This prior is similar to other informative priors in that a concentrated set of values is supported for the item discriminations. However, unlike other informative priors, our prior automatically adapts to include the values most appropriate for the item discriminations. In other words, our prior has the benefits of informative priors while maintaining some of the flexibility of noninformative priors. To show that our informative prior can adapt even when the items have a stronger discriminatory power than what was explored in the simulations reported in this article, additional simulations were performed (see Supplemental Material H for full details). These additional simulations also highlight the shortcomings of nonadaptive informative priors. The adaptive informative prior we used was designed to lead to an MIRT model suitable for small samples, but this prior is also applicable to large samples as the simulations verified. The benefit of having one model appropriate for a range of sample sizes is that the modeling process is further simplified because the same set of specifications can be used across submodels and sample sizes.
Another feature of our general MIRT model is that modifications to the submodels are straightforward, as previously noted. However, any submodel obtained from our framework that differs from how we used it should be demonstrated to be appropriate for the sample size with which the submodel is intended to be used because the minimum sample size requirement for these modified submodels could differ from the size we determined to be sufficient here.
Although our focus was to contribute an MIRT model for small samples, other aspects of our article add to the item response modeling literature. The results related to our models with noninformative priors support the findings from prior research based on frequentist estimation methods (i.e., Forero & Maydeu-Olivares, 2009; Jiang et al., 2016). Even though our models with noninformative priors were not the same as models based on full-information maximum likelihood or unweighted least squares, some similarities exist in that our models relied mainly on the data to estimate the parameters that were assigned noninformative priors. By remaining within a Bayesian framework, however, we were able to study the consequences of the informativeness of the priors assigned to just the item discriminations, and the model comparisons were straightforward. The findings showed that, even within a Bayesian setting, using noninformative priors with the item discriminations could result in complex MIRT models requiring sample sizes of 500 to obtain accurate item parameter estimates. In addition to showing the ramifications of using noninformative priors with small samples, we showed that our findings from the simulations apply to real settings by demonstrating that the patterns observed in the simulations related to the smaller sample sizes within the two-tier condition appeared in the analysis of the MRQ data.
Regarding the limitations of this study, one limitation is that the simulations investigated only three types of dimensional structures. Many factors could influence the minimum sample size needed for MIRT models, such as the number of items, the number of dimensions, and the correlations among the dimensions (de Ayala, 1994; Forero & Maydeu-Olivares, 2009; Jiang et al., 2016). One reasonable assumption is that a sample size of 100 is more likely to be appropriate when the two-tier, bifactor, and between-item-dimensionality IRT models are specified with fewer dimensions than the numbers specified for the models in this article because structures with fewer dimensions lower the complexity of the models. However, when these same models are specified with more dimensions, the models become more complex and thus could require sample sizes greater than 100. We also relied on model comparisons because model comparisons often serve as the first step—if not the only step—of a dimensionality analysis. PPMC could also be useful in assessing how well a model fits the data. Although we reported the posterior predictive probability when we discussed the results of our analysis of the MRQ data, we did so only to show the potential of PPMC. (We thank an anonymous reviewer for this suggestion.) Whether PPMC is appropriate with small samples should be established before it is used in practice. Last, we tested a limited set of prior distributions. Other priors might lead to similar, if not better, performances than our adaptive priors. However, our priors were sufficient for our goals, and we provided evidence of such.
Notwithstanding these limitations, we addressed a critical issue in MIRT modeling. Prior to our study, the sample size requirement for MIRT models was an obstacle for many researchers wishing to use these models to analyze their data. Our general Bayesian MIRT model lowers this sample size barrier. Researchers of small-scale studies now have the means to confirm complex dimensional structures in their rating scale data. In other words, these researchers have the ability to furnish evidence for the internal structural aspect of validity when needed.
Supplemental Material
Supplemental material, Online_Supplemental_Materials for A General Bayesian Multidimensional Item Response Theory Model for Small and Large Samples by Ken A. Fujimoto and Sabina R. Neugebauer in Educational and Psychological Measurement
Acknowledgments
The authors would like to thank two anonymous reviewers for their valuable comments. All errors remain the responsibility of the authors.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iD: Ken A. Fujimoto https://orcid.org/0000-0001-5222-3327
Supplemental Material: Supplemental material for this article is available online.
References
- Adams R. J., Wilson M., Wang W.-C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1-23. doi: 10.1177/0146621697211001 [DOI] [Google Scholar]
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. [Google Scholar]
- Cai L. (2010). A two-tier full-information item factor analysis model with applications. Psychometrika, 75, 581-612. doi: 10.1007/s11336-010-9178-0 [DOI] [Google Scholar]
- Congdon P. (2006). Bayesian statistical modelling (2nd ed.). Chichester, England: Wiley. [Google Scholar]
- de Ayala R. J. (1994). The influence of multidimensionality on the graded response model. Applied Psychological Measurement, 18, 155-170. doi: 10.1177/014662169401800205 [DOI] [Google Scholar]
- DeMars C. E. (2013). A tutorial on interpreting bifactor model scores. International Journal of Testing, 13, 354-378. doi: 10.1080/15305058.2013.799067 [DOI] [Google Scholar]
- Depaoli S., Clifton J. P. (2015). A Bayesian approach to multilevel structural equation modeling with continuous and dichotomous outcomes. Structural Equation Modeling: A Multidisciplinary Journal, 22, 327-351. doi: 10.1080/10705511.2014.937849 [DOI] [Google Scholar]
- Forero C. G., Maydeu-Olivares A. (2009). Estimation of IRT graded response models: Limited versus full information methods. Psychological Methods, 14, 275-299. doi: 10.2333/bhmk.42.79 [DOI] [PubMed] [Google Scholar]
- Fujimoto K. A. (2018). Bayesian multilevel multidimensional item response theory model for locally dependent data. British Journal of Mathematical and Statistical Psychology, 71, 536-560. doi: 10.1111/bmsp.12133 [DOI] [PubMed] [Google Scholar]
- Fujimoto K. A. (2019). The Bayesian multilevel trifactor item response theory model. Educational and Psychological Measurement, 79, 462-494. doi: 10.1177/0013164418806694 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fukuhara H., Kamata A. (2011). A bifactor multidimensional item response theory model for differential item functioning analysis on testlet-based items. Applied Psychological Measurement, 35, 604-622. doi: 10.1177/0146621611428447 [DOI] [Google Scholar]
- Gelfand A. E. (1996). Model determination using sampling-based methods. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.), Markov Chain Monte Carlo in practice (pp. 145-161). London, England: Chapman & Hall/CRC Press. [Google Scholar]
- Gelman A., Carlin J. B., Stern H. S., Rubin D. B. (2013). Bayesian data analysis (3rd ed.). Boca Raton, FL: Chapman & Hall/CRC Press. [Google Scholar]
- Geyer C. (2011). Introduction to MCMC. In S. Brooks, A. Gelman, G. Jones, & X. Meng (Eds.), Handbook of Markov Chain Monte Carlo (pp. 3-48). Boca Raton, FL: Chapman & Hall/CRC Press. [Google Scholar]
- Gibbons R. D., Hedeker D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57, 423-436. doi: 10.1007/BF02295430 [DOI] [Google Scholar]
- Holtmann J., Koch T., Lochner K., Eid M. (2016). A comparison of ML, WLSMV, and Bayesian methods for multilevel structural equation models in small samples: A simulation study. Multivariate Behavioral Research, 51, 661-680. doi: 10.1080/00273171.2016.1208074 [DOI] [PubMed] [Google Scholar]
- Jiang S., Wang C., Weiss D. J. (2016). Sample size requirements for estimation of item parameters in the multidimensional graded response model. Frontiers in Psychology, 7, 109. doi: 10.3389/fpsyg.2016.00109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kass R. E., Raftery A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773-795. doi: 10.1080/01621459.1995.10476572 [DOI] [Google Scholar]
- Kuo T.-C., Sheng Y. (2015). Bayesian estimation of a multi-unidimensional graded response IRT model. Behaviormetrika, 42(2), 79-94. doi: 10.2333/bhmk.42.79 [DOI] [Google Scholar]
- Levy R., Xu Y., Yel N., Svetina D. (2015). A standardized generalized dimensionality discrepancy measure and a standardized model-based covariance for dimensionality assessment for multidimensional models. Journal of Educational Measurement, 52, 144-158. doi: 10.1111/jedm.12070 [DOI] [Google Scholar]
- Muraki E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159-176. doi: 10.1177/014662169201600206 [DOI] [Google Scholar]
- Neugebauer S. R., Fujimoto K. A. (2018). Distinct and overlapping dimensions of reading motivation in commonly used measures in schools. Assessment for Effective Intervention. Advance online publication. doi: 10.1177/1534508418819793 [DOI] [Google Scholar]
- Reckase M. D. (2009). Multidimensional item response theory. New York, NY: Springer. [Google Scholar]
- Reise S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47, 667-696. doi: 10.1080/00273171.2012.715555 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roberts G. O., Rosenthal J. S. (2009). Examples of adaptive MCMC. Journal of Computational and Graphical Statistics, 18, 349-367. doi: 10.1198/jcgs.2009.06134 [DOI] [Google Scholar]
- Sheng Y. (2010). Bayesian estimation of MIRT models with general and specific latent traits in MATLAB. Journal of Statistical Software, 34(3), 1-27. doi: 10.18637/jss.v034.i03 [DOI] [Google Scholar]
- Toland M. D., Sulis I., Giambona F., Porcu M., Campbell J. M. (2017). Introduction to bifactor polytomous item response theory analysis. Journal of School Psychology, 60, 41-63. doi: 10.1016/j.jsp.2016.11.001 [DOI] [PubMed] [Google Scholar]
- Wigfield A., Guthrie J. T. (1997). Relations of children’s motivation for reading to the amount and breadth of their reading. Journal of Educational Psychology, 89, 420-432. doi: 10.1037/0022-0663.89.3.420 [DOI] [Google Scholar]
- Yao L. (2003). BMIRT: Bayesian multivariate item response theory [Computer software]. Monterey, CA: Defense Manpower Data Center. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, Online_Supplemental_Materials for A General Bayesian Multidimensional Item Response Theory Model for Small and Large Samples by Ken A. Fujimoto and Sabina R. Neugebauer in Educational and Psychological Measurement