Abstract
A mixture extension of Samejima’s continuous response model for continuous measurement outcomes and its estimation through a heuristic approach based on limited-information factor analysis is introduced. Using an empirical data set, it is shown that two groups of respondents that differ both qualitatively and quantitatively in their response behavior can be revealed. In addition to the real data application, the effectiveness of the heuristic estimation approach under real data analytic conditions was examined through a Monte Carlo simulation study. The results showed that the heuristic estimation approach provided reliable parameter estimates and the model successfully converged above 80% when the sample size was 250 and above 90% when the sample size was 500 or 1,000 for most conditions.
Keywords: continuous response model, mixture item response models, mixture models, item response models, item response theory
The assumption that the subjects come from a single homogeneous population is very strict and rarely met in practice for social science research. Mixture models relax this assumption and allow latent subpopulations. In the context of item response theory (IRT), mixture models have attracted particular attention in the past decade. Mixture item response models for dichotomous response outcomes are inspired by the fact that response patterns may differ across subpopulations due to a variety of reasons including different cognitive strategies used in problem solving (e.g., Mislevy, Levy, Kroopnick, & Rutstein, 2008), test speededness (e.g., Bolt, Cohen, & Wollack, 2002), test-taking behavior (e.g., Meyer, 2010), item preknowledge (e.g., C. Wang, Xu, Shang, & Kuncel, 2018), and mastery of certain skills (e.g., van Nijlen & Janssen, 2008). Similarly, mixture item response models for polytomous outcomes have appeared in the literature and been proposed to identify latent subpopulations in survey research that display certain biases in their use of the response scale (e.g., McIntyre, 2011), differ in their response patterns with regard to social desirability (e.g., Meij, Kelderman, & van der Flier, 2008), differ in understanding of the middle category (e.g., Hernandez, Drasgow, & Gonzalez-Roma, 2004; Smit, Kelderman, & van der Flier, 2003), and differ in the degree of faking from honest responding to extreme faking (e.g., Zickar, Gibby, & Robie, 2004).
Although not as common as the dichotomous and polytomous response outcomes, continuous response outcomes are encountered more frequent than typically thought in certain educational and behavioral settings. For instance, curriculum-based measurement (CBM) is frequently used to assess students’ reading performances across time and to evaluate instructional effects within the context of special education (Deno, 1985, 1986). CBM procedures are typically continuous performance tasks, which result in rate-based outcomes. For instance, CBM-Reading assessments prompt students to read one or more passages aloud for 1 minute each, and the outcome is the number of words read correctly per minute. CBM-Maze and CBM-Cloze assessments are tools used to assess reading comprehension. In CBM-Maze passages, the first sentence is left intact. Afterward, every nth word is deleted and replaced with three alternative words, one of which is the original correct word and two of which are incorrect. Students are given 2.5 minutes to read the passage and circle the correct word for each blank as it fits the meaning of that piece of the passage. Skipped blanks are counted as incorrect. The total score is the number of correct replacements (Parker, Hasbrouck, & Tindal, 1992; Wright, 2013). It was estimated that 1,000,000 or more students are exposed to some form of CBM assessment annually (Christ, Zopluoglu, Monaghen, & Van Norman, 2013). C-test (Grotjahn, 2014; Klein-Braley & Raatz, 1984), which is similar to cloze tests, very commonly used for measuring general language proficiency and students are expected to restore about 100 incomplete words in one or more than one reading passages, and the outcome is the number of correctly recovered words.
In visual analog rating scales, respondents are given a continuous line drawn between two antonyms and asked to mark a point on the line which represents their views on a statement. Although rating scale items with discrete response options are popular, visual analog rating scales are also used in education (e.g., Chafouleas, Christ, Riley-Tillman, Briesch, & Chanese, 2007; Chafouleas, Riley-Tillman, & McDougal, 2002; Christ, Riley-Tillman, Chafouleas, & Boice, 2010), psychology (e.g., Andersson & Yardley, 2000; Brumfitt & Sheeran, 1999; Lineweaver & Hertzog, 1998; Pietrzak, Laird, Stevens, & Thompson, 2002; Studer, 2012), and clinical research (e.g., Heacock, Hertzler, Williams, & Wolf, 2005; Micallef et al., 2001; van den Berg-Emons, Bussman, Balk, & Stam, 2005). One limitation of visual analog rating scales that potentially restricts its use in practical research is its scoring procedure, which is cost-intensive and laborious. However, web-based surveys have become more accessible today, and many online survey websites provide visual analog rating scales as a type of item and allow a flexible environment for an easy application. Regardless of whether or not they are more useful than rating scale items with discrete response options, researchers are likely to use them in practice.
When the measurement outcome is continuous, and there is latent heterogeneity in the population, researchers have a couple of options for modeling the response data. First, researchers may feel tempted to use a traditional linear factor mixture model (LFMM) as a convenient solution. However, a limitation of using a linear model is that it ignores the bounded nature of continuous measurement outcomes, which may yield bad fit in the tail of the latent trait distribution. Also, the complexity of data not properly modeled by a simpler linear model may lead to extracting spurious latent classes. Another alternative approach may be to discretize continuous data and to fit mixture polytomous item response models. However, it is not clear what the consequences would be when the continuous data is discretized, and mixture polytomous item response models are fit to data. While both these options are available for researchers, neither is optimal.
The current study focuses on a mixture extension of Samejima’s continuous response model (MixCRM) as a forgotten but viable model for modeling continuous measurement outcomes. The single-class continuous response model (CRM) has been suggested as a limiting form of the graded response model (Samejima, 1973), and a more useful model because it takes the bounded nature of the continuous measurement outcomes into account. Although there have been a few applications in the context of a homogenous population (Bejar, 1977; Ferrando, 2002; T. Wang & Zeng, 1998), CRM has not received as much attention in the literature as the dichotomous and polytomous item response models. In this article, a useful framework to estimate the parameters of MixCRM is introduced by using a data set from a previous study Ferrando (2002), and it is shown that two classes with both qualitative and quantitative differences can be identified. A follow-up simulation study was run to investigate the model convergence and quality of parameter estimation for the proposed estimation framework.
Mixture Continuous Response Model
MixCRM is an extension of the reparameterized version of the Samejima’s single-class CRM. The purpose of reparameterization was, as presented by T. Wang and Zeng (1998), “to create an item difficulty parameter on the same scale with the latent trait parameter and to have the original rating scale scores directly entering the model. (p. 335)” This reparameterization is also better aligned with the parameterization of commonly used dichotomous and polytomous item response models for readers familiar with these widely used item response models. Suppose that observed item response data originate from g subpopulations, g = {1, …, G}. The probability of the ith examinee in class g obtaining a score of x or higher on the jth item is equal to
| (1) |
and zg is defined as
| (2) |
where θgi is the ability level of the ith examinee in class g; agj, bgj, and αgj are the class-specific discrimination, difficulty, and scaling parameters, respectively, for item j; and kj is the possible maximum score on item j.
The parameters of MixCRM can be estimated using a full-information approach by extending the traditional marginal maximum likelihood estimation via expectation–maximization (EM) algorithm (MML-EM) proposed by T. Wang and Zeng (1998) for the single-class CRM. Also, a Bayesian approach can be used as it was often used for fitting mixture dichotomous and polytomous IRT models by most studies in the literature. While these different estimation approaches can be potentially used, they are less accessible to applied researchers. There is no available software for fitting this model using a full-information approach, so it requires one to write a computer program, which is a daunting task. Fitting a mixture IRT model using Bayesian approach requires an extra expertise and familiarity with Bayesian tools such as WinBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000) or Stan (Stan Development Team, 2018). On the other hand, the relationship between the parameters of LFMM and MixCRM, as previously shown by Ferrando (2002), allows a more convenient estimation approach first proposed and used by Bejar (1977) for a single-class CRM. The availability, flexibility, and popularity of the software Mplus (L. O. Muthén & Muthén, 1998-2012) makes this approach viable and accessible to many practitioners for estimating the parameters of MixCRM. Another significant advantage of Mplus is that multiple start values with random perturbations can be used in Mplus to monitor whether or not a global solution is obtained (B. E. Muthén, 2003).
In this approach, instead of modeling the continuous-bounded observed response , suppose one models a continuous-unlimited response as
| (3) |
where λgj is the factor loading, ζgj is the intercept parameter for the jth item within the gth class, and we assume the relationship between and holds as
| (4) |
This is simply a logit transformation of the observed continuous scores. Ferrando (2002) showed that the conditional distribution of the transformed scores, is
| (5) |
where ψjj|g is the residual variance for the jth item within the gth class, and the conditional distribution in Equation 5 can be rewritten by applying the following parameter transformations:
| (6) |
Then, it becomes
| (7) |
which is identical to the operating density function given by Samejima (1973).
In this convenient approach, the observed continuous item scores () are first transformed to a logit () as defined in Equation 4, and a traditional LFMM is fitted to the transformed responses to estimate the model parameters. Then, the parameters of MixCRM are approximated using the Equations 6. This procedure is analogous to approximating to the item parameters of the two parameter dichotomous IRT model by a linear factor analysis of a tetrachoric correlation matrix obtained from dichotomous items (Cristoffersson, 1975; B. E. Muthén, 1978).
An Illustration of Mixture Continuous Response Model
Data Set
The data comes from a published study and includes five items taken from a Spanish version of the Eysenck Personality Inventory Impulsivity Scale. A group of 1,033 undergraduate students were asked to check on a 112 mm line segment with two end points (almost never vs. almost always) using their own judgement for the five statements such as “I shout back when shouted at” or “I tend to do many things at the same time.” The direct item score was the distance in mm of the check mark from the left end point (Ferrando, 2002). The descriptive item statistics were reported in Table 1 for this data set.
Table 1.
Descriptive Item Statistics.
| Item | N | M | SD | Minimum | Maximum | Skewness | Kurtosis | Point-biserial |
|---|---|---|---|---|---|---|---|---|
| 1 | 1,033 | 62.83 | 29.65 | 1 | 111 | −0.20 | −0.96 | 0.312 |
| 2 | 1,033 | 36.77 | 26.65 | 1 | 111 | 0.87 | 0.08 | 0.331 |
| 3 | 1,033 | 54.03 | 35.52 | 1 | 111 | 0.03 | −1.37 | 0.237 |
| 4 | 1,033 | 48.57 | 30.54 | 1 | 111 | 0.29 | −1.00 | 0.381 |
| 5 | 1,033 | 49.70 | 31.43 | 1 | 111 | 0.26 | −1.11 | 0.339 |
Estimating the MixCRM Parameters
The MixCRM item parameters for a single class and multiple classes were estimated using Mplus 7.11 (L. O. Muthén & Muthén, 1998-2012), and the Mplus syntax is provided as an appendix (supplementary material available online) for interested readers. A minimum number of 500 sets of random starting values were used for the initial stage estimation and 30 best sets were re-iterated until convergence for the final stage estimation. The number of random starting value sets increased as necessary up to 30,000 depending on the model complexity until the best log-likelihood was replicated more than once. The model was considered as converged if the model estimation terminated normally with no warning messages, and the best log-likelihood value had been replicated more than once at the final stage estimation. The model fitting was stopped either when a model with more classes did not provide a better fit or when the best log-likelihood could not be replicated for a model with more classes even after 30,000 random start value sets. Both BIC and bootstrap log-likelihood ratio test (BLRT) were used in the current study to compare overall fit of the models with increasing number of latent classes (Leroux, 1992; Li, Cohen, Kim, & Cho, 2009; Nylund, Asparouhov, and Muthen, 2007; Roeder & Wasserman, 1997). In all estimation process, MLR estimator was used.
One issue of fitting mixture item response model is model identification and scale comparability across classes as discussed in detail by Paek and Cho (2015). A class invariant approach as suggested by Paek and Cho (2015) was used to ensure model identification and comparable item parameters across classes. The mean and variance of one class was fixed to zero and one, respectively, while the mean and variance for other classes were estimated. This allowed the classes to differ both qualitatively and quantitatively. Also, one of the items were treated as an anchor item and its parameters (a, b, and α) were fixed across classes to ensure the model identification and comparable scales across classes while the parameters of all other items were allowed to vary across classes, an approached used by previous studies in the literature (Cho, Cohen, & Bottge, 2013; Cho, Cohen, Kim, & Bottge, 2010; von Davier & Yamamoto, 2004). In order to select an anchor item, a similar statistical approach to Cho et al. (2010, 2013) was used. In this approach, after determining the number of classes, a series of models with the same number of class are fitted by separately constraining the item parameters across classes for one item at a time, and the likelihood estimates from these models are compared to a reference model with item parameters for all items allowed to vary across classes. So, each item is separately tested whether or not it can be treated as a class-invariant item. Both BIC and bootstrap log-likelihood ratio test (BLRT) were used to compare various models fitted to the data set. For BLRT, 500 data sets were simulated based on a constrained model and both constrained and full models were fitted to the simulated data sets to derive an empirical distribution for the observed difference in deviance between the constrained and full model under the assumption that the true model is the constrained model.
Results
The overall fit from models with different number of classes are presented in Table 2, and BLRT results are presented in Table 3 and Figure 1. Both BIC and empirical p values based on BLRT suggested that the best fitting model was a two-class solution with all item parameters were allowed to vary across classes. On the other hand, this solution was not very useful as the scales are not comparable due to lack of an anchor item. Theoretically, at least one class-invariant item is needed to ensure scale comparability. Therefore, additional two-class models were fitted by constraining the parameters to be equal across classes for one item at a time. Based on BIC and empirical p values from BLRT, none of the constrained two-class models provided better or equal fit compared to the unconstrained two-class model, suggesting no potential anchor item. However, while acknowledging the limitations, a two-class model with the parameters of Item 2 constrained to be equal across classes was chosen for further analysis because its fit was closest to the unconstrained two-class model. It was necessary to have at least one anchor item to obtain meaningful comparable results across classes. The readers should notice that that this two-class model with Item 2 constrained still provided a much better fit than a one-class model. The mean and variance for the first class (N = 316) were set to 0 and 1, respectively, and the mean and variance for the second class (N = 717) were estimated as 0.19 and 0.06, respectively. Table 4 summarizes the descriptive statistics for the maximum a posteriori factor score estimates provided by Mplus and their rescaled version based on the mean and variance set in the analysis. Figure 2 also shows the distribution of rescaled maximum a posteriori (MAP) factor score estimates for both classes. This indicates that the two classes differ quantitatively.
Table 2.
Model Fit for Mixture Continuous Response Models.
| Model | Log-likelihood | No. of parameters | BIC | Class proportions | Entropy | Average posterior
probability |
|
|---|---|---|---|---|---|---|---|
| Class 1 | Class 2 | ||||||
| 1-class | −9957.0 | 15 | 20018.1 | ||||
| 2-classa | −9630.6 | 31 | 19476.4 | [0.66,0.34] | 0.578 | 0.876 | 0.885 |
| 2-class1 | −9655.6 | 30 | 19519.4 | [0.70,0.30] | 0.601 | 0.886 | 0.890 |
| 2-class2 | −9643.5 | 30 | 19495.1 | [0.69,0.31] | 0.595 | 0.896 | 0.881 |
| 2-class3 | −9656.4 | 30 | 19521.0 | [0.71,0.29] | 0.603 | 0.891 | 0.889 |
| 2-class4 | −9644.8 | 30 | 19497.8 | [0.69,0.31] | 0.591 | 0.886 | 0.885 |
| 2-class5 | −9643.9 | 30 | 19495.9 | [0.69,0.31] | 0.592 | 0.888 | 0.884 |
Note. BIC = Bayesian information criterion. 2-classi refers to a model where all parameters for the ith item were same across classes while the parameters are allowed to vary across classes for other items. The models beyond 2-class did not have the best log-likelihood replicated even after 30,000 starting value sets.
This model is fully unconstrained. Item parameters were allowed to vary across two classes for all items. For model identification purposes, the mean and variance of latent scores were set to zero and one for both classes.
Table 3.
Model Comparison Using the Bootstrap Loglikelihood Ratio Test (BLRT).
| Constrained model | Full model | Observed difference in deviance | Maximum differenceb | BLRT p value |
|---|---|---|---|---|
| 1-class | 2-classa | 652.8 | 119.9 | <.002 |
| 2-class1 | 2-classa | 50.0 | 12.8 | <.002 |
| 2-class2 | 2-classa | 25.8 | 9.9 | <.002 |
| 2-class3 | 2-classa | 51.6 | 9.7 | <.002 |
| 2-class4 | 2-classa | 28.4 | 12.1 | <.002 |
| 2-class5 | 2-classa | 26.6 | 8.9 | <.002 |
Note. 2-classi refers to a model where the parameters for the ith item were constrained to be same across classes while the parameters were allowed to vary across classes for other items.
This model is fully unconstrained. Item parameters were allowed to vary across two classes for all items. For model identification purposes, the mean and variance of latent scores were set to zero and one for both classes.
This is the maximum difference in deviance observed between the constrained model and full model based on 500 simulated data sets under the condition that the true model is the constrained model.
Figure 1.
The empirical distribution of difference in deviance between a 2-class unconstrained model and constrained models based on 500 simulated data sets.
Table 4.
Descriptive Item Statistics for the Maximum a Posteriori (MAP) Factor Scores Estimated by Mplus and Rescaled Factor Scores to Their Estimated Mean and Variance for Mixture Continuous Response Model.
| Class | N | M | SD | Minimum | Maximum | Skewness | Kurtosis |
|---|---|---|---|---|---|---|---|
| MAP factor scores | |||||||
| 1 | 316 | −0.01 | 0.95 | −1.74 | 2.68 | 0.51 | 0.02 |
| 2 | 717 | 0.18 | 0.23 | −0.72 | 0.90 | −0.05 | 0.73 |
| Rescaled factor scores | |||||||
| 1 | 316 | 0.00 | 1.00 | −1.82 | 2.84 | 0.51 | 0.02 |
| 2 | 717 | 0.19 | 0.25 | −0.78 | 0.97 | −0.05 | 0.73 |
Figure 2.

The distribution of the rescaled maximum a posteriori factor score estimates for the two-class solution for mixture continuous response model.
To examine the nature of qualitative differences between two classes, the distribution of responses for each item is examined for two classes. Figure 3 shows how the distribution of scores on each item differs for both classes. The U-shape pattern for Class 1 indicates that the respondents in this class tend to use both ends of the scale more frequently, while the respondents in Class 2 tend to be using the whole scale in a more balanced way. One can also look at the relationships among item responses within each class as shown in Figure 4. The correlations among items for Class 1 were much lower and the points were more spread out compared to Class 2. The respondents in Class 1 were more likely to respond at either end of the scale, but this is not necessarily due to their trait level because their response behavior was not consistent across items. This suggests a more inconsistent noisy response behavior for the respondents in Class 1 compared to Class 2. This is actually reflected in the discrimination parameters estimated for each latent class. Item parameters estimated for each class were presented in Table 5. There was a large difference in a parameters between two classes. Items had very poor discrimination for Class 1, while they have very high discrimination for Class 2, indicating that the responses obtained from Class 1 are not very informative about the construct being measured. The large difference in α parameters between the classes reflects the difference in the variability of observed scores as also shown in Figure 2. In terms of b parameters, there was only a significant difference for Item 1 between two classes, and the b parameters for other items were very similar.
Figure 3.
The distribution of item scores within the predicted latent classes.
Figure 4.
Correlations among the items within each predicted latent class.
Table 5.
Item Parameter Estimates for Mixture Continuous Response Model.
| Item | Class 1 (N = 316) | Class 2 (N = 717) | ||||
|---|---|---|---|---|---|---|
| a | b | α | a | b | α | |
| 1 | 0.21 (0.07) | −1.25 (0.52) | 0.45 (0.15) | 2.15 (0.39) | 0.07 (0.07) | 2.23 (0.39) |
| 2 | 2.25 (0.19) | 0.61 (0.08) | 1.96 (0.11) | 2.25 (0.19) | 0.61 (0.08) | 1.96 (0.11) |
| 3 | 0.15 (0.07) | 0.18 (0.41) | 0.41 (0.19) | 1.55 (0.37) | 0.27 (0.08) | 2.32 (0.50) |
| 4 | 0.28 (0.07) | 0.39 (0.22) | 0.64 (0.16) | 3.94 (0.84) | 0.33 (0.09) | 3.09 (0.55) |
| 5 | 0.32 (0.07) | 0.30 (0.19) | 0.75 (0.16) | 3.16 (0.59) | 0.32 (0.09) | 2.86 (0.48) |
Note. The numbers in parenthesis are corresponding standard errors. Item 2 (shown in boldface) is an anchor item.
A Simulation Study for Model Convergence and Parameter Recovery
Study Design
With the findings from the empirical data set in mind, a simulation study was also designed and run in order to examine the performance of the heuristic estimation approach which utilizes the limited-information factor analysis in model convergence and parameter recovery of MixCRM. Number of items was fixed to five as in the empirical illustration, and number of classes was fixed to two. The manipulated conditions were sample size (N) and class mixing proportion (φg). The sample size was manipulated at three levels as 250, 500, and 1,000 in order to examine the performance of estimation for small, medium, and large sample sizes. Three levels of class mixing proportion were considered as φ = (0.50, 0.50), φ = (0.65, 0.35), and φ = (0.80, 0.20). The combination of manipulated conditions yielded a total of 9 cells (3 × 3) in the simulation design. For each condition 500 replications were generated.
Data Generation
The probability density function for a logit transformed observed score for ith person in the gth class on the jth item, , is a normal density with a mean of and a variance of . First, ability parameters for a given sample size within each class were randomly generated from a standard normal distribution. Then, given the generated ability parameters, was drawn using the class specific item parameters from a normal distribution with a mean of and a variance of . As in the real data illustration, each item was assumed to have a score scale between 0 and 112, so was transformed to observed score using the equation, . The item parameters from a two-class unconstrained model were used.
Model Fitting
The number of classes was assumed to be known and fixed to two, the true number of classes used to generate data. The mean of the latent scores for each latent class is fixed to zero and the variance of latent scores for each latent class is fixed to one, same values used to generate data. A fully unconstrained model where each item has its unique set of parameters across classes was fitted to each simulated data using Mplus 7.11 (L. O. Muthén & Muthén, 1998-2012). A minimum number of 1,000 sets of random starting values were used for the initial stage estimation and 10 best sets were re-iterated until convergence for the final stage estimation.
Label Switching
As well established in the mixture modeling literature, label switching is always an issue to consider for simulation studies of mixture models. In the current study, a similar post hoc class assignment using the column maxima switched label detection algorithm proposed by Tueller, Drotar, and Lubke (2011) was applied to check whether the labels were switched. When a label switching was identified for a replication, final parameter estimates were repositioned in the correct order before computing the outcome measures.
Outcome Measures
In order to evaluate the performance of the heuristic estimation procedure under different settings, following outcomes were used: convergence rate, average parameter bias, root mean squared error (RMSE), and the accuracy of class prediction. Convergence rate is the proportion of converged replications. A replication was considered as converged if the model estimation terminated normally, and the best log-likelihood value had been replicated more than once at the final stage estimation. Average bias of a given parameter for a replication is the average difference between the estimated parameter value and the corresponding true value across items. RMSE of a given parameter for a replication is the square root of the average squared difference between the estimated parameter value and the corresponding true value across items. The accuracy of class prediction was measured as the proportion of the correctly classified subjects.
Results of Simulation Study
The results of the simulation study were summarized in Table 6. The model was successfully fitted for most conditions, particularly when the sample sizes were 500 and 1,000. The most problematic conditions appeared to be a combination of small sample size of 250 and the class mixing proportions of φ = (0.80, 0.20). In this condition, the proportion of converged replication was 0.568. The bias was slightly higher for all parameters when the sample size is 250 compared with other sample size conditions with a maximum bias of 0.062 for parameter a, 0.040 for parameter b, and 0.031 for parameter α. The bias in item parameters was negligible when the sample sizes were 500 or 1,000. The mixing proportion did not appear to have an influence on RMSE. RMSE values were similar across different conditions of mixing proportion. As expected, the increase in sample size yielded lower RMSE values. The average accuracy of class prediction was above 85% for all conditions in the study. The sample size did not appear to be an important factor influencing class prediction accuracy, while there was a slight increase in the class prediction accuracy as the mixing proportion become more unequal. In summary, the model was successfully fitted in most conditions and reliable parameter estimates were obtained using the heuristic approach.
Table 6.
Results of Simulation Study.
| Mixing proportion | Sample size | Convergence | Bias |
RMSE |
Class Prediction
Accuracy |
||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| a | b | α | a | b | α | Minimum | Maximum | Mean | |||
| 0.50 | 250 | 0.810 | 0.038 | 0.015 | 0.005 | 0.240 | 0.449 | 0.247 | 0.73 | 0.92 | 0.85 |
| 500 | 0.992 | 0.018 | 0.012 | 0.000 | 0.156 | 0.267 | 0.173 | 0.83 | 0.91 | 0.86 | |
| 1,000 | 1.000 | 0.009 | 0.010 | 0.001 | 0.100 | 0.166 | 0.118 | 0.84 | 0.90 | 0.87 | |
| 0.65 | 250 | 0.798 | 0.040 | 0.040 | 0.010 | 0.232 | 0.444 | 0.251 | 0.80 | 0.93 | 0.88 |
| 500 | 0.966 | 0.015 | 0.002 | −0.002 | 0.151 | 0.310 | 0.181 | 0.84 | 0.93 | 0.89 | |
| 1,000 | 1.000 | 0.008 | 0.002 | 0.001 | 0.097 | 0.237 | 0.126 | 0.85 | 0.92 | 0.89 | |
| 0.80 | 250 | 0.568 | 0.062 | 0.027 | 0.031 | 0.273 | 0.482 | 0.288 | 0.86 | 0.96 | 0.92 |
| 500 | 0.876 | 0.021 | 0.013 | 0.003 | 0.169 | 0.353 | 0.211 | 0.88 | 0.95 | 0.92 | |
| 1,000 | 0.976 | 0.007 | 0.012 | −0.006 | 0.111 | 0.270 | 0.154 | 0.90 | 0.95 | 0.92 | |
Note. RMSE = root mean squared error. Bias and RMSE are averaged across all items and classes.
Conclusion
When one encounter with a set of items/tasks that have continuous outcomes, a linear factor model is always a natural and popular choice for researchers. However, a linear model does not take the bounded nature of continuous response outcomes, and there is no reason why we should make a blinded assumption that the relationship between observed indicators and the latent trait is linear, an assumption one does not easily make for dichotomous and polytomous response outcomes. Samejima’s CRM, as an unpopular model left in the shadows, should not be easily discarded as an alternative response model for continuous outcomes.
The current study focused on a mixture extension of Samejima’s single-class CRM as a nonlinear alternative to a more popular LFMM for continuous measurement outcomes. The heuristic estimation approach using limited-information factor analysis to fit the MixCRM is easily accessible to practitioners through Mplus as used in the current article. The parameters of the proposed MixCRM can also be estimated using full-information maximum likelihood estimation or a Bayesian approach. The heuristic estimation approach was successful under the most conditions included in the simulation study. However, there may be some other suboptimal conditions in which the heuristic estimation is not efficient such as a combination of small sample size and extreme class mixing proportions. For such conditions, the future research is needed to evaluate the effectiveness of other estimation approaches.
A response model assuming homogenous single-population yields inaccurate and misleading inferences about the respondents under investigation when there are subpopulations in the sample. The mixture models relax the assumption of homogeneous single-population and allow latent subpopulations when modeling response outcomes. When there are subpopulations, these models provide more valid and unbiased inferences about the respondents on the construct being measured. For instance, in the current study, a subgroup of examinees with a noisier response behavior was identified. By eliminating this subgroup from the sample, more reliable and valid estimates could be obtained for the remaining examinees in the sample.
Another advantage of MixCRM is that it can also be used for rating scale items with large number of response categories (e.g., 10) as an alternative to mixture polytomous IRT models. As the number of response categories increases, it becomes more challenging to fit mixture polytomous IRT models due to the large number of parameters in the model, particularly when there is not enough number of individuals within each category. In such cases, MixCRM can be a simpler alternative model if the rating scale categories are treated as continuous.
Supplemental Material
Supplemental material, APPENDIX for A Finite Mixture Item Response Theory Model for Continuous Measurement Outcomes by Cengiz Zopluoglu in Educational and Psychological Measurement
This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http://www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage).
EPIA_1 for A Finite Mixture Item Response Theory Model for Continuous Measurement Outcomes by Cengiz Zopluoglu in Educational and Psychological Measurement
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iD: Cengiz Zopluoglu
https://orcid.org/0000-0002-9397-0262
Supplemental Material: Supplemental material for this article is available online.
References
- Andersson G., Yardley L. (2000). Time-series analysis of the relationship between dizziness and stress. Scandinavian Journal of Psychology, 41, 49-54. [DOI] [PubMed] [Google Scholar]
- Bejar I. I. (1977). An application of the continuous response level model to personality measurement. Applied Psychological Measurement, 1, 509-521. [Google Scholar]
- Bolt D. M., Cohen A. S., Wollack J. A. (2002). Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement, 39, 331-348. [Google Scholar]
- Brumfitt S. M., Sheeran P. (1999). The development and validation of the visual analogue self-esteem scales (VASES). British Journal of Clinical Psychology, 38, 387-400. [DOI] [PubMed] [Google Scholar]
- Chafouleas S. M., Christ T. J., Riley-Tillman T. C., Briesch A. M., Chanese J. A. (2007). Generalizability and dependability of daily behavior report cards to measure social behavior of pre-schoolers. School Psychology Review, 36, 63-79. [Google Scholar]
- Chafouleas S. M., Riley-Tillman T. C., McDougal J. (2002). Good, bad, or in-between: How does the daily behavior report card rate? Psychology in the Schools, 39, 157-169. [Google Scholar]
- Cho S. J., Cohen A. S., Bottge B. (2013). Detecting intervention effects using a multilevel latent transition analysis with a mixture IRT model. Psychometrika, 78, 576-600. [DOI] [PubMed] [Google Scholar]
- Cho S.-J., Cohen A. S., Kim S.-H., Bottge B. (2010). Latent transition analysis with a mixture item response theory measurement model. Applied Psychological Measurement, 34, 483-504. [Google Scholar]
- Christ T. J., Riley-Tillman T. C., Chafouleas S. M., Boice C. H. (2010). Direct Behavior Rating (DBR): Generalizability and dependability across raters and observations. Educational and Psychological Measurement, 70, 825-843. [Google Scholar]
- Christ T. J., Zopluoglu C., Monaghen B. D., Van Norman E. R. (2013). Curriculum-based measurement of oral reading: Multi-study evaluation of schedule, duration, and dataset quality on progress monitoring outcomes. Journal of School Psychology, 51, 19-57. [DOI] [PubMed] [Google Scholar]
- Cristoffersson A. (1975). Factor analysis of dichotomized variables. Psychometrika, 40, 5-32. [Google Scholar]
- Deno S. L. (1985). Curriculum-based measurement: The emerging alternative. Exceptional Children, 52, 219-232. [DOI] [PubMed] [Google Scholar]
- Deno S. L. (1986). Formative evaluation of individual student programs: A new role for school psychologists. School Psychology Review, 15, 358-374. [Google Scholar]
- Ferrando P. J. (2002). Theoretical and empirical comparisons between two models for continuous item responses. Multivariate Behavioral Research, 37, 521-542. [DOI] [PubMed] [Google Scholar]
- Grotjahn R. (2014). The C-test bibliography: Version January 2014. In Grotjahn R. (Ed.), Der C-Test: Aktuelle Tendenzen/The C-test: Current trends (pp. 137-162). Frankfurt am Main, Germany: Peter Lang. [Google Scholar]
- Heacock P. M., Hertzler S. R., Williams J. A., Wolf B. W. (2005). Effects of a medical food containing an herbal α-glucosidase inhibitor on postprandial glycemia and insulinemia in healthy adults. Journal of the American Dietetic Association, 105, 65-71. [DOI] [PubMed] [Google Scholar]
- Hernandez A., Drasgow F., Gonzalez-Roma V. (2004). Investigating the functioning of a middle category by means of a mixed-measurement model. Journal of Applied Psychology, 89, 687-699. [DOI] [PubMed] [Google Scholar]
- Klein-Braley C., Raatz U. (1984). A survey of research on the C-Test1. Language Testing, 1, 134-146. [Google Scholar]
- Leroux B. G. (1992). Consistent estimation of a mixing distribution. Annals of Statistics, 20, 1350-1360. [Google Scholar]
- Li F., Cohen A. S., Kim S.-H., Cho S.-J. (2009). Model selection methods for mixture dichotomous IRT models. Applied Psychological Measurement, 33, 353-373. [Google Scholar]
- Lineweaver T. T., Hertzog C. (1998). Adults’ efficacy and control beliefs regarding memory and aging: Separating general from personal beliefs. Aging, Neuropsychology, and Cognition, 5, 264-296. [Google Scholar]
- Lunn D. J., Thomas A., Best N., Spiegelhalter D. (2000). WinBUGS: A Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325-337. [Google Scholar]
- McIntyre H. H. (2011). Investigating response styles in self-report personality data via a joint structural equation mixture modeling of item responses and response times. Personality and Individual Differences, 50, 597-602. [Google Scholar]
- Meij A. M. M.-d., Kelderman H., van der Flier H. (2008). Fitting a mixture item response theory model to personality questionnaire data: Characterizing latent classes and investigating possibilities for improving prediction. Applied Psychological Measurement, 32, 611-631. [Google Scholar]
- Meyer J. P. (2010). A mixture Rasch model with item response time components. Applied Psychological Measurement, 34, 521-538. [Google Scholar]
- Micallef J., Soubrouillard C., Guet F., Le Guern M. E., Alquier C., Bruguerolle B., Blin O. (2001). A double blind parallel group placebo controlled comparison of sedative and amnesic effects of etifoxine and lorazepam in healthy subjects. Fundamental & Clinical Pharmacology, 15, 209-216. [DOI] [PubMed] [Google Scholar]
- Mislevy R. J., Levy R., Kroopnick M., Rutstein D. (2008). Evidentiary foundations of mixture item response theory models. In Hancock G. R., Samuelsen K. M. (Eds.), Advances in latent variable mixture models (pp. 149-176). Charlotte, NC: Information Age. [Google Scholar]
- Muthén B. E. (1978). Contributions to factor analysis of dichotomized variables. Psychometrik, 43, 551-560. [Google Scholar]
- Muthén B. E. (2003). Perturbation of start values. statmodel.com. Retrieved from http://www.statmodel.com/download/Starts.pdf
- Muthén B. E. (2004). Mplus: Statistical analysis with latent variables: Technical appendices. statmodel.com. Retrieved from http://www.statmodel.com/download/techappen.pdf
- Muthén B. E. (2007). Finite mixture EFA in Mplus. statmodel.com. Retrieved from http://www.statmodel.com/download/MixtureEFA1.pdf
- Muthén L. O., Muthén B. (1998-2012). Mplus user’s guide (7th ed.). Los Angeles, CA: Muthén & Muthén. [Google Scholar]
- Nylund K. L., Asparouhov T., Muthén B. O. (2007). Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. Structural Equation Modeling, 14, 535-569. doi: 10.1080/10705510701575396 [DOI] [Google Scholar]
- Paek I., Cho S. J. (2015). A note on parameter estimate comparability: across latent classes in mixture IRT modeling. Applied Psychological Measurement, 39, 135-143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parker R., Hasbrouck J. E., Tindal G. (1992). The maze as a classroom-based reading measure: Construction methods, reliability, and validity. Journal of Special Education, 26, 195-218. [Google Scholar]
- Pietrzak R. H., Laird J. D., Stevens D. A., Thompson N. S. (2002). Sex differences in human jealousy: A coordinated study of forced-choice, continuous rating-scale, and psychological responses on the same subjects. Evolutions and Human Behavior, 23, 83-94. [Google Scholar]
- Roeder K., Wasserman L. (1997). Practical Bayesian density estimation using mixtures of normals. Journal of the American Statistical Association, 92, 894-902. [Google Scholar]
- Samejima F. (1973). Homogeneous case of the continuous response model. Psychometrika, 38, 203-219. [Google Scholar]
- Smit A., Kelderman H., van der Flier H. (2003). Latent trait latent class analysis of an Eysenck personality questionnaire. Methods of Psychological Research, 8(3), 23-50. [Google Scholar]
- Stan Development Team. (2018). Stan modeling language users guide and reference manual (Version 2.18.0). Retrieved from http://mc-stan.org
- Studer R. (2012). Does it matter how happiness is measured? Evidence from a randomized controlled experiment. Journal of Economic and Social Measurement, 37, 317-336. [Google Scholar]
- Tueller S. J., Drotar S., Lubke G. H. (2011). Addressing the problem of switched class labels in latent variable mixture model simulation studies. Structural Equation Modeling, 18, 110-131. [Google Scholar]
- van den Berg-Emons R. J., Bussman J. B., Balk A. H., Stam H. J. (2005). Factors associated with the level of movement-related everyday activity and quality of life in people with chronic heart failure. Physical Therapy, 85, 1340-1348. [PubMed] [Google Scholar]
- van Nijlen D., Janssen R. (2008). Mixture IRT models as means of DIF detection: Modeling spelling in different grades of primary schools. Leuven, Belgium: Katholieke Universiteit Leuven. Retrieved from https://core.ac.uk/download/pdf/34401068.pdf
- von Davier M., Yamamoto K. (2004). Partially observed mixtures of IRT models: An extension of the generalized partial-credit model. Applied Psychological Measurement, 28, 389-406. [Google Scholar]
- Wang C., Xu G., Shang Z., Kuncel N. R. (2018). Detecting aberrant behavior and item preknowledge: A comparison of mixture modeling method and residual method. Journal of Educational and Behavioral Statistics, 43, 469-501. doi: 10.3102/1076998618767123 [DOI] [Google Scholar]
- Wang T., Zeng L. (1998). Item parameter estimation for a continuous response model using an EM algorithm. Applied Psychological Measurement, 22, 333-344. [Google Scholar]
- Wright J. (2013). How to: Assess reading comprehension with CBM: Maze passages. Retrieved from http://www.jimwrightonline.com/mixed_files/lansing_IL/_Lansing_IL_Aug_2013/3_CBA_Maze_Directions.pdf
- Zickar M. J., Gibby R. E., Robie C. (2004). Uncovering faking samples in applicant, incumbent, and experimental data sets: an application of mixed-model item response theory. Organizational Research Methods, 7, 168-190. doi: 10.1177/1094428104263674 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, APPENDIX for A Finite Mixture Item Response Theory Model for Continuous Measurement Outcomes by Cengiz Zopluoglu in Educational and Psychological Measurement
This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http://www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage).
EPIA_1 for A Finite Mixture Item Response Theory Model for Continuous Measurement Outcomes by Cengiz Zopluoglu in Educational and Psychological Measurement



