Abstract
When scales or tests are used to make decisions about individuals (e.g., to identify which adults should be assessed for psychiatric disorders), it is crucial that these decisions be accurate and consistent. However, it is not obvious how to assess accuracy and consistency when the scale was administered only once to a given sample and the true condition based on the latent variable is unknown. This paper describes a method based on the linear factor model for evaluating the accuracy and consistency of scale-based decisions using data from a single administration of the scale. We illustrate the procedure and provide R code that investigators can use to apply the method in their own data. Finally, in a simulation study, we evaluate how the method performs when applied to discrete (vs. continuous) items, a practice that is common in published literature. The results suggest that the method is generally robust when applied to discrete items.
In the social sciences and medicine, tests and scales are often used for selection, such as to find the best candidates for a job, identify students with a minimum level of achievement, or determine which respondents need further psychological assessment. This article describes methods that apply to any selection scenario, but we frame our discussion using the example of screening measures: short assessments used to identify respondents who may have psychiatric disorders. Typically, scale item responses are summed (; e.g., Achenbach & Rescorla, 2003; Allison et al., 2012), and then a decision is made by comparing a respondent’s summed score to a cutpoint, (Pepe, 2003). When scores are used to make decisions about individuals, it is critical to ensure that the selection process is accurate and consistent (AERA et al., 2014).
Classification accuracy refers to the probability of correctly assigning a respondent to the correct group. Classification accuracy is measured as the agreement between the decision based on the summed scores and a reference class, as in the true condition of the respondent (e.g., a decision determined by a gold standard or true scores; Lee, 2010). Accurate classifications support the valid use of tests for decision-making (Lathrop, 2015). For example, suppose that a clinician administers the AQ10 (Autism Spectrum Quotient; Allison et al., 2012), and the respondent is referred for further diagnostic assessment for autism spectrum disorders if their score is above a cutpoint. The AQ10 would have high classification accuracy if it correctly distinguishes individuals who are likely to be diagnosed with autism spectrum disorder from those who are not.
Beyond accuracy, classification consistency refers to the probability that a respondent would receive the same classification across repeated administrations of the measure (Gonzalez et al., 2021; Lee, 2010; Livingston & Lewis, 1995). The concept of classification consistency is similar to the test-retest reliability of a classification (Haertel, 2006; Lathrop, 2015). For the example above, the AQ10 would yield inconsistent classifications if a respondent is above the cutpoint today, but they would have been below the cutpoint if they would have received the assessment tomorrow. Note that classification consistency assumes that there is no change in the respondents’ level of the construct (e.g., via maturation, practice effects, carry-over effects, or treatment effects) between the administrations of the measure – the only factor that affects the change in classification is measurement error (Gonzalez et al., 2021). There are many ways to estimate classification accuracy and consistency from measures (Deng, 2011), but in this paper we focus on estimates based on latent variable models.
Recently, model-based estimates using item response theory (IRT) have been used to provide estimates of classification accuracy and consistency in educational and psychiatric settings (e.g., Gonzalez et al., 2021; Gonzalez & Pelham, 2021; Lathrop & Cheng, 2013; Lee, 2010). These IRT-based methods fall short for three reasons. First, many researchers (e.g., clinical psychologists) are more familiar with the linear factor model than item response theory models. Reise and Waller (2009) note that item response theory models are the exception instead of the norm to analyze clinical assessments, although this is changing. Second, although most scales in the social sciences have discrete responses, some items may have continuous response scales, so the common item response theory models do not apply. Examples of scales with continuous response scales include items whose response format is a continuous line segment, a visual analog scale, or time to complete task (Mellenbergh, 2017). Also, items that have many response categories (Thissen et al., 1983), such as those found in the European Social Survey (Davidov et al., 2008), may be treated as continuous. Third, for several reasons, investigators may prefer to analyze items with discrete response scales as though they are continuous, using the linear factor model (Beauducel & Herzberg, 2006; Jorgensen & Johnson, 2022; Li, 2016; Muthén & Kaplan, 1985; Rhemtulla et al., 2012). For instance, they may be unfamiliar with IRT, have limited sample sizes for accurate parameter estimation, or prefer a simpler model (e.g., the linear factor model has three parameters per item, and the number of parameters per item for item response theory models depends on the number of response categories; Thissen, 2017). In the area of screening, examples of widely-used scales that have discrete item responses but are often treated as continuous include the Center of Epidemiologic Studies – Depression (CES-D) scale (e.g., Carleton et al., 2013) and the K6 scales (e.g., Bessaha, 2017). Thus, to fully realize the benefit of IRT-based advances for estimating classification accuracy and consistency, these methods must be extended to the linear factor analysis framework.
There has been limited work on the estimation of classification accuracy in situations in which item responses are treated as continuous. Examples include Millsap and Kwok (2004), who showed how one can translate the parameters of a linear factor model into estimates of sensitivity and specificity to describe how well a measure classifies respondents, and Lai et al., (2017), who facilitated the implementation of these procedures with R code. In these two cases, however, the estimation of classification consistency or the estimation of both classification accuracy and consistency at specific levels of the latent construct were not discussed. Also, Peng and Subkoviak (1980) developed an approach to estimate classification consistency that makes similar assumptions to what our procedure using the linear factor model will make. Like above, they do not study conditional classification consistency estimates or estimate classification accuracy. As such, there are two main contributions of this paper. First, we extend the IRT-based procedure to estimate classification accuracy and consistency (Gonzalez & Pelham, 2021; Lee, 2010) to handle responses that are treated as continuous. Second, we investigate how treating discrete items as continuous (i.e., analyzing discrete items with the linear factor model) affects the estimation of classification accuracy and consistency. These contributions are important because 1) they would facilitate and promote the estimation of classification accuracy and consistency for users of the linear factor model, and 2) they would help determine if researchers who routinely fit linear factor models to discrete data can still obtain a rough approximation of classification accuracy and consistency.
Present Study
The purpose of this study is to provide applied researchers with tools, based on the linear factor model, to estimate classification and consistency when they use scales for decision-making. First, we define four indices of model-based classification accuracy and consistency and explain how these indices are estimated. Then, we illustrate our method by applying it to published data on the K6 screener for psychological distress (Kessler et al., 2003). Finally, we conduct a Monte Carlo simulation study to examine how classification accuracy and consistency are affected when discrete items are analyzed as continuous. In the supplement, we present R functions to conduct the proposed procedure with the linear factor model and a brief tutorial on how to use them.
Indices of Classification Accuracy and Consistency
Consider a hypothetical group of individuals who have identical latent variable scores η. At the item level, random measurement error would result in different observed item responses for these individuals, which in turn results in a distribution of possible sum scores for each level of η. We refer to this conditional summed score distribution as and some examples are depicted in the right panel of Figure 1. In order to estimate classification accuracy (CA) and consistency (CC), one needs , which can be determined from an item response theory model or a linear factor model. In the appendix, we explain how one can use estimates of the factor loadings, intercepts, and residual variances to determine . The main takeaway of the appendix is that the linear factor model depends on the assumption that is normally distributed, while that distribution can be nonnormal for item response theory models. Therefore, the performance of our procedure depends on meeting this assumption.
Figure 1.

Left panels: summed score distribution conditional on the latent variable (e.g., ) from an item response theory model. Right panels: summed score distribution conditional on the latent variable from a linear factor model (e.g., . Note that the range of the scores is from 0 – 24.
Using , one can define four indices (Gonzalez et al., 2021; Lee, 2010):
Conditional CA (CCA): the probability of making a correct decision based on at a specific value of η.
Conditional CC (CCC): the probability making the same decision based on across two parallel administrations of the measure at a specific value of η.
Marginal CA (MCA): weighted average of CCA estimates across the range of η.
Marginal CC (MCC): weighted average of CCA estimates across the range of η.
These four estimates range from 0 to 1, and higher values are better. Note that the four indices are specific to cutpoint c on , so different will have different CA and CC values.
Conceptual Sketch of the Method and Relative Advantages
In this section, we provide a brief explanation of the procedure to estimate CA and CC, which is most effective if the user is looking at the top right panel of Figure 1 while reading.
Estimate , as in the top right panel of Figure 1, at one value of η.
For a cutpoint1 , estimate the proportion of at or above , and below .
Estimate CCA by checking if the is above or below . If , CCA is , else CCA is . To estimate CCC, add .
Repeat steps 1–3 for many more η values. Typically, η is normally distributed, so one can use equally spaced values between −2 and 2. These values are known as quadrature points.
Estimate the MCA and MCC by taking a weighted average of the CCA and CCC from the step above. The weights come from the quadrature points in Step 4 based on the height of the normal distribution (i.e., Gaussian quadrature), which are used to approximate the integral over η.
As mentioned above, the respondent’s η is considered fixed, so quantifies the uncertainty of a respondent’s at each level of η due to measurement error. In situations in which η is not expected to change, provides a range of that we are likely to observe across repeated administrations. If model assumptions are met, this property facilitates the estimation of CC because a single administration of the measure provides hypothetical information on test-retest performance, saving resources and reducing participant burden (Gonzalez et al., 2021; Lee, 2010).
Illustration: Application of the Method to Published Example
K6 Scale
The K6 is a screener used to study psychological distress in the population (Kessler et al., 2003). Psychological distress is defined by Drapeau (2010) as a combination of depression and anxiety symptoms to describe emotional ill-being. The K6 is a screener that assesses how frequently an individual experienced six symptoms in the past 30 days: sadness, nervousness, restlessness, hopelessness, worthlessness, and the feeling that everything was an effort. The response scale has five categories (0 – none of the time, 4 – all the time), and an observed score is estimated by summing all the items. Observed scores have been found to identify participants with moderate psychological distress (Kessler et al., 2003).
For this illustration, we borrow the item parameters from the reference group reported in the study by Sunderland and colleagues (2012, Table 3) on the K6 scales. Data for the Sunderland et al. (2012) study were from the Australian National Survey of Mental Health and Well-being, and the item parameters come from a sample (N=2,761) in which respondents were between the ages of 16 and 34, and 49.4% were women. We simulated 10,000 responses using the K6 item parameters from the graded response model (GRM; Samejima, 1969), and a normally-distributed latent variable score with a mean and variance of 1, N(1,1).2 It is important to note that the item thresholds were asymmetric (e.g., had a positive skew). Then, the data were analyzed with the linear factor model, the parameters were saved, and we estimated CA and CC using quadrature points between −2 and 2 in steps of .05 using the R functions presented in the supplement. The purpose of the illustration is to demonstrate the estimation of CC and CA under the linear factor model approach and compare it to CA and CC reference values from an item response theory model, which are the data-generating parameters (further described below). Previous findings suggest that when discrete item responses are analyzed with the linear factor model, the factor loadings are attenuated (Beauducel & Herzberg, 2006; Jorgensen & Johnson, 2022; Li et al., 2016; Rhemtulla et al., 2012). As such, we expect that those CA and CC estimates would be smaller than the estimates from the item response theory model because the relation between each item and the latent variable is underestimated.
Item response estimates of CA and CC as reference values
For our analyses, we use the CA and CC estimates from the item response theory model as reference values because the procedure based on the item response theory model accounts for the discrete nature of the items. We know that the item response variables are in truth discrete because they were designed that way by the scale constructors. Thus, by definition, a model parameterized to reflect the discrete response options is more accurate (“closer to the truth”) than a simpler model enforcing the assumptions of the linear factor model across item responses, which are not often tenable (e.g., model residuals are normally distributed; Wirth & Edwards, 2007). For this reason, both the illustration and simulation study consider the values from the item response theory model as the reference case and evaluate to what extent an approach based on a simpler model (i.e., a linear factor model) produces similar results. As mentioned above, the CA and CC estimation require the item parameters, the cutpoint , and the quadrature points. Sample size only plays a role in the precision of the item parameters, which is impacted by sampling error. Therefore, we obtain the reference values from the item response theory model by treating the K6 item parameters as population values, and the and quadrature points as fixed, and we estimated CA and CC using the cacIRT R-package (Lathrop, 2015). Moreover, we simulate many responses (e.g., N=10,000), which are analyzed to obtain factor model estimates from the linear factor model with small sampling variability (i.e., small standard errors).
Results
The top panel of Figure 2 shows the relation between the latent variable score and the model-implied summed score under the item response theory model and the linear factor model. Although the linear factor model provides a rough approximation to the relations found by the data generating (item response) model, there are areas in which model-implied summed score is higher or lower. In the simulated dataset, a K6 cut score of selects roughly 19% of the respondents. The θ value (i.e., the latent variable in the item response theory model, analogous to η) that yields a model-implied is roughly , while the corresponding η value is . Recall that the K6 parameters from the item response theory model are treated as population values, and the highest standard error for an estimated parameter from the linear factor model was .014. The middle panel of Figure 2 shows the curves of the CCA estimates, and the MCA from the item response theory model was .929 and for the linear factor model was .930. The bottom panel of Figure 2 shows the curves of the CCC estimates, and the MCC estimate from the item response theory model was .902 and for the linear factor model was .900. Furthermore, we also examined the CA and CC estimates for , which roughly captures 3% of the individuals. The MCA for the item response theory model was .980, while for the factor model was .992. On the other hand, the MCC for the item response theory model was .973, while for the factor model was .988. It is likely that the MCA and MCC estimates were similar across approaches for because of the conditional summed score distributions in the regions around – the expected mean relation is on the top of Figure 2, the has a constant value of 1.97, and the average for θ > 0 is 2.18. For , the estimates might be similar because most of the respondents will be ruled out with because it is close to maximum score on the K6 of 24, which in turn yields high classification accuracy and consistency regardless of the shape of the conditional summed score distribution. In this example, the CCA and CCC estimates differed but the linear factor model provided a close approximation to the MCA and MCC estimates from the data-generating model. As such, researchers who fit a linear factor model to K6 data (e.g., Behassa, 2017) can obtain an approximate estimate of CA and CC even when items are analyzed as continuous.
Figure 2.

Top: Test characteristic curves for the item response theory model and the linear factor model. Middle: Conditional classification accuracy curves at cutpoint 13 for the item response theory model and the linear factor model. Bottom: Conditional classification consistency curves at cutpoint 13 for the item response theory model and the linear factor model. For all plots, solid lines are for the item response theory model and dashed lines are for the linear factor model.
Simulation Study
The goal of the simulation is to determine if the MCA and MCC estimates from data-generating models with discrete items (Lee, 2010), and , are approximated by the MCA and MCC estimates from the linear factor model, and , at the population level. Recall that sampling variability does not affect our procedure per se (e.g., taking item parameters, determining conditional summed score distribution, imposing the cutpoint, and integrating results across η). Sampling error plays a role on the estimation of item response theory model parameters or factor model parameters, which takes place prior to conducting the procedure. Implicitly, users of the procedure assume that there is precise estimation of item/factor model parameters. Our simulation would help determine the population-level performance of the procedure when linear factor models are fit to discrete data. To mimic common screening applications, discrete item responses to a unidimensional measure were generated, and IRT-based estimates are treated as the reference values as discussed above. Consistent with previous studies that factor loadings are underestimated when item response are discrete with a few categories (Beauducel & Herzberg, 2006; DiStefano, 2002; Jorgensen & Johnson, 2022; Li, 2016; Muthén & Kaplan, 1985; Rhemtulla et al., 2012; Thissen, 2017), we expect that as the number of items and number of response categories increase, the and estimates will be more similar to and .
Data Generation
Data were simulated using an ordered-categorical unidimensional factor model, similar to the simulations by Gonzalez and Pelham (2021). There are deterministic relations between the parameters from a categorical factor model and from the GRM (Wirth & Edwards, 2007), but we chose to simulate data from a categorical factor model because we believe that researchers would be more familiar with this metric. The factors varied were number of items (from 5 to 15), number of response categories (4, 5, 6, 7) per model (the lowest response category value was zero), and the distribution of the thresholds (symmetric or asymmetric). In total, there were 88 conditions, with N=20,000 simulated cases generated per condition to mitigate the effect of sampling variability. The latent variable score was drawn from a standard normal distribution. The unique scores on each item were multivariate-normally distributed with means of zero, variances of one minus the communality of the item, and uncorrelated with each other and with the latent variable. The standardized item factor loadings per condition were equally spaced between 0.30 and 0.90, and the item thresholds were either symmetrically spaced from a standard normal distribution (i.e., evenly divided, with limits of −2.5 and 2.5) or asymmetrically spaced (i.e., the peak of the distribution fell to the left of the mean; moderately asymmetric condition), as investigated by Rhemtulla et al. (2012; see authors’ supplementary materials). After item responses were generated, a summed score for each respondent was computed, and the observed cutpoint was half the maximum possible summed score in each condition (e.g., in a condition with 7 items with 5 response categories, the maximum score is 28, so the cutscore was , selecting the top 50% of respondents), which corresponds roughly with the mean of the latent variable score. In the supplement, we extend our simulation and present results for conditions that select the top 10% and top 25% of the respondents and also report marginal classification consistency estimates using the Peng & Subkoviak (1980) procedure.
Data Analysis
The discrete item responses were analyzed with the linear factor model, parameters were saved, and the functions provided in the supplement were used to estimate and . We used the cacIRT package with population item parameters to estimate the and . For both approaches, we used quadrature points between −2 and 2 in steps of .05 and normalized quadrature weights. The and were then compared to and using the tabled values and by estimating the root mean squared difference (RMSD) and the relative mean difference (RMD), averaged across all conditions (see Tables S5–S10 in supplement for the raw difference and relative difference of estimates per condition). The RMSD was estimated by subtracting and from and , respectively, squaring the difference, averaging all values, and finally taking the square root. The RMSD value would indicate the mean absolute difference between estimates. The RMD was estimated by subtracting the and from and , respectively, then dividing by and . A positive RMD value would indicate that and overestimated and , and a negative RMD value would indicate that and underestimated and .
Results of Simulation
Tables 1 and 2 show the and as a function of the number of items, number of response categories, and whether thresholds were symmetric or asymmetric. For the conditions with symmetric thresholds, the largest and difference was .008 (in the condition of six items with five response categories), the RMSD was .004, and the RMD was .004. Except for the conditions with four response categories, slightly overestimated , and as reflected by the positive RMSD and RMD. Furthermore, the largest and difference was .007 (in the condition of seven items with seven response categories), the RMSD was .004 and the RMD was .002, exhibiting similar patterns.
Table 1.
Classification accuracy estimates for the item response theory model (in regular font) and the linear factor model (in bold font) at a cutpoint to select the top 50%
| Symmetric thresholds | Asymmetric thresholds | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Response categories | Response categories | ||||||||
| # items | Approach | 4 | 5 | 6 | 7 | 4 | 5 | 6 | 7 |
| 5 | True Discrete | 0.807 | 0.810 | 0.816 | 0.816 | 0.899 | 0.889 | 0.883 | 0.873 |
| Continuous | 0.813 | 0.816 | 0.821 | 0.821 | 0.925 | 0.908 | 0.898 | 0.882 | |
| 6 | True Discrete | 0.824 | 0.823 | 0.828 | 0.829 | 0.893 | 0.901 | 0.906 | 0.885 |
| Continuous | 0.823 | 0.831 | 0.831 | 0.833 | 0.911 | 0.916 | 0.921 | 0.893 | |
| 7 | True Discrete | 0.834 | 0.834 | 0.838 | 0.840 | 0.893 | 0.910 | 0.922 | 0.894 |
| Continuous | 0.835 | 0.840 | 0.843 | 0.845 | 0.905 | 0.922 | 0.934 | 0.900 | |
| 8 | True Discrete | 0.844 | 0.843 | 0.847 | 0.848 | 0.911 | 0.917 | 0.921 | 0.900 |
| Continuous | 0.844 | 0.847 | 0.850 | 0.853 | 0.925 | 0.928 | 0.932 | 0.905 | |
| 9 | True Discrete | 0.849 | 0.851 | 0.854 | 0.856 | 0.925 | 0.922 | 0.932 | 0.906 |
| Continuous | 0.852 | 0.856 | 0.857 | 0.860 | 0.937 | 0.933 | 0.942 | 0.911 | |
| 10 | True Discrete | 0.859 | 0.858 | 0.862 | 0.862 | 0.922 | 0.927 | 0.930 | 0.911 |
| Continuous | 0.858 | 0.862 | 0.865 | 0.866 | 0.933 | 0.937 | 0.939 | 0.915 | |
| 11 | True Discrete | 0.863 | 0.864 | 0.867 | 0.868 | 0.921 | 0.931 | 0.930 | 0.916 |
| Continuous | 0.864 | 0.868 | 0.870 | 0.872 | 0.928 | 0.940 | 0.938 | 0.919 | |
| 12 | True Discrete | 0.870 | 0.869 | 0.872 | 0.873 | 0.930 | 0.934 | 0.937 | 0.919 |
| Continuous | 0.868 | 0.873 | 0.874 | 0.877 | 0.939 | 0.941 | 0.945 | 0.923 | |
| 13 | True Discrete | 0.874 | 0.874 | 0.877 | 0.878 | 0.938 | 0.937 | 0.937 | 0.923 |
| Continuous | 0.874 | 0.878 | 0.879 | 0.880 | 0.946 | 0.944 | 0.943 | 0.926 | |
| 14 | True Discrete | 0.879 | 0.878 | 0.881 | 0.882 | 0.936 | 0.940 | 0.942 | 0.926 |
| Continuous | 0.878 | 0.880 | 0.884 | 0.885 | 0.943 | 0.947 | 0.948 | 0.929 | |
| 15 | True Discrete | 0.881 | 0.881 | 0.885 | 0.885 | 0.942 | 0.942 | 0.947 | 0.928 |
| Continuous | 0.881 | 0.885 | 0.888 | 0.889 | 0.949 | 0.948 | 0.953 | 0.931 | |
Table 2.
Classification consistency estimates for the item response theory model (in regular font) and the linear factor model (in bold font) at a cutpoint to select the top 50%
| Symmetric thresholds | Asymmetric thresholds | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Response categories | Response categories | ||||||||
| # items | Approach | 4 | 5 | 6 | 7 | 4 | 5 | 6 | 7 |
| 5 | True Discrete | 0.744 | 0.746 | 0.754 | 0.750 | 0.860 | 0.845 | 0.836 | 0.822 |
| Continuous | 0.743 | 0.747 | 0.754 | 0.753 | 0.892 | 0.868 | 0.855 | 0.833 | |
| 6 | True Discrete | 0.762 | 0.762 | 0.765 | 0.766 | 0.853 | 0.861 | 0.868 | 0.838 |
| Continuous | 0.756 | 0.766 | 0.766 | 0.770 | 0.873 | 0.881 | 0.887 | 0.848 | |
| 7 | True Discrete | 0.780 | 0.775 | 0.777 | 0.779 | 0.850 | 0.873 | 0.890 | 0.850 |
| Continuous | 0.772 | 0.778 | 0.782 | 0.785 | 0.866 | 0.889 | 0.906 | 0.858 | |
| 8 | True Discrete | 0.787 | 0.786 | 0.789 | 0.790 | 0.876 | 0.883 | 0.888 | 0.860 |
| Continuous | 0.783 | 0.788 | 0.791 | 0.795 | 0.893 | 0.898 | 0.903 | 0.866 | |
| 9 | True Discrete | 0.796 | 0.796 | 0.799 | 0.800 | 0.895 | 0.890 | 0.903 | 0.869 |
| Continuous | 0.794 | 0.800 | 0.801 | 0.805 | 0.911 | 0.905 | 0.918 | 0.874 | |
| 10 | True Discrete | 0.805 | 0.804 | 0.808 | 0.809 | 0.891 | 0.897 | 0.902 | 0.876 |
| Continuous | 0.803 | 0.808 | 0.812 | 0.814 | 0.904 | 0.910 | 0.913 | 0.880 | |
| 11 | True Discrete | 0.814 | 0.812 | 0.816 | 0.816 | 0.889 | 0.902 | 0.901 | 0.882 |
| Continuous | 0.811 | 0.815 | 0.819 | 0.821 | 0.898 | 0.914 | 0.911 | 0.886 | |
| 12 | True Discrete | 0.820 | 0.818 | 0.822 | 0.823 | 0.902 | 0.907 | 0.911 | 0.887 |
| Continuous | 0.816 | 0.823 | 0.824 | 0.829 | 0.913 | 0.917 | 0.922 | 0.891 | |
| 13 | True Discrete | 0.826 | 0.824 | 0.829 | 0.829 | 0.913 | 0.911 | 0.911 | 0.892 |
| Continuous | 0.823 | 0.829 | 0.831 | 0.832 | 0.923 | 0.920 | 0.919 | 0.895 | |
| 14 | True Discrete | 0.832 | 0.830 | 0.834 | 0.835 | 0.910 | 0.915 | 0.919 | 0.896 |
| Continuous | 0.829 | 0.832 | 0.837 | 0.838 | 0.919 | 0.924 | 0.926 | 0.900 | |
| 15 | True Discrete | 0.837 | 0.835 | 0.838 | 0.839 | 0.919 | 0.918 | 0.925 | 0.900 |
| Continuous | 0.834 | 0.839 | 0.843 | 0.845 | 0.927 | 0.925 | 0.933 | 0.903 | |
For conditions with asymmetric thresholds, the largest and difference was .032 (in the condition of five items with four response categories), the RMSD was .010, and the RMD was .010. Across conditions, slightly overestimated . Furthermore, the largest and difference was .025 (in the condition of five items with four response categories), the RMSD was .013, and the RMD was .013, and patterns similar to were observed. Across all conditions, there was not a clear pattern in the discrepancy across approaches, although this might be explained by the small differences between and and between and across the board. Therefore, contrary to our hypotheses, ignoring the discrete nature of the items and fitting a linear factor model led to slightly larger, but similar and estimates compared to and , as reflected by the RMSD and RMD values.
In the supplement, Tables S1–S4 show the and in conditions with that select the top 25% and top 10% of respondents. Largely, we see that and overestimate and , although the largest deficit is less than 4%. Note that and might be similar with these cutpoints because all the estimates are high regardless of the approach: most respondents are in the same class, so they are likely to be correctly and consistently classified. We expect to observe similar relations with very high or very low , regardless of the shape of . Moreover, Figure S1 in the supplement show that is higher than the classification consistency estimates from the Peng & Subkoviak (1980) procedure. Across conditions, the correlation of the estimates across procedures ranged between r = .95–.99.
Lastly, we highlight two issues regarding cutpoints on . Recall that CA and CC depend on . For the simulation, the and estimates were higher in the conditions of asymmetric thresholds than the conditions of symmetric thresholds, but these values cannot be directly compared. The was the same across both conditions, but the mass of the distribution shifted to the left because the item thresholds were asymmetric. As such, is located away from the mass of the distribution. Second, recall that the shapes of for the linear factor model and the item response theory model are slightly different (see Figure 1) – the conditional distribution for the linear factor model has a normal, smooth distribution that can take any value, and the conditional distribution for the item response theory model can be nonnormal, is not smooth, and takes discrete values. Suppose that the cutpoint is at . Under the linear factor model, the next value higher than 13 is, say, ~ 13.0001, whereas for the item response theory model, the next higher value is 14. The rounding when moving from a continuous model-implied to a discrete model-implied could introduce error that affects the precision of CA and CC, which in turn affect the comparisons. It is expected that the conditional distribution of for the item response theory model becomes smoother as the number of items and response categories increase.
Discussion
When tests and scales are used for decision-making, it is important to describe the decision process using estimates of classification accuracy and consistency (AERA et al., 2014). This paper introduced an analytical procedure to estimate classification accuracy and consistency for the linear factor model. The proposed extension used the relations presented by Millsap and Kwok (2004) to develop a procedure to estimate CA and CC, similar to the analytical procedure by Lee (2010) and a simulation-based procedure similar to Gonzalez et al. (2021). Our proposed extension addresses a gap in the literature by enabling researchers who work with continuous item responses or who treat their item responses as continuous to estimate model-based CA and CC. In general, our proposed extension facilitates the estimation of CA and CC from a linear factor model, when before the estimation of CA and CC was only available for item response theory models. Moving forward, researchers can use our procedure to report estimates of CA and CC for measures used for screening, selection, and decision-making. Also, we presented an illustration that researchers could replicate using the supplementary materials. The results from both the illustration and the simulation study suggest that when researchers treat discrete items as continuous, the CA and CC estimates from the linear factor model could slightly overestimate the CA and CC from the data-generating model where items are discrete. This difference would matter most in conditions with a few items (e.g., 5 or 6 items) and few response categories (e.g., 4 or 5 categories).
From our simulation results, we can provide general guidance for applied researchers using these methods. Whenever possible, we strongly recommend researchers to use latent variable models that match the response scale of the items. However, there might be instances in which researchers fit linear factor models to discrete items because that is the only latent variable model they know or because they do not have the sample size to estimate a model with more parameters. Simulation results suggest that model-based CA and CC from the linear factor model provide reasonable approximation to the estimates from the data-generating model with discrete items if there are four response categories and threshold skewness is not too extreme. In other words, estimates of CA and CC from a linear factor model fit to discrete items are similar enough to provide a sense of the accuracy and consistency of the decision-making process.
The model-based estimates for CA and CC have several limitations. For example, our proposed procedure assumes that the measure is unidimensional and that the latent variable model fits the measure well. A future direction would be to extend this procedure to handle multiple factors and to study its performance in the presence of model misspecification or misfit (e.g., local dependence; Edwards et al., 2018). Furthermore, like other model-based estimates, the item parameters are treated as fixed for the estimation of CA and CC, but these item parameters are subject to sampling variability. When researchers use small sample sizes, the factor model parameters are not precise, so our findings might not hold. A future direction would be to study how sampling variability and imprecise estimates of the factor model parameters or item response theory model parameters impacts CA and CC. Finally, future research includes continuing to explore how CA and CC quantifies how violations of measurement invariance affect the selection process (Gonzalez et al., 2021; Gonzalez & Pelham, 2021; Lai et al., 2019; Millsap & Kwok, 2004). Hundreds of invariance studies have been published in psychology, but much of this work fails to clarify the extent to which the use of such scales is impacted by the items that exhibit bias (e.g., Nye et al, 2019). Furthermore, it is also unclear if the linear factor model can detect violations of invariance when fit to discrete items (e.g., Meade & Lautenschlager, 2004). As such, it would be important to investigate how the detection rate of noninvariance is reflected on the estimates of model-based CA and CC from linear models compared to item response theory models. Overall, we encourage researchers to examine classification accuracy and consistency in their measures using the methodology described in this paper in tandem with other psychometric indices.
Supplementary Material
Acknowledgments
The research was in part supported by NIH funding: DA053137 (A.R.G), AA030197 (W.E.P.), and DA055935 (W.E.P.).
Appendix: Using the Linear Factor Model to Compute
Let be the observed score on item j of person i. When is continuous, the relationship between the observed score and the respondent’s standing on the latent variable could be described using the linear factor model,
| (1) |
In this case, is the item intercept, is the factor loading, and is the unique factor score for person i on item j. Recall that in many applied settings a summed score is used for decision-making. We can use Eq. 1 to derive two properties of (Millsap & Kwok, 2004),
| (2) |
In this case, is the model-implied mean of is the model-implied variance of is the mean of , and is the variance of
Using the preceding developments, we can determine , the conditional summed score distribution. Two approaches have been previously studied to determine that distribution for item response theory models: analytically using the approach by Lee (2010) or empirically using the approach by Gonzalez et al. (2021). For the rest of the paper, we focus on extending the analytical approach by Lee (2010) to the linear factor model, and we discuss the empirical approach by Gonzalez et al. (2021) in the supplement – both approaches tend to produce similar results (Gonzalez et al., 2021). Millsap and Kwok (2004) indicated that is normally distributed, so we can characterize analytically using the conditional mean and the conditional variance of . Given the relations in Eq. 2, we can determine that,
| (3) |
Note that does not vary as a function of η because at a specific value of η is zero, which follows from the assumption of homogeneity of the residual variance. Note that and are the intercept and slope of an unbounded line mapping to , and at each value of , the has a constant variance, . The to mapping is similar to the test characteristic curve (TCC) estimated for item responses models, with three exceptions: the TCC is bounded by the minimum and maximum values of , can be S-shaped, and it does not have a constant variance at each level of the latent variable (Thissen, 2000). The right panel of Figure 1 shows examples of from the linear factor model, which differs from the conditional summed score distributions obtained from an item response theory model (Lee, 2010; left panel of Figure 1) – the former are continuous and normal at all levels of the latent variable while the latter are discrete and can take nonnormal shapes. Thus, the performance of our procedure depends on how closely adheres to a normal distribution at each level of η.
Footnotes
The procedure defines the model-implied reference class for classification accuracy estimation as being at or above the η value that yields the model-implied . For the example presented in the illustration section, if researchers use a K6 cutpoint of 13, then the model-implied reference class used by the procedure is defined by η ≥ 1.06 because η = 1.06 yields a model-implied K6 cutpoint of 13 using Eq. 3 in the appendix for the conditional mean. The is typically determined empirically or by subject-matter experts. If the cutpoint were given in the η metric (e.g., cutpoint at η = 1.5), it could be transformed to the model-implied using Eq. 3.
We simulated θ to have a mean of 1 because some items had very extreme b-parameters (above 3). A consequence of this decision is that when we fit the factor model and constrain the θ distribution to have a mean of 0, the b-parameters are rescaled by subtracting 1 from their original values.
References
- Achenbach TM, & Rescorla LA (2000). Manual for the ASEBA Preschool Forms and Profiles: An integrated system of multi-informant assessment. Burlington, VT: University of Vermont Department of Psychiatry [Google Scholar]
- Allison C, Auyeung B, & Baron-Cohen S (2012). Toward brief “red flags” for autism screening: the short autism spectrum quotient and the short quantitative checklist in 1,000 cases and 3,000 controls. Journal of the American Academy of Child & Adolescent Psychiatry, 51, 202–212. [DOI] [PubMed] [Google Scholar]
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. [Google Scholar]
- Beauducel A, & Herzberg PY (2006). On the performance of maximum likelihood versus means and variance adjusted weighted least squares estimation in CFA. Structural Equation Modeling, 13, 186–203. [Google Scholar]
- Bessaha ML (2017). Factor structure of the Kessler psychological distress scale (K6) among emerging adults. Research on Social Work Practice, 27, 616–624. [Google Scholar]
- Carleton RN, Thibodeau MA, Teale MJ, Welch PG, Abrams MP, Robinson T, & Asmundson GJ (2013). The Center for Epidemiologic Studies – Depression scale: a review with a theoretical and empirical examination of item content and factor structure. PloS one, 8(3), e58067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davidov E, Schmidt P, & Schwartz SH (2008). Bringing values back in: The adequacy of the European Social Survey to measure values in 20 countries. Public Opinion Quarterly, 72, 420–445. [Google Scholar]
- Deng N (2011). Evaluating IRT- and CTT- based method of estimating classification consistency and accuracy indices from single administrations (Unpublished Doctoral Dissertation). University of Massachusetts, Amherst. [Google Scholar]
- DiStefano C (2002). The impact of categorization with confirmatory factor analysis. Structural Equation Modeling, 9, 327–346. [Google Scholar]
- Drapeau A, Beaulieu-Prévost D, Marchand A, Boyer R, Préville M, & Kairouz S (2010). A life-course and time perspective on the construct validity of psychological distress in women and men. Measurement invariance of the K6 across gender. BMC Medical Research Methodology, 10, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edwards MC, Houts CR, & Cai L (2018). A diagnostic procedure to detect departures from local independence in item response theory models. Psychological Methods, 23, 138–149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gonzalez O, Georgeson AR, Pelham WE III., & Fouladi RT (2021). Estimating classification consistency of screening measures and quantifying the impact of measurement bias. Psychological Assessment, 37, 596–609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gonzalez O & Pelham WE III. (2021). When does differential item functioning matter for screening? A method for empirical evaluation. Assessment, 28, 446–456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haertel EH (2006). Reliability. In: Brennan RL (Ed.), Educational Measurement, 4th ed. Greenwood, Westport, CT. [Google Scholar]
- Jorgensen TD, & Johnson AR (2022). How to derive expected values of structural equation model parameters when treating discrete data as continuous. Structural Equation Modeling: A Multidisciplinary Journal, 1–12. [Google Scholar]
- Kessler RC, Barker PR, Colpe LJ, Epstein JF, Gfroerer JC, Hiripi E, Howes MJ, Normand S-LT, Manderscheid RW, Walters EE, Zaslavsky AM (2003). Screening for serious mental illness in the general population. Archives of General Psychiatry. 60, 184–189. [DOI] [PubMed] [Google Scholar]
- Lai MH, Kwok OM, Yoon M, & Hsiao YY (2017). Understanding the impact of partial factorial invariance on selection accuracy: An R script. Structural Equation Modeling: A Multidisciplinary Journal, 24, 783–799. [Google Scholar]
- Lathrop QN (2015). Practical issues in estimating classification accuracy and consistency with R Package cacIRT. Practical Assessment, Research & Evaluation, 20, Article 18. [Google Scholar]
- Lathrop QN, & Cheng Y (2013). Two approaches to estimation of classification accuracy rate under item response theory. Applied Psychological Measurement, 37, 226–241. [Google Scholar]
- Lee WC (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47, 1–17. [Google Scholar]
- Li C-H (2016). Confirmatory factor analysis with ordinal data: Comparing robust maximum likelihood and diagonally weighted least squares. Behavior Research Methods, 48, 936–949. [DOI] [PubMed] [Google Scholar]
- Livingston SA, & Lewis C (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32, 179–197. [Google Scholar]
- Meade AW, & Lautenschlager GJ (2004). A comparison of item response theory and confirmatory factor analytic methodologies for establishing measurement equivalence/invariance. Organizational Research Methods, 7, 361–388. [Google Scholar]
- Mellenbergh JG (2017). Models for continuous responses. In van der Linden WJ (Ed.) Handbook of ítem response theory, volumen one: models (pp. 153–166). London: Chapman and Hall. [Google Scholar]
- Millsap RE, & Kwok OM (2004). Evaluating the impact of partial factorial invariance on selection in two populations. Psychological Methods, 9, 93–115. [DOI] [PubMed] [Google Scholar]
- Nye CD, Bradburn J, Olenick J, Bialko C, & Drasgow F (2019). How big are my effects? Examining the magnitude of effect sizes in studies of measurement equivalence. Organizational Research Methods, 22, 678–709. [Google Scholar]
- Pepe MS (2003). The statistical evaluation of medical tests for classification and prediction. Oxford, UK: Oxford University Press. [Google Scholar]
- Peng CYJ, & Subkoviak MJ (1980). A note on Huynh’s normal approximation procedure for estimating criterion-referenced reliability. Journal of Educational Measurement, 359–368. [Google Scholar]
- Reise SP, & Waller NG (2009). Item response theory and clinical measurement. Annual Review of Clinical Psychology, 5, 27–48. [DOI] [PubMed] [Google Scholar]
- Rhemtulla M, Brosseau-Liard PÉ, & Savalei V (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods, 17, 354–373. [DOI] [PubMed] [Google Scholar]
- Samejima F (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34, 1–97. [Google Scholar]
- Sunderland M, Hobbs MJ, Anderson TM, & Andrews G (2012). Psychological distress across the lifespan: examining age-related item bias in the Kessler 6 Psychological Distress Scale. International Psychogeriatrics, 24, 231–242. [DOI] [PubMed] [Google Scholar]
- Thissen D (2017). Similar DIFs: Differential item functioning and factorial invariance for scales with seven (“plus or minus two”) response alternatives. Paper presented at the 81st International Meeting of the Psychometric Society, Asheville, NC, USA. [Google Scholar]
- Thissen D (2000). Reliability and measurement precision. In Wainer H (Ed.), Computerized adaptive testing: A primer (2nd ed., pp. 159–183). Erlbaum. [Google Scholar]
- Thissen D, Steinberg L, Pyszczynski T, & Greenberg J (1983). An item response theory for personality and attitude scales: Item analysis using restricted factor analysis. Applied Psychological Measurement, 7, 211–226. [Google Scholar]
- Wirth RJ, & Edwards MC (2007). Item factor analysis: Current approaches and future directions. Psychological Methods, 12, 58–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
