Reporting Valid and Reliable Overall Scores and Domain Scores Using Bi-Factor Model

Yue Liu; Zhen Li; Hongyun Liu

doi:10.1177/0146621618813093

. 2018 Dec 10;43(7):562–576. doi: 10.1177/0146621618813093

Reporting Valid and Reliable Overall Scores and Domain Scores Using Bi-Factor Model

Yue Liu ¹, Zhen Li ², Hongyun Liu ^1,^✉

PMCID: PMC6739746 PMID: 31534290

Abstract

Recently, large-scale testing programs have an increasing interest in providing examinees with more accurate diagnostic information by reporting overall and domain scores simultaneously. However, there are few studies focusing on how to report and interpret reliable total scores and domain scores based on bi-factor models. In this study, the authors introduced six methods of reporting overall and domain scores as weighted composite scores of the general and specific factors in a bi-factor model, and compared their performance with Yao’s MIRT (multidimensional item response theory) method using both simulated and empirical data. In the simulation study, four factors were considered: test length, number of dimensions, correlation between dimensions, and sample size. Major findings are that Bifactor-M4 and Bifactor-M6, the methods utilizing discrimination parameters of the specific dimensions to compute the weights, provided the most accurate and reliable overall and domain scores in most conditions, especially when the test was long, the correlation between dimensions was high and the number of dimensions was large; additionally, Bifactor-M4 recovered the relationship of true ability parameters the best of all the proposed methods; On the contrary, Bifactor-M2, the method with equal weights, performed poor on the overall score estimation; Bifactor-M3 and Bifactor-M5, the methods where weights were computed using the discrimination parameters of all the dimensions, performed poor on the domain score estimation; Bifactor-M1, the original factor method, obtained the worst estimations.

Keywords: overall scores, domain scores, bi-factor model, multidimensional item response theory

In large-scale testing programs, the overall scores (or composite scores) have been used to make important decisions for a long time, such as placement, admission, and so on. However, more and more studies have found that most of the tests have a multidimensional structure (Reckase, 2009; Yao, 2010). Instead of focusing solely on overall scores, the domain scores (or subscores) can provide a finer diagnosis of examinees’ abilities, informing candidates their strengths and weaknesses in different content areas for future remedial work. States and academic institutions can build a performance profile for their graduates to better evaluate their training and focus on areas that need instructional improvement (Haladyna & Kramer, 2004). Moreover, ignoring the correlational structure of subtests usually leads to suboptimal overall ability estimates in practice. However, due to the small number of items within each domain, the lack of sufficient reliability is the primary impediment to reporting domain scores.

Recently, quite a few theories, such as classical test theory (CTT) and item response theory (IRT), based methods have been developed to improve the reliability and optimality of overall and domain scores. By incorporating the correlational structure in the estimation procedure using a multidimensional framework, overall scores can be obtained based on multidimensional item response theory (MIRT) models using maximum information method (Yao, 2010), or from second-order ability estimates in higher-order IRT (HO-IRT) models. Objective performance index (OPI) scoring, augmented scoring, multidimensional scoring, and HO-IRT scoring methods have been documented to improve domain scores, with the latter three regarded as relation-based methods (de la Torre, Song, & Hong, 2011). Previous research has also compared different approaches for reporting overall scores and domain scores. For example, Dwyer, Boughton, Yao, Steffen, and Lewis (2006) have compared four methods: raw subscores, OPI subscores, Wainer augmentation, and MIRT-based subscores. de la Torre et al. (2011) have conducted a systematic comparison of MIRT scoring, augmented scoring, HO-IRT scoring, and OPI scoring. Yao (2010) has investigated the performance of Unidimensional Item Response Theory (UIRT) method, HO-IRT method, MIRT method, and bi-factor method. On the whole, they found that the MIRT-based methods could provide the best estimates of overall and domain scores.

Currently, bi-factor models have been increasingly popular among researchers and practitioners in the area of educational and psychological measurement (Armon & Shirom, 2011; Cai, Yang, & Hansen, 2011; Kim, Sherry, Lee, & Kim, 2011; Lee & Lee, 2016). They specify a general factor measured by all the tested items and several specific factors, orthogonal to the general factor, accounting for residual variances shared by subtest items. Bi-factor models are ideally suited for representing the construct-relevant multidimensionality that arises in the responses to measures of broad constructs where multiple distinct domains of item content are included to increase content validity (Reise, Moore, & Haviland, 2010). There are several advantages in applying bi-factor models for score reporting. For one thing, factor scores generated by this model possess the linear measurement properties, which means that there is a direct relationship between the differences in latent ability and the differences in test scores. This linear scaling property is especially desirable in longitudinal applications, as the same amount of score changes should indicate the same change in ability regardless of initial ability levels (Gavett, Crane, & Dams-O’Connor, 2013). For another, the bi-factor models can be easily calibrated by many widely used software designed for factor analysis, which is one reason for its increasing popularity.

So far, a few methods have been developed to compute overall and domain scores based on bi-factor model (DeMars, 2013; Yao, 2010). DeMars (2013) estimated a domain score of the observed score metric by: $τ_{s} = \sum_{i \in s} P_{i} ({\hat{θ}}_{g}, {\hat{θ}}_{s})$ , where $τ_{s}$ is the model-predicted observed domain score for examinee in subscale s, and P_i is the probability of correct response given the score estimates on the general factor g and a specific factor s. However, as the relationship of $τ_{s}$ and ${\hat{θ}}_{s}$ is nonlinear, the linear property of the bi-factor model’s scale is no longer held and the units may be compressed or expanded, particularly at the ends of the scale (DeMars, 2013). Another commonly used approach is to represent the overall score and domain scores by the original general factor and specific factors, respectively. The general factor can be interpreted as pure measures of the general trait controlling for nuisance factors, such as the negatively worded effects on an attitude survey, or the testlet effects in a reading test (Yao, 2010). In other contexts, the specific factors due to common contents are used to represent domain scores. For example, Betts, Pickart, and Heistad (2011) studied whether the specific factors of literacy and numeracy on a kindergarten assessment could predict later reading and math achievement after controlling for a general factor. However, in Yao’s (2010) study, the reliability and parameter recovery of the overall scores and domain scores based on this approach proved to be poor. According to Gibbons et al. (2007), subscale trait estimates would be underestimated if specific factors were misinterpreted as the overall traits measured by the subscales. The domain score should include more than the specific part. Take a mathematics test as an example, it can be misleading to give a low geometry trait score to a student with a high general factor estimate for mathematics, but a low specific factor estimate for geometry. After all, the geometry trait is part of general mathematics proficiency. Similarly, the composite total score is also underestimated by not including the specific factors. Consider the index omega, which estimates the reliability of the multidimensional overall score. It is computed as the proportion of total score variance that can be attributed to all factors (Reise, 2012; Reise, Bonifay, & Haviland, 2013). Therefore, one alternative would be to estimate overall score as a composite of the general factor and all the specific factors, and domain score as a composite of the general factor and the specific factors (DeMars, 2013; Willoughby, Blanton, & Investigators, 2013). Some researchers claimed that the weights for the general and specific trait scores should be selected based on the relative contributions of the general and specific factors to the observed responses (DeMars, 2013). But how to compute the weights is yet to be explored.

In the current study, the authors propose several methods to compute overall scores and domain scores based on the statistical features of general and specific factors in the bi-factor model. Taking practical needs into consideration, the complexity of real data structures, and the proposed methods, this research uses both simulated and real data to investigate how to obtain reliable overall and domain scores. The following research questions are formulated:

Research Question 1 (RQ1): How do the proposed methods perform on the recovery of overall scores and domain scores, compared to MIRT methods? The effect of the proposed methods on the precision and accuracy of score estimates are the main focus of the article. As a couple of previous studies have found that MIRT-based methods could provide reliable estimates of overall score and subscores (Yao, 2010), the performance of the proposed methods will be compared with a multidimensional two-parameter logistic model (M-2PL)-based method.
Research Question 2 (RQ2): How do the proposed methods perform on the recovery of correlations between different dimensions? As in multidimensional structural assessments, the subdimensions are usually correlated with each other. The study intended to investigate whether the domain scores obtained by the proposed methods could reflect the real relationship between different content areas, which had not been considered in previous studies yet. The authors conjecture that the accuracy of domain score estimates would increase as the recovery of the relationship among dimensions improved.
Research Question 3 (RQ3): How do the proposed methods perform on selection decisions? It is important to understand the practical effects of different methods for reporting overall and domain scores. Therefore, the authors would like to find out to what extent the sets of selected examinees coincide for different scoring methods in various simulation conditions. The methods that provide more accurate overall scores are expected to perform better on selection decisions, for the estimated scores and the rank ordering of persons are supposed to be highly correlated.

The remaining sections of the article are laid out as follows: The “Different Methods for Computing Overall Scores and Domain Scores” section introduces proposed methods of estimating overall and domain scores; the “Simulation Study” section demonstrates a simulation study evaluating the quality of scores obtained by these methods; the “Real Data Example” section illustrates a real data analysis; and the “Summary and Discussion” presents a brief summary of the study and discussion of its practical implications.

Different Methods for Computing Overall Scores and Domain Scores

Bi-Factor Methods

Bi-factor models typically have all the items loading on the general dimension and domain-specific items loading on domain-specific dimensions. For a two-parameter model, the probability for examinee i answer item j correctly is

P ({\overset{⇀}{θ}}_{i}) = \frac{e^{1.7 ({\vec{α^{'}}}_{j} {\overset{⇀}{θ}}_{i} + d_{j})}}{1 + e^{1.7 ({\vec{α^{'}}}_{j} {\overset{⇀}{θ}}_{i} + d_{j})}}

(1)

where ${\overset{⇀}{a}}_{j}^{'}$ is a vector of item discrimination parameters ${\overset{⇀}{a}}_{j}^{'} = (\begin{matrix} β_{α j}, & 0, & \begin{matrix} \dots, & \begin{matrix} \begin{matrix} β_{s j}, & \dots, \end{matrix} & 0 \end{matrix} \end{matrix} \end{matrix})$ , and d_j is the item difficulty. For any given item j from subscale s, only the discrimination parameter for the general factor ( $β_{α j}$ ) and one discrimination parameter of specific factor in the sth subscale ( $β_{sj}$ ) are nonzero. The ability vector of each examinee is ${\overset{⇀}{θ}}_{i}^{'} = (θ_{i α}, θ_{i 1}, \dots, θ_{i s}, \dots, θ_{i S})$ , with one ability estimate for the general factor ( $θ_{i α}$ ), and the others for S specific factors ( $θ_{i 1}, \dots, θ_{is}, \dots, θ_{iS}$ ).

Based on the bi-factor model, the overall score ( $θ_{overall}$ ) can be estimated as a weighted composite of the general factor ( $θ_{i α}$ ) and all the specific factors ( $θ_{i 1}, \dots, θ_{is}, \dots, θ_{iS}$ ), whereas the domain score ( $θ_{domain_s}$ ) for the sth dimension can be estimated as a weighted composite of the general factor ( $θ_{i α}$ ) and the corresponding specific factor ( $θ_{is}$ ):

θ_{overall} = w_{1 α} θ_{i α} + \sum_{s = 1}^{S} w_{1 s} θ_{is}

(2)

θ_{domain_s} = w_{2 α} θ_{i α} + w_{2 s} θ_{is}

(3)

where $w_{1 α}$ and $w_{1 s}$ are the weights of the general factor and specific factors for the overall score, whereas $w_{2 α}$ and $w_{2 s}$ are the weights of the general factor and specific factor for domain scores. The weights play an essential role in score estimation.

The authors propose the following methods utilizing different ways of defining the weights in Equations 2 and 3 for estimating overall and domain scores.

Bifactor-M1

Bifactor-M1 uses 1 and 0 as the weights, as in Yao’s (2010) study.

w_{1 α} = 1, w_{1 s} = 0

(4)

w_{2 α} = 0, w_{2 s} = 1

(5)

By this method, the overall score is represented by the general factor, whereas the domain score is represented by the specific factor.

Bifactor-M2

As the overall and domain scores may be underestimated by Bifactor-M1 in a correlated factors interpretation, one alternative approach is to modify the weights as follows:

w_{1 α} = w_{1 s} = 1

(6)

w_{2 α} = w_{2 s} = 1

(7)

Bifactor-M3

For this method, the discrimination parameters are selected to represent the relative contributions to the overall scores and domain scores. Therefore,

w_{1 α} = \frac{\sum_{j = 1}^{J} β_{α j}}{\sum_{j = 1}^{J} β_{α j} + \sum_{s = 1}^{S} \sum_{j = m}^{m'} β_{sj}}

(8)

w_{2 α} = w_{1 α}

(9)

w_{1 s} = \frac{\sum_{j = m}^{m'} β_{sj}}{\sum_{j = 1}^{J} β_{α j} + \sum_{s = 1}^{S} \sum_{j = m}^{m'} β_{sj}}

(10)

w_{2 s} = w_{1 s}

(11)

where item $j = (m, m + 1, \dots, m')$ belongs to subscale s. Actually, this method uses the discrimination parameters of all the dimensions to compute the weights.

Bifactor-M4

The weights for generating overall scores are the same as Equations 8 and 10. As $β_{α j}$ is always larger than $β_{sj}$ , $w_{2 α}$ is much higher than $w_{2 s}$ in Bifactor-M3, which may result in an incredibly small weight for the specific factor. Bifactor-M4 uses the discrimination parameters for a specific subscale to compute the weights for the domain score.

w_{2 α} = \frac{\sum_{j = m}^{m'} β_{α j}}{\sum_{j = m}^{m'} β_{α j} + \sum_{j = m}^{m'} β_{sj}}

(12)

w_{2 s} = \frac{\sum_{j = m}^{m'} β_{sj}}{\sum_{j = m}^{m'} β_{α j} + \sum_{j = m}^{m'} β_{sj}}

(13)

Bifactor-M5

For this method, the squared discrimination parameters of all the dimensions are used to compute the weights.

w_{1 α} = \frac{\sum_{j = 1}^{J} β_{α j}^{2}}{\sum_{j = 1}^{J} β_{α j}^{2} + \sum_{s = 1}^{S} (\sum_{j = m}^{m'} β_{sj}^{2})}

(14)

w_{2 α} = w_{1 α}

(15)

w_{1 s} = \frac{\sum_{j = m}^{m'} β_{sj}^{2}}{\sum_{j = 1}^{J} β_{α j}^{2} + \sum_{s = 1}^{S} (\sum_{j = m}^{m'} β_{sj}^{2})}

(16)

w_{2 s} = w_{1 s}

(17)

Bifactor-M6

The weights for generating overall scores are the same as Equations 14 and 16. For domain scores, Bifactor-M6 uses the squared discriminations of the specific dimension to compute the weights for domain score, similar to Bifactor-M4.

w_{2 α} = \frac{\sum_{j = m}^{m'} β_{α j}^{2}}{\sum_{j = m}^{m'} β_{α j}^{2} + \sum_{j = m}^{m'} β_{sj}^{2}}

(18)

w_{2 s} = \frac{\sum_{j = m}^{m'} β_{sj}^{2}}{\sum_{j = m}^{m'} β_{α j}^{2} + \sum_{j = m}^{m'} β_{sj}^{2}}

(19)

MIRT Method

For an M-2PL model (Reckase, 2009), the ability estimates for each dimension are regarded as domain scores. Meanwhile, the overall score is obtained using the maximum information method as in Yao (2010), which provides the most reliable overall score estimate among all possible values.

Simulation Study

A simulation study was conducted to evaluate the performance of the methods based on bi-factor models in various conditions, comparing to the abovementioned MIRT method.

Design and Simulation Setup

Response data were generated based on the between-item M-2PL model (Equation 1, Reckase, 2009) using the program SimuMIRT (Yao, 2015). The distributions for generating item parameters were $β_{α j}$ ~ N (0.8, 0.2²), $β_{sj}$ ~ N (0.8, 0.2²), d_j~ N (0, 1²). The items were all dichotomous, with equal numbers of items in each subscale. The ability parameters were sampled from a standard multivariate normal distribution. Independent variables in this study were three levels of test lengths for each dimension (J = 20, 40, and 60); two levels of the number of dimensions (S = 2, 5); four levels of correlation between dimensions ( $ρ$ = .0, .3, .5, and .7); two levels of sample size (I = 1,000, 5,000).

Fully crossed conditions defined by the first four design factors resulted in 48 (3 × 2 × 4 × 2) conditions, while the selection ratio (SR) was only used to compute Jaccard index, which will be discussed later. One hundred replications were simulated for each condition. Every generated data set was analyzed by five methods to report overall scores, which were MIRT, Bifactor-M1, Bifactor-M2, Bifactor-M3&4, and Bifactor-M5&6; and seven methods to report domain scores, which were MIRT, Bifactor-M1, Bifactor-M2, Bifactor-M3, Bifactor-M4, Bifactor-M5, and Bifactor-M6. The MIRT method was regarded as the baseline for comparison.

Estimation

A Bayesian approach using Markov Chain Monte Carlo (MCMC) methods for estimating parameters in MIRT models and bi-factor models was implemented using Bayesian Multidimensional Item Response Theory (BMIRT) (Yao, 2015). For each MCMC run, 5,000 iterations were used, with 2,000 as burn-in. The priors for discrimination, difficulty, and ability parameters were $μ_{β_{α j}} = μ_{β_{sj}} = 1.5$ , $σ_{β_{α j}}^{2} = σ_{β_{sj}}^{2} = 1.5$ , $μ_{dj} = 0$ , $σ_{dj}^{2} = 1.5$ , $μ_{θ} = 0$ , $σ_{θ}^{2} = 1$ . To solve the indeterminacy problem of the models, the population parameters were fixed to be multivariate normal, with means of 0, variances of 1. After calibration, the overall scores and domain scores were computed.

Evaluation Measures

To investigate the accuracy of the overall scores and domain scores of the methods (RQ1), bias, root mean square error (RMSE), and reliability for the scores were analyzed across conditions.

Bias = \frac{1}{N} \sum_{n = 1}^{N} \frac{1}{I} \sum_{i = 1}^{I} (θ_{i} - {\hat{θ}}_{i})

(20)

RMSE = \frac{1}{N} \sum_{n = 1}^{N} \sqrt{\frac{1}{I} \sum_{i = 1}^{I} {(θ_{i} - {\hat{θ}}_{i})}^{2}}

(21)

Reliability = \frac{1}{N} \sum_{n = 1}^{N} cor {(θ, {\hat{θ}}_{i})}^{2}

(22)

where ${\hat{θ}}_{i}$ denoted the overall score or domain score for examinee i reported by the estimation methods, $θ_{i}$ denoted the true value, I denoted the number of examinees in each test, $θ$ and $\hat{θ}$ denoted the estimated overall score or domain score and the true values respectively, and N denoted the number of replications under each condition. The true domain scores were the abilities for each dimension, whereas the true overall scores were computed by the maximum information method using true domain abilities.

To investigate the recovery of correlations between different dimensions (RQ2), bias and RMSE were computed to compare the estimated correlation between domain scores with the true correlations using Fisher Z-r transformation in Equations 23 and 24.

Bias = \frac{1}{N} \sum_{n = 1}^{N} \frac{1}{H} \sum_{h = 1}^{H} (ρ_{h} - {\hat{ρ}}_{h})

(23)

RMSE = \frac{1}{N} \sum_{n = 1}^{N} \sqrt{\frac{1}{H} \sum_{h = 1}^{H} {(ρ_{h} - {\hat{ρ}}_{h})}^{2}}

(24)

where $ρ_{h}$ denotes the true correlation between dimensions, ${\hat{ρ}}_{h}$ denotes the estimated value, H denotes the number of correlations (H = S(S– 1) / 2).

To investigate the differences in the rank ordering of simulees under different conditions (RQ3), the Jaccard index was selected as a measure of the overlap between pairs of sets (Jaccard, 1912). The Jaccard index for two sets was defined as the ratio of the cardinals of the intersection set to the union set, ranging from 0 through 1.

J (A, B) = \frac{| A \cap B |}{| A \cup B |}

(25)

where A could be regarded as the examinees selected using the true overall scores, and B as the examinees selected using the introduced methods.

In addition, SR was varied across different simulation conditions. SR referred to the proportion of respondents who were selected. As most selection decisions were based on the overall scores in reality, only the rank order based on overall scores was evaluated. In this study, the following SRs were considered: SR = 0.3, 0.5, and 0.8.

Results

Overall score estimates

To investigate all the methods’ overall score recovery (RQ1), the authors computed bias and RMSE to compare different methods across various simulation conditions. Results of bias and RMSE showed the same pattern, with Bifactor-M1 always underestimating the true values. For simplicity, the following analysis was based on the RMSE results. As shown in Figure 1, overall score estimates using Bifactor-M3&4, Bifactor-M5&6, and MIRT were for the most part close to the true value, and the discrepancy became less apparent when test length, number of dimensions, and correlation between dimensions increased. By using Bifactor-M1 and Bifactor-M2, an interaction effect between method and the number of dimensions was observed. For S = 2, RMSE of Bifactor-M2 (0.428) was smaller than Bifactor-M1 (0.550), whereas for S = 5, RMSE of Bifactor-M1 (0.454) was smaller than Bifactor-M2 (0.545). Moreover, Bifactor-M1 produced RMSE that exhibited a different pattern across various correlations between dimensions. RMSE of Bifactor-M1 decreased sharply as correlation between dimensions increased. However, correlation between dimensions had a relatively trivial effect on other methods, especially after it reached 0.5. Sample size exerted almost no effect on the magnitude of RMSE, except for those of Bifactor-M2, with its RMSE decreasing when sample size increased.

Figure 1. — RMSE for overall scores of different methods.

*Note.* For each plot, RMSE is averaged across all the other tested conditions. RMSE = root mean square error; MIRT = multidimensional item response theory.

To check how the sets of top-selected examinees coincided (RQ3), Jaccard indexes were computed for all the methods across different conditions, as presented in the online supplement, Appendix 1. It shows that, for all the methods, lower Jaccard indexes were obtained when the SR, test length, number of dimensions, and sample size were small, or when multidimensionality was severe. Supplemental Appendix 1 also shows that, in line with the expectation that accurate estimates can result in larger proportion of overlap between sets of top selected examinees, Jaccard indexes for MIRT (0.570-0.948), Bifactor-M3&4 (0.556-0.946), and Bifactor-M5&6 (0.576-0.949) were higher than Bifactor-M1 (0.447-0.939) and Bifactor-M2 (0.451-0.910). When SR increased, the differences between Jaccard indexes using different methods became smaller.

The reliability of different methods under all simulation conditions had the same pattern as the RMSE results. For example, Supplemental Appendix 2 shows the reliability of overall scores for the sample size of 1,000. The table indicates that under some conditions, the reliability of Bifactor-M1 (i.e., J = 20, ρ = .0) and Bifactor-M2 (i.e., S = 5, ρ = .0) were extremely low. In contrast, reliabilities of Bifactor-M3&4, Bifactor-M5&6, and MIRT went from 0.8 to 1.0 as the test length of each dimension increased from 40 to 60.

Domain score estimates

To answer RQ1, the results of RMSE were summarized. As shown in Figure 2, the RMSE of Bifactor-M4, Bifactor-M6, and MIRT were highly comparable and lower than others. For most of the tested conditions, the RMSE of Bifactor-M2 was similar to the three better performing methods, except when ρ = .0. As test length increased, the RMSE of all the methods decreased, but the improvement of Bifactor-M4, Bifactor-M6, and MIRT was greater. Moreover, the performance of Bifactor-M4, Bifactor-M6, and MIRT was stable across different numbers of dimensions, or correlation levels. However, Bifactor-M3 and Bifactor-M5 resulted in poorer estimates when the test length or correlation between dimensions was small; Bifactor-M1 had extremely large RMSE when ρ≥ .5.

For the recovery of correlations between different dimensions (RQ2), Bifactor-M1 underestimated the correlation across all conditions, whereas Bifactor-M3 and Bifactor-M5 tended to overestimate the correlation systematically. The RMSE of correlations can be seen in Supplemental Appendix 3. The smaller RMSE of Bifactor-M4 and MIRT indicated that they had better recovery of correlations than all the other methods, followed by Bifactor-M2 and Bifactor-M6. The largest RMSE came from Bifactor-M3. Moreover, two interaction effects between methods and the correlations, methods and the number of dimensions were observed: for ρ = .0, Bifactor-M6 had smaller RMSE of correlations than Bifactor-M4 and Bifactor-M2, whereas when ρ = .7, the RMSE of correlation by Bifactor-M6 were much larger than those by Bifactor-M4 or Bifactor-M2; as the number of dimensions increased, the RMSE of correlations using Bifactor-M1 decreased, while those using Bifactor-M3 and Bifactor-M5 increased.

As indicated in Supplemental Appendix 4, the reliabilities of domain scores (0.202-0.891) were smaller compared with overall scores (0.475-0.960). The pattern of domain score reliabilities when the sample size is 1,000 was summarized as follows. First, the score reliabilities by Bifactor-M1 were all below 0.730, with the lowest reliabilities (0.202, 0.235, and 0.268) obtained when S = 5, ρ = .7. Second, for all the other methods, as test length of each dimension increased, the reliabilities of domain scores increased. Moreover, when J≥ 40, almost all the reliabilities were above 0.7. Third, compared with the overall score reliabilities, the impact of the number of dimensions and the correlation between dimensions was low.

Real Data Example

Description of Data

Real data were used to compare the performance of different methods in estimating overall and domain scores. 4,815 responses for a comprehensive science test in National College Entrance Examination in China were collected. The test contained 66 items covering three subjects: physics (17 items), chemistry (30 items), and biology (19 items). The correlational structure of the abilities based on the multidimensional scoring analysis by BMIRT (Yao, 2015) was given in Table 1, where high correlations (the average correlation was about .720) across the subjects was discovered. The highest were between physics and chemistry (.756), and lowest between physics and biology (.694).

Table 1.

Correlation Estimates for the Comprehensive Test of Science.

	Physics	Chemistry	Biology
Physics	—	.756	.694
Chemistry	—	—	.702
Biology	—	—	—

Open in a new tab

Real Data Example Analysis

Seven methods (MIRT, Bifactor-M1, Bifactor-M2, Bifactor-M3, Bifactor-M4, Bifactor-M5, and Bifactor-M6) were used to obtain the overall scores and domain scores. For the real data, overall and domain ability estimates from the MIRT method were used as the gold standard. Model fit was first evaluated for the UIRT model, MIRT model, and bi-factor model. Then the different methods were compared with MIRT method by the RMSE and correlation between estimated scores. Specifically, summary statistics were computed at five percentiles (5th, 25th, 50th, 75th, and 95th).

Real Data Example Results

Model fit statistics are displayed in Table 2. As the true model was unknown, only the relative performance of the models was examined: the bi-factor model fitted the data best, followed by the MIRT model, and the UIRT model did not fit well.

Table 2.

Model Fit Statistics for the Comprehensive Test of Science.

Model	Log likelihood	Chi-Square	df
UIRT	−230,798
MIRT	−221,795	18,006	9,630
Bi-factor	−219,977	21,642	14,511

Open in a new tab

Note. MIRT = multidimensional item response theory.

Supplemental Appendix 5 compares the overall scores and domain scores from different methods. Bifactor-M2 produced the most different overall ability estimates from MIRT, followed by Bifactor-M1. Bifactor-M3&4 and Bifactor-M5&6 had similar estimates, close to those from MIRT. For domain score estimates, the differences between Bifactor-M1 and MIRT were the largest. The domain scores from Bifactor-M4 for physics and chemistry and from Bifactor-M2 for biology were as accurate as those from MIRT. As shown in the simulation study, the differences between domain abilities from Bifactor-M2 and those from MIRT decreased as the correlation between dimensions increased. As the correlations between subjects were rather high, one would expect that both Bifactor-M2 and Bifactor-M4 would give a better match with MIRT.

The percentile statistics for the overall scores and domain scores can be found in Supplemental Appendix 6. Generally speaking, compared to MIRT, the proposed methods were more discrepant at the two ends of the scale (i.e., 5th percentile and 95th percentile). Specifically, for overall scores, the absolute difference of 95th percentile between Bifactor-M2 and MIRT reached to 0.134. For domain scores, take chemistry scores as an example, the absolute differences of Bifactor-M2, Bifactor-M3, Bifactor-M5, and Bifactor-M6 from MIRT at the 5th percentile were larger than 0.1. In general, Bifactor-M4 had the smallest discrepancy from MIRT for overall scores and domain scores.

Summary and Discussion

In practice, overall scores emphasize on reporting examinees’ overall achievement or proficiency based on a whole test, while domain scores focus on offering diagnostic information for a specific domain. This study compared several methods of estimating overall and domain scores based on the bi-factor model.

DeMars (2013) suggested that overall and domain scores using the weighted combination of the general and specific factors would be more reliable and have a smaller standard error. Within this framework, the authors explored six different ways of computing the weights of the general and specific factors. As the weights for Bifactor-M4 and Bifactor-M6 were more stringent and appropriate than the other methods, these two methods resulted in similar overall and domain score estimates as the MIRT method in the simulation study and real data analysis. For overall scores, when the correlation between dimensions was high and test length was small, the difference between MIRT and Bifactor-M4, Bifactor-M6 was negligible, as demonstrated in Figure 3. Part A, Part B, and Part C could be regarded as the specific factors, whereas Part X was regarded as the general factor. Therefore, the overall score could be computed as w_aA+w_bB+w_cC+w_xX in this framework based on the assumption, where ws are the weights of each part. When unidimensionality was stronger, the general part (Part X) of all the dimensions was larger. By using different weights, the general factor contributed more to the overall score estimation in Bifactor-M4 and Bifactor-M6 than in Bifactor-M2 (equal weights). Therefore, these methods had a smaller discrepancy between the true values. For domain scores, Bifactor-M4 could provide not only accurate score estimates, but also unbiased estimation of correlations between dimensions in all conditions.

As mentioned above, Bifactor-M3 and Bifactor-M5 failed to provide reliable domain scores, especially in the case of severe multidimensionality. Referring to the recovery of correlations between different dimensions, Bifactor-M3 and Bifactor-M5 had large RMSE in the five-dimension condition as well. Bifactor-M3 and Bifactor-M5 had much smaller weights of specific factors. Therefore, the common variance between domain scores was large, which led to overestimated correlations between dimensions. In Figure 3, when the correlation between dimensions was small, the purely specific contribution of each dimension (Part A, Part B, and Part C) was large. However, if smaller weights (w_a, w_b, w_c) for specific factors were chosen, the domain scores (w_aA+w_xX, w_bB+w_xX, w_cC+w_xX) would be severely biased. Therefore, it was inferred that the weights for Bifactor-M4&6 were more adequate than Bifactor-M3&5.

Bifactor-M1 and Bifactor-M2 had some notable drawbacks. Bifactor-M1 had inaccurate estimation for overall and domain scores, as well as correlations between dimensions. As the correlations and number of dimensions became smaller, the estimation error of overall scores by Bifactor-M1 increased dramatically. As shown in Figure 3, when the correlations were small, indicating that Part X was small, the relative contribution of the general factor in the overall score (w_aA+w_bB+w_cC+w_xX) was small. If the general score was used to represent the overall score mistakenly, the estimation would be inaccurate. The findings were different for domain scores. As the correlations between dimensions increased, the error of domain scores by Bifactor-M1 increased, as illustrated in Figure 3. When the unidimensionality assumption was not severely violated, the overlap between dimensions (Part X) was large and the specific contribution of each dimension (Part A, Part B, and Part C) to domain scores (w_aA+w_xX, w_bB+w_xX, w_cC+w_xX) was small. If the specific factor was used as a domain score directly, the error would be incredibly large. Moreover, as domain scores of Bifactor-M1 were the specific factor of each dimension neglecting their overlap, the authors would expect the correlation between dimensions to be underestimated. Finally, because the reliabilities for overall and domain scores using Bifactor-M1 were under 0.7 and 0.5, respectively, in most situations, this method is not recommended for reporting scores. For Bifactor-M2, although the domain scores were mostly accurate, the overall scores were inaccurate in all conditions. An interesting finding is that the error of the overall score based on Bifactor-M2 was rather large in the condition of five dimensions, probably due to the increasing Number of unweighted specific factors. Therefore, as emphasized by DeMars (2013), a set of carefully decided weights should be incorporated to report the overall scores and domain scores, as Bifactor-M4, Bifactor-M6 did. Finally, the relationship of the weights for general factor and specific factor, and the performance ranking of different methods under two conditions are summarized in Table 3.

Table 3.

Summary of the Weights and Different Methods’ Performance.

	Overall score			Domain score
		Rank			Rank
Methods	Weights	ρ = .7	ρ = .0	Weights	ρ = .7	ρ = .0
Bifactor-M1	w_x = 1, w_a,b,c = 0	3	4	w_x = 0, w_a,b,c = 1	6	4
Bifactor-M2	w_x = w_a,b,c = 1	4	3	w_x = w_a,b,c = 1	3	3
Bifactor-M3	w_x >> w_a,b,c	1	1	w_x >> w_a,b,c	4	6
Bifactor-M4	w_x > w_a,b,c	1	1	w_x > w_a,b,c	1	1
Bifactor-M5	w_x >> w_{a, b,c}	2	2	w_x >> w_a, b,c	5	5
Bifactor-M6	w_x > w_a,b,c	2	2	w_x > w_a,b,c	2	1

Open in a new tab

Note. >> means far larger than. Rank means the ranking of different methods’ performance in descending order. w_a,b,c indicates w_a = w_b = w_c.

Regarding the performance of different methods across conditions, similar patterns were found for the RMSE of overall scores and Jaccard index. The Jaccard index for Bifactor-M4 and Bifactor-M6 remained high in most conditions. However, when the correlations between dimensions were .0, and the number of items in each dimension was 20, the overlap of the top 30% examinees between true values and overall scores was no more than 60% for all the methods. This was not statistically desirable.

Practical Implications

The most important conclusion derived from this study is that, when applied appropriately, reliable overall scores and domain scores can be obtained using bi-factor model-based methods. Specifically, proposed method Bifactor-M4 will be well applied in the following conditions: (a) the G-factor scores (specified as the general factor), the overall scores, and domain scores are desired at the same time; (b) correlations between dimensions are above middle level (above .4), when a bi-factor model is proved to be a better choice than an MIRT model (Reise et al., 2010); (c) criterion-related validity is considered to be important, as factor scores based on bi-factor models usually have high criterion-related validity (Gavett et al., 2013); (d) the number of items in each dimension is large (more than 20). In practice, practitioners and researchers should be very careful when reporting overall and domain scores based on the bi-factor model, especially when the number of dimensions is large, the number of items in each dimension is small, and the correlation between dimensions is low. A more prudent suggestion is to check the reliability of total scores and subscale scores before interpreting them. The procedure to compute statistical indexes for reliability such as omega and omegaS can be found in Reise et al.’s (2013) paper.

Limitations and Future Research

The current study is limited in several aspects. First, the simulation study only considered dichotomously scored items. Investigating the performance of these methods on polytomous items or a mixture of both types would also be of great practical value. Second, in addition to the accuracy of domain scores investigated in the study, the linking and comparing of domain scores across different subscales may also be considered in the future; Finally, the present simulation study shares the same limitations as presented in other simulation studies (i.e., limited conditions); and more extensive simulations should be conducted to make solid generalizations.

Supplemental Material

AppendixAPM – Supplemental material for Reporting Valid and Reliable Overall Scores and Domain Scores Using Bi-Factor Model

Click here for additional data file.^{(543.3KB, pdf)}

Supplemental material, AppendixAPM for Reporting Valid and Reliable Overall Scores and Domain Scores Using Bi-Factor Model by Yue Liu, Zhen Li and Hongyun Liu in Applied Psychological Measurement

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Supported by National Natural Science Foundation of China (31571152); Special Found for Beijing Common Construction Project (019-105812); National Education Examinations Authority(GJK2017015).

Supplemental Material: Supplemental material for this article is available online.

References

Armon G., Shirom A. (2011). The across-time associations of the five-factor model of personality with vigor and its facets using the bifactor model. Journal of Personality Assessment, 93, 618-627. [DOI] [PubMed] [Google Scholar]
Betts J., Pickart M., Heistad D. (2011). Investigating early literacy and numeracy: Exploring the utility of the bifactor model. School Psychology Quarterly, 26, 97-107. [Google Scholar]
Cai L., Yang J. S., Hansen M. (2011). Generalized item bifactor analysis. Psychological Methods, 16, 221-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
de la Torre J., Song H., Hong Y. (2011). A comparison of four methods of IRT subscoring. Applied Psychological Measurement, 35, 296-316. [Google Scholar]
DeMars C. E. (2013). A tutorial on interpreting bifactor model scores. International Journal of Testing, 13, 354-378. [Google Scholar]
Dwyer A., Boughton K. A., Yao L., Steffen M., Lewis D. (2006, April). A comparison of subscale score augmentation methods using empirical data. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA. [Google Scholar]
Gavett B. E., Crane P. K., Dams-O’Connor K. (2013). Bi-factor analyses of the brief test of adult cognition by telephone. NeuroRehabilitation, 32, 253-265. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gibbons R. D., Bock D. R., Hedeker D., Weiss D. J., Segawa E., Bhaumik D. K., . . .Stover A. (2007). Full-information item bi-factor analysis of graded response data. Applied Psychological Measurement, 31, 4-19. [Google Scholar]
Haladyna S. J., Kramer G. A. (2004). The validity of subscores for a credentialing test. Evaluation & the health professions, 27, 349-368. [DOI] [PubMed] [Google Scholar]
Jaccard P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11, 37-50. doi: 10.1111/j.1469-8137.1912.tb05611.x [DOI] [Google Scholar]
Kim S.-H., Sherry A. R., Lee Y. S., Kim C.-D. (2011). Psychometric properties of a translated Korean adult attachment measure. Measurement and Evaluation in Counseling and Development, 44, 135-150. [Google Scholar]
Lee G., Lee W. C. (2016). Bi-factor MIRT observed-score equating for mixed-format tests. Applied Measurement in Education, 29, 224-241. [Google Scholar]
Reckase M. D. (2009). Multidimensional item response theory. New York, NY: Springer. [Google Scholar]
Reise S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47, 667-696. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reise S. P., Bonifay W. E., Haviland M. G. (2013). Scoring and modeling psychological measures in the presence of multidimensionality. Journal of Personality Assessment, 95, 129-140. [DOI] [PubMed] [Google Scholar]
Reise S. P., Moore T. M., Haviland M. G. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment, 92, 544-559. [DOI] [PMC free article] [PubMed] [Google Scholar]
Willoughby M. T., Blanton Z. E., Investigators F. L. P. (2013). Replication and external validation of a bi-factor parameterization of attention deficit/hyperactivity symptomatology. Journal of Clinical Child and Adolescent Psychology, 44, 68-79. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yao L. H. (2010). Reporting valid and reliable overall scores and domain scores. Journal of Educational Measurement, 47, 339-360. [Google Scholar]
Yao L. H. (2015). The BMIRT toolkit. Monterey, CA: Defense Manpower Data Center. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

AppendixAPM – Supplemental material for Reporting Valid and Reliable Overall Scores and Domain Scores Using Bi-Factor Model

Click here for additional data file.^{(543.3KB, pdf)}

Supplemental material, AppendixAPM for Reporting Valid and Reliable Overall Scores and Domain Scores Using Bi-Factor Model by Yue Liu, Zhen Li and Hongyun Liu in Applied Psychological Measurement

[bibr1-0146621618813093] Armon G., Shirom A. (2011). The across-time associations of the five-factor model of personality with vigor and its facets using the bifactor model. Journal of Personality Assessment, 93, 618-627. [DOI] [PubMed] [Google Scholar]

[bibr2-0146621618813093] Betts J., Pickart M., Heistad D. (2011). Investigating early literacy and numeracy: Exploring the utility of the bifactor model. School Psychology Quarterly, 26, 97-107. [Google Scholar]

[bibr3-0146621618813093] Cai L., Yang J. S., Hansen M. (2011). Generalized item bifactor analysis. Psychological Methods, 16, 221-248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr4-0146621618813093] de la Torre J., Song H., Hong Y. (2011). A comparison of four methods of IRT subscoring. Applied Psychological Measurement, 35, 296-316. [Google Scholar]

[bibr5-0146621618813093] DeMars C. E. (2013). A tutorial on interpreting bifactor model scores. International Journal of Testing, 13, 354-378. [Google Scholar]

[bibr6-0146621618813093] Dwyer A., Boughton K. A., Yao L., Steffen M., Lewis D. (2006, April). A comparison of subscale score augmentation methods using empirical data. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA. [Google Scholar]

[bibr7-0146621618813093] Gavett B. E., Crane P. K., Dams-O’Connor K. (2013). Bi-factor analyses of the brief test of adult cognition by telephone. NeuroRehabilitation, 32, 253-265. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr8-0146621618813093] Gibbons R. D., Bock D. R., Hedeker D., Weiss D. J., Segawa E., Bhaumik D. K., . . .Stover A. (2007). Full-information item bi-factor analysis of graded response data. Applied Psychological Measurement, 31, 4-19. [Google Scholar]

[bibr9-0146621618813093] Haladyna S. J., Kramer G. A. (2004). The validity of subscores for a credentialing test. Evaluation & the health professions, 27, 349-368. [DOI] [PubMed] [Google Scholar]

[bibr10-0146621618813093] Jaccard P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11, 37-50. doi: 10.1111/j.1469-8137.1912.tb05611.x [DOI] [Google Scholar]

[bibr11-0146621618813093] Kim S.-H., Sherry A. R., Lee Y. S., Kim C.-D. (2011). Psychometric properties of a translated Korean adult attachment measure. Measurement and Evaluation in Counseling and Development, 44, 135-150. [Google Scholar]

[bibr12-0146621618813093] Lee G., Lee W. C. (2016). Bi-factor MIRT observed-score equating for mixed-format tests. Applied Measurement in Education, 29, 224-241. [Google Scholar]

[bibr13-0146621618813093] Reckase M. D. (2009). Multidimensional item response theory. New York, NY: Springer. [Google Scholar]

[bibr14-0146621618813093] Reise S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47, 667-696. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr15-0146621618813093] Reise S. P., Bonifay W. E., Haviland M. G. (2013). Scoring and modeling psychological measures in the presence of multidimensionality. Journal of Personality Assessment, 95, 129-140. [DOI] [PubMed] [Google Scholar]

[bibr16-0146621618813093] Reise S. P., Moore T. M., Haviland M. G. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment, 92, 544-559. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr17-0146621618813093] Willoughby M. T., Blanton Z. E., Investigators F. L. P. (2013). Replication and external validation of a bi-factor parameterization of attention deficit/hyperactivity symptomatology. Journal of Clinical Child and Adolescent Psychology, 44, 68-79. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr18-0146621618813093] Yao L. H. (2010). Reporting valid and reliable overall scores and domain scores. Journal of Educational Measurement, 47, 339-360. [Google Scholar]

[bibr19-0146621618813093] Yao L. H. (2015). The BMIRT toolkit. Monterey, CA: Defense Manpower Data Center. [Google Scholar]

PERMALINK

Reporting Valid and Reliable Overall Scores and Domain Scores Using Bi-Factor Model

Yue Liu

Zhen Li

Hongyun Liu

Abstract

Different Methods for Computing Overall Scores and Domain Scores

Bi-Factor Methods

Bifactor-M1

Bifactor-M2

Bifactor-M3

Bifactor-M4

Bifactor-M5

Bifactor-M6

MIRT Method

Simulation Study

Design and Simulation Setup

Estimation

Evaluation Measures

Results

Overall score estimates

Figure 1.

Domain score estimates

Figure 2.

Real Data Example

Description of Data

Table 1.

Real Data Example Analysis

Real Data Example Results

Table 2.

Summary and Discussion

Figure 3.

Table 3.

Practical Implications

Limitations and Future Research

Supplemental Material

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases