Indices of Subscore Utility for Individuals and Subgroups Based on Multivariate Generalizability Theory

Mark R Raymond; Zhehan Jiang

doi:10.1177/0013164419846936

. 2019 May 16;80(1):67–90. doi: 10.1177/0013164419846936

Indices of Subscore Utility for Individuals and Subgroups Based on Multivariate Generalizability Theory

Mark R Raymond ^1,^✉, Zhehan Jiang ²

PMCID: PMC6943993 PMID: 31933493

Abstract

Conventional methods for evaluating the utility of subscores rely on traditional indices of reliability and on correlations among subscores. One limitation of correlational methods is that they do not explicitly consider variation in subtest means. An exception is an index of score profile reliability designated as $G$ , which quantifies the ratio of true score profile variance to observed score profile variance. $G$ has been shown to be more sensitive than correlational methods to group differences in score profile utility. However, it is a group average, representing the expected value over a population of examinees. Just as score reliability varies across individuals and subgroups, one can expect that the reliability of score profiles will vary across examinees. This article proposes two conditional indices of score profile utility grounded in multivariate generalizability theory. The first is based on the ratio of observed profile variance to the profile variance that can be attributed to random error. The second quantifies the proportion of observed variability in a score profile that can be attributed to true score profile variance. The article describes the indices, illustrates their use with two empirical examples, and evaluates their properties with simulated data. The results suggest that the proposed estimators of profile error variance are consistent with the known error in simulated score profiles and that they provide information beyond that provided by traditional measures of subscore utility. The simulation study suggests that artificially large values of the indices could occur for about 5% to 8% of examinees. The article concludes by suggesting possible applications of the indices and discusses avenues for further research.

Keywords: subscores, score profiles, generalizability theory, score reports

Most testing programs in K–12 education, college admissions, and professional certification provide subscores to examinees and other stakeholders. Subscores are valued by examinees, educators, and decision makers to inform test preparation efforts, evaluate instruction, and assist with admission and placement decisions (Huff & Goodman, 2007; Jiang & Raymond, 2018). However, subscores often are provided as an afterthought, without being given deliberate consideration during test design (Brennan, 2011; Haberman & Sinharay, 2010). When subscores or score profiles are provided, testing agencies are obligated to demonstrate that subscores exhibit sufficient reliability and are empirically distinct (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014, p. 27). Correlational methods remain the most common tools for evaluating the properties of subscores and score profiles. One straightforward approach is to compute correlations among subscores or between subscores and the total score; if the disattenuated correlations are in the .90s, then the subscores are not very distinct and probably are not worth reporting (e.g., Haladyna & Kramer, 2004). Factor analytic methods also have been used to investigate the utility of subscores (Sinharay, Haberman, & Puhan, 2007; Stone, Ye, Zhu, & Lane, 2010; Thissen, Wainer, &Wang, 1994). In addition, item response theory (IRT) can be used to determine whether item responses can be adequately modeled by a single examinee trait, or whether multiple traits are required.

Haberman (2008) developed a method that incorporates both subscore distinctiveness and subscore reliability into a single decision rule about whether to report subscores. That method is based on the principle that an observed subscore, V, is meaningful only if it can predict the true subscore, V_T, more accurately than the true subscore can be predicted from the total score Z, where V_T is estimated using Kelley’s equation for regressing observed scores toward the group mean, and where predictive accuracy is expressed as mean-squared error. If the proportion reduction in mean-square-error (PRMSE) based on the prediction of V_T from V exceeds the PRMSE based on Z, then the subscore adds value—that is, observed subscores predict true subscores more accurately than total scores predict true subscores. It has been suggested that the two PRMSE quantities be formulated as a ratio—referred to as the value-added ratio (VAR) of a subscore (Feinberg & Wainer, 2014). If VAR exceeds 1, then scores for that subtest are deemed to be useful. Brennan (2011) proposed a utility index, U, which produces the same decisions as PRMSE and VAR but is computationally more straightforward. A consistent finding with operational testing programs is that subscores seldom add value beyond the information available in the total score (Puhan, Sinharay, Haberman, & Larkin, 2010; Sinharay, 2010, 2013; Stone et al., 2010). However, there have been a few instances for which subscores have been found have utility for demographic groups based on gender, ethnicity, or native language (Reckase & Xu, 2015; Sinharay & Haberman, 2014).

While PRMSE and U are useful frameworks for evaluating subscores, correlational methods assume that all subscores have the same mean and variance; consequently, they will not capture all of the variability in subscores when means and variances differ across subtests (Conger & Lipshitz, 1973). This is a notable limitation when such differences represent meaningful variation in task or subtest difficulty (Cronbach & Gleser, 1953). Another instance where correlational methods can be misleading is when subtest means for certain groups of examinees vary even though total group means are the same across subtests. Such differences have been observed for examinee groups based on gender, English language fluency, curricular differences, and other factors (e.g., Bridgeman & Lewis, 1994; Holtzman, Swanson, Ouyang, Dillon, & Boulet, 2014; Reardon, Kalogrides, Fahle, Podolsky, & Zarate, 2018; Reckase & Xu, 2015; Sinharay & Haberman, 2014). In these instances, exclusive reliance on correlations can overlook important difference in subscores because even moderately large differences in subscore distributions for different groups (e.g., one fourth to one half standard deviation [SD]) will produce minimal differences in subgroup correlations (Bridgeman, 2016; Bridgeman & Lewis, 1994; Jiang & Raymond, 2018; Livingston, 2015).

Correlation-based methods represent a between-subject approach to accounting for variation in subscores. To fully explain the variability, it is necessary to also evaluate within-subject variation across score profiles (Cronbach & Gleser, 1953; van der Maas, Molenaar, Maris, Kievit, & Borsboom, 2011). The challenge is to differentiate signal from noise in the score profiles.

Conger and Lipshitz(1973) proposed two within-group indices of profile reliability—one based on the Mahalanobis D² statistic and the other based on the D² statistic proposed by Cronbach and Gleser (1953). The principal difference between the two indices is that D² allows subtest means and variances to vary. More recently, Brennan (2001a) introduced an index of score profile utility based on multivariate generalizability theory. That index, designated as $G$ , indicates the proportion of observed score profile variance attributable to universe (or true) score profile variance (Brennan, 2001a, p. 323). The generalizability of score profiles is given by

G = \frac{V (μ_{p})}{V ({\bar{X}}_{p})} = \frac{[\bar{σ_{v}^{2}} (p) - \bar{σ_{vv'}} (p)] + var (μ_{v})}{[\bar{S_{v}^{2}} (p) - \bar{S_{vv'}} (p)] + var ({\bar{X}}_{v})},

(1)

where $V (μ_{p})$ is the average variance of true score profiles and $V ({\bar{X}}_{p})$ is the average variance for observed score profiles. $G$ ranges from 0 to 1 and can be interpreted as a reliability-like index for score profiles (Brennan, 2001a). If one ignores the right-hand terms in the numerator and denominator, it can be seen that $G$ is essentially the average of the subtest reliabilities adjusted downward for the subtest covariances. The inclusion of the right-hand terms in the numerator and denominator serves to increase $G$ as the difference in subtest means increases. See the appendix (available online) for a detailed explanation of Equation (1).

$G$ has been the subject of little research. The one published study to date investigated $G$ and PRMSE under numerous experimental conditions that simulated the types of subtest conditions commonly encountered in practice (Jiang & Raymond, 2018). One notable finding was that PRMSE indicates that subscores are worth reporting even when values of $G$ reached only into the high .50s and low .60s. That study also found that $G$ is more sensitive than PRMSE to subgroup differences in subscore profiles; an example based subscores for a certification test found large differences in $G$ for a reference group and focal group (.51 vs. .71) even though PRMSEs were similar for the two groups. That study supported the use of $G$ in conjunction with PRMSE to evaluate the utility of subscores (Jiang & Raymond, 2018).

Both $G$ and PRMSE characterize subscores for a population of examinees, not for individuals. However, it is well-known that the precision of scores varies by examinee (e.g., American Educational Research Association et al., 2014; Brennan & Feldt, 1989; Feldt, Steffen, & Gupta, 1985; Raju, Price, Oshima, & Nering, 2007). Consequently, the utility of score profiles can be assumed to vary by examinee, as implied by previous research examining subscores for groups based on gender, ethnicity, or language difference (Sinharay & Haberman, 2014; Reckase & Xu, 2015).

The purpose of this article is to propose and evaluate indices of score profile utility that are based on multivariate G-theory and that are suitable for individuals and subgroups. One index is based on the ratio of observed profile variance to the error variance expected in the score profile. The second index is based on the ratio of true profile variance to observed profile variance and is a conditional version of the $G$ index of score profile reliability. The next section presents a rationale and equations for the indices and describes variations of the indices that incorporate either absolute or relative error variances. We then demonstrate use of the indices with data from two certification tests and further investigate their performance using simulated data.

Sources of Profile Variability

Observed Score Profile Variance

Tests with subscores can be treated as a multivariate design where each of p examinees responds to all i items nested within each of v subtests, with the item level score for each examine designated as $x_{piv}$ . A traditional measure of score profile variability is the within-examinee variance (Cronbach & Gleser, 1953; Plake, Reynolds, & Gutkin, 1981):

{SD}_{p}^{2} = \frac{\sum_{v = 1}^{n_{v}} {({\bar{x}}_{p • v} - {\bar{x}}_{p • •})}^{2}}{n_{v} - 1},

(2)

where ${\bar{x}}_{p • v}$ is the mean score over i items on subtest v for examinee p, and ${\bar{x}}_{p • •}$ is the mean score across $n_{v}$ subtests. The mean profile variance across $n_{v}$ subtests for examinees in subgroup j will be denoted as $\bar{{SD}_{j}^{2}}$ . This latter value is equivalent to $V ({\bar{X}}_{p})$ , the observed-score profile variance presented in the denominator of Equation (1).

Error Variance in Score Profiles

The numerator in Equation (1) estimates the true score variance for some population of examinees. Although it is not possible to directly estimate the true score profile variance for a single examinee, one can determine the error variances for each subscore for individual examinees and then aggregate that error over subtests. Generalizability theory (Brennan, 1998, 2001a) provides a flexible framework for accomplishing this. Brennan (1998, 2001a) distinguishes between two types of conditional errors: absolute and relative.

Absolute Error

The conditional absolute error variance for examinee p on subtest v is given by

σ_{v}^{2} (Δ_{p}) = \frac{\sum_{i} {(x_{piv} - {\bar{x}}_{p • v})}^{2}}{n_{i} (n_{i} - 1)},

(3)

where x_piv is the response by examinee p to item i in subtest v, and ${\bar{x}}_{p • v}$ is the mean score over n_i items in subtest v for examinee p. Equation (3) applies to rating scales and dichotomously scored items. Its square root is the conditional absolute standard error of measurement (SEM). Lord (1955) proposed an equivalent index of measurement error suitable for dichotomously scored items based on the binomial error model that requires only the number of items on the test and an estimate of an examinee’s score. Lord’s (1955) person-specific absolute error is given by

σ_{v}^{2} (Δ_{p}) = \frac{{\bar{x}}_{p • v} (1 - {\bar{x}}_{p • v})}{n_{i} - 1},

(4)

where ${\bar{x}}_{p • v}$ is the examinee’s proportion correct score on subtest v. Lord’s binomial error is noteworthy because it can be computed for every possible score on the score scale, with or without actual item response data. The mean of Equation (3) or (4) over $n_{p}$ examinees can be used to estimate KR-21 if the total group variance of v is known (Lord, 1955).

Relative Error

Brennan (2001a) describes two methods for estimating conditional relative errors. One approach assumes that all examinees at a score level will have the same error variance, which simplifies computations and produces a more stable estimate than the more complete method (see Equations 5.36 and 5.38 in Brennan, 2001a). The simplified equation suggested by Brennan (2001a) is

σ_{v}^{2} (δ_{p}) = σ_{v}^{2} (Δ_{p}) - σ_{v}^{2} (I),

(5)

where $σ_{v}^{2} (I)$ is the D study variance component for the item effect for subtest v obtained by dividing the item effect variance component by the number of items. The square root of Equation 5 gives the conditional relative SEM, while its average corresponds to the SEM based on KR-20 or coefficient alpha. Since $σ^{2} (I)$ is constant across all examinees, $σ_{v}^{2} (δ_{p})$ ≤ $σ_{v}^{2} (Δ_{p})$ .

Expected Error Over Subtests

Given the individual errors defined above for each subtest, it is reasonable to estimate the expected variance over subtests due solely to measurement error. The mean absolute error variance over $n_{v}$ subtests for examinee p is given by

\bar{σ^{2}} (Δ_{p}) = \frac{\sum_{v = 1}^{n_{v}} σ_{v}^{2} (Δ_{p})}{n_{v}},

(6)

while the mean relative error variance is

\bar{σ^{2}} (δ_{p}) = \frac{\sum_{v = 1}^{n_{v}} σ_{v}^{2} (δ_{p})}{n_{v}} .

(7)

These error variances can be summed over individuals to obtain aggregate-level error terms for any j subgroup of examinees. The mean absolute errors and relative errors for group j are designated here as $\bar{σ^{2}} (Δ_{j})$ and $\bar{σ^{2}} (δ_{j})$ . To eliminate some redundancy, the generic error term $\bar{σ^{2}} (e_{p})$ will be substituted for $\bar{σ^{2}} (Δ_{p})$ and $\bar{σ^{2}} (δ_{p})$ in the text that follows.

Ratios of Profile Variances

The preceding section described the following indices: ${SD}_{p}^{2}$ and $\bar{{SD}_{j}^{2}}$ are the observed profile variance for examinee p, and the mean profile variance for examinees in group j; $\bar{σ^{2}} (e_{p})$ and $\bar{σ^{2}} (e_{j})$ indicate the expected error variance over v subtests for examinee p, or for examinees in group j. The generic error term e can be replaced with absolute error (Δ) or relative error (δ). These indices serve as the basis for two types of ratios that characterize the amount profile variability that can be attributed to error variance.

Observed Profile Variance to Error Variance

The ratio of profile variances for examinee p is given by

RPV (e_{p}) = \frac{{SD}_{p}^{2}}{\bar{σ^{2}} (e_{p})} .

(8)

This quantity indicates the extent to which the observed score profile for examinee p provides systematic variability above that due to measurement error. As the ratio of two variances, it approximates an F-like distribution with a value of 1.0 when the two variances are equal. Values substantially larger than 1.0 indicate that the score profile contains variability greater than what is expected due solely to measurement error. The corresponding index for any j group of examinees is

RPV (e_{j}) = \frac{\bar{{SD}_{j}^{2}}}{\bar{σ^{2}} (e_{j})} .

(9)

True Profile Variance to Observed Variance

We propose the following index of profile reliability for group j

PR (e_{j}) = \frac{\bar{{SD}_{j}^{2}} - \bar{σ^{2}} (e_{j})}{\bar{{SD}_{j}^{2}}} .

(10)

Classical test theory posits that the numerator of Equation (10) can be replaced with

σ^{2} (τ_{j}) = \bar{{SD}_{j}^{2}} - \bar{σ^{2}} (e_{j}),

(11)

where $σ^{2} (τ_{j})$ designates the true score variance for group j. Raju et al. (2007) relied on this formulation in their derivation of a reliability index conditioned on person p, although they acknowledged that their index was not interpretable as reliability in the usual sense. The use of $σ^{2} (τ_{p})$ for an individual poses a problem because a person’s true score is unknown and is assumed to be constant for that person. Furthermore, $σ^{2} (τ_{p})$ will be negative for individual examinees whose observed score profiles are flatter than what is predicted from measurement error. Thus, it is suggested that Equation 10 be applied to j groups rather than to individuals. This ratio is conceptually similar to Brennan’s (2001a) index of score profile reliability presented above as Equation (1) and described in the appendix (available online). Whereas Equation (1) estimates group-level true score variance through a between-subjects generalizability analysis, Equation (10) estimates true score variance as the difference between observed variance and error variance for individuals within each of the j groups.

Examples Based on Real and Simulated Data

Four sets of analyses follow to illustrate application of the aforementioned indices. The first example (Data Set 1) computes $RPV (Δ_{p})$ for individual examinees and demonstrates its usefulness for comparing examinees at varying levels of proficiency. A resampling study is also conducted to determine the values of $RPV (Δ_{p})$ when all the observed variance in a score profile is engineered to be completely random. The second example (Data Set 2) computes $PR (Δ_{j})$ and $PR (δ_{j})$ for a reference group and a focal group of examinees on four test forms. The examples show how these indices can identify potentially important group differences in subscores. The third analysis compares $PR (δ_{j})$ to Brennan’s $G$ for Data Sets 1 and 2. The fourth analysis is a simulation study designed to evaluate $RPV (Δ_{p})$ when true subscore profiles are constructed to be flat. The simulation also looks at the magnitude of $RPV (Δ_{p})$ and $PR (δ_{j})$ when subscores are known to have merit based on VAR (or PRMSE).

Data Set 1: $RPV (Δ_{p})$ and $RPV (Δ_{j})$ for Different Ability Groups

Data Source

Data were obtained from a multiple-choice test completed by a national sample of 539 health professionals. The test consists of 359 items partitioned into seven subscores (e.g., anatomy, pathology). Table 1 summarizes the statistical properties of each subtest, including its correlation with the total test. All scores are on a percent correct metric. Based on PRMSE criteria (Haberman, 2008), none of the subscores added value above that provided by the total score.

Table 1.

Subtest Summary Statistics for Data Set 1.

Subtest	No. of Items	r_vv	M	SD	r_vt
A	48	.78	.52	.14	.83
B	47	.70	.64	.11	.78
C	42	.74	.60	.13	.85
D	55	.76	.66	.12	.81
E	53	.69	.67	.10	.84
F	63	.66	.63	.09	.79
G	51	.54	.69	.08	.63
Total	359	.93	.64	.09	—

Open in a new tab

Note. N = 539 examinees. r_vv refers to coefficient alpha; r_vt is the correlation of each subtest with the total score.

Individual Indices

Individual values of ${SD}_{p}^{2}$ ranged from 0.0008 to 0.0396, with a mean of .0089. Meanwhile, the mean conditional absolute errors of measurement across subtests, $\bar{σ^{2}} (Δ_{p})$ , ranged from 0.0022 to 0.0050, with an overall mean of .0044. The overall magnitude of these values is consistent with the fact that they represent squared deviations on a proportion correct scale. A comparison of the two means (0.0089/0.0044 = 2.0227) indicates that, overall, observed profile variability is about twice that of error variability. Figure 1 depicts the distribution of $RPV (Δ_{p})$ for the 539 examinees. Individual values of $RPV (Δ_{p})$ ranged from 0.21 to 10.28, with M = 2.02 and SD = 1.20. A notable percentage of the examinees have score profiles that are three or four times the variability expected due to error. For example, 10% of examinees had values of $RPV (Δ_{p})$ exceeding 3.56.

Figure 1. — Frequency distribution of $RPV (Δ_{p})$ for Data Set 1 for 539 examinees. Two values are not shown due to truncating the x axis at 7.0. RPV = ratio of profile variances.

Group Indices

There is interest in identifying particular subgroups of examinees for which subscores may be useful (Huff & Goodman, 2007). In particular, several studies show that low scoring examinees have more variable score profiles than high-scoring examinees (Haladyna & Kramer, 2004; Raymond, Swygert, & Kahraman, 2012). Of course, it is important to determine how much of that additional variability can be attributed to error variance. Figure 2 shows error variance and observed profile variance as a function of examinee proficiency, where proficiency is based on total test score. Each data point is based on just over 100 examinees. This graph demonstrates that observed profile variability differs systematically by proficiency level and that the difference between observed and error variance increases as examinee proficiency decreases. $RPV (Δ_{j})$ ranged from 2.60 at the lowest level of proficiency to 1.41 at the highest level.

Figure 2. — Observed profile variances ( $\bar{{SD}_{j}^{2}}$ ) and mean conditional error variances ( $\bar{σ^{2}} (Δ_{j})$ ) at five proficiency levels for Data Set 1. The ns is each group ranged from 107 to 109 examinees.

Resampling Study

A reasonable argument is that large values of $RPV (Δ_{p})$ reflect chance variation in ${SD}_{p}^{2}$ ; that regardless of the collection of items that make up a subtest, some proportion of examinees will exhibit variable score profiles due to error. Although including an index of measurement error in the denominator of $RPV (Δ_{p})$ is intended to account for this chance variation, it is important to verify this empirically. A resampling study was conducted to document the magnitude of observed profile variance, ${SD}_{p}^{2}$ , when all of it is known to be random error. We randomly assigned the 359 item responses to the seven subtests (preserving the subtest lengths specified in Table 1), computed subscores for each examinee, and documented ${SD}_{p}^{2}$ over these randomly generated subscores. Then, values of $RPV (Δ_{p})$ were obtained for each examinee, where the “hat” indicates the ratio of profile variances when subscores contain no known systematic variance (i.e., all variance is random). $RPV (Δ_{p})$ can be thought of as the null distribution for this specific test and subtests. This process was replicated 10,000 times for each of the 539 examinees, and $RPV (Δ_{p})$ was compared to values of $RPV (Δ_{p})$ obtained from the original unaltered test.

We first compared the distributions of $RPV (Δ_{p})$ based on the resampled data to the distribution of $RPV (Δ_{p})$ from the original test that were presented in Figure 1. Results indicated that the simulated distributions were not similar to Figure 1, with $RPV (Δ_{p})$ consistently exhibiting a smaller mean and much less variability. For the typical replication, the summary statistics for $RPV (Δ_{p})$ were M = 1.007, SD = 0.582, with minimum and maximum values = 0.07 and 3.80. We next evaluated the values of $RPV (Δ_{p})$ across all replications for selected examinees. Of particular interest were those examinees with high values of $RPV (Δ_{p})$ based on the original test. The intent was to determine if there was something about these examinees that resulted in highly variable score profiles regardless of subtest composition. Table 2 summarizes the simulated distributions for the 10 examinees with the largest observed values of $RPV (Δ_{p})$ . These 10 examinees had total scores that were comparable to the entire group (M = .62 vs .64). As indicated in Table 2, the distributions of simulated values of $RPV (Δ_{p})$ are nearly identical across examinees in terms of summary statistics and common percentiles. The bottom row of the table shows the results across all 539 examinees. Overall, the 10 examinees differed from other examinees only in their high values of $RPV (Δ_{p})$ based on original subscores.

Table 2.

Simulated Values of $RPV (Δ_{p})$ Obtained Over 10,000 Replications by Randomly Assigning Items to Subtests; Results for Examinees With 10 Largest Actual (Observed) Values of $RPV Δ_{p}$ .

			Simulated distributions of $RPV (Δ_{p})$
			Summary statistics				Common percentiles
Examinee	Total Score	Observed $RPV (Δ_{p})$	Minimum	Maximum	M	SD	67th	90th	95th	99th
1	.69	10.28	.01	4.93	1.004	.588	1.153	1.786	2.121	2.855
2	.71	7.18	.02	5.33	1.007	.600	1.155	1.785	2.158	2.925
3	.62	6.24	.04	5.15	1.001	.592	1.151	1.783	2.129	2.858
4	.58	5.48	.03	4.27	1.003	.583	1.157	1.783	2.128	2.806
5	.52	5.32	.01	5.84	1.015	.607	1.169	1.820	2.150	2.986
6	.67	5.25	.02	5.09	1.004	.585	1.146	1.769	2.100	2.958
7	.60	5.09	.03	5.79	1.007	.602	1.150	1.801	2.163	2.945
8	.63	5.08	.02	5.70	1.002	.595	1.148	1.792	2.122	2.857
9	.65	5.03	.01	6.15	1.003	.588	1.148	1.771	2.109	2.877
10	.48	5.00	.04	4.34	1.003	.593	1.148	1.804	2.165	2.917
All examinees (Ms)			.02	5.21	1.007	.594	1.154	1.790	2.123	2.882

Open in a new tab

Note. RPV = ratio of profile variances.

Table 2 can be used to estimate the probability of obtaining a value of $RPV (Δ_{p})$ greater than some specified value. As one example, it can see that $RPV (Δ_{p})$ exceeded 2.12 for about 5% of replications. These results confirm that if subtests consist of random collections of items, then the observed variance over the score profile approximates the variance expected due to random error. Importantly, results indicate that $\bar{σ^{2}} (Δ_{p})$ closely approximates the amount of score profile variability due to random error and that in the absence of systematic variance ${SD}_{p}^{2}$ = $\bar{σ^{2}} (Δ_{p})$ , and $RPV (Δ_{p})$ ≈ 1.0. We have confirmed this finding through resampling studies with other sets of real data.

In summary, Data Set 1 indicates that score profiles include systematic variance above what is expected due to error variance and that low scoring examinees exhibit more subscore variability than high-scoring examinees. A reasonable interpretation of the resampling study is that $RPV (Δ_{p})$ is sensitive to the observed variability in score profiles that cannot be attributed to measurement error.

Data Set 2: Subgroup Differences in $PR (Δ_{j})$ and $PR (δ_{j})$

Data Source

Item responses were available for four forms of a certification test completed by a national sample of health professionals in medical imaging. Each 200-item test comprises five subtests. The first four subtests consist of technical content (e.g., radiation biology, imaging parameters), while the fifth subtest addresses interpersonal interactions (e.g., communication, professionalism). The total test and subtests are built to tight specifications, with total test means falling between .769 and .772 across the four forms, and reliability coefficients ranging from .934 to .942. The analyses compared two groups. The reference group included 13,676 examinees who recently completed a formal educational program and were seeking initial certification, while the focal group consisted of 395 examinees who completed their education several years earlier, had let their certification lapse for various reasons, and were seeking recertification. The testing agency hypothesized that subscores might be beneficial for focal group because of the time lapse since those examinees had completed their formal education.

Data for the four test forms were combined to achieve adequate sample sizes for the focal group. Combining forms also demonstrates the flexibility of working with absolute errors, which can be estimated for any examinee score independent of the particular composition of test items. Table 3 provides the subtest summary statistics for the two groups. The reference group scored about .10 higher than the focal group on Subtests A through D, while the two groups had similar means on Subtest E. The subtest–total test correlations are similar and two groups exhibited similar amounts of subtest variability.

Table 3.

Subtest Summary Statistics for Data Set 2.

			Reference group			Focal group
Subtest	No. of Items	r_vv	M	SD	r_vt	M	SD	r_st
A	45	.76	.791	.117	.868	.705	.134	.851
B	22	.69	.738	.156	.820	.638	.164	.795
C	45	.80	.741	.137	.904	.658	.138	.887
D	58	.82	.783	.121	.896	.700	.126	.875
E	30	.60	.799	.111	.719	.810	.111	.688
Total	200	.93	.773	.108	—	.701	.111	—

Open in a new tab

Note. Reference group N = 13,676 and focal group N = 395 examinees. r_vv refers to coefficient alpha; r_vt is the correlation of each subtest with the total score.

Figure 3 plots the indices of variability for the reference and focal groups at each ability level. Instability for the focal group line can be attributed to the small sample sizes for each decile, which ranged from 10 to 81, with the smaller ns occurring at higher deciles. Estimation of $\bar{{SD}_{j}^{2}}$ and $\bar{σ^{2}} (Δ_{p})$ across forms was straightforward, as neither statistic is group dependent. However, determining $\bar{σ^{2}} (δ_{j})$ was complicated by its reliance on decision study variance components for the item effect, $σ_{v}^{2} (I)$ , which is group dependent. Therefore, $σ_{v}^{2} (I)$ was estimated within each test form for the reference and focal groups and averaged across forms.

Figure 3 further demonstrates that low scoring examinees have more variable score profiles than high-scoring examinees. In addition, examinees in the focal group exhibit more variable score profiles than the reference group at most ability levels. Also note that $\bar{σ^{2}} (δ_{p})$ is less than $\bar{σ^{2}} (Δ_{p})$ , a finding that is consistent with univariate G-theory where relative error is usually less than absolute error. Finally, focal examinees had slightly less relative error variance than reference examinees.

Table 4 presents the two indices of profile reliability for the two examinee groups at each ability level. As expected, $PR (Δ_{j})$ was consistently less than $PR (δ_{j})$ . By either index, low ability examinees generally had more reliable score profiles than high ability examinees, and the focal group had more reliable profiles than the reference group. The indices of reliability are remarkably low for most groups, a finding consistent with research on $G$ (Jiang & Raymond, 2018). Regardless of the overall magnitude of the indices in Table 4, the important point is that the utility of subscore profiles varies as a function of examinee ability and curricular differences.

Table 4.

Profile Reliability (PR) at 10 Ability Levels for Reference Group and Focal Group Examinees.

Ability level	Reference group		Focal group
	$PR (Δ_{j})$	$PR (δ_{j})$	$PR (Δ_{j})$	$PR (δ_{j})$
1	.472	.509	.620	.658
2	.401	.445	.614	.655
3	.370	.420	.527	.581
4	.322	.379	.493	.556
5	.313	.377	.411	.490
6	.252	.328	.388	.478
7	.195	.285	.392	.492
8	.186	.292	.351	.475
9	.167	.297	.138	.324
10	.143	.344	.142	.391

Open in a new tab

Note. $PR (Δ_{j})$ indicates the profile reliability bases on absolute error, while $PR (δ_{j})$ corresponds to relative error. See Equation (10) for details.

$PR (δ_{j})$ and $G$ for Data Sets 1 and 2

The analysis of variance identity suggests that partitioning score variability using the within-subject approach developed here ought to give the same result as partitioning the variability using a between-subjects approach as done with generalizability theory. Specifically, $PR (δ_{j})$ should closely approximate $G$ . Data Sets 1 and 2 were analyzed using mGENOVA (Brennan, 2001b), which, as optional output, provides estimates of $G$ , $V (μ_{p})$ , and $V ({\bar{X}}_{p})$ based on Equation (1).

Table 5 shows that $PR (δ_{j})$ is very similar to $G$ across all test forms, differing mostly in the third decimal place. The values of $PR (δ_{j})$ are very low except for the focal group on Date Set 2, where values for each form hover around .60. The differences between $σ^{2} (τ_{j})$ and $V (μ_{p})$ , and between ${\bar{SD}}_{j}^{2}$ and $V ({\bar{X}}_{p})$ are more notable, with indices from mGENOVA being systematically smaller. While some of the difference in estimates can be attributed to rounding, most can be explained by Brennan’s (2001b) use of biased estimators for certain variances (i.e., n_v instead of n_v− 1) in order to maintain consistency with other aspects of multivariate G-theory. As one example, consider the values of $σ^{2} (τ_{j})$ and $V (μ_{p})$ in the first row of Table 5. The two values can be made equivalent by multiplying $σ^{2} (τ_{j})$ by n_v / (n_v− 1), or by 6/7 in the present case (0.00524 × 6/7 = 0.00449). The same adjustment can be done to bring ${\bar{SD}}_{j}^{2}$ in line with $V ({\bar{X}}_{p}) .$ The adjustment for Data Set 2 would be 4/5. The results in Table 5 confirm the comparability of the two different frameworks for quantifying the reliability of subscore profiles.

Table 5.

Comparison of Indices Based on Equations 7, 8, and 9 to Indices Based on Equation 1 Computed by mGENOVA (Brennan, 2001a, b).

		Equations 7-9			Equation 1
Data source	N	$PR (δ_{j})$	$σ^{2} (τ_{j})$	$\bar{{SD}_{j}^{2}}$	$G$	$V (μ_{p})$	$V ({\bar{X}}_{p})$
Data Set 1	539	.58764	.00524	.00892	.58845	.00450	.00764
Data Set 2
Reference, Form 1	3374	.39042	.00267	.00684	.39079	.00214	.00547
Reference, Form 2	3346	.33116	.00207	.00624	.33157	.00166	.00500
Reference, Form 3	3399	.44248	.00332	.00750	.44269	.00266	.00600
Reference, Form 4	3457	.40453	.00286	.00708	.40473	.00229	.00566
Focal, Form 1	100	.59026	.00671	.01137	.59435	.00541	.00910
Focal, Form 2	108	.57201	.00619	.01082	.57592	.00498	.00865
Focal, Form 3	92	.65351	.00910	.01393	.65727	.00732	.01114
Focal, Form 4	95	.59158	.00705	.01191	.61085	.00590	.00965

Open in a new tab

Simulation Experiment

Results to this point suggests that $RPV (e_{p})$ and $PR (e_{j})$ are viable indices for characterizing the amount of observed score profile variability that can be attributed to random error. While the resampling results are encouraging, they will not generalize to tests with different properties. Also, the real data sets represent instances where subscores would not be worth reporting based on conventional methods, and it would be useful to learn more about the ratios under conditions more favorable to subscores. The primary purpose of the simulation is to determine the values of $RPV (e_{p})$ that are obtained when score profiles are known to be flat and very highly correlated. The simulation also provides the opportunity evaluate the distribution of $RPV (e_{p})$ when PRMSE or VAR indicate that subscores are worth reporting.

Study Design

Experimental conditions were created by manipulating subtest reliability, subtest correlation, and variability in subscore means—all of which are known to impact subscore utility. For the baseline, or null, conditions, subtests correlations were near 1.0 and subtest true score means did not differ. The other conditions represented instances for which score profiles based on two subtests systematically varied in the manner in which performance on multiple-choice items differs from performance on constructed response items (Bridgeman & Lewis, 1994; Sinharay & Haberman, 2014). Thus, simulated score profiles consisted of two subtests. Previous studies indicate that the number of subtests is not a potent factor in this line of research and that findings for two subtests generalize to multiple subtests (Feinberg & Wainer, 2014; Jiang & Raymond, 2018; Sinharay, 2010). Three factors were manipulated:

Subtest reliability, $ρ_{v}^{2}$ . Three levels were targeted, designated as high ( $ρ_{v 1}^{2}$ = .85; $ρ_{v 2}^{2}$ = .85), moderate ( $ρ_{v 1}^{2}$ = .85; $ρ_{v 2}^{2}$ = .75) and low ( $ρ_{v 1}^{2}$ = .85; $ρ_{v 2}^{2}$ = .65). The unequal reliabilities mimic the situation where the selected response items are more reliable than constructed response items, and are consistent with the meta-analytic findings of Rodriguez (2005). These targeted levels of reliability were achieved by creating subtests with the number of items set at 11, 20, and 35 items. Given that the method of simulation allowed items to vary across replications, actual subtest reliability differed slightly from the targeted reliabilities.
Correlation between subtests, $ρ_{vv'}$ . Two levels of the correlation between true subtest scores were studied, with values of $ρ_{vv'}$ set at .99 and .82. A value of .99 represents the baseline condition where subscores for an examinee are the same except for random error. The value of 0.82 is consistent with the correlation between true scores for selected response and constructed response item formats (Rodriguez, 2005).
Difference in the subtest means, Δμ. Four levels were studied, with Δμ set at 0.00, 0.25, 0.50, and 0.75. Although it is common practice to scale subtests to have equal means and SDs in the population, numerous studies have shown subtest mean differences of 0.50 SD or more for examinee subgroups (e.g., Bridgeman & Lewis, 1994; Reardon et al., 2018; Reckase & Xu, 2015; Sinharay & Haberman, 2014).

The three factors were fully crossed creating 24 conditions. The null conditions are those where $RPV (Δ_{p})$ is expected to be approximately 1.0. These are the conditions where score profiles in the population are flat (Δμ = 0.0) and correlate nearly perfectly ( $ρ_{vv'}$ = 0.99). While some conditions are not very realistic (e.g., $ρ_{vv'}$ = .99; Δμ = .75), they should shed light on the behavior of $RPV (Δ_{p})$ across a variety of conditions. The sample size for each replication was 1,000 examinees, with 500 replications per condition.

Subscores were generated using a two-parameter, logistic multidimensional IRT model (Reckase, 2007). Let $θ = (θ_{1}, θ_{2} \dots θ_{k})$ correspond to the K-dimensional true ability parameter vector of an examinee, where K = 2 for the present study. The probability of a correct response P to item i from an examinee can be expressed as

\frac{\exp (a_{1 i} θ_{1} + a_{2 i} θ_{2} + \dots + a_{ki} θ_{k} - b_{i})}{1 + \exp (a_{1 i} θ_{1} + a_{2 i} θ_{2} + \dots + a_{ki} θ_{k} - b_{i})},

where $b_{i}$ is a scalar difficulty parameter and $a_{i} = (a_{1 i}, a_{2 i}, \dots, a_{ki})$ is a vector of discrimination parameters of the Item i. As each item measures one subscore only, $a_{i}$ can be specified as (0,…, $a_{iV}$ , . . ., 0), where V is the identifier of the subscore. Each element in $θ$ can be regarded as a subtest, and $θ_{k}$ is an examinee’s score for subtest k. Item responses were generated by comparing P with a random draw u from a uniform distribution ranging from 0 to 1. If P≥u then the response $x_{i}$ at item i is 1; otherwise if P < u, response $x_{i}$ = 0. Discrimination parameters were generated from a log-normal distribution (M = 0.0, SD = 0.5), while difficulty parameters were normally distributed (M = 0, SD = 1). This method of simulation is similar to that employed in other subscores studies (Sinharay, 2010); a complete description can be found elsewhere (Jiang, 2018; Jiang & Raymond, 2018).

Three outcome measures were of interest: $RPV (Δ_{p})$ , $PR (δ_{j})$ , and VAR, where VAR represents two PRMSE values expressed as a ratio (Feinberg & Wainer, 2014). Since VAR is computed for each subtest, each profile will have two VARs. In contrast, $RPV (Δ_{p})$ and $PR (δ_{j})$ are obtained once for the score profile. To simplify comparison, we report VAR only for the less reliable of the two subtests.

Results

The left portion of Table 6 presents mean values of $RPV (Δ_{p})$ across the 24 conditions. Of most interest are the three null conditions where $ρ_{vv'}$ = .99 and Δµ = 0.0. Mean values of $RPV (Δ_{p})$ are very close to 1.0 for these conditions, which is encouraging. For all other conditions, $RPV (Δ_{p})$ follows the expected trends; that is, for most conditions means get larger as the correlation between subtests decreases and as subtest reliability and the differences in means increased. That fact that mean values are greater than 1.0 for null condition can likely be attributed to the variance in subtest difficulty introduced by the method of simulation which allowed subtest means to be different from Δµ = 0.0 for each replication. More specifically, the subtests at each replication were based on items with discrimination and difficulty parameters that varied across subtests and replications as described above (e.g., difficulty parameters where M = 0, SD = 1). The consequence was that even at Δµ = 0.0 in the population, the observed subtest means typically differed by about 0.15 SD, with mean differences occasionally reaching 0.50 SD.

Table 6.

Mean $RPV (Δ_{p})$ and Percentage of Examinees for Whom $RPV (e_{p})$ > 4.0 for 24 Conditions.

		Mean RPV				% RPV > 4.0
		Difference in subtest M, Δµ				Difference in subtest M, Δµ
Subtest $ρ_{vv'}$	Subtest $ρ_{v}^{2}$	0.00	0.25	0.50	0.75	0.00	0.25	0.50	0.75
.99	.65	1.111	1.544	1.818	2.479	7.8	9.7	12.2	17.6
	.75	1.117	1.252	1.645	2.334	6.3	8.2	12.7	20.0
	.85	1.054	1.251	1.800	2.626	5.2	6.8	13.3	22.1
.82	.65	1.850	1.966	2.409	3.200	12.7	13.6	17.0	23.4
	.75	1.841	2.015	2.347	3.134	14.7	15.8	20.0	27.0
	.85	2.071	2.246	2.730	3.627	15.5	17.2	21.8	29.5

Open in a new tab

Note. RPV = ratio of profile variances; VAR = value-added ratio; PRMSE = proportion reduction in mean-square-error. Shaded cells indicate those conditions for which VAR > 1.0 and subscores would be regarded as worth reporting by the PRMSE method of Haberman (2008).

The right portion of Table 6 indicates the percentage of examinees for which $RPV (Δ_{p})$ exceeded a threshold of 4.0, a value that might be regarded as being sufficiently large to report subscores for an individual. An $RPV (Δ_{p})$ of 4.0 would correspond to $PR (δ_{j})$ or $G$ of approximately 0.75, and previous research indicates that when $G$ is in the .70s, VAR almost always exceeds 1.0 (Jiang & Raymond, 2018).¹ For the three null conditions where $ρ_{vv'}$ = .99 and Δµ = 0.0, between 5% and 8% of examinees had values of $RPV (Δ_{p})$ that exceeded 4.0. These values might be regarded as the Type I errors, as all examinees in these conditions were expected to have flat score profiles. The results in the right portion of the panel follow the expected trend, with the percentage of examinees with $RPV (Δ_{p})$ > 4.0 increasing as the correlation drops from .99 to .82, as reliability increases, and as mean differences increase. Note that the shaded cells indicate the conditions for which VAR consistently exceeded 1.0 and subscores would be considered worth reporting. The percentage of subscore profiles where $RPV (Δ_{p})$ > 4.0 for these eight conditions ranges from a low of .147 to a high of .295. That is, even when VAR indicates that subscores are worth reporting, a minority of examinees have what might be regarded as meaningfully different subscores, and some of these examinees would have chance differences in subscores.

Figure 4 depicts the relationship between mean values of $PR (δ_{j})$ and VAR across all conditions. It is evident that VAR and $PR (δ_{j})$ do not covary within any particular panel because, as expected, VAR is nearly constant across Δµ. However, the two indices do covary across conditions if one collapses levels of Δµ. Although the relationship between the two indices is difficult to discern in Figure 4, Jiang and Raymond (2018) found that when Δµ = 0, VAR can be estimated from $PR (δ_{j})$ across a broad range of conditions with an R² of 0.95. It is instructive to inspect the two bottom-right panels in Figure 4, while VAR just exceeded 1.0 for all eight of these conditions, $PR (δ_{j})$ ranged from .50 to .73.

Simulation Summary

The simulations indicate that $RPV (Δ_{p})$ can overidentify examinees with variable profiles even with the moderately stringent threshold value of 4.0. The results for the null conditions in Table 6 may be an overestimate because the simulation design allowed item parameters to vary across subtests. In fact, most examination programs construct subtests to statistical specifications rather than selecting items at random. Nonetheless, the simulation results provide a useful caution against overinterpreting values of $RPV (Δ_{p})$ . The simulation results also support the findings of Feinberg and Jurich (2017) who opined that VAR = 1.0 may be an overly liberal criterion for determining the value of subscores. One might question whether subscores be reported at all when only 15% to 30% of examinees have $RPV (Δ_{p})$ that exceed 4.0.² Using $RPV (Δ_{p})$ or $PR (δ_{j})$ in conjunction with VAR can help identify subscore profiles that may or may not be worth reporting.

Discussion

This article introduced methods for evaluating the extent to which the observed variability in score profiles is greater than what can be explained by measurement error alone. The methods are founded on two measures that can be computed with ease at the level of the individual examinee: (1) the amount of observed variability in a score profile and (2) the amount of variability that can be attributed to measurement error. These two measures provide the basis for different indices that can be obtained for individuals, subgroups, or the total group.

Two within-examinee ratios were proposed. The first is based on the ratio of observed profile variability to error variability. $RPV (e_{p})$ applies to individual examinees, while $RPV (e_{j})$ can be applied to any group of j examinees; the subscript $e$ can be defined in terms of either absolute ( $Δ$ ) or relative (δ) error variance. The second index, $PR (e_{j})$ , is based on the ratio of observed profile variance minus error variance to the observed variance in the score profile. It also can be specified in terms of absolute or relative errors. $PR (e_{j})$ can be interpreted as a reliability-like coefficient for any group or subgroup of examinees. Although $RPV (e_{p})$ can be interpreted for individual examinees, that is not the case for $PR (e_{j})$ given that true scores for individuals are unknown and do not vary. While these indices do not explicitly include subscore covariance terms, it is evident that covariances will affect the within-examinee variability across subtests (Conger & Lipshitz, 1973).

The analyses of Data Sets 1 and 2 illustrated the basic properties of the indices and demonstrated that $PR (δ_{j})$ produced results comparable to Brennan’s $G$ . The resampling study for Data Set 1 provided support for the proposition that the error variance expected in an examinee’s score profile can be defined in terms of the average of the conditional errors across subtests for that examinee. The simulation study indicated that when score profiles are known to be flat, $RPV (e_{j})$ is near 1.0. The simulation experiment also raised a flag of caution by indicating that some percentage of examinees can have large values of $RPV (e_{p})$ due solely to chance. We suspect that the true results lie somewhere between what is suggested by the resampling study and simulation experiment. While resampling will underestimate the variability to expect in other samples, the simulation experiment included sources of variation (i.e., a large item effect within and across subtests) that is not likely to exist for well-constructed tests.

These ratios have several notable features. First, they are based on the simple idea that the information available in a score profile is a direct function of profile variability. Second, the computations are straightforward. This is especially true for absolute errors for dichotomously scored items, which can be estimated via Equation (4) prior to data collection (Lord, 1955). Third, the indices are useful for both individual examinees and subgroups of examinees. The methods proposed here acknowledge that subscores may be useful for some individuals or groups but not for others, even though the subscores may have been deemed useful (or not) for the total group. Fourth, the absolute indices can be obtained without having to estimate correlations or covariances. This can be an advantage when evaluating subgroups for which sample sizes are small or when there is a restricted range of variance, and it is not feasible to calculate $PR (δ_{j})$ or Brennan’s $G$ due to the effects of range restriction on correlations and variance components.

These indices should prove useful when used in conjunction with correlations and PRMSE or VAR to more fully evaluate the properties of subscores. For example, PRMSE can identify which subtests are worthy of inclusion in a score profile, while $PR (e_{j})$ can be used to temper the interpretation of score profiles once the reportable subtests have been identified. Feinberg and Jurich (2017) cautioned that PRMSE may be too generous by identifying subscores as being useful when the differences are really quite small. The simulations presented here detected instances where PRMSE indicated that subscores had utility, but $RPV (Δ_{p})$ found subscores to be meaningful for just a minority of examinees (Table 6). The use of $RPV (Δ_{p})$ or $PR (e_{j})$ in such instances could alert users to the limitations of score profiles. In contrast, there may be instances where PRMSE underestimates the utility of score profiles because of its insensitivity to variation subtest means for subgroups (Jiang & Raymond, 2018). The indices proposed here may provide a more meaningful picture of score profile utility for subgroups based on gender, language background, differences in training, level of proficiency, or other contexts were subgroup profiles can legitimately differ (e.g., Bridgeman & Lewis, 1994; Detterman & Daniel, 1989; Fleishman, 1957; Haladyna & Kelley, 2006; Raymond et al., 2012; Sinharay & Haberman, 2014). For example, in Data Set 2, both level of ability and curricular experiences appeared to be a source of systematic variation in score profiles.

The limitations of this article suggest a few issues to address in future work. One challenge is to establish guidelines for $RPV (e_{p})$ and $PR (e_{j})$ —values above which a score profile is deemed to have meaning. Even when score profiles are determined to contain meaningful variability, it is still necessary to compute confidence intervals for individual examinees to determine which specific subscores merit attention. Consistent with Brennan (2001a, pp. 317-320), we advocate the use of confidence intervals based on conditional errors of measurement, not the total group SEM, for obtaining confidence intervals. Another limitation of the present study is that it was illustrated with two data sets with similar properties—highly correlated subtests with modest reliabilities. Although the simulation study attempted to address this limitation, it still would be useful to evaluate the indices presented here with tests which are built with subscore interpretation in mind. Also, additional simulation studies are needed to investigate the properties of the indices under a wider array of conditions and using data simulation methods that capture different sources of variability and dimensionality. It should be noted that the methods presented here apply to observed scores on a percentage correct or number correct scale; their applicability to scaled scores or a theta metric introduces additional complications (see Klauer & Rettig, 1990 for an IRT-based index of person fit). These limitations notwithstanding, the indices presented here provide a useful adjunct to traditional methods for evaluating subscore utility. They extend the notion of conditional errors from single scores to score profiles, and dovetail nicely with certain aspects of multivariate G-theory.

Supplemental Material

Appendix – Supplemental material for Indices of Subscore Utility for Individuals and Subgroups Based on Multivariate Generalizability Theory

Click here for additional data file.^{(276.8KB, pdf)}

Supplemental material, Appendix for Indices of Subscore Utility for Individuals and Subgroups Based on Multivariate Generalizability Theory by Mark R. Raymond and Zhehan Jiang in Educational and Psychological Measurement

^1.

The previous study evaluated $G$ , but as demonstrated in Table 5, $PR (δ_{j})$ ≈ $G$ .

^2.

Other values of $RPV (Δ_{p})$ could be chosen. For example, $RPV (Δ_{p})$ of 3.0 and 5.0 would correspond to $PR (δ_{j})$ or $G$ of approximately 0.67 and 0.80 using assumptions based on classical test theory.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs: Mark R. Raymond Inline graphic https://orcid.org/0000-0003-0472-1027

Zhehan Jiang Inline graphic https://orcid.org/0000-0002-1376-9439

Supplemental Material: Supplemental material for this article is available online.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington DC: Author. [Google Scholar]
Brennan R. L. (1998). Raw-score conditional standard errors of measurement in generalizability theory. Applied Psychological Measurement, 22, 307-333. [Google Scholar]
Brennan R. L. (2001. a). Generalizability theory. New York, NY: Springer-Verlag. [Google Scholar]
Brennan R. L. (2001. b). Manual for mGENOVA, Version 2.1. Iowa Testing Programs Occasional Papers, No. 50. Iowa City: University of Iowa. [Google Scholar]
Brennan R. L. (2011). Utility indices for decisions about subscores (CASMA Research Report No. 33). Iowa City: University of Iowa. [Google Scholar]
Bridgeman B. (2016). Can a two-question test be reliable and valid for predicting academic outcomes? Educational Measurement: Issues and Practice, 35, 21-24. [Google Scholar]
Bridgeman B., Lewis C. (1994). The relationship of essay and multiple-choice scores with college courses. Journal of Educational Measurement, 31, 37-50. [Google Scholar]
Conger A. J., Lipshitz R. (1973). Measures of reliability for profiles and test batteries. Psychometrika, 38, 411-427. [Google Scholar]
Cronbach L. J., Gleser G. (1953). Assessing similarity between profiles. Psychological Bulletin, 50, 456-473. [DOI] [PubMed] [Google Scholar]
Detterman D. K., Daniel M. H. (1989). Correlations of mental tests with each other and with cognitive variables are highest for low-IQ groups. Intelligence, 13, 349-359. [Google Scholar]
Feinberg R. A., Jurich D. P. (2017). Guidelines for interpreting subscores. Educational Measurement: Issues and Practice, 36, 5-13. [Google Scholar]
Feinberg R. A., Wainer H. (2014). A simple equation to predict a subscore’s value. Educational Measurement: Issues and Practice, 33, 55-56. [Google Scholar]
Feldt L. S., Steffen M., Gupta N. C. (1985). A comparison of five methods for estimating the standard error of measurement at specific score levels. Applied Psychological Measurement, 9, 351-361. [Google Scholar]
Fleishman E. A. (1957). A comparative study of aptitude patterns in unskilled and skilled psychomotor performances. Journal of Applied Psychology, 41, 263-272. [Google Scholar]
Haberman S. J. (2008). When can subscores have value? Journal of Educational and Behavioral Statistics, 33, 204-229. [Google Scholar]
Haberman S. J., Sinharay S. (2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209. doi: 10.1007/s11336-010-9158-4 [DOI] [Google Scholar]
Haladyna T. M., Kramer G. A. (2004). The validity of subscores for a credentialing examination. Evaluation & the Health Professions, 27, 349-368. [DOI] [PubMed] [Google Scholar]
Holtzman K. Z., Swanson D. B., Ouyang W., Dillon G. F., Boulet J. R. (2014). International variation in performance by clinical discipline and task on the USMLE Step 2 clinical knowledge component. Academic Medicine, 89, 1558-1562. [DOI] [PubMed] [Google Scholar]
Huff K., Goodman D. P. (2007). The demand for cognitive diagnostic assessment. In Leighton J. P., Gierl M. J. (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 19-60). Cambridge, England: Cambridge University Press. [Google Scholar]
Jiang Z. (2018). Using the linear mixed-effect model framework to estimate generalizability variance components in R. Methodology, 14, 133-142. [Google Scholar]
Jiang Z., Raymond M. R. (2018). The use of multivariate generalizability theory to evaluate the quality of subscores. Applied Psychological Measurement, 42, 595-612. [DOI] [PMC free article] [PubMed] [Google Scholar]
Klauer K. C., Rettig K. (1990). An approximately standardized person test for assessing consistency with a latent trait model. British Journal of Mathematical and Statistical Psychology, 43, 193-206. [Google Scholar]
Livingston S. A. (2015). A note on subscores. Educational Measurement: Issues and Practice, 34, 5. [Google Scholar]
Lord F. M. (1955). Estimating test reliability. Educational and Psychological Measurement, 15, 325-336. [Google Scholar]
Plake B. S., Reynolds C. R., Gutkin T. B. (1981). A technique for the comparison of profile variability between independent groups. Journal of Clinical Psychology, 37, 142-146. [Google Scholar]
Puhan G., Sinharay S., Haberman S. J., Larkin K. (2010). The utility of augmented subscores in a licensure exam: An evaluation of methods using empirical data. Applied Measurement in Education, 23, 266-285. [Google Scholar]
Raju N. S., Price L. R., Oshima T. C., Nering M. L. (2007). Standardized conditional SEM: A case for conditional reliability. Applied Psychological Measurement, 31, 169-180. [Google Scholar]
Raymond M. R., Swygert K. A., Kahraman N. (2012). Psychometric equivalence of ratings for repeat examinees on a performance assessment for physician licensure. Journal of Educational Measurement, 49, 339-361. [Google Scholar]
Reckase M. D., Xu J. R. (2015). The evidence for a subscore structure in a test of English language competence for English language learners. Educational and Psychological Measurement, 75, 805-825. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reardon S. F., Kalogrides D., Fahle E. M., Podolsky A., Zarate R. C. (2018). The relationship between test item format and gender achievement gaps on math and ELA tests in fourth and eighth grades. Educational Researcher, 47, 284-294. [Google Scholar]
Rodriguez M. C. (2005). Three options are optimal for multiple-choice test items: A meta-analysis of 80 years of research. Educational Measurement: Issues & Practice, 24, 3-13. [Google Scholar]
Sinharay S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47, 150-174. [Google Scholar]
Sinharay S. (2013). A note on assessing the added value of subscores. Educational Measurement: Issues and Practice, 32, 38-42. [Google Scholar]
Sinharay S., Haberman S. J. (2014). An empirical investigation of population invariance in the value of subscores. International Journal of Testing, 14, 22-48. [Google Scholar]
Sinharay S., Haberman S. J., Puhan G. (2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26, 21-28. [Google Scholar]
Stone C. A., Ye F., Zhu X., Lane S. (2010). Providing subscale scores for diagnostic information: A case study for when the test is essentially unidimensional. Applied Measurement in Education, 23, 63-86. [Google Scholar]
Thissen D., Wainer H., Wang X. B. (1994). Are tests comprising both multiple-choice and free-response items necessarily less unidimensional than multiple-choice tests? An analysis of two tests. Journal of Educational Measurement, 31, 113-123. [Google Scholar]
Van der Maas H. L. J., Molenaar D., Maris G., Kievit R. A., Borsboom D. (2011). Cognitive psychology meets psychometric theory: On the relation between process models for decision making and latent variable models for individual differences. Psychological Review, 118, 339-356. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix – Supplemental material for Indices of Subscore Utility for Individuals and Subgroups Based on Multivariate Generalizability Theory

Click here for additional data file.^{(276.8KB, pdf)}

[bibr1-0013164419846936] American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington DC: Author. [Google Scholar]

[bibr2-0013164419846936] Brennan R. L. (1998). Raw-score conditional standard errors of measurement in generalizability theory. Applied Psychological Measurement, 22, 307-333. [Google Scholar]

[bibr3-0013164419846936] Brennan R. L. (2001. a). Generalizability theory. New York, NY: Springer-Verlag. [Google Scholar]

[bibr4-0013164419846936] Brennan R. L. (2001. b). Manual for mGENOVA, Version 2.1. Iowa Testing Programs Occasional Papers, No. 50. Iowa City: University of Iowa. [Google Scholar]

[bibr5-0013164419846936] Brennan R. L. (2011). Utility indices for decisions about subscores (CASMA Research Report No. 33). Iowa City: University of Iowa. [Google Scholar]

[bibr6-0013164419846936] Bridgeman B. (2016). Can a two-question test be reliable and valid for predicting academic outcomes? Educational Measurement: Issues and Practice, 35, 21-24. [Google Scholar]

[bibr7-0013164419846936] Bridgeman B., Lewis C. (1994). The relationship of essay and multiple-choice scores with college courses. Journal of Educational Measurement, 31, 37-50. [Google Scholar]

[bibr8-0013164419846936] Conger A. J., Lipshitz R. (1973). Measures of reliability for profiles and test batteries. Psychometrika, 38, 411-427. [Google Scholar]

[bibr9-0013164419846936] Cronbach L. J., Gleser G. (1953). Assessing similarity between profiles. Psychological Bulletin, 50, 456-473. [DOI] [PubMed] [Google Scholar]

[bibr10-0013164419846936] Detterman D. K., Daniel M. H. (1989). Correlations of mental tests with each other and with cognitive variables are highest for low-IQ groups. Intelligence, 13, 349-359. [Google Scholar]

[bibr11-0013164419846936] Feinberg R. A., Jurich D. P. (2017). Guidelines for interpreting subscores. Educational Measurement: Issues and Practice, 36, 5-13. [Google Scholar]

[bibr12-0013164419846936] Feinberg R. A., Wainer H. (2014). A simple equation to predict a subscore’s value. Educational Measurement: Issues and Practice, 33, 55-56. [Google Scholar]

[bibr13-0013164419846936] Feldt L. S., Steffen M., Gupta N. C. (1985). A comparison of five methods for estimating the standard error of measurement at specific score levels. Applied Psychological Measurement, 9, 351-361. [Google Scholar]

[bibr14-0013164419846936] Fleishman E. A. (1957). A comparative study of aptitude patterns in unskilled and skilled psychomotor performances. Journal of Applied Psychology, 41, 263-272. [Google Scholar]

[bibr15-0013164419846936] Haberman S. J. (2008). When can subscores have value? Journal of Educational and Behavioral Statistics, 33, 204-229. [Google Scholar]

[bibr16-0013164419846936] Haberman S. J., Sinharay S. (2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209. doi: 10.1007/s11336-010-9158-4 [DOI] [Google Scholar]

[bibr17-0013164419846936] Haladyna T. M., Kramer G. A. (2004). The validity of subscores for a credentialing examination. Evaluation & the Health Professions, 27, 349-368. [DOI] [PubMed] [Google Scholar]

[bibr18-0013164419846936] Holtzman K. Z., Swanson D. B., Ouyang W., Dillon G. F., Boulet J. R. (2014). International variation in performance by clinical discipline and task on the USMLE Step 2 clinical knowledge component. Academic Medicine, 89, 1558-1562. [DOI] [PubMed] [Google Scholar]

[bibr19-0013164419846936] Huff K., Goodman D. P. (2007). The demand for cognitive diagnostic assessment. In Leighton J. P., Gierl M. J. (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 19-60). Cambridge, England: Cambridge University Press. [Google Scholar]

[bibr20-0013164419846936] Jiang Z. (2018). Using the linear mixed-effect model framework to estimate generalizability variance components in R. Methodology, 14, 133-142. [Google Scholar]

[bibr21-0013164419846936] Jiang Z., Raymond M. R. (2018). The use of multivariate generalizability theory to evaluate the quality of subscores. Applied Psychological Measurement, 42, 595-612. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr22-0013164419846936] Klauer K. C., Rettig K. (1990). An approximately standardized person test for assessing consistency with a latent trait model. British Journal of Mathematical and Statistical Psychology, 43, 193-206. [Google Scholar]

[bibr23-0013164419846936] Livingston S. A. (2015). A note on subscores. Educational Measurement: Issues and Practice, 34, 5. [Google Scholar]

[bibr24-0013164419846936] Lord F. M. (1955). Estimating test reliability. Educational and Psychological Measurement, 15, 325-336. [Google Scholar]

[bibr25-0013164419846936] Plake B. S., Reynolds C. R., Gutkin T. B. (1981). A technique for the comparison of profile variability between independent groups. Journal of Clinical Psychology, 37, 142-146. [Google Scholar]

[bibr26-0013164419846936] Puhan G., Sinharay S., Haberman S. J., Larkin K. (2010). The utility of augmented subscores in a licensure exam: An evaluation of methods using empirical data. Applied Measurement in Education, 23, 266-285. [Google Scholar]

[bibr27-0013164419846936] Raju N. S., Price L. R., Oshima T. C., Nering M. L. (2007). Standardized conditional SEM: A case for conditional reliability. Applied Psychological Measurement, 31, 169-180. [Google Scholar]

[bibr28-0013164419846936] Raymond M. R., Swygert K. A., Kahraman N. (2012). Psychometric equivalence of ratings for repeat examinees on a performance assessment for physician licensure. Journal of Educational Measurement, 49, 339-361. [Google Scholar]

[bibr29-0013164419846936] Reckase M. D., Xu J. R. (2015). The evidence for a subscore structure in a test of English language competence for English language learners. Educational and Psychological Measurement, 75, 805-825. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr30-0013164419846936] Reardon S. F., Kalogrides D., Fahle E. M., Podolsky A., Zarate R. C. (2018). The relationship between test item format and gender achievement gaps on math and ELA tests in fourth and eighth grades. Educational Researcher, 47, 284-294. [Google Scholar]

[bibr31-0013164419846936] Rodriguez M. C. (2005). Three options are optimal for multiple-choice test items: A meta-analysis of 80 years of research. Educational Measurement: Issues & Practice, 24, 3-13. [Google Scholar]

[bibr32-0013164419846936] Sinharay S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47, 150-174. [Google Scholar]

[bibr33-0013164419846936] Sinharay S. (2013). A note on assessing the added value of subscores. Educational Measurement: Issues and Practice, 32, 38-42. [Google Scholar]

[bibr34-0013164419846936] Sinharay S., Haberman S. J. (2014). An empirical investigation of population invariance in the value of subscores. International Journal of Testing, 14, 22-48. [Google Scholar]

[bibr35-0013164419846936] Sinharay S., Haberman S. J., Puhan G. (2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26, 21-28. [Google Scholar]

[bibr36-0013164419846936] Stone C. A., Ye F., Zhu X., Lane S. (2010). Providing subscale scores for diagnostic information: A case study for when the test is essentially unidimensional. Applied Measurement in Education, 23, 63-86. [Google Scholar]

[bibr37-0013164419846936] Thissen D., Wainer H., Wang X. B. (1994). Are tests comprising both multiple-choice and free-response items necessarily less unidimensional than multiple-choice tests? An analysis of two tests. Journal of Educational Measurement, 31, 113-123. [Google Scholar]

[bibr38-0013164419846936] Van der Maas H. L. J., Molenaar D., Maris G., Kievit R. A., Borsboom D. (2011). Cognitive psychology meets psychometric theory: On the relation between process models for decision making and latent variable models for individual differences. Psychological Review, 118, 339-356. [DOI] [PubMed] [Google Scholar]

PERMALINK

Indices of Subscore Utility for Individuals and Subgroups Based on Multivariate Generalizability Theory

Mark R Raymond

Zhehan Jiang

Abstract

Sources of Profile Variability

Observed Score Profile Variance

Error Variance in Score Profiles

Absolute Error

Relative Error

Expected Error Over Subtests

Ratios of Profile Variances

Observed Profile Variance to Error Variance

True Profile Variance to Observed Variance

Examples Based on Real and Simulated Data

Data Set 1: RPV(Δp) and RPV(Δj) for Different Ability Groups

Data Source

Table 1.

Individual Indices

Figure 1.

Group Indices

Figure 2.

Resampling Study

Table 2.

Data Set 2: Subgroup Differences in PR(Δj) and PR(δj)

Data Source

Table 3.

Figure 3.

Table 4.

PR(δj) and G for Data Sets 1 and 2

Table 5.

Simulation Experiment

Study Design

Results

Table 6.

Figure 4.

Simulation Summary

Discussion

Supplemental Material

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Data Set 1: $RPV (Δ_{p})$ and $RPV (Δ_{j})$ for Different Ability Groups

Data Set 2: Subgroup Differences in $PR (Δ_{j})$ and $PR (δ_{j})$

$PR (δ_{j})$ and $G$ for Data Sets 1 and 2