Skip to main content
Educational and Psychological Measurement logoLink to Educational and Psychological Measurement
. 2019 Jun 14;80(1):91–125. doi: 10.1177/0013164419854208

Simple-Structure Multidimensional Item Response Theory Equating for Multidimensional Tests

Stella Y Kim 1,, Won-Chan Lee 2, Michael J Kolen 2
PMCID: PMC6943987  PMID: 31933494

Abstract

A theoretical and conceptual framework for true-score equating using a simple-structure multidimensional item response theory (SS-MIRT) model is developed. A true-score equating method, referred to as the SS-MIRT true-score equating (SMT) procedure, also is developed. SS-MIRT has several advantages over other complex multidimensional item response theory models including improved efficiency in estimation and straightforward interpretability. The performance of the SMT procedure was examined and evaluated through four studies using different data types. In these studies, results from the SMT procedure were compared with results from four other equating methods to assess the relative benefits of SMT compared with the other procedures. In general, SMT showed more accurate equating results compared with the traditional unidimensional IRT (UIRT) equating when the data were multidimensional. More accurate performance of SMT over UIRT true-score equating was consistently observed across the studies, which supports the benefits of a multidimensional approach in equating for multidimensional data. Also, SMT performed similarly to a SS-MIRT observed score method across all studies.

Keywords: equating, simple structure, multidimensional item response theory, IRT true-score equating


Many large-scale testing programs provide multiple testing opportunities for examinees, which by necessity lead to the use of alternate forms of the test in order to control item exposure and maintain test security. One critical issue in administering multiple test forms is establishing score comparability so that the scores from different test forms can be used interchangeably. To achieve this goal, a statistical adjustment referred to as equating is performed (Kolen & Brennan, 2014).

One of the most widely used equating procedures is unidimensional item response theory (UIRT) equating that requires a set of assumptions about the data structure. In particular, UIRT rests on a unidimensionality assumption, which requires that a test measures only a single ability. However, this assumption is not likely to be fulfilled for many real data situations such as mixed-format tests. Previous studies suggest that the use of multiple item formats can introduce a potential source of multidimensionality (Bridgeman, 1992; Kennedy & Walstad, 1997; Zhang, Kolen, & Lee, 2014). It is possible that different item formats such as multiple-choice (MC) and free-response (FR) items measure abilities that are, albeit quite similar, not exactly the same. Multidimensionality also might be present if a test is composed of several content subdomains.

Multidimensional item response theory (MIRT) has been developed in recognition of the complexity of psychological and educational processes (McKinley & Reckase, 1982; Mulaik, 1972; Reckase, 1972; 2009; Sympson, 1978; Whitely, 1980). A complex conceptualization of psychological/educational constructs demands a more sophisticated measurement model that can capture the relevant skills and knowledge required for correctly answering items. The MIRT models provide a mathematical expression that relates the location of examinees specified by multiple abilities to the probability of getting a particular score on an item (Reckase, 2009).

In the context of equating, when the assumptions of the UIRT models are not met by the data, the accuracy of the resulting equating relationships under the UIRT framework is threatened. Obviously, the use of an equating procedure that leads to unacceptably inaccurate results should be avoided. As an alternative approach, there has been a growing demand for equating procedures that are based on the MIRT framework. The MIRT equating procedures that have appeared in the literature so far include (a) full MIRT observed-score equating (Brossman & Lee, 2013), (b) true-score unidimensional approximation of MIRT equating (Brossman & Lee, 2013), (c) observed-score unidimensional approximation of MIRT equating (Brossman & Lee, 2013), (d) bi-factor MIRT observed-score equating (G. Lee & Lee, 2016), (e) bi-factor MIRT true-score equating (G. Lee et al., 2015), (f) testlet response model MIRT observed-score equating (Tao & Cao, 2016), (g) testlet response model MIRT true-score equating (Tao & Cao, 2016), and (h) simple-structure MIRT observed-score equating (W. Lee & Brossman, 2012).

A new true-score equating procedure based on the simple-structure MIRT (SS-MIRT) model is proposed and examined in the current study. As will be discussed in more detail later, this new equating procedure is different from, and cannot be viewed as a special case of, any of the aforementioned MIRT equating methods. In particular, arbitrary reduction of dimensions in conducting MIRT true-score equating, as was done with the unidimensional approximation of MIRT true-score equating (Brossman & Lee, 2013) and bi-factor true-score equating (G. Lee et al., 2015), is bypassed in the proposed method.

Perfect simple-structure or independent-cluster structure, which is referred as “simple-structure” in this article, is a term used in factor analysis originally introduced by Thurstone (1935, 1947). It refers to situations in which each item loads on only one factor and no cross-loadings exist on the other factors (McDonald, 2000; Sass & Schmitt, 2010). In other words, factor loadings specified to be zero are forced to be zero (Marsh, Lűdke, Nagengast, Morin, & von Davier, 2013). Because of the confirmatory nature, imposing simple-structure on data should be based on a predefined factor structure supported by a strong evidence and belief such as a test blueprint. Initially, the concept of simple-structure was introduced to minimize the number of factors needed to explain each variable by performing a rotation. Therefore, if the simple-structure model fits reasonably well to the data at hand, then the additional complexities associated with MIRT models can be lessened, which is a potential implication of this study. One example of the simple-structure data structure is provided in Figure 1.

Figure 1.

Figure 1.

Simple-structure multidimensional item response theory (MIRT) model with two abilities.

Compared with other MIRT approaches, SS-MIRT equating has several compelling features (W. Lee & Brossman, 2012). One promising characteristic of this approach is its calibration efficiency. The calibration process under the SS-MIRT framework is relatively simple and faster than under other MIRT models because each item loads on only one ability, whereas most of the other MIRT models relate multiple abilities to a single item. Also, it allows for straightforward interpretation of the data structure such as the relationships among dimensions and the weights for each dimension. In spite of such useful features, however, only a limited number of studies have dealt with the SS-MIRT approach in the context of equating (W. Lee & Brossman, 2012). In particular, a true-score equating procedure under the SS-MIRT framework has not been developed in the literature. One of the obstacles to developing MIRT true-score equating is that multiple combinations of abilities can be defined for a particular true score. The current study addresses the complex issues associated with MIRT true-score equating and develops a MIRT true-score equating method. In addition, four studies were conducted to evaluate the new MIRT true-score equating method and compare equating results for the new method to those from other equating methods using four different data types.

SS-MIRT True-Score Equating Procedure

The goal of equating is to find the relationship between two forms so that scores from the two forms can be used interchangeably. In traditional UIRT true-score equating, it is assumed that the true score on the new form is equivalent to the true score on the old form for a given ability θ, as long as the item parameters are on the same scale. In such a univariate case, the test characteristic curve for each of the two forms is used to relate IRT ability to true score. The true-score relationship is applied to observed scores to define the final equating relationship.

The SS-MIRT model allows for multiple abilities and, consequently, an examinee’s expected number-correct score is conditional on a θ-vector. A test characteristic surface represents the relationship between the expected number-correct scores and θ-vector. The challenge of conducting true-score equating using multiple ability dimensions comes from the fact that there is no unique combination of abilities that corresponds to a particular true score. An example of an item characteristic surface is given for a bivariate case in Figure 2. Assuming that two abilities (θs) are identified for a test, the dashed curve represents the probability of a correct response that equals .2, implying that an infinite number of θ combinations that lead to the .2 probability can be determined along with this curve. To deal with such complexities, a procedure is proposed in this article that does not involve direct use of the multidimensional space. Instead, the multidimensional space is collapsed into a set of unidimensional spaces, in which each composite score is a linear combination of all possible univariate components corresponding to the same true score.

Figure 2.

Figure 2.

Example of an item characteristic surface.

Suppose there are m dimensions identified for a test. Let Xi denote observed number-correct score for section i (i=1,,m). Furthermore, let wi be a prespecified weight for section i. In practice, either integer or noninteger section weights are predetermined to construct composite scores such that the contribution of each section to the composite score is aligned with the test specifications. Equating is performed on the weighted composite score rounded to the nearest integer value, denoted as X=int(i=1mwiXi)). Note that scores are often rounded to integers for reporting purposes (Kolen & Brennan, 2014). Also, let eq(xi) represent the typical UIRT true-score equating equivalent for new form score xi on section i.

As an example, for illustrating the SMT (simple-structure multidimensional item response theory true-score) equating method, consider the following scenario. There are two sections and each section has a score range of 0 to 2. For the sake of simplicity, section weights of 1:1 (i.e., wi=1.0) are used for this example. First, items are calibrated using the SS-MIRT model. Second, traditional UIRT true-score equating is performed for each of the two sections separately. The equating results obtained from this step are the first four columns shown in Table 1. Note that all possible combinations of the section scores are listed in the first and third columns (i.e., (0, 0), (0, 1), . . ., (2, 1), (2, 2)), which lead to different composite scores, x.

Table 1.

Example of SMT Equating.

Section 1
Section 2
Composite Scores
x1 eq(x1) x2 eq(x2) x eq(x|x1,x2) f(x1,x2) f(x1,x2|x)
0 0.5 0 0.3 0 0.5 +0.3 = 0.8 .05 1
0 0.5 1 1.4 1 0.5 + 1.4 = 1.9 .10 .4
0 0.5 2 2.2 2 0.5 + 2.2 = 2.7 .10 .25
1 1.2 0 0.3 1 1.2 + 0.3 = 1.5 .15 .6
1 1.2 1 1.4 2 1.2 + 1.4 = 2.6 .15 .375
1 1.2 2 2.2 3 1.2 + 2.2 = 3.4 .20 .8
2 2.6 0 0.3 2 2.6 + 0.3 = 2.9 .15 .375
2 2.6 1 1.4 3 2.6 + 1.4 = 4.0 .05 .2
2 2.6 2 2.2 4 2.6 + 2.2 = 4.8 .05 1

Note. SMT = simple-structure multidimensional item response theory true-score.

Third, the equated composite score, denoted as eq(x|x1,x2)=w1eq(x1)+w2eq(x2), is found for each combination of section scores as presented in the sixth column in Table 1—note that because w1=w2=1.0 in this example, the equated composite score is a simple sum of the two equated section scores. Fourth, a multivariate (e.g., bivariate in this example) observed-score distribution is found for the two section scores, which is given in the seventh column (i.e., f(x1,x2)). A model-based multivariate observed-score distribution can be computed based on the SS-MIRT model, which will be discussed in the following section. The next step is to find the frequency weight for each combination of section scores conditioning on a composite score (i.e., f(x1,x2|x)), which is shown in the last column. In this example, three possible combinations of section scores exist that lead to a composite score of 2: (0, 2), (1, 1), and (2, 0). The frequency weights corresponding to these combinations are .25, .375, and .375. The final equated score is defined as a weighted sum of equated composite scores. In this example, the final equated score for x of 2 is (.25×2.7)+(.375×2.6)+(.375×2.9)=2.7375. Note that the sum of the frequency weights should be 1 for each composite score point.

Taken together, the proposed SMT equating procedure can be summarized as follows.

  1. Calibrate items on each form using a SS-MIRT model.

  2. Conduct standard UIRT true-score equating for each of m sections separately using the item parameter estimates obtained from step (1).

  3. Compute the equated composite score for each combination of section scores, which can be expressed as

eq(x|x1,,xm)=i=1mwieq(xi). (1)
  • 4. Estimate a multivariate observed-score distribution for the m section scores, f(x1,,xm) for the new form, which will be described in the following section.

  • 5. Determine the frequency weight for each combination of section scores conditioning on a composite score, denoted as f(x1,,xm|x).

  • 6. Compute the final equated score, which is a weighted sum of equated composite scores, which can be defined by the following equation:

eq(x)=X=int(i=1mwiXi)f(x1,,xm|x)(i=1mwieq(xi)), (2)
  • where the first summation is taken over all possible combinations of m section scores that lead to a particular composite score x.

The rationale for separate univariate equating for each section in SMT is that under the SS-MIRT framework, only one ability is needed to predict the probability of a correct response on an item. Therefore, the equating relationship between two forms can be determined according to a subset of the test that measures the same construct. The separate relationships then can be combined to identify the final equating relationship between new and old forms in terms of the composite score.

The proposed SMT equating procedure is based on the following assumptions: (a) each item measures only a single ability, (b) the underlying multiple abilities are allowed to be correlated with each other, (c) a cluster of items measuring the same ability can be modeled accurately using a UIRT model, (d) each corresponding section in two forms measures the same construct, and (e) correlations between abilities are captured and reflected in the equating process by applying frequency weights to computing the final equated scores. Since there are multiple combinations of section scores that will produce the same composite score, our goal here is to define a single equated composite score that is most representative of all the combinations. To achieve this goal, the frequency weight of each combination is used, which also reflects the correlations among multiple section scores.

Estimating Joint Multivariate Score Distributions

Conducting SMT equating requires the use of a multivariate observed-score frequency distribution for m section scores, f(x1,,xm). The proposed approach is to estimate a model-fitted multivariate distribution using the SS-MIRT model. Based on the assumption of conditional independence with respect to all ability dimensions, the probability of a correct response to an item for a particular examinee is mutually independent of the probabilities for other items, conditioning on examinee’s abilities θ1, θ2, . . ., θm. Thus, given item parameters, a conditional multivariate score distribution for each combination of section abilities can be determined as the product of each conditional observed-score distribution:

f(x1,x2,,xm|θ1,θ2,,θm)=f(x1|θ1)f(x2|θ2)f(xm|θm), (3)

where f(xi|θi) represents the conditional raw-score distribution for section i, which can be computed using the Lord and Wingersky (1984) formula for MC items or the modified version of the Lord–Wingersky formula proposed by Hanson (1994) for items with multiple score categories. Note in Equation (3) that the conditional distribution for each section involves only a single ability under the SS-MIRT model.

Finally, the multivariate score distribution can be obtained by aggregating conditional multivariate score distributions over multivariate latent ability distributions, g(θ1,,θm):

f(x1,,xm)=θ1θmf(x1|θ1)f(xm|θm)g(θ1,,θm)dθ1dθm. (4)

The multivariate distribution can be approximated by replacing integrals in Equation (4) with summations as:

f(x1,,xm)=θ1θmf(x1|θ1)f(xm|θm)Q(θ1,,θm), (5)

where Q(θ1,,θm) is the discrete multivariate quadrature ability distribution. Once the multivariate distribution is defined, the final step is to find the frequency weights for section scores leading to each composite score, which is given by

f(x1,,xm|x)=f(x1,,xm)X=int(i=1mwiXi)f(x1,,xm), (6)

where the summation in the denominator is taken over all possible combinations of section scores that give the same composite score.

Method

Study 1: Real Data Analysis

Data Description

Two forms of the Advanced Placement (AP) English Language test were used. For illustrative purposes, some aspects of the data were arbitrarily manipulated: (a) summed scores over the MC and FR items were used to reduce the computational complexity; (b) the random groups (RG) design was used although the data were originally collected under the common item nonequivalent groups design (CINEG) design; (c) 6,000 examinees were sampled from the original data for each form; and (d) for research purposes only, normalized scale scores ranging from 0 to 70 were created for the raw-to-scale score conversion. Such modifications imply that the primary aim of this article was to explore the performance of the equating methods, not to examine the psychometric properties of the AP exams used.

In this study, the old form contains 54 MC items scored 0-1 and 3 FR items scored 0-9, and the new form consists of 55 MC items scored 0-1 and 3 FR items scored 0-9. The maximum composite summed raw scores were equal to 81 and 82 for the old and new forms, respectively. Descriptive statistics for the data are presented in the first two columns in Table 2. The mean score for the old form is higher than that for the new form, suggesting that the old form is easier than the new form (under the assumption that the groups taking the two forms are equivalent).

Table 2.

Descriptive Statistics for Data Used in Studies 1, 3, and 4.

Study 1
Study 3
Study 4
New Form Old Form New Form Old Form New/Old Form
Raw score scale 0-82 0-81 0-37 0-37 0-64
No. of MC items (no. of items for each content area) 55 54 37 (15, 15, 7) 37 (15, 15, 7) 64 (25, 26, 13)
No. of FR items (maximum score) 3 (9, 9, 9) 3 (9, 9, 9)
Mean 47.090 49.025 24.251 24.370 42.584
SD 12.502 12.448 5.350 5.335 8.608
max no. of CI (no. of CI for each content area) 12 (5, 5, 2) 12 (5, 5, 2) 19 (7, 8, 4)
CI mean 7.886 12.186
CI SD 2.186 3.134
N 6,000 6,000 3,000 3,000

Note. MC = multiple choice; FR = free response; CI = common item; SD = standard deviation.

Studies using real data are limited in that the “true” equating relationship is not available. Without criteria, it is difficult to evaluate which equating method performs best. Therefore, the primary goal of Study 1 is to examine how (dis)similarly the proposed equating methods behave relative to the existing methods. A substantial discrepancy between the SMT method and the existing methods might imply a serious limitation of the SMT method. The methods used in this article for comparison purposes will be described later. In Study 1, multidimensionality was noted to occur due to multiple item formats.

To quantify such discrepancies, “Difference That Matters” (DTM) was used. This index, proposed by Dorans, Holland, Thayer, and Tateneni (2003), serves as a benchmark that reflects practical significance. In a practical setting, what actually matters to the examinees is a reported rounded score. Thus, a difference between equating relationships larger than a half point on the reporting scale (i.e., ±.5 in this study) will result in a change in the final reporting score. When the true equating relationship is unknown, the DTM criterion provides an alternative means to compare equating methods.

Study 2: Simulated Data Analysis

Data Preparation

This simulation study was intended to evaluate the performance of SMT under various study conditions. The estimated item parameters used in Study 1 served here as generating item parameters. Data were generated using the three-parameter logistic (3PL; Birnbaum, 1968) and the graded response (GR: Samejima, 1969) models for MC and FR items, respectively.

Simulation Factors

Three levels of correlation between MC and FR section abilities (ρθMCθFR) were considered: .5, .8, and .95. A previous study conducted by W. Lee and Brossman (2012) indicates that, based on a DTM criterion, equating results obtained from unidimensional or classical equating procedures would be reasonably acceptable with a disattenuated correlation of .8 or above, when group difference is minimal. Thus, a correlation of .8 was included in this study as a benchmark at which MIRT models are expected to perform better than UIRT models. The impact of substantial multidimensionality was examined with correlation of .5. When the correlation is .95 or above, the data can be regarded as approximately unidimensional.

Two levels of sample size were explored: 1,000 and 5,000. Previous research indicates that having a sample size of 5,000 or more leads to reasonably accurate equating results (Hanson & Béguin, 2002).

Simulation Procedures

The SS-MIRT model was used as a generating model to produce response data, assuming that a test is designed to assess two different, but somewhat related abilities. In this study, both forms are assumed to be given to randomly equivalent groups of examinees. The specific simulation process is described below:

  1. Calibrate MC and FR items in the new and old forms using 3PL and GR models, respectively. The resulting estimated item parameters are used as generating item parameters.

  2. Randomly draw pairs of theta values (θMC, θFR) for a group of examinees from a given bivariate normal distribution, BN(0,0,1,1,ρθMCθFR).

  3. Generate item responses for each examinee for each MC item using MC item parameters for the new form and the true MC theta value (θMC).

  4. Generate item responses for each examinee for each FR item using FR item parameters for the new form and the true FR theta value (θFR).

  5. Repeat steps (2) through (4) using the old form item parameters.

  6. Conduct equating using the simulated data to find the estimated equating relationships for each equating procedure.

  7. Repeat steps (2) through (6) 100 times.

The number of replications of 100 has been used in many equating studies (W. Lee & Brossman, 2012; W. Lee, He, Hagge, Wang, & Kolen, 2012). In order to check whether 100 replications are enough to produce stable evaluation criteria statistics (presented later), one simulation condition was selected, and results were obtained using 1,000 replications. The comparison suggested that the evaluation criteria statistics did not change greatly with the increased number of replications.

Criterion Equating Relationships

The criterion equating relationships were established based on large-sample single-group equipercentile equating. This single-group equating criterion suggested by Kim and Lee (2016) has several appealing features over other possible criteria such as computational efficiency and intuitive interpretation. Both forms are assumed to be given to the same large group of examinees. The specific steps are as follow:

  1. Draw a large sample (N=1,000,000) from the bivariate normal distribution for each level of correlation, (θMC, θFR) ~ BN(0,0,1,1,ρθMCθFR).

  2. Generate item responses for each examinee for both the old and new forms. As a result, each examinee has scores on both forms.

  3. Find equating relationships using traditional equipercentile equating.

  4. Repeat the above steps for each of the three levels of correlation considered.

Evaluation Criteria

The estimated equating relationships for each equating procedure based on 100 replications were compared to the criterion equating relationships. To quantify the performance of each equating method, signed bias, standard error (SE), and root mean squared error (RMSE) were computed at each score x, which can be expressed as:

bias(x)=(1Rr=1Re^xr)ex, (7)
SE(x)=1Rr=1R[e^xr(1Rr=1Re^xr)]2, (8)
RMSE(x)=bias(x)2+SE(x)2, (9)

where ex is the criterion equated score at score x, e^xr is an estimated equated score at score x on replication r, and R (=100) corresponds to the number of replications. In addition to the conditional statistics, overall statistics were computed to explore the precision of each equating procedure aggregated over score points. The three overall statistics, average root mean squared bias (AB), average SE (ASE), and average RMSE (ARMSE), were computed using the following equations:

AB=xw(x)[(1Rr=1Re^xr)ex]2, 10
ASE=xw(x)1Rr=1R[e^xr(1Rr=1Re^xr)]2, 11
ARMSE=AB2+SE2, 12

where w(x) represents a relative frequency of score point x on the new form based on the population distribution with 1,000,000 examinees.

Study 3: Pseudo Forms Data Analysis

In Study 3, a single form was divided into two half-length test forms to create a pair of pseudo forms with examinees’ scores on both forms. The advantage of such data manipulation is that it allows to establish a criterion equating relationship using the single-group design (without assuming that an IRT model holds), because both pseudo forms contain scores for each examinee. Unlike the previous two studies, this study treats content specifications as the source of multidimensionality. With a set of common items (CIs), the proposed equating method was evaluated under the CINEG design.

Intact Form Information

A single form of the AP Spanish Literature and Culture test was used for Studies 3 and 4. As with the AP English exam, this test is also a mixed-format test containing both MC and FR items. However, the MC items were considered only because the primary focus of Studies 3 and 4 is on multidimensionality resulting from content areas, not from item formats. As a result, the modified data had 64 MC items scored 0 or 1. The Spanish Literature test is designed to measure three distinct content areas: (1) comprehension (Section 1), (2) interpretation (Section 2), and (3) recognition of literary technique (Section 3). Descriptive statistics for the data are provided in the middle of Table 2.

Pseudo-Test Forms

Pseudo test forms were created by splitting the intact test form into two similar halves to make the two short forms as similar as possible in terms of content areas and p-values. In addition to items that are unique to each form, 12 CIs were identified that were proportionally representative of the pseudo forms with regard to content area and p-value.

Regarding the common item (CI) length, one conventional rule-of-thumb is that the proportion of the CI set should be at least 20% of the total test length (Kolen & Brennan, 2014). To explore the impact of CI proportion (CIp) on equating result, a 10% condition (5 MC) and a 30% condition (12 MC) were considered. For computational simplicity, the section weights of 1:1:1 were used for the three sections, which led to a composite score range of 0 to 37 for both new and old forms.

Group difference has been known to have the potential to impact equating results under the CINEG design (W. Lee et al., 2012; Powers et al., 2011). One way to quantify the degree of group difference is to use the standardized mean difference (Dorans, 2000), or effect size (ES), computed based on CI scores, which is given by

ES=c¯Nc¯O(nN1)σ^N2+(nO1)σ^O2nN+nO, (13)

where c¯O indicates the mean of CI scores on the old form, σ^O2 is the variance of CI scores on the old form, and nO is the number of examinees taking the old form. The same notation was applied to the new form with subscript N. Two levels of group difference were considered in this study: .1 and .3. W. Lee et al. (2012) found that the results for all studied equating methods were acceptable when ES=.05 or ES=.1, implying that ES=.1 is small enough to produce satisfactory equating results for most equating methods. By contrast, Kolen and Brennan (2014) noted that ES=.3 or more could lead to large differences across equating methods.

Two groups were formed using a selection variable “parental education.” To achieve a target level of group difference, the examinees were dichotomized into two groups according to their parental education level. The two subsets consisted of 9,927 and 7,617 examinees for groups with higher and lower parental education level, respectively. The 3,000 examinees for each study condition were randomly sampled from two subsets with unequal probabilities: for the ES=.1 condition, the old form pseudo group was created by randomly selecting without replacement one third of examinees (1,000) from the high parental education subset and two thirds (2,000) from the low parental education subset, and the new form pseudo group was created using two thirds of examinees from the high parental education subset and one third of examinees from the low parental education subset. This process continued until the effect size for the sampled groups fell into ±.01 range of the target ES value. A similar process was performed for the ES=.3 condition, but with more extreme proportions: 3/4 and 1/4. For each study condition, data were created for 100 replications.

Scale Linking

One additional step required in the CINEG equating design is to adjust group differences so that (M)IRT parameter estimates are placed on the same scale. Concurrent calibration was implemented to construct a common metric for the new and old forms using flexMIRT (Cai, 2017). A set of CIs served as a link to put the ability and item parameter estimates on the same scale. One advantage of concurrent calibration is that scale linking is carried out at the time of item calibration, making additional scale transformation unnecessary.

As in UIRT linking, concurrent calibration was used for MIRT linking. One requirement for MIRT linking is that at least one of the CIs has to be associated with each dimension to place the estimates for two forms on the same dimension-specific metric. Because multidimensionality was defined according to content specifications in this study, the CIs associated with each of content areas were used to link two forms with regard to the content-specific ability dimension.

Evaluation Criteria

Since two pseudo forms were constructed by arbitrarily splitting a single operational form, each examinee has scores on both pseudo forms, which allows for single group criterion equating. Specifically, a criterion equating relationship was established using the equipercentile equating procedure based on the entire sample of examinees who took the intact form. Similar to Study 2, six evaluation indices, three conditional and three overall statistics, were used to evaluate the equating procedures based on 100 replications.

Study 4: Single Form With Identity Equating Analysis

The use of a single criterion is almost always open to criticism because every criterion has limitations. It is important to examine the extent to which the findings are observed consistently across various designs and criteria (Kolen & Brennan, 2014). Therefore, the same data used in Study 3 were used again in Study 4, but with a different equating design and criterion. As another criterion, “equating a test to itself” (Kolen & Brennan, 2014) was used.

Data Preparation

For this study, the AP Spanish Literature and Culture test was used as an intact form. As a result, there are 64 MC items, which leads to a composite score range of 0-64 after applying section weights of 1:1:1. To maintain the same CI proportion levels specified in Study 3, additional 7 MC items needed to be included in a CI set because the test becomes longer with the use of the intact form. A similar procedure used in Study 3 was also conducted in Study 4 to create two nonequivalent groups.

Once two groups were created, the form was equated to itself using the two sampled groups. It should be noted that under this setting, equating is not necessary, in theory, because the two pseudo groups actually took the same single form. One advantage of this design is that identity equating can be used as the criterion equating. Identity equating is referred to a situation where each score on old form is considered to be equivalent to the same score on new form. As with Study 3, the number of replications was 100 with a fixed sample size of 3,000 and concurrent calibration was used for both UIRT and MIRT as a linking procedure.

Evaluation Criteria

Using identity equating as a criterion equating relationship, conditional and overall statistics were computed based on 100 replications, using Equations (7) through (12). The identity equating criterion also has a limitation. Brennan and Kolen (1987a; 1987b) found that equating methods with fewer parameter estimates tend to provide better results than those with more parameter estimates.

Dimensionality Assessment and Model Fit

The use of MIRT models in equating can provide valuable benefits only when data actually demonstrate some degree of multidimensionality. Regarding the choice of an adequate IRT model (i.e., UIRT vs. MIRT), dimensionality assessment can serve as a means to determine which model is more adequate to represent the data.

First, principal components analysis (PCA) on tetrachoric and polychoric correlations was performed using the computer program R (R Core Team, 2014). Based on the PCA results, scree plots of eigenvalues were constructed to visually inspect the appropriate number of factors needed to adequately summarize the data. Second, the revised parallel analysis (R-PA) proposed by Green, Levy, Thompson, Lu, and Lo (2012) was conducted. R-PA corrects the bias involved in the traditional parallel analysis, which is ignorant of the existence of identified factors in the empirical distribution of eigenvalues. The R-PA method has demonstrated more accurate results than the traditional method (Green, Redell, Thompson, & Levy, 2016). The R code provided by Green et al. (2016) was used to perform the analysis. Finally, disattenuated correlation was computed, which is a simple index that has been widely used in the literature on MIRT equating to quantify the degree of multidimensionality (Brossman, 2010; W. Lee & Brossman, 2012; G. Lee & Lee, 2016; Peterson, 2014). Disattenuated correlation between two sections is computed using the following formula:

ρT1T2=ρS1S2ρS1S1×ρS2S2 (14)

where ρS1S2 is the Pearson correlation between two section scores, ρS1S1 and ρS2S2 are the coefficient alpha reliabilities for section 1 and 2, respectively.

In Studies 1 and 2, item format is viewed as a source of multidimensionality. The source of multidimensionality was identified according to the test specifications for Studies 3 and 4.

Equating Procedures

The results of the SMT procedure were compared to the results for (a) traditional equipercentile with log-linear presmoothing with a degree of 6 (EQ), (b) UIRT true-score (UT), (c) UIRT observed-score (UO), and (d) SS-MIRT observed-score methods (SMO).

UIRT and Equipercentile Equating Methods

The equating relationships for the EQ, UT, and UO methods were found using the open-source computer program Equating Recipes (Brennan, Wang, Kim, & Seol, 2009). The traditional equipercentile equating method was included in this study because it does not make an explicit assumption of unidimensionality. Thus, it is expected that the impact of multidimensionality on EQ is less influential, which might produce less biased equating estimates even when some degree of multidimensionality is present. To reduce the random error in equating, log-linear presmoothing with a degree of 6 was incorporated in conducting EQ. For Studies 3 and 4, chained equipercentile (CE) equating was carried out instead of EQ, since Studies 3 and 4 were conducted under the CINEG equating design. In conducting CE, a univariate log-linear presmoothing method was incorporated to improve equating precision by reducing random error. Equating Recipes (Brennan et al., 2009) was used to perform CE with log-linear presmoothing with a polynomial of degree 6.

One concern in UT using the 3PL model is that there is a score range in which true scores cannot be defined (i.e., below the sum of guessing parameters). Linear interpolation or other ad hoc procedures are usually implemented to determine the equating relationships for this undefined score range. Through an examination of the frequency distribution, it was found that few examinees were located in this area so the results for a raw-score range of 0 to 15 were excluded from evaluation.

For UO, to determine the marginal score distribution, 41 evenly spaced quadrature points and their weights were used with an ability range of −6 to +6. Note that a synthetic weight of 1 was given to the new form group wherever applicable across all equating methods in this article.

SS-MIRT Equating Methods

For SMT, item parameters were estimated under the SS-MIRT framework using flexMIRT (Cai, 2017). Then, multiple sets of equating relationships were obtained for each section separately under the UIRT true-score framework using Equating Recipes (Brennan et al., 2009). The 3PL and the GR models were used to estimate item parameters for the MC and FR items under the SS-MIRT framework, respectively.

A program was written in R (R Core Team, 2014) to find model-based multivariate score distributions. Note that the correlation between the section abilities was estimated for each replication, at the time when item parameters for the SS-MIRT model were estimated using flexMIRT (Cai, 2017). Then, the estimated correlation for the new form was regarded as the correlation for the population ability distribution. For SMO, the equating relationships were obtained using the computer program MIRTeq (W. Lee, 2015).

In sum, the current research involved four distinct studies with different data types. A summary of the studies is provided in Table 3.

Table 3.

Summary of Study Structure for Studies 1 Through 4.

Study 1 Study 2 Study 3 Study 4
Data Real data Simulated data Pseudo forms data Single intact form data
Source of Multidimensionality Item format Content specification
Factors of Interest NA (a) Correlation b/w sections: .5, .85, and .95(b) Sample size: 1,000 and 5,000 (a) CI proportion: 10% and 30%(b) Group difference: ES = .1 and .3
Equating Design RG CINEG
Dimensionality Assessment Principal component analyses (PCA)
Revised parallel analysis (R-PA)
Disattenuated correlation
Scale Linking NA Concurrent calibration
Equating Methods (a) EQ, (b) UT, (c) UO, (d) SMO, and (e) SMT (a) CE, (b) UT, (c) UO, (d) SMO, and (e) SMT
Criterion (Equating) DTM A large-sample single-group equipercentile Single group equipercentile Identity
Number of Replications NA 100

Note. NA = not applicable; CI = common item; ES = effect size; CINEG = common item nonequivalent groups design; RG = random group design; EQ = traditional equipercentile with log-linear presmoothing with a degree of 6; UT = unidimensional item response theory true-score; UO = unidimensional item response theory observed-score; SMO = simple-structure multidimensional item response theory observed-score; SMT = simple-structure multidimensional item response theory true-score; CE = chained equipercentile; DTM = difference that matters.

Results

Study 1: Real Data Analysis

Dimensionality Assessment

Prior to presenting equating results, results from dimensionality assessments are provided. First, PCA was conducted to gain a rough sense of the dimensionality of the data used for Study 1. Results from PCA are presented in scree plots, as displayed in Figure 3, with the left plot depicting the new form and the right for the old form. Note that in Figure 3, the first eigenvalues are not displayed to better visualize the pattern of the remaining eigenvalues. It can be seen that at least a few more dimensions exist after removing the first eigenvalue for both forms. Although an inspection of a scree plot is useful in obtaining a rough picture of dimensionality structure, it heavily relies on a visual investigation, which in turn requires somewhat subjective human judgment. Thus, it would be informative to have an analytic assessment tool, which was carried out with R-PA. Based on the 95% criterion, there are 14 and 24 factors identified for the new and old forms, respectively, from the R-PA results, suggesting that the data are multidimensional.

Figure 3.

Figure 3.

Scree plots for Studies 1, 3, and 4.

Note. The first eigenvalues have been removed from the plots.

Additional evidence toward the multidimensionality was found based on the confirmatory analysis using disattenuated correlation. Note that in Table 4, the upper off-diagonal elements are estimated theta correlations under the IRT framework, whereas the lower off-diagonal elements are estimated true score correlations under the CTT framework. The estimated theta correlations between MC and FR sections for Study 1 were .818 and .799 for the new and old forms, respectively; the estimated disattenuated correlations (i.e., correlation between true scores) were .829 and .811 for the two forms, respectively. Consistent with the results from PCA, estimated disattenuated correlations indicate that two sections in the two forms are likely to measure different constructs.

Table 4.

Estimated Disattenuated Correlations for Data Used in Studies 1, 3, and 4.

New Form Old Form
Study 1
[1.818.8291] [1.799.8111]
Study 3
[1.964.5091.001.642.527.7301] [1.930.495.9501.695.486.7101]
Study 4
[1.959.583.9751.741.556.7491]

Note. The upper off-diagonal elements are estimated theta correlations (i.e., ρ^θ1θ2); the lower off-diagonal elements are estimated true score correlations (i.e., ρ^T1T2). Study 4 involved only one dataset used either as old or as new form.

Equating Results

Equating results for raw scores are displayed in Figure 4 using the identity equating (i.e., new form score) as a baseline. It seems from the figure that all equating procedures lead to very similar equating results, except at the upper end of the score range, where UT tends to have smaller equivalents compared to the other procedures. The EQ procedure reveals a different pattern than the other equating procedures, especially in the score range of 10 to 15. This might possibly be due to the low—or sometimes zero—frequency associated with this score range. Note that the equating results were reported only for a raw score range of 11 to 79, since the IRT true scores are not identified for the observed scores of 10 or below (i.e., below the sum of guessing parameters), and no frequency was found for raw scores of 80 or above.

Figure 4.

Figure 4.

Study 1: Raw-to-raw score equivalents for the five equating procedures.

The equating results for the unrounded scale scores are presented using the EQ procedure as a baseline in Figure 5. In Figure 5, each line represents differences in equivalents between EQ and the other equating procedures. In terms of scale score, an interpretation was made only for the scale-score range of 11 to 62, which corresponds to the raw score range of 11 to 79. In Figure 5, the two dotted lines represent the DTM criterion, which is conventionally used to determine if acceptable equating is achieved (Dorans & Feigenbaum, 1994). In general, a similar pattern is found between scale-score and the raw-score equating results. Some large discrepancies among the procedures are seen at the upper end of the score scale, where SMO and UT slightly deviate from the DTM lines in an opposite direction. In sum, the proposed SMT procedure yields comparable, but not exactly the same results to the other procedures.

Figure 5.

Figure 5.

Study 1: Differences between (multidimensional) item response theory ((M)IRT) and equipercentile (EQ) equated unrounded scale scores.

Study 2: Simulated Data Analysis

Conditional Results

The conditional results for Study 2 are presented in Figure 6. For simplicity, the plots are provided only for a condition with ρ=.5 and N = 1,000 as a similar pattern was observed for the other conditions. In the figure, the top plot pertains to the result for SE, the middle for bias, and the bottom for RMSE. The vertical axis indicates the amount of equating error and the horizontal axis represents the new form raw score.

Figure 6.

Figure 6.

Study 2: Conditional results for ρ=.5 and N = 1,000.

In general, EQ introduces a significant amount of SE at the extreme ends of the score distribution. Also, a relatively larger SE tends to be seen for UT at the ends of the score scale. Most of the studied procedures generally provide similar SE results. In the middle plot for bias results, a solid horizontal line indicates a zero line where no bias exists at all. In general, all the lines for the five equating procedures are below the zero line for most of the score range, suggesting a negative bias across the score scale. One notable finding is that the behavior of UT is different from that of the other procedures. That is, UT introduces a large amount of negative bias at the upper and lower score ranges. This unique pattern remained consistent across all levels of correlation. The four equating procedures SMT, SMO, EQ, and UO yield a similar pattern across score points. With respect to RMSE results, SE contributes to the overall error to a greater extent than bias does, due to the larger values of SE in magnitude. Consequently, the plots for RMSE closely resemble those for SE. As can be seen in Figure 6, there is a high degree of consistency between most of the procedures, except EQ and UT. The other three procedures (i.e., SMT, SMO and UO) generally maintain a similar level of RMSE across the score scale.

Overall Results

Overall results are provided in Table 5 for the sample size of 1,000 in the left columns and for the sample size of 5,000 in the right columns. In Table 5, for each correlation level, the first row contains results for ASE, the second for AB, and the last for ARMSE. The results are described for the sample size of 1,000 first, followed by 5,000.

Table 5.

Study 2: Overall Results.

N = 1,000
N = 5,000
SMT SMO EQ UT UO SMT SMO EQ UT UO
ρ=.5
 ASE .59184 .58606 .72708 .62802 .59830 .26591 .25946 .32020 .30131 .27839
 AB .17164 .11658 .15550 .42162 .13813 .11904 .07211 .07328 .40186 .09930
 ARMSE .61623 .59754 .74352 .75642 .61404 .29134 .26931 .32848 .50227 .29557
ρ=.8
 ASE .70331 .71388 .83045 .74548 .72759 .29139 .29041 .34003 .31016 .29702
 AB .14923 .09311 .09343 .31480 .11095 .10402 .04393 .05273 .30176 .06042
 ARMSE .71896 .71992 .83569 .80922 .73600 .30942 .29373 .34408 .43274 .30310
ρ=.95
 ASE .72427 .72549 .84128 .73601 .71874 .32453 .32499 .33413 .29786 .28665
 AB .09731 .10060 .07113 .25255 .09644 .06512 .08526 .07829 .24765 .07362
 ARMSE .73078 .73243 .84429 .77813 .72518 .33101 .33599 .34318 .38736 .29596

Note. ASE = average standard error; AB = average root mean squared bias; ARMSE = average root mean squared error; SMT = simple-structure multidimensional item response theory true-score; SMO = simple-structure multidimensional item response theory observed-score; EQ = equipercentile; UT = unidimensional item response theory true-score; UO = unidimensional item response theory observed-score.

In terms of ASE, in general, smaller ASEs are found with the SS-MIRT procedures (i.e., SMT and SMO) than the UIRT procedures (i.e., UT and UO), with the exception of the largest correlation condition (i.e., ρ=.95). Therefore, more consistent equating relationships are found with SMT and SMO than with UO or UT when the unidimensionality assumption is violated. Also, SMT yields ASE values that are comparable to, or sometimes smaller than, SMO. This finding is particularly surprising because the observed-score equating procedure has been recognized as having less variability than the true-score equating procedure, given that UO has been reported to produce smaller standard errors than UT in previous studies (Cho, 2008; Hagge et al., 2011; Tsai, Hanson, Kolen, & Forsyth, 2001).

According to AB based on 1,000 sample size, both SMO and UO generally outperform SMT, except for the correlation of .95. This pattern is to be expected given that the criterion equating relationship was established using EQ based on observed scores generated from the SS-MIRT models, which, in theory, is the same as the SMT equating. One notable finding is that AB for UT becomes substantially larger as the correlation decreases, while AB for SMT remains relatively consistent across the correlation conditions. In addition, using SMT instead of UT seems to reduce AB substantially, even for approximately unidimensional data (i.e., ρ=.95).

With respect to ARMSE with smaller sample size, the results generally indicate that the two (M)IRT observed-score procedures (i.e., UO and SMO) show the smallest error, followed by SMT, next by UT, and last by EQ. Under a small or moderate correlation (i.e., ρ=.5 and .8), SMO shows the smallest error among the five procedures. When the data are essentially unidimensional (i.e., ρ=.95), UO outperforms the other procedures, which is in line with previous findings (W. Lee & Brossman, 2012). It is also worth mentioning that a consistent pattern is found for the relationship between SMT and UT. That is, SMT always provides more accurate equating results than UT. This relationship holds for any statistics and any correlation levels. In addition, it appears obvious that as the correlation becomes smaller, the difference between the two procedures tends to become bigger, with more accurate results being associated with SMT.

The equating results for the sample size of 5,000 are generally very similar to those for the sample size of 1,000, with only a few exceptions. First, as anticipated, the use of larger sample size leads to a reduction in variability. As a result, AB contributes more to ARMSE relative to ASE, which makes the patterns of ARMSE roughly mirror those of AB. Second, increasing sample size results in smaller differences between equating procedures, especially in the ASE values. In terms of AB, however, the general tendency seems unchanged, except for significantly smaller AB in EQ under a correlation of .5.

Study 3: Pseudo Forms Data Analysis

Dimensionality Assessment and Model Fit

In Figure 3, for both new and old forms, at least two sharp bends emerge in the curve connecting the second through fourth eigenvalues, implying the existence of multiple dimensions in the data. The results from R-PA also suggest that there are 14 factors needed to explain the correlation matrix for both forms. Additional evidence of multidimensionality can be seen in the disattenuated correlations provided in the middle of Table 4. Both the new and old forms appear to be multidimensional, reflected by considerably lower correlations between the last content domain and the others—disattenuated correlations ranging from .486 to .730.

Conditional Results

Conditional results are provided for SE, bias, and RMSE in Figure 7. The top plot pertains to results for SE, the middle for bias, and the bottom for RMSE. Note that the results are only presented for a condition of ES=.3 and CIp=.1, because a very similar trend was also found for the other conditions. In terms of SE, the four UIRT and MIRT procedures tend to reveal a close alignment. UT generally has slightly larger SEs than UO across the score range, which is in line with previous findings (Cho, 2008; Tsai et al., 2001). CE introduces a substantial amount of SE at both ends of the score scale and CE behaves fairly differently from the other procedures.

Figure 7.

Figure 7.

Study 3: Conditional results for ES=.3 and CIp=.1. ES = effect size; CIp = common item proportion.

The results for conditional bias demonstrate clear relationships among the studied procedures: UT and UO show a similar relationship and SMT and SMO nearly overlap each other. Specifically, at the lower end, a substantial amount of negative bias is apparent for UT and UO, whereas a relatively smaller amount of either positive or negative bias is seen for SMT and SMO. In general, the two UIRT procedures yield negative bias at the lower part of the score scale but positive bias at the opposite end of the score scale. On the other hand, the patterns of bias for the two SS-MIRT procedures are relatively flat across the entire score scale. For CE, one unique pattern consistently observed is positive bias at the lower end and negative bias at the upper end, which is a reverse tendency of the (M)IRT procedures. The distinct behavior of CE compared with the others is rather understandable, given that it is the only procedure in this study that is not under the IRT framework. A comparison between SMT and SMO demonstrates better performance of SMT over SMO in that it generally introduces a smaller magnitude of positive/negative bias across the score range.

Unlike the results from Study 2, the overall patterns of RMSE more closely resemble those of bias, rather than of SE. This finding probably occurs because of the nature of the equating designs used in Studies 2 and 3. In Study 2, the RG design is used whereas Study 3 is under the CINEG design which more likely introduces a larger bias in magnitude than RG. Thus, under the CINEG design, an equating procedure that can minimize the amount of bias would be preferable to a procedure that can reduce the amount of SE. In general, SMT and SMO outperform UT and UO at most of the score points. As with the bias results, the two UIRT procedures reveal a similar pattern and the two SS-MIRT procedures tend to perform similarly.

Overall Results

The overall results in Table 6 suggest that the smallest ASEs tend to be associated with CE and SMO, when CIp=.1 and CIp=.3, respectively. Comparison between CIp=.1 and CIp=.3 proves smaller errors associated with CIp=.3, which is in line with the finding from previous research (Kim & Lee, 2016). They confirmed a rule of thumb that the length of a CI set should be at least 20% of the total test. UO tends to outperform UT, but in general, the ASE values produced by the two procedures are comparable. A slightly smaller ASE is found for SMT than SMO with a smaller CIp, but a reverse pattern is observed with a larger CIp. It seems clear that having a larger CI proportion and smaller group difference helps reduce the magnitudes of ASE. Particularly, the impact of CI proportion is substantial, making ASE values under CIp=.1 nearly two times as large as those under CIp=.3.

Table 6.

Study 3: Overall Results.

CIp=.1 CIp=.3
SMT SMO CE UT UO SMT SMO CE UT UO
ES=.1 ASE .20199 .20801 .18152 .21870 .20263 .11216 .10780 .14516 .13210 .11515
AB .26153 .29902 .24727 .62296 .57872 .10183 .13472 .06132 .28738 .26151
ARMSE .33045 .36427 .30674 .66023 .61316 .15149 .17254 .15761 .31629 .28574
ES=.3 ASE .26363 .27308 .17510 .25642 .24515 .12062 .11798 .14363 .14353 .13191
AB .24112 .34054 .57519 1.36590 1.29856 .26544 .32886 .09680 .63487 .57068
ARMSE .35727 .43650 .60125 1.38976 1.32150 .29155 .34940 .17321 .65090 .58572

Note. ES = effect size; CIp = common item proportion; SMT = simple-structure multidimensional item response theory true-score; SMO = simple-structure multidimensional item response theory observed-score; CE = chained equipercentile; UT = unidimensional item response theory true-score; UO = unidimensional item response theory observed-score.

In terms of AB, a clear pattern is found between UIRT and SS-MIRT. That is, SS-MIRT always results in smaller bias than UIRT. More specifically, UT leads to the largest bias across all study conditions, closely followed by UO. Also, a comparison between SMT and SMO suggests a better performance of SMT over SMO. CE has the smallest bias among the five procedures except for one condition under ES=.3 and CIp=.1. This pattern is reasonably expected given that the criterion equating relationship is established based on the single-group EQ. It is interesting that the absolute magnitudes of AB increase substantially for the UIRT procedures when the small number of CIs is used and group difference is large. It seems obvious that UT and UO are affected most seriously by the small number of CIs and large group difference.

According to the ARMSE results, AB tends to contribute more to the overall equating errors than ASE. As a result, UT shows the largest ARMSE across all study conditions, followed by UO. CE tends to provide small ARMSEs except for one condition with CIp=.1 and ES=.3. Also, as with AB, SMT tends to outperform SMO due to its smaller AB. In general, the performance of SMT is reasonably satisfactory, having relatively smaller AB and ARMSE compared with the other procedures.

Study 4: Single Form With Identity Equating Analysis

For Study 4, the conditional results are not presented because a very similar pattern with Study 3 was found which is described earlier. In fact, the similar conditional results between Studies 3 and 4 were reasonably expected because the pseudo-form data used in Study 3 were created by splitting the data used in Study 4. Although Study 3 data are half the length of Study 4 data, the data structure is anticipated to have been roughly maintained.

Dimensionality Assessment and Model Fit

Similar to Studies 1 and 3, the adequacy of SS-MIRT over UIRT to the data was evaluated through a visual inspection of the scree plots presented in the bottom of Figure 3. Note again that for Study 4 there is only one form because one single form was equated back to itself. In Figure 3, it is shown that there are at least two sharp curves in the line connecting the second through fourth eigenvalues. It appears that the shape of scree plot for Study 4 data closely resembles that for Study 3 data. As with Study 3, an examination of scree plot suggests that multiple dimensions need to be considered to explain the data structure. The results from R-PA were also indicative of multidimensionality of the data, identifying 26 underlying dimensions for the data used in Study 4.

The existence of multidimensionality is also suggested by the disattenuated correlations provided in the bottom of Table 4. The first two content domains are approximately unidimensional with a disattenuated correlation of .975, whereas the third domain primarily contributes to the multidimensionality of the data as it formulates disattenuated correlations of .556 and .749 with the other content domains.

Overall Results

The smallest ASEs correspond to SMT and SMO, when CIp=.1 and CIp=.3, respectively, as can be found in Table 7. In terms of ASE, UO tends to perform better than UT although the difference in ASE values is not considerable. Also, the results reconfirm the findings from Study 3 that the use of a smaller proportion of CIs leads to significantly larger ASE regardless of which equating procedure is used.

Table 7.

Study 4: Overall Results.

CIp=.1 CIp=.3
SMT SMO CE UT UO SMT SMO CE UT UO
ES=.1 ASE .30116 .31013 .32334 .33779 .32119 .18770 .18314 .24012 .21237 .19409
AB .74526 .71988 .38743 1.02826 .97872 .08562 .06419 .11100 .27559 .24702
ARMSE .80381 .78384 .50462 1.08232 1.03008 .20630 .19406 .26454 .34791 .31415
ES=.3 ASE .31008 .31410 .30651 .35946 .34830 .17889 .17326 .22168 .21594 .20418
AB 1.20106 1.11543 .98530 2.30975 2.24241 .22570 .23898 .26249 .56638 .50873
ARMSE 1.24044 1.15881 1.03188 2.33755 2.26930 .28801 .29519 .34357 .60615 .54818

Note. ES = effect size; CIp = common item proportion; SMT = simple-structure multidimensional item response theory true-score; SMO = simple-structure multidimensional item response theory observed-score; CE = chained equipercentile; UT = unidimensional item response theory true-score; UO = unidimensional item response theory observed-score.

According to AB, CE performs the best under CIp=.1 and either SMO or SMT provides the smallest ABs when CIp=.3. UT always results in the largest ABs, followed by UO. There is a significantly large amount of AB observed for UT and UO compared with SMT, SMO, and CE. In sum, it is clear that among the (M)IRT procedures, using multidimensional procedures leads to smaller bias than using unidimensional procedures. This general pattern is also found with the results from Study 3, which strengthens the argument for taking into account multidimensionality of the data in conducting equating.

According to the ARMSE results, SMT and SMO are always preferred over UT and UO. The smallest ARMSEs are observed for CE when the CI proportion is small, whereas the smallest are found for either SMO or SMT when the CI proportion is large. Also, smaller ARMSEs are generally found for SMO than SMT except for one condition (ES=.3 and CIp=.3).

Discussion

The purpose of this study was to develop a theoretical and conceptual framework for true-score equating using SS-MIRT models. The trend in current psychometric practice is moving toward multidimensional approaches, because examinees’ responses to test items are likely governed by multiple abilities. The Standards make it clear that the assumptions required by the procedures used for equating should be considered:

When model-based psychometric procedures are used, technical documentation should be provided . . . Such documentation should include the assumptions and procedures that were used to established comparability . . . (American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational, & Psychological Testing, 2014, p. 106)

If there is clear evidence supporting the multidimensionality of data, the use of a unidimensional approach in equating may not be justifiable. This article aims to shed light on equating that employs the multidimensional framework.

Despite the growing recognition of the prevalence of multidimensionality, there is still a paucity of literature presenting the theoretical development of multidimensional equating methods that faithfully capture the multidimensional structure of data. Even more rare is IRT true-score equating that uses multidimensional models, given the fact that the way that true-score equating is conducted makes it more challenging to incorporate multiple latent dimensions. The need for a sound equating framework for multidimensional data motivated this study.

In addition to SMT, four competitors were included in the analyses to assess the relative benefits of SMT over the other procedures: (a) equipercentile equating with log-linear presmoothing (b) UIRT true-score equating, (c) UIRT observed-score equating, and (d) SS-MIRT observed-score equating. Four separate studies were carried out using different datasets under either the RG or CINEG equating designs. The performance of SMT was examined with respect to various criteria tailored for each study design and data manipulation procedure. Note that the general conclusions in this section were made on the basis of evidence collected from the four studies. First, it was generally found that the SS-MIRT procedures provided more accurate equating results than the other equating procedures including the UIRT procedures when the degree of multidimensionality was substantial. This general pattern has been consistently observed in previous studies with various MIRT equating methods (E. Lee, Lee, & Brennan, 2014; G. Lee & Lee, 2016; Peterson & Lee, 2014; Tao & Cao, 2016).

Second, in addition to the confirmation of previous findings, one important contribution of this study is that a comparison was made between UIRT and MIRT true-score equating, which has received little attention in the literature. Few studies (E. Lee et al., 2014; G. Lee et al., 2015; Tao & Cao, 2016) involved the results from both MIRT-based (e.g., bi-factor) true-score equating and UT, but the focus of the previous studies was not on the comparison between the two. Comparisons made in this study showed that SMT always resulted in more accurate equating results than UT across all the studies conducted in this research regardless of the degree of multidimensionality. More specifically, UT consistently revealed a substantial amount of bias in equating relative to SMT, which led to larger overall equating errors. This general trend was observed across all studies, and it became even more noticeable for the data that evidenced more multidimensionality. Given the fact that in practice IRT true-score equating is more commonly used than IRT observed-score equating (Kolen & Brennan, 2014), the implications that this study has for the practical setting are important. Although the findings might support use of SMT over UT even for the unidimensional data, some plausible explanations for the observation are worth further investigation. One possible explanation could be that SMT uses a multivariate observed-score distribution. Although SMT virtually searches for the relationship of true scores on two forms, the intermediate step involves the use of multivariate observed-score distributions, not true-score distributions. In this sense, SMT can rather be called a “hybrid” equating method that mixes both observed-score and true-score equatings. Given that the criterion used in Study 2 possibly favored observed-score methods, the use of the multivariate observed-score distribution could be advantageous for SMT over UT, which can be a possible explanation for the better results associated with SMT.

Third, another interesting finding is the comparison between SMT and SMO. The result from Study 1 revealed similar equating functions between two equating methods. SMT tended to have larger ARMSE than SMO in Studies 2 and 4 but provided less overall equating error in Study 3. In general, the results from four separate studies suggest that these two methods lead to similar equating results. Given that the criterion used in Study 2 possibly favored SMO over SMT, this study supports the benefit of using SMT along with SMO.

Several study factors were examined through a simulation study and data manipulation methods including (a) sample size, (b) the degree of multidimensionality, (c) common-item proportion, and (d) group difference level. In terms of sample size (a), both UIRT and SS-MIRT procedures tended to have smaller equating errors with a larger sample size primarily due to the reduced standard errors of equating. Concerning the degree of multidimensionality (b), it was clear, as discussed earlier, that the SS-MIRT procedures were more accurate for multidimensional data than were the UIRT procedures. Under the CINEG design, it was seen that the UIRT procedures were more sensitive to both (c) common item proportion and (d) group difference, whereas the SS-MIRT procedures seemed relatively robust to small common-item proportion and large group difference.

Although the current research successfully presented SMT and provided promising results, it is not without limitations. First, there were several replications that failed to produce an output file using flexMIRT (Cai, 2017) for Studies 3 and 4, which had three dimensions. Those replications were dropped from the analysis and the final evaluation statistics were computed without them. Eliminating a subset of replications could possibly have influenced the final equating results, which might introduce some extent of unintended bias in the results. Another limitation related to the estimation process is the overwhelming amount of time spent on the calibration for SS-MIRT models. Although the exact estimation time was not measured, it took 6 hours on average to calibrate items for a single replication of Study 4. Therefore, a total of nearly 6 months was devoted to calibrating items for Studies 3 and 4. Last, this article considered two particular data structures: two and three dimensions with item formats and content specification as a source of multidimensionality, respectively. In practice, there are various other types of data structures with even higher dimensions. More research is needed for SMT with other types of data structure.

Based on the general conclusions above, several implications can be drawn. First, whenever equating is conducted, careful attention should be paid to the dimensional structure of the data. Many tests are inherently multidimensional, and there are numerous sources of multidimensionality that sometimes are not easily recognizable. It is advisable to obtain a rough picture of dimensional structure first through exploratory factor analyses or confirmatory analysis such as a disattenuated correlation. If the test turns out to be multidimensional, MIRT procedures might be a preferable option to a UIRT procedure. Second, the investigation should also be supported by external evidence such as test content specifications. Peterson (2014) pointed out that results from exploratory factor analysis might have restricted interpretations, so it was recommended to identify latent dimensions according to test blueprints. SMT inherently is a confirmatory approach in nature in the sense that the data structure for SMT is determined by the test specifications established at the test development stage, not by the observed data. Thus, the rationale for use of SMT should also be driven by the very nature of the test specifications.

There are numerous potential applications of SMT and SMO in practical equating settings. For example, if a test is composed of several subtests such as reading, writing, and math (e.g., SAT; College Board, 2014), equating can be done on each subtest level as an intermediate product of SMT, along with the equating results on composite score level. By using either SMT or SMO, reported scores will gain richer information in that both subtest score and composite score levels are provided. At the same time, they will gain better accuracy in that the data structure is taken into account in equating composite scores. Another application of SMT or SMO is to have both equating procedures “in a toolbox” along with other equating procedures such as UT or UO so that practitioners have more options for equating.

Acknowledgments

The authors thank the College Board for making available the Advanced Placement Examination data used in this article.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

  1. American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational, & Psychological Testing (US). (2014). Standards for educational and psychological testing. Washington, DC: Author. [Google Scholar]
  2. Birnbaum A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord F. M., Novick M. R. (Eds.), Statistical theories of mental test scores. Reading, MA: Addison-Wesley. [Google Scholar]
  3. Brennan R. L., Kolen M. J. (1987. a). Some practical issues in equating. Applied Psychological Measurement, 11, 279-290. [Google Scholar]
  4. Brennan R. L., Kolen M. J. (1987. b). A reply to Angoff. Applied Psychological Measurement, 11, 301-306. [Google Scholar]
  5. Brennan R. L., Wang T., Kim S., Seol J. (2009). Equating recipes (CASMA Monograph Number 1). Iowa City: CASMA, University of Iowa. [Google Scholar]
  6. Bridgeman B. (1992). A comparison of quantitative questions in open-ended and multiple-choice formats. Journal of Educational Measurement, 29, 253-271. [Google Scholar]
  7. Brossman B. G. (2010). Observed score and true score equating procedures for multidimensional item response theory. Unpublished doctoral dissertation. Iowa City: University of Iowa. [Google Scholar]
  8. Brossman B. G., Lee W. (2013). Observed score and true score equating procedures for multidimensional item response theory. Applied Psychological Measurement, 37, 460-481. [Google Scholar]
  9. Cai L. (2017). flexMIRT (Version 1.88). [Computer Program]. Chapel Hill, NC: Vector Psychological Group. [Google Scholar]
  10. Cho Y. (2008). Comparison of bootstrap standard errors of equating using IRT and equipercentile methods with polytomously-scored items under the common-item nonequivalent-groups design (Unpublished doctoral dissertation). University of Iowa, Iowa City. [Google Scholar]
  11. College Board. (2014). Test specifications for the redesigned SAT. Retrieved from https://www.collegeboard.org/sites/default/files/test_specifications_for_the_redesigned_sat_na3.pdf
  12. Dorans N. J. (2000). Distinctions among classes of linkages. College Board Research Note RN-11. New York, NY: College Board. [Google Scholar]
  13. Dorans N. J., Feigenbaum M. D. (1994). Equating issues engendered by changes to the SAT and PSAT/NMSQT. In Lawrence I. M., Dorans N. J., Feigenbaum M. D., Feryok N. J., Schmitt A. P., Wright N. K. (Eds.), Technical issues related to the introduction of the new SAT and PSAT/NMSQT (RM-94–10; pp. 91-122). Princeton, NJ: Educational Testing Service. [Google Scholar]
  14. Dorans N. J., Holland P. W., Thayer D. T., Tateneni K. (2003). Invariance of score linking across gender groups for three advanced placement program exams. In Dorans N. J. (Ed.), Population invariance of score linking: Theory and applications to advanced placement program examinations (Research Report 03-27; pp. 79-118). Princeton, NJ: Educational Testing Service. [Google Scholar]
  15. Green S. B., Levy R., Thompson M. S., Lu M., Lo W. J. (2012). A proposed solution to the problem with using completely random data to assess the number of factors with Parallel Analysis. Educational and Psychological Measurement, 72, 357-374. [Google Scholar]
  16. Green S. B., Redell N., Thompson M. S., Levy R. (2016). Accuracy of revised and traditional parallel analyses for assessing dimensionality with binary data. Educational and Psychological Measurement, 76, 5-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Hagge S. L., Liu C., He Y., Powers S. J., Wang W., Kolen M. J. (2011). A comparison of IRT and traditional equipercentile methods in mixed-format equating. In Kolen M. J., Lee W. (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 1; CASMA Monograph No. 2.1; pp. 19-50). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, University of Iowa. [Google Scholar]
  18. Hanson B. A. (1994). An extension of the Lord-Wingersky algorithm to polytomous items. (Unpublished Research Note). Iowa City, IA: ACT. [Google Scholar]
  19. Hanson B. A., Béguin A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26, 3-24. [Google Scholar]
  20. Kennedy P., Walstad W. B. (1997). Combining multiple-choice and constructed-response test scores: An economist’s view. Applied Measurement in Education, 10, 359-375. [Google Scholar]
  21. Kim S. Y., Lee W. (2016). Composition of common items for equating with mixed-format tests. In Kolen M. J., Lee W.-C. (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 4; CASMA Monograph No. 2.4; pp. 7-46). Iowa City: Center for Advanced Studies in Measurement and Assessment, University of Iowa; Retrieved from https://education.uiowa.edu/sites/education.uiowa.edu/files/documents/centers/casma/publications/casma-monograph-2.4.pdf [Google Scholar]
  22. Kolen M. J., Brennan R. L. (2014). Test equating, scaling, and linking: Methods and practices (3rd ed.). New York, NY: Springer. [Google Scholar]
  23. Lee E., Lee W., Brennan R. L. (2014). Equating multidimensional tests under a random groups design: A comparison of various equating procedures (CASMA Research Report No. 40). Iowa City: Center for Advanced Studies in Measurement and Assessment, University of Iowa; Retrieved from http://www.education.uiowa.edu/casmahttps://education.uiowa.edu/sites/education.uiowa.edu/files/documents/centers/casma/publications/casma-research-report-40.pdf [Google Scholar]
  24. Lee G., Lee W. (2016). Bi-factor MIRT observed-score equating for mixed-format tests. Applied Measurement in Education, 29, 224-241. [Google Scholar]
  25. Lee G., Lee W., Kolen M. J., Park I. -Y., Kim D.-I., Yang J. S. (2015). Bi-factor MIRT true-score equating for testlet-based tests. Journal of Educational Evaluation, 28, 681-700. [Google Scholar]
  26. Lee W. (2015). MIRTeq: Multidimensional item response theory equating [Computer software]. Iowa City: Center for Advanced Studies in Measurement and Assessment, University of Iowa. [Google Scholar]
  27. Lee W., Brossman B. G. (2012). Observed score equating for mixed-format tests using a simple-structure multidimensional IRT framework. In Kolen M. J., Lee W. (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 2; CASMA Monograph No. 2.2.; pp. 115-142). Iowa City: Center for Advanced Studies in Measurement and Assessment, University of Iowa; Retrieved from https://education.uiowa.edu/sites/education.uiowa.edu/files/documents/centers/casma/publications/casma-monograph-2.2.pdf [Google Scholar]
  28. Lee W., He Y., Hagge S., Wang W., Kolen M. J. (2012). Equating mixed-format tests using dichotomous common items. In Kolen M. J., Lee W. (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 2; CASMA Monograph No. 2.2.; pp.13-44). Iowa City: Center for Advanced Studies in Measurement and Assessment, University of Iowa; Retrieved from https://education.uiowa.edu/sites/education.uiowa.edu/files/documents/centers/casma/publications/casma-monograph-2.2.pdf [Google Scholar]
  29. Lord F. M., Wingersky M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings.” Applied Psychological Measurement, 8, 452-461. [Google Scholar]
  30. Marsh H. W., Lűdke O., Nagengast B., Morin A. J., von Davier M. (2013). Why item parcels are (almost) never appropriate: Two wrongs do not make a right—Camouflaging misspecification with item parcels in CFA models. Psychological Methods, 18, 257-284. [DOI] [PubMed] [Google Scholar]
  31. McDonald R. P. (2000). A basis for multidimensional item response theory. Applied Psychological Measurement, 24, 99-114. [Google Scholar]
  32. McKinley R. L., Reckase M. D. (1982). The use of the general Rasch model with multidimensional item response data (Research Report ONR 82-1). Iowa City, IA: American College Testing. [Google Scholar]
  33. Mulaik S. A. (1972, March). A mathematical investigation of some multidimensional Rasch models for psychological tests. Paper presented at the annual meeting of the Psychometric Society, Princeton, NJ. [Google Scholar]
  34. Peterson J. (2014). Multidimensional item response theory observed score equating methods for mixed-format tests (Unpublished doctoral dissertation). University of Iowa, Iowa City. [Google Scholar]
  35. Peterson J., Lee W. (2014). Multidimensional item response theory observed score equating methods for mixed-format tests. In Kolen M. J., Lee W. (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 2, CASMA Monograph No. 2.3). Iowa City: Center for Advanced Studies in Measurement and Assessment, University of Iowa; Retrieved from http://www.education.uiowa.edu/casma. [Google Scholar]
  36. Powers S. J., Hagge S. L., Wang W., He Y., Liu C., Kolen M. J. (2011). Effects of group differences on mixed-format equating. In Kolen M. J., Lee W. (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 1; CASMA Monograph No. 2.1; pp. 51-74). Iowa City: Center for Advanced Studies in Measurement and Assessment, University of Iowa; Retrieved from https://education.uiowa.edu/sites/education.uiowa.edu/files/documents/centers/casma/publications/casma-monograph-2.1.pdf [Google Scholar]
  37. R Core Team. (2014). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. [Google Scholar]
  38. Reckase M. D. (1972). Development and application of a multivariate logistic latent trait model. (Unpublished doctoral dissertation). Syracuse University, Syracuse, NY. [Google Scholar]
  39. Reckase M. D. (2009). Multidimensional item response theory. New York, NY: Springer. [Google Scholar]
  40. Samejima F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometric Monograph No. 17). Richmond, VA: Psychometric Society; Retrieved from https://www.psychometricsociety.org/sites/default/files/pdf/MN17.pdf [Google Scholar]
  41. Sass D. A., Schmitt T. A. (2010). A comparative investigation of rotation criteria within exploratory factor analysis. Multivariate Behavioral Research, 45, 73-103. [DOI] [PubMed] [Google Scholar]
  42. Sympson J. B. (1978). A model for testing with multidimensional items. In Weiss D. J. (Ed.), Proceedings of the 1977 Computerized Adaptive Testing Conference (pp. 82-98). Minneapolis: University of Minnesota. [Google Scholar]
  43. Tao W., Cao Y. (2016). An extension of IRT-based equating to the dichotomous testlet response theory model. Applied Measurement in Education, 29, 108-121. [Google Scholar]
  44. Thurstone L. L. (1935). The vectors of mind: Multiple-factor analysis for the isolation of primary traits. Chicago, IL: University of Chicago Press. [Google Scholar]
  45. Thurstone L. L. (1947). Multiple factor analysis. Chicago, IL: University of Chicago Press. [Google Scholar]
  46. Tsai T.-H., Hanson B. A., Kolen M. J., Forsyth R. A. (2001). A comparison of bootstrap standard errors of IRT equating methods for the common-item nonequivalent groups design. Applied Measurement in Education, 14, 17-30. [Google Scholar]
  47. Whitely S. E. (1980). Measuring aptitude processes with multicomponent latent trait models (Technical Report No. NIE-80-5). University of Kansas, Lawrence. [Google Scholar]
  48. Zhang M., Kolen M. J., Lee W. (2014). A comparison of test dimensionality assessment approaches for mixed-format tests. In Kolen M. J., Lee W. (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 3; CASMA Monograph No. 2.3; pp. 161-200). Iowa City: Center for Advanced Studies in Measurement and Assessment, University of Iowa; Retrieved from https://education.uiowa.edu/sites/education.uiowa.edu/files/documents/centers/casma/publications/casma-monograph-2.3.pdf [Google Scholar]

Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES