Evaluating Equating Methods for Varying Levels of Form Difference

Ting Sun; Stella Yun Kim

doi:10.1177/00131644231176989

. 2023 Jun 8;84(3):510–529. doi: 10.1177/00131644231176989

Evaluating Equating Methods for Varying Levels of Form Difference

Ting Sun ^1,^✉, Stella Yun Kim ²

PMCID: PMC11095324 PMID: 38756465

Abstract

Equating is a statistical procedure used to adjust for the difference in form difficulty such that scores on those forms can be used and interpreted comparably. In practice, however, equating methods are often implemented without considering the extent to which two forms differ in difficulty. The study aims to examine the effect of the magnitude of a form difficulty difference on equating results under random group (RG) and common-item nonequivalent group (CINEG) designs. Specifically, this study evaluates the performance of six equating methods under a set of simulation conditions including varying levels of form difference. Results revealed that, under the RG design, mean equating was proven to be the most accurate method when there is no or small form difference, whereas equipercentile is the most accurate method when the difficulty difference is medium or large. Under the CINEG design, Tucker Linear was found to be the most accurate method when the difficulty difference is medium or small, and either chained equipercentile or frequency estimation is preferred with a large difficulty level. This study would provide practitioners with research evidence–based guidance in the choice of equating methods with varying levels of form difference. As the condition of no form difficulty difference is also included, this study would inform testing companies of appropriate equating methods when two forms are similar in difficulty level.

Keywords: equating, form difficulty

Introduction

Multiple alternate forms of a test that are built on the same content and statistical specifications are sometimes administered to examinees to provide sufficient testing opportunities while maintaining test security. Although carefully constructed, items on those forms are not necessarily identical in terms of difficulty levels. These differences in difficulty among test forms can undermine the comparability of score interpretations obtained from these forms, potentially benefiting or penalizing some students unfairly. Equating is a statistical procedure used to adjust for the difference in form difficulty such that scores on those forms can be used and interpreted comparably (Kolen & Brennan, 2014). Several equating methods that can be used under random group (RG) and common-item nonequivalent group (CINEG) designs have appeared in the literature (Kolen & Brennan, 2014). In practice, these equating methods are often implemented without considering the extent to which two forms (Forms X and Y) differ in difficulty. Often, group equivalence is of greater interest to determine the equating design between RG and CINEG. However, literature suggested that the form difficulty difference had an influence on equating results (Heh, 2007; Skaggs, 2005). Heh (2007) found that either identity or mean (MN) equating is preferred when there is no or small form difference, and equipercentile (EQ) is favored as the form difficulty differences increase. However, the existing studies (e.g., Heh, 2007; Skaggs, 2005) focus on small-sample equating with the RG design, and our knowledge is limited regarding the performance of equating methods with medium or large samples or with the CINEG design as a function of varying levels of form difficulty differences. As test forms are often built to the same content and statistical specifications, it is likely that two forms are very similar in terms of difficulty. So far, however, not much attention has been paid to examining the impact of form similarity on equating. A few equating studies (e.g., Lee, 2013; Suh et al., 2009) considered no form difference as a simulation condition, but none of these explicitly investigated equating performance as a function of form difference. This study aims to examine the effect of a form difficulty difference on equating accuracy of various equating methods under RG and CINEG designs.

Literature Review

Heh (2007) also examined the performance of nine equating methods (i.e., identity, MN, linear [LN], unsmoothed EQ, and 2–6 moments presmoothing polynomial log-linear EQ) under an RG design. Data were simulated with six levels of sample sizes (20, 50, 75, 100, and 200) and six levels of form difficulty differences (0, 0.15, 0.30, 0.45, 0.60, 0.75). This study concluded that either identity or MN equating is preferred when there is no or small form difficulty difference (effect size [ES] is 0 or 0.15), and EQ is favored as the form difficulty differences increase (ES ranges from 0.30 to 0.75). Although it is promising to see that interest in a difficulty difference has emerged in the equating context, the study design was limited to the RG design and not much attention was paid to this particular factor, as the primary research purpose was to examine equating methods for small samples.

Aşiret and Sünbül (2016) compared and evaluated the equating methods of identity, mean, linear, circle-arc, and presmoothed EQ under varying levels of small sample sizes (ranging from 10 to 200), a guessing parameter of Form Y (0, 0.1, 0.2, 0.25), and form differences in difficulty (0.1, 0.4 and 0.7). Results revealed that the performance of equating methods was associated with form differences. The identity method was the most accurate method with forms of small difficulty difference and the least accurate method with forms of large difficulty difference. The circle-arc method was favored as the form difficulty difference increased. MN equating was the second accurate method, and the performance of other methods remained consistent across varying levels of form difference in difficulty. Although this study considered a few levels of form differences in evaluating equating methods, the primary focus was again to examine them with small samples, and thus, only limited information was available with regard to the function of form differences.

Skaggs (2005) examined the performance of MN, LN, unsmoothed EQ, and EQ equating using two through six moments of log-linear presmoothing with small samples (i.e., 25, 50, 75, 100, 150, and 200) under the RG design. Two sets of simulations were conducted, with one having two identical forms and the other one having forms that differed by a tenth of a standard deviation in difficulty. Results revealed that identity equating yielded a substantial amount of misclassification of examinees with forms that differed. This study, however, did not examine the performance of these equating methods as a function of form difficulty difference.

Form difference in difficulty was also found to influence the performance of equating methods with different equating criteria. For example, Tong and Kolen (2005) evaluated three equating methods (i.e., the presmoothed equating method, the item response theory [IRT] true score method, and the IRT observed-score method) based on three equating criteria, the same distributions property, the first-order equity property, and the second-order equity property. Researchers found that the first- and second-order equity properties are difficult to hold as the form difference increases.

The literature collectively suggests that the performance of equating methods is dependent on the level of form difference (e.g., Aşiret & Sünbül, 2014; Heh, 2007). However, as evidenced by the number of relevant studies, not much research has been carried out to examine to what extent the form difference affects the performance of a specific equating method. Also, the previous studies explored this factor under the RG design with small sample sizes. Thus, our understanding is virtually non-existent for equating methods under the CINEG design or with varying levels of sample size. It often is taken for granted that equating is necessary when multiple forms are designed and circulated. However, that speculation needs to be challenged by research to better inform practice. This study is aimed to shed light on the role of form difference in equating, a largely neglected factor in the equating literature.

Equating Design

To collect data for equating, an appropriate design needs to be developed and implemented (Livingston, 2004). Three most commonly used equating designs are the single group (SG) design, RG design, and CINEG design (Kolen & Brennan, 2014). The SG design is relatively less frequent due to its repeated measures with a single group, which is often not practical. In the RG design, Form X (often defined as a new form) and Form Y (often defined as an old form) are randomly administered to examinees. As two forms are taken by randomly sampled groups that are considered equivalent in their ability distributions, score differences on the two forms are solely attributable to form differences in difficulty. The CINEG design, also called non-equivalent groups with anchor test (NEAT), is employed when the administration of multiple test forms on one test date is not feasible. Under the CINEG design, two test forms with a set of common items are administered to two groups of examinees, and groups can be different in their ability distributions. Score differences on the two forms can be attributed to either group differences or form differences, and the set of common items is used to separate the former from the latter. The choice of equating designs is determined by several factors, such as the feasibility of administering multiple forms on 1 day or the practicability of administering multiple forms to the same examinees (Aşiret & Sünbül, 2016). Equating designs are implemented in conjunction with a variety of equating methods, including MN, linear (LN), and EQ methods.

Equating Methods

MN Equating Using the RG Design

In MN equating, the difference in form difficulty between From X and Form Y is deemed to be a constant along the score scale (Kolen & Brennan, 2014). The constant is found as the mean difference between the two forms. Thus, in MN equating, scores on Form X equated to the Form Y scale can be estimated by adding the mean difference to every Form X scores. Consequently, there is only one single parameter (i.e., mean) to be estimated for this method. The mean equating function can be expressed as

M N_{y} (x) = x - μ (X) + μ (Y),

(1)

where $x$ is a score on Form X, $M N_{y} (x)$ is an equated score using the MN equating method, $μ (X)$ is the mean of scores on Form X, and $μ (Y)$ is the mean of scores on Form Y.

LN Equating Using the RG Design

Different from MN equating, in LN, Form X can be different from Form Y in difficulty levels by varying amounts along the score scale (Kolen & Brennan, 2014). LN defines a linear relationship between scores on the two forms by setting equal the standardized deviation scores from their means ( $z$ score) on Form X and From Y. The scores of Form X are equated on Form Y using LN by identifying the scores that have equal distance from their means in standard deviation unit. The LN function can be defined as

L N_{y} (x) = \frac{σ (Y)}{σ (X)} x + [μ (Y) - \frac{σ (Y)}{σ (X)} μ (X)],

(2)

where $L N_{y} (x)$ is an equated score using LN equating, and $σ (X)$ $σ (Y)$ are the standard deviation of scores on Form X and Form Y, respectively. Different from MN, in LN, two parameters (i.e., mean and standard deviation) are used to adjust for form difference in difficulty.

EQ Equating Using the RG Design

EQ allows a non-linear relationship in difficulty differences between Form X and Form Y. In EQ, scores on Form X equated to the Form Y scale have the same distribution as scores on Form Y (Kolen & Brennan, 2014). EQ is implemented by the corresponding scores on Form Y that have the same percentile rank, $P (x)$ , with scores on Form X. A percentile rank of a score $x$ refers to the percentage of examinees in the frequency distribution who earned the score of $x$ and below. EQ assumes scores are continuous, and examinees with the score $x$ are uniformly distributed between the range of x–.5 and x+.5, assuming the score unit is 1. When EQ is performed, multiple points along with the scale (i.e., as many as the number of score points) are used to adjust for form differences.

Tucker Linear Using the CINEG Design

Similar to LN equating under the RG design, in CINEG, the Tucker linear (TL) equating method is conducted using the following equation:

L N_{s} (x) = \frac{σ_{s} (Y)}{σ_{s} (X)} x + [μ_{s} (Y) - \frac{σ_{s} (Y)}{σ_{s} (X)} μ_{s} (X)],

(3)

where $μ_{s} (X), μ_{s} (Y), σ_{s} (X),$ and $σ_{s} (Y)$ refer to the means and standard deviations on Form X and Form Y for the synthetic population, which is a weighted combination of groups of examinees in two different populations.

The TL method has two statistical assumptions, which can be used to estimate the unknown parameters. First, TL assumes that the regression of X on scores of common items (V) is equal in the two populations (Gulliksen, 1950). The second assumption is that the conditional variance of X given V is identical for the two populations. The same assumptions are also made for Form Y. Similar to the LN under the RG design, two parameters, mean and standard deviation, are to be estimated to adjust for form difference in difficulty.

Chained EQ Using the CINEG Design

Chained equipercentile (CE) was first described by Angoff (1971). Score on Form X equated to Form Y scale is identified by (a) defining an EQ function, $e_{V 1} (x),$ equating scores on Form X to scores on the common items scale with examinees from Population 1; (b) defining an EQ function, $e_{Y 2} (x)$ , equating scores on the common items to scores on Form Y scale with examinees from Population 2; and (c) equating scores on Form X to scores on Form Y using the two defined functions:

C E_{y} (x) = e_{Y 2} [e_{V 1} (x)] .

(4)

CE assumes that the equating of scores on Form X to the common-item scale is the same for Population 1 and Population 2, and the same assumption holds for the equating of scores on common items to Form Y (Albano, 2016).

Frequency Estimation Using the CINEG Design

Frequency estimation (FE) is another alternative to EQ equating under the CINEG design. Similar to EQ equating under the RG design, the FE method requires that equated scores of Form X on Form Y scale have the same distribution with scores on Form Y. The difference between EQ and FE is that FE uses the synthetic population, which can be denoted by the subscript s in the following equations:

f_{s} (x) = w_{1} f_{1} (x) + w_{2} f_{2} (x),

(5)

g_{s} (y) = w_{1} g_{1} (y) + w_{2} g_{2} (y),

(6)

where subscripts 1 and 2 refer to the population assigned Form X and Form Y, respectively, $f$ and $g$ represent the frequency for Form X and Form Y, and $w$ is the weight assigned to each population. Although $f_{2} (x)$ and $g_{1} (y)$ cannot be estimated directly, they can be estimated based on the assumption that the conditional distribution of scores on Form X given each of a common-item score is the same between the two populations: $f_{1} (x | v) = f_{2} (x | v)$ . The same assumption also holds for Form Y. Based on the two assumptions, the distribution of scores on Form X and Form Y for the synthetic population can be expressed as

f_{s} (x) = w_{1} f_{1} (x) + w_{2} \sum_{v} f_{1} (x | v) h_{2} (v),

(7)

g_{s} (y) = w_{1} \sum_{v} g_{2} (y | v) h_{1} (v) + w_{2} g_{2} (y),

(8)

where $h_{1} (v)$ and $h_{2} (v)$ are frequency for common items for Population 1 and Population 2, respectively. Accumulative distribution of scores for Form X and Form Y for the synthetic population can be calculated by $\sum_{0}^{x} f_{s} (x)$ , and percentile rank can be obtained in a similar fashion to EQ. As both CE and FE are EQ equating methods, multiple parameters (i.e., multiple score points) are to be estimated to adjust for form difference in difficulty.

Research Objectives

This study aims to examine the effect of the magnitude of a form difficulty difference on equating results. Specifically, this study evaluates the performance of the six equating methods (i.e., MN, LN, and EQ under the RG design and TL, CE, and FE under the CINEG design) under a set of simulation conditions including varying levels of form difference, sample size, and group difference. The six equating methods were evaluated in this study because they are the most commonly used methods in many testing companies. For example, the LN and EQ methods are used to equate SAT scores (College Board, 2017) and the EQ equating method is used for ACT (ACT, 2020). EQ is also a common practice in Educational Testing Service (ETS) (Livingston, 2004). Previous studies suggested that sample size is a factor influencing equating relationships: random equating errors tend to decrease with the increase in sample size (Kolen & Brennan, 2014; Livingston, 2004). Thus, sample size may interact with form difference and impact equating accuracy. This study aims to examine the performance of equating methods as a function of form difficulty difference and sample size. The study is guided by the following two research questions:

Research Questions 1: Which equating method (MN, LN, EQ, TL, CE, and FE) performs better for forms with different levels of difficulty?
Research Questions 2: What is the interaction effect between form difference and sample size on the performance of these methods?

Method

Simulation Factors and Procedures

Item parameters were randomly sampled from an item pool that consisted of item parameters obtained by calibrating large-scale assessment data. The large-scale assessment data sets were from nine subject-area exams and had large samples ranging from 1,760 to 20,000 depending on the exam. The assessment was designed for measuring high-school students in the United States. The form difference was controlled by manipulating the difference in b-parameters noted in this study as ES. Three possible scenarios were examined: Form X is more difficult than Form Y; Form Y is more difficult than Form X; no difference exists between the two forms. When there is a form difference, three levels of form difference were considered with an ES of .05–.2, .2–.5, and .5–1 for small, medium, and large differences, respectively. These values were chosen such that they were observable and plausible in reality according to the previous literature. For example, Kolen and Brennan (2004) manipulated the ES of two forms ranging from .1 to .6, and Babcock et al. (2012) and Heh (2007) chose the ES ranging from 0 to .75. Each of the seven conditions was replicated 100 times, producing 700 sets of item parameters that were used for data generation under each condition of sample size and group difference.

Four levels of sample size (i.e., 100, 500, 1,000, 5,000) were manipulated in this study to represent small, medium, and large samples, respectively. The choice of these values was guided by the review of the literature. Specifically, a small-scale test was defined by Jones et al. (2006) as having a sample size smaller than 200. Kim (2018) and von Davier and Kong (2005) employed the sample size of 5,000 to represent a large-scale test in the context of equating. It is also the common practice to have sample sizes ranging from 100 to 5,000 in testing companies: 79 to 2,623 examinees in different modules of the verbal measure in GRE test (Davey & Lee, 2011) and 4,693 examinees in SAT test (The College Board, 2017).

Previous research also noted a relationship between the performance of equating methods and group differences under the CINEG design (e.g., Suh et al., 2009), so two levels of group differences were considered. The simulation factors are summarized as follows.

Seven levels of form difference in difficulty: Form X is more difficult than Form Y (small, medium, and large); Form Y is more difficult than Form X (small, medium, and large); no difference exists between two forms;
Four levels of sample size: 100, 500, 1,000, and 5,000;
Two levels of group difference for the CINEG design with Y being set to N(0, 1): X ~ N(.1, 1), X ~ N(.3, 1). Under the RG design, both groups followed N(0,1);
One level of test length: 60 items (with 12 common items for CINEG design).

Data were simulated under the 3-parameter logistic model (3PL; Birnbaum, 1968; Lord, 1980). The simulation procedures are described as follows:

Form development: A set of item parameters was sampled from an item pool to create Forms X and Y. For each replication, data were chosen to be used only if the form difference represented by ES was within the desired range as specified by the form difference condition. If ES did not meet the criterion, sampled data were discarded and another set of item parameters was drawn from the item pool. This process continued until the target number of replications (100) was achieved for each condition.
Uniform form difference check: After the 700 sets of item parameters were obtained for Forms X and Y, test information function for each pair of the test form was calculated and compared across 41 quadrature points along the ability scale (from −4 to 4). This process was performed to avoid a potential interaction effect between form difference and ability, which may have caused complexity in interpreting results. As a result, there was a uniform form difference for all simulated data.
Ability parameter (theta, $θ$ ) generation: Theta values were randomly drawn from a normal distribution of ability with a mean of 0 and a standard deviation of 1 for the RG design. Under the CINEG design, theta values were randomly drawn from N(.1, 1) or N(.3, 1) for Form X and from N(0, 1) for Form Y.
Item response generation: Item responses were generated for 100, 500, 1,000, and 5,000 examinees using item parameters and $θ$ s obtained in Step 1 and Step 3, respectively.
Equating relationship estimation: Equating relationships were estimated using MN, LN, and EQ under the RG design and TL, CE, and EF under the CINEG design.

Equating Procedures

Six equating methods were examined, including MN, LN, and EQ under the RG design and TL, CE, and FE under the CINEG design. Equating was conducted using the function equate in an R package equate (Albano, 2016). For EQ equating under the RG design, log-linear presmoothing was applied to mimic a typical operational setting. Thus, C values also need to be specified. We chose C = 2 because it was the smallest value of C with the nonsignificant overall chi-square statistics at the .05 level of significance for some select data sets. For the CINEG design, the synthetic weight for two populations was set to be equal (i.e., .5).

Criterion Equating Relationships

A set of evaluation criteria were used to assess the accuracy of equating results. Criterion equating relationships were obtained by equating two population distributions of Forms X and Y. Specifically, item parameters used to generate data were used to construct score distributions of the populations. The criterion equating relationship was established by conducting EQ equating between these two model-fitted distributions. Given that the focus of this study was to evaluate equating results as a function of form difference, not to compare across different equating methods, choosing a particular equating method (e.g., EQ) when establishing the criterion equating relationship was assumed not to alter a study conclusion. Also, regardless of the equation design employed, the criterion equating relationship was found using the same population distribution (e.g., N(0, 1)) for both groups to solely detect the form difference. Under the CINEG design, for instance, if a chosen equating method makes an appropriate score adjustment (i.e., disentangling group difference from form difference), that adjustment has to be consistent with the criterion equating relationship found under the RG design that does not involve group difference.

Each replication had its own criterion equating relationship because a different set of item parameters was used to simulate data for each replication. Using a varying set of item parameters by replication allowed for generalization of results over test forms, not limiting results to a specific test form.

Evaluation Indices

Three indices were used to evaluate equating results of the six methods: standard error ( $SE$ ), $BIAS$ , and root mean squared error ( $RMSE$ ) (Kolen & Brennan, 2014; Wang, 2009). BIAS is the systematic error, which is the deviation of an estimated equating relationship from the criterion equating relationship (LaFlair et al., 2017). BIAS is introduced due to an estimation method (e.g., smoothing techniques) or the violation of statistical assumptions (Kolen & Brennan, 2014). SE, the random equating error, occurs because of sampling errors. It is defined as the variability of estimated equating results. SE can be reduced by increasing sample size (Kolen & Brennan, 2014; Livingston, 2004). RMSE quantifies the amount of total errors and can be conceived as (roughly) the sum of SE and BIAS. The three evaluation indices are calculated using the following formula:

SE = \sqrt{\frac{\sum_{1}^{N} {({\hat{eq}}_{r} - \bar{\hat{eq}})}^{2}}{K}},

(9)

BIAS = \frac{\sum_{1}^{N} | {\hat{eq}}_{r} - e q_{r} |}{K},

(10)

RMSE = \sqrt{{SE}^{2} + {BIAS}^{2}},

(11)

where $K$ is the number of replications ( $K$ = 100), ${\hat{eq}}_{r}$ is the estimated equating relationship from the rth replication, $\bar{\hat{eq}}$ is the average of estimates over 100 replications, and $e q_{r}$ denotes the criterion equating relationship for the rth replication.

Results

Table 1 presents BIAS, SE, and RMSE values for the three methods under the RG design. The shaded cells in Table 1 represent the equating method that produces the smallest error among the three. MN produces the smallest values of BIAS when there is no difficulty difference between the two forms. Interestingly, however, when difficulty difference occurs, EQ produces the smallest BIAS. This pattern is consistent across the four levels of sample size. MN consistently introduces the smallest SE values across all the conditions. In terms of stability of estimates (i.e., SE), a clear pattern was found that MN leads to the smallest SE values, followed by EQ. LN allows for the largest SE.

Table 1.

BIAS, SE, and RMSE for Mean, Linear, and Equipercentile Equating Methods

Sample size	Difficulty level	BIAS			SE			RMSE
Sample size	Difficulty level	MN	LN	EQ	MN	LN	EQ	MN	LN	EQ
100	no	0.614	2.120	1.802	0.268	0.564	0.469	0.670	2.218	1.888
	nd_s	0.536	0.554	0.406	0.694	1.008	0.928	0.897	1.157	1.017
	nd_m	1.266	1.342	0.994	0.730	1.017	0.934	1.484	1.691	1.370
	nd_l	2.401	2.607	1.940	0.673	0.973	0.913	2.511	2.789	2.150
	od_s	0.506	0.536	0.367	0.818	1.139	1.052	0.984	1.266	1.119
	od_m	1.281	1.463	1.036	0.924	1.274	1.140	1.615	1.957	1.545
	od_l	2.480	2.766	1.896	0.750	1.060	0.919	2.611	2.970	2.109
500	no	0.262	1.056	0.926	0.119	0.274	0.249	0.288	1.098	0.966
	nd_s	0.529	0.567	0.418	0.655	0.938	0.853	0.862	1.103	0.952
	nd_m	1.185	1.272	0.937	0.708	0.985	0.898	1.406	1.619	1.301
	nd_l	2.321	2.531	1.878	0.635	0.904	0.836	2.423	2.695	2.060
	od_s	0.509	0.566	0.399	0.721	1.029	0.937	0.905	1.184	1.022
	od_m	1.263	1.444	1.014	0.855	1.160	1.019	1.559	1.871	1.443
	od_l	2.346	2.714	1.840	0.673	0.981	0.837	2.461	2.897	2.024
1,000	no	0.280	0.538	0.523	0.100	0.186	0.173	0.297	0.581	0.560
	nd_s	0.510	0.533	0.388	0.643	0.922	0.834	0.842	1.072	0.922
	nd_m	1.161	1.205	0.872	0.690	0.947	0.856	1.377	1.543	1.226
	nd_l	2.290	2.424	1.756	0.629	0.882	0.808	2.392	2.587	1.939
	od_s	0.511	0.528	0.366	0.699	0.980	0.896	0.889	1.120	0.972
	od_m	1.245	1.368	0.957	0.824	1.089	0.965	1.527	1.768	1.364
	od_l	2.313	2.545	1.725	0.669	0.925	0.803	2.429	2.720	1.905
5,000	no	0.164	0.416	0.390	0.050	0.097	0.090	0.171	0.432	0.404
	nd_s	0.522	0.530	0.388	0.648	0.917	0.825	0.853	1.067	0.915
	nd_m	1.157	1.189	0.866	0.686	0.935	0.840	1.372	1.523	1.211
	nd_l	2.274	2.378	1.724	0.617	0.877	0.794	2.374	2.542	1.905
	od_s	0.508	0.521	0.366	0.713	0.982	0.892	0.898	1.119	0.969
	od_m	1.242	1.342	0.948	0.828	1.093	0.968	1.527	1.749	1.360
	od_l	2.325	2.486	1.705	0.653	0.913	0.789	2.435	2.658	1.881

Open in a new tab

Note. no = no difficulty difference; nd = Form X is more difficult than Form Y; od = Form Y is more difficult than Form X; s = small effect size; m = medium effect size; l = large effect size. Highlighted is the smallest value for each condition.

As for RMSE, MN has the smallest values with no or small difficulty difference, whereas EQ has the smallest values when the difficulty difference is medium or large. Based on these findings, it seems clear that the level of form difference does affect the relative performance of the three equating methods: MN is preferred when no or small form difference exists, while EQ is favored under medium or large form difference. The sample size and the direction of form difference (new vs. old) do not seem to be an influential factor.

Values of BIAS, SE, and RMSE for the three methods under the CINEG design with small group differences are presented in Table 2. For the CINEG design, the following two findings are noticeable: First, the relative performance of the three studied methods is partly affected by the sample size: TL yields the smallest BIAS/RMSE when the form difficulty difference is small or medium, while this trend becomes less apparent as the sample size increases. Second, the direction of form difference also matters. CE tends to perform better than FE when Form Y is more difficult than Form X, whereas the reverse is true when Form X is more difficult. As for SE, FE produces the smallest value under N = 5,000 and TL produces the smallest SE values in other conditions. It is also notable that the difference among the three methods is nearly negligible in terms of the absolute amount of SE when there is some level of form difference. However, with no form difference, TL remarkably outperforms the other two methods.

Table 2.

BIAS, SE, and RMSE for Tucker Linear, Chained Equipercentile Equating, and Frequency Estimation With Smaller Group Differences

Sample size	Difficulty level	BIAS			SE			RMSE
Sample size	Difficulty level	Tucker	Chain	Frequency	Tucker	Chain	Frequency	Tucker	Chain	Frequency
100	no	0.247	1.633	1.686	0.781	2.008	1.660	0.822	3.034	2.738
	nd_s	0.251	1.619	1.596	1.014	1.723	1.684	1.054	2.800	2.756
	nd_m	0.978	1.833	1.730	1.056	1.713	1.650	1.463	2.899	2.806
	nd_l	2.240	2.201	2.108	1.015	1.739	1.674	2.473	3.100	2.990
	od_s	0.956	1.684	1.765	1.165	1.754	1.723	1.509	2.817	2.825
	od_m	1.918	2.152	2.244	1.297	1.745	1.736	2.321	3.097	3.153
	od_l	3.231	2.506	2.592	1.033	1.527	1.508	3.398	3.241	3.311
500	no	0.615	1.451	1.547	0.378	0.921	0.783	0.729	1.890	1.863
	nd_s	0.293	1.023	0.977	0.948	1.236	1.211	1.006	1.894	1.869
	nd_m	0.971	1.375	1.305	0.995	1.197	1.177	1.417	2.033	1.984
	nd_l	2.266	1.944	1.861	0.898	1.123	1.104	2.453	2.397	2.315
	od_s	0.892	1.338	1.434	1.040	1.300	1.259	1.374	2.073	2.100
	od_m	1.781	1.540	1.640	1.171	1.389	1.376	2.141	2.227	2.281
	od_l	3.101	2.066	2.177	1.007	1.203	1.181	3.268	2.533	2.604
1,000	no	0.211	0.745	0.812	0.246	0.802	0.687	0.329	1.253	1.181
	nd_s	0.280	0.692	0.633	0.930	1.071	1.054	0.984	1.471	1.450
	nd_m	0.940	0.973	0.886	0.955	1.105	1.101	1.365	1.605	1.557
	nd_l	2.183	1.546	1.476	0.885	1.048	1.029	2.370	1.999	1.923
	od_s	0.818	0.900	0.980	0.995	1.097	1.080	1.290	1.564	1.588
	od_m	1.662	1.247	1.321	1.096	1.200	1.198	2.001	1.835	1.881
	od_l	2.876	1.757	1.849	0.946	1.067	1.039	3.035	2.156	2.217
5,000	no	0.345	0.544	0.629	0.136	0.466	0.411	0.379	0.835	0.857
	nd_s	0.274	0.594	0.526	0.916	0.873	0.869	0.969	1.241	1.225
	nd_m	0.929	0.876	0.792	0.939	0.904	0.899	1.347	1.394	1.343
	nd_l	2.135	1.416	1.333	0.871	0.860	0.855	2.320	1.766	1.692
	od_s	0.807	0.770	0.859	0.994	0.979	0.975	1.283	1.366	1.406
	od_m	1.640	1.118	1.209	1.094	1.008	1.006	1.981	1.610	1.670
	od_l	2.818	1.596	1.682	0.933	0.874	0.873	2.974	1.903	1.977

Open in a new tab

The values of BIAS, SE, and RMSE for TL, CE, and FE equating methods with large group differences are displayed in Table 3. The general pattern of results is somewhat similar to those with small group differences presented in Table 2. That is, TL introduces the smallest BIAS/RMSE with no, small, and medium form difference. With large form difference, however, FE introduces the smallest BIAS/RMSE with large form differences when Form X is more difficult than Form Y, while CE does so when Form Y is difficult than Form X, which was also observed under the small group difference condition. CE is generally favored as sample size increases. As noted under large group difference, FE produces the smallest value of SE under N = 5,000 and TL produces the smallest SE values across other conditions.

Table 3.

BIAS, SE, and RMSE for Tucker Linear, Chained Equipercentile Equating, and Frequency Estimation With Larger Group Differences

Sample size	Difficulty level	BIAS			SE			RMSE
Sample size	Difficulty level	Tucker	Chain	Frequency	Tucker	Chain	Frequency	Tucker	Chain	Frequency
100	no	0.808	1.921	2.125	0.783	1.957	1.692	1.140	3.233	3.122
	nd_s	0.598	1.998	2.129	1.103	1.963	1.814	1.293	3.253	3.227
	nd_m	0.520	1.938	1.860	1.105	1.831	1.716	1.244	3.111	3.000
	nd_l	1.571	2.236	1.991	1.051	1.903	1.855	1.960	3.329	3.129
	od_s	1.715	2.134	2.447	1.259	1.849	1.747	2.129	3.263	3.351
	od_m	2.779	2.582	2.866	1.375	1.858	1.827	3.103	3.583	3.747
	od_l	4.071	2.881	3.230	1.139	1.669	1.600	4.230	3.686	3.936
500	no	1.058	1.512	1.929	0.502	0.989	0.823	1.198	2.042	2.196
	nd_s	0.411	1.164	1.463	0.976	1.343	1.282	1.104	2.076	2.152
	nd_m	0.552	1.261	1.257	1.026	1.287	1.230	1.193	2.113	2.063
	nd_l	1.692	1.735	1.504	0.918	1.174	1.118	1.984	2.318	2.127
	od_s	1.629	1.703	1.978	1.098	1.393	1.333	1.966	2.379	2.528
	od_m	2.526	1.901	2.198	1.209	1.472	1.419	2.803	2.560	2.743
	od_l	3.918	2.432	2.716	1.070	1.302	1.253	4.065	2.919	3.121
1,000	no	0.687	0.917	1.235	0.343	0.839	0.719	0.774	1.411	1.523
	nd_s	0.331	0.789	1.015	0.963	1.144	1.100	1.056	1.597	1.642
	nd_m	0.555	0.836	0.756	0.981	1.175	1.136	1.159	1.631	1.569
	nd_l	1.692	1.396	1.208	0.895	1.100	1.058	1.965	1.933	1.747
	od_s	1.420	1.204	1.458	1.018	1.160	1.132	1.747	1.807	1.952
	od_m	2.286	1.562	1.815	1.137	1.261	1.234	2.557	2.107	2.276
	od_l	3.563	2.063	2.333	1.011	1.121	1.073	3.707	2.457	2.661
5,000	no	0.888	0.773	1.041	0.259	0.501	0.459	0.929	1.030	1.245
	nd_s	0.328	0.641	0.910	0.936	0.901	0.894	1.029	1.315	1.397
	nd_m	0.546	0.745	0.604	0.959	0.924	0.914	1.135	1.358	1.314
	nd_l	1.647	1.255	1.034	0.873	0.905	0.894	1.914	1.676	1.492
	od_s	1.404	1.074	1.345	1.029	0.986	0.974	1.741	1.567	1.744
	od_m	2.270	1.425	1.691	1.123	1.041	1.032	2.536	1.854	2.058
	od_l	3.500	1.899	2.163	0.998	0.922	0.917	3.642	2.189	2.422

Open in a new tab

To examine the interaction between equating methods and score positions, conditional RMSEs are plotted across the score scale under all the conditions (Figures 1–3). Note that in the interest of space limit, results for N = 100 are displayed only given that results for the other sample size conditions are similar to those for N = 100. Figure 1 displays the conditional RMSE values across the 61 score points for MN, LN, and EQ equating methods with 100 examinees. When the form difficulty difference occurs, EQ performs better at the lower and upper ends of the score distribution compared to the other two methods, and this pattern is more salient as form difficulty difference increases. LN introduces the largest RMSE at both ends with small form differences. When the form difference is medium or large, LN introduces the largest RMSE at the lower end, whereas MN introduces the largest RMSE at the upper end of score distribution.

Conditional RMSE for Mean, Linear, and Equipercentile Equating With the 100 Examinees

Conditional RMSE for Tucker Linear Equating, Chained Equipercentile Equating, and Frequency Estimation for Smaller Group Differences With the 100 Examinees

Conditional RMSE for Tucker Linear Equating, Chained Equipercentile Equating, and Frequency Estimation for Larger Group Differences With the 100 Examinees

The patterns of conditional RMSE values for TL, CE, and FE equating methods with smaller group differences (Figure 2) are similar to those with larger group differences (Figure 3). Both CE and FE introduce substantial RMSE values at the lower end of score distribution, and TL introduces a larger RMSE at the upper end. Overall, CE and FE perform similarly in terms of RMSE across the score scale under all the conditions.

Discussion

This study aims to evaluate the performance of six equating methods with varying degrees of form differences in difficulty. Several conditions were manipulated with seven levels of difficulties, four levels of samples sizes, and two levels of an ability distribution to simulate data. A few studies examined small sample equating accuracy with form difficulty difference (e.g., Heh, 2007). However, no research has been conducted to primarily investigate the effect of difficulty difference on the effectiveness of equating methods with varying levels of sample size and the direction of form difference (e.g., Form X easier than Form Y). This study would provide practitioners with research evidence–based guidance in the choice of equating methods with varying levels of a form difference. As the condition of no form difficulty difference is also included, this study would inform testing companies of appropriate equating methods when two forms are similar in difficulty level.

The results can be summarized as follows. First, under the RG design, MN equating was proven to be the most accurate method when there is no or small form difference, whereas EQ is the most accurate method when the difficulty difference is medium or large. The pattern is consistent across the four levels of sample size. This finding might be attributed to the number of parameters used to adjust for form difference for different equating methods. As MN uses a single parameter (i.e., mean) to adjust for form difference, it may perform well when the form difference is small (minimal adjustment). However, as the form difference becomes large, one single parameter is not sufficient, and more parameters are needed to adjust for form differences, which may explain the finding that EQ was favored in the scenario of large form differences. We also found that LN equating is the least accurate method on all the conditions. This finding can be related to the data simulated. A quick look at the data revealed small differences in standard deviation between the two forms, suggesting that standard deviation may have not played a significant role in adjusting differences in scores. Our findings are consistent with previous studies. For example, Heh (2007) concluded that either identity or MN equating is preferred when there is no or small form difficulty difference (ES is 0 or .15), and EQ is favored as the form difficulty differences increase (ES ranges from .30 to .75). This finding, however, contradicted the study conducted by Aşiret and Sünbül (2016), who noted that MN equating consistently yielded the smallest RMSE compared with LN and EQ equating regardless of the form difficulty differences.

Second, under the CINEG design, TL was found to be the most accurate method when the difficulty difference is medium or smaller and either CE or FE is preferred with a large difficulty level. Similar to the RG design, our speculation is that under the CINEG design, when form difference is larger, more parameters are needed to adjust for the difference. In CE or FE, multiple points along the score scale are used to adjust for form difference, which may explain the finding that either of them was preferred in the condition of large form differences. CE tended to perform better when Form Y was more difficult than Form X, whereas FE tended to perform better in the opposite scenario. The reason for this finding is not clear and needs further investigation to be addressed. The two methods were also favored as the sample size increased. Similar results were found across the two levels of group difference.

Third, interaction effects were noted between the score point and the performance of these methods. EQ performed better at the lower and upper ends of score distribution. CE and FE introduce larger RMSE at the upper end in the score distribution. This can be partly due to the fact that most simulated data sets had zero frequency scores at the upper end. Unfortunately, this hypothesis was not testable under the current simulation setting because each replication involved a different set of items, resulting in 700 score distributions for each form. Thus, there was no unique score distribution. Instead, we inspected a few of those distributions and noticed that only few examines generally existed at the upper and lower ends. Finally, overall, total equating errors increase as the form difficulty differences become larger, regardless of the equating designs and methods. This finding confirms the study conducted by Kolen and Brennan (2004), who contended that equating errors were larger with different forms than with similar forms in difficulty.

This study contributes to the current literature by providing evidence for the differential ability of equating methods as a function of form difference. Sample size has been known to be a determining factor in choosing an equating method (Kolen & Brennan, 2014; Livingston, 2004), and it has been widely accepted that mean or linear equating leads to more precise results in the event of small sample size. However, findings from this study indicated the interaction effect between sample size and form difference: When form difference is substantial, EQ equating may be a reasonable option regardless of the sample size. Also, under the CINEG design, the simplest method (i.e., Tucker linear) proved to be the most accurate when form difference was negligible, while as the form difference became larger, a more complex equating method was a better option depending on the relative difficulty between old and new forms.

Limitation and Future Research

This study is subject to the following limitations. First, this study does not vary test length. As test length has been known to impact equating performance (van der Linden, 2006), future research is suggested to add the condition of test length and examine the interaction effect of test length and form differences on the performance of equating methods. Second, we just evaluated the six equating methods under the RG and CINEG designs. The study can be replicated under the SG design or with other methods such as the Levine true score method in the future. Third, data were simulated such that one form was consistently more difficult than the other across all score range to avoid interpretational complexity (particularly in explaining the performance of EQ). Future research may consider the nonuniform form difference as a study condition to further explore the performance of equating methods under such condition.

Footnotes

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs: Stella Yun Kim Inline graphic https://orcid.org/0000-0002-0562-1071

Ting Sun Inline graphic https://orcid.org/0000-0002-9320-8921

References

ACT. (2020). ACT technical manual. https://www.act.org/content/dam/act/unsecured/documents/ACT_Technical_Manual.pdf
Albano A. D. (2016). Equate: An R package for observed-score linking and equating. Journal of Statistical Software, 74(1), 1–36. [Google Scholar]
Angoff W. H. (1971). Scales, norms, and equivalent scores. In Thorridike R. L. (Ed.), Educational measurement (2nd ed., pp. 508–600). American Council on Education. [Google Scholar]
Aşiret S., Sünbül S. Ö. (2016). Investigating test equating methods in small samples through various factors. Educational Sciences: Theory & Practice, 16(2), 647–668. [Google Scholar]
Babcock B., Albano A., Raymond M. (2012). Nominal weights mean equating: A method for very small samples. Educational and Psychological Measurement, 72(4), 608–628. [Google Scholar]
Birnbaum A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord F. M., Novick M. R. (Eds.), Statistical theories of mental test scores (pp. 397–472). Addison Wesley. [Google Scholar]
College Board. (2017). SAT suite of assessments technical manual: Characteristics of the SAT. College Board. [Google Scholar]
Davey T., Lee Y. H. (2011). Potential impact of context effects on the scoring and equating of the multistage GRE® revised General Test. ETS Research Report Series, 2011(2), i–44. [Google Scholar]
Gulliksen H. (1950). Theory of mental tests. Wiley. [Google Scholar]
Heh V. K. (2007). Equating accuracy using small samples in the random groups design [Doctoral dissertation, Patton College of Education at Ohio University]. [Google Scholar]
Jones P., Smith R. W., Talley D. (2006). Developing test forms for small-scale achievement testing systems. In Downing S. M. (Ed.), Handbook of test development (pp. 487–525). Lawrence Erlbaum Associates. [Google Scholar]
Kim S. Y. (2018). Simple structure MIRT equating for multidimensional tests [Doctoral dissertation, The University of Iowa]. 10.17077/etd.oj5oxcpf [DOI]
Kolen M. J., Brennan R. L. (2004). Test equating, scaling, and linking: Method and practice (2nd ed.). Springer-Verlag. [Google Scholar]
Kolen M. J., Brennan R. L. (2014). Test equating, scaling, and linking: Methods and practices (2nd ed.). Springer. [Google Scholar]
LaFlair G. T., Isbell D., May L. N., Gutierrez Arvizu M. N., Jamieson J. (2017). Equating in small-scale language testing programs. Language Testing, 34(1), 127–144. [Google Scholar]
Lee E. (2013). Equating multidimensional tests under a random groups design: A comparison of various equating procedures. The University of Iowa. [Google Scholar]
Livingston S. A. (2004). Equating test scores (without IRT). Educational Testing Service. [Google Scholar]
Lord F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates. [Google Scholar]
Skaggs G. (2005). Accuracy of random groups equating with very small samples. Journal of Educational Measurement, 42(4), 309–330. [Google Scholar]
Suh Y., Mroch A. A., Kane M. T., Ripkey D. R. (2009). An empirical comparison of five linear equating methods for the NEAT design. Measurement, 7(3), 143–173. [Google Scholar]
The College Board. (2017). SAT suite of assessments technical manual characteristics of the SAT. https://collegereadiness.collegeboard.org/pdf/sat-suite-assessmentstechnical-manual.pdf
Tong Y., Kolen M. J. (2005). Assessing equating results on different equating criteria. Applied Psychological Measurement, 29(6), 418–432. [Google Scholar]
van der Linden W. J. (2006). Equating error in observed score equating. Applied Psychological Measurement, 30(5), 355–378. 10.1177/0146621606289948 [DOI] [Google Scholar]
von Davier A. A., Kong N. (2005). A unified approach to linear equating for the nonequivalent groups design. Journal of Educational and Behavioral Statistics, 30(3), 313–342. [Google Scholar]
Wang T. (2009). Standard errors of equating for the percentile rank-based equipercentile equating with Log-linear 525 presmoothing. Journal of Educational and Behavioral Statistics, 34(1), 7–23. 10.3102/1076998607307361 [DOI] [Google Scholar]

[bibr1-00131644231176989] ACT. (2020). ACT technical manual. https://www.act.org/content/dam/act/unsecured/documents/ACT_Technical_Manual.pdf

[bibr2-00131644231176989] Albano A. D. (2016). Equate: An R package for observed-score linking and equating. Journal of Statistical Software, 74(1), 1–36. [Google Scholar]

[bibr3-00131644231176989] Angoff W. H. (1971). Scales, norms, and equivalent scores. In Thorridike R. L. (Ed.), Educational measurement (2nd ed., pp. 508–600). American Council on Education. [Google Scholar]

[bibr4-00131644231176989] Aşiret S., Sünbül S. Ö. (2016). Investigating test equating methods in small samples through various factors. Educational Sciences: Theory & Practice, 16(2), 647–668. [Google Scholar]

[bibr5-00131644231176989] Babcock B., Albano A., Raymond M. (2012). Nominal weights mean equating: A method for very small samples. Educational and Psychological Measurement, 72(4), 608–628. [Google Scholar]

[bibr6-00131644231176989] Birnbaum A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord F. M., Novick M. R. (Eds.), Statistical theories of mental test scores (pp. 397–472). Addison Wesley. [Google Scholar]

[bibr7-00131644231176989] College Board. (2017). SAT suite of assessments technical manual: Characteristics of the SAT. College Board. [Google Scholar]

[bibr8-00131644231176989] Davey T., Lee Y. H. (2011). Potential impact of context effects on the scoring and equating of the multistage GRE® revised General Test. ETS Research Report Series, 2011(2), i–44. [Google Scholar]

[bibr9-00131644231176989] Gulliksen H. (1950). Theory of mental tests. Wiley. [Google Scholar]

[bibr10-00131644231176989] Heh V. K. (2007). Equating accuracy using small samples in the random groups design [Doctoral dissertation, Patton College of Education at Ohio University]. [Google Scholar]

[bibr11-00131644231176989] Jones P., Smith R. W., Talley D. (2006). Developing test forms for small-scale achievement testing systems. In Downing S. M. (Ed.), Handbook of test development (pp. 487–525). Lawrence Erlbaum Associates. [Google Scholar]

[bibr12-00131644231176989] Kim S. Y. (2018). Simple structure MIRT equating for multidimensional tests [Doctoral dissertation, The University of Iowa]. 10.17077/etd.oj5oxcpf [DOI]

[bibr13-00131644231176989] Kolen M. J., Brennan R. L. (2004). Test equating, scaling, and linking: Method and practice (2nd ed.). Springer-Verlag. [Google Scholar]

[bibr14-00131644231176989] Kolen M. J., Brennan R. L. (2014). Test equating, scaling, and linking: Methods and practices (2nd ed.). Springer. [Google Scholar]

[bibr15-00131644231176989] LaFlair G. T., Isbell D., May L. N., Gutierrez Arvizu M. N., Jamieson J. (2017). Equating in small-scale language testing programs. Language Testing, 34(1), 127–144. [Google Scholar]

[bibr16-00131644231176989] Lee E. (2013). Equating multidimensional tests under a random groups design: A comparison of various equating procedures. The University of Iowa. [Google Scholar]

[bibr17-00131644231176989] Livingston S. A. (2004). Equating test scores (without IRT). Educational Testing Service. [Google Scholar]

[bibr18-00131644231176989] Lord F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates. [Google Scholar]

[bibr19-00131644231176989] Skaggs G. (2005). Accuracy of random groups equating with very small samples. Journal of Educational Measurement, 42(4), 309–330. [Google Scholar]

[bibr20-00131644231176989] Suh Y., Mroch A. A., Kane M. T., Ripkey D. R. (2009). An empirical comparison of five linear equating methods for the NEAT design. Measurement, 7(3), 143–173. [Google Scholar]

[bibr21-00131644231176989] The College Board. (2017). SAT suite of assessments technical manual characteristics of the SAT. https://collegereadiness.collegeboard.org/pdf/sat-suite-assessmentstechnical-manual.pdf

[bibr22-00131644231176989] Tong Y., Kolen M. J. (2005). Assessing equating results on different equating criteria. Applied Psychological Measurement, 29(6), 418–432. [Google Scholar]

[bibr23-00131644231176989] van der Linden W. J. (2006). Equating error in observed score equating. Applied Psychological Measurement, 30(5), 355–378. 10.1177/0146621606289948 [DOI] [Google Scholar]

[bibr24-00131644231176989] von Davier A. A., Kong N. (2005). A unified approach to linear equating for the nonequivalent groups design. Journal of Educational and Behavioral Statistics, 30(3), 313–342. [Google Scholar]

[bibr25-00131644231176989] Wang T. (2009). Standard errors of equating for the percentile rank-based equipercentile equating with Log-linear 525 presmoothing. Journal of Educational and Behavioral Statistics, 34(1), 7–23. 10.3102/1076998607307361 [DOI] [Google Scholar]

PERMALINK

Evaluating Equating Methods for Varying Levels of Form Difference

Ting Sun

Stella Yun Kim

Abstract

Introduction

Literature Review

Equating Design