Derivation and Applicability of Asymptotic Results for Multiple Subtests Person-Fit Statistics

Casper J Albers; Rob R Meijer; Jorge N Tendeiro

doi:10.1177/0146621615622832

. 2016 Jan 6;40(4):274–288. doi: 10.1177/0146621615622832

Derivation and Applicability of Asymptotic Results for Multiple Subtests Person-Fit Statistics

Casper J Albers ^1,^✉, Rob R Meijer ¹, Jorge N Tendeiro ¹

PMCID: PMC5978505 PMID: 29881053

Abstract

In high-stakes testing, it is important to check the validity of individual test scores. Although a test may, in general, result in valid test scores for most test takers, for some test takers, test scores may not provide a good description of a test taker’s proficiency level. Person-fit statistics have been proposed to check the validity of individual test scores. In this study, the theoretical asymptotic sampling distribution of two person-fit statistics that can be used for tests that consist of multiple subtests is first discussed. Second, simulation study was conducted to investigate the applicability of this asymptotic theory for tests of finite length, in which the correlation between subtests and number of items in the subtests was varied. The authors showed that these distributions provide reasonable approximations, even for tests consisting of subtests of only 10 items each. These results have practical value because researchers do not have to rely on extensive simulation studies to simulate sampling distributions.

Keywords: item response theory, person-model fit, validity test scores

In high-stakes testing, individual test scores are being used to make important decisions for individual test takers. In these circumstances, it is important to check the validity of the individual test scores. Although test scores may be valid for most persons in a particular population, for some test takers, these scores may not reflect their true proficiency level. For example, Meijer and Tendeiro (2014) showed that for some test takers on a high-stakes test, the proficiency scores did not seem to reflect their true proficiency level. Several methods have been proposed to check the validity of individual test scores. In this study, the authors focus on methods that are sensitive to the fit of individual response patterns to an item response theory (IRT) model. The idea behind this approach is that, when an item score pattern is very unexpected given the estimated proficiency level, this estimated proficiency level might not provide a good estimate of their true proficiency level. These methods are often denoted as person-fit methods or person-fit statistics (Meijer & Sijtsma, 2001).

One of the most popular statistics is the standardized log-likelihood statistic, denoted $l_{z}$ , proposed by Drasgow, Levine, and Williams (1985). Based on asymptotic arguments, Snijders (2001; see also Magis, Raîche, & Béland, 2012) suggested an improved version of this statistic, denoting it $l_{z}^{*}$ . In assessing person fit, the person parameter θ is unknown and needs to be estimated by $\hat{θ}$ . This estimation process biases the (asymptotic) behavior of $l_{z}$ and Snijders’s version accounts for this bias.

Both $l_{z}$ and $l_{z}^{*}$ are developed for unidimensional tests. In practice, however, many tests consist of several (correlated) subtests. For example, the Law School Admission Test consists of four subtests which total scores are combined into one total score. For these types of tests, it would be useful to have a person-fit statistic that combines information from the multiple subtests into one person-fit value. Drasgow, Levine, and McLaughlin (1991) proposed a multiple subtest extension of $l_{z}$ , which they denoted $l_{zm}$ . Conijn, Emons, and Sijtsma (2014) compared several approaches based on $l_{z}$ with studying person fit for noncognitive multiple subtests consisting of polytomous items. Because of the advantage of $l_{z}^{*}$ over $l_{z}$ , Tendeiro, Meijer, and Albers (2014) recently studied the performance of Conijn et al.’s approaches applied to $l_{z}^{*}$ rather than $l_{z}$ for multiple subtests settings based on dichotomous items. This study by Tendeiro et al. was performed on the basis of a simulation design. The aim of the current article is to study the distributional properties of multisubtest modifications of $l_{z}^{*}$ statistic through statistical (asymptotic) theory.

The outline of this report is as follows. In the next section, the $l_{z}$ (Drasgow et al., 1985) and $l_{z}^{*}$ (Snijders, 2001) statistics are introduced. Next, multiple subtests extensions based on $l_{z}$ (Conijn et al., 2014; Drasgow et al., 1991) will be discussed. As explained above, the $l_{z}^{*}$ statistic is a bias-removing improvement upon the $l_{z}$ statistic. In the main theoretical section of this article, we study the theoretical null distribution of the multiple subtests person-fit statistics based on $l_{z}^{*}$ instead of on $l_{z}$ . The distributional theory is based on asymptotic arguments. The length of each subtest and the correlation of the latent traits between subtests are manipulated by means of a simulation study. The goal is to study possible effects of these factors on the quality of the asymptotic approximations. It will be shown that the asymptotic approximations are fairly good for subtest lengths as low as 10 items.

The $l_{z}$ and $l_{z}^{*}$ Statistics

This section shortly describes the $l_{z}$ and $l_{z}^{*}$ statistics. For a more extensive discussion of the $l_{z}$ statistic, see, for example, Armstrong, Stoumbos, Kung, and Shi (2007); Magis et al. (2012); and van Krimpen-Stoop and Meijer (1999).

The $l_{z}$ Statistic

A test taker with trait level θ is administered a univariate test consisting of n items. The random variable X_i equals 0 or 1, depending on whether item i was answered correctly or incorrectly, respectively. The probability of answering correctly, P(X_i = 1 | θ), is denoted by $p_{i} (θ)$ . The three-parameter logistic model (3PLM; see Embretson & Reise, 2000), or its constrained versions known as the two- and one-parameter logistic models, is commonly used in IRT to describe the stochastic relationship between θ and X_i. The 3PLM is given by

p_{i} (θ) = c_{i} + (1 - c_{i}) \frac{e^{a_{i} (θ - b_{i})}}{1 + e^{a_{i} (θ - b_{i})}},

(1)

where a_i, b_i, and c_i denote the discrimination, difficulty, and pseudo-guessing parameters of item i. The two-parameter logistic model (2PLM), which results from constraining c_i to zero in Equation 1, will be used in this simulation study. However, the theory in this article applies to other models as well.

The likelihood function of a response vector $X = (X_{1}, \dots, X_{n})$ is given by

L (θ) = Π_{i = 1}^{n} p_{i} {(θ)}^{X_{i}} {(1 - p_{i} (θ))}^{1 - X_{i}},

and the maximum likelihood (ML) estimator ${\hat{θ}}_{ML}$ is obtained by maximizing $L (θ)$ or, equivalently, by maximizing the log likelihood:

l_{0} (θ) = \log (L (θ)) = \sum_{i = 1}^{n} {X_{i} \log (p_{i} (θ)) + (1 - X_{i}) \log (1 - p_{i} (θ))} .

(2)

As this function depends on the number of items, it is not directly applicable as a person-fit statistic. To this end, Drasgow et al. (1985) proposed to use the standardized version

l_{z} = \frac{l_{0} - E (l_{0})}{\sqrt{Var (l_{0})}},

as a person-fit statistic. Here, the expectation and variance are given by

E (l_{0}) = \sum_{i = 1}^{n} {p_{i} (θ) \log (p_{i} (θ)) + (1 - p_{i} (θ)) \log (1 - p_{i} (θ))}

(3)

and

Var (l_{0}) = \sum_{i = 1}^{n} {p_{i} (θ) (1 - p_{i} (θ)) {[\log (p_{i} (θ)) - \log (1 - p_{i} (θ))]}^{2}} .

(4)

The $l_{z}^{*}$ Statistic

Snijders (2001) argued that, in practice, it is rarely the case that true trait values θ are known. He showed that the $l_{z}$ statistic is biased when the true θ is replaced by an estimate $\hat{θ}$ . He proposed a correction, actually applicable to a wider range of estimators than $l_{z}$ . Snijders studied the class of standardized person-fit statistics described through

\frac{W_{n} (θ)}{\sqrt{Var (W_{n} (θ))}},

(5)

where $W_{n} (θ) = \sum_{i = 1}^{n} (X_{i} - p_{i} (θ)) w_{i} (θ)$ , with $w_{i} (θ)$ a particular choice of a weight function. For $w_{i} (θ) = \log (p_{i} (θ)) - \log (1 - p_{i} (θ))$ , the $l_{z}$ statistic is obtained.

Snijders (2001) showed that the bias introduced by replacing θ by its estimate $\hat{θ}$ does not vanish asymptotically, causing, for example, conservative inferences in case of parameter estimation through the 3PLM. A solution to this problem is obtained by modifying the weights $w_{i} (θ)$ via

{\tilde{w}}_{i} (θ) = w_{i} (θ) - c_{n} (θ) r_{i} (θ),

where

c_{n} (θ) = \frac{\sum_{i = 1}^{n} p_{i}^{'} (θ) w_{i} (θ)}{\sum_{i = 1}^{n} p_{i}^{'} (θ) r_{i} (θ)},

(6)

$p_{i}^{'} (θ)$ is the first-order derivative of $p_{i} (θ)$ with respect to θ, and the $r_{i} (θ)$ is chosen such that

r_{0} (\hat{θ}) + \sum_{i = 1}^{n} (X_{i} - p_{i} (\hat{θ})) r_{i} (\hat{θ}) = 0 .

(7)

Various choices of $r_{i} (\hat{θ})$ satisfy this relation. The most common choice is that of the ML estimates, given by $r_{0} (\hat{θ}) = 0$ and $r_{i} (\hat{θ}) = \frac{p_{i}^{'} (\hat{θ})}{p_{i}^{'} (\hat{θ}) (1 - p_{i}^{'} (\hat{θ}))}$ (i > 0). Snijders (2001) showed that $W_{n} (\hat{θ})$ is asymptotically normally distributed with expected value

E (W_{n} (\hat{θ})) = - c_{n} (\hat{θ}) r_{0} (\hat{θ})

and variance $Var (W_{n} (\hat{θ})) = n τ_{n}^{2} (\hat{θ})$ , with

τ_{n}^{2} (\hat{θ}) = \frac{1}{n} \sum_{i = 1}^{n} {\tilde{w}}_{i}^{2} (\hat{θ}) p_{i} (\hat{θ}) (1 - p_{i} (\hat{θ})) .

(8)

As a consequence, the statistic

l_{z}^{*} = \frac{W_{n} (\hat{θ}) - E (W_{n} (\hat{θ}))}{\sqrt{Var (W_{n} (\hat{θ}))}}

is asymptotically standard normally distributed.

In case the estimation of θ is done through the method of ML, things become slightly easier. In this case, $r_{0} (\hat{θ}) = 0$ , which implies that $E (W_{n} (\hat{θ})) = 0$ . In this article, the authors will only work with ML estimators, but the authors shall continue to use the generalized notation of Snijders (2001).

The $l_{zm}^{(D)}$ and $l_{zm}^{(C)}$ Statistics and Proposed Corrections

The multiple subtests statistic developed by Drasgow et al. (1991), here denoted $l_{zm}^{(D)}$ , is based on the sum of the $l_{0}$ statistics for each subtest. For each subtest s $(s = 1, \dots, S)$ , one computes $l_{0 (s)}$ , $E (l_{0 (s)})$ , and $Var (l_{0 (s)})$ as described above. Then,

l_{zm}^{(D)} = \frac{(\sum_{s = 1}^{S} l_{0 (s)}) - (\sum_{s = 1}^{S} E (l_{0 (s)}))}{{[\sum_{s = 1}^{S} Var (l_{0 (s)})]}^{1 / 2}} .

(9)

The lack of covariance in the denominator is a consequence of the IRT assumption of local independence. Note that this assumption implies that the $l_{0 (s)}$ scores across subtests are independent, but it still allows for the latent traits $θ_{i (s)}$ across subtests to be correlated, which often is the case in practice. Assuming that the true scores θ are known, $l_{zm}^{(D)}$ is standard normally distributed.

Conijn et al. (2014) suggested a slightly different approach to compute the multiple subtests statistic. Rather than summing the $l_{0 (s)}$ statistics over the S subtests and then standardizing the sum, Conijn et al. suggested to first standardize each $l_{0 (s)}$ and then to sum the standardized $l_{z (s)}$ statistics:

l_{zm}^{(C)} = \sum_{s = 1}^{S} l_{z (s)} .

This approach is based on the same assumptions as the approach by Drasgow et al. (1991). In this simulation study, we shall study which method performs better.

Just as the $l_{z}$ approach is biased when θ is unknown, so are the multiple subtest extensions $l_{zm}^{(D)}$ and $l_{zm}^{(C)}$ . The solution by Snijders (2001) is actually directly applicable to $l_{zm}^{(D)}$ and $l_{zm}^{(C)}$ . The authors, therefore, propose two new person-fit statistics, which are denoted by $l_{zm}^{* (D)}$ and $l_{zm}^{* (C)}$ , respectively. In the following section, the asymptotic null distribution of each of these statistics is derived.

Asymptotic Null Distribution of $l_{zm}^{* (D)}$ and $l_{zm}^{* (C)}$

In this section, the asymptotic distributions of $l_{zm}^{* (D)}$ and $l_{zm}^{* (C)}$ are derived. In the next section, the applicability of this asymptotic theory for tests of finite length is studied.

Null Distribution of $l_{zm}^{(D)}$ and $l_{zm}^{* (D)}$

Statistic $l_{zm}^{(D)}$ is asymptotically normally distributed if the true θ values are used (Drasgow et al., 1991). However, replacing true with estimated θ values introduces a bias, as explained above. We shall now try to correct this bias by applying Snijders’s approach.

It is possible to rewrite $l_{zm}^{(D)}$ given by Equation 9 in the form of Equation 5:

l_{zm}^{(D)} = \frac{W (θ)}{\sqrt{Var (W (θ))}},

(10)

with

W (θ) = \sum_{s = 1}^{S} \sum_{i = 1}^{n_{s}} [X_{i (s)} - p_{i (s)} (θ_{s})] w_{i (s)} (θ_{s}),

w_{i (s)} (θ_{s}) = \log (p_{i (s)} (θ_{s})) - \log (1 - p_{i (s)} (θ_{s})) .

(11)

Here, θ denotes the vector $θ_{1}, \dots, θ_{S}$ of latent trait parameters per subtest. That Equation 9 can be written as Equation 10 can be seen as follows. First, for the numerator (recall Equations 2 and 3),

\sum_{s = 1}^{S} l_{0 (s)} = \sum_{s = 1}^{S} \sum_{i = 1}^{n_{s}} {X_{i (s)} \log (p_{i (s)} (θ_{s})) + (1 - X_{i (s)}) \log (1 - p_{i (s)} (θ_{s}))}

and

\sum_{s = 1}^{S} E (l_{0 (s)}) = \sum_{s = 1}^{S} \sum_{i = 1}^{n_{s}} {p_{i (s)} \log (p_{i (s)} (θ_{s})) + (1 - p_{i (s)}) \log (1 - p_{i (s)} (θ_{s}))},

thus, $\sum_{s = 1}^{S} l_{0 (s)} - \sum_{s = 1}^{S} E (l_{0 (s)}) = W (θ)$ . For the denominator, we have (recall Equation 4)

\sum_{s = 1}^{S} Var (l_{0 (s)}) = \sum_{s = 1}^{S} \sum_{i = 1}^{n_{s}} p_{i (s)} (θ_{s}) (1 - p_{i (s)} (θ_{s})) {[\log (p_{i (s)} (θ_{s})) - \log (1 - p_{i (s)} (θ_{s}))]}^{2}

and

Var (W (θ)) = \sum_{s = 1}^{S} \sum_{i = 1}^{n_{s}} p_{i (s)} (θ_{s}) (1 - p_{i (s)} (θ_{s})) {(w_{i (s)} (θ_{s}))}^{2},

and therefore $\sum_{s = 1}^{S} Var (l_{0 (s)}) = Var (W (θ))$ .

The authors have established that $l_{zm}^{(D)}$ belongs to the family of statistics considered by Snijders. It therefore follows that, for trait estimates ${\hat{θ}}_{(s)}$ satisfying Equation 7, $W (\hat{θ})$ is asymptotically normally distributed with expected value

E (W (\hat{θ})) = \sum_{s = 1}^{S} - c_{n_{s}} ({\hat{θ}}_{(s)}) r_{0 (s)} ({\hat{θ}}_{(s)})

and variance

Var (W (\hat{θ})) = \sum_{s = 1}^{S} n_{s} τ_{n_{s}}^{2} ({\hat{θ}}_{(s)})

where $c_{n_{s}} ({\hat{θ}}_{(s)})$ and $τ_{n_{s}}^{2} ({\hat{θ}}_{(s)})$ are the functions defined by Equations 6 and 8 applied to subtest s.

Thus, in the above, the authors established that, under the assumption of local independence, asymptotically

l_{zm}^{* (D)} = \frac{W (\hat{θ}) - E (W (\hat{θ}))}{\sqrt{Var (W (\hat{θ}))}} ~ N (0, 1) .

Null Distribution of $l_{zm}^{(C)}$ and $l_{zm}^{* (C)}$

Conijn et al.’s (2014) approach is similar to Drasgow’s et al. (1991) approach, but the order of operations is reversed. That is, $l_{z (s)}$ is computed for each subtest s and then all values are added. $l_{z (s)}$ is asymptotically standard normally distributed for true θ values. Furthermore, due to the local independence assumption, the $l_{z (s)}$ statistics are independent. As a consequence, $l_{zm}^{(C)}$ is the sum of S independent standard normally distributed variables and is therefore normally distributed with mean and variance equal to the sums of the means and variances, respectively. Thus, the asymptotic null distribution of $l_{zm}^{(C)}$ , assuming known θ and local independence, is given by

l_{zm}^{(C)} = \sum_{s = 1}^{S} l_{z (s)} ~ N (0, S) .

The asymptotic distribution of $l_{zm}^{* (C)}$ can be derived along the same lines. Because $l_{zm}^{* (C)}$ is the sum of the $l_{z (s)}^{*}$ values, each independent and asymptotically N(0, 1) distributed (Snijders, 2001), we immediately have that

l_{zm}^{* (C)} ~ N (0, S)

for trait estimates ${\hat{θ}}_{(s)}$ satisfying Equation 7.

Design of the Simulation Study

In practice, to assess whether response patterns are unusual, one can either (a) simulate a large number of response patterns under the null distribution of normal behavior and compare the observed with the simulated response patterns; or (b) compute the critical value on the basis of the asymptotic distribution. Obviously, the first approach is time-consuming but has the benefit of not having to rely on asymptotic theory. The second approach is computationally much more efficient but does rely on asymptotic theory.

The main goal of this simulation study was to study the quality of the asymptotic results discussed in the previous sections. In particular, the authors wanted to verify how the asymptotic approximations for person-fit statistics $l_{zm}^{* (D)}$ and $l_{zm}^{* (C)}$ hold for relatively short subtest lengths (say, of 10 items). The goal is to understand whether the asymptotic results are accurate enough for most practical purposes. Discussing the univariate $l_{z}^{*}$ statistic, Snijders (2001) expected that n≥ 15 would be sufficient for the asymptotic approximations to work well (in case of univariate scales). Subtests might be of shorter length than the 15 items mentioned by Snijders. How much shorter the subtests can be is dependent on the relation between the subtests. If the latent traits for the subtests correlate perfectly (i.e., test taker’s θ is the same for each subtest), the test is actually univariate and, according to Snijders, subtests of lengths n_i≈ 15 / S should suffice. When the correlation between subtest traits is smaller than 1, one may expect to need subtests of longer length. Studying how subtest length and trait correlations relate to the quality of the asymptotic approximations is the main goal of this simulation study.

The simulation study was set up as follows. Item scores of 1,000 test takers on four subtests were generated. Four subtest lengths were considered: 10, 25, 50, and 100. The shorter subtest lengths (10, 25) are of most practical interest, whereas the longer subtest lengths (50, 100) are mostly of theoretical interest. All subtests within the same data set had the same length. The 2PLM was used to generate the item scores, with discrimination parameters uniformly distributed between [0.5, 2.0] and difficulty parameters standard normally distributed (bounded between −2.5 and +2.5). Moreover, four person θ parameters were generated for each simulated test taker, one per subtest. These parameters were randomly drawn from a multivariate normal distribution. Seven between-subtests correlations of θ were considered: 0.4(0.1)1.0. These item and person parameters resulted in data that were very similar to the empirical data from a number of large-scale high-stakes educational admission tests (see also Rupp, 2013).

The simulation study consisted therefore of a 4 (number of subtest lengths) by 7 (number of between-subtests correlations of θ) completely crossed design, hence 28 experiment conditions in total. One hundred replications were simulated per condition. For each replicated data set, six multiple subtests person-fit statistics were computed. Of these, $l_{zm}^{* (D)}$ and $l_{zm}^{* (C)}$ , which the authors proposed and developed in this report, were of most interest. Furthermore, $l_{zm}^{(D)}$ and $l_{zm}^{(C)}$ were computed to compare these uncorrected statistics with their starred versions. The corrected starred statistics were expected to outperform the uncorrected statistics. Finally, $l_{z}$ and $l_{z}^{*}$ were computed by concatenating the four subtests together, that is, by ignoring the multiple subtests data structure. This approach was expected to work well for large correlation values between the θs but not so well for lower correlation values between the θs.

The simulation was coded in R (R Core Team, 2014). The item parameters were estimated by means of the function est() in the “irtoys” package (Partchev, 2014). The ML person parameters were estimated by means of the function mlebme(), also in the “irtoys” package.

Results of the Simulation Study

Findings are reported in various tables and figures. For $l_{z}^{*}$ , $l_{zm}^{* (C)}$ , and $l_{zm}^{* (D)}$ , Table 1 lists the following values: (a) the mean of the 1,000 statistic values per replication, averaged across the 100 replications; (b) the standard deviation of the 1,000 statistic values per replication, averaged across the 100 replications; (c) the Kolmogorov–Smirnov (KS) distance between the empirical and theoretical (asymptotic) normal cumulative distribution function; and (d) the level of significance when applying critical values from the asymptotic distribution, at α = .05. The KS distance (Smirnov, 1948) is a method to assess whether the empirical results lie close to the asymptotic distribution. This metric is a common method for density comparisons and reports the maximum vertical distance between both cumulative distributions. When both distributions completely agree, this value is zero; when they completely disagree, it is one. For $l_{z}$ , $l_{zm}^{(C)}$ , and $l_{zm}^{(D)}$ , Table 2 lists the means and standard deviations over the replications. (Reporting KS distances and levels of significance for these statistics is undesirable, as the asymptotic distribution only holds if all θ are known.)

Table 1.

Results of the Simulation Study for Person-Fit Statistics $l_{z}^{*}$ , $l_{zm}^{* (D)}$ , and $l_{zm}^{* (D)}$ : Mean, Standard Deviation, Kolmogorov–Smirnov Distances, and Empirical Proportion of Statistic Values Scoring Below the 5% Quantile of the Asymptotic Distribution.

		$l_{z}^{*}$				$l_{zm}^{* (D)}$				$l_{zm}^{* (D)}$
ρ	n_i	M	SD	KS	α_asymp	M	SD	KS	α_asymp	M	SD	KS	α_asymp
.4	10	0.083	1.017	0.076	.056	0.642	1.999	0.166	.037	0.336	0.996	0.176	.036
	25	0.053	1.067	0.059	.064	0.415	2.006	0.110	.040	0.212	1.002	0.113	.040
	50	0.045	1.142	0.064	.077	0.312	2.005	0.086	.042	0.155	1.001	0.087	.042
	100	0.047	1.269	0.084	.096	0.264	2.008	0.072	.042	0.129	1.002	0.072	.042
.5	10	0.079	1.009	0.071	.055	0.609	2.016	0.157	.038	0.318	1.003	0.167	.037
	25	0.055	1.054	0.057	.062	0.413	2.012	0.110	.040	0.211	1.005	0.113	.040
	50	0.048	1.108	0.059	.070	0.313	1.999	0.087	.042	0.155	0.998	0.087	.042
	100	0.052	1.220	0.074	.085	0.265	2.003	0.073	.042	0.130	1.000	0.072	.041
.6	10	0.082	1.008	0.072	.055	0.622	2.009	0.160	.037	0.322	1.002	0.169	.038
	25	0.056	1.038	0.056	.058	0.413	2.009	0.111	.040	0.211	1.004	0.114	.041
	50	0.051	1.075	0.053	.063	0.314	2.002	0.086	.041	0.156	0.999	0.086	.042
	100	0.056	1.165	0.065	.076	0.263	1.997	0.071	.042	0.130	0.997	0.071	.041
.7	10	0.083	1.003	0.072	.054	0.620	2.002	0.160	.037	0.326	0.998	0.172	.037
	25	0.058	1.020	0.053	.055	0.411	2.004	0.109	.040	0.210	1.000	0.112	.040
	50	0.054	1.048	0.050	.059	0.315	2.001	0.085	.041	0.156	0.999	0.085	.041
	100	0.061	1.115	0.055	.065	0.262	2.005	0.071	.042	0.130	1.000	0.071	.041
.8	10	0.084	0.999	0.073	.053	0.625	2.010	0.161	.037	0.326	1.000	0.172	.037
	25	0.060	1.010	0.054	.054	0.413	2.008	0.110	.041	0.211	1.003	0.112	.040
	50	0.056	1.030	0.050	.055	0.313	2.014	0.087	.043	0.156	1.006	0.087	.043
	100	0.065	1.074	0.049	.058	0.262	2.008	0.071	.042	0.130	1.003	0.071	.042
.9	10	0.084	0.997	0.073	.053	0.615	2.005	0.160	.038	0.320	0.999	0.168	.037
	25	0.062	0.999	0.055	.052	0.417	2.000	0.111	.040	0.212	1.000	0.113	.040
	50	0.060	1.008	0.048	.050	0.315	2.007	0.086	.042	0.157	1.002	0.086	.042
	100	0.071	1.048	0.046	.051	0.263	2.008	0.073	.042	0.130	1.003	0.073	.043
1.0	10	0.084	1.001	0.075	.054	0.614	2.015	0.161	.038	0.319	1.005	0.169	.038
	25	0.062	1.002	0.054	.053	0.410	2.011	0.110	.041	0.208	1.006	0.111	.042
	50	0.063	1.001	0.048	.049	0.316	2.000	0.085	.041	0.158	0.999	0.085	.041
	100	0.076	1.039	0.047	.048	0.263	2.007	0.072	.042	0.131	1.003	0.072	.042

Open in a new tab

Note. Results are averaged across replications. KS = Kolmogorov–Smirnov; ρ = correlation between subtest θ values; n_i = subtest length.

Table 2.

Results of the Simulation Study for Person-Fit Statistics $l_{z}$ , $l_{zm}^{(C)}$ , and $l_{zm}^{(D)}$ : Mean and Standard Deviation.

		$l_{z}$		$l_{zm}^{(C)}$		$l_{zm}^{(D)}$
ρ	n_i	M	SD	M	SD	M	SD
.4	10	0.070	0.912	0.543	1.595	0.267	0.780
	25	0.039	0.956	0.353	1.673	0.173	0.818
	50	0.029	1.028	0.260	1.699	0.129	0.832
	100	0.025	1.146	0.217	1.719	0.108	0.844
.5	10	0.065	0.885	0.508	1.582	0.249	0.770
	25	0.040	0.937	0.351	1.679	0.173	0.822
	50	0.032	0.990	0.261	1.697	0.129	0.833
	100	0.029	1.093	0.218	1.720	0.108	0.846
.6	10	0.069	0.882	0.519	1.584	0.255	0.773
	25	0.043	0.914	0.352	1.674	0.173	0.821
	50	0.036	0.952	0.263	1.702	0.130	0.837
	100	0.033	1.030	0.217	1.712	0.108	0.843
.7	10	0.069	0.874	0.524	1.582	0.258	0.773
	25	0.044	0.890	0.349	1.667	0.172	0.818
	50	0.038	0.918	0.261	1.696	0.130	0.835
	100	0.038	0.972	0.217	1.713	0.108	0.846
.8	10	0.071	0.866	0.526	1.587	0.259	0.776
	25	0.048	0.878	0.352	1.675	0.174	0.825
	50	0.042	0.899	0.263	1.712	0.130	0.846
	100	0.043	0.927	0.217	1.723	0.108	0.853
.9	10	0.072	0.856	0.516	1.583	0.254	0.775
	25	0.052	0.864	0.356	1.672	0.177	0.825
	50	0.046	0.873	0.263	1.706	0.131	0.846
	100	0.048	0.890	0.216	1.722	0.108	0.855
1.0	10	0.074	0.851	0.517	1.587	0.255	0.779
	25	0.053	0.857	0.345	1.670	0.172	0.827
	50	0.050	0.861	0.264	1.700	0.131	0.846
	100	0.054	0.874	0.216	1.719	0.108	0.857

Open in a new tab

Note. Results are averaged across replications. ρ = correlation between subtest θ values; n_i = subtest length.

First, the results in Table 1 are focused. The values for the means and standard deviations can be directly compared with the means and standard deviations of the asymptotic distribution. With respect to the means, Table 1 shows that (a) the empirical means are structurally larger than zero across all methods and that (b) the means decrease as n_i increases. Furthermore, the value of the means seem unrelated to the value of the subtest correlations ρ, with the exception of $l_{z}^{*}$ which seems to have slightly larger mean values for larger ρ (for instance, for n_i = 100, the means range from 0.047 (ρ = .4) through 0.076 (ρ = 1)). With respect to the standard deviations, Table 1 shows that the empirical SDs are very close to their asymptotic values (1 for $l_{z}^{*}$ and $l_{zm}^{* (D)}$ ; 2 for $l_{zm}^{* (C)}$ ) across all methods.

Table 1 shows that $l_{z}^{*}$ is sensitive to ρ: The KS distance increases when ρ decreases, especially for larger subtest lengths. This result is to be expected, as the idea of ignoring the multiple subtests structure is incompatible with increasingly lower values of correlations between the θs. The two multiple subtest methods, $l_{zm}^{* (C)}$ and $l_{zm}^{* (D)}$ , do not show this dependence on ρ. For all methods, it holds that, when n_i increases, the asymptotic approximation lies closer to the empirical distribution. Furthermore, the KS distances decrease when the subtest length increases. This is obvious: faced with more data, better predictions can be made.

The KS distance measures how close the complete empirical distribution is with respect to the asymptotic distribution. This is actually more than what was needed: What happens in the critical region of the distribution (i.e., the lower tail) is what is important when looking for aberrant patterns. We are not (primarily) interested in whether the $l_{z}^{*}$ scores of fitting response patterns are estimated without bias; what matters most is that the scores for misfitting response patterns are measured accurately. Figures 1 and 2, based on the $l_{zm}^{* (D)}$ values for the experiment condition defined by n_i = 25 and ρ = .7, display the empirical and asymptotic density functions (left) and cumulative distribution functions (right), with the 1% and 5% critical values added. The full empirical distribution has a significant misfit compared with the asymptotic standard normal because of its skew to the right. The skewness of this distribution has been noted in practice as well (e.g., Meijer & Tendeiro, 2012, Figure 1). However, the left tail of the empirical distribution is relatively well approximated by the asymptotic distribution, especially at the 1% level. In Table 1, the α_asymp values are reported, which consist of the proportion of empirical data to the left of the 5% quantile of the asymptotic distribution. Thus, α_asymp describes for what proportion of the empirical results the null hypothesis of no aberrant behavior would be rejected (a Type I error), if this decision is made based on asymptotic theory and α = .05. Values close to .05 are indicative of the adequacy of the asymptotic approximation. Figure 3 presents a visualization of the same results. For comparison, Figure 3 also shows the α_asymp values for the statistics without Snijders’s bias correction, where the asymptotic distribution is derived under the additional (and incorrect) assumption of known θ.

Figure 1. — Histogram of the 1,000 × 100 = 100,000 computed $l_{zm}^{* (D)}$ values (bars) and the standard normal distribution (curve).

*Note.* The blue vertical lines correspond to the α = 1% (left) and α = 5% (right) critical values. This figure is based on the experiment condition defined by the parameters *n_i* = 25 and ρ = .7.

Figure 3. — The α_asymp values of the 100 replications of the 1,000 person-fit statistics.

*Note.* Left panels, from top to bottom: $l_{z}^{*}$ , $l_{zm}^{* (C)}$ , and $l_{zm}^{* (D)}$ . Right panels, from top to bottom: $l_{z}$ , $l_{zm}^{(C)}$ , and $l_{zm}^{(D)}$ . In the online supplementary material, a more detailed version of this figure is provided.

From Figure 2, the following can be concluded. (a) One should not rely on asymptotic theory for the nonbias removed statistics (right panels): Even for large subtests (n_i = 50) the α_asymp values can be less than half the nominal values (around 2%). Thus, the critical values based on the asymptotic approximation are too conservative in the case of noncorrected statistics (i.e., there is lack of power). The problem is also present for the bias-corrected statistics (left panels) but to a lesser extent. (b) $l_{z}^{*}$ works very well for very small subtests (n_i = 10), but for lower correlations and lengthier subtests, the critical values from the asymptotic distribution yield are too liberal: Too many response patterns are flagged as aberrant. (c) The performance of $l_{zm}^{* (C)}$ and $l_{zm}^{* (D)}$ is comparable, and both methods are unaffected by the value of ρ. (d) Even for subtests of moderate length (n_i = 25), the approximation by asymptotic theory provides accurate approximation.

Deviations between the reported α_asymp values and the nominal α = 0.05 can be due to (a combination of) two reasons: (a) sampling variation (results are based on 100 replications of 1,000 simulated persons) and (b) the approximation is asymptotic and the sample size is clearly finite. However, sampling variation was controlled almost entirely by this experiment design. When sampling 100 × 1,000 = 100,000 values from a normal distribution, then in 95% of cases, the α_asymp value would be in (.0499, .05001).

Table 3 uses formal regression models to underpin the conclusions of Tables 1 and 2 and Figure 2. For each of the six types of person-fit statistic, first the regression model M_j = β₀+β₁(n_i)_j+β₂ρ_j+β₃(n_i×ρ_j) +ε_j is fitted to the 4 × 7 combinations of subtest length n_i and subtest correlation ρ, and the p values and effect sizes are reported. Next, a similar model, now with α_asymp as the dependent variable, is fitted for the three starred methods. The authors decided to include an interaction term because Tables 1 and 2 and Figure 2 indicate that such interaction might be present. It has to be noted that this elementary linear model is not perfect; especially the subtest lengths seem to have a nonlinear relation with the dependent variable. However, the model seems adequate for a rough indication. The results from the table are clear: For every method, the size of the subtest is a significant factor with large effect sizes. The subtest correlation is only significant and relevant for $l_{z}^{*}$ and l_z: The multiple subtests approaches indeed are capable of dealing with correlated subtests without distortions in the mean $l_{z}^{*}$ or l_z value nor their corresponding α_asymp value. For the α_asymp values, a clear interaction is present only for $l_{z}^{*}$ ; for the mean values, none of the interactions are significant (at the usual 5% level) nor do they have considerable effect sizes.

Table 3.

Effect Sizes (η²) and p Values for the Regression Models Predicting the Mean Values and α_asymp Values, With Sample Size and Subtest Correlation as Predictors.

		$l_{z}^{*}$		$l_{zm}^{* (C)}$		$l_{zm}^{* (D)}$		$l_{z}$		$l_{zm}^{(C)}$		$l_{zm}^{(D)}$
		p	η²	p	η²	p	η²	p	η²	p	η²	p	η²
M	n_i	.013	.182	.000	.726	.000	.720	.000	.417	.000	.741	.000	.742
	ρ	.021	.152	.933	.000	.939	.000	.003	.171	.933	.000	.985	.000
	n_i×ρ	.121	.065	.938	.000	.839	.000	.145	.036	.927	.000	.947	.000
α_asymp	n_i	.000	.227	.000	.584	.000	.540
	ρ	.000	.476	.792	.001	.593	.006
	n_i×ρ	.000	.270	.962	.000	.792	.001

Open in a new tab

Discussion

In this article, the authors investigated the theoretical asymptotic distributions of three person-fit statistics for tests that consist of multiple subtests. In both psychological and educational measurement, these types of tests are often used, but thus far, there were no studies that investigated these asymptotic distributions. A recent study that used the multiple subtest extensions $l_{zm}^{* (C)}$ and $l_{zm}^{* (D)}$ made use of simulation to determine the critical values on the basis of which item score patterns could be classified as normal or aberrant (Tendeiro et al., 2014). A drawback of this approach is that it is time-consuming. In the present study, the authors showed that asymptotic theory can adequately be used for both statistics even for subtest lengths as low as 10 items. And that, at least, at for a 95% confidence interval, Type I errors are in agreement with what is expected.

Type I errors are controlled for by the simple univariate $l_{z}^{*}$ statistic when correlations between tests are relatively high (larger than, say .7-.8). This is the case for many large-scale educational tests. Drasgow et al. (1991), for example, reported a correlation r = .73 between SAT Verbal and Quantitative tests and r = .80 between the enhanced ACT English and Mathematics test. Thus, in these cases, the theoretical asymptotic distribution of both the $l_{z}^{*}$ statistic and the multiple subtests extensions $l_{zm}^{* (C)}$ and $l_{zm}^{* (D)}$ can be used. As many studies showed (e.g., Conijn et al., 2014), correlations between test scores for noncognitive instruments are often lower than for cognitive tests. In these cases, the asymptotic distribution of both $l_{zm}^{* (C)}$ or $l_{zm}^{* (D)}$ can be used, at least for α = .05. We should note, however, that as shown in Figures 1 and 2, because the empirical distributions are skewed, at an α level of, for example, .10, results may be less optimal. However, almost all person-fit statistics use α levels of .05 or lower, so in practice, the authors conclude that this study showed that researchers can use the discussed asymptotic distributions to classify item score patterns as normal or aberrant for multiple subtests settings.

There can be benefit in applying bootstrap methods (such as those in Tendeiro et al., 2014) rather than resorting to asymptotic theory. These benefits especially hold in small studies, when the use of asymptotic theory is questionable. However, in this article, the authors show that, even for fairly short subtest lengths, asymptotic results already provide decent approximations. Finally, the benefit of not having to use the bootstrap distribution is saving computing time and, to a lesser degree, it is less technical: Understanding the bootstrap is quite hard; flagging all $l_{z}^{*}$ scores below −1.65 is extremely simple.

Supplementary Material

Supplementary material

DS_10.1177_0146621615622832.pdf^{(467.5KB, pdf)}

Footnotes

Authors’ Note: The opinions and conclusions contained in this report are those of the authors and do not necessarily reflect the position or policy of LSAC.

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study received funding from the Law School Admission Council (LSAC).

Supplementary Material: The R code used to generate the results, figures, and tables, as well as a detailed version of Figure 2, is provided as online supplementary material.

References

Armstrong R. D., Stoumbos Z. G., Kung M. T., Shi M. (2007). On the performance of the $l_{z}$ person-fit statistic. Practical Assessment, Research & Evaluation, 12(16). Retrieved from http://pareonline.net/getvn.asp?v=12&n=16 [Google Scholar]
Conijn J. M., Emons W. M., Sijtsma K. (2014). Statistic $l_{z}$ -based person-fit methods for noncognitive multiscale measures. Applied Psychological Measurement, 38, 122-136. doi: 10.1177/0146621613497568 [DOI] [Google Scholar]
Drasgow F., Levine M. V., McLaughlin M. E. (1991). Appropriateness measurement for some multidimensional test batteries. Applied Psychological Measurement, 15, 171-191. doi: 10.1177/014662169101500207 [DOI] [Google Scholar]
Drasgow F., Levine M. V., Williams E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38, 67-86. doi: 10.1111/j.2044-8317.1985.tb00817.x [DOI] [Google Scholar]
Embretson S. E., Reise S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
Magis D., Raîche G., Béland S. (2012). A didactic presentation of Snijders’s $l_{z}^{*}$ index of person fit with emphasis on response model selection and ability estimation. Journal of Educational and Behavioral Statistics, 37, 57-81. doi: 10.3102/1076998610396894 [DOI] [Google Scholar]
Meijer R. R., Sijtsma K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, 107-135. doi: 10.1177/01466210122031957 [DOI] [Google Scholar]
Meijer R. R., Tendeiro J. N. (2012). The use of $l_{z}$ and $l_{z}^{*}$ person-fit statistics and problems derived from model misspecification. Journal of Educational and Behavioral Statistics, 37, 758-766. [Google Scholar]
Meijer R. R., Tendeiro J. N. (2014). The use of person-fit scores in high-stakes educational testing: How to use them and what they tell us (LSAC Research Report 14-03). Newtown, PA: Law School Admission Council. [Google Scholar]
Partchev I. (2014). irtoys: Simple interface to the estimation and plotting of IRT models (R package version 0.1.7). Retrieved from http://CRAN.R-project.org/package=irtoys
R Core Team. (2014). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Available; from http://www.R-project.org/ [Google Scholar]
Rupp A. A. (2013). A systematic review of the methodology for person fit research in Item Response Theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55, 3-38. [Google Scholar]
Smirnov N. (1948). Table for estimating the goodness of fit of empirical distributions. Annals of Mathematical Statistics, 19, 279-281. doi: 10.1214/aoms/1177730256 [DOI] [Google Scholar]
Snijders T. B. (2001). Asymptotic null distribution of person fit statistics with estimated person parameter. Psychometrika, 66, 331-342. doi: 10.1007/BF02294437 [DOI] [Google Scholar]
Tendeiro J. N., Meijer R. R., Albers C. J. (2014). Detection of invalid test scores on admission tests: A simulation study using person-fit statistics (LSAC Research Report RR-15-03). Newtown, PA: Law School Admission Council; Available from: http://www.lsac.org/lsacresources/research/all/rr/rr-15-03 [Google Scholar]
van Krimpen-Stoop E. A., Meijer R. R. (1999). The null distribution of person-fit statistics for conventional and adaptive tests. Applied Psychological Measurement, 23, 327-345. doi: 10.1177/01466219922031446 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

DS_10.1177_0146621615622832.pdf^{(467.5KB, pdf)}

[bibr1-0146621615622832] Armstrong R. D., Stoumbos Z. G., Kung M. T., Shi M. (2007). On the performance of the $l_{z}$ person-fit statistic. Practical Assessment, Research & Evaluation, 12(16). Retrieved from http://pareonline.net/getvn.asp?v=12&n=16 [Google Scholar]

[bibr2-0146621615622832] Conijn J. M., Emons W. M., Sijtsma K. (2014). Statistic $l_{z}$ -based person-fit methods for noncognitive multiscale measures. Applied Psychological Measurement, 38, 122-136. doi: 10.1177/0146621613497568 [DOI] [Google Scholar]

[bibr3-0146621615622832] Drasgow F., Levine M. V., McLaughlin M. E. (1991). Appropriateness measurement for some multidimensional test batteries. Applied Psychological Measurement, 15, 171-191. doi: 10.1177/014662169101500207 [DOI] [Google Scholar]

[bibr4-0146621615622832] Drasgow F., Levine M. V., Williams E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38, 67-86. doi: 10.1111/j.2044-8317.1985.tb00817.x [DOI] [Google Scholar]

[bibr5-0146621615622832] Embretson S. E., Reise S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]

[bibr6-0146621615622832] Magis D., Raîche G., Béland S. (2012). A didactic presentation of Snijders’s $l_{z}^{*}$ index of person fit with emphasis on response model selection and ability estimation. Journal of Educational and Behavioral Statistics, 37, 57-81. doi: 10.3102/1076998610396894 [DOI] [Google Scholar]

[bibr7-0146621615622832] Meijer R. R., Sijtsma K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, 107-135. doi: 10.1177/01466210122031957 [DOI] [Google Scholar]

[bibr8-0146621615622832] Meijer R. R., Tendeiro J. N. (2012). The use of $l_{z}$ and $l_{z}^{*}$ person-fit statistics and problems derived from model misspecification. Journal of Educational and Behavioral Statistics, 37, 758-766. [Google Scholar]

[bibr9-0146621615622832] Meijer R. R., Tendeiro J. N. (2014). The use of person-fit scores in high-stakes educational testing: How to use them and what they tell us (LSAC Research Report 14-03). Newtown, PA: Law School Admission Council. [Google Scholar]

[bibr10-0146621615622832] Partchev I. (2014). irtoys: Simple interface to the estimation and plotting of IRT models (R package version 0.1.7). Retrieved from http://CRAN.R-project.org/package=irtoys

[bibr11-0146621615622832] R Core Team. (2014). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Available; from http://www.R-project.org/ [Google Scholar]

[bibr12-0146621615622832] Rupp A. A. (2013). A systematic review of the methodology for person fit research in Item Response Theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55, 3-38. [Google Scholar]

[bibr13-0146621615622832] Smirnov N. (1948). Table for estimating the goodness of fit of empirical distributions. Annals of Mathematical Statistics, 19, 279-281. doi: 10.1214/aoms/1177730256 [DOI] [Google Scholar]

[bibr14-0146621615622832] Snijders T. B. (2001). Asymptotic null distribution of person fit statistics with estimated person parameter. Psychometrika, 66, 331-342. doi: 10.1007/BF02294437 [DOI] [Google Scholar]

[bibr15-0146621615622832] Tendeiro J. N., Meijer R. R., Albers C. J. (2014). Detection of invalid test scores on admission tests: A simulation study using person-fit statistics (LSAC Research Report RR-15-03). Newtown, PA: Law School Admission Council; Available from: http://www.lsac.org/lsacresources/research/all/rr/rr-15-03 [Google Scholar]

[bibr16-0146621615622832] van Krimpen-Stoop E. A., Meijer R. R. (1999). The null distribution of person-fit statistics for conventional and adaptive tests. Applied Psychological Measurement, 23, 327-345. doi: 10.1177/01466219922031446 [DOI] [Google Scholar]

PERMALINK

Derivation and Applicability of Asymptotic Results for Multiple Subtests Person-Fit Statistics

Casper J Albers

Rob R Meijer

Jorge N Tendeiro

Abstract