Skip to main content
Applied Psychological Measurement logoLink to Applied Psychological Measurement
. 2016 Jan 6;40(4):274–288. doi: 10.1177/0146621615622832

Derivation and Applicability of Asymptotic Results for Multiple Subtests Person-Fit Statistics

Casper J Albers 1,, Rob R Meijer 1, Jorge N Tendeiro 1
PMCID: PMC5978505  PMID: 29881053

Abstract

In high-stakes testing, it is important to check the validity of individual test scores. Although a test may, in general, result in valid test scores for most test takers, for some test takers, test scores may not provide a good description of a test taker’s proficiency level. Person-fit statistics have been proposed to check the validity of individual test scores. In this study, the theoretical asymptotic sampling distribution of two person-fit statistics that can be used for tests that consist of multiple subtests is first discussed. Second, simulation study was conducted to investigate the applicability of this asymptotic theory for tests of finite length, in which the correlation between subtests and number of items in the subtests was varied. The authors showed that these distributions provide reasonable approximations, even for tests consisting of subtests of only 10 items each. These results have practical value because researchers do not have to rely on extensive simulation studies to simulate sampling distributions.

Keywords: item response theory, person-model fit, validity test scores


In high-stakes testing, individual test scores are being used to make important decisions for individual test takers. In these circumstances, it is important to check the validity of the individual test scores. Although test scores may be valid for most persons in a particular population, for some test takers, these scores may not reflect their true proficiency level. For example, Meijer and Tendeiro (2014) showed that for some test takers on a high-stakes test, the proficiency scores did not seem to reflect their true proficiency level. Several methods have been proposed to check the validity of individual test scores. In this study, the authors focus on methods that are sensitive to the fit of individual response patterns to an item response theory (IRT) model. The idea behind this approach is that, when an item score pattern is very unexpected given the estimated proficiency level, this estimated proficiency level might not provide a good estimate of their true proficiency level. These methods are often denoted as person-fit methods or person-fit statistics (Meijer & Sijtsma, 2001).

One of the most popular statistics is the standardized log-likelihood statistic, denoted lz, proposed by Drasgow, Levine, and Williams (1985). Based on asymptotic arguments, Snijders (2001; see also Magis, Raîche, & Béland, 2012) suggested an improved version of this statistic, denoting it lz*. In assessing person fit, the person parameter θ is unknown and needs to be estimated by θ^. This estimation process biases the (asymptotic) behavior of lz and Snijders’s version accounts for this bias.

Both lz and lz* are developed for unidimensional tests. In practice, however, many tests consist of several (correlated) subtests. For example, the Law School Admission Test consists of four subtests which total scores are combined into one total score. For these types of tests, it would be useful to have a person-fit statistic that combines information from the multiple subtests into one person-fit value. Drasgow, Levine, and McLaughlin (1991) proposed a multiple subtest extension of lz, which they denoted lzm. Conijn, Emons, and Sijtsma (2014) compared several approaches based on lz with studying person fit for noncognitive multiple subtests consisting of polytomous items. Because of the advantage of lz* over lz, Tendeiro, Meijer, and Albers (2014) recently studied the performance of Conijn et al.’s approaches applied to lz* rather than lz for multiple subtests settings based on dichotomous items. This study by Tendeiro et al. was performed on the basis of a simulation design. The aim of the current article is to study the distributional properties of multisubtest modifications of lz* statistic through statistical (asymptotic) theory.

The outline of this report is as follows. In the next section, the lz (Drasgow et al., 1985) and lz* (Snijders, 2001) statistics are introduced. Next, multiple subtests extensions based on lz (Conijn et al., 2014; Drasgow et al., 1991) will be discussed. As explained above, the lz* statistic is a bias-removing improvement upon the lz statistic. In the main theoretical section of this article, we study the theoretical null distribution of the multiple subtests person-fit statistics based on lz* instead of on lz. The distributional theory is based on asymptotic arguments. The length of each subtest and the correlation of the latent traits between subtests are manipulated by means of a simulation study. The goal is to study possible effects of these factors on the quality of the asymptotic approximations. It will be shown that the asymptotic approximations are fairly good for subtest lengths as low as 10 items.

The lz and lz* Statistics

This section shortly describes the lz and lz* statistics. For a more extensive discussion of the lz statistic, see, for example, Armstrong, Stoumbos, Kung, and Shi (2007); Magis et al. (2012); and van Krimpen-Stoop and Meijer (1999).

The lz Statistic

A test taker with trait level θ is administered a univariate test consisting of n items. The random variable Xi equals 0 or 1, depending on whether item i was answered correctly or incorrectly, respectively. The probability of answering correctly, P(Xi = 1 | θ), is denoted by pi(θ). The three-parameter logistic model (3PLM; see Embretson & Reise, 2000), or its constrained versions known as the two- and one-parameter logistic models, is commonly used in IRT to describe the stochastic relationship between θ and Xi. The 3PLM is given by

pi(θ)=ci+(1ci)eai(θbi)1+eai(θbi), (1)

where ai, bi, and ci denote the discrimination, difficulty, and pseudo-guessing parameters of item i. The two-parameter logistic model (2PLM), which results from constraining ci to zero in Equation 1, will be used in this simulation study. However, the theory in this article applies to other models as well.

The likelihood function of a response vector X=(X1,,Xn) is given by

L(θ)=Πi=1npi(θ)Xi(1pi(θ))1Xi,

and the maximum likelihood (ML) estimator θ^ML is obtained by maximizing L(θ) or, equivalently, by maximizing the log likelihood:

l0(θ)=log(L(θ))=i=1n{Xilog(pi(θ))+(1Xi)log(1pi(θ))}. (2)

As this function depends on the number of items, it is not directly applicable as a person-fit statistic. To this end, Drasgow et al. (1985) proposed to use the standardized version

lz=l0E(l0)Var(l0),

as a person-fit statistic. Here, the expectation and variance are given by

E(l0)=i=1n{pi(θ)log(pi(θ))+(1pi(θ))log(1pi(θ))} (3)

and

Var(l0)=i=1n{pi(θ)(1pi(θ))[log(pi(θ))log(1pi(θ))]2}. (4)

The lz* Statistic

Snijders (2001) argued that, in practice, it is rarely the case that true trait values θ are known. He showed that the lz statistic is biased when the true θ is replaced by an estimate θ^. He proposed a correction, actually applicable to a wider range of estimators than lz. Snijders studied the class of standardized person-fit statistics described through

Wn(θ)Var(Wn(θ)), (5)

where Wn(θ)=i=1n(Xipi(θ))wi(θ), with wi(θ) a particular choice of a weight function. For wi(θ)=log(pi(θ))log(1pi(θ)), the lz statistic is obtained.

Snijders (2001) showed that the bias introduced by replacing θ by its estimate θ^ does not vanish asymptotically, causing, for example, conservative inferences in case of parameter estimation through the 3PLM. A solution to this problem is obtained by modifying the weights wi(θ) via

w~i(θ)=wi(θ)cn(θ)ri(θ),

where

cn(θ)=i=1npi(θ)wi(θ)i=1npi(θ)ri(θ), (6)

pi(θ) is the first-order derivative of pi(θ) with respect to θ, and the ri(θ) is chosen such that

r0(θ^)+i=1n(Xipi(θ^))ri(θ^)=0. (7)

Various choices of ri(θ^) satisfy this relation. The most common choice is that of the ML estimates, given by r0(θ^)=0 and ri(θ^)=pi(θ^)pi(θ^)(1pi(θ^)) (i > 0). Snijders (2001) showed that Wn(θ^) is asymptotically normally distributed with expected value

E(Wn(θ^))=cn(θ^)r0(θ^)

and variance Var(Wn(θ^))=nτn2(θ^), with

τn2(θ^)=1ni=1nw~i2(θ^)pi(θ^)(1pi(θ^)). (8)

As a consequence, the statistic

lz*=Wn(θ^)E(Wn(θ^))Var(Wn(θ^))

is asymptotically standard normally distributed.

In case the estimation of θ is done through the method of ML, things become slightly easier. In this case, r0(θ^)=0, which implies that E(Wn(θ^))=0. In this article, the authors will only work with ML estimators, but the authors shall continue to use the generalized notation of Snijders (2001).

The lzm(D) and lzm(C) Statistics and Proposed Corrections

The multiple subtests statistic developed by Drasgow et al. (1991), here denoted lzm(D), is based on the sum of the l0 statistics for each subtest. For each subtest s(s=1,,S), one computes l0(s), E(l0(s)), and Var(l0(s)) as described above. Then,

lzm(D)=(s=1Sl0(s))(s=1SE(l0(s)))[s=1SVar(l0(s))]1/2. (9)

The lack of covariance in the denominator is a consequence of the IRT assumption of local independence. Note that this assumption implies that the l0(s) scores across subtests are independent, but it still allows for the latent traits θi(s) across subtests to be correlated, which often is the case in practice. Assuming that the true scores θ are known, lzm(D) is standard normally distributed.

Conijn et al. (2014) suggested a slightly different approach to compute the multiple subtests statistic. Rather than summing the l0(s) statistics over the S subtests and then standardizing the sum, Conijn et al. suggested to first standardize each l0(s) and then to sum the standardized lz(s) statistics:

lzm(C)=s=1Slz(s).

This approach is based on the same assumptions as the approach by Drasgow et al. (1991). In this simulation study, we shall study which method performs better.

Just as the lz approach is biased when θ is unknown, so are the multiple subtest extensions lzm(D) and lzm(C). The solution by Snijders (2001) is actually directly applicable to lzm(D) and lzm(C). The authors, therefore, propose two new person-fit statistics, which are denoted by lzm*(D) and lzm*(C), respectively. In the following section, the asymptotic null distribution of each of these statistics is derived.

Asymptotic Null Distribution of lzm*(D) and lzm*(C)

In this section, the asymptotic distributions of lzm*(D) and lzm*(C) are derived. In the next section, the applicability of this asymptotic theory for tests of finite length is studied.

Null Distribution of lzm(D) and lzm*(D)

Statistic lzm(D) is asymptotically normally distributed if the true θ values are used (Drasgow et al., 1991). However, replacing true with estimated θ values introduces a bias, as explained above. We shall now try to correct this bias by applying Snijders’s approach.

It is possible to rewrite lzm(D) given by Equation 9 in the form of Equation 5:

lzm(D)=W(θ)Var(W(θ)), (10)

with

W(θ)=s=1Si=1ns[Xi(s)pi(s)(θs)]wi(s)(θs),
wi(s)(θs)=log(pi(s)(θs))log(1pi(s)(θs)). (11)

Here, θ denotes the vector θ1,,θS of latent trait parameters per subtest. That Equation 9 can be written as Equation 10 can be seen as follows. First, for the numerator (recall Equations 2 and 3),

s=1Sl0(s)=s=1Si=1ns{Xi(s)log(pi(s)(θs))+(1Xi(s))log(1pi(s)(θs))}

and

s=1SE(l0(s))=s=1Si=1ns{pi(s)log(pi(s)(θs))+(1pi(s))log(1pi(s)(θs))},

thus, s=1Sl0(s)s=1SE(l0(s))=W(θ). For the denominator, we have (recall Equation 4)

s=1SVar(l0(s))=s=1Si=1nspi(s)(θs)(1pi(s)(θs))[log(pi(s)(θs))log(1pi(s)(θs))]2

and

Var(W(θ))=s=1Si=1nspi(s)(θs)(1pi(s)(θs))(wi(s)(θs))2,

and therefore s=1SVar(l0(s))=Var(W(θ)).

The authors have established that lzm(D) belongs to the family of statistics considered by Snijders. It therefore follows that, for trait estimates θ^(s) satisfying Equation 7, W(θ^) is asymptotically normally distributed with expected value

E(W(θ^))=s=1Scns(θ^(s))r0(s)(θ^(s))

and variance

Var(W(θ^))=s=1Snsτns2(θ^(s))

where cns(θ^(s)) and τns2(θ^(s)) are the functions defined by Equations 6 and 8 applied to subtest s.

Thus, in the above, the authors established that, under the assumption of local independence, asymptotically

lzm*(D)=W(θ^)E(W(θ^))Var(W(θ^))~N(0,1).

Null Distribution of lzm(C) and lzm*(C)

Conijn et al.’s (2014) approach is similar to Drasgow’s et al. (1991) approach, but the order of operations is reversed. That is, lz(s) is computed for each subtest s and then all values are added. lz(s) is asymptotically standard normally distributed for true θ values. Furthermore, due to the local independence assumption, the lz(s) statistics are independent. As a consequence, lzm(C) is the sum of S independent standard normally distributed variables and is therefore normally distributed with mean and variance equal to the sums of the means and variances, respectively. Thus, the asymptotic null distribution of lzm(C), assuming known θ and local independence, is given by

lzm(C)=s=1Slz(s)~N(0,S).

The asymptotic distribution of lzm*(C) can be derived along the same lines. Because lzm*(C) is the sum of the lz(s)* values, each independent and asymptotically N(0, 1) distributed (Snijders, 2001), we immediately have that

lzm*(C)~N(0,S)

for trait estimates θ^(s) satisfying Equation 7.

Design of the Simulation Study

In practice, to assess whether response patterns are unusual, one can either (a) simulate a large number of response patterns under the null distribution of normal behavior and compare the observed with the simulated response patterns; or (b) compute the critical value on the basis of the asymptotic distribution. Obviously, the first approach is time-consuming but has the benefit of not having to rely on asymptotic theory. The second approach is computationally much more efficient but does rely on asymptotic theory.

The main goal of this simulation study was to study the quality of the asymptotic results discussed in the previous sections. In particular, the authors wanted to verify how the asymptotic approximations for person-fit statistics lzm*(D) and lzm*(C) hold for relatively short subtest lengths (say, of 10 items). The goal is to understand whether the asymptotic results are accurate enough for most practical purposes. Discussing the univariate lz* statistic, Snijders (2001) expected that n≥ 15 would be sufficient for the asymptotic approximations to work well (in case of univariate scales). Subtests might be of shorter length than the 15 items mentioned by Snijders. How much shorter the subtests can be is dependent on the relation between the subtests. If the latent traits for the subtests correlate perfectly (i.e., test taker’s θ is the same for each subtest), the test is actually univariate and, according to Snijders, subtests of lengths ni≈ 15 / S should suffice. When the correlation between subtest traits is smaller than 1, one may expect to need subtests of longer length. Studying how subtest length and trait correlations relate to the quality of the asymptotic approximations is the main goal of this simulation study.

The simulation study was set up as follows. Item scores of 1,000 test takers on four subtests were generated. Four subtest lengths were considered: 10, 25, 50, and 100. The shorter subtest lengths (10, 25) are of most practical interest, whereas the longer subtest lengths (50, 100) are mostly of theoretical interest. All subtests within the same data set had the same length. The 2PLM was used to generate the item scores, with discrimination parameters uniformly distributed between [0.5, 2.0] and difficulty parameters standard normally distributed (bounded between −2.5 and +2.5). Moreover, four person θ parameters were generated for each simulated test taker, one per subtest. These parameters were randomly drawn from a multivariate normal distribution. Seven between-subtests correlations of θ were considered: 0.4(0.1)1.0. These item and person parameters resulted in data that were very similar to the empirical data from a number of large-scale high-stakes educational admission tests (see also Rupp, 2013).

The simulation study consisted therefore of a 4 (number of subtest lengths) by 7 (number of between-subtests correlations of θ) completely crossed design, hence 28 experiment conditions in total. One hundred replications were simulated per condition. For each replicated data set, six multiple subtests person-fit statistics were computed. Of these, lzm*(D) and lzm*(C), which the authors proposed and developed in this report, were of most interest. Furthermore, lzm(D) and lzm(C) were computed to compare these uncorrected statistics with their starred versions. The corrected starred statistics were expected to outperform the uncorrected statistics. Finally, lz and lz* were computed by concatenating the four subtests together, that is, by ignoring the multiple subtests data structure. This approach was expected to work well for large correlation values between the θs but not so well for lower correlation values between the θs.

The simulation was coded in R (R Core Team, 2014). The item parameters were estimated by means of the function est() in the “irtoys” package (Partchev, 2014). The ML person parameters were estimated by means of the function mlebme(), also in the “irtoys” package.

Results of the Simulation Study

Findings are reported in various tables and figures. For lz*, lzm*(C), and lzm*(D), Table 1 lists the following values: (a) the mean of the 1,000 statistic values per replication, averaged across the 100 replications; (b) the standard deviation of the 1,000 statistic values per replication, averaged across the 100 replications; (c) the Kolmogorov–Smirnov (KS) distance between the empirical and theoretical (asymptotic) normal cumulative distribution function; and (d) the level of significance when applying critical values from the asymptotic distribution, at α = .05. The KS distance (Smirnov, 1948) is a method to assess whether the empirical results lie close to the asymptotic distribution. This metric is a common method for density comparisons and reports the maximum vertical distance between both cumulative distributions. When both distributions completely agree, this value is zero; when they completely disagree, it is one. For lz, lzm(C), and lzm(D), Table 2 lists the means and standard deviations over the replications. (Reporting KS distances and levels of significance for these statistics is undesirable, as the asymptotic distribution only holds if all θ are known.)

Table 1.

Results of the Simulation Study for Person-Fit Statistics lz*, lzm*(D), and lzm*(D): Mean, Standard Deviation, Kolmogorov–Smirnov Distances, and Empirical Proportion of Statistic Values Scoring Below the 5% Quantile of the Asymptotic Distribution.

lz*
lzm*(D)
lzm*(D)
ρ ni M SD KS αasymp M SD KS αasymp M SD KS αasymp
.4 10 0.083 1.017 0.076 .056 0.642 1.999 0.166 .037 0.336 0.996 0.176 .036
25 0.053 1.067 0.059 .064 0.415 2.006 0.110 .040 0.212 1.002 0.113 .040
50 0.045 1.142 0.064 .077 0.312 2.005 0.086 .042 0.155 1.001 0.087 .042
100 0.047 1.269 0.084 .096 0.264 2.008 0.072 .042 0.129 1.002 0.072 .042
.5 10 0.079 1.009 0.071 .055 0.609 2.016 0.157 .038 0.318 1.003 0.167 .037
25 0.055 1.054 0.057 .062 0.413 2.012 0.110 .040 0.211 1.005 0.113 .040
50 0.048 1.108 0.059 .070 0.313 1.999 0.087 .042 0.155 0.998 0.087 .042
100 0.052 1.220 0.074 .085 0.265 2.003 0.073 .042 0.130 1.000 0.072 .041
.6 10 0.082 1.008 0.072 .055 0.622 2.009 0.160 .037 0.322 1.002 0.169 .038
25 0.056 1.038 0.056 .058 0.413 2.009 0.111 .040 0.211 1.004 0.114 .041
50 0.051 1.075 0.053 .063 0.314 2.002 0.086 .041 0.156 0.999 0.086 .042
100 0.056 1.165 0.065 .076 0.263 1.997 0.071 .042 0.130 0.997 0.071 .041
.7 10 0.083 1.003 0.072 .054 0.620 2.002 0.160 .037 0.326 0.998 0.172 .037
25 0.058 1.020 0.053 .055 0.411 2.004 0.109 .040 0.210 1.000 0.112 .040
50 0.054 1.048 0.050 .059 0.315 2.001 0.085 .041 0.156 0.999 0.085 .041
100 0.061 1.115 0.055 .065 0.262 2.005 0.071 .042 0.130 1.000 0.071 .041
.8 10 0.084 0.999 0.073 .053 0.625 2.010 0.161 .037 0.326 1.000 0.172 .037
25 0.060 1.010 0.054 .054 0.413 2.008 0.110 .041 0.211 1.003 0.112 .040
50 0.056 1.030 0.050 .055 0.313 2.014 0.087 .043 0.156 1.006 0.087 .043
100 0.065 1.074 0.049 .058 0.262 2.008 0.071 .042 0.130 1.003 0.071 .042
.9 10 0.084 0.997 0.073 .053 0.615 2.005 0.160 .038 0.320 0.999 0.168 .037
25 0.062 0.999 0.055 .052 0.417 2.000 0.111 .040 0.212 1.000 0.113 .040
50 0.060 1.008 0.048 .050 0.315 2.007 0.086 .042 0.157 1.002 0.086 .042
100 0.071 1.048 0.046 .051 0.263 2.008 0.073 .042 0.130 1.003 0.073 .043
1.0 10 0.084 1.001 0.075 .054 0.614 2.015 0.161 .038 0.319 1.005 0.169 .038
25 0.062 1.002 0.054 .053 0.410 2.011 0.110 .041 0.208 1.006 0.111 .042
50 0.063 1.001 0.048 .049 0.316 2.000 0.085 .041 0.158 0.999 0.085 .041
100 0.076 1.039 0.047 .048 0.263 2.007 0.072 .042 0.131 1.003 0.072 .042

Note. Results are averaged across replications. KS = Kolmogorov–Smirnov; ρ = correlation between subtest θ values; ni = subtest length.

Table 2.

Results of the Simulation Study for Person-Fit Statistics lz, lzm(C), and lzm(D): Mean and Standard Deviation.

lz
lzm(C)
lzm(D)
ρ ni M SD M SD M SD
.4 10 0.070 0.912 0.543 1.595 0.267 0.780
25 0.039 0.956 0.353 1.673 0.173 0.818
50 0.029 1.028 0.260 1.699 0.129 0.832
100 0.025 1.146 0.217 1.719 0.108 0.844
.5 10 0.065 0.885 0.508 1.582 0.249 0.770
25 0.040 0.937 0.351 1.679 0.173 0.822
50 0.032 0.990 0.261 1.697 0.129 0.833
100 0.029 1.093 0.218 1.720 0.108 0.846
.6 10 0.069 0.882 0.519 1.584 0.255 0.773
25 0.043 0.914 0.352 1.674 0.173 0.821
50 0.036 0.952 0.263 1.702 0.130 0.837
100 0.033 1.030 0.217 1.712 0.108 0.843
.7 10 0.069 0.874 0.524 1.582 0.258 0.773
25 0.044 0.890 0.349 1.667 0.172 0.818
50 0.038 0.918 0.261 1.696 0.130 0.835
100 0.038 0.972 0.217 1.713 0.108 0.846
.8 10 0.071 0.866 0.526 1.587 0.259 0.776
25 0.048 0.878 0.352 1.675 0.174 0.825
50 0.042 0.899 0.263 1.712 0.130 0.846
100 0.043 0.927 0.217 1.723 0.108 0.853
.9 10 0.072 0.856 0.516 1.583 0.254 0.775
25 0.052 0.864 0.356 1.672 0.177 0.825
50 0.046 0.873 0.263 1.706 0.131 0.846
100 0.048 0.890 0.216 1.722 0.108 0.855
1.0 10 0.074 0.851 0.517 1.587 0.255 0.779
25 0.053 0.857 0.345 1.670 0.172 0.827
50 0.050 0.861 0.264 1.700 0.131 0.846
100 0.054 0.874 0.216 1.719 0.108 0.857

Note. Results are averaged across replications. ρ = correlation between subtest θ values; ni = subtest length.

First, the results in Table 1 are focused. The values for the means and standard deviations can be directly compared with the means and standard deviations of the asymptotic distribution. With respect to the means, Table 1 shows that (a) the empirical means are structurally larger than zero across all methods and that (b) the means decrease as ni increases. Furthermore, the value of the means seem unrelated to the value of the subtest correlations ρ, with the exception of lz* which seems to have slightly larger mean values for larger ρ (for instance, for ni = 100, the means range from 0.047 (ρ = .4) through 0.076 (ρ = 1)). With respect to the standard deviations, Table 1 shows that the empirical SDs are very close to their asymptotic values (1 for lz* and lzm*(D); 2 for lzm*(C)) across all methods.

Table 1 shows that lz* is sensitive to ρ: The KS distance increases when ρ decreases, especially for larger subtest lengths. This result is to be expected, as the idea of ignoring the multiple subtests structure is incompatible with increasingly lower values of correlations between the θs. The two multiple subtest methods, lzm*(C) and lzm*(D), do not show this dependence on ρ. For all methods, it holds that, when ni increases, the asymptotic approximation lies closer to the empirical distribution. Furthermore, the KS distances decrease when the subtest length increases. This is obvious: faced with more data, better predictions can be made.

The KS distance measures how close the complete empirical distribution is with respect to the asymptotic distribution. This is actually more than what was needed: What happens in the critical region of the distribution (i.e., the lower tail) is what is important when looking for aberrant patterns. We are not (primarily) interested in whether the lz* scores of fitting response patterns are estimated without bias; what matters most is that the scores for misfitting response patterns are measured accurately. Figures 1 and 2, based on the lzm*(D) values for the experiment condition defined by ni = 25 and ρ = .7, display the empirical and asymptotic density functions (left) and cumulative distribution functions (right), with the 1% and 5% critical values added. The full empirical distribution has a significant misfit compared with the asymptotic standard normal because of its skew to the right. The skewness of this distribution has been noted in practice as well (e.g., Meijer & Tendeiro, 2012, Figure 1). However, the left tail of the empirical distribution is relatively well approximated by the asymptotic distribution, especially at the 1% level. In Table 1, the αasymp values are reported, which consist of the proportion of empirical data to the left of the 5% quantile of the asymptotic distribution. Thus, αasymp describes for what proportion of the empirical results the null hypothesis of no aberrant behavior would be rejected (a Type I error), if this decision is made based on asymptotic theory and α = .05. Values close to .05 are indicative of the adequacy of the asymptotic approximation. Figure 3 presents a visualization of the same results. For comparison, Figure 3 also shows the αasymp values for the statistics without Snijders’s bias correction, where the asymptotic distribution is derived under the additional (and incorrect) assumption of known θ.

Figure 1.

Figure 1.

Histogram of the 1,000 × 100 = 100,000 computed lzm*(D) values (bars) and the standard normal distribution (curve).

Note. The blue vertical lines correspond to the α = 1% (left) and α = 5% (right) critical values. This figure is based on the experiment condition defined by the parameters ni = 25 and ρ = .7.

Figure 2.

Figure 2.

The corresponding cumulative distribution functions (solid: empirical; dashed: standard normal).

Note. It can be seen that the maximal vertical distance in the right-hand side display occurs around lzm*(D)0. This figure is based on the experiment condition defined by the parameters ni = 25 and ρ = .7.

Figure 3.

Figure 3.

The αasymp values of the 100 replications of the 1,000 person-fit statistics.

Note. Left panels, from top to bottom: lz*, lzm*(C), and lzm*(D). Right panels, from top to bottom: lz, lzm(C), and lzm(D). In the online supplementary material, a more detailed version of this figure is provided.

From Figure 2, the following can be concluded. (a) One should not rely on asymptotic theory for the nonbias removed statistics (right panels): Even for large subtests (ni = 50) the αasymp values can be less than half the nominal values (around 2%). Thus, the critical values based on the asymptotic approximation are too conservative in the case of noncorrected statistics (i.e., there is lack of power). The problem is also present for the bias-corrected statistics (left panels) but to a lesser extent. (b) lz* works very well for very small subtests (ni = 10), but for lower correlations and lengthier subtests, the critical values from the asymptotic distribution yield are too liberal: Too many response patterns are flagged as aberrant. (c) The performance of lzm*(C) and lzm*(D) is comparable, and both methods are unaffected by the value of ρ. (d) Even for subtests of moderate length (ni = 25), the approximation by asymptotic theory provides accurate approximation.

Deviations between the reported αasymp values and the nominal α = 0.05 can be due to (a combination of) two reasons: (a) sampling variation (results are based on 100 replications of 1,000 simulated persons) and (b) the approximation is asymptotic and the sample size is clearly finite. However, sampling variation was controlled almost entirely by this experiment design. When sampling 100 × 1,000 = 100,000 values from a normal distribution, then in 95% of cases, the αasymp value would be in (.0499, .05001).

Table 3 uses formal regression models to underpin the conclusions of Tables 1 and 2 and Figure 2. For each of the six types of person-fit statistic, first the regression model Mj = β01(ni)j2ρj3(ni×ρj) +εj is fitted to the 4 × 7 combinations of subtest length ni and subtest correlation ρ, and the p values and effect sizes are reported. Next, a similar model, now with αasymp as the dependent variable, is fitted for the three starred methods. The authors decided to include an interaction term because Tables 1 and 2 and Figure 2 indicate that such interaction might be present. It has to be noted that this elementary linear model is not perfect; especially the subtest lengths seem to have a nonlinear relation with the dependent variable. However, the model seems adequate for a rough indication. The results from the table are clear: For every method, the size of the subtest is a significant factor with large effect sizes. The subtest correlation is only significant and relevant for lz* and lz: The multiple subtests approaches indeed are capable of dealing with correlated subtests without distortions in the mean lz* or lz value nor their corresponding αasymp value. For the αasymp values, a clear interaction is present only for lz*; for the mean values, none of the interactions are significant (at the usual 5% level) nor do they have considerable effect sizes.

Table 3.

Effect Sizes (η2) and p Values for the Regression Models Predicting the Mean Values and αasymp Values, With Sample Size and Subtest Correlation as Predictors.

lz*
lzm*(C)
lzm*(D)
lz
lzm(C)
lzm(D)
p η2 p η2 p η2 p η2 p η2 p η2
M ni .013 .182 .000 .726 .000 .720 .000 .417 .000 .741 .000 .742
ρ .021 .152 .933 .000 .939 .000 .003 .171 .933 .000 .985 .000
ni×ρ .121 .065 .938 .000 .839 .000 .145 .036 .927 .000 .947 .000
αasymp ni .000 .227 .000 .584 .000 .540
ρ .000 .476 .792 .001 .593 .006
ni×ρ .000 .270 .962 .000 .792 .001

Discussion

In this article, the authors investigated the theoretical asymptotic distributions of three person-fit statistics for tests that consist of multiple subtests. In both psychological and educational measurement, these types of tests are often used, but thus far, there were no studies that investigated these asymptotic distributions. A recent study that used the multiple subtest extensions lzm*(C) and lzm*(D) made use of simulation to determine the critical values on the basis of which item score patterns could be classified as normal or aberrant (Tendeiro et al., 2014). A drawback of this approach is that it is time-consuming. In the present study, the authors showed that asymptotic theory can adequately be used for both statistics even for subtest lengths as low as 10 items. And that, at least, at for a 95% confidence interval, Type I errors are in agreement with what is expected.

Type I errors are controlled for by the simple univariate lz* statistic when correlations between tests are relatively high (larger than, say .7-.8). This is the case for many large-scale educational tests. Drasgow et al. (1991), for example, reported a correlation r = .73 between SAT Verbal and Quantitative tests and r = .80 between the enhanced ACT English and Mathematics test. Thus, in these cases, the theoretical asymptotic distribution of both the lz* statistic and the multiple subtests extensions lzm*(C) and lzm*(D) can be used. As many studies showed (e.g., Conijn et al., 2014), correlations between test scores for noncognitive instruments are often lower than for cognitive tests. In these cases, the asymptotic distribution of both lzm*(C) or lzm*(D) can be used, at least for α = .05. We should note, however, that as shown in Figures 1 and 2, because the empirical distributions are skewed, at an α level of, for example, .10, results may be less optimal. However, almost all person-fit statistics use α levels of .05 or lower, so in practice, the authors conclude that this study showed that researchers can use the discussed asymptotic distributions to classify item score patterns as normal or aberrant for multiple subtests settings.

There can be benefit in applying bootstrap methods (such as those in Tendeiro et al., 2014) rather than resorting to asymptotic theory. These benefits especially hold in small studies, when the use of asymptotic theory is questionable. However, in this article, the authors show that, even for fairly short subtest lengths, asymptotic results already provide decent approximations. Finally, the benefit of not having to use the bootstrap distribution is saving computing time and, to a lesser degree, it is less technical: Understanding the bootstrap is quite hard; flagging all lz* scores below −1.65 is extremely simple.

Supplementary Material

Supplementary material

Footnotes

Authors’ Note: The opinions and conclusions contained in this report are those of the authors and do not necessarily reflect the position or policy of LSAC.

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study received funding from the Law School Admission Council (LSAC).

Supplementary Material: The R code used to generate the results, figures, and tables, as well as a detailed version of Figure 2, is provided as online supplementary material.

References

  1. Armstrong R. D., Stoumbos Z. G., Kung M. T., Shi M. (2007). On the performance of the lz person-fit statistic. Practical Assessment, Research & Evaluation, 12(16). Retrieved from http://pareonline.net/getvn.asp?v=12&n=16 [Google Scholar]
  2. Conijn J. M., Emons W. M., Sijtsma K. (2014). Statistic lz-based person-fit methods for noncognitive multiscale measures. Applied Psychological Measurement, 38, 122-136. doi: 10.1177/0146621613497568 [DOI] [Google Scholar]
  3. Drasgow F., Levine M. V., McLaughlin M. E. (1991). Appropriateness measurement for some multidimensional test batteries. Applied Psychological Measurement, 15, 171-191. doi: 10.1177/014662169101500207 [DOI] [Google Scholar]
  4. Drasgow F., Levine M. V., Williams E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38, 67-86. doi: 10.1111/j.2044-8317.1985.tb00817.x [DOI] [Google Scholar]
  5. Embretson S. E., Reise S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
  6. Magis D., Raîche G., Béland S. (2012). A didactic presentation of Snijders’s lz* index of person fit with emphasis on response model selection and ability estimation. Journal of Educational and Behavioral Statistics, 37, 57-81. doi: 10.3102/1076998610396894 [DOI] [Google Scholar]
  7. Meijer R. R., Sijtsma K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, 107-135. doi: 10.1177/01466210122031957 [DOI] [Google Scholar]
  8. Meijer R. R., Tendeiro J. N. (2012). The use of lz and lz* person-fit statistics and problems derived from model misspecification. Journal of Educational and Behavioral Statistics, 37, 758-766. [Google Scholar]
  9. Meijer R. R., Tendeiro J. N. (2014). The use of person-fit scores in high-stakes educational testing: How to use them and what they tell us (LSAC Research Report 14-03). Newtown, PA: Law School Admission Council. [Google Scholar]
  10. Partchev I. (2014). irtoys: Simple interface to the estimation and plotting of IRT models (R package version 0.1.7). Retrieved from http://CRAN.R-project.org/package=irtoys
  11. R Core Team. (2014). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Available; from http://www.R-project.org/ [Google Scholar]
  12. Rupp A. A. (2013). A systematic review of the methodology for person fit research in Item Response Theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55, 3-38. [Google Scholar]
  13. Smirnov N. (1948). Table for estimating the goodness of fit of empirical distributions. Annals of Mathematical Statistics, 19, 279-281. doi: 10.1214/aoms/1177730256 [DOI] [Google Scholar]
  14. Snijders T. B. (2001). Asymptotic null distribution of person fit statistics with estimated person parameter. Psychometrika, 66, 331-342. doi: 10.1007/BF02294437 [DOI] [Google Scholar]
  15. Tendeiro J. N., Meijer R. R., Albers C. J. (2014). Detection of invalid test scores on admission tests: A simulation study using person-fit statistics (LSAC Research Report RR-15-03). Newtown, PA: Law School Admission Council; Available from: http://www.lsac.org/lsacresources/research/all/rr/rr-15-03 [Google Scholar]
  16. van Krimpen-Stoop E. A., Meijer R. R. (1999). The null distribution of person-fit statistics for conventional and adaptive tests. Applied Psychological Measurement, 23, 327-345. doi: 10.1177/01466219922031446 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

Articles from Applied Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES