Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 May 9.
Published in final edited form as: Comput Stat Data Anal. 2011 Jan 1;55(1):34–44. doi: 10.1016/j.csda.2010.04.022

Some exact tests for manifest properties of latent trait models

Jan G De Gooijer a,b,*, Ao Yuan c
PMCID: PMC3090213  NIHMSID: NIHMS272845  PMID: 21562625

Abstract

Item response theory is one of the modern test theories with applications in educational and psychological testing. Recent developments made it possible to characterize some desired properties in terms of a collection of manifest ones, so that hypothesis tests on these traits can, in principle, be performed. But the existing test methodology is based on asymptotic approximation, which is impractical in most applications since the required sample sizes are often unrealistically huge. To overcome this problem, a class of tests is proposed for making exact statistical inference about four manifest properties: covariances given the sum are non-positive (CSN), manifest monotonicity (MM), conditional association (CA), and vanishing conditional dependence (VCD). One major advantage is that these exact tests do not require large sample sizes. As a result, tests for CSN and MM can be routinely performed in empirical studies. For testing CA and VCD, the exact methods are still impractical in most applications, due to the unusually large number of parameters to be tested. However, exact methods are still derived for them as an exploration toward practicality. Some numerical examples with applications of the exact tests for CSN and MM are provided.

Keywords: Conditional distribution, Exact test, Monte Carlo, Markov chain Monte Carlo

1. Introduction

Item response theory (IRT), as opposed to the classical test theory, is a modern theory of standardized tests that are commonly used in educational and psychological measurement settings. In psychometrics, it describes the application of mathematical models to data from questionnaires and tests as a basis for measuring abilities, attitudes, or other variables. Items may be questions that have incorrect and correct responses, statements to indicate level of agreement, or patient symptoms scores, etc. IRT makes it possible in principle to analyse a collection of test items assigned to many subjects or examinees. Using various (non)parametric methods, the goal is to estimate a property (parameter) such as an examinee’s ability, attitude, intelligence or strength of some traits. The properties are not directly observable. Once the parameter estimates are obtained, statistical tests are usually conducted to assess the extent to which the parameters predict item responses given the model used. Such tests provide information about the psychometric properties of assessment and the quality of estimates. The pioneering work of IRT occurred during the 1950s and 1960s, including the studies of the Educational Testing Service psychometrician Frederic Lord, the Danish mathematician George Rasch, and the Austrian sociologist Paul Lazarsfeld. Although the mathematical ground work was laid earlier, IRT gained popular application from the late 1970s and 1980s when the advent of computers provided the power for extensive evaluations. Compared to classical test theory, IRT generally has greater flexibility and provides more sophisticated information. It can perform many tasks that cannot be realized by using classical test theory. Some basic references to the historical literature in this field include Birnbaum (1968), Lord and Norvick (1968), Fisher (1974), Cressie and Holland (1983), Joag-Dev and Proschan (1983), Holland and Rosenbaum (1986), Rosenbaum (1987), Stout (1987), Stout (1990) and van der Linden and Hambleton (1997).

IRT models can be divided into two families: unidimensional and multidimensional. The unidimensional model assumes that the response data are unidimensional in the reference population, i.e. the item response probabilities are a function of a single underlying property. However, because of the greatly increased complexity, the majority of IRT research and applications utilize a unidimensional model. Another commonly used condition is monotonicity, i.e. the item response characteristic curves are nondecreasing functions. In this context the works by Ellis and Junker (1997), Junker (1991, 1993), and Junker and Ellis (1997) are worth mentioning. Their main results include an asymptotic characterization of monotone unidimensional property for dichotomously scored items in terms of a collection of physically meaningful manifest properties. This is useful because manifest properties are amenable to conventional hypothesis testing. Recently, Yuan and Clarke (2001) developed asymptotic test methods for four manifest properties: covariances given the sum are non-positive (CSN), manifest monotonicity (MM), conditional association (CA), and vanishing conditional dependence (VCD) (see Section 2 for a brief introduction). An IRT model can have none or some of the manifest properties mentioned above. However, since the desired properties are characterized by a (usually large) collection of statistics, the asymptotic validity requires unrealistically huge sample sizes. As a rule of thumb, asymptotic methods will be valid if the data sample size n ≥ 30m2, where m is the number of unknown parameters in the problem. In practice, data sample sizes are often much smaller than that required by valid asymptotic methods. For example, in investigating some of the manifest properties based on performances of students of some grade in a given region, the data size is often in the low hundreds or less. We will see latter (Sections 2 and 4) that the required sample sizes, using asymptotic methods, are in the thousands, tens of thousands and more for the above four manifest properties. So the asymptotic tests are apparently impractical. The objective of the current paper is to construct exact tests of certain properties, for datasets with relatively small sample sizes, so that many realistic studies can be carried out in practice such as the above mentioned example.

The concept of exact test, originally proposed by Fisher (1935) for the inference of contingency tables, has received much attention and been extended to various settings since then. Under the null hypothesis the table usually has some kind of row or column or in both way independence, so that one conditional on the sufficient statistics of the parameters of interests, all the unknown parameters are left out, and the P-value of some test statistic can be computed under the parameter-free exact distribution. Usually, direct computation of the P-value under the conditional distribution is difficult in practice. Instead, various Monte Carlo sampling methods are used for accurate approximations. Although an enumeration method is possible in some special cases, it is generally computationally infeasible. Based on permutations, for large tables, a simple Monte Carlo method may become a problem in sampling. In this case, Markov chain Monte Carlo (MCMC) sampling can be employed, which only updates a sub-table at each iteration. Hence the computation will not be limited by the table size.

In IRT inference, with data typically in the form of a table with binary entries, the null hypotheses are often composite. But for some hypotheses, testing can be performed on those specified on the boundary of the parameter set. As a result some kind of conditional independence can be achieved which gives rise to parameter-free exact tests. These simpler hypothesis tests are also tests for the original ones with the same significance level. We elaborate on the form of the exact tests in subsequent sections. In particular, exact tests for CSN and MM can be routinely performed in empirical studies. For CA and VCD, the exact methods are still impractical in most applications, due to the unusually large number of parameters to be tested. But we still derive exact computational methods for them as an exploration toward practicality. First, in Section 2, we provide key definitions, and notations. Next, in Section 3, we give four exact tests for four different manifest conditions. The finite-sample performance of two exact test statistics is considered in Section 4 by simulation for several unidimensional IRT models. This is followed by a small illustration of the tests to an empirical item response dataset. Finally, we provide some concluding remarks in Section 5.

2. Notation and preliminaries

Let X1, …, Xn be i.i.d. with X = (X1, …, XJ), a random vector of length J. Typically, in the educational testing context, it represents an examinee’s testing scores on J items. Xi = (Xi1, …, XiJ) with Xij’s be the binary (zero for wrong and one for correct) score of the ith participant. The corresponding observations will be denoted by lowercase letters. Let X+=j=1JXj, and X+(−j) = X+Xj. For an observed data table t = (xij), with the ith row xi = (xi1, …, xiJ) be the scores of the ith participant over all J items. Denote T the corresponding random table of t. Let xi+=j=1Jxij be the ith row total, xj+=i=1nxij be the jth column total, x+=(x1+,,xJ+) be the vector of all column totals, and x++=i=1nj=1Jxij be the grand total.

In the exact test, conditional on the sufficient statistics S of the parameters of interests, one computes the P-value of some reasonably chosen test statistic h(T) under the parameter-free exact distribution, i.e.

P(h(T)h(t)S). (1)

This test can also be derived from the Lehmann-Pearson framework as conditioning on the nuisance parameter, and under some regularity conditions it is Uniformly Most Powerful Unbiased, though not necessarily Uniformly Most Powerful (UMP) (Lehmann, 1986). Usually direct computation of (1) is difficult, instead, various sampling methods can be used. That is, sample t(n)(n = 1, …, N) from the conditional distribution P(T|S), and (1) is approximated by

P^N=1Nn=1Nχ(h(t(n))h(t)),

where χ(·) is the indicator function.

A general form for the joint probability of X is given by Cox (1972), Fitzmaurice and Laird (1993) and Zhao and Prentice (1990)

P(X)=exp{ΨX+ΩWA(Ψ,Ω)}, (2)

where Ψ and Ω are parameters and exp{−A(Ψ, Ω)} is the normalizing constant, W is all the cross-product terms of X, including all the second and higher order terms. Computation of the P-value of the test statistic under the observed data is infeasible, since there are too many unknown parameters in the above distribution. However, under the properties (characterized by their corresponding hypotheses) of interest, model (2) often has a much simpler form. Then, conditioning on a suitable statistic S, we can get the parameter-free exact distribution, based on which the tests will be performed.

Now we state the properties we want to test, and in Section 3 we discuss the corresponding test statistics h(·), the conditioning statistic S, the conditional distributions, and the sampling.

Junker (1993) introduced the notion of covariances given the sum are non-positive (CSN) to characterize the general dependence nature between pairs of testing items. For self-content, we restate its definition below.

Definition (CSN)

The covariances given the sum are non-positive, if and only if for any i < j ≤ J the covariance between items i and j, given the mean, is negative. That is,

Cov(Xi,XjX+)0.

Note CSN is an intuitive property since for a fixed total, increasing some component, the other components tend to decrease. But this property is not automatically true for all IRT models, as it requires all the components vary in a coordinated way. An IRT model should have some special dependence nature for this to be true. Thus in practice we expect some of the commonly used IRT models or the corresponding data tables possess this property.

Also from Junker (1993), we have the following.

Definition (MM)

Manifest monotonicity holds if

E(XiX+(i))isnondecreasingasafunctionofX+(i)

for all i ≤ J and all J.

The following concept, conditional association (CA), is from Holland and Rosenbaum (1986).

Definition (CA)

The components in X are conditionally associated, if and only if for every pair of disjoint, finite response vectors Y and Z in X, and for every pair of coordinatewise nondecreasing functions f(Y) and g(Y), and for every function h(Z), and for every c ∈ range(h) we have that

Cov(f(Y),g(Y)h(Z)=c)0.

Let XJ,k = (XJ+1, …, XJ+k) be a k-vector of future items after X. The following definition of vanishing conditional dependence (VCD) is from Junker and Ellis (1997).

Definition (VCD)

X has vanishing conditional dependence, if and only if for any partition (Y, Z) of the response vector X, and any measurable functions f and g (and any J) we have that

limkCov(f(Y),g(Z)XJ,k)=0

almost surely.

An IRT model can have none, some, or even all of the four manifest properties defined above, and generally having one or some of the properties do not necessarily imply one or some of the other properties; see, e.g., Junker (1993), and Junker and Ellis (1997) for a characterization of the relationships among CSN, CA, MM, VCD and some other properties.

From the definitions of these properties, asymptotic methods will be valid for CSN provided the sample size n ≥ 30 [J (J − 1)/2]2; for MM if n > 30 [J(J − 1)]2. Sample sizes for CA and VCD are much larger. In practice, tests are often made up of J > 3 items. If J = 6, the valid sample size using asymptotic methods for CSN is n ≥ 6750; for MM we have n > 27,000; let alone sample sizes for the properties CA and VCD. We see that the sample size required for valid asymptotic methods are unrealistic in many applications. To perform exact tests of the above properties, the key is to derive the conditional distributions for each of the properties and the corresponding sampling methods. We will see that we only require the conditional models at the boundary of each assumption. This makes the corresponding models very simple, otherwise exact methods will be infeasible. In Section 3, we consider these issues one by one.

3. Construction of the tests

The exact tests derived below are based on the condition that the level α test is determined by the boundary condition under which all J items are independent. Without this condition, the conditional distributions and related samplings will be difficult to handle. Testing for CSN, MM, CA and VCD, denoted by the null hypothesis H, will be non-standard, and often the number of parameters involved will be huge. To overcome this problem, we first simplify the conditions to be tested on the boundary of the parameter set, giving rise to a simpler null hypothesis H0. This will be done such that any level α test for H0 is also a level α test for H, although these two hypotheses are not equivalent.

3.1. Test for CSN

Since the data are binary, X+ can only take the values 0,1, …, J (values 0 and J are trivial, implying all the scores are 0 or 1), we can reformulate CSN as follows

r(i,jk):=Cov(Xi,XjX+=k)0,0i<jJ;0kJ.

Given X+ = k, the joint probability of X can be specified by (2) for each k.

Let t(k) be the nk × J sub-table of all xi’s in t with xi+=k. A natural estimate of r(i, j|k) is (only for those with nk > 0)

r^(i,jk)=1nkxst(k)(xsix¯i)(xsjx¯j),(k=1,,J1), (3)

where i and j are the means of the ith and jth item across all subjects in t(k). Then a reasonable choice as a test statistic for CSN is given by

h(t)=r:=k=1J1nknmaxi,jr^(i,jk). (4)

Note that (4) tends to have small values under H0 and big values under the alternative. This makes h(·) to be a valid test statistic. Clearly, (i, j|0) = (i, j|J) ≡ 0, ∀i, j. Let

Θ={r(i,jk):0i<jJ;1kJ1}

be the collection of all r(i, j|k)’s. Then the null hypothesis for testing CSN can be written as H : Θ0 (here “≤” in the sense of componentwise). The rejection rule of a level α test of CSN has the form h(t) ≥ h0 for some h0 satisfying

supΘP(h(t)h0Θ)α.

Apparently, the above supΘ is attained at Θ = 0. Thus, to get a level α test for CSN, we only need to construct a level α test for H0 : Θ = 0 vs. K : supθ Θ > 0.

Now we describe the exact test for H0 vs. K. For this we first need the distribution of the data t under H0, and then we condition on a sufficient statistic of the parameters in the distribution to get a parameter-free conditional distribution. Based on the conditional distribution, i.i.d. samples are drawn to evaluate the observed statistic given in (4), and to compute the Monte Carlo P-value under H0. Conditional on x+ we have the following.

Proposition 1

Under H0,

P(tx+)=j=1Jxj+!x++!. (5)
Proof

Under H0, for ij we have

Cov(Xi,Xj)=j=kJCov(Xi,Xjl=1JXl=k)P(l=1JXl=k)=0.

Since the Xi’s are binary, we have

0=Cov(Xi,Xj)=E(XiXj)E(Xi)E(Xj)=P(Xi=1,Xj=1)P(Xi=1)P(Xj=1). (6)

By (6) we get

P(Xi=1,Xj=0)=P(Xi=1)P(Xi=1,Xj=1)=P(Xi=1)P(Xi=1)P(Xj=1)=P(Xi=1)P(Xj=0).

Similarly

P(Xi=0,Xj=1)=P(Xi=0)P(Xj=1),P(Xi=0,Xj=0)=P(Xi=0)P(Xj=0).

Thus, under H0, Xi and Xj, are independent for all ij.

Let pj = P(Xj = 1) (j = 1, …, J) and p = (p1, …, pJ). Under H0 the mass function of t is

P(T=t)=i=1nj=1Jpjxij=j=1Jpjxj+.

Now we show that x+ is a sufficient statistic for p. For this we only need to show that the conditional distribution of t given x+ is free of parameters, and is given by (5). In fact, let X+ be the corresponding random variable for observation x+. Then, under H0, X+ is distributed as the multinomial M(x++, p), so

P(T=tX+=x+)=P(T=t,X+=x+)P(X+=x+)=P(T=t)P(X+=x+)=j=1Jpjxj+x++!j=1Jxj+!j=1Jpjxj+=j=1Jxj+!x++!.

Proposition 1 tells us how to sample from (5). However, our purpose is to compute the test statistic from (3) or/and (4) for each new sample. Specifically, the Monte Carlo samples are drawn as follows.

Get the sub-tables t(k), (k = 1, …, J − 1) from the observation t, and compute the ((i, j)|k)’s by (3). Then compute r0 = h(t) by (4). To draw the Monte Carlo samples, we first compute the column totals x+=(x1+,,xJ+). Now the Monte Carlo sampling is performed below. Specify an integer M, and let a sequence z1, …, zM to be assigned in the sampling process. For m = 1, …, M do the following steps:

  1. Draw a sample t(m) from (5), which is realized by a random permutation of the jth column tj, of t, for each j = 1, … j independent of each other.

  2. For k = 1, …, J − 1, compute t(m)(k), which is composed of all the row vectors in t(m) with row total k. The size nk(m) is the number of rows in t(m) (k).

  3. Compute the r(m)(i, j|k)’s by (3) based on t(m)(k), for each k. Then compute r(m) = h(t(m)) using the r(m) (i, j|k)’s and nk(m)’s by (4). If r(m)r0, let zm = 1 otherwise zm = 0.

The Monte Carlo P-value is α^=1Mm=1Mzm. Its estimated variance sd2 is given by

sd2=1M(M1)m=1M(zmz¯)2=1M1z¯(1z¯),

and z¯=1Mm=1Mzm. The corresponding 100(1 − α)% confidence interval is estimated by [ z¯±Φ1(1α/2)sd/M] (Mehta et al, 1988), where Φ−1(1 − α/2) is the upper 100(1 − α/2)% quantile of the standard normal distribution. Since sd21M1α^(1α^)1/4, to estimate α̂ within accuracy β, one should choose M ≥ Φ−2(1 − α/2)/(4/β2). For α = 0.05, β = 0.01, we have M(2.5762×0.01)217,000. If α̂ is smaller than some prespecified level α, H0 and hence CSN is rejected.

Remark

The sampling scheme above is based on permutation of data with size n. It is known that the amount of computation for permutation increases rapidly with n, and may result in computational overflow. In this case, instead of a full updating of the original data table in the sampling process, we only update a sub-table of it at each sampling step. Let n be the number of examinees with ℓ(ℓ = 0, 1, …, J) scores. Then, replace step (i) above by

  • (i′)

    For each j =1, …, j draw an index vector ij, = (ij1, …, ijn) of length n from {1, …, n}, uniformly without replacement (so that all ijn’s are different). This can be done as follows: divide [0, 1] into non-overlapping sub-intervals I1, …, In with equal lengths. Draw u1 ~ U[0, 1], if u1Is1, assign ij1 = s1. Then draw u2 ~ U[0, 1], if u2ls2 and s2s1, assign ij2 = s2; if s2 = s1 (the possibility is zero), redraw u2 ~ U[0, 1], if u2ls2 and s2s1, assign ij2 = s2. Continue until all the ijn’s are assigned. Given this ij, let tj(ij) be the sub-vector of length n of tj, with indices in ij, do a permutation within tj(ij) for j = 1, …, J. Merge the results in a new table t(m).

In this case the number M for the samples should be much larger to ensure ergodicity of the Monte Carlo samples, and the convergence of the corresponding P-value. Note that P-values for other properties, expressed in terms of covariances ≤ (≥)0, can be computed in the same way as above.

3.2. Test for MM

Using the notation of Yuan and Clarke (2001), consider the total score of the ith examinee over the J items, but subtract the term for the jth item. Denote this by xi+(j)=r=1,rjJxir. As a generic random variable this is X+(j)=i=1,ijJXi, in which j indices the item. Now, the quantity we use to test MM is Δk(−j) := E(Xj|X+(−j) = k + 1) − E(Xj|X+(−j) = k), where k = 0, …, J − 1 and j = 1, …, J. Let Θ = {Δk(−j) : k = 0, …, J − 1; j = 1, …, J}. So, the null hypothesis H: MM is equivalent to H0 : Θ0 vs. K : Θ < 0. We first get natural estimators of Δk(−j)’s, and so a test statistic for MM. To this end we partition the collection of examinees’ binary response vectors based on the values of xi+(j). Let t(k,j)={xi:xi+(j)=k}(k=0,1,,J1;j=1,,J), and t(k, −j) = |t(k, −j)| is its cardinality. Now, a natural estimate of Δk(−j) is

Δ^k(j)=1t(k+1,j)xit(k+1,j)xi,j1t(k,j)xit(k,j)xi,j. (7)

In the above we use the convention Σxit(k, −j) xi,j/t(k, −j) = 0 if t(k, −j) = 0. A reasonable choice for h(·) is

h(t)=Δ^:=0k<J1;1jJt(k,j)+t(k+1,j)2JnΔ^k(j). (8)

When MM is not true h(·) will tend to be small. By the same argument as for CSN, to get a level α test for H, we only need to construct a level α test for H0 : Θ = 0 vs. K : Θ < 0. For two random variables X and Y, XY denote X and Y are independent. Let x(k, −j) = (x+1(k, −j), …, x+J(k, −j)) be the vector of the observed column totals in t(k, −j). We have the following.

Proposition 2

Under H0, {5) is still true in this case.

Proof

Under H0, we have

P(Xj=1X+(j)=0)=E(XjX+(j)=0)=E(XjX+(j)=1)==E(XjX+(j)=J1),

or

P(Xj=1X+(j)=0)=P(Xj=1X+(j)=1)==P(Xj=1X+(j)=J1),

so

P(Xj=1)=k=0J1P(Xj=1X+(j)=k)P(X+(j)=k)=P(Xj=1X+(j)=r)k=0J1P(X+(j)=k)=P(Xj=1X+(j)=r),

for any 0 ≤ r ≤ J − 1. Since Xj, is binary, this implies that XjX+(−j) (1 ≤ j ≤ J) for all J. In particular, take j = 1 and J = 2, we have X1X2; take J = 3 we have X1 ⊥ (X2 + X3) which, given the independence between X1 and X2, implies that X1X3, …, X1Xj (j ≠ 1). Similarly, take j = 2 and; J = 2, 3, …, we have X2Xj (j ≠ 2), and finally, X1, …, XJ, are independent of each other. The rest of the proof is the same as in Proposition 1.

To perform the exact test for H0 vs. H1, the Monte Carlo procedure is similar to the one used for testing CSN. In particular, get the tables t(k, −j)’s (k = 0, …, J − 1; j = 1, …, J) from the observed table t. Then compute Δ(0) by (7) and (8). Next, draw Monte Carlo samples t(m) (k, −j)’s according to (5) as in the sampling setup for testing CSN. Then compute the Δ̂(m)’s by (7) or/and (8). Further, the Monte Carlo sampling to compute P-values is similar as before. Specify an integer M and a sequence z1, …, zM similar as that for CSN. For m = 1, …, M do the following: (a) Steps (i) and (ii) are similar as before; (b) If Δ(m)Δ(0), let zm = 1 otherwise zm = 0. The Monte Carlo P-value for H0 vs. K is , its estimated standard error and confidence interval are the counter parts of these quantities corresponding to testing for CSN. Clearly, the Remark given in Section 3.1 applies also to this case.

3.3. Test for CA

In principle, testing CA will be the same as testing for CSN. In the following we refer to the notations and Proposition 4.4 in Yuan and Clarke (2001). Under these notations, CA is equivalent to H:

Θ={Cov(χA(X(ω(j))),χB(X(ω(j)))X(ω(j))D):(j,j,ω,ω,,,A,B,D)}0

vs. K : θ < 0, where the range of (j, j′, ω, ω′, ≺, ≺′, A, B, D) is

graphic file with name nihms272845e1.jpg

The cardinality of Θ will usually be enormous even for J ≥ 3.

Let Θ0 be the subset of Θ consisting of all the components of Θ for which ω(j) be the (1, …, J)-complement of ω′(j′) and ω(j) = ω1(j1) ⊕ ω2(j2) for some ω1(·), ω2(·) and j1 + j2 = j. As before, for a level α test of H vs. K, we need only to construct a level α test for H0 : Θ0 = 0 vs. K0 : Θ0 > 0. By similar reasoning as before, this corresponds to independence of χA(X(ω1(j1))) and χB(X(ω2(j2))) pairs, for any AInline graphic(≺ω1) and BInline graphic(≺ω2), conditional on the event X(ω′(j′)) ∈ D. Now, for each fixed j′ and ω′(j′), let ΓD = ΓD(ω′(j′)) be all the vectors X’s with X(ω′(j′)) ∈ D. For fixed j1, j2, AInline graphic(≺ω1) and BInline graphic(≺ω2), let yAB|D, yABc|D, yAcB|D and yAcBc|D be the cell counts of the events AB, ABc, AcB and AcBc in the set ΓD. Define yA|D = yAB|D + yABc|D, yB|D = yAB|D + yAcB|D and y++|D = yA|D + yB|D. Then under H0, the two-by-two contingency table yD : = (yAB|D, yABc|D, yAcB|D, yAcBc|D) are columnwise independent, and its conditional distribution given (yA|D, yB|D) is standard (Agresti, 1990)

P(yDyAD,yBD)=(yADyABD)(yBDyADyABD)(y++DyAD). (9)

For given AInline graphic(≺ω1 (j1)), BInline graphic(≺ω2 (j2)) and graphic file with name nihms272845u1.jpg, let nABD be the sample size for all the observations satisfying X (ω1 (j1)) ∈ A, X(ω2 (j2)) ∈ B and X (ω′(j′)) ∈ D. If nABD > 2, an estimate ABD of rABD = Cov (χA(X(ω(j))), χB(X(ω)(j))) |X (ω′(j′)) ∈ D) can be constructed by its empirical version.

Since the cardinality of Θ is huge, it seems impractical to construct a closed form testing statistic even for H0. Instead, we use a random scan sampling method as follows.

Let graphic file with name nihms272845u2.jpg be the collection of all ω(j)s for some 1 ≤ j≤ J to which observation xi(ω′(j′)) belongs to at least two i’s. Define Inline graphic(≺ω1 (j1)) and Inline graphic (≺ω2 (j2)) similarly. Define Inline graphic be all the integer triples (j′, j1, j2) with j′ + j1 + j2 = J and that there are graphic file with name nihms272845u3.jpg, AInline graphic (≺ω1 (j1)) and BInline graphic (≺ω2 (j2)). Let W1, W2 and W′ be the vectors of proportions of the observed A, B and D’s. For a collection Inline graphic of sets, denote U (Inline graphic) as the uniform distribution over Inline graphic, and Inline graphic(W, Inline graphic) be the weighted distribution over Inline graphic with weights W.

Set a prespecified sample size M, and a sequence z1, …, zM to be specified. For m = 1, …, M, go over the following steps:

  1. Draw (j′, j1, j2) from U(Inline graphic), A from Inline graphic (W1, Inline graphic(≺ω1 (j1))), B from Inline graphic (W2, Inline graphic(≺ω2 (j2))) and D from graphic file with name nihms272845u4.jpg.

  2. Given the above A, B, D, compute nABD, yAB|D, yA|D, yB|D, y++|D and ABD from the observed data table.

  3. Sample nABD of yDs from (9), and compute the estimate ABD, using the sampled data, of rABD by the same formula for ABD.

  4. If ABD > ABD, set zm = 1, else zm = 0.

The estimated P-value and its estimated standard error are computed in the same way as before.

3.4. Test for VCD

Using the same notation as in the previous subsection, Proposition 5.1 in Yuan and Clarke (2001) says that VCD is equivalent to the condition that for each k there is an ε = ε (k), with ε (k) going to zero, so that

maxj,ω(j),A,B,DCov(χA(X(ω(j))),χB(X(ωc(j)))XJ,kD)ε(k), (10)

in which the operation maxj,ω(j),A,B,D denotes the maximum over

graphic file with name nihms272845e2.jpg

Let θ = |Cov(χA(X(ω(j))), χB(X(ωc(j)))|XJ,k ∈ D)|, Θ = {θ : j, ω(j), J, k, A, B, D}, θ̄ = max θ ∈ Θ. Then CVD can be formulated as H : θ̄ < ε vs. K : θ̄ ≥ ε, for some ε. As before, for a level α test for H vs. K, if we use the testing statistic θ¯^ with rejection rule of the form: θ¯^>θ0, where θ0 is θ evaluated at the observation (xij), then we only need to get a level α test for H0 : θ̄ = 0 vs. K. For fixed ω(j), J, k and D, let G = GD = {i: xJ,k = D}, nG = |G| be the cardinality of G, Yi,1 = χA(Xi(ω(j))), Yi,2 = χB(Xi(ωc(j))), Yi,3 = χAcBc(Xi(ω(j))), the Yij’s are binary and under H0, they are independent conditional on XJ, k, and conditional on the Y’s total will eliminate the nuisance parameters. Thus, we have

P(YY+1,Y+2,Y+3)=j=13Y+j!(nGY+j)!nG!. (11)

So the test will be similar to that for CA. Also, sampling from (11) parallels sampling of CA given in Section 3.3.

Denote xi = (xi,1, …, xi,J, xi, J+1, …, xi,J+k) where i = 1, …, n. The averages of examinees’ scores over G are

χ¯A(D)=(1/nG)iGχA(xi(ω(j)))andχ¯B(D)=(1/nG)iGχB(xi(ωc(j))).

So,

θ^=1nG|iG(χA(xi(ω(j)))χ¯A(D))(χB(xi(ωc(j)))χ¯B(D))| (12)

is an estimator of θ.

In principle, to test H0 vs. K, we still need to go through all the combinations {j, ω(j), J, k, A, B, D} to find the maximum, which is impractical. Instead, we use random scan as in the previous section, in which, at each Monte Carlo iteration m, we randomly select a θΘ, draw a sample ( xij(m)), and compute θ̂(x(m)) and θ̂(x). Any occurrence of θ̂(x(m)) ≥ θ̂(x) is evidence against H0. Specifically, the sampling is as follows.

Specify a sample size M, a sequence z1, …, zM to be specified, and set m = 0. Then do the following:

  1. Draw J0 from {2, …, J − 1}, j from {1, …, J0 − 1}, k from {J0 + 1,…, J}, ω(j) from {1, …, J0}, A from Inline graphic(ω(j)), B from Inline graphic(ω(j)c), and D from Inline graphic.

  2. For the above D, get the set GD for the observation x. If GD is empty, go back to (i), else increase m by 1. Compute yi,1 = χA(Xi(ω(j))), yi,2 = χB (Xi(ωc(j))), yi,3 = χAcBc(Xi(ω(j))), i = 1, …, nG), y+1, y+2, y+3 and θ̂(y) by (12).

  3. Sample Y(m) from (11). Compute θ̂(Y(m)). If θ̂(Y(m)) ≥ θ̂(y), set zm = 1, else zm = 0. If m < M, go to(i); else, stop.

The Monte Carlo P-value and its estimated standard error are computed in the same way as before.

4. Finite-sample performance

The tests for CA and VCD above, although feasible here as compared to their theoretical versions, are still not convenient to use. They need unrealistic huge sample sizes to perform the formal tests. This section presents three sets of Monte Carlo experiments illustrating the finite-sample performance of the exact tests for CSN and MM. In all experiments the number of replicates is set at 1000, with M = 30,000. Although this setup allows for meaningful power results, the actual number of replicates may be considered low. But the costs in computing the tests statistics for CSN and MM with M = 30,000 was a limitation for considering a larger number of replicates.

4.1. First experiment

Two known unidimensional parametric IRT models for binary response are used: the one-parameter logistic model (1PLM, also called the Rasch model), and the two-parameter logistic model (2PLM). The 2PLM, defined via the conditional probability of an item response, is given by

P(Xj=1θi)=11+exp(aj(θibj)),(i=1,,n;j=1,,J), (13)

where θi represents the ability of examinee i, aj, and bj are item parameters. aj is the item discrimination parameter, and bj represents the item difficulty parameter. The 1PLM is a special case of (13) when aj = 1 (j = 1, …, J); see, e.g., Patz and Junker (1999) and van der Linden and Hambleton (1997) for more details on these models.

Using the computer program WinGen2 (Han and Hambleton, 2007) we simulate item and person parameters, item responses for a set of J = 10, 20 items, and n = 25 and 50 examinees. For the 1PLM, bj is sampled randomly from a U [0.6, 1.9] distribution. This range is selected because estimated discrimination parameters for real data often fall within these values. θi is sampled randomly from a N (0,1) distribution. For the 2PLM the item discrimination parameters aj are drawn from a log-normal distribution with mean 0 and standard deviation 0.25. The item difficulty parameters bj are sampled from a N (0,1) distribution. These parameter distributions can be considered realistic in practice.

Table 1 shows empirical quartiles Q1, Q2 (median), and Q3 of the 1000 computed P-values. It is quite obvious from the values of Q2 that in a large number of cases there is no indication to reject the null hypotheses, i.e. there is no violation of the CSN and MM properties. Moreover, the variability in the P-values as measured by the sample interquartile range (Q3Q1) is low. The last two columns of Table 1 show the number of P-values less than 0.05 out of 1000 replications. Recall from Section 3.1 that the nominal level α is established on the boundary of the null parameter space of Θ = 0. When the actual case is Θ < 0, the observed nominal levels can be significantly smaller than α. So when we observe 0 rejections out of 1000 replications in Table 1, this does not mean the test for MM is of level α = 0. Rather it means that the actual case is more likely Θ < 0. It should be noted here that for any specified α, a size-α critical value h(α) can only be obtained via Monte Carlo sampling under H0, as the (1 − α)th sample quantile. For instance, for CSN this implies following steps (i)–(iii) in Section 3.1. Then a size-α test for CSN is given by the rejection rule: reject the null, if h (observed) > h(t(α)), where t(α) corresponds to the αth quantile of the null distribution. Hence, the size of the tests is not related to the number of P-values less than 0.05.

Table 1.

Empirical quartiles Q1, Q2, and Q3 of 1000 computed P-values for testing CSN and MM, and number of P-values < 0.05; Experiment 1.

Model n J CSN
MM
No. P-values < 0.05
Q2 (Q1, Q3) Q2 (Q1, Q3) CSN MM
1PLM 25 10 0.466 (0.233, 0.683) 0.234 (0.157, 0.342) 59 14
20 0.237 (0.060, 0.512) 0.332 (0.225, 0.471) 76 13
50 10 0.640 (0.385, 0.830) 0.192 (0.148, 0.249) 23 2
20 0.327 (0.165, 0.530) 0.323 (0.192, 0.403) 76 0
2PLM 25 10 0.437 (0.191, 0.728) 0.358 (0.290, 0.448) 72 2
20 0.360 (0.110, 0.678) 0.391 (0.263, 0.528) 20 17
50 10 0.300 (0.131, 0.558) 0.259 (0.216, 0.308) 106 0
20 0.789 (0.569, 0.917) 0.366 (0.317, 0.432) 6 1

Given the above results, it seems that CSN and MM are rather general properties of multivariate binary data. In fact, by reviewing the theory underlying monotonicity, Junker and Sijtsma (2000) showed that MM holds for the 1PLM. For the 2PLM these authors construct three theoretical counterexamples in which MM fails. Two counterexamples give rise to a nearly perfect (deterministic) Guttman scale, i.e. the items constitute a unidimensional ordered series such that an answer to a given item predicts the answers to all previous items in the series. Indeed, by constructing such a scale, we are able to reject MM using the sampling process discussed in Section 3.2. But, since the ideal of a Guttman scale is difficult to achieve in real testing, we do not explore this issue here further. Experiment 2 below presents a counterexample in which the CSN property is rejected.

4.2. Second experiment

Let X1, …, Xn be an i.i.d. sample from X = (X1, …, XJ). Further, let Y ~ N(0, Ω), where all the off-diagonal elements of the J × J covariance matrix Ω are positive and equal to r. If Yi < Φ−1(pi), set xij = 1 otherwise xij = 0(i = 1, …, n; j = 1, …, J). Given this general setup we consider testing for CSN with n = 25, 50, J = 10, r = 0.5, 0.6, 0.7, and pi = 0.5. Table 2 shows empirical quantiles of 1000 computed P-values. We see that, when n = 25 and r = 0.5, the CSN property is rejected in quite a few cases, with 38% of the P-values lying between 0 and 0.05. When n = 25 and r = 0.6, 0.7 these percentages are 61.3% and 78.6% respectively. Thus, as the correlation increases the null hypothesis of CSN is more strongly rejected. This result is typical for other sample sizes and values of J.

Table 2.

Empirical quantiles of 1000 P-values for testing the CSN property (J = 10); Experiment 2.

n r Empirical quantiles
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
25 0.5 0.002 0.015 0.033 0.063 0.109 0.165 0.260 0.377 0.565
0.6 0.000 0.000 0.005 0.011 0.025 0.051 0.091 0.174 0.344
0.7 0.000 0.000 0.000 0.000 0.004 0.011 0.026 0.058 0.146
50 0.5 0.014 0.038 0.082 0.130 0.182 0.256 0.343 0.451 0.597
0.6 0.002 0.009 0.018 0.037 0.059 0.091 0.158 0.233 0.383
0.7 0.000 0.001 0.002 0.006 0.013 0.024 0.041 0.077 0.144

4.3. Third experiment

For the last experiment, we compute the exact tests for CSN and MM using data taken from the 1992 Trial State Assessment Program in Reading at Grade 4 of the US National Assessment of Educational Progress (NAEP). In fact, the dataset under study concerns a random sub-sample of size n = 3000 drawn from the population of fourth-grade students in the US; see Patz and Junker (1999), Table 1. The responses concern J = 6 items from each student, for each item a response of 1 represents a correct answer and 0 for incorrect. The questions themselves and the associated reading passage have not been publicly released by NAEP. Patz and Junker (1999) analysed the complete dataset (3000 examinees) using MCMC sampling methods for 2PLM item calibration. Here we assume that the dataset has the nature of a population, and 1000 random samples of sizes n = 25, 50, 100 and 200 are drawn without replacement from the full dataset. Recall that these sample sizes are far less than the minimal sample sizes required for using asymptotic methods for CSN (No. of parameters = 15, n ≥ 6750) and MM (No. of parameters = 30, n > 27,000).

Table 3 shows empirical quartiles Q1, Q2, and Q3 computed on the basis of 1000 P-values. Clearly, for all values of n the empirical quartiles do not show evidence to reject the CSN property, with less evidence against violation of the null hypothesis as n increases from 25 to 200. Interestingly, the opposite occurs when testing for MM. That is, evidence to reject the MM property increases as n increases. The next step would be to fit 2PLMs to the 1000 data subsets for each n. Then, following Junker and Sijtsma (2000), estimates of P(Xj = 1|X+(−j)) may well reveal violations of monotonicity at certain locations of its empirical distribution.

Table 3.

Empirical quartiles Q1, Q2, and Q3 of 1000 P-values for testing the CSN and MM properties; Experiment 3.

n CSN
MM
Q2 (Q1, Q3) Q2 (Q1, Q3)
25 0.3236 (0.1419, 0.5852) 0.2571 (0.1742, 0.3863)
50 0.2796 (0.1111, 0.5191) 0.1760 (0.1012, 0.2815)
100 0.3966 (0.2242, 0.5925) 0.0855 (0.0474, 0.1836)
200 0.5742 (0.4037, 0.7458) 0.0288 (0.0173, 0.0499

5. Some concluding remarks

We propose exact hypothesis tests for CSN, MM, CA, and VCD. In particular, tests for CSN and MM are now computationally feasible and practical, with Monte Carlo P-values computed under H0. For CA and VCD to be practical, it is still open to further research. Moreover, the Monte Carlo method may extend to some more properties. Nevertheless, the tests considered here may not be the best ones in some sense and admit rooms for improvements. However, based on permutation, the amount of computation grows factorially (faster than exponential growth) along with the data table size. So for collections with large table sizes, the simple Monte Carlo method may again becomes computationally impractical. For this, the MCMC method is to update a sub-table per iteration, so it can be used in practice without actual size limitation. Yuan and Yang (2005) proposed a Markov chain method for contingency table exact inference, in which a sub-table of user specified size is sampled at each iteration. This chain has high sampling efficiency and can be modified to the present case. For data with really large table size, this method can be considered to refine our method.

Finally, it is worth mentioning that the null hypotheses of CSN, MM, CA, and VCD considered here are not of the simple Pearson type. Hence tests with some optimality such as UMP tests, generally do not exist. Thus, we have only dealt with level α tests for these hypotheses. We find level α tests on the corresponding H0, which are also level α tests on the corresponding H. On each H0, all the properties CSN, MM, CA and VCD have a common feature: columnwise independence, although on the corresponding H, these properties are not the same.

Acknowledgments

The authors thank three anonymous referees for detailed and helpful comments. The work of Ao Yuan is supported in part by the National Center for Research Resources at NIH grant 2G12RR003048.

References

  1. Agresti A. Categorical Data Analysis. 2. Wiley; New York: 1990. [Google Scholar]
  2. Birnbaum A. Some latent trait models and their use in inferring an examinee’s ability (Part 5) In: Lord F, Novick M, editors. Statistical Theorems of Mental Test Scores. Addison-Wesley; Reading, MA: 1968. pp. 397–479. [Google Scholar]
  3. Cox DR. The analysis of multivariate binary data. Appl Stat. 1972;21:113–120. [Google Scholar]
  4. Cressie N, Holland PW. Characterizing the manifest probabilities of latent trait models. Psychometrika. 1983;48:129–141. [Google Scholar]
  5. Ellis J, Junker BW. Tail measurability in monotone latent variable models. Psychometrika. 1997;62:495–523. [Google Scholar]
  6. Fisher RA. The logic of inductive inference. J Roy Statist Soc Ser B. 1935;98:39–54. [Google Scholar]
  7. Fisher G. Einführung in die Theorie Psychologischer Tests. Grundlagen und Anwendungen; Bern, Huber: 1974. [Google Scholar]
  8. Fitzmaurice G, Laird NM. A likelihood-based method for analyzing longitudinal binary responses. Biometrika. 1993;80:141–151. [Google Scholar]
  9. Han KT, Hambleton RK. Center for Educational Assessment Research Report No. 642. University of Massachusetts; 2007. User’s Manual for WinGen: Windows Software that Generates IRT Model Parameters and Item Responses. [Google Scholar]
  10. Holland PW, Rosenbaum PR. Conditional association and unidimensionality in monotone latent trait models. Ann Statist. 1986;14:1523–1543. [Google Scholar]
  11. Joag-Dev K, Proschan F. Negative association of random variables, with applications. Ann Statist. 1983;10:286–295. [Google Scholar]
  12. Junker BW. Essential independence and likelihood-based ability estimation for polytomous items. Psychometrika. 1991;56:255–278. [Google Scholar]
  13. Junker BW. Conditional association, essential independence, and monotone unidimensional item response models. Ann Statist. 1993;21:1359–1378. [Google Scholar]
  14. Junker BW, Ellis J. A characterization of monotone unidimensional latent variable models. Ann Statist. 1997;25:1327–1343. [Google Scholar]
  15. Junker BW, Sijtsma K. Latent and manifest monotonicity in item response models. Appl Psychol Meas. 2000;24:65–81. [Google Scholar]
  16. Lehmann EL. Testing Statistical Hypothesis. 2. Wiley; New York: 1986. [Google Scholar]
  17. Lord FM, Norvick MR. Statistical Theories of Mental Test Scores. Addison-Wesley; Reading, MA: 1968. [Google Scholar]
  18. Mehta CR, Patel NR, Senchaudhuri P. Importance sampling for estimating exact probabilities in permutational inference. J Amer Statist Assoc. 1988;83:999–1005. [Google Scholar]
  19. Patz RJ, Junker BW. A straightforward approach to Markov chain Monte Carlo methods for item response models. J Educ Behav Stat. 1999;24:146–178. [Google Scholar]
  20. Rosenbaum PR. Comparing item characteristic curves. Psychometrika. 1987;52:217–233. [Google Scholar]
  21. Stout WF. A nonparametric approach for assessing latent trait unidimensionality. Psychometrika. 1987;52:293–325. [Google Scholar]
  22. Stout WF. A new item response theory modeling approach with applications to unidimensionality assessment and ability estimation. Psychometrika. 1990;55:293–325. [Google Scholar]
  23. van der Linden WJ, Hambleton RK. Handbook of Modern Item Response Theory. Springer; New York: 1997. [Google Scholar]
  24. Yuan A, Clarke B. Manifest characterization and testing for certain latent properties. Ann Statist. 2001;29:876–898. [Google Scholar]
  25. Yuan A, Yang Y. A Markov chain sampler for contingency table exact inference. Comput Statist. 2005;20:63–80. [Google Scholar]
  26. Zhao LP, Prentice RL. Correlated binary regression using a quadratic exponential model. Biometrika. 1990;77:642–648. [Google Scholar]

RESOURCES