Abstract
The goodness-of-fit of the unidimensional monotone latent variable model can be assessed using the empirical conditions of nonnegative correlations (Mokken in A theory and procedure of scale-analysis, Mouton, The Hague, 1971), manifest monotonicity (Junker in Ann Stat 21:1359–1378, 1993), multivariate total positivity of order 2 (Bartolucci and Forcina in Ann Stat 28:1206–1218, 2000), and nonnegative partial correlations (Ellis in Psychometrika 79:303–316, 2014). We show that multidimensional monotone factor models with independent factors also imply these empirical conditions; therefore, the conditions are insensitive to multidimensionality. Conditional association (Rosenbaum in Psychometrika 49(3):425–435, 1984) can detect multidimensionality, but tests of it (De Gooijer and Yuan in Comput Stat Data Anal 55:34–44, 2011) are usually not feasible for realistic numbers of items. The only existing feasible test procedures that can reveal multidimensionality are Rosenbaum’s (Psychometrika 49(3):425–435, 1984) Case 2 and Case 5, which test the covariance of two items or two subtests conditionally on the unweighted sum of the other items. We improve this procedure by conditioning on a weighted sum of the other items. The weights are estimated in a training sample from a linear regression analysis. Simulations show that the Type I error rate is under control and that, for large samples, the power is higher if one dimension is more important than the other or if there is a third dimension. In small samples and with two equally important dimensions, using the unweighted sum yields greater power.
Supplementary Information
The online version contains supplementary material available at 10.1007/s11336-023-09905-w.
Keywords: unidimensional measurement, multidimensional measurement, monotone latent variable model, monotone homogeneity model, conditional association
For binary test data satisfying a monotone item response theory (IRT) model, we develop a statistical test procedure that can detect multidimensionality as opposed to unidimensionality. Investigating the dimensionality of a psychological test is an important step in test development and validation. Establishing unidimensionality can contribute to construct validity of the test because this renders the interpretation of test performance easier, comparable with measurement in other science areas. Multidimensional item sets are edited by removing or replacing items deviating from the target attribute or splitting the item set in subsets representing better interpretable test performance. A case in point is the development of Spearman’s (1904) theory of general intelligence into the current multidimensional Cattell–Horn–Carroll (CHC) theory (Wasserman, 2019), based on psychometric analyses of numerous datasets.
Dimensionality analysis of an item set is usually done using factor analysis or IRT analysis (e.g., Sijtsma & Van der Ark, 2021). These approaches include parametric assumptions such as linearity and normality in factor analysis and logistic, normal-ogive, or step functions in IRT. These assumptions usually have little prior plausibility, which led several authors (Holland, 1981; Mokken, 1971; Rosenbaum, 1984; Stout, 1987) to study measurement using weaker assumptions, for example, replacing logistic and normal-ogive item response functions (IRFs) with monotone IRFs only subjected to order restrictions without choosing a parametric function.
The absence of restrictive parametric functions rendered the development of goodness-of-fit tests complex but was replaced with a focus on testable conditions that are the hallmark of an underlying quantitative variable. An example of a testable condition is that the inter-item correlations must be nonnegative (Mokken, 1971). The search for testable properties was also inspired by axiomatic measurement theory (Krantz et al., 1971) and probabilistic developments in it, such as the relation between simple scalability and strong stochastic transitivity in choice data (Tversky & Russo, 1969). This article also follows this approach.
We review two classes of monotone nonparametric IRT models and their testable conditions. Critical to this article, we argue that most conditions for which practical test procedures are available cannot distinguish multidimensional from unidimensional monotone IRT models. We target a specific set of covariance inequalities and demonstrate that we can use them to detect multidimensionality in cases that would previously remain undetected. We develop a practical test procedure and explore the Type I error rate and power using simulated data.
Models and Testable Conditions
Monotone Homogeneity and Monotone Factor Models
We discuss the definitions of three nonparametric IRT models that will be used throughout the article. The first model is monotone homogeneity (MH), which contends that the expected value of each observed binary item score variable increases with a single underlying variable, called the common factor, the latent variable, or the latent dimension. The second model, the monotone factor model (MFM), contends basically the same as the MH model with one or more independent factors to which the items are related in a simple structure. The third model is the higher-order monotone one-factor (HOMOF) model, which is like the MFM, but allows the factors to be correlated with the restriction that they depend on a single higher-order factor. The three models share the assumption of conditional (or local) independence or independent errors, which are similar assumptions. Thus, MH describes a general form of unidimensionality, MFM describes a general form of multidimensionality, and HOMOF is somewhere in between. Applied to intelligence, MH formally resembles Spearman’s theory of a single general intelligence factor, MFM is like Thurstone’s initial theory of multiple independent primary mental abilities, and HOMOF parallels the hierarchical factors of CHC, integrating the other two theories.
We assume the item scores are binary manifest variables, denoted . Variable represents the scores (1 positive, 0 negative) subjects obtained on the i-th item. Factors, latent variables, or dimensions are denoted . The rest of this section discusses the formal definitions of the three models, which we need to prove the theorems.
We adopt the following assumptions (Holland & Rosenbaum, 1986; Mokken, 1971; Rosenbaum, 1984):
(conditional independence): is conditionally independent given .
(monotonicity): is an increasing function of for all .
(unidimensionality): .
We use the term ‘increasing’ synonymous with ‘monotone nondecreasing’. For readers who do not have the information ready, Appendix A provides formulations of the assumptions having greater precision. Following Holland and Rosenbaum (1986), we will say that is a monotone latent variable (MLV) model if MH1 and MH2 hold. Following Mokken (1971) Mokken and Lewis (1982) and Ellis and Junker (1997), we will say that satisfies a unidimensional MLV model or MH model if there exists a variable such MH1, MH2, and MH3 hold.
Ellis (2015) studied a narrower class of monotone models. Slightly rephrasing Ellis, we will say that satisfies an MFM if
where and are component-wise increasing functions, is a multivariate random vector with independent components, is a multivariate random vector with components that are independent of each other and of , the s have log-concave densities (Appendix A), and is a nonnegative real matrix with simple structure (i.e., every manifest variable loads positive on one factor and zero on the other factors). As an example of an MFM for binary manifest variables, consider a case where the s have standard normal distributions, is the identity function, and the s are step functions with if and if for some real number . Then, , where is the standard normal distribution function. Hence, every multidimensional normal ogive IRT model with independent factors and nonnegative loadings with a simple structure is an MFM (also, see Takane & De Leeuw, 1987).
Ellis (2015) also studied a more general class of models, where the components of need not be independent but may be the result of a higher-order MFM factor with log-concave disturbances at each level. We call this class of models, with possibly many levels and one factor at the highest level, a HOMOF model. In this class of models, the factor loadings at the lowest level (i.e., in do not necessarily have simple structure.
Testable Conditions of the Models
In this section, we review statistical inequalities that have been used to test whether MH holds for a given set of manifest variables. These inequalities can be expressed as covariances that are nonnegative. The general result implied by MH is conditional association (CA; Rosenbaum, 1984). Below, we discuss that CA is the hallmark of MH. Coincidentally, CA fares well with Spearman’s (Spearman, 1904) idea that intelligence tests have positive correlations and together measure a single general intelligence factor, and Guttman’s ‘first law of intelligence’, stating that any two intelligence items have a nonnegative correlation in any population that is “not artificially selected” (Guttman & Levy, 1991), thus suggesting the items should have nonnegative correlations in any subgroup defined by the other items. CA is hard to test fully because it involves many restrictions even for small item sets (De Gooijer & Yuan, 2011; Ligtvoet, 2022; Yuan & Clarke, 2001). Therefore, we will also discuss conditions that are easier to test, such as the condition that the expected item score increases with the rest score, called marginal monotonicity (MM) (Junker, 1993). These conditions can be viewed as incomplete tests of CA (Ligtvoet, 2022). We will now continue this section with the formal definitions.
Following Rosenbaum (1984), we say that is CA if for every partition and every function h, for all increasing functions and ,
Rosenbaum (1984) provides examples. Rosembaum’s result that
is key to this article, in which we develop a practically feasible test for CA. Holland and Rosenbaum (1986) generalized Rosenbaum’s (Rosenbaum, 1984) work to non-binary variables. Ellis and Junker (1997; also Junker & Ellis, 1997) furthermore suggested that CA is sufficient to detect multidimensionality in a finite number of items. They studied infinite item sequences and used the condition of vanishing conditional dependence (VCD), which means that certain conditional covariances vanish as . They showed that CA and VCD are necessary and sufficient for a unidimensional monotone latent variable model in which the latent variable can be estimated consistently. Since VCD is defined only in the limit as , one would expect that if any condition can detect multidimensionality in a finite item set, it will be CA.
In addition to the practical infeasibility of CA due to large numbers of restrictions one must test even for small item sets, a complete test of CA is also impossible because of the sparseness of the data, because most item-score patterns never occur in the most commonly available sample sizes of 100 to 10,000 subjects. Maybe for this reason, many authors have studied weaker restrictions than CA, possibly believing they still capture the gist of CA. Straat et al. (2016) acknowledged this limitation and proposed an incomplete strategy. Ligtvoet (2022) gives an excellent review of weaker conditions, which he describes as “incomplete tests of conditional association”.
An important condition discussed by Ligtvoet (2022) is multivariate total positivity of order 2 (MTP. The ordinary formulation of this condition is given in Appendix A, but for the present purpose it suffices to note that being MTP implies that
(Note the omission of h() here). Therefore, being MTP means that holds for some, but not all, functions h (Ellis, 2015, p. 264–265). For binary variables, the difference between CA and MTP thus lies in the kind of events on which one may condition: MTP involves conditioning on the finest partition of subgroups that can be made with , whereas CA also involves conditioning on combinations of such groups. Holland and Rosenbaum (1986) show that
Therefore, a test of can be viewed as an incomplete test of CA (Ligtvoet, 2022).
For tests consisting of realistic numbers of items, however, MTP still involves a large number of restrictions (Bartolucci & Forcina, 2005, p. 35; Ligtvoet, 2022). Therefore, one may want to reduce the number of restrictions to be tested even further by considering properties derived from MTP, which include nonnegative partial correlations (NPC; Ellis, 2015; Brusco et al., 2015) and nonnegative covariances (NNC) (Mokken, 1971).
The important testable property of Manifest Monotonicity (MM) (Junker, 1993; Junker & Sijtsma, 2000) means that each item regression on the sum of the other items (known as the rest score) is increasing (Appendix A). Junker (1993, p. 1372), Junker and Sijtsma (2000) showed that
Ligtvoet (2022) showed that , so MM may be viewed as another incomplete test of CA. MM is an important property because it is conceptually like the idea of a monotone IRF, and for the reader who is unfamiliar with these concepts it may be hard to see how one can have MM without MH. Ellis (2014) gave an example where MM holds while NPC fails, and therefore, MH must fail too.
Limitations of Partial Tests of Conditional Association
Ligtvoet (2022) concluded that existing incomplete tests of CA perform poorly detecting violations of CA. In the data structures he generated, CA was often violated while MTP and MM held. Thus, a researcher who tests MTP and/or MM instead of CA misses the violation of CA. We agree but notice that MTP and MM are not sensitive to violations of dimensionality, necessitating the discussion of the problem from a more theoretical perspective. This discussion is partially inspired by Van den Wollenberg’s (Van den Wollenberg, 1982) proof that existing test statistics for the Rasch model were insensitive to violations of unidimensionality.
Ellis (2015; also proposition A1 of Appendix A) showed that for MTP the problem is that,
Consequently, any test of MH based on MTP (Bartolucci & Forcina, 2000, 2005) logically cannot distinguish MH from MFM. That is, if the test suggests that MTP holds, it is also possible that an MFM that violates MH generated the data. The same conclusion holds for any property derived from MTP, which includes NPC (Ellis, 2015; Brusco et al., 2015), and NNC (Mokken, 1971). Thus, testing MTP, NPC, and NNC cannot distinguish unidimensional and multidimensional monotone factor models.
MM is implied by CA but not by MTP, and therefore, we need to discuss it separately. For MM, the problem is that
This follows from Theorem 1 in Appendix B. Consequently, no test of MM (Douglas & Cohen, 2001; Junker & Sijtsma, 2000; Molenaar & Sijtsma, 2000; Tijmstra et al., 2013; Tijmstra & Bolsinova, 2019) can distinguish MH from MFM. That is, if the test suggests that MM holds, it is also possible that an MFM that violates MH generated the data. Thus, testing MM cannot distinguish unidimensional from multidimensional monotone factor models.
To summarize, based on data it is impossible to distinguish between MH and an MFM if one tests CA only partially with MTP, NNC, NPC, and MM. For example, assume that , the first five items satisfying the Rasch model with latent variable , and the last five satisfying the Rasch model with latent variable , and and are independent. This is an MFM, thus satisfying MM and MTP, which implies NNC and NPC. Hence, it would be impossible to reject MH if only these conditions are tested. This puts the MH model at a serious disadvantage in comparison with parametric IRT models such as the two-parameter logistic model, where one can easily identify this situation via the M-test (Maydeu-Olivares & Joe, 2006).
The argument given includes the somewhat artificial case of independent factors, which implies that some items have correlation zero. One might argue that such cases would be excluded with Mokken’s (1971; Mokken & Lewis, 1982) criterion . However, Ellis (2015) also showed that
Although HOMOF models involve a single higher-order factor, there will generally be multiple first-order factors that are positively correlated with any degree. Consequently, it is possible that such items also satisfy .
We conclude that to distinguish unidimensional MFMs from multidimensional MFMs, the testable conditions MTP, NPC, NNC, and MM are logically insufficient. Therefore, we will target specific aspects of CA, which are covariance inequalities that will likely discriminate between MH and multidimensional MFMs. The next section discusses candidate covariance inequalities.
Conditioning on Added Regression Predictions (CARP) Inequalities
CARP Inequalities, Definition
Assume that all response probabilities of are known; sample statistics will be discussed later. For a given pair (, denote by the variables of except and . Our proposal is a generalization of Rosenbaum’s (1984, p. 427) “Case 2” method to test the covariance of each item pair conditionally on the rest score of the pair. For these covariances, CA implies that
We call this the conditioning on rest scores (CRS) inequality, and we call a significance test for the CRS inequality a CRS test. A limitation of the CRS inequality for testing the dimensionality of a set of items is that the rest score used for conditioning is not adapted to the possible multidimensional structure if the item set does not satisfy MH. To obtain greater adaptation, we propose to use weighted rest scores. We use two linear regression analyses, where and serve as dependent variables and the other items are independent variables. Write , and denote the regression coefficient of in the prediction of from as , and denote the resulting predicted scores as ; that is,
where the are such that they minimize , and with if or . Similarly, where the are such that they minimize , and with if or . In other words, is the prediction of if is excluded as predictor, and is the prediction of if is excluded as predictor. As the basis for conditioning, we propose the variable
This is the sum of the predicted scores of and , where both and are excluded from the predictors. It may be noted that we are not assuming linearity or normality here; we are just using the least squares solution as a heuristic tool, without claiming that this produces a good model.
The variable can attain many different values, producing small conditioning groups. Therefore, we will use deciles or other quantiles of this function. This is like Rosenbaum’s (1984, p. 428) “Case 5”, which tests the covariance of two subtests conditionally on deciles of the rest score. If is an operator that divides any variable into m groups of approximately equal size, that is, , then we obtain
We refer to covariance inequalities of this form (with or without grouping by as conditioning on added regression predictions (CARP) inequalities. Similarly, we refer to the involved conditional covariances as CARP covariances, and to the corresponding correlations as CARP correlations. We call the property that satisfies all CARP inequalities, simply CARP.
CARP is the special case of CA with (if we use the notation used in the definition of CA) , , and . The latter weighted sum is a function of because has weight and has weight . We assume in this section that all response probabilities of are known, and therefore, the regression coefficients are parameters, not sample statistics.
Hence, MH implies CARP. Furthermore, MFM does not imply CARP, which is demonstrated in later simulations, and which can also be seen in theoretical computations of some special cases. Therefore, testing CARP inequalities can reveal some violations of MH that testing MTP or MM cannot reveal.
Let us now briefly explain why CARP inequalities may be useful in the assessment of multidimensionality. Suppose that and load on different independent latent variables, say, and , and that the other items load on either or . After a suitable transformation of and , we may say that estimates and estimates (set and , so conditioning on tends to create groups with approximately equal, which induces a negative correlation between and in these groups (in groups where is constant, is a decreasing function of , which in turn tends to create a negative correlation between and . Theorem 2 of the Appendix states more formally that in this situation, the mean conditional covariance given the unweighted rest scores will be negative or zero, and Theorem 3 of the Appendix states that this will also be true for the mean conditional covariance given the weighted rest score provided that and are both increasing (i.e., the items have MM with respect to the partial weighted sum score of their respective subtest). In the standard two-dimensional case (defined in section 5.1) with ten items, we computed these correlations using numerical integration, and the outcomes supported our expectation that such correlations tend to be negative or zero. The simulations to assess the power of the CARP tests, reported later in this study, also support this result.
When we developed the test, we initially created a slightly different method, which can produce smaller correlations than the CARP correlations (the computation of the next example can be found in the Supplementary Material). For example, take two uncorrelated standard normal dimensions each with five Rasch items, all having . Then, using numerical integration, one can obtain a correlation of -.204 in the union of the two most extreme vigintile groups of ; that is, . This is not aCARP correlation, because we condition on instead of . Although this correlation is smaller than the CARP correlations we obtained, a statistical test based on this conditional correlation of turns out to be less powerful because the 90% observations with are discarded. Simulations showed that a CARP test has greater power in this case. The next section focusses on a test statistic that can be used to test CARP.
Our approach is almost the opposite of the DETECT and DIMTEST procedures for investigating an item set’s dimensionality (Stout et al., 1996; Zhang & Stout, 1999a, b). DETECT and DIMTEST look for large conditional covariances, averaged over item pairs, as a sign that unidimensionality is violated. Unlike CARP, these approaches do not use rigorously established inequalities for the conditional covariances, but rather assume that they are approximately equal to certain theoretical conditional covariances given .
A Statistical Test of CARP for a Single Focal Pair
We develop a significance test that we can use to check whether the CARP inequality holds for a single pair (, called the focal pair. We discuss computation in six steps. The algorithm will become available in the R-package mokken (Van der Ark, 2007, 2012). Analyzing 100,000 samples with and took less than 6 min in total.
Step 1: Select a focal item pair. We propose four strategies:
If the researcher suspects different items measure different attributes, pick one item representative of one attribute and another item representing another attribute. For example, some arithmetic items including item i may also measure a verbal attribute and others including item j a nonverbal attribute. Pick item i and item j.
If data are available from previous research, an explorative analysis may be done using factor analysis or a parametric multidimensional IRT model. If different dimensions appear, again pick two items each representing another attribute. For example, in a factor solution, not necessarily from a well-fitting model, items can be selected that load high on one factor and close to zero on another factor, thus providing a heuristic tool.
The CARP procedure involves splitting the sample into training and test samples. The training sample can be used to select the focal pair in the same way as in Strategy 2.
Let run over all possible pairs ( and apply the test to each pair. A later section discusses methods to combine multiple item pairs.
Step 2: Select a training sample. Split the total sample of N subjects randomly into a training sample of L subjects and a test sample of M subjects (. The proportion of subjects in the training sample is . We use the training sample to estimate the regression coefficients and use these in the test sample to compute the test statistic. Based on simulation work reported later, for small samples (, we recommend , and for larger samples or .
Step 3: Estimate the regression coefficients. Linear regression analysis on the training sample yields estimates of the coefficients and , denoted and (with .
Step 4: Estimate quantiles of the predicted scores. Using only the training sample, compute the estimated predicted scores
and similarly, for . Next, determine the empirical distribution function of in the training sample. The distribution is used to define m quantiles. Here, we propose using deciles (. Thus, the outcome of Step 4 is a list of real numbers such that
for , where we write and . The precise algorithm used in the simulations is provided in the Supplementary Material.
Step 5: Compute the conditioning variable in the test sample. Using the estimated regression coefficients and , and the estimated quantile separators estimated in the training sample, we extend the computation of to the test sample. Next, we compute the conditioning variable in the test data by
Step 6: Compute the one-sided Mantel–Haenszel Z. Using the test sample, test the null hypothesis that for by means of a one-sided version of the Mantel–Haenszel statistic. We will use the test proposed by Rosenbaum (1984, p. 429; see Kuritz et al., 1988, for a discussion of different versions of the Mantel–Haenszel method). The following description is copied almost verbatim from Rosenbaum: Denote the number of subjects in the test sample having , , and as for and denote the marginal totals as , , , etc. Compute
and then the test statistic
The p-value is computed as , where is the inverse of the standard normal distribution function. See Rosenbaum (1984, p. 429) for more details and the rationale of . The sample covariance of and in the layer with is given by
and therefore,
The numerator of is therefore a weighted sum of the conditional covariances of and , given the grouped weighted rest scores, with a continuity correction. Rosenbaum noticed that the quantities and are the expectation and variance of in the least favorable case of the null hypothesis, which is the case where for . If the null hypothesis is true, then and , so that has an asymptotic normal distribution with .
The optimal number (m) of quantile groups in Step 4 is rather arbitrary in the sense that there is no definitive number. Rosenbaum (1984) suggested deciles ( in his “Case 5”, but also used the raw rest score, which has levels. We did simulations with both linear and logistic regression, with , , and . The differences in power between these versions were small, but linear regression with had slightly higher power than the other options. Therefore, we use linear regression with in all simulations below.
Asymptotic Type 1 Error Rate
We provide a formal proof that the Type 1 error rate is under control as . Note that in Step 6 we suggested a one-sided version of a Mantel–Haenszel test, but that multiple versions of the Mantel–Haenszel test exist (Kuritz et al., 1988), and more versions can be developed in the future. We want a result that is valid for all these versions, and rather than delving into the details of each possible version, we will make the general assumption that in Step 6 one applies a test with the following property: If the test is applied to data of a table to test the null hypothesis that the covariance is nonnegative in each of the K layers, then the asymptotic Type 1 error rate is under control in the sense that the p-values stochastically dominate a standard uniform random variable as the sample size grows to infinity. Now, the question is whether that remains true in our case, where the layers are partially based on the regression estimated from a training sample rather than on a fixed variable in the test sample.
Proposition 1
If subjects are drawn randomly and independently and if the test sample grows infinitely large while the training sample remains fixed, then the asymptotic Type 1 error rate of the CARP test is under control.
Proof
Denote the data of the L subjects of the training sample and the data of the M subjects of the test sample . Subjects are drawn randomly and independently, therefore we consider the random vectors as independent and identically distributed (iid) copies of . The k-th item score in is denoted . Define g(.) to be the function such that, for any vectors and with , if we write and , then
for . This definition parallels the definition of the conditioning variable , given in the description of Step 5. We denote the vectors of estimated regression coefficients and quantile separators and , respectively, so that we can write the conditioning variable for the n-th subject in the test sample as
Now, consider the conditional covariance of the form . For the n-th subject in the test sample, the corresponding covariance is . Consider the latter covariance conditionally on . Given , is conditionally associated (because is conditionally associated and is a copy of that is independent of . Furthermore, given , depends only on with the two variables and excluded (since we required , and therefore, is implied to be nonnegative by conditional association of . Nonnegativity holds for . It can be concluded that the data of the test sample can be considered as M independent draws from a population with . Therefore, the asymptotic distribution of dominates the uniform (0, 1) distribution in the sense that for any , . The decision rule “reject the null hypothesis if will thus lead to asymptotic Type 1 error rate , and with the reverse Fatou lemma we have that this is
Proposition 1 holds no matter how poor the estimates are or how much off-target the heuristic tool is. All that is needed to control the Type 1 error rate is that the subjects are drawn iid, that the estimates are based on the training sample, that the training sample is independent of the test sample, and that the weights of the focal variables are fixed to zero in .
Proposition 1 assumes that the size of the training sample remains fixed while the size of the test sample increases. If, however, L and M increase simultaneously, we presumably also need that and converge as , because depends on and therefore, on L. If and converge, then the s stabilize, and we expect that the proof can be modified to establish that the Type 1 error rate is under control in this situation as well. However, we see no point in dwelling on cases with , because increasing the training sample has almost no benefits once the standard errors of and become very small. For all practical purposes we can therefore add to our procedure the prescription to cap L when the estimated standard errors of and are below a certain small threshold. This happens almost surely for large L if these estimates are obtained by linear regression and the empirical distribution function, as we discussed in the previous section. Then, the proof suffices to establish asymptotic Type 1 error rate control.
We explored whether the test can be modified such that the training sample and the test sample both include the whole sample, but simulations showed that this modification causes the Type I error rate to exceed the nominal significance level in some cases. Hence, we recommend cross-validation.
Simulation Studies
Method
General Set-Up
We used J items and a logistic model, , where has a trivariate standard normal distribution with correlations 0. Denote the number of items that load on dimensions 1, 2, and 3 as , , and , respectively, so that . We distinguish the standard two-dimensional case as a special case with if item i loads on dimension d, and otherwise, , and and . We call this the ‘standard’, but it represents a failure if the goal was to create a unidimensional test that satisfies MH.
Optimum Size of Training Sample
Training samples in cross-validation often contain at least 70% of the observations of the whole sample. We did some simulations to find out whether we must maintain that percentage here. We used the standard two-dimensional case with and , and , using 1000 samples per cell. For each combination of J and N, we fitted the power of the CARP method found in the simulation by a quadratic regression on . From the estimated regression coefficients, we computed the value of for which the quadratic curve has its maximum. Table 1 shows that for small samples (, the estimated optimum was close to , and for large samples the optimum was rather or .
Table 1.
Estimated optimum values of training proportion for varying sample size N and test length J.
| J | ||
|---|---|---|
| N | 12 | 24 |
| 500 | .45 | .48 |
| 1000 | .38 | .43 |
| 2000 | .31 | .36 |
| 5000 | .19 | .31 |
Results
Type I Error Rate
For large samples, the Type I error rate is under control because of the asymptotic properties of the Mantel–Haenszel test. It suffices to study the error rate for small samples, and we focused first on with for this purpose. In unidimensional cases, we chose for . We focused on cases with , which we refer to as zero-dimensional. (One can also describe this case as J-dimensional, but the number of common dimensions would still be 0.) These cases are interesting because all CARP covariances are zero, whereas they are positive in the unidimensional case with positive loadings. Consequently, the rejection rates were generally higher in zero-dimensional cases than in other unidimensional cases. Parameters that were not fixed to 0 were randomly drawn from the following distributions: and . We studied the effect of the number of items, J, varying between 10 and 50. For each J, we simulated 100 parameter sets S, each consisting of , . Next, for each of the 100 parameter sets we simulated 1000 samples of N subjects responding to the J items and applied the CARP test procedure to this sample with nominal significance level . Thus, for each J we have 100 parameter sets, and for each parameter set, we obtained a rejection rate based on 1000 samples.
Table 2 shows the quartiles of the rejection rates with for some selected values of J. The maximum rejection rate over all 4100 zero-dimensional cases (41 values of J, each with 100 cases of 1000 samples) was .065, which is not significantly larger than .05 according to a binomial test with multiple testing correction (for a single case of 1000 samples, the p-value would be , but for the maximum of 100 cases the p-value is . The mean rejection rate was .038. Figure 1 shows the cumulative percentages of the rejection rates along with the expected cumulative percentages derived from a binomial distribution with success probability .05. The expected distribution clearly dominates the distribution of rejection rates. Therefore, we conclude that the Type I error rate of the CARP test is under control in these cases.
Table 2.
Type I error rates in zero-dimensional cases.
| J | |||||
|---|---|---|---|---|---|
| Quartile | 10 | 20 | 30 | 40 | 50 |
| 0 (minimum) | .018 | .024 | .019 | .025 | .022 |
| 1 | .032 | .034 | .034 | .032 | .033 |
| 2 | .038 | .038 | .038 | .037 | .038 |
| 3 | .041 | .040 | .043 | .041 | .041 |
| 4 (maximum) | .052 | .052 | .055 | .051 | .051 |
Each column is based on 100 cases with 1000 samples each, with and .
Fig. 1.
Cumulative Percentages of Type I Error Rates in Zero-Dimensional Cases.
We also did simulations where , and J were randomly drawn from a uniform distribution, with , , and . We simulated zero-dimensional parameter cases with 1000 samples each. The maximum rejection rate was .062, which is not significant according to a binomial test with multiple testing correction . The mean rejection rate was .040. Figure 1 shows the cumulative frequencies of the rejection rates along with the expected cumulative frequencies derived from a binomial distribution with success probability .05. The binomial distribution clearly dominates the distribution of the rejection rates. Therefore, we conclude that the Type I error rate is under control. The simulations with random , and J were also conducted for unidimensional cases with . These rejection rates were all well below .05.
Finally, we did some simulations with and small J. Both zero-dimensional and unidimensional cases were simulated with , , and with 10 parameter cases per cell and 1000 samples per parameter case. The rejection rates were again around .05 in the zero-dimensional cases, and close to 0 in the unidimensional cases.
Power: Single Focal Pair, Effect of Dimensionality, and Item Distribution
We chose if item i loads on dimension d, and otherwise. We used . Consider the standard two-dimensional case where and . For large N, if linear regression is used, converges to a linear transform of the rest score . Therefore, the power of the CARP test will approach that of a CRS test. However, for finite N, the power of the CARP test will remain below that of the CRS test, because part of the sample is used for training and not for testing, and because is not exactly equal to the rest score yet. This was indeed what we found in simulations. Therefore, one may consider the standard two-dimensional case as the ideal case for Rosenbaum’s CRS test. Next, we will compare the power of the CARP test and the CRS test in various deviations from this standard case. The first kind of deviations is that . The second kind of deviation is that , introducing items that load on a third dimension that we assume uncorrelated with the other two dimensions. These simulations were conducted with all J between 9 and 39 that are multiples of 3, but we report results in detail only for and .
Table 3 shows the CARP test’s power with and for cases with or , and . The CARP test had significantly greater power than the CRS test in the seven cases with or and in some of these cases, the power of the CARP test was considerably larger. In the other nine cases, the CRS test had more power, but the power of the CARP test was rather close to it. In general, the CARP test had greater power. The results for other values of J were similar: For or , the CARP test had significantly greater power than the CRS test. For between .30 and .70, the CARP test had significantly smaller power than the CRS test. For values of between .27 and .30, or between .70 and .73, the difference in power between the CARP and CRS tests was usually not significant.
Table 3.
Rejection Rates of CARP and CRS Tests in Two-Dimensional Cases with 5000.
| J | CARP | CRS | |||
|---|---|---|---|---|---|
| 12 | 2 | 10 | 0 | .363 | .231 |
| 12 | 3 | 9 | 0 | .601 | .533 |
| 12 | 4 | 8 | 0 | .727 | .797 |
| 12 | 5 | 7 | 0 | .792 | .929 |
| 12 | 6 | 6 | 0 | .791 | .949 |
| 24 | 2 | 22 | 0 | .320 | .118 |
| 24 | 3 | 21 | 0 | .627 | .234 |
| 24 | 4 | 20 | 0 | .753 | .450 |
| 24 | 5 | 19 | 0 | .836 | .680 |
| 24 | 6 | 18 | 0 | .889 | .839 |
| 24 | 7 | 17 | 0 | .892 | .913 |
| 24 | 8 | 16 | 0 | .937 | .970 |
| 24 | 9 | 15 | 0 | .929 | .989 |
| 24 | 10 | 14 | 0 | .950 | .995 |
| 24 | 11 | 13 | 0 | .945 | .995 |
| 24 | 12 | 12 | 0 | .954 | .998 |
Values in bold are the largest power in the row
Each row is based on 1000 samples.
Table 4 shows the power with and for cases with or , and . The CARP test had greater power than the CRS test in the ten cases with , and in some of these cases the power of the CARP test was considerably greater. In the other four cases with , the CRS test had greater power, but the power of the CARP test was still substantial. In general, the CARP test had greater power. Results for other values of J were similar. For small N, the CARP test lost power compared to the CRS test, because the training sample was excluded from the test. Thus, the results for smaller N were more favorable for the CRS test.
Table 4.
Rejection rates of CARP and CRS tests in three-dimensional cases with 5,000.
| J | CARP | CRS | |||
|---|---|---|---|---|---|
| 12 | 2 | 2 | 8 | .154 | .064 |
| 12 | 3 | 3 | 6 | .373 | .148 |
| 12 | 4 | 4 | 4 | .541 | .398 |
| 12 | 5 | 5 | 2 | .684 | .767 |
| 24 | 2 | 2 | 20 | .130 | .061 |
| 24 | 3 | 3 | 18 | .285 | .060 |
| 24 | 4 | 4 | 16 | .476 | .092 |
| 24 | 5 | 5 | 14 | .622 | .171 |
| 24 | 6 | 6 | 12 | .732 | .240 |
| 24 | 7 | 7 | 10 | .801 | .493 |
| 24 | 8 | 8 | 8 | .855 | .729 |
| 24 | 9 | 9 | 6 | .899 | .915 |
| 24 | 10 | 10 | 4 | .914 | .982 |
| 24 | 11 | 11 | 2 | .945 | .995 |
Values in bold are the largest power in the row
Each row is based on 1000 samples.
Power: Single Focal Pair, Effect of Item Parameters and Sample Size
We studied two-dimensional tests with and , with , , and or . We chose for and for . Parameters that were not fixed to 0 were randomly drawn from the following distributions: and . For , we simulated 100 parameter sets S, each consisting of , . Next, for each of the 100 parameter sets we simulated 1000 samples of N subjects responding to the J items and applied the CARP test procedure to this sample with nominal significance level . Figure 2 shows modified boxplots of the rejection rates. As expected, the power increased with N, but variation was large due to the parameter sets. For , most parameter sets would have had power greater than .80.
Fig. 2.
Rejection rates in two-dimensional cases with different item parameters.
Additional Results for the CARP Tests
Aggregation of CARP Tests across Item Pairs
We discuss how one can combine tests of multiple item pairs without prior selection. If one applies a CARP test to all item pairs of a scale, this produces a sequence of p-values. To keep the family-wise Type I error rate (FWER) under control, a multiple testing correction may be in order. Alternatively, one could choose to control the false discovery rate (FDR), which generally leads to tests with higher power (Benjamini & Hochberg, 1995; Benjamini & Yekutieli, 2001). However, applying either correction method to all p-values results in unnecessary loss of power if many of the involved covariances are positive. Instead, one may consider only the item pairs that have a negative conditional covariance in the training sample and test their conditional covariances in the test sample. We collect the set of item pairs with a negative conditional covariance in the training sample in set , and let denote its size. Since this statistic is independent of the data in the test sample, one may apply the Bonferroni correction with this number; that is, reject the null hypothesis for pair (i, j) iff and . To check that this correction with S controls the FWER, assume that under the null hypothesis the distribution of each dominates the Uniform(0, 1) distribution in the sense that for all where denotes the probability measure of the . This condition is called supra-uniformity by Ellis et al. (2020). The condition is satisfied if dominates the Uniform(0, 1) distribution in likelihood ratio order (Whitt, 1980), and it is equivalent to having a density that is increasing on the interval (0, 1). This is true if the test is conducted with with and . Let denote the event that the null hypothesis is rejected for pair (i, j), then, for a fixed pair (i, j) we have and ; therefore, the FWER is
(In and , is treated as a random variable with outcomes that enumerate the power set of .)
Alternatively, one could try to compound the Z-statistics in the formula
where is some further restricted subset of item pairs, and v is an estimate of the variance of the numerator. The s are correlated, and to obtain v one should somehow estimate their average correlation; see Efron (2010), who discusses methods for this purpose. An advantage is that compounding can increase the power, as was concluded by Straat et al. (2016) for the CA tests of Rosenbaum (1984). We encourage future researchers to develop improved compounding rules.
Comparison with Other Methods
So far, we compared CARP mathematically with other incomplete tests of CA, such as testing MTP, and we compared it in simulations with Rosenbaum’s (1984) CRS test, which also assesses CA. Next, we briefly touch upon the topics of parametric goodness-of-fit tests and discuss the difference of the CARP method and existing approaches to dimensionality assessment in nonparametric IRT. We notice that the CARP test has been developed for a single focal pair, so that aggregation over multiple item pairs is a topic for future research. The alternatives that we discuss typically aggregate over all item pairs. This is true for Chi-square and RMSEA in factor analysis, the statistics (Van den Wollenberg, 1982), (Glas, 1988), and (Maydeu-Olivares & Joe, 2006) in logistic IRT models, and DETECT (Zhang & Stout, 1999b) in nonparametric IRT. Our discussion can therefore not include a quantitative comparison with respect to statistical power.
Comparison with Parametric Goodness-of-Fit Tests
Two reviewers suggested to compare the CARP procedure with a method where a parametric multidimensional model is fitted first and then a parametric unidimensional model is compared with it, using a goodness-of-fit test such as a likelihood-ratio test. We will call this method the parametric goodness-of-fit comparison (PGC). Examples of this are (1) testing multiple linear factor models and compare their chi-square statistics, assuming normal distributions, or compare their Chi-squares, RMSEAs or eigenvalues. We mention this possibility because of its popularity in psychology; (2) similar but using logistic IRT models instead of linear factor models (e.g., Bartolucci, 2007; Christensen et al. 2002; 3) testing different latent class models (e.g., Bartolucci et al. 2017; Ligtvoet & Vermunt 2012; Van Onna, 2002; Vermunt, 2001). The idea is that a latent class model can approximate nonparametric multidimensional and unidimensional models if the number of latent classes is large enough; and (4) testing monotonic polynomial models (Falk & Cai, 2016). These models can be used to approximate multidimensional and unidimensional models if the number of polynomial terms is large enough, and therefore a similar strategy can be used.
We find this approach interesting, but we are not yet convinced that in the long run it is more helpful than our approach, which is focussed on critical data patterns such as negative covariances rather than comparing goodness-of-fit statistics. Our restraints concerning the PGC methods are the following. First, it is generally difficult to know how many dimensions the multidimensional model should have, and how this influences the decision on the unidimensional model. Second, it is unclear to which extent the auxiliary assumptions (linearity, logistic response function, normality, number of latent classes, number of polynomial terms) influence the goodness-of-fit of the unidimensional model. Third, if a goodness-of-fit test indicates that the unidimensional model is wrong, it might not be clear which items are causing the problem. For some models, item-fit statistics have been proposed (e.g., Sijtsma & Van der Ark, 2021) that must be used in combination with statistics assessing the fit of sets of items. Another variant is that an alternative unidimensional model must be chosen, but then the large array of possibilities provides a new choice problem (which model is the most obvious choice?) and corresponding analysis problem (how to avoid endless trial and error?). We conclude that the application of methodologies other than the one we study in this article comes with complexities hindering their straightforward use as well as a simple comparison with our CARP methodology.
Comparison with Item Selection Procedures in Nonparametric IRT
In the context of nonparametric IRT, several procedures have been proposed to assess the dimensionality of an item set. The automated item selection procedure (AISP; Mokken, 1971; Sijtsma & Molenaar, 2002) uses a bottom-up algorithm to select items in unidimensional subsets based on a definition of a scale that uses non-negative inter-item covariances and positive scalability coefficients. Straat et al. (2013) proposed a genetic algorithm to replace and remedy some of the peculiarities of the AISP. The goal of both procedures is to have as many items possible in the first scale, as many from the remaining items—if available—in the second scale, and so on. Zhang & Stout (1999b, p. 239) defined the “bias-corrected estimator for the theoretical DETECT index” as a weighted average of covariances of the form , with sum score , and , with rest score , where the DETECT weights are such that the pair (i, j) contributes if and only if both items are in the same cluster. Next, Zhang and Stout try to find the partition that maximizes this index using a heuristic procedure. Roussos, Stout, and Marden (1998) proposed an agglomerative hierarchical cluster analysis for finding subsets of items, using the software package HCA/CCPROX. The procedure provided the choice between different statistics, including covariances conditional on rest scores not including items known to be in already formed clusters, for assessing the relationship between items, and different agglomerative hierarchical clustering methods. They did not use a formal criterion for identifying a final solution but rather left this to the researcher to decide, for example, based on theoretical expectations of the item set’s dimensionality. The DIMTEST procedure assesses the hypothesized unidimensionality of a user-specified item set (Nandakumar & Stout, 1993; Stout, 1987). Thus, unlike the other procedures, DIMTEST is confirmatory and cannot directly be used to partition items in different clusters in an exploratory analysis. Several variations on the original procedure have been proposed; see Stout et al. (2001) and Kieftenbeld & Nandakumar (2015). Van Abswoude, Van der Ark, and Sijtsma (2004) systematically compared the methods.
The CARP procedure is different from these and other item selection procedures proposed in the nonparametric IRT context (e.g., Brusco, Köhn, & Steinley, 2015). It shares with several of these procedures a certain open-endedness caused by the complexities typical of a fine-grained analysis of the data involving many item pairs or item subsets and subdivisions of the sample into score groups, dealing with finite sample sizes and empty or near-empty cells in contingency tables, and combining many detailed results into one useful conclusion about the dimensionality of an item set. Because so many arbitrary researcher decisions are needed to obtain a result, not only for the CARP procedure but also for other procedures many precautions are needed to be able to compare them thoroughly. This is a project requiring a separate study.
Discussion
We developed the CARP test, which often distinguishes data generated by a two-dimensional model from data generated by a unidimensional monotone model, even if the data are MTP and have MM. The test uses CA and can be viewed as a generalization of Rosenbaum’s (1984) proposal to test the covariance of each item pair conditionally on their unweighted rest score (the CRS test). The CARP test conditions on a weighted rest score, where the weights are based on regression analyses in a training sample consisting of to of the total sample. Each of the items in a focal pair (i, j) is used as dependent variable in a linear regression analysis that predicts them from the remaining items. The sum of the two predicted scores is computed in the test sample and is used as the weighted rest score. The weighted rest score divides the test sample into deciles and a directional Mantel–Haenszel test tests whether the covariance of (i, j) is nonnegative in each decile group.
Data generated by means of unidimensional logistic models showed that the Type I error rate is under control, even if the overall inter-item correlations are 0. Simulations with two-dimensional logistic models showed the power of the CARP test exceeds the power of the CRS test if one dimension has three times more items than the other dimension. Simulations with three-dimensional logistic models showed the power of the CARP test exceeds the power of the CRS test if the third dimension has at least a third of the items. In the extreme two-dimensional case, where both dimensions have the same number of items with equal loadings and difficulty parameters for all items, the CARP test converges to the CRS test as the sample size increases. Thus, in comparison with Rosenbaum’s (1984) CRS test, our CARP test gains power in a variety of multidimensional cases at the cost of losing some power in extreme two-dimensional cases with equally important dimensions. Because tests are usually aimed at being unidimensional, most of the items indeed resulting in targeting this dimension, the results for the CARP test are positive.
We explored for multiple focal items that compounding their test statistics can increase the power. The CARP method looks promising, but as with any newly developed method it also raises questions for future research. First, what are the optimal values of (the size of the training sample) and m (the number of groups in conditioning), and how do these values depend on N and J? Second, in the cross-validation, rather than drawing one training sample one might repeat drawing and then aggregate the results over draws, thus reducing the variability of the outcomes. Which aggregation rules are suitable? Third, how can one compound test results for multiple item pairs? Fourth, a more elaborate study of the dependence of the power on the number of items, the number of dimensions, the shape of the response function (logistic or other), and the item parameters could be done. Fifth, the CARP inequalities also hold for polytomous items. Which test procedures are most useful? Rosenbaum (1984, p. 429) provides suggestions. Sixth, how does the power profile of the CARP test compare to the semiparametric methods of Bartolucci (2007) and Falk and Cai (2016)?.
In our analysis, we assumed a priori that conditional independence holds, which is consistent with the fact that for a finite number of binary items, without other restrictions, conditional independence is a “vacuous assumption” (Holland & Rosenbaum, 1986, p. 1525). Moreover, assuming monotonicity, we developed the CARP test as a test of unidimensionality versus multidimensionality. However, if the CARP test points to a violation of MH, this cannot be attributed to a single assumption. An alternative model may thus assume local dependence or correlated errors instead of multidimensionality.
The CARP method can be a useful addition to the existing methods for testing MH and detecting multidimensionality in monotone models. It may help answer a fundamental empirical question without relying on features of parametric models that are irrelevant to the research question. We already mentioned the present CHC intelligence representation using multiple factors (Wasserman, 2019) that is based on parametric—mostly linear—models. This choice is mathematically convenient but may be irrelevant for distinguishing the factors and damaging when it dominates the data analysis. A significant negative covariance obtained in a CARP test would demonstrate that the distinction between intelligence factors is not an artifact of the parametric assumptions, and it would rule out every unidimensional monotone model for intelligence. This is another topic for future research.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendix A
In this appendix, we state the precise definitions of the various models and conditions that are relevant here. Consider a vector of binary manifest variables, . Variable represents the scores (1 correct, 0 incorrect) subjects obtained on the i-th item. Suppose that in the probability space of there is some random vector, , which represents the latent variables. We will use the following conditions (Holland & Rosenbaum, 1986; Mokken, 1971; Rosenbaum, 1984), adapted to binary manifest variables:
(conditional independence). is conditionally independent given if for all .
(monotonicity). is monotone with if is monotonically increasing in each coordinate of for all .
(unidimensionality). is unidimensional if .
Definition 1
(monotone latent variable model; Holland & Rosenbaum, 1986). is a monotone latent variable (MLV) model if is conditionally independent given and is monotone with . If, additionally, is unidimensional, then is a unidimensional monotone latent variable model.
Mokken (1971), Mokken and Lewis (1982) introduced the unidimensional monotone latent variable model for binary items as the monotone homogeneity (MH) model. Ellis and Junker (1997) reformulated Definition 1 such that it is a property of rather than :
Definition 2
(monotone homogeneity). satisfies a unidimensional monotone latent variable model or monotone homogeneity (MH) if there exists a unidimensional variable such that is conditionally independent given and is monotone with .
Ellis (2015) studied a narrower formulation of monotone models that will be called ‘monotone factor models’ here. His assumptions can be rephrased as follows:
for , where each is an increasing function (henceforth, called a response function), and and are latent variables. We write.
, where is a multivariate vector with independent components, is a real matrix, and is an increasing function.
The latent variables are independent from each other and independent of .
The have densities that are Pólya frequency functions of order 2 (PF2; Efron, 1965).
has a simple structure where every manifest variable loads positive on one factor and zero on the other factors.
Assumption MF1 means that the item responses are a dichotomization of underlying item-specific latent variables that one may view as latent responses. This has been discussed earlier in the context of the normal ogive model (Takane & de Leeuw, 1987), but normality is not assumed here. Assumption MF2 specifies that the common parts of the latent responses have a monotone relationship with the same underlying, more fundamental factors in . These underlying factors should be independent. MF3 states that the unique factors or error variables are independent, which is comparable to conditional independence. Assumption MF4 is equivalent to the assumption that the distributions of are log-concave (e.g., Saumard & Wellner, proposition 2.3) or strongly unimodal (Walther, 2009, p. 320). This assumption is satisfied in many models, as this includes normal, uniform, gamma, beta, and logistic densities. Assumption MF5 requires a simple structure of the factor loadings.
Definition 3
(monotone factor model). satisfies a monotone factor model (MFM) if there are , , and such that MF1 – MF5 hold.
In other words, an MFM has factors that are independent, nonnegative loadings with a simple structure, strongly unimodal latent errors, and increasing response functions.
We will now define the concept of multivariate positivity of order 2 (MTP and related concepts. Let be a product lattice in . Let the lattice operators be defined as, for all :
Definition 4
(MTP; Karlin and Rinott (1980)). A random vector on product lattice is MTP if it has density f, and for all ,
MTP generalizes the idea of a positive correlation and is also known as supermodularity, the FKG condition (Denuit et al. 2005) or affiliation (Milgrom & Weber, 1982). Denote the correlation between variables X and Y by .
Definition 5
(nonnegative partial correlations, NPC; Ellis (2014)). has nonnegative partial correlations (NPC) if for every triplet (X, Y, Z) of variables in , .
Definition 6
(nonnegative covariances, NNC; Mokken (1971)). has nonnegative covariance (NNC) if for every pair of variables in .
For any item i, we define the rest score as ; that is, the sum score with the score of item i omitted.
Definition 7
(manifest monotonicity, MM; Junker (1993)). has manifest monotonicity (MM) if is increasing in for each .
The following proposition is mostly implied by Corollary 3 of Ellis (2015), except that we do not require to be PF2 here.
Proposition A1
If satisfies a MFM, then is MTP.
Proof
This can be derived from three elementary facts about MTP and PF2 variables: (1) independent variables are MTP (Karlin & Rinott, 1980, proposition 3.5), (2) MTP is preserved by increasing transformations (Karlin & Rinott, 1980, proposition 3.6; Ellis, 2015, proposition 8), and (3) the sum of MTP and independent PF2 variables is MTP (Karlin & Rinott, 1980, proposition 3.7). Consequently, assuming MF1-MF5, we obtain that is MTP because it has independent components, is MTP because it is an increasing function of , and is MTP because it is a sum of MTP and independent PF2 variables. is MTP because it is an increasing transformation of .
Appendix B
In this appendix, we will prove Theorem 1, which implies that MFMs satisfy MM. For ease of notation, we will first extend the definition of conditional expectations such as . Suppose R is a bounded nonnegative integer valued variable and X is a binary variable. If , then is not uniquely defined, and we extend the definition of to all , including cases with , such that it remains increasing in , provided that it is increasing to begin with. The precise values are not relevant, and one possibility is defining
if ,
if , and
if , , and
where and . In other words, if is undefined for some value of r, then we set it equal to 0 if , equal to 1 if , and equal to the average of the two surrounding values otherwise. Similarly, if S is another random variable, we can define in cases with , and can be defined accordingly.
Lemma 1
Let R and S be bounded nonnegative integer valued variables and let X be a variable with finite expectation. If is increasing in and (X, R) is independent of S, then is increasing in .
Proof
Since S is independent of X and R, we have for every ,
Therefore,
Theorem 1
Suppose that the set of test items can be divided into disjoint subtests such that different subtests are independent while items within the same subtest satisfy MH. Then, MM holds for the entire set of test items, that is, is increasing in t for each .
Proof
Consider an arbitrary item for which MM should be established. Without loss of generality, we may assume that this item is and that the first subtest is . With and we can write , where and R are independent of S. Since satisfies MH, it must also satisfy MM, that is, is increasing in r. Using Lemma 1, we obtain that is increasing in t.
Theorem 2
Suppose that the set of test items can be divided into two disjoint subtests such that different subtests are independent while items within the same subtest satisfy MH. If two items and belong to different subtests, then
Proof
Without loss of generality, we can assume, for notational convenience, that the first subtest is and the second subtest is , and that and . As a consequence of MM, is increasing in t. Now, apply Lemma 1 with and . It follows that is increasing in t. Similarly, is increasing in t too. Therefore, . Since and, by the law of total covariance,
it follows that .
This can be generalized to weighted sum scores, provided that MM still holds with respect to the weighted sum scores of the subtests, as stated in the following lemma and theorem.
Lemma 2
Let R and S be real valued random variables with finite range and let X be a variable with finite expectation. If is increasing in and (X, R) is independent of S, then is increasing in .
Proof
Extend the definition of and to all , just as we did prior to Lemma 1. Since S is independent of X and R, we have for every , :
Therefore,
Theorem 3
Suppose that the set of test items can be divided into two disjoint subtests and , such that different subtests are independent. Let be such that both and are increasing in t. Then, .
Proof
By Lemma 2, and are both increasing in t. Therefore, their covariance is nonnegative, i.e., C. The hypothesis of the theorem implies that , and the conclusion follows by the law of total covariance.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- Bartolucci F. A class of multidimensional IRT models for testing unidimensionality and clustering items. Psychometrika. 2007;72:141–157. doi: 10.1007/s11336-005-1376-9. [DOI] [Google Scholar]
- Bartolucci F, Farcomeni A, Scaccia L. A nonparametric multidimensional latent class IRT model in a Bayesian framework. Psychometrika. 2017;82:952–978. doi: 10.1007/s11336-017-9576-7. [DOI] [PubMed] [Google Scholar]
- Bartolucci F, Forcina A. A likelihood ratio test for MTPwithin binary variables. The Annals of Statistics. 2000;28:1206–1218. doi: 10.1214/aos/1015956713. [DOI] [Google Scholar]
- Bartolucci F, Forcina A. Likelihood inference on the underlying structure of IRT models. Psychometrika. 2005;70:31–43. doi: 10.1007/s11336-001-0934-z. [DOI] [Google Scholar]
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B. 1995;57:289–300. doi: 10.1111/j.2517-6161.1995.tb02031.x. [DOI] [Google Scholar]
- Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics. 2001;29:1165–1188. doi: 10.1214/aos/1013699998. [DOI] [Google Scholar]
- Brusco MJ, Köhn HF, Steinley D. An exact method for partitioning dichotomous items within the framework of the monotone homogeneity model. Psychometrika. 2015;80:949–967. doi: 10.1007/s11336-015-9459-8. [DOI] [PubMed] [Google Scholar]
- Clarke B, Yuan A. Manifest characterization and testing for certain latent properties. Annals of Statistics. 2001;29:876–898. doi: 10.1214/aos/1009210693. [DOI] [Google Scholar]
- Christensen KB, Bjorner JB, Kreiner S, Petersen JH. Testing unidimensionality in polytomous Rasch models. Psychometrika. 2002;67:563–574. doi: 10.1007/BF02295131. [DOI] [Google Scholar]
- De Gooijer JG, Yuan A. Some exact tests for manifest properties of latent trait models. Computational Statistics and Data Analysis. 2011;55:34–44. doi: 10.1016/j.csda.2010.04.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Denuit M, Dhaene J, Goovaerts M, Kaas R. Actuarial theory for dependent risks. Wiley; 2005. [Google Scholar]
- Douglas J, Cohen A. Nonparametric item response function estimation for assessing parametric model fit. Applied Psychological Measurement. 2001;25:234–243. doi: 10.1177/01466210122032046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efron B. Correlated z-values and the accuracy of large-scale statistical estimates. Journal of the American Statistical Association. 2010;105(491):1042–1055. doi: 10.1198/jasa.2010.tm09129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efron B. Increasing properties of Pólya frequency function. The Annals of Mathematical Statistics. 1965;36(1):272–279. doi: 10.1214/aoms/1177700288. [DOI] [Google Scholar]
- Ellis JL. An inequality for correlations in unidimensional monotone latent variable models for binary variables. Psychometrika. 2014;79:303–316. doi: 10.1007/s11336-013-9341-5. [DOI] [PubMed] [Google Scholar]
- Ellis, J. L. (2015). MTP2 and partial correlations in monotone higher-order factor models. In Roger E. Millsap, Daniel M. Bolt, L. Andries van der Ark, en Wen-Chung Wang (Eds.), Quantitative Psychology Research. The 78th Annual Meeting of the Psychometric Society (pp. 261-272). Springer. 10.1007/978-3-319-07503-7_16
- Ellis JL, Junker BW. Tail-measurability in monotone latent variable models. Psychometrika. 1997;62:495–523. doi: 10.1007/bf02294640. [DOI] [Google Scholar]
- Ellis JL, Pecanka J, Goeman JJ. Gaining power in multiple testing of interval hypotheses via conditionalization. Biostatistics. 2020;21(2):e65–e79. doi: 10.1093/biostatistics/kxy042. [DOI] [PubMed] [Google Scholar]
- Falk CF, Cai L. Maximum marginal likelihood estimation of a monotonic polynomial generalized partial credit model with applications to multiple group analysis. Psychometrika. 2016;81:434–460. doi: 10.1007/s11336-014-9428-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fortuin CM, Kasteleyn PW, Ginibre J. Correlation inequalities on some partially ordered sets. Communications in Mathematical Physics. 1971;22(2):89–103. doi: 10.1007/bf01651330. [DOI] [Google Scholar]
- Glas CAW. The derivation of some tests for the Rasch model from the multinomial distribution. Psychometrika. 1988;53:525–546. doi: 10.1007/BF02294405. [DOI] [Google Scholar]
- Guttman L, Levy S. Two structural laws for intelligence tests. Intelligence. 1991;15(1):79–103. doi: 10.1016/0160-2896(91)90023-7. [DOI] [Google Scholar]
- Hambleton RK, Swaminathan H. Item response theory. Principles and applications. Kluwer Nijhoff Publishing; 1985. [Google Scholar]
- Holland PW. When are item response models consistent with observed data? Psychometrika. 1981;46(1):79–92. doi: 10.1007/bf02293920. [DOI] [Google Scholar]
- Holland PW, Rosenbaum PR. Conditional association and unidimensionality in monotone latent variable models. The Annals of Statistics. 1986;14:1523–1543. doi: 10.1214/aos/1176350174. [DOI] [Google Scholar]
- Junker BW. Conditional association, essential independence and monotone unidimensional item response models. The Annals of Statistics. 1993;21:1359–1378. doi: 10.1214/aos/1176349262. [DOI] [Google Scholar]
- Junker BW, Ellis JL. A characterization of monotone unidimensional latent variable models. The Annals of Statistics. 1997 doi: 10.1214/aos/1069362751. [DOI] [Google Scholar]
- Junker BW, Sijtsma K. Latent and manifest monotonicity in item response models. Applied Psychological Measurement. 2000;24:65–81. doi: 10.1177/01466216000241004. [DOI] [Google Scholar]
- Karlin S, Rinott Y. Classes of orderings of measures and related correlation inequalities. I. Multivariate totally positive distributions. Journal of Multivariate Analysis. 1980;10(4):467–498. doi: 10.1016/0047-259x(80)90065-2. [DOI] [Google Scholar]
- Kieftenbeld V, Nandakumar R. Alternative hypothesis testing procedures for DIMTEST. Applied Psychological Measurement. 2015;39(6):480–493. doi: 10.1177/0146621615577618. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krantz DH, Luce RD, Suppes P, Tversky A. Foundations of measurement, Vol. I: Additive and polynomial representations. Academic Press; 1971. [Google Scholar]
- Kuritz SJ, Landis JR, Koch GG. A general overview of Mantel-Haenszel methods: Applications and recent developments. Annual Review of Public Health. 1988;9(1):123–160. doi: 10.1146/annurev.pu.09.050188.001011. [DOI] [PubMed] [Google Scholar]
- Ligtvoet R. Incomplete tests of conditional association for the assessment of model assumptions. Psychometrika. 2022 doi: 10.1007/s11336-022-09841-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ligtvoet R, Vermunt JK. Latent class models for testing monotonicity and invariant item ordering for polytomous items. British Journal of Mathematical and Statistical Psychology. 2012;65:237–250. doi: 10.1111/j.2044-8317.2011.02019.x. [DOI] [PubMed] [Google Scholar]
- Maydeu-Olivares A, Joe H. Limited information goodness-of-fit testing in multidimen-sional contingency tables. Psychometrika. 2006;71:713–732. doi: 10.1007/s11336-005-1295-9. [DOI] [Google Scholar]
- McDonald RP. Test theory: A unified treatment. Lawrence Erlbaum; 1999. [Google Scholar]
- Milgrom PR, Weber RJ. A theory of auctions and competitive bidding. Econometrica. 1982;50(5):1089. doi: 10.2307/1911865. [DOI] [Google Scholar]
- Mokken RJ. A theory and procedure of scale-analysis. Mouton; 1971. [Google Scholar]
- Mokken RJ, Lewis C. A nonparametric approach to the analysis of dichotomous item responses. Applied Psychological Measurement. 1982;6:417–430. doi: 10.1177/014662168200600404. [DOI] [Google Scholar]
- Mokken RJ, Lewis C, Sijtsma K. Rejoinder to “The Mokken scale: A critical discussion”. Applied Psychological Measurement. 1986;10:279–285. doi: 10.1177/014662168601000306. [DOI] [Google Scholar]
- Molenaar, I. W., & Sijtsma, K. (2000). MSP5 for Windows. A program for Mokken scale analysis for polytomous items, Groningen, The Netherlands: iecProGAMMA.
- Rinott Y, Scarsini M. Total positivity order and the normal distribution. Journal of Multivariate Analysis. 2006;97:1251–1261. doi: 10.1016/j.jmva.2005.07.008. [DOI] [Google Scholar]
- Rosenbaum PR. Testing the conditional independence and monotonicity assumptions of item response theory. Psychometrika. 1984;49(3):425–435. doi: 10.1007/bf02306030. [DOI] [Google Scholar]
- Roussos LA, Stout WF, Marden JI. Using new proximity measures with hierarchical cluster analysis to detect multidimensionality. Journal of Educational Measurement. 1998;35:1–30. doi: 10.1111/j.1745-3984.1998.tb00525.x. [DOI] [Google Scholar]
- Sarkar, T. K. (1969). Some lower bounds of reliability. Tech. Report, No. 124, Dept. of Operations Research and Statistics, Stanford University.
- Saumard A, Wellner JA. Log-concavity and strong log-concavity: A review. Statistics Surveys. 2014;8:45–113. doi: 10.1214/14-SS107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sijtsma K, Molenaar IW. Introduction to nonparametric item response theory. Sage; 2002. [Google Scholar]
- Sijtsma K, Van der Ark LA. Measurement models for psychological attributes. Chapman and Hall/CRC; 2021. [Google Scholar]
- Spearman C. ‘General intelligence’, objectively determined and measured. The American Journal of Psychology. 1904;15(2):201–292. doi: 10.2307/1412107. [DOI] [Google Scholar]
- Stout W. A nonparametric approach for assessing latent trait unidimensionality. Psychometrika. 1987;52(4):589–617. doi: 10.1007/bf02294821. [DOI] [Google Scholar]
- Stout WF, Habing B, Douglas J, Kim HR, Roussos L, Zhang J. Conditional covariance-based nonparametric multidimensionality assessment. Applied Psychological Measurement. 1996;19:331–354. doi: 10.1177/014662169602000403. [DOI] [Google Scholar]
- Stout, W., Froelich, A.G., Gao, F. (2001). Using resampling methods to produce an improved DIMTEST procedure. In: Boomsma, A., van Duijn, M.A.J., Snijders, T.A.B. (eds) Essays on Item Response Theory. Lecture Notes in Statistics, vol 157. Springer, New York, NY. 10.1007/978-1-4613-0169-1_19
- Straat JH, Van der Ark LA, Sijtsma K. Comparing optimization algorithms for item selection in Mokken scale analysis. Journal of Classification. 2013;30:75–99. doi: 10.1007/s00357-013-9122-y. [DOI] [Google Scholar]
- Straat JH, Van der Ark LA, Sijtsma K. Using conditional association to identify locally independent item sets. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences. 2016;12:117–123. doi: 10.1027/1614-2241/a000115. [DOI] [Google Scholar]
- Takane Y, de Leeuw J. On the relationship between item response theory and factor analysis of discretized variables. Psychometrika. 1987;52:393–408. doi: 10.1007/bf02294363. [DOI] [Google Scholar]
- Tijmstra J, Bolsinova M. Bayes factors for evaluating latent monotonicity in polytomous item response theory models. Psychometrika. 2019;84:846–869. doi: 10.1007/s11336-019-09661-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tijmstra J, Hessen DJ, van der Heijden PGM, Sijtsma K. Testing manifest monotonicity using order-constrained statistical inference. Psychometrika. 2013;78:83–97. doi: 10.1007/s11336-012-9297-x. [DOI] [PubMed] [Google Scholar]
- Tversky A, Russo JE. Substitutability and similarity in binary choices. Journal of Mathematical Psychology. 1969;6(1):1–12. doi: 10.1016/0022-2496(69)90027-3. [DOI] [Google Scholar]
- Van Abswoude AAH, Van der Ark LA, Sijtsma K. A comparative study of test data dimensionality assessment procedures under nonparametric IRT models. Applied Psychological Measurement. 2004;28:3–24. doi: 10.1177/0146621603259277. [DOI] [Google Scholar]
- Van den Wollenberg AL. Two new test statistics for the Rasch model. Psychometrika. 1982;47:123–140. doi: 10.1007/BF02296270. [DOI] [Google Scholar]
- Van der Ark LA. Mokken scale analysis in R. Journal of Statistical Software. 2007;20:1–19. doi: 10.18637/jss.v020.i11. [DOI] [Google Scholar]
- Van der Ark LA. New developments in Mokken scale analysis in R. Journal of Statistical Software. 2012;48:1–27. doi: 10.18637/jss.v048.i05. [DOI] [Google Scholar]
- Van Onna MJH. Bayesian estimation and model selection in ordered latent class models for polytomous items. Psychometrika. 2002;67:519–538. doi: 10.1007/BF02295129. [DOI] [Google Scholar]
- Vermunt JK. The use of restricted latent class models for defining and testing nonparametric and parametric item response theory models. Applied Psychological Measurement. 2001;25:283–294. doi: 10.1177/01466210122032082. [DOI] [Google Scholar]
- Walther G. Inference and modeling with log-concave distributions. Statistical Science. 2009 doi: 10.1214/09-sts303. [DOI] [Google Scholar]
- Wasserman JD. Deconstructing CHC. Applied Measurement in Education. 2019;32:249–268. doi: 10.1080/08957347.2019.1619563. [DOI] [Google Scholar]
- Whitt W. Uniform conditional stochastic order. Journal of Applied Probability. 1980;17:112–123. doi: 10.2307/3212929. [DOI] [Google Scholar]
- Zhang J. Conditional covariance theory and DETECT for polytomous items. Psychometrika. 2007;72(1):69–91. doi: 10.1007/s11336-004-1257-7. [DOI] [Google Scholar]
- Zhang J, Stout W. Conditional covariance structure of generalized compensatory multidimensional items. Psychometrika. 1999;64(2):129–152. doi: 10.1007/bf02294532. [DOI] [Google Scholar]
- Zhang J, Stout W. The theoretical detect index of dimensionality and its application to approximate simple structure. Psychometrika. 1999;64:213–249. doi: 10.1007/BF02294536. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


