Abstract
New measures of test information, termed global information, quantify test information relative to the entire range of the trait being assessed. Estimating global information relative to a non-informative prior distribution results in a measure of how much information could be gained by administering the test to an unspecified examinee. Currently, such measures have been developed only for unidimensional tests. This study introduces measures of multidimensional global test information and validates them in simulated data. Then, the utility of global test information is tested in neuropsychological data collected as part of Rush University’s Memory and Aging Project. These measures allow for direct comparison of complex tests calibrated in different samples, facilitating test development and selection.
Keywords: achievement testing, assessment, Bayesian, dimensionality, information function, psychometrics, reliability, testlets, latent variable models, test construction
Introduction
Many psychological tests are multidimensional, either intentionally or due to the presence of one or more nuisance traits. This complicates the quantification of test information, especially if the test is misspecified as unidimensional. In such cases, item parameter estimates are biased (Ansley and Forsyth, 1985; Way et al., 1988). Bias in item parameter estimates has cascading effects, affecting the accuracy of test information, and subsequently the efficiency of item selection and trait estimation (Folk and Green, 1989).
To address this problem, methods for quantifying test information for multidimensional tests have been developed, all based on multidimensional IRT. These include D-optimality methods, which maximize the Fisher information matrix (Segall, 2001), methods maximizing Kullback–Leibler divergence between sequential trait estimates (Chang and Ying, 1996; Veldkamp and van der Linden, 2002; Wang et al., 2010), and methods maximizing mutual information between sequential estimates (Mulder and van der Linden, 2009). For a review and comparison of these methods, many of which are theoretically linked, see Wang and Chang (2011).
However, these measures are still limited, because even though they are sensitive to multiple traits, they require an a priori estimate of the sample or individual’s abilities. Defining this prior can be challenging, especially when little is known about the sample or examinee’s abilities. Indeed, testing is only necessary because these abilities are unknown. To the extent that the prior is inaccurate, so too is the estimate of test information. Test information quantified relative to a point estimate or narrow prior distribution is sometimes termed “local” test information. It would be useful to quantify “global” test information, that is, information calculated relative to a prior distribution that spans the entire trait of interest. Optimally, such a prior would be non-informative, so that the resulting metric would reflect the potential informativeness of the test for an unspecified person or sample. This measure could therefore be used when abilities are unknown, such as at the beginning of a computer adaptive test. Chang and Ying (1996) developed a measure of global information by averaging test information over a reasonable range of the traits being assessed. This is equivalent to calculating information relative to a uniform prior. Here, test information is calculated relative to the reference prior, which is the minimally informative prior for a given test. Reference priors have previously been used to quantify global information for unidimensional tests (Markon, 2013). Here, the principle is extended to more complicated test structures.
In sum, it is desirable to estimate test information in a way that (1) is sensitive to test structure, and (2) reflects a lack of information about sample abilities. The following analyses demonstrate how multidimensional reference priors can be used to quantify global test information for multidimensional tests, and to decompose multidimensional global test information into information about component traits.
Criterion Information Utility
As stated above, it is desirable to estimate a test’s potential to update the estimate of an individual’s ability. This can be conceptualized as the difference between the prior and posterior estimates of the latent trait. Kullback–Leibler divergence, or information utility, is a measure of the difference between two probability distributions (Kullback and Leibler, 1951). When those two probability distributions are the a priori estimate of θ and the a posteriori estimate, the divergence is referred to as Lindley information (Lindley, 1956)
| (1) |
where π(θ) is the probability distribution of the trait estimate prior to administering the test, and is the same distribution after administering the test and collecting the response data, . In the multidimensional case, Lindley information is a function of the volume between two multidimensional probability distributions, and .
Lindley information, like other measures of test information, is dependent on θ as well as the observed response pattern . To minimize dependence on an a priori estimate of the examinee’s ability, one can define the prior, π(θ), as the distribution that reflects a complete lack of information about the trait. This prior can be conceptualized as the distribution that minimizes the informativeness of the prior, or conversely, maximizes the difference between the prior and posterior trait distribution. This prior is called the reference prior, π r (θ) (Berger et al., 2009; Bernardo, 1979a, 1979b; Clarke and Barron, 1994). Note that the reference prior is a function of test parameters. Therefore, even though Lindley information is a function of π(θ), through the recursive dependence of the reference prior on item parameters, Lindley information becomes purely a function of item parameters. Methods for calculating reference priors are discussed in Appendix A.
To calculate Lindley information independent of a single observed response pattern , Lindley information is calculated for all possible response patterns. If items are dichotomous, there are 2 j possible response patterns, where j is the number of items. Here I focus on dichotomous items, but the relevant expressions can all be extended to other response formats. Lindley information of the response is weighted by the probability of observing the dataset, and summed over all datasets. This quantity is called expected information utility (Bernardo, 1979a). When the prior is specified as the reference prior for the trait of interest, π r (θ), the quantity is unidimensional criterion information utility (u-ι c ) (Markon, 2013)
| (2) |
Kullback–Leibler divergence, and therefore criterion information utility, ranges from 0 to infinity. Both are measured in nats if the logarithm in equation (1) is base e, and bits if the logarithm is base 10. Note that, although not shown, both terms in this expression are also a function of item parameter estimates, which are fixed. Monte Carlo approximation of ι c is described in Appendix B.
Normalized Minimum Reduction in Uncertainty
Criterion information utility has a natural upper and lower bound. The upper bound, ι u , is defined by the entropy of the prior, because the test cannot convey more information than what is unknown about the trait (Clarke and Barron, 1994)
| (3) |
Notably, because the reference prior maximizes missing information about a given trait, it also maximizes entropy, and the upper limit of criterion information utility. By maximizing entropy, test information is scaled relative to the sample in which it performs best, and is therefore an optimistic estimate of test information. This is akin to the “drunkard’s search,” in which a drunk man who has lost his keys looks for them under the streetlight, because that is where the light is (Kaplan, 1973). While this is not a good way to find keys, it is an excellent method for evaluating the strength of the streetlight. Test information calculated relative to the reference prior is a way of evaluating the test, which indirectly facilitates trait estimation, rather than maximizing information for a specific person or sample.
The lower bound of criterion information utility, ι l , is an approximation of the Kullback–Leibler divergence between the reference prior and posterior trait distribution, first shown in Clarke and Barron (1990) and expanded upon in Clarke and Barron (1994)
| (4) |
where represents the Fisher information, and π is the constant. The first term approximates the posterior density of θ based on the normal distribution. In the second term, Fisher information is inversely proportional to the expected standard error of measurement (the mean of ), and proportional to the precision of the trait estimate of θ, meaning that the lower bound is a measure of the test’s precision over the range of the trait.
The lower bound divided by the upper bound results in a scaled measure of global information called the normalized minimum reduction in uncertainty (NMRU) (Markon, 2013)
| (5) |
Normalized minimum reduction in uncertainty ranges from 0 to 1. It quantifies the precision of a test relative to the entropy of the trait it measures. Since NMRU is a normalized measure of criterion information utility, it can also be interpreted as the proportion of information about a trait that can be gained by administering the test. When NMRU is near 0, administering the test will not significantly update the trait estimate from what is described by the reference prior. To the extent that NMRU nears 1, administering the test will cause the posterior distribution to become more peaked and narrow, as the test has conveyed more information about a person’s trait standing.
Multidimensional Criterion Information Utility (md-ι c )
Unidimensional criterion information utility (equation (2); from here forward called u-ι c to distinguish it from multidimensional and marginal criterion information) is a function of two components: Lindley information and the probability of the data given the reference prior. Both terms can be extended to the multidimensional case. Multidimensional Lindley information is a function of the volume between two probability densities given by
| (6) |
where is the multidimensional reference prior for a vector of traits . By extension, multidimensional criterion information is
| (7) |
Multidimensional NMRU (md-NMRU)
The terms of unidimensional NMRU (u-NMRU), equations (3) and (4), can also be extended to the multidimensional case. The upper bound of global information is the entropy of the multidimensional reference prior
| (8) |
And the lower bound is (Bodnar and Elster, 2014)
| (9) |
where d is the number of dimensions in . As in the unidimensional case, multidimensional NMRU (md-NMRU) equals the lower bound divided by the upper bound.
Marginal Criterion Information Utility
To calculate test information with regard to one of multiple traits, nuisance traits are integrated out of md-ι c . The reference prior is defined with respect to the trait of interest, θ k , by integrating over the traits (where the subscript -k indicates all traits except k), and multiplying by the marginal likelihood of the data given θ k . This allows for the calculation of marginal Lindley information (m-ι L )
| (10) |
which quantifies the extent to which the test data changes the estimate of θ k . Marginal criterion information utility is then
| (11) |
When the traits are uncorrelated, the sum of the marginal informations m-ι c will approximately equal md-ι c .
Marginal NMRU (m-NMRU)
The upper bound of equation (11) is the entropy of the trait of interest, θ k
| (12) |
And the lower bound is (Clarke and Barron, 1994)
| (13) |
where is the marginal distribution of the Fisher information matrix with respect to θ k . Marginal NMRU is the ratio of m-ι l to m-ι u . As is the case with marginal criterion information, when traits are uncorrelated, the sum of marginal NMRUs equals the multidimensional NMRU. Note that because information is defined with regard to the marginal and not the conditional trait distribution, marginal information for two correlated traits will also be correlated.
Since criterion information utility is a function of Fisher information relative to a given trait, only tests of the same trait can be compared in terms of ι c . Since NMRU is normalized by the entropy of the reference prior, it is more general, and tests of different traits can be compared in terms of NMRU.
The Present Study
The following analyses validate measures of unidimensional, multidimensional, and marginal global test information in simulated data. A summary of the four studies is reported in Table 1. Study 1 compares existing IRT-based measures of marginal reliability to global test information in cases of test-trait mismatch, and shows that where test-trait mismatch attenuates marginal reliability, global information is constant. Study 2 shows the invariance of global test information to model mispecification, and the sensitivity of these measures in tests of uncorrelated traits, as depicted in Figure 1(b). Study 3 demonstrates the sensitivity of marginal and multidimensional global test information in tests with multiple, correlated traits, as depicted in Figure 1(c). Finally, an empirical example calculates marginal and multidimensional global test information for a number of common neuropsychological tests. Global test information is evaluated as a proxy for neuropsychological tests’ criterion validity in predicting a diagnosis of probable Alzheimer’s Disease.
Table 1.
Summary of Study Designs.
| Objective | Conditions | Outcome Measures | |
|---|---|---|---|
| Study 1 | Assess robustness of global information to sample abilities | Test difficulty × sample ability | Reliability, E (SEM), ud-NMRU, ud-ι c |
| Study 2 | Assess sensitivity of marginal and multidimensional test information to uncorrelated, multidimensional test structure, and model misspecification | No. of cross-loadings on nuisance trait × magnitude of cross-loadings | S1-NMRU, S2-NMRU, md-NMRU, ud-NMRU, S1-ι c , S2-ι c , md-ι c , ud-ι c |
| Study 3 | Assess sensitivity of marginal and multidimensional test information to correlated multidimensional test structure | Correlation between traits × no. of cross-loadings on nuisance trait × magnitude of cross-loadings | S1-NMRU, S2-NMRU, G-NMRU, md-NMRU, S1-ι c , S2-ι c , G-ι c , md-ι c |
| Empirical example | Evaluate association between test information and criterion validity | Nine commonly used neuropsychological exams, assessed in a population representative sample of older Americans evaluated for dementia | Reliability, E (SEM), m-NMRU, m-ι c , correlation with Alzheimer’s disease |
Note. Reliability is marginal reliability, defined in equation (18); E(SEM) = expected standard error of measurement; NMRU = normalized minimum reduction of uncertainty; ι c = criterion information utility; S1 = primary trait; S2 = nuisance trait; md = multidimensional; ud = unidimensional; and G = general trait.
Figure 1.
(a) Study 1, test structure with one latent trait (S1). (b) Study 2, multidimensional, uncorrelated trait structure with two latent traits (S1 and S2). (c) Study 3, correlated traits, second-order test structure. Item subscripts index the test and item, respectively. λ represents item and trait loadings. Subscripts of λ index the trait and item. Item loadings are only shown for traits and items that cross-load, to reduce clutter. Disturbances are omitted for the same reason.
Study 1: Test-Trait Mismatch
Test-trait mismatch occurs when a sample’s abilities are much lower or higher than the range assessed by the trait. This can occur when, for example, a group of personality disordered individuals is administered a test of normative personality function. Marginal reliability and the expected standard error of measurement are a function of the ability of the sample being assessed. As a result, they can be biased by test-trait mismatch. As global test information is solely a function of item parameters, it is hypothesized that a mismatch between test difficulty and sample ability will not affect global test information. Study 1 tests this hypothesis by crossing test difficulty with sample ability.
Study 1 Methods
Analyses were performed in R (R Development Core Team, 2010). A 16-item, unidimensional test was simulated. The test structure is depicted in Figure 1(a). The length of the simulated test was designed to be similar to that of commonly used neuropsychological tests, such as those in the empirical data. Response data for a sample of 500 individuals was simulated according to the 2-parameter logistic model (Birnbaum, 1968)
| (14) |
where θ i is the ability of person i, b j reflects the difficulty of an item j, and a j is the item discrimination. Respondents’ abilities were randomly drawn from a normal distribution with variance of 0.5, and means varying from −1.0, 0.0, and 1.0 across conditions. Item discriminations were generated from a truncated normal distribution with a mean of 1.28, variance of 0.50, and floor of 0.0. Item difficulties were drawn from a normal distribution with variance of 0.5, and means varying from −1.0, 0.0, and 1.0 across conditions.
In all four studies, estimation was based in the IRT framework, yet many of the results in this paper are presented in terms of factor loadings, cross-loadings, and related features of factor analysis. This is because IRT models are closely related to factor analytic models. Specifically, the a j and b j parameters can be transformed into item loadings (c j ) and thresholds (r j ) by the equations (Wirth and Edwards, 2007)
| (15) |
| (16) |
where D is a scaling constant equal to 1.7 (Camilli, 1994). Simulated discriminations were equivalent to a mean factor loading of 0.6, and difficulties to threshold means of −0.78, 0, and 0.78. The cross of test difficulty and sample ability resulted in a 3 × 3 design. Each condition was replicated 20 times.
A unidimensional confirmatory model was fit to each simulated dataset, using EM estimation in the R package “mirt” (Chalmers, 2012). The precision and accuracy of recovered item parameters are reported in Table A1, Appendix C. The recovered parameters were used to estimate the expected standard error of measurement for a subsample of 100 respondents, with expected standard error defined as
| (17) |
where N is the number of examinees, indexed by i, and is Fisher information for the examinee i. Marginal reliability was estimated as (Bechger et al., 2003; Raju et al., 2007)
| (18) |
That is, a function of the ratio of error variance to total score variance.
Recovered item parameters were used to calculate a reference prior for each test as described in Appendix A, from which ud-NMRU and ud-ι c were calculated. Differences in marginal reliability, E (SEM), ud-NMRU, ud-ι c between conditions were assessed using a t-test.
Since both NMRU and ι c are a function of parameter estimates, variability due to estimation affects measures of global information. Global information calculated from simulated parameters was compared to global information calculated from parameter estimates. The root mean squared error of approximation (RMSE) of global information is plotted as a function of the RMSE of item discrimination parameters in Figure A1, Appendix C. The accuracy of NMRU is strongly correlated with the accuracy of parameter estimates, whereas the accuracy of ι c is not associated with parameter accuracy. This may be because NMRU is a deterministic function of item parameters, whereas ι c is estimated using a stochastic process (the Monte Carlo approximation method described in Appendix B).
Study 1 Results
The effects of test-sample mismatch on marginal reliability and the expected standard error of measurement are reported in the upper half of Table 2. Marginal reliability was highest when the sample’s abilities were matched to the difficulty of the test, and decreased as the gap between difficulty and ability increased. In every case, t-tests between conditions indicated the change was statistically significant. The inverse pattern was found for expected standard error of measurement. For a given sample, E (SEM) was lowest when test difficulty matched the sample’s ability, and grew larger as the mismatch grew. All differences between conditions were statistically significant.
Table 2.
Test-Trait Mismatch.
| Test Difficulty | ||||
|---|---|---|---|---|
| Sample Ability | N (−1.0, 0.5) | N (0.0, 0.5) | N (1.0, 0.5) | |
| Reliability (SD) | N (−1.0, 0.5) | 0.76a (0.04) | 0.72a (0.04) | 0.60a (0.05) |
| N (0.0, 0.5) | 0.74a (0.04) | 0.77a (0.03) | 0.73a (0.04) | |
| N (1.0, 0.5) | 0.61a (0.06) | 0.71a (0.03) | 0.76a (0.03) | |
| E (SEM) (SD) | N (−1.0, 0.5) | 0.62b (0.06) | 0.81b (0.08) | 1.19b (0.13) |
| N (0.0, 0.5) | 0.56b (0.06) | 0.51b (0.03) | 0.56b (0.06) | |
| N (1.0, 0.5) | 1.17b (0.07) | 0.81b (0.07) | 0.63b (0.05) | |
| ud-NMRU (SD) | N (−1.0, 0.5) | 0.32 (0.01) | 0.32 (0.01) | 0.32 (0.01) |
| N (0.0, 0.5) | 0.32 (0.01) | 0.31 (0.01) | 0.32 (0.01) | |
| N (1.0, 0.5) | 0.32 (0.01) | 0.31 (0.01) | 0.31 (0.01) | |
| ud-ι c (SD) | N (−1.0, 0.5) | 1.31 (0.02) | 1.31 (0.02) | 1.29 (0.02) |
| N (0.0, 0.5) | 1.29 (0.02) | 1.30 (0.02) | 1.30 (0.02) | |
| N (1.0, 0.5) | 1.29 (0.02) | 1.30 (0.02) | 1.30 (0.02) | |
Note. Test difficulty increases from left to right across columns. Sample ability increases from top to bottom down rows within the same information index. Reliability is marginal reliability, defined in equation (18); E(SEM) = expected standard error of measurement, defined in equation (17); ud-NMRU = unidimensional normalized minimum reduction of uncertainty; ud-ι c = unidimensional criterion information utility. Differences between adjacent cells marked with the same superscript are significant at p < 0.05.
The effect of test-sample mismatch on global information is reported in the lower half of Table 2. Global information was stable, regardless of whether the abilities matched test difficulty. None of the differences between conditions were statistically significant. Normalized minimum reduction in uncertainty ranged from 0.31 to 0.32, indicating that administering the test has the potential to reduce uncertainty about ability estimates by a third.
Study 2: Uncorrelated Trait Structure
Assuming a test is unidimensional can bias test information in cases where secondary or nuisance traits affect responses (Folk and Green, 1989). This simulation tested the effect of dimensionality on global test information by simulating data from a multidimensional uncorrelated trait structure, and fitting both the true multidimensional model and a misspecified unidimensional model to the data. Global test information was compared between the true and the misspecified model.
Study 2 Methods
Response data was simulated from two 16-item tests, which measured two uncorrelated traits (S1 and S2). The test structure is depicted in Figure 1(b). S1 was considered the primary trait, and S2 the nuisance trait. S2, the nuisance trait, could represent a single reading passage in a test of reading comprehension with multiple passages, or variance attributable to social desirability in an assessment of personality (Paulhus, 1981). All items from the first test loaded onto S1, and all items in the second test loaded onto S2. The purpose of the second test was primarily to fix the scale and location of the nuisance trait, so that the stability of S2 would not affect the accuracy or precision of parameter estimates of test one. Statistics for the second test are not reported.
Across conditions, 0, 4, or 8 items from the first test cross-loaded onto S2. The average magnitude of these cross-loadings also varied systematically, from 0.54, 1.28, to 2.27 (or 0.3, 0.6, and 0.8 in a factor analytic parameterization) (Wirth and Edwards, 2007), representing weak, moderate, and strong effects of the nuisance trait, respectively. Respondents’ abilities were randomly drawn from a multivariate normal distribution with mean 0.0, variance 1.0, and covariance of 0.0. Response data for a sample of 500 individuals was simulated according to a compensatory multidimensional 2-parameter logistic IRT model (Bonifay, 2019; Rechase, 1997)
| (19) |
where indicates the transpose of a vector of item discrimination parameters for item j, indicates a vector of trait parameters for individual i, and d is a single intercept. The cross between the number of cross-loadings and their magnitude resulted in a 3 × 3 design. As in Study 1, each condition was replicated 20 times.
Confirmatory models were fit using Metropolis-Hastings Robbins-Monro estimation in the R package “mirt” (Chalmers, 2012). MH-RM estimation is used as a stochastic alternative to quadrature in a process that is otherwise equivalent to EM estimation, and is not a fully Bayesian estimator. In models with four or more latent traits, MH-RM estimation is significantly faster than EM estimation (Cai, 2010). Since the empirical example required estimating models with up to 10 factors, MH-RM was used for all multidimensional models (Studies 2, 3, and 4) to maximize comparability across studies.
To test the effect of model misspecification, two models were fit to the simulated response data: the true data-generating model (a two-dimensional IRT model), and an incorrect unidimensional model (a misspecification in conditions with 4 and 8 cross-loadings). The precision and accuracy of item parameter estimates from the two-dimensional model are reported in Table A2, Appendix C.
Item parameters estimated from the correct, two-dimensional model were used to calculate a multidimensional reference prior, as described in Appendix A. This prior was used to calculate multidimensional global information, as well as marginal global information for test one, with regard to S1 and S2. Item parameters from the misspecified, unidimensional model were used to calculate a unidimensional reference prior. This prior was used to calculate unidimensional global information for test one. Differences in measures of test information between conditions were assessed using a t-test.
Study 2 Results
Table 3 reports the effect of nuisance traits and model misspecification on global information. Marginal global information for S1 was invariant to test dimensionality, and remained consistent as the number and magnitude of cross-loadings on S2 changed. Marginal global information for S2 approximately doubled as the number of cross-loadings doubled, and increased as the magnitude of those cross-loadings increased. As the number and magnitude of cross-loadings increased, so did multidimensional global information, reflecting the increase in total information that was gained as more was learned about S2.
Table 3.
Uncorrelated Traits, First-Order Structure.
| No. of Cross-Loadings on S2 | Magnitude of Cross-Loadings | |||||
|---|---|---|---|---|---|---|
| 0 | 4 | 8 | 0.3 | 0.6 | 0.8 | |
| S1-NMRU (SD) | 0.33 (0.01) | 0.33 (0.01) | 0.34 (0.01) | 0.33 (0.01) | 0.33 (0.01) | 0.34 (0.01) |
| S2-NMRU (SD) | NA | 0.12a (0.04) | 0.23a (0.03) | 0.14b (0.06) | 0.18b (0.06) | 0.20 (0.06) |
| md-NMRU (SD) | NA | 0.69c (0.04) | 0.72c (0.04) | 0.67d (0.03) | 0.70d (0.02) | 0.74d (0.03) |
| ud-NMRU (SD) | 0.33 (0.01) | 0.33e (0.01) | 0.34e (0.01) | 0.33 (0.01) | 0.33 (0.01) | 0.34 (0.01) |
| S1-ι c (SD) | 2.38 (0.02) | 2.38 (0.02) | 2.38 (0.02) | 2.38 (0.03) | 2.38 (0.02) | 2.38 (0.02) |
| S2-ι c (SD) | NA | 1.79f (0.02) | 2.06f (0.02) | 1.86g (0.02) | 1.94g (0.02) | 1.98g (0.02) |
| md-ι c (SD) | NA | 4.01h (0.03) | 4.15h (0.03) | 4.02i (0.03) | 4.09i (0.03) | 4.13i (0.03) |
| ud-ι c (SD) | 2.40 (0.02) | 2.40j (0.02) | 2.42j (0.02) | 2.40 (0.02) | 2.40 (0.02) | 2.41 (0.02) |
Note. S1 = primary trait; S2 = nuisance trait; md = multidimensional; ud = unidimensional; NMRU = normalized minimum reduction of uncertainty; ι c = criterion information utility. Differences between cell values marked with the same superscript are significant at p < 0.05.
If S2 was ignored and the test was misspecified as unidimensional, global information trended lower, and changed very little across conditions. This is likely because when the model is misspecified, the trait that is estimated is a weighted average of the two traits the test actually measures, attenuating cross-loadings’ effect on test information.
Study 3: Correlated Trait Structure
Quantifying test information for multidimensional tests is complicated by correlations among traits, as assessment of any one trait yields information about the others. This simulation assessed sensitivity of global and marginal global test information to a correlated trait structure. This was evaluated by simulating tests in which the salience of the nuisance trait varied, as did the correlation between the primary and nuisance traits.
Study 3 Methods
In the second-order model, correlations among first-order traits can be represented by a second-order, general trait, onto which first-order traits load. The test structure is depicted in Figure 1(c). When there are only two first-order traits, loadings on the general trait must be constrained to be equal to identify the model. Therefore, in order to systematically vary the loading of only one first-order trait on the general trait, three first-order traits were simulated (S1, S2, and S3), each measured by a 16-item test. As in Study 2, the dimensionality of test one varied across conditions, with 0, 4, and 8 item cross-loadings on S2. The magnitude of these cross-loadings also varied. Loadings were drawn from a truncated normal distribution with means varying across conditions from 0.54 to 1.28, to 2.27, with variance of 0.1, and a floor of 0.0. Item loadings for the second and third tests were also drawn from truncated normal distribution, with mean 1.28, variance 0.5, and floor of 0.0. The loading of S1 on G varied across conditions from 0.54, 1.28, to 2.27, drawn from a truncated normal with variance 0.5 and a floor of 0.0. Loadings of S2 and S3 on G were drawn from a truncated normal with a mean of 1.28, variance 0.5, and floor of 0.0. Respondents’ abilities were randomly drawn from a multivariate normal with means of 0.0, and variances and covariances defined by
| (20) |
where represents the vector of first-order factor loadings on G, and Ψ is a diagonal matrix with diagonal elements equal to 1 minus the square of the factor loading. In other words, the covariance matrix of respondent abilities matched the covariances among first-order traits. The Σ matrix across the three conditions can be found in Table A3, Appendix C. In total, this yielded a 3 × 3 × 3 design. Response data for a sample of 500 individuals was simulated according to a compensatory multidimensional 2-parameter logistic IRT model (defined in equation (19)). As in Study 1, each condition was replicated 20 times.
Confirmatory models were fit to the simulated data using Metropolis-Hastings Robbins-Monro estimation in the R package “mirt” (Chalmers, 2012). The precision and accuracy of recovered item parameters is reported in Tables A4 and A5, Appendix C. Recovered item parameters for test one were used to calculate a multidimensional reference prior, as described in Appendix A. This prior was used to calculate multidimensional global information, as well as marginal information for S1, S2, and G. Differences in test information between conditions were assessed using a t-test.
Study 3 Results
Table 4 reports global information S1, S2, and G. Global information for S1 was constant across conditions, as anticipated. Global information for the nuisance trait, S2, increased significantly as both the number and magnitude of loadings on that trait increase, but not as the loading of S1 on G increased. Global information for G increased as the number of cross-loadings increased, reflecting the additional information about G that was provided via better estimates of S2. In addition, global information for G increased as the magnitude of the loading of S1 on G increased, reflecting the closer relationship between the two traits, though these increases were not significant. Changes in information were likely attenuated by the second-order test structure, leading to increases in the standard error of estimates of global information (see row 3 of Table 4).
Table 4.
Correlated Traits, Second-Order Structure.
| No. of Cross-Loadings on S2 | Magnitude of Cross-Loadings | Loading of S1 on G | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | 8 | 0.3 | 0.6 | 0.8 | 0.3 | 0.6 | 0.8 | |
| S1-NMRU (SD) | 0.34 (0.01) | 0.34 (0.01) | 0.34 (0.01) | 0.34 (0.01) | 0.34 (0.01) | 0.34 (0.01) | 0.34 (0.01) | 0.34 (0.01) | 0.34 (0.01) |
| S2-NMRU (SD) | NA | 0.11a (0.02) | 0.21a (0.02) | 0.14b (0.03) | 0.17b (0.02) | 0.18 (0.01) | 0.17 (0.02) | 0.16 (0.02) | 0.16 (0.02) |
| G-NMRU (SD) | 0.34 (0.09) | 0.37 (0.07) | 0.40 (0.06) | 0.37 (0.06) | 0.36 (0.09) | 0.37 (0.07) | 0.34 (0.08) | 0.37 (0.07) | 0.40 (0.06) |
| md-NMRU (SD) | NA | 0.66c (0.02) | 0.68c (0.02) | 0.55d (0.02) | 0.56d (0.01) | 0.57d (0.02) | 0.56 (0.01) | 0.56 (0.02) | 0.56 (0.02) |
| S1-ι c (SD) | 2.40 (0.02) | 2.40 (0.02) | 2.40 (0.02) | 2.40 (0.02) | 2.39 (0.02) | 2.40 (0.02) | 2.40 (0.02) | 2.41 (0.02) | 2.40 (0.02) |
| S2-ι c (SD) | NA | 1.87e (0.02) | 2.09e (0.02) | 1.91f (0.02) | 2.00f (0.02) | 2.03f (0.02) | 1.99 (0.02) | 1.98 (0.02) | 1.97 (0.02) |
| G-ι c (SD) | 2.34g (0.02) | 2.40g (0.02) | 2.41g (0.02) | 2.41h (0.02) | 2.36h (0.02) | 2.38h (0.02) | 2.35i (0.02) | 2.39i (0.02) | 2.41i (0.02) |
| md-ι c (SD) | NA | 4.14j (0.03) | 4.24j (0.03) | 4.12j (0.03) | 4.21k (0.03) | 4.25k (0.03) | 4.19 (0.03) | 4.17 (0.03) | 4.17 (0.03) |
Note. S1 = primary trait; S2 = nuisance trait; G = general trait; md = multidimensional; ud = unidimensional; NMRU = normalized minimum reduction of uncertainty; and ι c = criterion information utility. Differences between cell values marked with the same superscript are significant at p < 0.05.
Multidimensional global test information increased with the number and magnitude of cross-loadings. This reflects the general increase in information available about both S1 and S2. As the loading of S1 and G increased, equivalent to an increase in the covariance of S1 and S2, multidimensional information was stable. This is to be expected, as the total amount of information is not a function of the covariance between traits. This can be understood intuitively by imagining a limit case in which S1 and S2 are perfectly correlated. Since information is additive, the result is a test in which all items and cross-loadings measure the same trait, equivalent to a longer, more informative test.
Empirical Example
The aim of these analyses was to evaluate the utility of marginal and multidimensional global test information in an applied setting. The structure of neuropsychological tests is of significant interest to both test developers and administrators, as omnibus, multidimensional tests are often useful for screening or preliminary stages of assessment, while unidimensional tests become important for assessing specific abilities. By quantifying global and marginal global test information for a battery of neuropsychological exams, these analyses aimed to evaluate the performance and utility of these metrics in applied settings.
Empirical Example Methods
The Memory and Aging Project (MAP) is a longitudinal study funded by the National Institute of Aging, approved by the Institutional Review Board at Rush University Medical Center (Bennett et al., 2018). The study aims to identify factors that predict dementia in those over the age of 65. These analyses use data from the baseline assessment (n=1489). Of participants at this time point, 73.1% were female and 87.8% were non-Hispanic white. The average age of participants was 80.1 years. Average years of education was 14.4. At baseline evaluation, 5.4% of the sample was considered to have some form of dementia.
The neuropsychological assessment comprised: the Mini Mental Status Examination (MMSE), East Boston Memory Test (EBMT; immediate and delayed recall), Logical Memory (one story, immediate and delayed recall), the Brief Smell Identification Test, Word List Memory (three immediate recall trials, delayed recall, and delayed recognition), Complex Ideational Material, Boston Naming Test (BNT; short form), Category Fluency (fruits and animals), National Adult Reading Test (NART), Digit Span Forward, Digit Span Backward, Digit Sequencing, Symbol Digit Modalities Test (SDMT), Number Comparison, Judgment of Line Orientation (JOLO; abbreviated 15-item version), and Standard Progressive Matrices. Descriptive statistics for each test are reported in Appendix D.
Item-level response data were available for 13 of the 19 tests. These responses were scored as correct/incorrect, and modeled by the 2-parameter logistic model. For five tests, only summary scores were available. These tests were Logical Memory, the East Boston Memory Test, Verbal Fluency, Symbol Digit Modalities, and Number Comparison. These tests were included in factor analyses, allowing cognitive abilities described by these tests (namely, processing speed and episodic memory) to be reflected in the structural model. However, global information is not reported for these tests.
Exploratory multidimensional item response theory models with 1–10 traits were estimated using the R package “mirt” (Chalmers, 2012). Model fit was assessed via the Bayesian Information Criterion (BIC), which indicated a five-factor model best fit the data. Model fit is reported in Appendix E. Four rotations of the five-factor model were estimated: a first-order uncorrelated traits rotation, a first-order correlated traits rotation, a second-order rotation, and a bifactor rotation. Since BIC does not differ as a function of rotation, the five-factor uncorrelated traits model was selected on the basis of correlations between factor scores and diagnoses, amount of variance accounted for by the general factor, as well as considerations related to speed and accuracy with which the reference prior could be estimated. Figure 2 is a simplified depiction of the final model’s latent structure (tests with no primary loadings are not shown). The neuropsychological battery was shown to reflect five traits: verbal ability, working memory, processing speed, vocabulary, and attention.
Figure 2.
EBMT = East Boston memory test; MMSE = mini mental status exam; Number Comp. = number comparison; Complex Id. = complex ideation; Progress. Matrices = standard progressive matrices; NART = national adult reading test; JOLO = judgment of line orientation. Dashed lines represent tests in which fewer than one-third of items load on a factor, or in the case of single-score tests, a loading less than 0.3.
Of the five factors, latent trait scores on factor 1 were most closely correlated with a diagnosis of probable Alzheimer’s. Therefore, item loadings on factor 1 were used to estimate marginal global test information for that factor.
Empirical Example Result
Table 5 reports marginal reliability (as defined in equation (18)), expected standard error of measurement (as defined in equation (17)), m-NMRU, and m-ι c for each test. By existing metrics, Word List Learning, Judgment of Lines Orientation, and Digit Span were all highly reliable (reliability >0.75). Expected standard error of measurement was universally low. Marginal NMRU identifies the Mini Mental Status Exam and Word List Learning (immediate recall condition) as the best tests of verbal abilities. Marginal criterion information utility is also high for the Mini Mental Status Exam, as well as Word List Learning (delayed recall condition). The last column of Table 5 reports the criterion validity of these tests, as reflected in Kendall’s rank correlations with diagnoses of Alzheimer’s. The Mini Mental Status Exam and various trials of Word List Learning were most closely associated with Alzheimer’s.
Table 5.
Reliability, Expected Standard Error of Measurement, and Kendall’s Rank Correlations of Test Scores With Diagnosis of Alzheimer’s.
| Test | Reliability | E (SEM) | m-NMRU | m-ι c | τ(Alz.) |
|---|---|---|---|---|---|
| Mini mental status exam | 0.67 | 0.01 | 0.39 | 3.82 | 0.43* |
| Boston naming test | 0.54 | 0.02 | 0.28 | 2.18 | 0.30* |
| Word list immediate recall | 0.79 | 0.01 | 0.35 | 2.39 | 0.44* |
| Word list delayed recall | 0.71 | 0.01 | 0.24 | 3.81 | 0.52* |
| Word list recognition | 0.49 | 0.02 | 0.26 | 3.33 | 0.54* |
| Judgment of line orientation | 0.83 | 0.01 | 0.08 | 3.51 | 0.14* |
| Digit span forward | 0.80 | 0.01 | 0.14 | 2.84 | 0.18* |
| Digit span backward | 0.78 | 0.01 | 0.21 | 3.15 | 0.28* |
| Digit span sequencing | 0.74 | 0.01 | 0.30 | 3.24 | 0.29* |
| Complex ideation | 0.23 | 0.05 | 0.11 | 2.89 | 0.20* |
| Progressive matrices | 0.63 | 0.02 | 0.00 | 2.73 | 0.25* |
| National adult reading test | 0.73 | 0.01 | 0.02 | 2.18 | 0.13* |
| Smell test | 0.65 | 0.02 | 0.25 | 2.31 | 0.28* |
Note. Reliability is marginal reliability, defined in equation (18); E(SEM) = expected standard error of measurement, defined in equation (17); and m-NMRU = marginal normalized minimum reduction of uncertainty, calculated relative to Factor 1 of Figure 2; m-ι c = criterion information utility, calculated relative to Factor 1 of Figure 2; τ(Alz.) = Kendall’s rank correlation of test score with Alzheimer’s; Alzheimer’s is coded as 1 = highly probable, 2 = probable, 3 = possible, and 4 = not present. ∗ denotes correlations significantly different from 0.00 at p < 0.05.
Reliability limits, but is not the sole determinant of, validity. Consistent with this, the tests that were closely correlated with Alzheimer’s were not particularly reliable. Word List Recognition was one of the least reliable tests included in the battery (reliability = 0.49), but had the highest rank-order correlation with diagnosis. Figure 3 illustrates why this may be so. The reference priors for the immediate and delayed recall portions of the test show that these trials assess the normative range of the trait, around a latent trait estimate of zero. A majority of the individuals in this sample have trait estimates in this range, leading to high estimates of reliability. However, among those with a diagnosis of probable Alzheimer’s Disease, the trait distribution is lower. For these individuals, the delayed recognition test is more appropriate. The disjunction between the reference prior for delayed recognition and the observed trait distribution suppresses the estimate of reliability. However, because measures of global test information are a function of the reference prior itself, they are not affected by this mismatch. For this reason, global test information may be a better metric for test selection in cases where there is little a priori information about abilities. Indeed, in this sample, global test information was more closely associated with criterion validity than reliability or expected standard error of measurement. This is demonstrated by Figure 4, which includes four scatterplots demonstrating criterion validity as a function of marginal reliability (a), expected standard error of measurement (b), and marginal global test information (c and d). Marginal NMRU explains 50% of the variability in correlations between test scores and probable Alzheimer’s.
Figure 3.
The empirical density of verbal memory among those with probable Alzheimer’s dementia (solid line), and reference priors for the immediate recall (dotted line), delayed recall (dashes and dots), and delayed recognition (dashed line) subtests of Word List Learning.
Figure 4.
The relationship between criterion validity (operationalized as rank-order correlations with a diagnosis of probable Alzheimer’s) and (a) marginal reliability, (b) expected standard error of measurement, (c) normalized minimum reduction in uncertainty calculated with respect to verbal abilities, (d) ι c calculated with respect to verbal abilities. The association is expressed as R2, or variance in criterion validity explained by test information.
Discussion
The multidimensional nature of most psychological tests complicates the estimation of test information. These results show that multidimensional measures of global test information are sensitive to test structure. Because global test information is calculated relative to the reference prior, global information requires no a priori information or hypotheses about the ability of the individual.
Existing measures of test information require a prior estimate or hypothesis about the abilities of the sample. Global test information, in contrast, is a function of the reference prior, a distribution that is itself a function of item parameters only. To the extent that the observed trait distribution differs from the reference prior, marginal reliability decreases. As Study 1 shows, global test information is not affected by this mismatch (see the lower half of Table 2). Using the reference prior results in an estimate of test information that reflects only the characteristics of the test. However, one could theoretically use other priors, in order to quantify the amount of information that could be gained by administering a test to a group with some known or hypothetical trait distribution.
Multidimensional extensions of global test information, including both marginal and multidimensional measures, can be used to dissect test information into component parts. Multidimensional global test information reflects the overall informativeness of a measure. This metric can be used to estimate an omnibus measure of test information for measures of multiple, correlated traits (see rows 4 and 8 of Table 4). For instance, multidimensional test information could quantify the informativeness of Wechsler Adult Intelligence Scales with regard to all cognitive traits, or informativeness of each test with regard to G (see rows 3 and 7 of Table 4). Alternatively, marginal test information can be used to quantify information gained about specific traits (see rows 1, 2, 5, and 6 of Table 4). This could be used to quantify the extent to which an arithmetic test based on word problems measures arithmetic versus verbal skills.
The analysis of observed neuropsychological test data demonstrates how these measures might be used in test selection. The neuropsychological battery was shown to reflect five traits: verbal ability, working memory, processing speed, vocabulary, and effort/attention. Verbal ability was most closely correlated with probable Alzheimer’s. Marginal test information vis a vis verbal ability was estimated for each test in the battery. Scores on tests that were most informative with regard to verbal ability tended to be most closely correlated with a diagnosis of probable Alzheimer’s (see columns 3, 4, and 5 of Table 5). In the context of screening, or at the outset of testing an examinee whose abilities are unknown, basing test selection on global information for the trait of interest may improve prediction, although further research on this topic is required.
Limitations
The analyses of neuropsychological test data were limited by computational constraints, which required an uncorrelated trait model. This is almost certainly an incorrect model, although the high correlation between corresponding traits from different rotations (all r > 0.90) suggest the effects of rotation on the results are minimal.
These analyses did not systematically investigate the extent to which global information varies due to error in parameter estimates, although see Appendix C, Figure A1 for a simple examination of this issue. Parameter sampling error depends on the form of the item response model (two-, three-, or four-parameter), the size of the calibration sample and length of the test (Hulin et al., 1982). Future research may reveal how parameter variability translates to variation in global test information.
Finally, like all simulation studies, these results cover only a fixed range of simulation conditions. For example, reference priors were calculated at 60 points along the interval from −10 to 10 standard deviations along the latent trait, in order to capture the extreme range that those tests measured while remaining computationally tractable. However, one could reasonably argue for either a narrower or wider range.
Conclusions
Marginal and multidimensional global information quantify how well a test can inform estimates of one or more traits in a multidimensional test. The development of these metrics may facilitate test development, in that items can be selected so as to maximize information about one or more traits of interest, and to minimize the influence of nuisance traits. Applications to the assessment of probable dementia have been demonstrated, but the findings have applications to achievement testing, and the assessment of personality and psychopathology, too. Especially in contexts where prior information is vague, global information will identify those items or tests which are most likely to be the best measures of the trait of interest.
Acknowledgments
The author gratefully acknowledges Dr. Kristian Markon, Dr. Teresa Treat, and Dr. Michael O’Hara for their helpful comments.
Appendix A. Calculating Reference Priors
The algorithm for estimating the reference prior begins from an arbitrary starting distribution of θ, from which a string of simulated test data, , is generated. The probability is calculated using Bayes’ theorem. Over many simulated samples, mimicking an infinitely long test, approximates the reference prior, π r (θ). For uncorrelated traits, the process is as follows (Berger et al., 2009):
1. Define starting values. This includes choosing the number of items to simulate, which is usually a multiple, k, of the number of items in the test. The multiple k is intended to approximate an infinitely long test. Berger et al. (2009) simulate 500 items. In simulations, k was set to 50, approximating an 800 item test. The same k was used in analysis of the Memory and Aging Project (MAP) data. Because the number of items varied between tests in the MAP dataset, length ranged from 400 upward. Second, the number of samples to be simulated, m, is chosen. M was set to 1,000, which has been effective in a number of applications. Finally, a starting prior distribution prior was selected (a uniform prior over the range of the latent trait, p(θ) = 1). The initial prior is arbitrary, as the repeated sampling of this procedure asymptotically reaches the reference prior, but it should be defined over the desired range of the reference prior. For simulations, π r (θ) was calculated at 60 points spanning the range between −10 and 10. Since tests in the MAP data set measured a wider range of abilities, π r (θ) for these tests was calculated at 120 points between −20 to 20.
2. For each simulation m, simulate response data for k replications of the test, for a given trait value θ. The likelihood of the data given theta, p(x→|θ), is calculated, and divided by p(x→), the likelihood of the data integrated over the range of theta.
3. The probability p(θ) is the ratio in step 2 averaged over the m simulated samples.
4. Repeat steps 2 and 3 for all desired values of θ in order to obtain the reference prior, π r (θ).
When data depend on more than one correlated trait, calculation of the reference prior proceeded by calculating conditional reference priors, then integrating over the dimensions one by one. For example, in a second-order model, in which items load onto two correlated traits θ1 and θ2, the process is as follows 7 (Bernardo, 2005):
1. The single-parameter algorithm above is used to calculate the conditional reference prior π r (θ2|θ1).
2. The conditional reference prior calculated in step 1 is used to integrate out the nuisance parameter, resulting in a 1-parameter model:
| (21) |
3. The single-parameter algorithm is applied to the model above to calculate the marginal reference prior π r (θ1).
4. The multidimensional reference prior π r (θ1 and θ2) is equal to π r (θ2|θ1) (from step 1) multiplied by π r (θ1) (from step 3).
5. The other marginal reference prior π r (θ2) is then
| (22) |
The order of conditioning parameters can be switched (i.e., calculating π r (θ1|θ2) in step 1 and integrating over θ1 in step 2) without changing the form of the resulting prior. In the second-order simulations, the mean square difference between priors calculated in opposite directions (from specific factor 1 to specific factor 2, and vice versa) was 1.50 × 10−6. This is small compared to the range of observed probabilities (p(θ) = 0.00–0.17).
When calculating reference priors, it is possible to simulate response strings consisting entirely of 1s or entirely of 0s. For these limit cases, the most likely value of θ calculated in Step 3 will by default be the upper or lower threshold, set in Step 1. For example, in these simulations, a response string of k × j 1s is most likely at the upper limit of θ, which was set to −10. Over many replications, this causes the tails of the probability distribution to tilt upwards. To counter this artifact of the Monte Carlo process, prior distributions were smoothed using b-splines (implemented in R package “cobs” in R), such that the tail ends of the distribution were forced to asymptotically approach 0 beyond the outermost inflection points.
Appendix B. Monte Carlo Approximation of Criterion Information Utility
Criterion information utility can be transformed in such a way as allow Monte Carlo approximation. The resulting approximation is (Markon, 2013)
| (23) |
where probabilities are calculated with respect to the reference prior, and m indexes the iterations of the Monte Carlo procedure, which is:
1. Choose the number of iterations, M.
2. Randomly generate m values of θ from the reference prior.
3. For each θm, randomly generate a response pattern, x→m.
4. It can be shown that via Bayes’ theorem that the quantity inside the brackets in equation (23) is equivalent to
| (24) |
So the approximation to ι c can be calculated by calculating the quantities in equation (24) and averaging over the m replications. The standard error of ι c is equal to
| (25) |
Appendix C. Parameter Recovery
Table A1.
Study 1: Item Parameter Recovery.
| Test Difficulty | a | b | |
|---|---|---|---|
| RMSE | N (−1.0, 0.5) | 0.30 | 0.14 |
| N (0.0, 0.5) | 0.25 | 0.11 | |
| N (1.0, 0.5) | 0.27 | 0.14 | |
| Bias | N (−1.0, 0.5) | 0.02 | 0.01 |
| N (0.0, 0.5) | 0.02 | 0.01 | |
| N (1.0, 0.5) | 0.02 | 0.00 |
Note. RMSE = root mean standard error; a = item discrimination; b = item intercept.
Figure A1.
The relationship between the accuracy of parameter estimates, expressed as the root mean squared error of approximationRMSE of item discriminations, a, and (a) accuracy in NMRU, (b) accuracy of ι c . Note. NMRU = normalized minimum reduction in uncertainty; RMSE = root mean squared error of approximation.
Table A2.
Study 2: RMSE and Bias of Test Parameter Estimates.
| A | B | |||||
|---|---|---|---|---|---|---|
| N cross-loadings | a = 0.3 | a = 0.6 | a = 0.8 | a = 0.3 | a = 0.6 | a = 0.8 |
| RMSE | ||||||
| 0 | 0.19 | 0.19 | 0.19 | 0.14 | 0.12 | 0.14 |
| 4 | 0.59 | 0.51 | 0.68 | 0.12 | 0.14 | 0.14 |
| 8 | 0.52 | 0.49 | 0.70 | 0.13 | 0.15 | 0.17 |
| Bias | ||||||
| 0 | 0.01 | 0.01 | 0.02 | −0.01 | −0.01 | −0.01 |
| 4 | −0.26 | −0.05 | 0.33 | −0.01 | 0.00 | 0.01 |
| 8 | −0.16 | 0.06 | 0.37 | −0.01 | −0.03 | −0.03 |
Note. RMSE = root mean standard error; a = item discrimination; b = item intercept.
Table A3.
Study 3: Trait Correlations and Residuals.
| S1 Loading | |||
|---|---|---|---|
| 0.3 | 0.6 | 0.8 | |
| Σ | |||
| σS1,S2 | 0.18 | 0.36 | 0.48 |
| σS1,S3 | 0.18 | 0.36 | 0.48 |
| σS2,S3 | 0.36 | 0.36 | 0.36 |
| Ψ | |||
| ψS1 | 0.91 | 0.64 | 0.36 |
| ψS2 | 0.64 | 0.64 | 0.64 |
| ψS3 | 0.64 | 0.64 | 0.64 |
Note. σ = correlation; ψ = residual variance.
Table A4.
Study 3: RMSE and Bias of Item Parameter Estimates.
| A | B | |||||
|---|---|---|---|---|---|---|
| N cross-loadings | a = 0.3 | a = 0.6 | a = 0.8 | a = 0.3 | a = 0.6 | a = 0.8 |
| RMSE | ||||||
| 0 | 0.21 | 0.20 | 0.20 | 0.15 | 0.15 | 0.15 |
| 4 | 0.20 | 0.22 | 0.21 | 0.15 | 0.16 | 0.15 |
| 8 | 0.21 | 0.23 | 0.23 | 0.15 | 0.16 | 0.17 |
| Bias | ||||||
| 0 | 0.03 | 0.02 | 0.03 | 0.01 | 0.02 | 0.02 |
| 4 | 0.03 | 0.03 | 0.02 | 0.03 | 0.03 | 0.02 |
| 8 | 0.02 | 0.04 | 0.02 | 0.02 | 0.03 | 0.01 |
Note. RMSE = root mean standard error; a = item discrimination; b = item intercept.
Table A5.
Study 3: Trait Correlation Recovery.
| Trait Loading | |||
|---|---|---|---|
| 0.3 | 0.6 | 0.8 | |
| RMSE | 0.04 | 0.04 | 0.04 |
| Bias | 0.00 | 0.00 | 0.00 |
Note. RMSE = root mean standard error.
Appendix D. Descriptive Statistics for Neuropsychological Tests
| Mean | SD | Median | Skew | Kurtosis | Range | |
|---|---|---|---|---|---|---|
| Number comparison | 23.72 | 7.77 | 24 | −0.29 | 3.17 | 0–46 |
| Verbal fluency, animals | 16.04 | 5.43 | 16 | 0.27 | 3.32 | 0–40 |
| Verbal fluency, fruits | 16.71 | 5.36 | 17 | −0.12 | 3.17 | 0–40 |
| Symbol digit modalities | 36.65 | 11.83 | 38 | −0.52 | 3.34 | 0–70 |
| Logical memory, immediate recall | 10.48 | 4.59 | 11 | −0.18 | 2.55 | 0–23 |
| Logical memory, delayed recall | 8.70 | 4.70 | 9 | −0.01 | 2.36 | 0–23 |
| East Boston memory test, immediate recall | 9.33 | 2.14 | 10 | −1.00 | 4.85 | 0–12 |
| East Boston memory test, delayed recall | 8.70 | 2.69 | 9 | −1.52 | 5.76 | 0–12 |
| Mini mental status exam | 25.38 | 3.38 | 26 | −3.36 | 19.62 | 0–28 |
| Boston naming test | 13.49 | 2.29 | 14 | −3.69 | 20.64 | 0–15 |
| Word list immediate recall, trial 1 | 3.75 | 1.76 | 4 | 0.13 | 2.93 | 0–10 |
| Word list immediate recall, trial 2 | 5.94 | 1.86 | 6 | −0.36 | 3.27 | 0–10 |
| Word list immediate recall, trial 3 | 6.84 | 1.90 | 7 | −0.70 | 3.81 | 0–10 |
| Word list delayed recall | 4.94 | 2.59 | 5 | −0.33 | 2.38 | 0–10 |
| Word list delayed recognition | 9.07 | 2.13 | 10 | −3.01 | 11.87 | 0–10 |
| Judgment of line orientation | 19.45 | 6.21 | 20 | −0.96 | 4.14 | 0–30 |
| Digit span forward | 8.13 | 2.11 | 8 | −0.35 | 3.29 | 0–12 |
| Digit span backward | 6.00 | 2.11 | 6 | 0.17 | 3.23 | 0–12 |
| Digit span sequencing | 6.94 | 1.91 | 7 | −0.93 | 5.50 | 0–13 |
| Complex ideation | 3.67 | 0.63 | 4 | −2.45 | 10.94 | 0–4 |
| Progressive matrices | 7.14 | 2.10 | 8 | −1.55 | 5.27 | 0–9 |
| National adult reading test | 7.56 | 2.68 | 8 | −1.22 | 3.67 | 0–10 |
| Smell test | 7.36 | 3.92 | 9 | −0.87 | 2.41 | 0–12 |
Appendix E. Exploratory Factor Analyses of Neuropsychological Data
| No. Traits | BIC |
|---|---|
| 1 | 353,227.1 |
| 2 | 351,879.9 |
| 3 | 349,763.0 |
| 4 | 349,248.0 |
| 5 | 348,976.6 |
| 6 | 349,019.5 |
| 7 | 349,218.9 |
| 8 | 349,761.2 |
| 9 | 350,766.2 |
| 10 | 352,040.2 |
Note. BIC = Bayesian information criterion. Values in bold indicate the optimal model according to fit index.
Footnotes
Author’s Note: Parts of these analyses were presented at the 2016 International Meeting of the Psychometric Society.
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The Memory and Aging Project data collection is supported by NIA grant R01AG17917.
ORCID iD
Katherine G. Jonas https://orcid.org/0000-0002-1910-223X
References
- Ansley T. N., Forsyth R. A. (1985). An examination of the characteristics of unidimensional IRT parameter estimates derived from two-dimensional data. Applied Psychological Measurement, 9(1), 37–48. 10.1177/014662168500900104 [DOI] [Google Scholar]
- Bechger T. M., Maris G., Verstralen H. H. F. M., Béguin A. A. (2003). Using classical test theory in combination with item response theory. Applied Psychological Measurement, 27(5), 319–334. 10.1177/0146621603257518 [DOI] [Google Scholar]
- Bennett D. A., Buchman A. S., Boyle P. A., Barnes L. L., Wilson R. S., Schneider J. A. (2018). Religious orders study and rush memory and aging project. Journal of Alzheimer’s Disease, 64(s1), S161–S189. 10.3233/JAD-179939 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berger J. O., Bernardo J. M., Sun D. (2009). The formal definition of reference priors. The Annals of Statistics, 37(2), 905–938. 10.1214/07-AOS587 [DOI] [Google Scholar]
- Bernardo J. M. (1979. a). Expected information as expected utility. The Annals of Statistics, 7(3), 686–690. 10.1214/aos/1176344689 [DOI] [Google Scholar]
- Bernardo J. M. (1979. b). Reference posterior distributions for Bayesian inference. Journal of the Royal Statistical Society: Series B (Methodological), 41(2), 113–128. 10.1111/j.2517-6161.1979.tb01066.x [DOI] [Google Scholar]
- Bernardo J. M. (2005). Reference analysis. Handbook of Statistics, 25, 17–90. 10.1016/S0169-7161(05)25002-2 [DOI] [Google Scholar]
- Birnbaum A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In: Statistical theories of mental test scores. Addison-Wesley. [Google Scholar]
- Bodnar O., Elster C. (2014). Analytical derivation of the reference prior by sequential maximization of Shannon’s mutual information in the multi-group parameter case. Journal of Statistical Planning and Inference, 147, 106–116. https://doi.org/0.1016/j.jspi.2013.11.003 [Google Scholar]
- Bonifay W. (2019) Multidimensional item response theory (Vol. 183). SAGE Publications. [Google Scholar]
- Cai L. (2010). High-dimensional exploratory item factor analysis by a metropolis-hastings robbins-monro algorithm. Psychometrika, 75(1), 33–57. 10.1007/s11336-009-9136-x [DOI] [Google Scholar]
- Camilli G. (1994). Teacher’s corner: origin of the scaling constant d = 1.7 in item response theory. Journal of Educational Statistics, 19(3), 293–295. 10.3102/10769986019003293 [DOI] [Google Scholar]
- Chalmers R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. 10.18637/jss.v048.i06 [DOI] [Google Scholar]
- Chang H.-H., Ying Z. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20(3), 213–229. 10.1177/014662169602000303 [DOI] [Google Scholar]
- Clarke B. S., Barron A. R. (1990). Information-theoretic asymptotics of Bayes methods. IEEE Transactions on Information Theory, 36(3), 453–471. 10.1109/18.54897 [DOI] [Google Scholar]
- Clarke B. S., Barron A. R. (1994). Jeffreys’ prior is asymptotically least favorable under entropy risk. Journal of Statistical Planning and Inference, 41(1), 37–60. 10.1016/0378-3758(94)90153-8 [DOI] [Google Scholar]
- Folk V. G., Green B. F. (1989). Adaptive estimation when the unidimensionality assumption of IRT is violated. Applied Psychological Measurement, 13(4), 373–390. 10.1177/014662168901300404 [DOI] [Google Scholar]
- Hulin C. L., Lissak R. I., Drasgow F. (1982). Recovery of two- and three-parameter logistic item characteristic curves: A Monte Carlo study. Applied Psychological Measurement, 6(3), 249–260. 10.1177/014662168200600301 [DOI] [Google Scholar]
- Kaplan A. (1973). The conduct of inquiry. Transaction Publishers. [Google Scholar]
- Kullback S., Leibler R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. 10.1214/aoms/1177729694 [DOI] [Google Scholar]
- Lindley D. V. (1956). On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27(4), 986–1005. 10.1214/aoms/1177728069 [DOI] [Google Scholar]
- Markon K. E. (2013). Information utility: Quantifying the total psychometric information provided by a measure. Psychological Methods, 18(1), 15–35. 10.1037/a0030638 [DOI] [PubMed] [Google Scholar]
- Mulder J., van der Linden W. J. (2009). Multidimensional adaptive testing with optimal design criteria for item selection. Psychometrika, 74(2), 273–296. 10.1007/s11336-008-9097-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paulhus D. L. (1981). Control of social desirability in personality inventories: Principal-factor deletion. Journal of Research in Personality, 15(3), 383–388. 10.1016/0092-6566(81)90035-0 [DOI] [Google Scholar]
- R Development Core Team (2010). R: A language and environment for statistical computing. http://www.R-project.org [Google Scholar]
- Raju N. S., Price L. R., Oshima T. C., Nering M. L. (2007). Standardized conditional SEM: A case for conditional reliability. Applied Psychological Measurement, 31(3), 169–180. 10.1177/0146621606291569 [DOI] [Google Scholar]
- Rechase M. D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21(1), 25–36. 10.1177/0146621697211002 [DOI] [Google Scholar]
- Segall D. O. (2001). General ability measurement: An application of multidimensional item response theory. Psychometrika, 66(1), 79–97. 10.1007/BF02295734 [DOI] [Google Scholar]
- Veldkamp B. P., van der Linden W. J. (2002). Multidimensional adaptive testing with constraints on test content. Psychometrika, 67(4), 575–588. 10.1007/BF02295132 [DOI] [Google Scholar]
- Wang C., Chang H.-H. (2011). Item selection in multidimensional computerized adaptive testing-gaining information from different angles. Psychometrika, 76(3), 363–384. 10.1007/s11336-011-9215-7 [DOI] [Google Scholar]
- Wang C., Chang H.-H., Boughton K. A. (2010). Kullback-Leibler information and its applications in multi-dimensional adaptive testing. Psychometrika, 76(1), 13–39. 10.1007/s11336-010-9186-0 [DOI] [Google Scholar]
- Way W. D., Ansley T. N., Forsyth R. A. (1988). The comparative effects of compensatory and noncompensatory two-dimensional data on unidimensional IRT estimates. Applied Psychological Measurement, 12(3), 239–252. 10.1177/014662168801200303 [DOI] [Google Scholar]
- Wirth R. J., Edwards M. C. (2007). Item factor analysis: current approaches and future directions. Psychological Methods, 12(1), 58–79. 10.1037/1082-989X.12.1.58 [DOI] [PMC free article] [PubMed] [Google Scholar]





