Abstract
Equating of psychometric scales and tests is frequently required and conducted in educational, behavioral, and clinical research. Construct comparability or equivalence between measuring instruments is a necessary condition for making decisions about linking and equating resulting scores. This article is concerned with a widely applicable method for examining if two scales or tests cannot be equated. A latent variable modeling method is discussed that can be used to evaluate whether the tests or parts thereof measure latent constructs that are distinct from each other. The approach can be routinely used before an equating procedure is undertaken, in order to assess whether equating could be meaningfully carried out to begin with. The procedure is readily applicable in empirical research using popular software. The method is illustrated with data from dementia screening test batteries administered as part of two studies designed to evaluate a wide range of biomarkers throughout the process of normal aging to dementia or Alzheimer’s disease.
Keywords: dementia screening test, equating, latent construct, latent structure, latent variable modeling, necessary condition, scale, sufficient condition, test battery
Equating of psychometric scales, tests, composites, or test batteries is frequently needed and carried out in educational, behavioral, and clinical research (Kolen & Brennan, 1995). As a logical prerequisite for equating, however, it must be demonstrated first that equating can be meaningfully carried out, in which case the scores on the composites of concern can be used interchangeably after the equating is conducted. A necessary condition for the application of any equating methodology is that each of them is unidimensional to begin with (see Lord, 1980). Moreover, if two tests measure distinct constructs or contain multiple domain components, they cannot be obviously meaningfully equated.1
The aim of this article is to discuss a procedure for examining whether equating can be meaningfully carried out. Using the methodological framework provided in Raykov, Marcoulides, and Tong (2016), we present a readily applicable latent variable modeling (LVM) approach that can be employed to investigate the hypothesis that two tests, scales, or test batteries under consideration, or parts thereof, evaluate the same latent construct. Rejection of this hypothesis would suffice in an empirical setting to declare as unjustifiable (a) the equating of two composite measures and/or (b) their subsequent use as interchangeable. We illustrate the procedure by employing empirical data from widely used dementia screening test batteries administered as part of two ongoing studies designed to investigate biomarkers associated with the process of normal aging to dementia or Alzheimer’s disease (AD), which are used to assess cognitive status and for detecting, staging, and tracking disease progression.
Notation, Background, and Assumptions
We assume in the rest of this article that two sets of measures are given, which consist of p and q components denoted as Y1, . . . , Yp and Z1, . . . , Zq, respectively (p > 1, q > 1; see below). The two sets may or may not contain common measures. We posit that these sets correspondingly represent two scales, tests, composites, or test batteries (generically referred to as “tests” below), which are being considered for equating with respect to the scores obtainable from them. We assume that each test is unidimensional, with p and q being such that the single-factor model is identified for either test. That is, we stipulate the following models are valid for the respective tests:
In Equation (1), y and z denote correspondingly the p× 1 and q× 1 column vectors of Y1, . . . , Yp and Z1, . . . , Zq, with their means stacked in the p× 1 and q× 1 vectors µ1 and µ2, respectively. Furthermore, Λ1 and Λ2 are the p× 1 and q× 1 corresponding factor loading vectors (matrices), with the pertinent common factors per test denoted as η1 and η2 that are assumed with zero means (e.g., Raykov & Marcoulides, 2008). In addition, ε1 and ε2 symbolize the respective p× 1 and q× 1 vectors of residual terms with zero means, which are assumed uncorrelated among themselves as well as with each other and with the factors. Last, we assume that both tests are administered to a representative, large sample from a studied population consisting of independent subjects (i.e., no nesting or clustering effects are at work at the population level—see also conclusion section; Raykov, Marcoulides, Dimitrov, & Li, 2016).
A Latent Variable Modeling Procedure for Examining a Necessary Condition for Equating Two Tests
As indicated above, an essential prerequisite for equating two tests or composites is that they measure the same latent construct (see Note 1; Lord, 1980). That is, if x denotes the (p+q) × 1 vector of the observed variables Y1, . . . , Yp and Z1, . . . , Zq taken together (stacked on each other), then according to this necessary condition,
must hold, where Λ is the pertinent (p+q) × 1 factor loading vector (matrix), η the underlying common factor, µ the mean intercept (observed means vector), and ε is the associated (p+q) × 1 unique factors’ vector, with the usual distributional assumptions (see preceding discussion; e.g., Raykov & Marcoulides, 2008).
Implications of Equation (2) for the joint distribution of the measures x are testable using the popular LVM methodology (Muthén, 2002). In particular, use of the asymptotically distribution-free (or weighted least squares, WLS) estimation method can be made in general to test the hypothesis of unidimensionality (homogeneity) embedded in Equations (2) (e.g., Bollen, 1989). It is important to point out in this connection, however, that not rejecting this unidimensional hypothesis for the measures x does not mean that it is true. Specifically, when the homogeneity hypothesis is tested empirically, true latent structures distinct from unidimensionality can be associated with tenable overall goodness of fit indices pertaining to the single-factor model, such as chi-square value, root mean square error of approximation (RMSEA), and descriptive goodness of fit indices (e.g., comparative fit index and Tucker–Lewis index; Muthén & Muthén, 2017). One may argue that this possibility may be even more likely in cases where the number p+q of observed variables in the vector x in Equation (2) is large, due to what may be considered increased likelihood of “local” violations of homogeneity that are offset by tenable overall fit indices when testing this unidimensionality hypothesis. Part of the intended contribution of the present article is in fact the demonstration of such a circumstance using empirical data, that is, a situation where the hypothesis of homogeneity for a given set of measures is plausible while subsets of that manifest variable set assess distinct constructs (see next section).
This discussion and especially the indicated possibility of complex latent structures being missed in the conventional and routinely used omnibus unidimensionality hypothesis test have important implications when one is considering the equating of two tests. In particular, as we demonstrate in the next section, it is possible that (a) each test is unidimensional, as judged by the overall goodness of fit indices (e.g., those mentioned above); (b) appropriate subsets of measures participating in each of the tests, pooling over test, are also unidimensional as evaluated by these indices; while (c) these subsets of measures from either test actually evaluate two distinct constructs, as judged by a statistical test based on the same overall fit indices. More specifically, using the notation introduced in the preceding section, it is possible that the following three circumstances hold. One, the single-factor models
are plausible for a subset u of Y measures and a subset v of Z measures with r and s elements, respectively, as judged by the above overall goodness of fit indices, with ξ1 and ξ2 being their associated single common factors per subset, K1 and K2 the pertinent loading vectors, and δ1 and δ2 the associated error term vectors correspondingly of size r× 1 and s× 1, with the usual distributional assumptions (see above; 0 < r≤p, 0 < s≤q). In addition,
is also plausible, where w is the (r+s) × 1 stacked vector of u and v measures with a common factor ξ, δ that of error terms in (3), and K the (r+s) × 1 factor loading vector (with the usual distributional assumptions mentioned earlier). At the same time, however,
is consistent with the empirical data, as judged by a statistical test based on the above-mentioned goodness of fit indices.
We emphasize that when Inequality (5) holds the equating of the two tests consisting correspondingly of the Y and Z measures is not meaningful or justifiable. Hence, Inequality (5) represents a sufficient condition for the two resulting test scores (from the Y and Z measures, respectively) not to be used as interchangeable. (For the distinction between, and relevance, of necessary and sufficient conditions in empirical and theoretical research, see, e.g., Raykov & Marcoulides, 2006.) We demonstrate in the next section an empirical case consistent with the described circumstance of plausibility of the unidimensionality hypothesis for a given set of measures while two subsets of it evaluate distinct constructs.
Can the Mini-Mental State Test Battery Be Equated to the Montreal Cognitive Assessment Test Battery for Older Adults?
In this section, we apply the discussed procedure to data from concurrent administrations of two cognitive test batteries, the popular Mini-Mental State Examination (MMSE; Folstein, Folstein, & McHugh, 1975) and the more recently developed Montreal Cognitive Assessment (MoCA; Nasreddine et al., 2005), to a sample of older adults participating in two major studies of aging and AD. We show that strictly speaking the two test batteries need not be possible to equate in general, that is, that their meaningful equating need not be feasible, and hence that they need not be usable interchangeably (see also Note 5). Sample characteristics associated with the used data set of aged adults are presented next.
Sample
We employ next combined data from the Phase 2 Alzheimer’s Disease Neuroimaging Initiative Grand Opportunities (ADNI) study (adni.loni.usc.edu; Dowling, Raykov, & Marcoulides, 2018) and the Wisconsin Alzheimer’s Disease Research Center (Wisconsin ADRC). The primary goal of ADNI has been to test whether imaging and other biological markers as well as clinical and neuropsychological assessments can be combined to measure the progression of mild cognitive impairment (MCI) and early AD. The Wisconsin ADRC is one of the 32 research centers in the nation funded by the National Institute of Health and National Institute of Aging to contribute with clinical data and provide new knowledge to improve diagnosis and care of people with AD. To date, ADNI protocols have recruited more than 1,500 adults, aged 54 to 95 years, who are clinically diagnosed as cognitively normal (CN), MCI, or AD. (For up-to-date information on ADNI protocols, see www.adni-info.org.) The Wisconsin ADRC also recruits individuals who are diagnosed as CN, MCI, or AD. Neuropsychological data collection protocols are very close to those implemented in ADNI studies. (For up-to-date information on national study protocols and local recruitment, see http://www.adrc.wisc.edu.)
The analytical sample in this illustration consisted of 1,683 participants ranging in age from 53 to 94 years (with a standard deviation [SD] of 8.5) who were clinically diagnosed at study entry as CN (n1 = 560), MCI (size n2 = 655), or AD (n3=468). The sample was predominantly Caucasian (92%), had an average education level of 15.4 years (SD = 3.08), and a gender composition of 62.8% male.
Measures
The MMSE is the most widely used test by clinicians for grading cognitive status. Its administration takes approximately 7 minutes and includes 30 measures. The MoCA test battery was developed as an alternative tool to screen individuals with MCI who would most likely perform within the “normal” range on the MMSE (Koski, Xie, & Finch, 2009; Koski, Xie, & Konsztowicz, 2011). Like the MMSE, the MoCA contains 30 items and it takes 10 minutes to administer. An important point to raise here is the fact that while the two test batteries contain a number of common measures, they also include measures that are specific to each battery. Table 1 illustrates the cognitive domain measured by each item category in the tests.
Table 1.
Side-by-Side Comparison of the Screening Test Batteries.
| Domain | MoCA | MMSE | |
|---|---|---|---|
| Items unique to MoCA | Visuospatial | Mini trails | None |
| Executive Function | Draw Clock | None | |
| Attention | Digits forward/backward | None | |
| Attention | Tapping with each A | None | |
| Naming | Name F words | None | |
| Abstraction | Similarities | None | |
| Items unique to MMSE | Language and praxis | None | Read sentence |
| Language and praxis | None | Write sentence | |
| Language and praxis | None | Three-stage command | |
| Items common to both tests | Language and praxis | Naming (Lyon, Rhino, Camel) | Naming (pen, wristwatch) |
| Language and praxis | Copying drawing (a cube) | Copying drawings (2 pentagons) | |
| Attention/Concentration | Serial 7s | Serial 7s or WORLD backwards | |
| Language and praxis | Repeat sentence | Repeat sentence (no ifs ands or buts) | |
| Memory | Delayed Recall (5 words) | Delayed recall (3 words) | |
| Orientation: Temporal | Time (4 items) | Time (5 items) | |
| Orientation: Spatial | Place (2 items) | Place (5 items) | |
| Immediate Memory/Attention | Two trials (5 words each; not scored) | Up to six trials (3 words; 0-3 score) |
Note. MoCA = Montreal Cognitive Assessment; MMSE = Mini-Mental State Examination.
As seen from Table 1, both the MoCA and MMSE test batteries assess temporal and spatial orientation, but the MoCA battery includes additional memory items and emphasizes visuospatial, executive function, and naming (language) tasks. These cognitive functions are generally affected in what is referred to as the prodromal phases of AD (Diniz, Yassuda, Nunes, Radanovic, & Forlenza, 2007; Smith, Gildeh, & Holmes, 2007).
Examining Test Battery Equating
Based on the preceding sections of this article, one readily realizes that if the two test batteries under consideration were to be meaningfully equated, they both must measure the same common construct, as indicated earlier. As a consequence, then, the sets of unique measures in the MMSE and in the MoCA test batteries must themselves measure a single, common construct. Therefore, in order to show that the MMSE and MoCA batteries cannot be equated (in general; see Note 5), it suffices to demonstrate that the unique measures in these two batteries in actual fact evaluate two distinct constructs.
To this end, we begin by testing the unidimensionality of the unique measures in the MMSE battery and in the MoCA battery. (For software syntax pertaining to testing this hypothesis, see, e.g., Raykov & Marcoulides, 2006. ) For the r = 3 unique measures in the MMSE battery, the pertinent unidimensional (one-factor) model is saturated and hence associated with perfect fit (when using the above mentioned WLS method of model fitting and parameter estimation, due to the distinct discrete individual measure distributions; Muthén & Muthén, 2017; see also Note 2). Similarly, for the s = 6 unique measures in the MoCA test battery, this model is associated with what can be interpreted as tenable goodness of fit indices: chi-square (χ2) = 51.745, degrees of freedom (df) = 9, p value (p) = .0, RMSEA = 0.053 with a 90% confidence interval being (0.040, 0.068; e.g., Browne & Cudeck, 1993).2 As a next step, we fit the single-factor model to the set of these altogether 9 unique measures between the two test batteries, which is found to be tenable as well: χ2 = 89.540, df = 27, p < .001, RMSEA = 0.037 (0.029, 0.046).
At the same time, however, the two-factor model, with one-factor loading only on the unique MMSE measures and the other only on the unique MoCA measures, is also tenable: χ2 = 79.522, df = 26, p < .001, RMSEA = 0.035 (0.026, 0.044). (For software syntax pertaining to testing this type of model, see, e.g., Raykov, Marcoulides, Dimitrov, et al., 2016) As a next step, in line with the preceding discussion, we wish to examine specifically the possible congruence or lack thereof for the constructs underlying the unique MMSE measures and the unique MoCA measures, respectively. To this end, we carry out the test of the restriction of unit correlation between the two factors, which results in a single-factor model for the set of 9 measures under consideration that is nested in the last fit two-factor model (for details and demonstration of this relationship and relevant software syntax, see Raykov, Marcoulides, & Tong, 2016, as well as www.statmodel.com for WLS-based chi-square difference testing). This test is found to be associated with a significant result: chi-square test for difference testing = 9.898, df = 1, p = .002.3
These results suggest rejection of the null hypothesis of unit correlation between the two factors in the above-mentioned two-factor model. This is interpretable as evidence against the single-factor model for the presently considered set of 9 measures, since unit factor correlation renders the one-factor model from the two-factor model, that is, the former as nested in the latter model (e.g., Raykov, Marcoulides, & Tong, 2016). Hence, the reported analyses provide support, based on the analyzed sample, for the MMSE and MoCA test batteries measuring distinct constructs. For this reason, as discussed earlier, equating the MMSE and MoCA batteries is strictly speaking not to be generally recommended.4,5
Conclusion
The purpose of this article was to discuss a readily used LVM procedure for examining whether two psychometric scales, tests, composites, or test batteries can be meaningfully equated. The procedure is widely applicable in empirical educational, behavioral, or clinical research using the popular LVM methodology. The method rests on the logical prerequisite of each test being unidimensional and the tests measuring the same latent construct. In particular, in case the tests contain common measures, for meaningful test equating a necessary condition is that their unique measures evaluate the same rather than distinct constructs. Therefore, a sufficient condition for the test equating of two given tests not to be meaningful is the finding of their distinct measures evaluating distinct latent constructs.
We emphasize that an application of the discussed procedure is tied to a particular sample of data that is obtained from a studied population in which both tests are considered for use. Hence, it is possible that two tests may be meaningfully equated for a given (sub)population but not for another (sub)population where they may also be relevant as well as informative. In addition, it is possible that a finding of lack of meaningful equating (in the sense discussed in this article) is associated with (a) a sample not being representative of a population of concern or intended to be studied or (b) an overly large sample (e.g., well into the thousands). We note that the circumstance (a) is to be resolved by using a correct scheme and process of sampling and needs to be addressed at the study design stage before data collection. At the same time, the findings of not rejecting the homogeneity hypothesis (a) for each test, (b) for unique subsets of their measures, and (c) for the union of these subsets (as in the illustration section) may be consistent with sample size not being excessively large in a given empirical study.
We would like to point out several limitations of the discussed LVM procedure for examining test equating in this article (Raykov, Marcoulides, & Tong, 2016). As indicated earlier, one is the requirement of a large number of studied subjects (sample) from an examined population, since the method rests on an application of the LVM methodology that is based on an asymptotic statistical theory. We encourage future research addressing the important matter of when this theory may be obtaining practical relevance, as related to particular empirical circumstances. Similarly, the method does not assume that the studied subjects are nested or clustered in higher order units (e.g., Level 2 units), or alternatively that the clustering effect is negligible. In our view, the method may be expected to be robust to some relatively minor violations of this classical independence assumption, particularly under normality of the observed measures (see also Raykov, Marcoulides, & Tong, 2018, for the case of nonnormality and nesting effects).
In conclusion, the present article adds a widely applicable LVM procedure to the arsenal of methods available to educational, behavioral, and clinical scientists. The method helps them assess whether equating of psychometric tests, scales, composites, or test batteries can be meaningfully carried out in an empirical setting before engaging in the actual process of test equating.
While numerically equating could be carried out formally on any sets of scores resulting from using distinct measures (tests, scales, composites, or test batteries), it is essential that the results of this procedure be meaningfully interpreted in substantive terms pertinent to the studied phenomenon. A major logical principle that this article is based on, therefore, is that in order for such interpretation to be possible it is necessary that the scores result from evaluation of the same latent construct using distinct measurement procedures (see Lord, 1980).
If a test battery (test, scale, composite) is unidimensional, then evidently the pertinent single-factor model must be tenable regardless of whether WLS or maximum likelihood (ML) is used as a method of model fitting. When ML is used to fit this model to the unique MMSE measures, the likelihood ratio chi-square is 22.175, for df = 7, p = .002, indicating some potential violations of unidimensionality (see also introductory section). (The Pearson chi-square value is 24.926, i.e., rather close, which is consistent with the asymptotic theory underlying these tests having likely obtained practical relevance at the used sample size.) Similarly, when fitting the single-factor model for the unique MoCA measures with ML, the likelihood ratio chi-square is 290.875, for df = 271, p = .194, indicating no considerable violations of unidimensionality. (The Pearson chi-square value is 314.702, i.e., relatively close, which is consistent with the asymptotic theory underlying these tests having largely obtained practical relevance at the used sample size; see also Note 3). For the illustration purposes of the current section, we treat these fit results for the unique MMSE measures as not sufficient to warrant rejection of the pertinent unidimensionality hypothesis but note that the results are consistent with the conclusion at the end of this section that the two considered test batteries cannot in general be meaningfully equated.
We emphasize that rejection of the null hypothesis of unit correlation, as found in this section, is sufficient to reject a claim that the two test batteries of initial concern can be equated in general. We stress that this conclusion is of course to be interpreted within the limits of statistical inference, that is, assuming no Type 1 error is committed with the used statistical test of the two nested models involved (see also conclusion section and Note 5).
In the two-factor model, the factor correlation is estimated at 0.831, with a standard error of 0.050 and a 95% confidence interval (CI) being (0.704, 0.906; Raykov & Marcoulides, 2011). We do not interpret this finding as indicative of (practical) collapsibility of the two factors into a single one in this model (Raykov, Marcoulides, & Tong, 2016), and hence do not interpret it as support for the single factor model vis-à-vis the two-factor model. On the contrary, because of the lower limit of this CI being 0.70, that is, being consistent with less than half of the variance in either factor being explainable in terms of that in the other factor (through an assumed linear relationship), we find that these correlation estimation results cannot be taken as indicative of congruence in practical terms of the two factors of concern.
As indicated earlier, we stress that the purpose of this article is not to demonstrate that the MMSE and MoCA test batteries cannot be equated for any population for which they would be of substantive relevance and applicable. Rather, the goal is merely (a) to discuss a procedure that can be used for examining test equating in the general case and (b) to illustrate this procedure by employing data from an older adults’ study that are obtained with these batteries. In particular, this article does not preclude the two batteries from being equated for another (e.g., a more specialized) older adults (sub)population (as also pointed out in the conclusion section).
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
- Bollen K. A. (1989). Structural equations with latent variables. New York, NY: Wiley. [Google Scholar]
- Browne M. W., Cudeck R. (1993). Alternative ways of assessing model fit. In Bollen K. A., Long J. S. (Eds.), Testing structural equation models (pp. 136-162). Newbury Park, CA: Sage. [Google Scholar]
- Diniz B., Yassuda M., Nunes P., Radanovic M., Forlenza O. V. (2007). Mini-mental State Examination performance in mild cognitive impairment subtypes. International Psychogeriatrics, 19, 647-656. [DOI] [PubMed] [Google Scholar]
- Dowling N. M., Raykov T., Marcoulides G. A. (2018). Examining population differences in within-person variability in longitudinal designs using latent variable modeling: An application to the study of cognitive functioning of older adults. Educational & Psychological Measurement, 79(3), 598-609. doi: 10.1177/0013164418758834 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Folstein M. F., Folstein S., McHugh P. R. (1975). Mini-Mental State: A practical method for grading the cognitive state of patients for the clinician. Journal of Psychiatric Research, 12, 189-198. [DOI] [PubMed] [Google Scholar]
- Kolen M. J., Brennan R. L. (1995). Test equating: Methods and practices. New York, NY: Springer. [Google Scholar]
- Koski L., Xie H. Q., Finch L. (2009). Measuring cognition in a geriatric outpatient clinic: Rasch analysis of the Montreal Cognitive Assessment. Journal of Geriatric Psychiatry and Neurology, 22, 151-160. [DOI] [PubMed] [Google Scholar]
- Koski L., Xie H., Konsztowicz S. (2011). Improving precision in the quantification of cognition using the Montreal Cognitive Assessment and the Mini-Mental State Examination. International Psychogeriatrics, 29, 1107-1115. [DOI] [PubMed] [Google Scholar]
- Lord F. M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Erlbaum. [Google Scholar]
- Muthén B. O. (2002). Beyond SEM: General latent variable modeling. Behaviormetrika, 29, 81-117. [Google Scholar]
- Muthén L. K., Muthén B. O. (2017). Mplus user’s guide (8th ed.). Los Angeles, CA: Muthén & Muthén. [Google Scholar]
- Nasreddine Z. S., Phillips N. A., Bedirian V., Charbonneau S., Whitehead V., Collin I., . . . Chertkow H. (2005). The Montreal Cognitive Assessment, MoCA: A brief screening tool for mild cognitive impairment. Journal of the American Geriatric Society, 53, 695-699. [DOI] [PubMed] [Google Scholar]
- Raykov T., Marcoulides G. A. (2006). A first course in structural equation modeling. Mahwah, NJ: Erlbaum. [Google Scholar]
- Raykov T., Marcoulides G. A. (2008). An introduction to applied multivariate analysis. New York, NY: Taylor & Francis. [Google Scholar]
- Raykov T., Marcoulides G. A. (2011). Classical item analysis using latent variable modeling: A Note on a direct evaluation procedure. Structural Equation Modeling, 18, 316-325. [Google Scholar]
- Raykov T., Marcoulides G. A., Dimitrov D. M., Li T. (2016). Examining construct congruence for psychometric tests: A note on an extension to binary items and nesting effects. Educational and Psychological Measurement, 78, 167-174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raykov T., Marcoulides G. A., Tong B. (2016). Do two or more multi-component instruments measure the same construct? Testing construct congruence using latent variable modeling. Educational and Psychological Measurement, 76, 873-884. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith T., Gildeh N., Holmes C. (2007). The Montreal Cognitive Assessment: Validity and utility in a memory clinic setting. Canadian Journal of Psychiatry, 52, 329-332. [DOI] [PubMed] [Google Scholar]
