Abstract
Objective:
A variety of factors affect list learning performance, and relatively few studies have examined the impact of word selection on these tests. This study examines the effect of both language and memory processing of individual words on list learning.
Methods:
Item response data from 1219 participants (Mage=74.41 [SD=7.13], Medu=13.30 [SD=2.72]) in the Harmonized Cognitive Assessment Protocol were used. A Bayesian generalized (non-)linear multilevel modeling framework was used to specify the measurement and explanatory IRT models. Explanatory effects on items due to learning over trials, serial position of words, and six word properties obtained through the English Lexicon Project were modeled.
Results:
A two parameter logistic model with trial-specific learning effects produced the best measurement fit. Evidence of the serial position effect on word learning was observed. Robust positive effects on word learning were observed for body-object integration while robust negative effects were observed for word frequency, concreteness, and semantic diversity. A weak negative effect of average age of acquisition and a weak positive effect for the number of phonemes in the word were also observed.
Conclusions:
Results demonstrate that list learning performance depends on factors beyond the repetition of words. Identification of item factors that predict learning could extend to a range of test development problems including translation, form equating, item revision, and item bias. In data harmonization efforts, these methods can also be used to help link tests via shared item features and testing of whether these features are equally explanatory across samples.
Keywords: Memory, Semantics, Psychometrics, Memory and Learning Tests
Word list learning tests are commonly used to assess verbal episodic memory (Rabin et al., 2016). These tests typically involve the auditory presentation of a set list of words over repeated trials to evaluate verbal learning and recall. Despite differences between test versions and forms, such measures share similar administration formats and have proven to be useful tools in detecting hippocampal dysfunction and dementing disorders (e.g., Gavett et al., 2009; Pozueta et al., 2011; Ribeiro et al., 2007; Silva et al., 2012; Weissberger et al., 2017). The California Verbal Learning Test (Delis et al., 2017) is one of the most popular among clinicians for use in adults (Rabin et al., 2016), while the briefer Consortium to Establish a Registry for Alzheimer’s Disease’s (CERAD) word list (Morris et al., 1989) has been used in multiple longitudinal, international research studies (Fillenbaum et al., 2008; Prince et al., 2003).
Despite their popularity and utility, common word list learning tests have been criticized because most were created decades ago or iterate on the same general format (Rabin et al., 2016) and thus often fall behind the pace of the cognitive neuroscience literature (Bilder, 2011; Bilder & Reise, 2019; Perry, 2009; Silverberg et al., 2011). One method to overcome these limitations is to apply updated theories and analytic methods to existing tests, and one such method is to combine the process approach with the psychometric framework of item response theory (IRT; Bilder & Reise, 2019).
Applying the Boston process approach to existing tests, as described by Perry (2009), involves three aspects: deconstructing existing measures, creating new scoring systems, and developing new tasks using new methods. While all three aspects can be addressed using IRT, this study focuses on the first two aspects. As a model-based approach to describing a test, IRT allows explicit modeling of both measurement and performance features. One extension of model-based IRT is explanatory IRT in which item- and person-level covariates are included in the model to explain variance in item and person parameters (De Boeck & Wilson, 2004). Existing neuropsychology research already commonly examines person-level covariates in the form of educational attainment, premorbid ability, and age as variables affecting cognitive abilities. Item covariates function similarly but apply to the measurement properties of the items (i.e., the item parameters), which allows examination of what item features contribute to (or explain) the observed item behavior. Tests can thus be deconstructed at the item-level by examining item characteristics’ effect on their IRT parameters, and the resulting model serves as a scoring method, making it a cohesive solution to the first two process aspects described by Perry (2009).
Prior studies of list learning tests have examined cognitively relevant variables like serial position, learning slope, and word organization strategies (e.g., Blumenfeld & Ranganath, 2007; Delis et al., 2017; Griffin et al., 2017; Sternberg & Tulving, 1977; Williams et al., 2019) or IRT parameters (e.g., Gavett & Horwitz, 2012; Thiruselvam & Hoelzle, 2020), but to our knowledge, the two approaches have not been combined. Combining information about cognitive processes related to memory with psychometric study of the test facilitates identification of potential item covariates for study and quantifying these item covariates has three specific advantages. First, it deepens understanding of the measure itself. For example, if an aspect of language familiarity affects an item’s difficulty, then knowing the effect size (i.e., probability) of correctly answering the item informs clinical judgement about whether the item is appropriate for certain populations. Second, these covariates can inform how different items would affect test functioning. A new or alternative item can be designed and its item parameters predicted from previously identified item covariates to decide about inclusion of that item in a test before it is ever administered. Finally, in cases where tests utilize similar but different items (e.g., word reading tests, list learning tests), knowing the item covariates that explain item-level properties facilitates comparison across tests. Explanatory IRT methods treat the items as being from one test but with multiple forms, and the items’ parameters are all impacted by the same covariates. Knowing these covariates allows prediction of the parameters, which is then used to make predictions about the tests’ properties. Such an approach transitions data harmonization from a score-based problem, where the aim is to equate groups on their scores, to an item-property problem, where the aim is to equate tests by their items.
The primary goals of this study were to develop a Bayesian IRT model of the CERAD List Learning test that incorporates item covariates informed by language and memory processing theory to demonstrate how potential item features of a test can be derived, explored, and tested within explanatory IRT. In the case of a list learning test, this means considering the ways in which the items are learned and what may affect this learning process. Cognitive research on semantic network activation to words as predictors of recall has demonstrated the importance of prior semantic knowledge to learning (Nelson et al., 2013), so several semantic and linguistic features of the CERAD words were used for this study. These features are described in detail in the appendix. Hypotheses for this study are split into two modeling endpoints: the measurement model and the explanatory model.
The measurement model includes hypotheses for how the CERAD List Learning test is structured and its measurement properties:
A two-parameter model will be preferred to a Rasch model for the CERAD,
An item unique local dependency effect (corresponding roughly to inter-trial recall consistency) will be more appropriate for the model than other local dependency compensatory methods, and
-
Item parameters (i.e., difficulty and discrimination) will differ across trials with a general trend of items becoming easier after more repetitions.
The explanatory model includes hypotheses for what item features drive the measurement characteristics defined by the item parameters:
There will be evidence of a serial position effect in the model (primary and recent items being easier than middle items), and
The selected linguistic traits of the CERAD items will predict item easiness with all covariates, except for age of acquisition and number of phonemes, hypothesized to have a positive relationship with item easiness.
These two endpoints emphasize different applications of the explanatory IRT framework. Measurement models correspond to traditional IRT models where the structure and measurement qualities of the test are identified. In the case of list learning tests, these models formally test whether items convey differing levels of information about memory (i.e., whether models including the item discrimination parameter improve fit) and how to account for the effects of both learning and repeated item presentation. Explanatory models extend the IRT framework by treating item parameters as a product of certain item features. As highlighted in this study, additional cognitive factors (e.g., the serial position effect) and also the mediation of language can be explicitly tested in these models as factors affecting the measurement itself, and knowing these factors can aid in test translation, adaption, and application problems.
Methods
Sample
Data Sources
Data for this study came from the Harmonized Cognitive Assessment Protocol (HCAP; grant number NIA R01 AG051142), an international research collaboration to understand and measure dementia risk. In the United States, HCAP data were obtained through the Health and Retirement Study (HRS; grant number NIA U01 AG009740), which is conducted by the University of Michigan. HRS has collected longitudinal, cohort data through its survey of older adults and their spouses (aged 50+) every two years since 1992 (see Heeringa & Connor, 1995, for details of the HRS sampling). The HRS and HCAP data sources are among the few large-scale databases to have recorded item-level responses to neuropsychological tests, and this level of variable recording is needed for IRT modeling.
HCAP data collection began in 2016 and involved HRS participants who (a) were 65 years or older, (b) already completed the 2016 HRS interview, and (c) had an appropriate informant available for interview. Of those potentially eligible HRS participants, a random sample of half of uncoupled respondents in addition to a random selection of one respondent from each coupled household were pre-selected for HCAP interviews (see Weir et al., 2016 for additional details). The RAND HRS longitudinal file (2021) was also used to summarize demographic, socioeconomic, employment, and health variables across study waves.
Exclusion Criteria
Participants were excluded if they did not complete (a) the study in English or (b) all learning trials of the CERAD List Learning test. Participants with informant-reported history of stroke, Parkinson’s disease, Alzheimer’s disease, or “memory problems” were also excluded. To reduce the risk of including individuals with undiagnosed neurocognitive disorders, inclusion in the sample was further based on both functional and cognitive impairment criteria (Petersen et al., 2014). Functional impairment was based on whether informants reported cognitive-associated functional impairment on any item of the Blessed Dementia Rating Scale (BDRS). As only raw scores across tests are reported in HCAP, cognitive impairment was determined via a latent class analysis on dichotomized residuals from a multivariate regression predicting raw scores on the neuropsychological tests in the HCAP using sociodemographic predictors. The latent class corresponding to normal performances across the dichotomized test scores was used to define unimpaired individuals. The supplementary materials include a detailed discussion of these measures, methods, and results.
Measures
The CERAD List Learning test consists of 10 items repeated over three trials. The words are presented one at a time on cards for two seconds each, and the respondent reads the word from the card out loud. After each trial, the respondent is asked to recall as many words as possible from memory. The order of words differs with each trial.
Covariates
All covariates examined, except for item position, were obtained using the English Lexicon Project (ELP; Balota et al., 2007). The ELP is a multi-university effort to develop and maintain a database of descriptive and behavioral data on English-language words. Not all recorded variables in the ELP were available for all ten words on the CERAD word list, so selection of covariates was restricted to variables of interest available for all words. Covariates included word frequency, concreteness, semantic diversity, age of acquisition, body-object integration, phonemes, and item order. An Appendix is provided to define each of these variables for readers.
Data Analyses
All data analyses for this study utilized the R statistical language and environment (version 4.1.0; R Core Team, 2021). Model fitting and evaluation utilized the brms package (Bürkner, 2017), which itself calls the rstan package (Stan Development Team, 2020a) to compile models in the Stan programming language (Carpenter et al., 2017; Stan Development Team, 2020b). Summary of model results also utilized the bayestestR package (Makowski et al., 2019b).
Each model was estimated across four chains running 3,000 iterations each (1,000 being warmup), resulting in 8,000 post-warmup posterior samples. Estimation of models utilized a QR decomposition to reduce the influence of correlated covariates on posterior sampling (Stan Development Team, 2020b). The study was preregistered: https://osf.io/pyd63. A github repository is available for all supplementary materials referenced, including R script files to reproduce the analyses: https://github.com/w-goette/eIRT-CERAD.
Model Specification
The measurement and explanatory IRT models utilized a generalized (non-)linear mixed model framework (De Boeck et al., 2011; De Boeck & Wilson, 2004). Embretson and Reise (2000) report two key IRT model assumptions: appropriately specified form of the item characteristic curves and local independence. The former is assessed by comparing the fit of different IRT model types. Per hypothesis one, we examined a Rasch (items vary only in their easiness) and a 2PL model (items vary in both their easiness and discrimination). Local independence occurs when, conditional on the latent trait, the probability of correctly answering an item is independent of responses to other items. This assumption can be violated in a verbal-learning test because someone is likely to recall a previously learned word on a later trial even after conditioning on the latent memory trait.
There are several potential methods to account for this residual dependence. One option is to specify multiple rather than a single latent factor (De Boeck et al., 2011; Embretson & Reise, 2000). This is a potentially clinically useful model as each trial of the word list is fit as a separate latent trait, making it analogous to interpreting raw scores on each learning trial. It was assumed that these multidimensional factors are correlated. This correlated multidimensional model can be extended to a change model, which has the advantage of modeling improvement from trial-to-trial rather than just correlation from trial-to-trial (Cho et al., 2013). A final multidimensional model that could be used is a growth model where learning between trials is modeled, which is similar to estimating a learning curve. Path diagrams for these models are shown in Figure 1.
Figure 1. Path Diagrams and brms R Code for Multidimensional Models.
The above plots contextualize the described models that utilized multidimensionality to account for violation of the local independence assumption along with the brms formula code used to fit the model for readers interested parsing the formulas. The figure in (a) shows the base unidimensional model that assumes no assumption violation. Figure (b) depicts the simple multidimensional model with each trial loading on a separate but correlated factor. Figure (c) corresponds to the latent change model. Finally, figure (d) is the growth model with loadings for all items within a trial shown.
An alternative solution to local dependence that does not require transitioning to multidimensional models is to define a dependency matrix among the items that depend on one another (De Boeck et al., 2011; Meulders & Xie, 2004; Tuerlinckx & De Boeck, 2004). For a learning test where items are repeated multiple times, the dependency covariate is a recursive item-level effect where parameters for later items differ depending on responses to previous items. Per hypothesis two, each of these approaches was examined.
Prior Specification
Priors were specified based on the recommendations in Bürkner (2020a, 2020b). These priors follow general recommendations for skeptical, weakly informative priors (Gelman et al., 2013; Gill, 2015; McElreath, 2016). All continuous variables were z-scaled to ensure that all predictors were on similar scales that would be consistent with the specified priors. Prior predictive checks are available in the supplementary materials.
Model Evaluation and Selection
Evaluation of fitted models began with inspection of whether chains mixed both visually by the trace plots and quantitatively by the value of . Bulk and tail effective sample sizes (ESS) were examined to screen whether model estimation or specification may be problematic. These diagnostic plots for all models are available through the supplementary materials. To compare models, leave-one-out cross-validation information criterion (LOOIC) (Gelman et al., 2013; McElreath, 2016) and model stacking (Yao et al., 2018) were utilized.
Results Interpretation
Parameter results are reported in a hierarchical manner (Makowski et al., 2019a). First, parameters were inspected for evidence of an effect using the probability of direction (pd) and maximum a posteriori (MAP) p-value. Second, the magnitude of the effect was operationalized via a region of practical equivalence (ROPE) around an effect of zero. The ROPE in this study ranged from −0.18 to 0.18, which on the logistic scale would correspond to a negligible effect size (i.e., equivalent to Cohen’s d between −0.10 and 0.10). Two ROPE metrics are provided: first, the proportion of the effect’s 95% credible interval within the ROPE and second, the proportion of the effect’s entire posterior that falls within the ROPE.
Finally, posterior predictive checks (PPCs) were used to visualize the model’s fit with the data (Gelman et al., 2013; Gill, 2015; McElreath, 2016). Item and person fit statistics were also calculated for the measurement model (Bürkner, 2020a).
Results
Sample Descriptives
Of the originally 3,496 participants in HCAP, 576 (16%) were excluded due to informant-reported history of an excluded medical condition, another 1,408 (40%) were excluded due to informant-reported functional impairment on the BDRS, and then an additional 293 (8%) were excluded due to cognitive impairment on testing. The final sample consisted of 1,219 HCAP unimpaired participants. Complete summary statistics for the continuous variables are shown in Table 1, and categorical variables are summarized in Table 2. On average, the sample is older (M = 74.4 years, SD = 7.1) with some college education (M = 13.3 years, SD = 2.7). The sample is also primarily White (78%), non-Hispanic (94%), female (64%), above the federal poverty threshold (91%), and living in urban areas (50%).
Table 1.
Summary of Continuous Demographic Variables
Mean | Median | SD | Minimum | Maximum | Missing | |
---|---|---|---|---|---|---|
Age (years) | 74.4 | 74 | 7.1 | 64 | 101 | 0 / 0% |
Education (years) | 13.3 | 13 | 2.7 | 0 | 17 | 1 / < 1.0% |
Maternal Education (years) | 10.2 | 12 | 3.4 | 0 | 17 | 95 / 8% |
Paternal Education (years) | 9.9 | 10 | 3.8 | 0 | 17 | 160 / 13% |
Years Worked | 37.7 | 41 | 15.2 | 0 | 71 | 0 / 0% |
2016 Total Income (US dollars) | 72,000 | 40,848 | 113,000 | 0 | 1,510,342 | 0 / 0% |
CESD (raw score) | 1.5 | 1 | 2.1 | 0 | 11 | 1 / < 1.0% |
Note: All educational attainment variables are reported as years of completed education up to 17 years with any additional years still being recorded as 17. Values in the Missing column are structured as number missing / percentage missing.
SD = standard deviation, US = United States, CESD = Center for Epidemiological Studies Depression scale
Table 2.
Summary of Categorical Demographic Variables
N | % | |
---|---|---|
Race | ||
White | 956 | 78 |
Black | 220 | 18 |
Other | 43 | 4 |
Ethnicity | ||
Hispanic | 73 | 6 |
Non-Hispanic | 1145 | 94 |
Missing | 1 | < 1 |
Sex | ||
Male | 434 | 36 |
Female | 785 | 64 |
Federal Poverty Threshold | ||
Above | 1110 | 91 |
Below | 99 | 8 |
Missing | 10 | 1 |
Psychiatric Diagnosis | ||
Yes | 178 | 15 |
No | 1041 | 85 |
Subjective Health | ||
Poor | 54 | 4 |
Fair | 216 | 18 |
Good | 439 | 36 |
Very Good | 401 | 33 |
Excellent | 109 | 9 |
Rural-Urban Code | ||
Urban | 611 | 50 |
Suburban | 271 | 22 |
Exurban | 316 | 26 |
Missing | 21 | 2 |
Subjective Memory | ||
Poor | 39 | 3 |
Fair | 285 | 23 |
Very Good | 298 | 24 |
Excellent | 45 | 4 |
Missing | 6 | < 1.0 |
Note: Unless otherwise indicated, there are no missing values for variables. Urban-Rural code classification is based on the 2013 Rural-Urban Continuum Codes from the Economic Research Service (2020). The Health and Retirement Study condenses many continuum classifications as “Exurban” to avoid the risk of very small cell that, with other data, could lead to identification of participants.
N = count of cell size, % = percent of total sample in the cell
Measurement Model Selection
Complete details about models’ fit are provided in the supplementary materials. For hypothesis one, the LOOIC and model stacking comparisons for the Rasch and 2PL models are shown in Table 3. The 2PL was preferred to the Rasch specification.
Table 3.
Comparison of Model Specifications for the Measurement Model
ΔLOOIC | SE of ΔLOOIC : | Stacking Weight | |
---|---|---|---|
Comparison of Rasch to 2PL Models | |||
2PL | 0.0 | 0.0 | 0.89 |
Rasch | −280.4 | 26.7 | 0.11 |
Comparison of No Local Dependence Violation Model to Multidimensional Models | |||
Simple 2PL Model | 0.0 | 0.0 | 0.88 |
Growth Model | −1.7 | 0.7 | 0.12 |
Change Model | −2.1 | 0.6 | 9.4xl0−4 |
Multidimensional Model | −5.5 | 0.7 | 7.8xl0−5 |
Comparison of No Local Dependence Violation to Local Dependence Model | |||
Local Dependence | 0.0 | 0.0 | 0.76 |
Simple 2PL | −71.1 | 16.7 | 0.24 |
Comparison of Uniform Local Dependence to Item-Specific Dependence | |||
Uniform Local Dependence | 0.0 | 0.0 | 0.69 |
Item-Specific Dependence | −4.4 | 4.8 | 0.31 |
Comparison of Uniform Local Dependence to Trial-Specific Dependence | |||
Trial-Specific Dependence | 0.0 | 0.0 | 0.91 |
Uniform Local Dependence | −100.2 | 15.5 | 0.09 |
Note: ΔLOOIC refers to the difference in the LOOIC with the next column (SE of ΔLOOIC) reflecting the standard error of this difference. Differences greater than ±1.96(SE of ΔLOOIC) may be considered statistically significant. Model stacking weight reflects the relative weight given to each model to produce the most accurate predictions (two equally performing models would have weights of 0.5 each).
LOOIC = leave-one-out cross-validation information criterion, SE = standard error
For hypothesis two, multiple latent traits, change model, and growth curve specifications were added to the basic model to account for possible residual dependency. To ensure appropriate comparisons, a 2PL model assuming no assumption violation was included in the comparisons. The comparisons for these models are shown in Table 3 and demonstrate that adding dimensionality to the model does not improve its performance.
The local dependency model was then fit and compared to the 2PL model. This model demonstrated superior performance to the simple 2PL model (see Table 3). The dependency matrix was further decomposed into item-specific and trial-specific effects to test the assumption of whether learning effects were consistent across items or trials, respectively. Item-specific dependency effects did not definitively improve model fit, but trial-specific effects did (see Table 3). The results indicate the CERAD is unidimensional but has significant residual method effects between items. The local dependency model is a recursive model wherein the probability of success on an item depends on prior responses to specific items (Tuerlinckx & De Boeck, 2004). The effect of this learning seems to be greater in the second trial than the third trial (see first rows of Table 6 for the coefficients of these dependency effects).
Table 6.
Effects in the Measurement and Explanatory Models for the CERAD List Learning Test
Estimate (mdn) | 95% HDI LB | 95% HDI UB | pd | MAP p-value | % HDI in ROPE | % Total in ROPE | ESS | ||
---|---|---|---|---|---|---|---|---|---|
Dependency Effects in the Measurement Model | |||||||||
Easiness | |||||||||
Trial 1 → 2 | 1.03 | 0.93 | 1.13 | > 0.99 | < 0.01 | < 1.0 | < 1.0 | 1.00 | 5459.19 |
Trial 2 → 3 | 0.12 | 0.003 | 0.24 | 0.98 | 0.13 | 84 | 83 | 1.00 | 2682.04 |
Discrimination | |||||||||
Trial 1 → 2 | 0.57 | 0.40 | 0.79 | > 0.99 | < 0.01 | < 1.0 | < 1.0 | 1.00 | 5783.77 |
Trial 2 → 3 | 1.32 | 1.07 | 1.63 | > 0.99 | 0.04 | 16 | 17 | 1.00 | 3250.12 |
Item Covariate Effects in the Explanatory Model | |||||||||
Easiness | |||||||||
Intercept | −0.04 | −0.09 | 0.01 | 0.97 | 0.21 | > 99 | > 99 | 1.00 | 6289.50 |
Trial 2 | 0.97 | 0.92 | 1.03 | > 0.99 | < 0.01 | < 1.0 | < 1.0 | 1.00 | 12591.77 |
Trial 3 | 1.34 | 1.29 | 1.40 | > 0.99 | < 0.01 | < 1.0 | < 1.0 | 1.00 | 13705.80 |
Item Position | 12.5 | 9.4 | 15.6 | > 0.99 | < 0.01 | < 1.0 | < 1.0 | 1.00 | 23172.45 |
(Item Pos.)2 | 29.9 | 26.8 | 32.9 | > 0.99 | < 0.01 | < 1.0 | < 1.0 | 1.00 | 11293.41 |
STX Freq. | −0.35 | −0.38 | −0.31 | > 0.99 | < 0.01 | < 1.0 | < 1.0 | 1.00 | 8607.72 |
Concrete | −0.40 | −0.44 | −0.35 | > 0.99 | < 0.01 | < 1.0 | < 1.0 | 1.00 | 8702.64 |
Diversity | −0.21 | −0.27 | −0.17 | > 0.99 | < 0.01 | 6 | 9 | 1.00 | 13978.29 |
AoA | −0.17 | −0.21 | −0.12 | > 0.99 | < 0.01 | 78 | 76 | 1.00 | 7548.05 |
BOI | 0.37 | 0.32 | 0.41 | > 0.99 | < 0.01 | < 1.0 | < 1.0 | 1.00 | 10286.81 |
Phonemes | 0.11 | 0.08 | 0.15 | > 0.99 | < 0.01 | > 99 | > 99 | 1.00 | 10204.04 |
Discrimination | |||||||||
Intercept | −0.29 | −1.49 | 1.13 | 0.65 | 0.82 | 18 | 18 | 1.00 | 10204.04 |
Trial 2 | 1.25 | 1.11 | 1.42 | > 0.99 | < 0.01 | 24 | 26 | 1.00 | 9263.37 |
Trial 3 | 1.23 | 1.08 | 1.40 | > 0.99 | 0.01 | 34 | 34 | 1.00 | 8992.94 |
Item Position | 0.13 | 0.02 | 0.86 | 0.98 | 0.12 | < 1.0 | 2 | 1.00 | 20544.07 |
(Item Pos.)2 | 0.39 | 0.05 | 2.60 | 0.83 | 0.66 | 10 | 10 | 1.00 | 18617.02 |
STX Freq. | 1.17 | 1.08 | 1.27 | > 0.99 | < 0.01 | 723 | 71 | 1.00 | 8843.09 |
Concrete | 1.06 | 0.94 | 1.18 | 0.85 | 0.58 | > 99 | 98 | 1.00 | 7859.73 |
Diversity | 0.87 | 0.78 | 0.98 | 0.99 | 0.06 | 76 | 75 | 1.00 | 7400.67 |
AoA | 1.10 | 0.99 | 1.22 | 0.97 | 0.18 | 97 | 94 | 1.00 | 10317.84 |
BOI | 1.21 | 1.09 | 1.33 | > 0.99 | < 0.01 | 45 | 45 | 1.00 | 7234.42 |
Phonemes | 0.78 | 0.71 | 0.85 | > 0.99 | < 0.01 | 4 | 6 | 1.00 | 7118.53 |
Note: The first three columns (Estimate and 95% lower- and upper-bound of the highest density interval) capture the estimated effect and its uncertainty. The next two columns (pd and MAP p-value) summarize the probability of the effect being non-zero. As a note, pd is directly related to typical p-values such that a pd = 0.975 corresponds to a two-sided p-value of 0.05. The next two columns (the percentages of the highest density interval and total posterior within the region of practical equivalence) summarize the probability of the effect being practically equivalent to zero (i.e., of having a negligible effect size). The final two columns provide the and effective sample size diagnostics that help inform whether the model converged.
Mdn = median, HDI = highest density interval, LB = lower bound, UB = upper bound, pc = probability of direction, MAP = maximum a posteriori, ROPE = region of practical equivalence, ESS = effective sample size, STX = SUBTLEXUS, Freq. = frequency, Item Pos. = item position, AoA = age of acquisition, BOI = body-object integration
Measurement Model Summary
The final model demonstrated excellent posterior predictive checks, suggesting that the model is well specified (see Figures 2–4). Items in the second trial tended to demonstrate misfit to the measurement model (see Table 4). Only 51 participants (4%) showed problematic person fit. Plots of the item characteristic curves (ordered within rows to match the order on the word list) are shown in Figures 5, 6, and 7 for each of the three trials, respectively. These plots also show the effect of having previously recalled a given word. Additional plots summarizing the test’s performance are available in the supplementary materials. The model’s estimates of latent trait values for the sample correlated strongly with raw scores (Pearson’s r = 0.93), but the two scores do rank individuals differently (Kendall’s τ = 0.79). Figure 8 displays a scatter plot with marginal densities and line of best fit for the two scores.
Figure 2. Posterior Predictive Check of the Final Model and All Responses (incorrect [0] vs. correct [1]).
Shown above are the observed raw data (solid grey boxes; y in the legend) and the model’s predicted observations (black dot with error lines; yrep in the legend) for all of the items and participants. The y-axis shows the number of times the observation was made. Since the responses are dichotomous, the plots show the total number of incorrect (0) and correct (1) responses.
Figure 4. Posterior Predictive Check of the Final Model and Responses for a Random Sample of 6 Participants (incorrect [0] vs. correct [1]).
Shown above are the observed raw data (solid grey boxes; y in the legend) and the model’s predicted observations (black dot with error lines; yrep in the legend) grouped by a random subset of participants for all items. The y-axis shows the number of times the observation was made. Since the responses are dichotomous, the plots show the total number of incorrect (0) and correct (1) responses across the test (meaning that the height of the right column is the raw sum score).
Table 4.
Item Fit Statistic Results From the Final Model
Trial One |
Trial Two |
Trial Three |
|||
---|---|---|---|---|---|
Item | Bayesian p-value | Item | Bayesian p-value | Item | Bayesian p-value |
Butter | 0.50 | Ticket | 0.61 | Queen | 0.56 |
Arm | 0.48 | Cabin | 0.57 | Grass | 0.51 |
Shore | 0.47 | Butter | 0.34 | Arm | 0.49 |
Letter | 0.51 | Shore | 0.05 | Cabin | 0.54 |
Queen | 0.51 | Engine | 0.50 | Pole | 0.50 |
Cabin | 0.51 | Arm | 0.48 | Shore | 0.55 |
Pole | 0.54 | Queen | 0.57 | Butter | 0.49 |
Ticket | 0.52 | Letter | 0.55 | Engine | 0.51 |
Grass | 0.56 | Pole | 0.61 | Ticket | 0.50 |
Engine | 0.55 | Grass | 0.81 | Letter | 0.52 |
Note: The Bayesian p-value here refers to the posterior predictive p-value and reflects a comparison of the log-likelihoods of observed and predicted responses from the model. Values close to 0.5 are ideal with values closer to 1 reflecting misfit.
Figure 5. Item Characteristic Curves of Words From Trial One of the List.
Item order from left-to-right (top row first and then bottom row) reflects the order of words as presented on the word list. Item characteristic curves show the probability of a correct recall (y-axis) as a function of the latent trait on a z-score metric (x-axis). Item difficulty corresponds to the latent trait value when the probability of a correct response is 0.50. Item discrimination corresponds to the slope of the logistic curve.
Figure 6. Item Characteristic Curves of Words From Trial Two of the List.
Item order from left-to-right (top row first and then bottom row) reflects the order of words as presented on the word list. Item characteristic curves show the probability of a correct recall (y-axis) as a function of the latent trait on a z-score metric (x-axis). Item difficulty corresponds to the latent trait value when the probability of a correct response is 0.50. Item discrimination corresponds to the slope of the logistic curve. The dotted line plots the item characteristic curve in the case where that word had been recalled in trial one.
Figure 7. Item Characteristic Curves of Words From Trial Three of the List.
Item order from left-to-right (top row first and then bottom row) reflects the order of words as presented on the word list. Item characteristic curves show the probability of a correct recall (y-axis) as a function of the latent trait on a z-score metric (x-axis). Item difficulty corresponds to the latent trait value when the probability of a correct response is 0.50. Item discrimination corresponds to the slope of the logistic curve. The dotted line plots the item characteristic curve in the case where that word had been recalled in trial two.
Figure 8. Scatterplot of CERAD Raw and Latent Trait Scores with Marginal Histograms and Densities.
Scatterplot of the latent trait (theta) scores on the y-axis against raw scores (total number of words recalled on all three trials) on the x-axis. Marginal density and histogram plots are provided on top (raw scores) and right (theta scores) to visualize the distribution of scores. A line of best fit has been added over the jittered points.
Explanatory Model Summary
Descriptive statistics of the item covariates, including intercorrelations, are reported in Table 5. All item covariates of interest were included together in the model, and indicators for the trial number were also included. The results from this model are shown in Table 6. All item covariates examined had effects that were not likely to be zero, but effects for the item covariates on the discrimination parameter were less robust with many having a fair probability of being practically no different than zero. Broadly, words tended to become easier over trials, in later presentation order, and as body-object integration and number of phonemes increased. In contrast, words with greater frequency, concreteness, semantic diversity, and age of acquisition tended to become harder.
Table 5.
Summary of Word-trait Item Covariates
STX Freq. | Concrete | Diversity | AoA | BOI | Phonemes | |
---|---|---|---|---|---|---|
STX Freq. | — | −0.18 | 0.11 | −0.33 | <0.01 | 0.08 |
Concrete | −0.12 | — | −0.31 | −0.04 | 0.52 | <0.01 |
Diversity | 0.14 | −0.33 | — | −0.33 | 0.09 | −0.34 |
AoA | −0.42 | −0.11 | −0.58 | — | −0.31 | 0.23 |
BOI | 0.04 | 0.65 | 0.19 | −0.40 | — | −0.08 |
Phonemes | 0.09 | 0.02 | −0.40 | 0.22 | −0.05 | — |
| ||||||
Mean (SD) | 3.2 (0.3) | 4.8 (0.2) | 1.5 (0.2) | 5.3 (1.2) | 5.2 (0.8) | 4.0 (0.8) |
Range | 2.8 – 3.6 | 4.5 – 5.0 | 1.3 – 1.7 | 3.3 – 6.9 | 4.1 – 6.5 | 3.0 – 5.0 |
Note: Values in the upper triangle (above diagonal) are Kendall’s τ correlations and reflect the rate of concordant ranked pairs of variables. Given the small sample sizes (i.e., n = 10), the non-parametric Kendall’s τ is recommended as it reflects whether, when ranked, two variables tend to place items in the same order. Since Kendall’s τ is less familiar, the lower triangle (below diagonal) are Spearman’s ρ. The mean, standard deviation, and range of each variable (by column) is shown in the bottom two rows.
STX = SUBTLEXUS, Freq. = word frequency, diversity = semantic diversity, AoA = age of acquisition, BOI = body-object integration, SD = standard deviation
Discussion
This study examined an IRT model of the CERAD List Learning test utilizing novel methods to understand the factors affecting test functioning and cognitive processes underlying this memory measure. The results generally highlight that test functioning is impacted by a range of factors beyond just repetition of the same words over several trials. Of particular note, results demonstrate how explanatory item models can be built to incorporate and test knowledge about both cognitive and measurement characteristics of a test. Identifying salient cognitive constructs of a task, as was done in our measurement model, and explanatory factors that affect item performance, as was demonstrated in the explanatory model, allow for multiple opportunities to integrate cognitive science into scoring of existing tests. Even in cases where the resulting model’s scores do not differ from traditional raw scores, identifying measurement traits can help uncover possible sociodemographic factors that may result in unintended item biases. For example, given that several English-language word qualities affected the probability of an English-speaking sample to learn these items, use of such tests with individuals with lower levels of English experience may yield markedly different measurement qualities on the same test for the same words.
Starting with the model-based hypotheses, the results generally support the study’s hypotheses. The 2PL model was preferred to the simpler Rasch model (H1), and a model including a local dependency effect was preferred to other methods for accounting for assumption violations (H2). The local dependency effect is notable as it quantifies the relative impact of having previously learned a word. Examination of the effect and the item characteristic curves reveals that recalling a word on trial one primarily affects the easiness of recalling the word on trial two. In contrast, recalling a word in trial two primarily effects the discrimination of the word on trial three, resulting in less pronounced shift in the item characteristic curves in that trial. This finding has implications for consideration of inter-trial variability in that recall of words over trials is not expected to be equally probable. Item parameters did vary over trials (H3), though this effect was not uniform across all items (see Figures 4–6). In the explanatory model, the item easiness tended to increase over later trials, which was consistent with our hypothesis (H3). In short, items variably differ across trials in their difficulty and discrimination parameters and understanding test-level functioning requires incorporation of a person’s learning for items over time.
With regard to the item-based hypotheses, the results provided mixed support for the hypotheses. There was an observed serial position effect for item difficulty (H4), and all hypothesized item covariates emerged as predictive of item parameters (H5). Of those item covariates, however, only body-object integration and age of acquisition were in the hypothesized direction. Examining these effects more closely, as the frequency of the word in English increased, it was seemingly harder to learn. It is possible that this represents a novelty element wherein words that are more novel, or rarer in English language, are easier to recall. The word’s concreteness and degree of body-object integration both predicted item difficulty. Words with greater concreteness were harder to recall while those with greater body-object integration were easier. This separation of concreteness and body-object integration suggests that imagery and physical experience with the word is a significant factor beyond just its level of “realness” versus abstraction. If this is true, then it has implications for word selection in alternative cultural and socioeconomic settings where the level of interaction with certain concrete objects might be different. The semantic diversity of words also was related to the probability of recalling the word with less diversity corresponding to easier recall. It is possible that this reflects a kind of interference effect such that words with less semantic diversity are encoded more reliably whereas memory errors are more likely when there is greater related semantic knowledge around that word. It would be of interest to examine whether intrusions in recall commonly include words in the same semantic neighborhood as the target word. The age at which a word is typically learned had a negative effect on recall easiness, suggesting that older and perhaps more overlearned words were easier to remember; however, this effect has a high probability of being essentially zero (76% of its posterior distribution falls within the ROPE). Similarly, greater numbers of phonemes in a word corresponded to easier recall, perhaps because of the additional processing needed to pronounce these words out loud, but this effect is also most likely to be no different than zero (>99% of its posterior falls within the ROPE).
The presence of a serial position effect on item parameters is consistent with the extensive documentation of this effect (e.g., Bayley et al., 2000; Crockett et al., 1992; Foldi et al., 2003; La Rue et al., 2008). The serial position effect can be seen in the item characteristic plots with the most prominent effects in the discrimination of the later items of each trial (see Figure 4). Where earlier items measure the latent trait with greater accuracy, the latter items (particularly in the first trial) are nearly linear suggesting that they provide relatively limited information about someone’s memory abilities. This observation is consistent with the IRT of a different list learning test performed by Gavett and Horwitz (2012). They found that the last two items of the test had negative discriminations, indicating that individuals with lower memory abilities were more likely to recall these items than those with higher memory abilities (i.e., a recency effect). Unfortunately, the current models had to restrict the discrimination parameter to be positive to remain identifiable (Bürkner, 2020b). In the explanatory model, the linear and quadratic terms are strongly supported for predicting item easiness; however, there is relatively weaker evidence for the quadratic relative to the linear term on item discrimination. In other words, there is stronger evidence that items tend to become less discriminatory as their presentation order increases compared to the evidence that this effect is quadratic.
IRT analyses of the Rey Auditory Verbal Learning Test (Gavett & Horwitz, 2012) and California Verbal Learning Test-II (Thiruselvam & Hoelzle, 2020) found evidence for multidimensionality, though both studies report univocality of a dominant memory factor. Consistent with this latter possibility, the current results strongly support unidimensionality in the CERAD List Learning test as well (e.g., correlations between the trials was 0.99 in the multidimensional model). Evidence of multidimensionality may reflect the requirement of attention, encoding, language, and free recall skills in order to perform a primarily memory task. Rather than model these additional cognitive skills as extra latent factors measured by a list learning test, the item covariates may help to incorporate knowledge about these additional skills. For example, the results reported in this study strongly support that prior semantic and language knowledge affect learning of individual words. Under this paradigm, the use of other cognitive and experiential skills to approach a test and respond to items with the desired cognitive domain (i.e., memory in this case) can be modeled as a feature of the items rather than a by-product of the test. Recall that, in IRT, test functioning can be summarized as the sum of item functioning. When item features associated with other cognitive domains can be used to explain how items on a test function, this is one way in which the influence of other cognitive skills can be incorporated into understanding test measurement, and this method is likely more flexible than adding dimensionality and is more explicit about modeling how other cognitive skills affect test performance.
The potential for Bayesian explanatory IRT to inform test harmonization is three-fold. First, by specifying item parameter predictions from a previous explanatory IRT model as priors for a new sample, it is possible to approximate existing forms of IRT linking. Consider the case of CERAD lists with only partial overlap of words. Using the current results to predict item parameters for the non-overlapping words allows for a more informative prior specification for those parameters, meaning that models could be fit on smaller samples (McNeish, 2016). By comparing the informative priors to the new sample’s posteriors, it would be possible to test whether the new group is demonstrating potential differential item functioning or whether the group’s parameters are practically equivalent to those observed in the original model. The fixed item parameters and score transformation approach to IRT linking can be conducted using the overlapping items as anchors if a more traditional method is desired (Lee & Ban, 2007). Second, knowledge of item features can inform whether certain samples should be compared at all. The current results, for example, highlight the importance of linguistic traits in English-speaking individuals. Using these results to link CERAD scores for other individuals is inappropriate unless the same item covariate data can be known for the other language, and even then, it is reasonable to expect that sociocultural factors (particularly education) could impact variables like concreteness or body-object integration. Finally, the methods elucidate considerations for research planning. In most cases, tests developed in one sociocultural context will require modification when used in other populations, so knowing the relative importance and effect of various item features can help inform more systematic modifications. The ideal use case would be such complete knowledge of item features for a given test that alternative items could be selected a priori that satisfy specific adaptation needs while still producing an alternative form of the test in which the psychometrics (e.g., reliability, information function, expected score function) are parallel to the original version.
It may be illustrative to examine a real-world scenario in which this approach might be useful. The assessment of increasingly diverse populations necessitates consideration of both cultural and linguistic factors (Silverberg et al., 2011). In the international 10/66 Dementia Research Group project, the CERAD word list was modified to replace four words (pole, shore, cabin, and engine) with alternative words that were believed to be more culturally appropriate (corner, stone, book, and stick) in addition to standard translation and back-translation of the word list (Prince et al., 2003). Utilization of such cultural and linguistic considerations contribute to a more appropriate test for a given population, but when scores from multiple translated and adapted version of the same test need to be examined together, there still remains the need to identify a method for such comparison. Unfortunately, the current study only had access to item covariates related to English-language words, so we can only extrapolate to item parameters for English words. In this case, say a researcher believes that these CERAD word substitutions as used in the 10/66 battery would also be better suited for some English-speaking individuals within the United States. Using the explanatory item covariates reported in this study, one could make predictions for what the item parameters of “corner,” “stone,” “book,” and “stick” will be even before item-level data for these word substitutions are collected. In a Bayesian framework, these predictions can be used as informative priors that would allow IRT modeling on smaller samples than commonly needed for IRT.
There are several limitations to the current study that must be addressed. Due to the limited variability of some item covariates in this study, it is important to acknowledge the ease with which out-of-sample predictions can be made as well as sensitivity of the current results to range restriction. This being said, most list learning tests try to find concrete, frequently used words, so this limited variability is likely a feature of most word lists. Similarly, it is possible that some interactions of variables are relevant. For example, the dual-code theory predicts concreteness will impact memory only when adequate time to process is given, meaning that the naming speed, the length of time it typically takes to recognize the word as a word, may be a useful proxy for processing time within the two-second word exposure. These interactions were not examined.
Another limitation of the current study is that the clinical implications are still not clear. While there are theoretical reasons to believe that these results could prove useful in improving diagnostic accuracy of a list learning test, selecting more appropriate word substitutes across linguistic and cultural settings, and equating performance across a variety of list learning tests, these possibilities remain untested in the current study. Similarly, it is unclear whether the parameter estimates generalize to other word lists because the CERAD is unique in the fact that the participant sees and reads the words aloud, so the observed effects may be unique to the CERAD administration methods. Some important follow-ups to these initial findings are already planned such as examination of differential item functioning across demographics and diagnostic groups (which is needed to understand whether different models are needed for different populations) and analysis of predictors of the latent memory ability.
Constraints on Generality
A primary limitation applies directly to the generalizability of the normative references used for item covariates. The ELP normative studies rarely provided a complete demographic composition of the study sample, so it unclear the extent to which these lexical item qualities are representative of diverse English-speaking populations. As a result, the extent to which these item covariates reflect any one individual’s lexical familiarity is unclear and likely a limitation to the generalizability of the identified covariates. For example, it is not expected that word frequency is equally distributed across all dialects and vernaculars in English, and as such, these results should be treated with appropriate caution when a sample’s English-language use may diverge from those represented in the ELP norms. Similarly, this study’s sample is primarily White and non-Hispanic, so item covariates and parameters identified in this study may not generalize to other populations. An additional benefit of explanatory IRT is explaining differential item functioning (DIF) by testing whether explanatory factors differ between two groups exhibiting DIF, thus while an immediate limitation to the current study, this question of generalizability speaks directly to broader questions of what items, or kinds of items, are appropriate for which populations.
Conclusion
The current study highlights the complexity of item parameters within verbal learning measures where items are repeated multiple times and demonstrates a modeling approach to help inform test development as well as interpretation through the incorporation of explanatory item covariates, which is likely most important in individuals whose sociodemographic and/or linguistic backgrounds may be dissimilar to those included in the lexical normative studies used for item covariates here. At the same time, these results may be useful for informing test translations and modifications by clarifying the specific properties on some item covariate that a word replacement or translation would need in order to provide similar item parameters to an original version of the test. Likewise, the methods can be used to carefully study why items and tests may function differently between populations, allowing for greater insights about item and test selection within diverse populations. Such approaches have implications for study design and then post-hoc harmonization in large scale studies. The explanatory IRT modeling framework allows explicit modeling of both salient and subtle design, measurement, and cognitive aspects of neuropsychological testing, and as such, the approach has significant potential for informing, interpreting, and harmonizing a wide range of assessments.
Figure 3. Posterior Predictive Check of the Final Model and Responses By Item (incorrect [0] vs. correct [1]).
Shown above are the observed raw data (solid grey boxes; y in the legend) and the model’s predicted observations (black dot with error lines; yrep in the legend) grouped by items for all participants. The y-axis shows the number of times the observation was made. Since the responses are dichotomous, the plots show the total number of incorrect (0) and correct (1) responses for each item.
Key Points.
Question:
What aspects of learning and language affect how list learning tests measure memory?
Findings:
Words on list learning tests are easier or harder to encode because of prior English-language exposure and associations with the words, but once a word is learned and incorporated into ones’ lexicon, it is easier to recall it later within the context of a memory test.
Importance:
The methods and approach described can be extended to a wide variety of tests to integrate cognitive science to deepen the understanding of common neuropsychological tests and why some groups may perform differently on them.
Next Steps:
The methods need to be applied to other tests and applied directly to the question of whether certain item properties can explain things like item bias or help design tests that are better suited for certain cross-cultural research.
Acknowledgments
This study has been pre-registered on the Open Science Foundation (OSF) through the AsPredicted template: https://osf.io/pyd63. All R code is provided at https://github.com/w-goette/eIRT-CERAD. Data used in this study come from the Health and Retirement Study (U01 AG009740), Harmonized Cognitive Assessment Protocol (R01 AG051142), and RAND Center for the Study of Aging. Preliminary data results were presented as a poster at the annual conference of the International Neuropsychological Society in 2021.
Appendix
Definitions for the Item Covariates Examined
Word Frequency
Frequency of words in the English language were derived from the SUBTLEXUS corpus of words (Brysbaert & New, 2009). This index was computed by downloading English-language subtitles for U.S. films and television series, resulting in a total of over 65 million words from approximately 8,400 movies and television episodes. The frequency index is log-scaled and reflects word frequency per one million words.
Concreteness
The dual-code theory (Paivio, 1991, 2013) predicts that concrete words will be easier to recall than abstract ones but only if adequate time is provided for associated perceptual memory cues to be activated from the perception of words. The concreteness of a word refers to the strength to which a word’s concept is associated to a perceptible object or experience, which makes it similar to embodied cognition and language research (Fischer & Zwaan, 2008). Norms for the concreteness of English words are provided by Brysbaert et al. (2014) who presented over 60,000 words from several different word corpuses to about 4,000 U.S. residents recruited from Amazon’s Mechanical Turk (a platform for crowdsourcing data collection) who rated the words’ level of concreteness.
Semantic Diversity
Semantic diversity quantifies how varied the meaning of a word is when it appears in different contexts. Semantic diversity may be considered a measure of the word’s ambiguity, particularly if the word were presented on its own and out of any context to clarify its semantic meaning. Norms for the semantic diversity of over 30,000 English words are provided in Hoffman et al. (2013).
Age of Acquisition
Age of acquisition (AoA) for a word refers to the typical age at which a speaker of that language learns a particular word. The normative reference for age of acquisition was obtained by Kuperman et al. (2012) who used the Amazon Mechanical Turk to recruit U.S. residents to estimate the age at which they learned various English words.
Body-Object Integration
A word’s body-object integration captures the degree to which raters can connect human body interaction with the object signified by the word (Pexman et al., 2019). Utilizing similar methods to those for age of acquisition and concreteness, Pexman et al. (2019) collected ratings of over 9,000 English words’ level of body-object integration.
Phonemes
The ELP provides the number of phonemes, a linguistics quality referring to a unit of speech sound that distinguishes words. Greater numbers of phonemes may correspond to greater cognitive processing demand as participants are asked to read the words aloud, and the density of information to recall may be greater when there are more phonemes in the word.
Item Position
Another important aspect of immediate memory measures is the serial position effect wherein words at the beginning (primacy effect) and end (recency effect) of a list are more frequently remembered (Weitzner & Calamia, 2020). To examine the serial position effect on the word list, the order in which each item was presented was also included in the model. This effect was specified as a quadratic term using orthogonal polynomials. Inclusion of the parameter in the models followed the methods specified by Debeer and Janssen (2013) for the linear logistic test model.
Footnotes
We have no conflicts of interest to disclose.
References
- Balota DA, Yap MJ, Cortese MJ, Hutchison KA, Kessler B, Loftis B, Neely JH, Nelson DL, Simpson GB, & Treiman R (2007). The English Lexicon Project. Behavior Research Methods, 39(3), 445–459. 10.3758/BF03193Q14 [DOI] [PubMed] [Google Scholar]
- Bayley PJ, Salmon DP, Bondi MW, Bui BK, Olichney J, Delis DC, Thomas RG, & Thal LJ (2000). Comparison of the serial position effect in very mild Alzheimer’s disease, mild Alzheimer’s disease, and amnesia associated with electroconvulsive therapy. Journal of the International Neuropsychological Society, 6(3), 290–298. 10.1017/S1355617700633040 [DOI] [PubMed] [Google Scholar]
- Bilder RM (2011). Neuropsychology 3.0: Evidence-based science and practice. Journal of the International Neuropsychological Society, 17(1), 7–13. 10.1017/2FS1355617710001396 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bilder RM & Reise SP (2019). Neuropsychological tests of the future: How do we get there from here? Clinical Neuropsychologist, 33(2), 220–245. 10.1080/13854046.2018.1521993 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blumenfeld RS, & Ranganath C (2007). Prefrontal cortex and long-term memory encoding: An integrative review of findings from neuropsychology and neuroimaging. Neuroscientist, 13(3), 280–291. 10.1177/1073858407299290 [DOI] [PubMed] [Google Scholar]
- Brysbaert M, & New B (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. 10.3758/BRM.41.4.977 [DOI] [PubMed] [Google Scholar]
- Brysbaert M, Warriner AB, & Kuperman V (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46(3), 904–911. 10.3758/s13428-013-0403-5 [DOI] [PubMed] [Google Scholar]
- Bürkner P (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28. 10.18637/iss.v080.i01 [DOI] [Google Scholar]
- Bürkner P-C (2020a). Analysing Standard Progressive Matrices (SPM-LS) with Bayesian item response models. Journal of Intelligence, 8(1), 5. 10.3390/iintelligence8010005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bürkner P-C (2020b). Bayesian Item Response Modeling in R with brms and Stan. ArXiv. https://arxiv.org/pdf/1905.09501.pdf [Google Scholar]
- Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, & Riddell A (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1). 10.18637/iss.v076.i01 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cho S-J, Athay M, & Preacher KJ (2013). Measuring change for a multidimensional test using a generalized explanatory longitudinal item response model. British Journal of Mathematical and Statistical Psychology, 66(2), 353–381. 10.1111/1.2044-8317.2012.02058.x [DOI] [PubMed] [Google Scholar]
- Crockett DJ, Hadjistavropoulous T, & Hurwitz T (1992). Primacy and recency effects in the assessment of memory using the Rey Auditory Verbal Learning Test. Archives of Clinical Neuropsychology, 7(1), 97–107. 10.1016/0887-6177(92)90022-F [DOI] [PubMed] [Google Scholar]
- De Boeck P, Bakker M, Zwister R, Nivard M, Hofman A, Tuerlinkcx F, & Partchev I, (2011). The estimation of item response models with the lmer function from the lme4 package in R. Journal of Statistical Software, 39(12), 1–28. 10.18637/iss.v039.i12 [DOI] [Google Scholar]
- De Boeck P, & Wilson M (2004). Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach. Springer. [Google Scholar]
- Debeer D, & Janssen R (2013). Modeling item-position effects within an IRT framework. Journal of Educational Measurement, 50(2), 164–185. 10.1111/iedm.12009 [DOI] [Google Scholar]
- Delis DC, Kramer JH, Kaplin E, & Ober BA (2017). CVLT-3: California Verbal Learning Test, Third Edition. Pearson. [Google Scholar]
- Economic Research Service. (2020). Rural-Urban Continuum Codes. U.S. Department of Agriculture. https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/documentation [Google Scholar]
- Embretson SE, & Reise SP (2000). Item response theory as model-based measurement. In Item Response Theory for Psychologists (pp. 40–61). Lawrence Erlbaum Associates. [Google Scholar]
- Fillenbaum GG, van Belle G, Morris JC, Mohs RC, Mirra SS, Davis PC, Tariot PN, Silverman JM, Clark CM, Welsh-Bohmer KA, & Heyman A (2008). CERAD (Consortium to Establish a Registry for Alzheimer’s Disease): The first 20 years. Alzheimer’s & Dementia, 4(2), 96–109. 10.1016/i.ialz.2007.08.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fischer MH, & Zwaan RA, (2008). Embodied language: A review of the role of the motor system in language comprehension. Quarterly Journal of Experimental Psychology, 61(6), 825–850. 10.1080/17470210701623605 [DOI] [PubMed] [Google Scholar]
- Foldi NS, Brickman AM, Schaefer LA, & Knutelska ME (2003). Distinct serial position profiles and neuropsychological measures differentiate late life depression from normal aging and Alzheimer’s disease. Psychiatry Research, 120(1), 71–84. 10.1016/S0165-1781(03)00163-X [DOI] [PubMed] [Google Scholar]
- Gavett BE, & Horwitz JE, (2012). Immediate list recall as a measure of short-term episodic memory: Insights from the serial position effect and item response theory. Archives of Clinical Neuropsychology, 27(2), 125–135. 10.1093/arclin/acr104 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gavett BE, Poon SJ, Ozonoff A, Jefferson AL, Nair AK, Green RC, & Stern RA (2009). Diagnostic utility of the NAB List Learning Test in Alzheimer’s disease and amnestic mild cognitive impairment. Journal of the International Neuropsychological Society, 15(1), 121–129. 10.1017/2FS1355617708090176 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, & Rubin DB (2013). Bayesian Data Analysis (3rd ed.). CRC Press. [Google Scholar]
- Gill J (2015). Bayesian Methods: A Social and Behavioral Sciences Approach (3rd ed.). CRC Press. [Google Scholar]
- Griffin JW, John SE, Adams JW, Bussell CA, Saurman JL, & Gavett BE (2017). The effects of age on the learning and forgetting of primacy, middle, and recency components of a multi-trial word list. Journal of Clinical and Experimental Neuropsychology, 39(9), 900–912. 10.1080/13803395.2017.1278746 [DOI] [PubMed] [Google Scholar]
- Health and Retirement Study. Produced and distributed by the University of Michigan with funding from the National Institute on Aging (grant number U01AG009740). Ann Arbor, MI. [Google Scholar]
- Heeringa SG, & Connor JH (1995). Technical description of the Health and Retirement Survey sample design [White paper]. Institute for Social Research at the University of Michigan. https://hrs.isr.umich.edu/sites/default/files/biblio/HRSSAMP.pdf [Google Scholar]
- Hoffman P, Ralph MAL, & Rogers TT, (2013). Semantic diversity: A measure of semantic ambiguity based on variability in the contextual usage of words. Behavior Research Methods, 45(3), 718–730. 10.3758/s13428-012-0278-x [DOI] [PubMed] [Google Scholar]
- Kuperman V, Stadthagen-Gonzalez H, & Brysbaert M (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978–990. 10.3758/s13428-012-0210-4 [DOI] [PubMed] [Google Scholar]
- La Rue A, Hermann B, Jones JE, Johnson S, Asthana S, & Sager MA (2008). Effect of parental family history of Alzheimer’s disease on serial position profiles. Alzheimer’s & Dementia, 4(4), 285–290. https://dx.doi.org/10.1016%2Fj.jalz.2008.03.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee W-C, & Ban J-C (2007). Comparison of three IRT linking procedures in the random groups equating design (Report No. 23). https://education.uiowa.edu/sites/education.uiowa.edu/files/documents/centers/casma/publications/casma-research-report-23.pdf
- Makowski D, Ben-Shachar MS, Chen SHA, & Lüdecke D, (2019a). Indices of effect existence and significance in the Bayesian framework. Frontiers in Psychology, 10. 10.3389/fpsyg.2019.02767 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Makowski D, Ben-Shachar MS, & Lüdecke D (2019b). bayestestR: Describing effects and their uncertainty, existence and significance within the Bayesian framework. Journal of Open Source Software, 4(40), 1541. 10.21105/ioss.01541 [DOI] [Google Scholar]
- McElreath R (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press. [Google Scholar]
- McNeish D (2016). On using Bayesian methods to address small sample problems. Structural Equation Modeling: A Multidisciplinary Journal, 23(5), 750–773. 10.1080/10705511.2016.1186549 [DOI] [Google Scholar]
- Meulders M, & Xie Y (2004). Person-by-item predictors. In De Boeck P & Wilson M (eds.), Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach. Springer. [Google Scholar]
- Morris JC, Heyman A, Mohs RC, Hughes JP, van Belle G, Fillenbaum G, Mellits ED, Clark C (1989). The Consortium to Establish a Registry for Alzheimer’s Disease (CERAD). Part I. Clinical and neuropsychological assessment of Alzheimer’s disease. Neurology, 39(9), 1159–1165. 10.1212/wnl.39.9.1159 [DOI] [PubMed] [Google Scholar]
- Nelson DL, Kitto K, Galea D, McEvoy CL, & Bruza PD (2013). How activation, entanglement, and searching a semantic network contribute to event memory. Memory & Cognition, 41(6), 797–819. 10.3758/s13421-013-0312-y [DOI] [PubMed] [Google Scholar]
- Paivio A (1991). Dual coding theory: Retrospect and current status. Canadian Journal of Psychology, 45(3), 255–287. 10.1037/h0084295 [DOI] [Google Scholar]
- Paivio A (2013). Dual coding theory, word abstractness, and emotion: A critical review of Kousta et al. (2011). Journal of Experimental Psychology: General, 142(1), 282–287. 10.1037/a0027004 [DOI] [PubMed] [Google Scholar]
- Perry W (2009). Beyond the numbers: Expanding the boundaries of neuropsychology. Archives of Clinical Neuropsychology, 24(1), 21–29. 10.1093/2Farclin/2Facp001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Petersen RC, Caracciolo B, Brayne C, Gauthier S, Jelic V, & Fratiglioni L, (2014). Mild cognitive impairment: A concept in evolution. Journal of Internal Medicine, 275(3), 214–228. 10.1111/2Fioim.12190 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pexman PM, Muraki E, Sidhu DM, Siakaluk PD, & Yap MJ (2019). Quantifying sensorimotor experience: Body-object interaction ratings for more than 9,000 English words. Behavior Research Methods, 51(2), 453–466. 10.3758/s13428-018-1171-z [DOI] [PubMed] [Google Scholar]
- Pozueta A, Rodríguez-Rodríguez E, Vazquez-Higuera JL, Mateo I, Sánchez-Juan P, González-Perez S, Berciano J, & Combarros O (2011). Detection of early Alzheimer’s disease in MCI patients by the combination of MMSE and an episodic memory test. BMC Neurology, 11, 78. 10.1186/1471-2377-11-78 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prince M, Acosta D, Chiu H, Scazufca M, & Varghese M (2003). Dementia diagnosis in developing countries: A cross-cultural validation study. The Lancet, 361(9361), 909–917. 10.1016/s0140-6736(03)12772-9 [DOI] [PubMed] [Google Scholar]
- R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-proiect.org [Google Scholar]
- Rabin LA, Paolillo E, & Barr WB (2016). Stability in test-usage practices of clinical neuropsychologists in the United States and Canada over a 10-year period: A follow-up survey of INS and NAN members. Archives of Clinical Neuropsychology, 31(3), 206–230. 10.1093/arclin/acw007 [DOI] [PubMed] [Google Scholar]
- RAND HRS Longitudinal File 2018. Produced by the RAND Center for the Study of Aging, with funding from the National Institute on Aging and the Social Security Administration. Santa Monica, CA: (2021). [Google Scholar]
- Ribeiro F, Guerreiro M, & de Mendonça A (2007). Verbal learning and memory deficits in mild cognitive impairment. Journal of Clinical and Experimental Neuropsychology, 29(2), 187–197. 10.1080/13803390600629775 [DOI] [PubMed] [Google Scholar]
- Silva D, Guerreiro M, Maroco J, Santana I, Rodrigues A, Marques JB, & de Mendonça A (2012). Comparison of four verbal memory tests for the diagnosis and predictive value of mild cognitive impairment. Dementia and Geriatric Cognitive Disorders Extra, 2(1), 120–131. 10.1159/000336224 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Silverberg NB, Ryan LM, Carrillo MC, Sperling R, Petersen RC, Posner HB, Snyder PJ, Hilsabeck R, Gallagher M, Raber J, Rizzo A, Possin K, King J, Kaye J, Ott BR, Albert MS, Wagster MV, Schinka JA, Cullum CM, … Ferman TJ (2011). Assessment of cognition in early dementia. Alzheimer’s & Dementia, 7(3), e60–e70. https://dx.doi.org/10.1016%2Fj.jalz.2011.05.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stan Development Team. (2020a). RStan: The R interface to Stan. (Version 2.21.2). https://mc-stan.org [Google Scholar]
- Stan Development Team. (2020b). Stan Modeling Language Users Guide and Reference Manual (Version 2.26). https://mc-stan.org [Google Scholar]
- Sternberg RJ, & Tulving E (1977). The measurement of subjective organization in free recall. Psychological Bulletin, 84(3), 539–556. 10.1037/0033-2909.84.3.539 [DOI] [Google Scholar]
- Thiruselvam I, & Hoelzle JB (2020). Refined measurement of verbal learning and memory: Application of item response theory to California Verbal Learning Test – Second Edition (CVLT-II) learning trials. Archives of Clinical Neuropsychology, 35(1), 90–104. 10.1093/arclin/acy097 [DOI] [PubMed] [Google Scholar]
- Tuerlinckx F, & De Boeck. (2004). Models for residual dependencies. In De Boeck P & Wilson M (eds.), Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach. Springer. [Google Scholar]
- Weir DR, Langa KM, Ryan LH 2016. Harmonized Cognitive Assessment Protocol (HCAP): Study Protocol Summary. https://hrs.isr.umich.edu/sites/default/files/biblio/HRS%202016%20HCAP%20Protocol%20Summary_011619_rev.pdf
- Weissberger GH, Strong JV, Stefanidis KB, Summers MJ, Bondi MW, & Stricker NH (2017). Diagnostic accuracy of memory measures in Alzheimer’s dementia and mild cognitive impairment: A systematic review and meta-analysis. Neuropsychology Review, 27(4), 354–388. 10.1007/2Fs11065-017-9360-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weitzner DS, & Calamia M, (2020). Serial position effects on list learning tasks in mild cognitive impairment and Alzheimer’s disease. Neuropsychology, 34(4), 467–478. 10.1037/neu0000620 [DOI] [PubMed] [Google Scholar]
- Williams DR, Zimprich DR, & Rast PA (2019). A Bayesian nonlinear mixed-effects location scale model for learning. Behavior Research Methods, 51(5), 1968–1986. 10.3758/s13428-019-01255-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yao Y, Vehtari A, Simpson D, & Gelman A (2018). Using stacking to average Bayesian predictive distributions. Bayesian Analysis, 13(3), 917–1007. 10.1214/17-BA1091 [DOI] [Google Scholar]