Test–Retest Reliability and Reliable Change on the NIH Toolbox Cognition Battery

Justin E Karr; Eric O Ingram; Cristina N Pinheiro; Sheliza Ali; Grant L Iverson

doi:10.1093/arclin/acae011

. 2024 Feb 23;39(6):702–713. doi: 10.1093/arclin/acae011

Test–Retest Reliability and Reliable Change on the NIH Toolbox Cognition Battery

Justin E Karr ^1,^✉, Eric O Ingram ², Cristina N Pinheiro ³, Sheliza Ali ⁴, Grant L Iverson ^5,^6,^7,⁸

PMCID: PMC11345114 PMID: 38402512

Abstract

Objective

Researchers and practitioners can detect cognitive improvement or decline within a single examinee by applying a reliable change methodology. This study examined reliable change through test–retest data from the English-language National Institutes of Health Toolbox Cognition Battery (NIHTB-CB) normative sample.

Method

Participants included adults (n = 138; age: M ± SD = 54.8 ± 20.0, range: 18–85; 51.4% men; 68.1% White) who completed test–retest assessments about a week apart on five fluid cognition tests, providing raw scores, age-adjusted standard scores (SS), and demographic-adjusted T-scores (T).

Results

The Fluid Cognition Composite (SS: ICC = 0.87; T-score: ICC = 0.84) and the five fluid cognition tests had good test–retest reliability (SS: ICC range = 0.66–0.85; T-score: ICC range = 0.64–0.86). The lower and upper bounds of 70%, 80%, and 90% confidence intervals (CIs) were calculated around change scores, which serve as cutoffs for determining reliable change. Using T-scores, 90% CI, and adjustment for practice effects, 32.3% declined on one or more tests, 9.7% declined on two or more tests, 36.6% improved on one or more tests, and 5.4% improved on two or more tests.

Conclusions

It was common for participants to show reliable change on at least one test score, but not two or more test scores. Per an 80% CI, test–retest difference scores beyond these cutoffs would indicate reliable change: Dimensional Change Card Sort (SS ≥ 14/T ≥ 10), Flanker (SS ≥ 12/T ≥ 8), List Sorting (SS ≥ 14/T ≥ 10), Picture Sequence Memory (SS ≥ 19/T ≥ 13), Pattern Comparison (SS ≥ 11/T ≥ 8), and Fluid Cognition Composite (SS ≥ 10/T ≥ 7). The reliable change cutoffs could be applied in research or practice to detect within-person change in fluid cognition at the individual level.

Keywords: Cognition, Neuropsychological tests, Psychometrics

INTRODUCTION

The National Institutes of Health Toolbox Cognition Battery (NIHTB-CB) is a tablet-assisted battery of crystallized and fluid cognitive tests designed for brief, repeated administration, with strong psychometric properties (Carlozzi,et al., 2014; Dikmen,et al., 2014; Heaton,et al., 2014; Tulsky,et al., 2014; Zelazo,et al., 2014) and prior applications in clinical research settings (Fox,et al., 2022). The NIHTB-CB provides norm-referenced scores based on the U.S. population (Beaumont,et al., 2013), along with demographic-adjusted norms that consider age, sex, race/ethnicity, and education in their calculation (Casaletto,et al., 2015). Designed for repeated assessment, clinicians and researchers would benefit from methods for detecting reliable cognitive change with this battery. Cutoffs for determining reliable change have been developed for the Spanish-language battery (Karr,et al., 2021), but not the English-language battery. The NIHTB-CB includes measures of both crystallized and fluid cognition (Mungas,et al., 2014), with fluid cognition scores more sensitive to change due to brain injury (Tulsky,et al., 2017), neurodegenerative process (Snitz,et al., 2020), or intervention (Parsey,et al., 2021).

Neuropsychologists commonly engage in reevaluations of cognitive functioning, which allows them to monitor decline in the context of a neurodegenerative condition (e.g., dementia) or improvement in the context of recovery or rehabilitation (e.g., following an acquired or traumatic brain injury; Duff, 2012). As such, neuropsychologists would benefit from familiarity with approaches to interpreting reliable change across serial assessments (Heilbronner,et al., 2010), but clinicians and researchers often lack criteria for determining if true change is present within an individual (Rabin,et al., 2016). Another concern pertains to the normal frequency of change at retest, with researchers finding that, even among healthy children and adults, it is common to have decline or improvement on one or more cognitive tests (Brooks,et al., 2016, 2017). The current study evaluated reliable change on NIHTB-CB fluid tests in the test–retest sample from the English-language normative data, aiming to (a) provide estimates of test–retest reliability, (b) develop cutoffs for detecting reliable change in research and practice, and (c) prepare multivariate base rates of reliable change, determining how commonly examinees present with change on one or more NIHTB-CB tests.

METHOD

Participants

Participants were initially recruited through the NIHTB-CB norming study, which involved planned recruitment of community-dwelling adults from the U.S. population (Beaumont,et al., 2013). This recruitment effort ultimately resulted in the participation of 1,038 adults on whom the normative score calculations are based (Casaletto,et al., 2015). A subsample of participants from the normative sample (n = 138) completed a retest assessment about a week after their initial evaluation (M = 8.7 days, SD = 2.7). These participants did not undergo any intervention within the test–retest interval. The normative dataset is public domain and available online for download (Gershon, 2016). The participants had a mean age of 54.8 years (SD = 20.0, range: 18–85), with roughly equal representation of men (n = 71, 51.4%) and women (n = 67, 48.6%). The sample had a mean education of 13.6 years (SD = 2.6, range: 8–20) and was majority White (n = 94, 68.1%). Full educational and racial/ethnic breakdown of the sample is provided in Table 1.

Table 1.

Sample demographics

Demographic characteristic	Descriptive statistic
Age (years), M (SD), range	54.8 (20.0), 18–85
Gender, n (%)	—
Men	71 (51.4)
Women	67 (48.6)
Education (years), M (SD), range	13.6 (2.6), 8–20
Education (years), n (%)	—
<12 years	19 (13.7)
12 years	48 (34.8)
13–15 years	25 (18.1)
≥16 years	46 (33.3)
Race/ethnicity, n (%)	—
White	94 (68.1)
Black or African American	14 (10.1)
Hispanic	17 (12.3)
Asian/Pacific Islander	4 (2.9)
Native American	2 (1.4)
Multiracial	2 (1.4)
Not provided	5 (3.6)

Open in a new tab

Notes: N = 138.

Materials

The NIHTB-CB includes five tests measuring fluid abilities, including the List Sorting test, measuring working memory (Tulsky,et al., 2014); the Picture Sequence Memory test, measuring visual episodic memory (Dikmen,et al., 2014); the Pattern Comparison test, measuring processing speed (Carlozzi,et al., 2014), and the Flanker and Dimensional Change Card Sort tests, measuring executive functions (Zelazo,et al., 2014). The average of the norm-referenced scores from these tests are used to calculate a Fluid Cognition Composite score (Heaton,et al., 2014). The raw scores for each of these tests can be converted to age-adjusted standard score (M = 100, SD = 15) or demographic-adjusted T-scores, which consider age, gender, race/ethnicity, and education in their calculation (M = 50, SD = 10) (Casaletto,et al., 2015).

Statistical Analyses

Test–retest reliability was estimated with the Pearson’s r correlation and the intraclass correlation (ICC), using an absolute agreement definition for average scores based on a two-way mixed effects model (Field, 2005; Koo &Li, 2016). Interpretation of test–retest reliability was based on the ICC: poor (<0.50), moderate (0.50 to 0.75), good (0.75 to 0.90), and excellent (>0.90) (Koo &Li, 2016). The reliable change methodology involved a modification of prior approaches to calculating reliable change indices (Iverson, 2001; Jacobson &Truax, 1991). The standard error of measurement (SEM) was calculated at test (SEM₁ = Inline graphic ) and retest (SEM₂ = ), which was used to calculate the SE_diff = . The SE_diff is multiplied by a z-score to calculate a confidence interval (CI) in the units of the test (i.e., ±1.04 for 70% CI, ±1.28 for 80% CI, and ± 1.64 for 90% CI). For the NIHTB-CB, these units will be based on the raw scores, age-adjusted standard scores, or demographic-adjusted T-scores.

The average practice effect can be calculated from the test–retest data, by subtracting the mean test score at baseline from the mean test score at retest (i.e., M₂ – M₁). This practice effect can be added to the lower and upper bounds of the CI as a method to adjust for practice effects, if desired by the clinician or researcher interpreting change (Chelune,et al., 1993). The calculated values serve as cutoffs that indicate reliable change, either improvement if an increase in performance exceeds the positive cutoff, decline if a reduction in performance exceeds the negative cutoff, or stability if change does not exceed a cutoff and falls within the CI. These cutoffs were prepared for the five NIHTB-CB tests of fluid cognition and the Fluid Cognition Composite score.

These CIs can be interpreted as encompassing 70%, 80%, or 90% of change scores within a normal distribution, meaning that a percentage of change scores fall beyond the upper and lower bound of the CI. For the 70% CI, approximately 15% would fall below the lower bound threshold and 15% would fall above the upper bound threshold. For example, a score above the upper bound of the 70% CI would indicate greater improvement than 85% of the distribution of change scores, and a score below the lower bound of the 70% CI would indicate greater decline than 85% of the distribution of change scores. As such, scores at or above the cutoffs for the 80% and 90% CIs would indicate greater improvement or decline than 90% and 95% of the distribution of change scores, respectively.

Additional analyses examined the frequencies at which participants presented with change from test to retest on one or more of the five fluid cognition test scores. These analyses were conducted defining reliable change based on cutoffs using 80% and 90% CIs with and without an adjustment for practice effects. The analyses only used cases with complete data for all five fluid cognition tests and were conducted separately for age-adjusted standard scores (n = 100) and demographic-adjusted T-scores (n = 93). The frequencies at which participants declined, improved, or changed in either direction (i.e., improved or declined) on one or more tests was compared across demographic stratifications of gender (i.e., men vs. women), education (i.e., ≤12 vs. >12 years), and race/ethnicity (i.e., White vs. other identities) using χ² tests. Power analysis indicated that a sample of 93 had adequate power (1-β = 0.80) to detect a medium effect size (w = 0.29) at p < .05 (Erdfelder,et al., 2009).

RESULTS

Descriptive statistics for test performances at baseline and retest, SEM values, SE_diff values, test–retest reliability estimates, and mean practice effects are provided in Table 2. For individual tests, test–retest reliability was good to excellent for raw scores (ICC range: 0.80–0.92) and moderate to good for age-adjusted standard scores (ICC range: 0.66–0.85) and demographic-adjusted T-scores (ICC range: 0.64–0.86). The Fluid Cognition Composite had good test–retest reliability for both the age-adjusted standard score (ICC = 0.87) and the demographic-adjusted T-score (ICC = 0.84). Practice effects on individual tests were small to medium for raw scores (d range: 0.09–0.52), age-adjusted standard scores (d range: 0.11–0.51), and demographic-adjusted T-scores (d range: 0.14–0.51). The lowest practice effect was observed for List Sorting (raw: d = 0.09; standard score: d = 0.11; T-score: d = 0.14) and the largest practice effects for an individual test were for Picture Sequence Memory (raw: d = 0.52; standard score: d = 0.51) and Pattern Comparison (T-score: d = 0.51). The Fluid Cognition Composite had the largest practice effect for both the age-adjusted standard scores (d = 0.75) and the demographic-adjusted T-scores (d = 0.73).

Table 2.

Descriptive statistics and test–retest correlations for the National Institutes of Health Toolbox Cognition Battery: Raw scores, age-adjusted standard scores, and demographic-adjusted T-scores

	Baseline				Retest
	n	M	SD	SEM₁	M	SD	SEM₂	r [95% CI]	ICC [95% CI]	M _diff	SE _diff	d [95% CI]
Raw scores	—	—	—	—	—	—	—	—	—	—	—	—
DCCS	137	7.7	1.3	0.5	8.0	1.1	0.4	0.74 [0.66, 0.81]	0.83 [0.75, 0.88]	0.3	0.68	0.31 [0.14, 0.48]
Flanker	137	8.4	0.9	0.3	8.6	0.9	0.3	0.79 [0.72, 0.85]	0.88 [0.82, 0.91]	0.2	0.43	0.27 [0.10, 0.44]
List Sorting	138	16.3	3.8	1.6	16.6	3.8	1.6	0.69 [0.59, 0.77]	0.82 [0.75, 0.87]	0.3	2.29	0.09 [0.08, 0.26]
PSM	102	473.9	104.8	46.9	517.7	120.2	53.8	0.72 [0.61, 0.80]	0.80 [0.62, 0.88]	43.9	71.33	0.52 [0.31, 0.72]
Pattern Comparison	137	54.5	11.4	3.3	57.6	13.0	3.7	0.88 [0.84, 0.91]	0.92 [0.84, 0.95]	3.1	4.98	0.50 [0.33, 0.68]
Age-adjusted SS	—	—	—	—	—	—	—	—	—	—	—	—
DCCS	137	101.8	15.1	7.1	105.4	16.5	7.7	0.66 [0.55, 0.74]	0.78 [0.69, 0.85]	3.6	10.47	0.27 [0.10, 0.44]
Flanker	137	101.7	15.2	6.2	105.3	15.7	6.5	0.73 [0.64, 0.80]	0.83 [0.75, 0.89]	3.5	8.96	0.31 [0.14, 0.48]
List Sorting	138	100.6	16.1	7.7	102.2	16.6	8.0	0.63 [0.51, 0.72]	0.77 [0.68, 0.84]	1.6	11.10	0.11 [0.05, 0.28]
PSM	102	98.9	16.3	9.4	107.3	18.8	10.9	0.56 [0.41, 0.68]	0.66 [0.43, 0.79]	8.4	14.39	0.51 [0.30, 0.71]
Pattern Comparison	137	102.1	15.3	6.0	107.4	16.5	6.4	0.78 [0.70, 0.84]	0.85 [0.72, 0.91]	5.3	8.78	0.50 [0.30, 0.67]
Fluid Cognition Composite	100	102.8	14.5	5.3	109.5	16.5	6.0	0.84 [0.77, 0.89]	0.87 [0.58, 0.94]	6.7	8.04	0.75 [0.53, 0.98]
Demo. T-scores	—	—	—	—	—	—	—	—	—		—	—
DCCS	126	51.1	9.8	4.8	53.6	11.2	5.5	0.63 [0.51, 0.73]	0.76 [0.65, 0.83]	2.5	7.31	0.27 [0.10, 0.45]
Flanker	126	50.8	9.6	4.3	53.3	10.2	4.6	0.69 [0.58, 0.77]	0.80 [0.70, 0.86]	2.5	6.28	0.32 [0.14, 0.50]
List Sorting	127	50.2	10.4	5.2	51.5	10.7	5.4	0.60 [0.48, 0.70]	0.75 [0.65, 0.82]	1.3	7.46	0.14 [0.03, 0.32]
PSM	95	49.9	10.9	6.6	55.5	12.5	7.6	0.52 [0.36, 0.65]	0.64 [0.39, 0.77]	5.6	10.01	0.49 [0.27, 0.70]
Pattern Comparison	126	45.4	11.5	4.3	49.3	12.1	4.6	0.79 [0.72, 0.85]	0.86 [0.73, 0.92]	3.9	6.31	0.51 [0.32, 0.69]
Fluid Cognition Composite	93	50.4	9.5	3.8	55.0	10.4	4.2	0.81 [0.72, 0.87]	0.84 [0.55, 0.93]	4.6	5.61	0.73 [0.50, 0.95]

Open in a new tab

Notes: CI = confidence interval; DCCS = Dimensional Change Card Sort; Demo. = Demographic-adjusted; ICC = intraclass correlation coefficient; M = mean; M_diff = mean difference score; PSM = Picture Sequence Memory; r₁₂ = test–retest correlation; SD = standard deviation; SE_diff = standard error of difference; SEM = standard error of measurement; SS = standard scores. Fluid cognition tests include the Dimensional Change Card Sort Test, Flanker Inhibitory Control and Attention Test, List Sorting Working Memory Test, Picture Sequence Memory Test, and Pattern Comparison Processing Speed Test.

Critical values for determining reliable change without an adjustment for practice effects are provided in Table 3 based on 70%, 80%, and 90% CIs around change scores; and critical values for determining reliable change with an adjustment for practice effects are provided in Table 4. The practice effect adjustment involved adding the mean difference scores from baseline to retest to the lower and upper bounds of the CIs. These values serve as cutoffs for detecting reliable change, with a score exceeding a cutoff indicating greater change than would be expected based on normal variability in test performance and measurement error. The cutoffs for detecting reliable change were smallest for the Fluid Cognition Composite and largest for Picture Sequence Memory. Using the 70% CI without adjusting for practice effects, change of about half of a standard deviation (SD) (i.e., SS ≥ |8.36| or T ≥ |5.83|) indicated reliable change for the Fluid Cognition Composite, whereas about an SD of change indicated reliable change for Picture Sequence Memory (i.e., SS ≥ |14.97| or T ≥ |10.41|). Adjustments for practice effects resulted in lower cutoffs for detecting decline and higher cutoffs for detecting improvement. For some of the tests, practice effects were quite substantial and substantially shifted the cutoff values. For example, using the 90% CI, an improvement of greater than 2 SDs is required to detect reliable improvement on Picture Sequence Memory (i.e., SS ≥ 32.04 or T ≥ 22.01), whereas a decline of approximately 1 SD is required to detect reliable decline on Picture Sequence Memory (i.e., SS ≤ −15.17 or T ≤ −10.83).

Table 3.

Reliable change confidence intervals for the National Institutes of Health Toolbox Cognition Battery (without adjustment for practice effects): Raw scores, age-adjusted standard scores, and demographic-adjusted T-scores

	70% CI	80% CI	90% CI
Raw scores
Dimensional Change Card Sort	0.71	0.88	1.12
Flanker	0.45	0.55	0.71
List Sorting	2.38	2.93	3.75
Picture Sequence Memory	74.18	91.30	116.98
Pattern Comparison	5.18	6.38	8.17
Age-adjusted standard scores
Dimensional Change Card Sort	10.89	13.40	17.17
Flanker	9.32	11.47	14.69
List Sorting	11.55	14.21	18.21
Picture Sequence Memory	14.97	18.42	23.61
Pattern Comparison	9.13	11.24	14.40
Fluid Cognition Composite	8.36	10.29	13.18
Demographic-adjusted T-scores
Dimensional Change Card Sort	7.60	9.36	11.99
Flanker	6.54	8.04	10.31
List Sorting	7.76	9.55	12.24
Picture Sequence Memory	10.41	12.81	16.42
Pattern Comparison	6.56	8.08	10.35
Fluid Cognition Composite	5.83	7.18	9.20

Open in a new tab

Notes: CI = confidence interval. Fluid cognition tests include the Dimensional Change Card Sort Test, Flanker Inhibitory Control and Attention Test, List Sorting Working Memory Test, Picture Sequence Memory Test, and Pattern Comparison Processing Speed Test. This table provides the CI around the mean test–retest difference score for each measure (test–retest interval: M = 8.4 days, SD = 2.4, range: 5–14), without an adjustment for test–retest practice effects. If a retest score changes by that amount or more (either positive or negative), that score is indicative of worsening or improvement.

Table 4.

Reliable change confidence intervals for the National Institutes of Health Toolbox Cognition Battery (with adjustment for practice effects): Raw scores, age-adjusted standard scores, and demographic-adjusted T-scores

	70% CI		80% CI		90% CI
	Decline	Improve	Decline	Improve	Decline	Improve
Raw scores
Dimensional Change Card Sort	−0.44	0.98	−0.60	1.15	−0.85	1.39
Flanker	−0.30	0.60	−0.40	0.71	−0.56	0.86
List Sorting	−2.11	2.65	−2.66	3.20	−3.48	4.02
Picture Sequence Memory	−30.30	118.06	−47.42	135.18	−73.10	160.86
Pattern Comparison	−2.07	8.29	−3.27	9.49	−5.06	11.28
Age-adjusted standard scores
Dimensional Change Card Sort	−7.31	14.46	−9.82	16.98	−13.59	20.75
Flanker	−5.77	12.86	−7.92	15.01	−11.14	18.24
List Sorting	−9.93	13.16	−12.59	15.83	−16.59	19.82
Picture Sequence Memory	−6.54	23.40	−9.99	26.86	−15.17	32.04
Pattern Comparison	−3.83	14.43	−5.94	16.54	−9.10	19.70
Fluid Cognition Composite	−1.62	15.10	−3.55	17.03	−6.44	19.92
Demographic-adjusted T-scores
Dimensional Change Card Sort	−5.11	10.10	−6.87	11.85	−9.50	14.48
Flanker	−4.00	9.07	−5.51	10.58	−7.78	12.84
List Sorting	−6.42	9.10	−8.21	10.89	−10.90	13.58
Picture Sequence Memory	−4.82	16.00	−7.22	18.40	−10.83	22.01
Pattern Comparison	−2.66	10.46	−4.18	11.97	−6.45	14.24
Fluid Cognition Composite	−1.27	10.39	−2.62	11.74	−4.64	13.75

Open in a new tab

Notes: CI = confidence interval. Fluid cognition tests include the Dimensional Change Card Sort Test, Flanker Inhibitory Control and Attention Test, List Sorting Working Memory Test, Picture Sequence Memory Test, and Pattern Comparison Processing Speed Test. This table provides the upper and lower bounds of the CI around the mean test–retest difference score for each measure (test–retest interval: M = 8.4 days, SD = 2.4, range: 5–14), with an adjustment for test–retest practice effects (i.e., the mean practice effect for each test was added to the lower and upper bounds of the CI). If a score is below the lower bound, that score is indicative of decline, and if a score is above the upper bound, that score is indicative of improvement.

Multivariate base rates of change scores are presented in Tables 5 and 6 using age-adjusted standard scores and Tables 7 and 8 using demographic-adjusted T-scores, with and without an adjustment for practice effects based on 80% and 90% CIs. Per the 90% CI without an adjustment for practice effects, a minority of the sample declined on one or more tests (i.e., SS = 16.0%; T = 16.1%) and about half of the sample improved on one or more tests (i.e., SS = 53.0%; T = 51.6%). With an adjustment for practice effects, about a third declined (i.e., SS = 33.0%; T = 32.3%) and about a third improved on one or more tests (i.e., SS = 33.0%; T = 36.6%). Fewer participants showed reliable change on multiple test scores. Per the 90% CI with adjustment for practice effects, very few participants declined on two or more tests (i.e., SS = 7.0%; T = 9.7%) and change on three or more tests was extremely uncommon (i.e., SS = 1.0%; T = 2.2%).

Table 5.

Cumulative percentages of reliably different National Institutes of Health Toolbox Cognition Battery fluid test scores at retest: Age-adjusted standard scores, 80% confidence interval for change

		Gender		Education		Race/Ethnicity
	Total sample (n = 100)	Women (n = 42)	Men (n = 58)	≤12 (n = 51)	>12 (n = 49)	White (n = 66)	Other identities (n = 32)
Adjusted for practice effects
Reliably decline
No change	51.0	54.8	48.3	47.1	55.1	54.5	43.8
1 or more	49.0	45.2	51.7	52.9	44.9	45.5	56.2
2 or more	15.0	14.3	15.5	9.8	20.4	15.2	15.6
3 or more	4.0	2.4	5.2	2.0	6.1	0	12.5
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve
No change	48.0	40.5	53.4	54.9	40.8	50.0	43.8
1 or more	52.0	59.5	46.6	45.1	59.2	50.0	56.2
2 or more	17.0	19.0	15.5	11.8	22.4	18.2	12.5
3 or more	2.0	2.4	1.7	0	4.1	3.0	0
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve or decline
No change	25.0	21.4	27.6	29.4	20.4	30.3	15.6
1 or more	75.0	78.6	72.4	70.6	79.6	69.7	84.4
2 or more	48.0	50.0	46.6	43.1	53.1	45.5	53.1
3 or more	15.0	11.9	17.2	7.8	22.4	16.7	12.5
4 or more	1.0	2.4	0	0	2.0	0	3.1
5 scores	0	0	0	0	0	0	0
Unadjusted for practice effects
Reliably decline
No change	68.0	69.0	67.2	68.6	67.3	66.7	68.8
1 or more	32.0	31.0	32.8	31.4	32.7	33.3	31.2
2 or more	5.0	2.4	6.9	2.0	8.2	3.0	9.4
3 or more	1.0	0	1.7	2.0	0	0	3.1
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve
No change	28.0	19.0	34.5	31.4	24.5	25.8	31.2
1 or more	72.0	81.0	65.5	68.6	75.5	74.2	68.8
2 or more	26.0	28.6	24.1	19.6	32.7	27.3	21.9
3 or more	9.0	9.5	8.6	9.8	8.2	9.1	6.3
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve or decline
No change	17.0	11.9	20.7	19.6	14.3	16.7	15.6
1 or more	83.0	88.1	79.3	80.4	85.7	83.3	84.4
2 or more	45.0	47.6	43.1	35.3	55.1	45.5	43.8
3 or more	15.0	16.7	13.8	13.7	16.3	15.2	12.5
4 or more	2.0	0	3.4	3.9	0	3.0	0
5 scores	0	0	0	0	0	0	0

Open in a new tab

Notes: Values represent the cumulative percentage of the sample presenting with improvement or decline at retest, based on an 80% confidence interval with and without adjusting for practice effects. Two participants were missing on race/ethnicity. Fluid cognition tests include the Dimensional Change Card Sort Test, Flanker Inhibitory Control and Attention Test, List Sorting Working Memory Test, Picture Sequence Memory Test, and Pattern Comparison Processing Speed Test.

Table 6.

Cumulative percentages of reliably different National Institutes of Health Toolbox Cognition Battery fluid test scores at retest: Age-adjusted standard scores, 90% confidence interval for change

		Gender		Education		Race/Ethnicity
	Total sample (n = 100)	Women (n = 42)	Men (n = 58)	≤12 (n = 51)	>12 (n = 49)	White (n = 66)	Other identities (n = 32)
Adjusted for practice effects
Reliably decline
No change	67.0	69.0	65.5	66.7	67.3	66.7	68.8
1 or more	33.0	31.0	34.5	33.3	32.7	33.3	31.2
2 or more	7.0	4.8	8.6	5.9	8.2	4.5	12.5
3 or more	1.0	0	1.7	2.0	0	0	3.1
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve
No change	67.0	59.5	72.4	78.4	55.1	62.1	75.0
1 or more	33.0	40.5	27.6	21.6	44.9	37.9	25.0
2 or more	7.0	7.1	6.9	3.9	10.2	10.6	0
3 or more	0	0	0	0	0	0	0
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve or decline
No change	44.0	38.1	48.3	52.9	34.7	42.4	46.9
1 or more	56.0	61.9	51.7	47.1	65.3	57.6	53.1
2 or more	19.0	14.3	22.4	13.7	24.5	22.7	12.5
3 or more	6.0	7.1	5.2	5.9	6.1	6.1	6.3
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Unadjusted for practice effects
Reliably decline
No change	84.0	85.7	82.8	86.3	81.6	86.4	78.1
1 or more	16.0	14.3	17.2	13.7	18.4	13.6	21.9
2 or more	3.0	2.4	3.4	2.0	4.1	0	9.4
3 or more	1.0	0	1.7	2.0	0	0	3.1
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve
No change	47.0	40.5	51.7	54.9	38.8	50.0	40.6
1 or more	53.0	59.5	48.3	45.1	61.2	50.0	59.4
2 or more	16.0	16.7	15.5	9.8	22.4	16.7	12.5
3 or more	0	0	0	0	0	0	0
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve or decline
No change	36.0	35.7	36.2	45.1	26.5	40.9	25.0
1 or more	64.0	64.3	63.8	54.9	73.5	59.1	75.0
2 or more	23.0	26.2	20.7	15.7	30.6	21.2	25.0
3 or more	2.0	2.4	1.7	2.0	2.0	0	6.3
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0

Open in a new tab

Notes: Values represent the cumulative percentage of the sample presenting with improvement or decline at retest, based on a 90% confidence interval with and without adjusting for practice effects. Two participants were missing on race/ethnicity. Fluid cognition tests include the Dimensional Change Card Sort Test, Flanker Inhibitory Control and Attention Test, List Sorting Working Memory Test, Picture Sequence Memory Test, and Pattern Comparison Processing Speed Test.

Table 7.

Cumulative percentages of reliably different National Institutes of Health Toolbox Cognition Battery fluid test scores at retest: Demographic-adjusted T-scores, 80% confidence interval for change

		Gender		Education		Race/Ethnicity
	Total sample (n = 93)	Women (n = 40)	Men (n = 53)	≤12 (n = 46)	>12 (n = 47)	White (n = 66)	Other identities (n = 27)
Adjusted for practice effects
Reliably decline
No change	54.8	55.0	54.7	54.3	55.3	57.6	48.1
1 or more	45.2	45.0	45.3	45.7	44.7	42.4	51.9
2 or more	14.0	15.0	13.2	10.9	17.0	12.1	18.5
3 or more	4.3	2.5	5.7	2.2	6.4	0	14.8
4 or more	1.1	2.5	0	0	2.1	0	3.7
5 scores	0	0	0	0	0	0	0
Reliably improve
No change	48.4	40.0	54.7	56.5	40.4	48.5	48.1
1 or more	51.6	60.0	45.3	43.5	59.6	51.5	51.9
2 or more	9.7	7.5	11.3	10.9	8.5	13.6	0
3 or more	2.2	2.5	1.9	0	4.3	3.0	0
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve or decline
No change	28.0	22.5	32.1	32.6	23.4	30.3	22.2
1 or more	72.0	77.5	67.9	67.4	76.6	69.7	77.8
2 or more	40.9	40.0	41.5	37.0	44.7	40.9	40.7
3 or more	12.9	12.5	13.2	8.7	17.0	12.1	14.8
4 or more	1.1	2.5	0	0	2.1	0	3.7
5 scores	1.1	2.5	0	0	2.1	0	3.7
Unadjusted for practice effects
Reliably decline
No change	68.8	65.0	71.7	67.4	70.2	72.7	59.3
1 or more	31.2	35.0	28.3	32.6	29.8	27.3	40.7
2 or more	7.5	7.5	7.5	8.7	6.4	4.5	14.8
3 or more	0	0	0	0	0	0	0
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve
No change	29.0	15.0	39.6	28.3	29.8	30.3	25.9
1 or more	71.0	85.0	60.4	71.7	70.2	69.7	74.1
2 or more	24.7	27.5	22.6	21.7	27.7	27.3	18.5
3 or more	4.3	7.5	1.9	4.3	4.3	6.1	0
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve or decline
No change	20.4	7.5	30.2	21.7	19.1	22.7	14.8
1 or more	79.6	92.5	69.8	78.3	80.9	77.3	85.2
2 or more	43.0	47.5	39.6	43.5	42.6	40.9	48.1
3 or more	15.1	20.0	11.3	15.2	14.9	15.2	14.8
4 or more	1.1	2.5	0	2.2	0	1.5	0
5 scores	0	0	0	0	0	0	0

Open in a new tab

Notes: Values represent the cumulative percentage of the sample presenting with improvement or decline at retest, based on an 80% confidence interval with and without adjusting for practice effects. Fluid cognition tests include the Dimensional Change Card Sort Test, Flanker Inhibitory Control and Attention Test, List Sorting Working Memory Test, Picture Sequence Memory Test, and Pattern Comparison Processing Speed Test.

Table 8.

Cumulative percentages of reliably different National Institutes of Health Toolbox Cognition Battery fluid test scores at retest: Demographic-adjusted T-scores, 90% confidence interval for change

		Gender		Education		Race/Ethnicity
	Total sample (n = 93)	Women (n = 40)	Men (n = 53)	≤12 (n = 46)	>12 (n = 47)	White (n = 66)	Other identities (n = 27)
Adjusted for practice effects
Reliably decline
No change	67.7	65.0	69.8	69.6	66.0	68.2	66.7
1 or more	32.3	35.0	30.2	30.4	34.0	31.8	33.3
2 or more	9.7	7.5	11.3	8.7	10.6	7.6	14.8
3 or more	2.2	2.5	1.9	0	4.3	0	7.4
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve
No change	63.4	57.5	67.9	73.9	53.2	60.6	70.4
1 or more	36.6	42.5	32.1	26.1	46.8	39.4	29.6
2 or more	5.4	5.0	5.7	4.3	6.4	7.6	0
3 or more	0	0	0	0	0	0	0
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve or decline
No change	43.0	35.0	49.1	54.3	31.9	40.9	48.1
1 or more	57.0	65.0	50.9	45.7	68.1	59.1	51.9
2 or more	22.6	20.0	24.5	19.6	25.5	21.2	25.9
3 or more	6.5	7.5	5.7	4.3	8.5	6.1	7.4
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Unadjusted for practice effects
Reliably decline
No change	83.9	85.0	83.0	84.8	83.0	87.9	74.1
1 or more	16.1	15.0	17.0	15.2	17.0	12.1	25.9
2 or more	2.2	0	3.8	2.2	2.1	0	7.4
3 or more	0	0	0	0	0	0	0
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve
No change	48.4	40.0	54.7	54.3	42.6	48.5	48.1
1 or more	51.6	60.0	45.3	45.7	57.4	51.5	51.9
2 or more	11.8	12.5	11.3	10.9	12.8	15.2	3.7
3 or more	0	0	0	0	0	0	0
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0
Reliably improve or decline
No change	40.9	35.0	45.3	47.8	34.0	40.9	40.7
1 or more	59.1	65.0	54.7	52.2	66.0	59.1	59.3
2 or more	20.4	22.5	18.9	17.4	23.4	18.2	25.9
3 or more	2.2	0	3.8	4.3	0	1.5	3.7
4 or more	0	0	0	0	0	0	0
5 scores	0	0	0	0	0	0	0

Open in a new tab

Notes: Values represent the cumulative percentage of the sample presenting with improvement or decline at retest, based on a 90% confidence interval with and without adjusting for practice effects. Fluid cognition tests include the Dimensional Change Card Sort Test, Flanker Inhibitory Control and Attention Test, List Sorting Working Memory Test, Picture Sequence Memory Test, and Pattern Comparison Processing Speed Test.

When comparing frequencies of decline, improvement, or change in either direction based on demographic stratifications, there were no significant group differences based on an 80% CI for change, but there were some group differences by education based on a 90% CI for change. Results are summarized in Table 9. Using age-adjusted standard scores, there was a modest group difference in rates of change based on education, with more years of education associated with greater likelihood of improving on one or more test score (χ² = 6.15, p = .013, OR = 2.96 [1.24, 7.09]). Using demographic-adjusted T-scores, this same group difference was present (χ² = 4.30, p = .038, OR = 2.49 [1.04, 5.97]), which contributed to a significant group difference in frequencies of change in either direction with more years of education (χ² = 4.77, p = .029, OR = 2.54 [1.09, 5.91]).

Table 9.

Comparisons of frequencies of change with adjustments for practice effects by demographic characteristics

	Gender			Education			Race/Ethnicity
	χ²	p	OR [95% CI]	χ²	p	OR [95% CI]	χ²	p	OR [95% CI]
80% CI
Age-adjusted standard scores
Reliably decline	0.41	.522	1.30 [0.59, 2.88]	0.65	.421	0.72 [0.33, 1.59]	1.01	.316	0.65 [0.28, 1.52]
Reliably improve	1.64	.200	0.59 [0.27, 1.32]	1.99	.159	1.77 [0.80, 3.90]	0.34	.561	0.78 [0.33, 1.82]
Reliably improve or decline	0.49	.483	0.72 [0.28, 1.82]	1.08	.299	1.63 [0.65, 4.08]	2.44	.118	0.43 [0.14, 1.27]
Demo-adjusted T-scores
Reliably decline	0.00	.978	1.01 [0.44, 2.31]	0.01	.925	0.96 [0.43, 2.18]	0.69	.407	0.68 [0.28, 1.68]
Reliably improve	1.98	.160	0.55 [0.24, 1.27]	2.41	.120	1.92 [0.84, 4.37]	0.00	.976	0.99 [0.40, 2.42]
Reliably improve or decline	1.04	.308	0.62 [0.24, 1.57]	0.98	.323	1.58 [0.64. 3.95]	0.62	.431	0.66 [0.23, 1.88]
90% CI
Age-adjusted standard scores
Reliably decline	0.14	.711	1.17 [0.50, 2.75]	0.01	.942	0.97 [0.42, 2.23]	0.04	.837	1.10 [0.45, 2.72]
Reliably improve	1.83	.176	0.56 [0.24, 1.30]	6.15	.013	2.96 [1.24, 7.09]	1.60	.206	1.83 [0.71, 4.69]
Reliably improve or decline	1.03	.311	0.66 [0.29, 1.48]	3.38	.066	2.12 [0.95, 4.74]	0.17	.677	1.20 [0.51, 2.80]
Demo-adjusted T-scores
Reliably decline	0.24	.623	0.80 [0.34, 1.93]	0.14	.710	1.18 [0.49, 2.82]	0.02	.887	0.93 [0.36, 2.42]
Reliably improve	1.07	.301	0.64 [0.27, 1.50]	4.30	.038	2.49 [1.04, 5.97]	0.79	.375	1.54 [0.59, 4.04]
Reliably improve or decline	1.84	.175	0.56 [0.24, 1.30]	4.77	.029	2.54 [1.09, 5.91]	0.41	.522	1.34 [0.55, 3.30]

Open in a new tab

Notes: CI = confidence interval. n = 100 for age-adjusted standard score analyses; n = 93 for demographic-adjusted T-score analyses. Frequencies of changed were based on 80% and 90% CIs with adjustments for practice effects. χ² tests are based on 2 × 2 designs, comparing the frequencies of participants with 1 or more reliably changed score by gender (i.e., men vs. women), education (i.e., ≤12 vs. >12 years), and race/ethnicity (i.e., White vs. other identities). Fluid cognition tests include the Dimensional Change Card Sort Test, Flanker Inhibitory Control and Attention Test, List Sorting Working Memory Test, Picture Sequence Memory Test, and Pattern Comparison Processing Speed Test.

DISCUSSION

This study examined test–retest data from the normative sample for the English-language NIHTB-CB, evaluating test–retest reliability, developing cutoffs for determining reliable change, and assessing the normal frequency of change from test to retest. The test–retest reliability of the raw scores ranged from good to excellent and the test–retest reliability of the normed scores (i.e., both age- and demographic-adjusted) ranged from moderate to good. The cutoffs developed for determining reliable change are provided in Table 3 without an adjustment for practice effects and Table 4 with an adjustment for practice effects. Because of the short test–retest interval for the sample (i.e., M = 8.4 days, SD = 2.4, range: 5–14), the practice effect would not likely apply to most domains of clinical practice or research, in which test–retest intervals likely last months to years. As such, Table 10 provides rounded values for determining reliable change without an adjustment for practice effects. This table serves as a quick reference guide for inferring reliable change. A change score equal to or greater than these values would indicate reliable decline or improvement at retest.

Table 10.

Quick cutoff values for determining reliable change

	70% CI	80% CI	90% CI
Age-adjusted standard scores
Dimensional Change Card Sort	11	14	17
Flanker Inhibitory Control and Attention Test	9	12	15
List Sorting Working Memory Test	12	14	18
Picture Sequence Memory Test	15	19	24
Pattern Comparison Processing Speed Test	9	11	15
Fluid Cognition Composite	9	10	13
Demographic-adjusted T-scores
Dimensional Change Card Sort	8	10	12
Flanker Inhibitory Control and Attention Test	7	8	10
List Sorting Working Memory Test	8	10	12
Picture Sequence Memory Test	11	13	17
Pattern Comparison Processing Speed Test	7	8	11
Fluid Cognition Composite	6	7	9

Open in a new tab

Notes: These values indicate cutoffs for determining reliable change on NIH Toolbox Cognition Battery fluid cognition tests and Fluid Cognition Composite score. Values were drawn from Table 3 and rounded. Decimal places of 0.33 and lower were rounded down to the nearest whole number, whereas all other values were rounded up to the nearest whole number.

There are 70%, 80%, and 90% CIs from which neuropsychologists can choose when selecting a cutoff for defining reliable change. The selection of a cutoff likely depends on the clinical setting and the degree of confidence desired when determining if change has occurred. A neuropsychologist may have interest in seeing if fluid cognition improved through intervention following a traumatic brain injury. A cutoff using the 70% CI would be the most lenient for determining if change has occurred and may be chosen because the risk of saying change did not occur (i.e., patient feeling demoralized and reducing engagement in rehabilitation) may exceed the risk of saying change has occurred. A test–retest improvement exceeding the 70% cutoff would indicate improvement on that test or composite score beyond 85% of the test–retest sample. In contrast, a neuropsychologist may desire greater confidence when evaluating a patient for cognitive decline, because the consequences of inferring the presence of change (i.e., a potential neurodegenerative condition) may be less than the consequences of inferring stability (i.e., watchful waiting for more substantial change upon future reassessment). As such, a neuropsychologist may choose cutoffs based on the 90% CI. A change score exceeding this cutoff would indicate greater decline than 95% of the test–retest sample.

The cutoffs can allow for a statistical inference as to whether change is potentially attributable to normal variability and measurement error versus true change in the measured construct, but such change may or may not correspond to any meaningful difference in an individual’s everyday life. For example, a T-score change of 6 on the Fluid Cognition Composite would exceed the 70% CI cutoff, but reflects a difference of about two-thirds of an SD. For a patient with a baseline fluid cognition score of T = 40 (16th percentile) dropping to a score of T = 34 (5th percentile) at a 1-year follow-up assessment may reflect a reduction in fluid cognition that corresponds with problems in everyday functioning (e.g., financial, medication management). In contrast, a reduction from a baseline performance of T = 53 (62nd percentile) to T = 47 (38th percentile) would indicate reliable change, but both the baseline and follow-up performances fell within the average range. This change may not meaningfully relate to the everyday functioning of the examinee but may be worth monitoring in future assessments.

Few studies have assessed the normal frequency of change across a battery of tests (Brooks,et al., 2016, 2017). A prior study examined multivariate base rates of reliable change based on the test–retest data from two memory test batteries: the Neuropsychological Assessment Battery (NAB) memory tests and the Wechsler Memory Scale, Fourth Edition (WMS-IV) (Brooks,et al., 2016). The base rates of participants with one or more reliably changed scores on the NAB memory tests (declined: 31.6%; improved: 33.7%), WMS-IV adult battery subtests (declined: 33.6%; improved: 39.1%), and WMS-IV older adult battery subtests (declined: 13.4%; improved: 43.3%) aligned with the base rates using the NIHTB-CB fluid cognition tests (i.e., per 90% CI for age-adjusted scores with practice effect adjustment, in alignment with the methodology of Brooks,et al., 2016, 2017), which showed similar rates of declined scores (i.e., 33.0%) and improved scores (i.e., 33.0%). A prior study also examined reliable change base rates in children and adolescents using the Child and Adolescent Memory Profile (ChAMP) (declined: 42.9%; improved: 40.8%) and the Children’s Memory Scale (CMS) (declined: 40.2%; improved: 48.4%), finding higher rates of decline and improvement (Brooks,et al., 2017).

This study had limitations. The analyses were based on a small sample size, although the sample size is consistent with most test–retest sample sizes for common neuropsychological tests (Delis,et al., 2001; Wechsler, 2008; Wechsler,et al., 2009). Moreover, the retest sample was collected during the original standardization and norming of the NIHTB-CB, on laptops, whereas the two current versions of the battery are administered on iPads. The test–retest interval was only about 1–2 weeks, which is far briefer than typical intervals used in research and practice. The reliable change calculation used herein does not involve standardized regression-based approaches, which have become more commonly applied in prior research (Cysique,et al., 2011; Duff,et al., 2010; Rinehardt,et al., 2010). These approaches can adjust for practice effects and demographic characteristics in their calculation; however, they involve added burden in their calculation and result in comparable results to more traditional approaches (Duff, 2012). This study also did not involve any external validity measure of clinical outcome, such as a measure of functional ability, or involve clinical populations, which limits the applicability of the findings in clinical settings. Future research would benefit from gauging whether improvement on the NIHTB-CB corresponds with any meaningful change in a clinical condition or functioning in everyday life. Although limitations exist, these results provide some empirical guidance for assessing change using an increasingly common cognitive battery in research and practice (Fox,et al., 2022).

The results of the current study are most applicable to past studies that have used the original version of the NIHTB-CB, which was administered via desktop and used to collect the normative data that were used for the iPad-assisted NIHTB Version 2. A scoping review identified 225 studies that have used the NIHTB-CB or components of it through June of 2020 (Fox,et al., 2022). In the future, as researchers conduct secondary analyses of data collected with the NIHTB-CB, such as the Human Connectome Project (Van,Essen,et al., 2012) and many studies focused on clinical conditions (Fox,et al., 2022), having reliable change estimates for the individual tests and for the five fluid tests in combination could be useful. There is a need for researchers to engage in test–retest research with the NIHTB-CB Version 2, which involved iPad-assisted administration. At minimum, test–retest intervals of 3 months and 1 year would be of value, because these intervals have more practical relevance for treatment studies and long-term outcome studies. Version 2 of the NIHTB-CB is scheduled for decommissioning in 2025, as Version 3, with updated normative data, has become available in 2023. However, having higher quality reliable change data will be useful to future researchers conducting secondary analyses on data collected with the iPad-assisted NIHTB-CB Version 2.

Neuropsychologists report difficulty with detecting change over repeated test administrations as a top-four challenge in practice (Rabin,et al., 2016). The NIHTB-CB is a widely used test battery with growing use in clinical research (Fox,et al., 2022), which may lead to its broader translation into clinical practice. Version 3 of the NIHTB-CB includes not only the original seven tests but also additional tests of learning and memory (e.g., a face–name association test and a modified Rey Auditory Verbal Learning Test), a traditional test of attention and processing speed (Oral Symbol Digits Test), and a measure of visual reasoning. This expanded battery, including more cognitive domains, can be administered in approximately 1 hr. It has tremendous potential for use in some clinical settings. Reliable change research is needed for children, adults, and older adults at 3-month, 1-year, and 2-year intervals for this battery. Analyses are needed for individual tests, but also for multivariate base rates of change scores for partial batteries and the full battery.

Contributor Information

Justin E Karr, Department of Psychology, College of Arts and Sciences, University of Kentucky, Lexington, KY, USA.

Eric O Ingram, Department of Psychology, College of Arts and Sciences, University of Kentucky, Lexington, KY, USA.

Cristina N Pinheiro, Department of Psychology, College of Arts and Sciences, University of Kentucky, Lexington, KY, USA.

Sheliza Ali, Department of Neurology, College of Medicine, University of Kentucky, Lexington, KY, USA.

Grant L Iverson, Department of Physical Medicine and Rehabilitation, Harvard Medical School, Boston, MA, USA; Department of Physical Medicine and Rehabilitation, Spaulding Rehabilitation Hospital, Charlestown, MA, USA; Department of Physical Medicine and Rehabilitation, Schoen Adams Research Institute at Spaulding Rehabilitation, Charlestown, MA, USA; Home Base, A Red Sox Foundation and Massachusetts General Hospital Program, Charlestown, MA, USA.

FUNDING

This work was supported, in part, by a Building Interdisciplinary Research Careers in Women's Health (BIRCWH) grant (#K12-DA035150) from the National Institute on Drug Abuse (NIDA) of the National Institutes of Health (NIH). G.L.I., Ph.D. serves as a scientific advisor for NanoDX®, Sway Operations, LLC, and Highmark, Inc. He has received past research support or funding from several test publishing companies, including ImPACT Applications, Inc., CNS Vital Signs, and Psychological Assessment Resources (PAR, Inc.). He receives royalties from the sales of one neuropsychological test (WCST-64). He acknowledges unrestricted philanthropic support from ImPACT Applications, Inc., the Mooney-Reed Charitable Foundation, the National Rugby League, Boston Bolts, and the Schoen Adams Research Institute at Spaulding Rehabilitation.

CONFLICT OF INTEREST

None declared.

AUTHOR CONTRIBUTIONS

Justin Karr (Conceptualization, Formal analysis, Methodology, Project administration, Supervision, Writing—original draft, Writing—review & editing), Eric Ingram (Formal analysis, Writing—original draft, Writing—review & editing), Cristina Pinheiro (Formal analysis, Writing—review & editing), Sheliza Ali (Writing—original draft, Writing—review & editing), and Grant Iverson (Conceptualization, Writing—review & editing)

References

Beaumont, J. L., Havlik, R., Cook, K. F., Hays, R. D., Wallner-Allen, K., Korper, S. P., et al. (2013). Norming plans for the NIH Toolbox. Neurology, 80(11_supplement_3), S87–S92. 10.1212/wnl.0b013e3182872e70. [DOI] [PMC free article] [PubMed] [Google Scholar]
Brooks, B. L., Holdnack, J. A., & Iverson, G. L. (2016). To Change is human: “Abnormal” reliable change memory scores are common in healthy adults and older adults. Archives of Clinical Neuropsychology, 31(8), 1026–1036. 10.1093/arclin/acw079. [DOI] [PMC free article] [PubMed] [Google Scholar]
Brooks, B. L., Holdnack, J. A., & Iverson, G. L. (2017). Reliable change on memory tests is common in healthy children and adolescents. Archives of Clinical Neuropsychology, 32(8), 1001–1009. 10.1093/arclin/acx028. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carlozzi, N. E., Tulsky, D. S., Chiaravalloti, N. D., Beaumont, J. L., Weintraub, S., Conway, K., et al. (2014). NIH toolbox cognitive battery (NIHTB-CB): The NIHTB pattern comparison processing speed test. Journal of the International Neuropsychological Society, 20(6), 630–641. 10.1017/S1355617714000319. [DOI] [PMC free article] [PubMed] [Google Scholar]
Casaletto, K. B., Umlauf, A., Beaumont, J., Gershon, R., Slotkin, J., Akshoomoff, N., et al. (2015). Demographically corrected normative standards for the English version of the NIH Toolbox Cognition Battery. Journal of the International Neuropsychological Society, 21(5), 378–391. 10.1017/S1355617715000351. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chelune, G. J., Naugle, R. I., Lüders, H., Sedlak, J., & Awad, I. A. (1993). Individual change after epilepsy surgery: Practice effects and base-rate information. Neuropsychology, 7(1), 41–52. 10.1037/0894-4105.7.1.41. [DOI] [Google Scholar]
Cysique, L. A., Franklin, D., Abramson, I., Ellis, R. J., Letendre, S., Collier, A., et al. (2011). Normative data and validation of a regression based summary score for assessing meaningful neuropsychological change. Journal of Clinical and Experimental Neuropsychology, 33(5), 505–522. 10.1080/13803395.2010.535504. [DOI] [PMC free article] [PubMed] [Google Scholar]
Delis, D. C., Kaplan, E., & Kramer, J. H. (2001). The Delis-Kaplan Executive Function System: Technical manual. The Psychological Corporation, San Antonio, Texas, USA. [Google Scholar]
Dikmen, S. S., Bauer, P. J., Weintraub, S., Mungas, D., Slotkin, J., Beaumont, J. L., et al. (2014). Measuring episodic memory across the lifespan: NIH toolbox picture sequence memory test. Journal of the International Neuropsychological Society, 20(6), 611–619. 10.1017/S1355617714000460. [DOI] [PMC free article] [PubMed] [Google Scholar]
Duff, K. (2012). Current topics in science and practice evidence-based indicators of neuropsychological change in the individual patient: Relevant concepts and methods. Archives of Clinical Neuropsychology, 27(3), 248–261. 10.1093/arclin/acr120. [DOI] [PMC free article] [PubMed] [Google Scholar]
Duff, K., Beglinger, L. J., Moser, D. J., Paulsen, J. S., Schultz, S. K., & Arndt, S. (2010). Predicting cognitive change in older adults: The relative contribution of practice effects. Archives of Clinical Neuropsychology, 25(2), 81–88. 10.1093/arclin/acp105. [DOI] [PMC free article] [PubMed] [Google Scholar]
Erdfelder, E., Faul, F., Buchner, A., & Lang, A. G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41(4), 1149–1160. 10.3758/BRM.41.4.1149. [DOI] [PubMed] [Google Scholar]
Field, A. P. (2005). Intraclass correlation. In Everitt, B. & Howell, D. (Eds.), Encyclopedia of Statistics in Behavioral Science (pp. 948–954). John Wiley & Sons, Ltd., Hoboken, New Jersey, USA. 10.1002/0470013192.bsa313 [DOI] [Google Scholar]
Fox, R. S., Zhang, M., Amagai, S., Bassard, A., Dworak, E. M., Han, Y. C., et al. (2022). Uses of the NIH Toolbox® in clinical samples: A scoping review. Neurology: Clinical Practice, 12(4), 307–319. 10.1212/CPJ.0000000000200060. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gershon, R. C. (2016). NIH toolbox norming study. Harvard Dataverse, V4, UNF:6:bOqMnZEEG/rBz6SQyN4t2g== [fileUNF]. 10.7910/DVN/FF4DI7. [DOI]
Heaton, R. K., Akshoomoff, N., Tulsky, D., Mungas, D., Weintraub, S., Dikmen, S., et al. (2014). Reliability and validity of composite scores from the NIH toolbox cognition battery in adults. Journal of the International Neuropsychological Society, 20(6), 588–598. 10.1017/S1355617714000241. [DOI] [PMC free article] [PubMed] [Google Scholar]
Heilbronner, R. L., Sweet, J. J., Attix, D. K., Krull, K. R., Henry, G. K., & Hart, R. P. (2010). Official position of the American Academy of Clinical Neuropsychology on serial neuropsychological assessments: The utility and challenges of repeat test administrations in clinical and forensic contexts. The Clinical Neuropsychologist, 24(8), 1267–1278. 10.1080/13854046.2010.526785. [DOI] [PubMed] [Google Scholar]
Iverson, G. L. (2001). Interpreting change on the WAIS-III/WMS-III in clinical samples. Archives of Clinical Neuropsychology, 16(2), 183–191. 10.1016/S0887-6177(00)00060-3. [DOI] [PubMed] [Google Scholar]
Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology, 59(1), 12–19. 10.1037/0022-006X.59.1.12. [DOI] [PubMed] [Google Scholar]
Karr, J. E., Rivera Mindt, M., & Iverson, G. L. (2021). Interpreting reliable change on the Spanish-language NIH toolbox cognition battery. Applied Neuropsychology: Adult, 1–9, 1–9. 10.1080/23279095.2021.2011726. [DOI] [PMC free article] [PubMed] [Google Scholar]
Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. 10.1016/j.jcm.2016.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mungas, D., Heaton, R., Tulsky, D., Zelazo, P. D., Slotkin, J., Blitz, D., et al. (2014). Factor structure, convergent validity, and discriminant validity of the NIH toolbox cognitive health battery (NIHTB-CHB) in adults. Journal of the International Neuropsychological Society, 20(6), 579–587. 10.1017/S1355617714000307. [DOI] [PMC free article] [PubMed] [Google Scholar]
Parsey, C. M., Bagger, J. E., Trittschuh, E. H., & Hanson, A. J. (2021). Utility of the iPad NIH Toolbox Cognition Battery in a clinical trial of older adults. Journal of the American Geriatrics Society, 69(12), 3519–3528. 10.1111/jgs.17382. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rabin, L. A., Paolillo, E., & Barr, W. B. (2016). Stability in test-usage practices of clinical neuropsychologists in the United States and Canada over a 10-Year Period: A follow-up survey of INS and NAN members. Archives of Clinical Neuropsychology, 31(3), 206–230. 10.1093/arclin/acw007. [DOI] [PubMed] [Google Scholar]
Rinehardt, E., Duff, K., Schoenberg, M., Mattingly, M., Bharucha, K., & Scott, J. (2010). Cognitive change on the repeatable battery of neuropsychological status (RBANS) in parkinson’s disease with and without bilateral subthalamic nucleus deep brain stimulation surgery. The Clinical Neuropsychologist, 24(8), 1339–1354. 10.1080/13854046.2010.521770. [DOI] [PubMed] [Google Scholar]
Snitz, B. E., Tudorascu, D. L., Yu, Z., Campbell, E., Lopresti, B. J., Laymon, C. M., et al. (2020). Associations between NIH Toolbox Cognition Battery and in vivo brain amyloid and tau pathology in non-demented older adults. Alzheimer’s and Dementia: Diagnosis, Assessment and Disease Monitoring, 12(1), e12018. 10.1002/dad2.12018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tulsky, D. S., Carlozzi, N., Chiaravalloti, N. D., Beaumont, J. L., Kisala, P. A., Mungas, D., et al. (2014). NIH Toolbox Cognition Battery (NIHTB-CB): List sorting test to measure working memory. Journal of the International Neuropsychological Society, 20(6), 599–610. 10.1017/S135561771400040X. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tulsky, D. S., Carlozzi, N. E., Holdnack, J., Heaton, R. K., Wong, A., Goldsmith, A., et al. (2017). Using the NIH toolbox cognition battery (NIHTB-CB) in individuals with traumatic brain injury. Rehabilitation Psychology, 62(4), 413–424. 10.1037/rep0000174. [DOI] [PMC free article] [PubMed] [Google Scholar]
Van Essen, D. C., Ugurbil, K., Auerbach, E., Barch, D., Behrens, T. E. J., Bucholz, R., et al. (2012). The Human Connectome Project: A data acquisition perspective. NeuroImage, 62(4), 2222–2231. 10.1016/j.neuroimage.2012.02.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wechsler, D. (2008). Wechsler Adult Intelligence Scale (4th ed.). Pearson, Inc., San Antonio, Texas, USA. [Google Scholar]
Wechsler, D., Holdnack, J. A., & Drozdick, L. W. (2009). Wechsler Memory Scale, Fourth edition: Technical and interpretive manual. In Encyclopedia of psychology, Vol. 8. NCS Pearson, Inc., San Antonio, Texas, USA. [Google Scholar]
Zelazo, P. D., Anderson, J. E., Richler, J., Wallner-Allen, K., Beaumont, J. L., Conway, K. P., et al. (2014). NIH toolbox cognition battery (CB): Validation of executive function measures in adults. Journal of the International Neuropsychological Society, 20(6), 620–629. 10.1017/S1355617714000472. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref1] Beaumont, J. L., Havlik, R., Cook, K. F., Hays, R. D., Wallner-Allen, K., Korper, S. P., et al. (2013). Norming plans for the NIH Toolbox. Neurology, 80(11_supplement_3), S87–S92. 10.1212/wnl.0b013e3182872e70. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] Brooks, B. L., Holdnack, J. A., & Iverson, G. L. (2016). To Change is human: “Abnormal” reliable change memory scores are common in healthy adults and older adults. Archives of Clinical Neuropsychology, 31(8), 1026–1036. 10.1093/arclin/acw079. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] Brooks, B. L., Holdnack, J. A., & Iverson, G. L. (2017). Reliable change on memory tests is common in healthy children and adolescents. Archives of Clinical Neuropsychology, 32(8), 1001–1009. 10.1093/arclin/acx028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] Carlozzi, N. E., Tulsky, D. S., Chiaravalloti, N. D., Beaumont, J. L., Weintraub, S., Conway, K., et al. (2014). NIH toolbox cognitive battery (NIHTB-CB): The NIHTB pattern comparison processing speed test. Journal of the International Neuropsychological Society, 20(6), 630–641. 10.1017/S1355617714000319. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] Casaletto, K. B., Umlauf, A., Beaumont, J., Gershon, R., Slotkin, J., Akshoomoff, N., et al. (2015). Demographically corrected normative standards for the English version of the NIH Toolbox Cognition Battery. Journal of the International Neuropsychological Society, 21(5), 378–391. 10.1017/S1355617715000351. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] Chelune, G. J., Naugle, R. I., Lüders, H., Sedlak, J., & Awad, I. A. (1993). Individual change after epilepsy surgery: Practice effects and base-rate information. Neuropsychology, 7(1), 41–52. 10.1037/0894-4105.7.1.41. [DOI] [Google Scholar]

[ref7] Cysique, L. A., Franklin, D., Abramson, I., Ellis, R. J., Letendre, S., Collier, A., et al. (2011). Normative data and validation of a regression based summary score for assessing meaningful neuropsychological change. Journal of Clinical and Experimental Neuropsychology, 33(5), 505–522. 10.1080/13803395.2010.535504. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] Delis, D. C., Kaplan, E., & Kramer, J. H. (2001). The Delis-Kaplan Executive Function System: Technical manual. The Psychological Corporation, San Antonio, Texas, USA. [Google Scholar]

[ref9] Dikmen, S. S., Bauer, P. J., Weintraub, S., Mungas, D., Slotkin, J., Beaumont, J. L., et al. (2014). Measuring episodic memory across the lifespan: NIH toolbox picture sequence memory test. Journal of the International Neuropsychological Society, 20(6), 611–619. 10.1017/S1355617714000460. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] Duff, K. (2012). Current topics in science and practice evidence-based indicators of neuropsychological change in the individual patient: Relevant concepts and methods. Archives of Clinical Neuropsychology, 27(3), 248–261. 10.1093/arclin/acr120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] Duff, K., Beglinger, L. J., Moser, D. J., Paulsen, J. S., Schultz, S. K., & Arndt, S. (2010). Predicting cognitive change in older adults: The relative contribution of practice effects. Archives of Clinical Neuropsychology, 25(2), 81–88. 10.1093/arclin/acp105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] Erdfelder, E., Faul, F., Buchner, A., & Lang, A. G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41(4), 1149–1160. 10.3758/BRM.41.4.1149. [DOI] [PubMed] [Google Scholar]

[ref13] Field, A. P. (2005). Intraclass correlation. In Everitt, B. & Howell, D. (Eds.), Encyclopedia of Statistics in Behavioral Science (pp. 948–954). John Wiley & Sons, Ltd., Hoboken, New Jersey, USA. 10.1002/0470013192.bsa313 [DOI] [Google Scholar]

[ref14] Fox, R. S., Zhang, M., Amagai, S., Bassard, A., Dworak, E. M., Han, Y. C., et al. (2022). Uses of the NIH Toolbox® in clinical samples: A scoping review. Neurology: Clinical Practice, 12(4), 307–319. 10.1212/CPJ.0000000000200060. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] Gershon, R. C. (2016). NIH toolbox norming study. Harvard Dataverse, V4, UNF:6:bOqMnZEEG/rBz6SQyN4t2g== [fileUNF]. 10.7910/DVN/FF4DI7. [DOI]

[ref16] Heaton, R. K., Akshoomoff, N., Tulsky, D., Mungas, D., Weintraub, S., Dikmen, S., et al. (2014). Reliability and validity of composite scores from the NIH toolbox cognition battery in adults. Journal of the International Neuropsychological Society, 20(6), 588–598. 10.1017/S1355617714000241. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] Heilbronner, R. L., Sweet, J. J., Attix, D. K., Krull, K. R., Henry, G. K., & Hart, R. P. (2010). Official position of the American Academy of Clinical Neuropsychology on serial neuropsychological assessments: The utility and challenges of repeat test administrations in clinical and forensic contexts. The Clinical Neuropsychologist, 24(8), 1267–1278. 10.1080/13854046.2010.526785. [DOI] [PubMed] [Google Scholar]

[ref18] Iverson, G. L. (2001). Interpreting change on the WAIS-III/WMS-III in clinical samples. Archives of Clinical Neuropsychology, 16(2), 183–191. 10.1016/S0887-6177(00)00060-3. [DOI] [PubMed] [Google Scholar]

[ref19] Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology, 59(1), 12–19. 10.1037/0022-006X.59.1.12. [DOI] [PubMed] [Google Scholar]

[ref20] Karr, J. E., Rivera Mindt, M., & Iverson, G. L. (2021). Interpreting reliable change on the Spanish-language NIH toolbox cognition battery. Applied Neuropsychology: Adult, 1–9, 1–9. 10.1080/23279095.2021.2011726. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. 10.1016/j.jcm.2016.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] Mungas, D., Heaton, R., Tulsky, D., Zelazo, P. D., Slotkin, J., Blitz, D., et al. (2014). Factor structure, convergent validity, and discriminant validity of the NIH toolbox cognitive health battery (NIHTB-CHB) in adults. Journal of the International Neuropsychological Society, 20(6), 579–587. 10.1017/S1355617714000307. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] Parsey, C. M., Bagger, J. E., Trittschuh, E. H., & Hanson, A. J. (2021). Utility of the iPad NIH Toolbox Cognition Battery in a clinical trial of older adults. Journal of the American Geriatrics Society, 69(12), 3519–3528. 10.1111/jgs.17382. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref24] Rabin, L. A., Paolillo, E., & Barr, W. B. (2016). Stability in test-usage practices of clinical neuropsychologists in the United States and Canada over a 10-Year Period: A follow-up survey of INS and NAN members. Archives of Clinical Neuropsychology, 31(3), 206–230. 10.1093/arclin/acw007. [DOI] [PubMed] [Google Scholar]

[ref25] Rinehardt, E., Duff, K., Schoenberg, M., Mattingly, M., Bharucha, K., & Scott, J. (2010). Cognitive change on the repeatable battery of neuropsychological status (RBANS) in parkinson’s disease with and without bilateral subthalamic nucleus deep brain stimulation surgery. The Clinical Neuropsychologist, 24(8), 1339–1354. 10.1080/13854046.2010.521770. [DOI] [PubMed] [Google Scholar]

[ref26] Snitz, B. E., Tudorascu, D. L., Yu, Z., Campbell, E., Lopresti, B. J., Laymon, C. M., et al. (2020). Associations between NIH Toolbox Cognition Battery and in vivo brain amyloid and tau pathology in non-demented older adults. Alzheimer’s and Dementia: Diagnosis, Assessment and Disease Monitoring, 12(1), e12018. 10.1002/dad2.12018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref27] Tulsky, D. S., Carlozzi, N., Chiaravalloti, N. D., Beaumont, J. L., Kisala, P. A., Mungas, D., et al. (2014). NIH Toolbox Cognition Battery (NIHTB-CB): List sorting test to measure working memory. Journal of the International Neuropsychological Society, 20(6), 599–610. 10.1017/S135561771400040X. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] Tulsky, D. S., Carlozzi, N. E., Holdnack, J., Heaton, R. K., Wong, A., Goldsmith, A., et al. (2017). Using the NIH toolbox cognition battery (NIHTB-CB) in individuals with traumatic brain injury. Rehabilitation Psychology, 62(4), 413–424. 10.1037/rep0000174. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref29] Van Essen, D. C., Ugurbil, K., Auerbach, E., Barch, D., Behrens, T. E. J., Bucholz, R., et al. (2012). The Human Connectome Project: A data acquisition perspective. NeuroImage, 62(4), 2222–2231. 10.1016/j.neuroimage.2012.02.018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref30] Wechsler, D. (2008). Wechsler Adult Intelligence Scale (4th ed.). Pearson, Inc., San Antonio, Texas, USA. [Google Scholar]

[ref31] Wechsler, D., Holdnack, J. A., & Drozdick, L. W. (2009). Wechsler Memory Scale, Fourth edition: Technical and interpretive manual. In Encyclopedia of psychology, Vol. 8. NCS Pearson, Inc., San Antonio, Texas, USA. [Google Scholar]

[ref32] Zelazo, P. D., Anderson, J. E., Richler, J., Wallner-Allen, K., Beaumont, J. L., Conway, K. P., et al. (2014). NIH toolbox cognition battery (CB): Validation of executive function measures in adults. Journal of the International Neuropsychological Society, 20(6), 620–629. 10.1017/S1355617714000472. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Test–Retest Reliability and Reliable Change on the NIH Toolbox Cognition Battery

Justin E Karr

Eric O Ingram

Cristina N Pinheiro

Sheliza Ali

Grant L Iverson