Abstract
The NIH Toolbox® for Assessment of Neurological and Behavioral Function Cognition Battery (NIHTB-CB) is a brief neuropsychological assessment tool created as part of the NIH Neuroscience Blueprint. Since its inception, NIHTB-CB has been widely used in a variety of clinical and research settings, including large-scale epidemiological studies. The NIHTB-CB was recently updated and re-normed to Version 3 (V3). We describe the approach to establish normative reference values. The NIHTB-CB tests were administered to a large English-speaking sample of n = 3,904 (average age = 25.7 years, 52.1% female) individuals from the U.S. population, stratified by age, race and ethnicity, sex assigned at birth, and education level within the four U.S. census regions. Normative data were raked via iterative proportional fitting (e.g., by sex assigned at birth, race, ethnicity, and educational attainment nested within geographic region) to derive sampling weights that match the demographic proportions from the U.S. Census targets. Through regression-based continuous norming and bootstrap techniques, age-adjusted and age-and-education-adjusted normative scores were created for individual measure-level and composite scores.
Keywords: cognition, continuous norming, lifespan, NIH Toolbox, psychometrics, sample weighting, neuropsychology
Introduction
The NIH Toolbox® for Assessment of Neurological and Behavioral Function (NIHTB) was originally created in 2004 under the NIH Blueprint for Neuroscience Research and was conceptualized as a common currency for cross-study comparisons in large-scale research and clinical studies (Fox et al., 2022; Gershon et al., 2013). The NIHTB was designed to capture subtle changes in performance across the lifespan and contains measures encompassing four domains: cognition, emotion, motor, and sensation. The NIH Toolbox Cognition Battery (NIHTB-CB) offers a widely accessible and easy-to-administer neurocognitive assessment that was originally designed to be completed in under 30 minutes on a web-based platform (Gershon et al., 2013; Weintraub et al., 2013) and was subsequently adapted for iPad-based administration. The NIHTB-CB was developed to measure executive functions, episodic memory, processing speed, working memory, and language in individuals 3–85+ years old (Mather et al., 2024; Weintraub et al., 2013).
A major goal of the NIH Toolbox Version 3 (NIHTB V3) was to create new reference, or normative, values based on updated census data. Norm-referenced cognitive assessments are typically updated approximately every 10 years to (a) develop and validate normative performance indices relative to updated census data; (b) mitigate the Flynn effect (Flynn, 1987, 1998, 2000) whereby the average ability scores of a population increases over time (Dickinson & Hiscock., 2011; Pietschnig & Voracek, 2015; Trahan et al., 2014); (c) incorporate new scoring, scaling, or norming methods to increase score precision and interpretive validity (American Educational Research Association et al., 2014; International Test Commission, 2001); and (d) develop and validate new measures that are relevant to contemporary theory and practice. Here, we describe the development of norms for the NIH Toolbox Cognition Battery Version 3 (NIHTB-CB V3).
NIH Toolbox Cognition Battery Version 3 Overview of Revisions
The planning for the revision of the NIHTB-CB began in 2018. Informed by a combination of user feedback, usage of the original NIHTB, and its implementation in large-scale studies (e.g., Victorson et al., 2013), the scientific team worked with human-centered design experts to redesign the user interface and measure workflows, providing a more consistent and standardized workflow across tests. In particular, Section 508 guidelines were adhered to in the redesign. Existing tests were revised by (a) simplifying test instructions to be more consistent across tests and participant age groups; (b) updating test items and, where appropriate, adding new test items; (c) updating the workflows for multistage testing routes and administrative logic, and (d) recalibrating the item pools. New scoring models were developed for some existing tests (e.g., see Shono et al., 2024), and five new, normed tests were added to improve the construct coverage of the Cognition Battery (see Ho et al., 2025; LaForte et al., 2024 for more details). Further details regarding changes to the NIHTB-CB user interface, measures, and scoring can be found in the NIHTB V3 Technical Manual (LaForte et al., 2024).
Norms Development.
While the sampling and norms development processes were generally similar to those used in the development of the original NIHTB (Beaumont et al., 2013), there were some methodological differences for V3. The original NIHTB normative reference values were developed using multifractional polynomials (Casaletto et al., 2015), separately across two age groups: one group for children ages 3–17, and one group for adults ages 18–85. The raw test scores were scaled to have a mean of 10 and a standard deviation of 3 across everyone within each of the two age groups. The expected scale score (from the multifractional polynomial modeling) was treated as the normative reference value for all individuals within the age group (Casaletto et al., 2015). Due to the rapid growth of cognitive abilities during childhood, the wide norming age band for generating the interim scaled scores resulted in difficult-to-interpret NIHTB-CB normative scores for children, insofar as it confounded test performance with developmental age.
To ameliorate this concern, a continuous, regression-based approach to norming was utilized for the NIHTB-CB V3. Regression-based norming procedures have been shown to be more statistically efficient than nonregression-based approaches (Kiselica et al., 2024; Lenhard & Lenhard, 2020; Oosterhuis et al., 2016; Timmerman et al., 2021). Compared with conventional norming approaches, Lenhard and Lenhard (2020) showed that continuous, regression-based approaches minimized model fit error and required smaller sample sizes to achieve similar levels of measurement precision and reliability. In particular, the NIHTB-CB V3 norming study utilized the generalized additive models (Wood et al., 2016), a semiparametric approach, extensions of which also have been used in the norming of several other psychological measures (Timmerman et al., 2021). Although the norming sample was recruited at 1-year age intervals, by using continuous norming procedures, normative reference values can be derived for more granular intervals. Normative reference values for V3 measures were derived for 3-month intervals from ages 3 to 10, 6-month intervals from 10 to 18, and 1-year age intervals from ages 18 to 90. As one of the primary goals for the study was to develop age-based reference values, the sample was not chosen to represent the U.S. population proportions for age; in fact, child participants were oversampled to allow more precise estimation of reference values at the ages where cognitive abilities show rapid neurodevelopmental growth and large variation, as well as to minimize any possible stepwise discontinuities in the resulting normative growth curve, ultimately facilitating ease of interpretation across measures and increasing measurement precision (more details in LaForte et al., 2024).
Another psychometric innovation implemented for the NIHTB-CB V3 update was the calibration of all measures onto an item response theory (IRT)–based ability-score metric (Embretson & Reise, 2013), referred to as change sensitive scores (CSSs). CSSs are ability scores calculated from a linear transformation of the IRT-based or raw score from performance on an individual test and have been used in other large-scale assessments such as the Stanford-Binet Intelligence Scales, Fifth Edition (Roid & Pomplun, 2012), and Woodcock-Johnson V (LaForte et al., 2025). Finally, the sampling population was weighted to represent the 2019 American Community Survey and 2020 Decennial Census targets with respect to geographical region, sex assigned at birth, race/ethnicity, and education level (see LaForte et al., 2024).
Method
Study Sample
There were several goals for the V3 norming study. One was to update the Cognition Battery norms to reflect the current U.S. population with regard to region, race/ethnicity, sex assigned at birth, and education level. Thus, the target sample for the NIHTB-CB V3 included English-speaking individuals in the United States between the ages of 3 and 90+. Participants were community-dwelling or noninstitutionalized individuals who were capable of following test instructions in English, as determined from prior studies the individuals had participated in and were able to give informed consent in English. In the case of children under 18 years of age, consent was given by a caregiver or guardian, and the child provided assent. If a recruited individual did not fit into any of the sampling cells or if the sampling cell the individual would fit into was full at the time of recruitment, the individual was excluded from participation. Study participants were recruited from Sago’s national database via phone calls and emails. Sago’s national database consisted of individuals across the United States who were members of Sago’s market research panel, provided their demographic information, and agreed to participate in future research studies. Participants were recruited for the norming study from this panel if their demographics matched the norming sampling plan, and additional participants were recruited for the norming study if the panel did not contain enough participants. As the intent was to obtain a representative sample of community-dwelling children and adults, there was no overt exclusion for participants living with disabilitie or neurological or psychiatric disorders. It should be noted that completing the study required being sufficiently able-bodied to independently complete the neurocognitive assessment, including being able to travel to the study site (compensation for taxis was provided if needed), schedule appointments, correspond through phone calls, and obtain compensation. These may have been implicit inclusion criteria. Those who met the demographic criteria for the sampling plan were enrolled in participation. All testing took place in Sago offices. Sampling targets were monitored daily to ensure that the study demographic targets were met. Compensation and parking fees were provided to adult participants and to families of child participants who completed testing for each session. The estimated time of NIHTB-CB completion was 45 minutes for children ages 3 to 5, and 1 hour for participants ages 6 and older.
A subsample of n = 190 participants was also recruited to complete the battery between 1 and 14 days after the first session to assess test–retest reliability of the measures. In addition, a subsample of participants completed other gold-standard cognitive assessments, including the Wechsler Adult Intelligence Scale, 4th edition (WAIS-IV; Wechsler, 2008; n = 180), the Wechsler Intelligence Scales for Children, 5th Edition (WISC-V; Wechsler, 2014; n = 50); the Wechsler Preschool and Primary Scales of Intelligence, 4th Edition (WPPSI-IV; Wechsler, 2012; n = 43), the California Verbal Learning Test, 3rd Edition (CVLT3; Delis et al., 2017; n = 51), and the Wechsler Memory Scale, 4th Edition (WMS-IV; Wechsler, 2009; n = 102) to assess convergent validity between scores on these assessments and the NIHTB-CB V3; more details can be found in LaForte et al., 2024.
Sampling Plan
The V3 norming study sampling plan was developed based on the 1-year estimates from the 2017 American Community Survey (ACS; U.S. Census Bureau, 2017a) and Current Population Survey (CPS; U.S. Census Bureau, 2017b) in consultation with a survey epidemiologist familiar with the U.S. Census Bureau catalogue of surveys and products. Post-data collection, the norming sample was ultimately weighted to match more recent Census figures (i.e., 2019 American Community Survey and 2020 Decennial Census), but as these were not available when the plan was developed, the original sampling targets and the final sample weights varied slightly. Based on expert opinion, a 10% margin around the 2017 ACS and CPS proportions was considered acceptable for the purposes of sampling, with tighter margins enforced for poststratification sample weighting.
Sampling Age Intervals.
One-year sampling cells were utilized for child participants ages 3 to 17 years, and 2-year sampling cells were used for young adults ages 18 to 21 years. A single sampling cell was used for young adults ages 22 to 29 years, and 10-year sampling cells were utilized for participants ages 30 to 79 years. Separate cells were used for participants ages 80 to 85 and ages over 85. Age was based on the last birthday; for example, the cell for age 3 included individuals who were 3 years, 0 days old to 3 years, 364 days old.
Within each census region (Northeast, Mideast, South, and West), marginal proportions matched the census proportions on ethnicity/race, sex assigned at birth (male or female), and education level. Ethnicity/race categories were Hispanic (regardless of race), non-Hispanic White, non-Hispanic Black, non-Hispanic Asian, or non-Hispanic Other. Education was categorized into less than high school education, high school education or GED, some college (including technical or trade school and 2-year associate’s degree), bachelor’s degree, or graduate or professional degree (including a master’s or doctoral degree, medical degree, or other professional certification). The child sample was stratified on parental education, which used the same education categories as the adult sample except that the bachelor’s degree and higher levels were collapsed into a single category.
Data Collection
Data collection was conducted from June 2021 to October 2021 using a panel company with access to a demographically representative sample across all census regions (Sago). The NIHTB-CB was administered to n = 3,904 individuals in English, of which n = 2,248 were children aged 3 to 17 years old, and n = 1,656 were adults 18 years old or older. Between 25 and 100 individuals were recruited in each targeted sampling cell. Working with members of the authorship team (EHH, AK, EL, RCG), Sago hired examiners who had at least a bachelor’s degree, recruited participants, and conducted assessments in their offices at 12 sites across the United States, including Appleton, WI; Atlanta, GA; Baltimore, MD; Boston, MA; Chicago, IL; Columbus, OH; Dallas, TX; Iselin, NJ; Los Angeles, CA; Nashville, TN; Phoenix, AZ; and St. Louis, MO. All census regions and divisions were represented during the data collection.
This study utilized a “train the trainer” model. Sago trainers and project staff were trained during a 5-day, in-person session led by the Northwestern University NIHTB training team, which consisted of a project manager and multiple faculty members with extensive experience in assessment and including two board-certified neuropsychologists, one of them specializing in pediatric neuropsychology. After returning to their respective sites, Sago trainers practiced administration with participants of various ages and video-recorded their practice cases. These videos were submitted to Northwestern University’s training team for review and certification, and Sago trainers received feedback about any administration errors made during the certification process. Post-certification, Sago trainers administered the same training protocol to their respective site staff, ranging from 10 to 12 examiners per site. Northwestern project staff trained the trainers and certified them. Any additional site staff were trained by those already trained and certified by Northwestern. It was also the Northwestern staff who certified these additional site staff. Furthermore, project staff and scientists conducted site visits within the first few weeks of data collection across all sites, so that Northwestern could observe site staff administering the NIHTB. When applicable and as needed, additional corrective feedback and further training were provided.
Measures
Participants were administered all NIHTB-CB V3 tests in a fixed sequence based on age and education level (for more details, see Online Appendix A). Given the large number of age and age by education brackets required to obtain a fully representative sample across multiple sites and within a relatively short time frame, assessment counterbalancing was not pragmatically feasible, though for some tests, participants were randomly assigned to parallel forms (see Table A1). Existing tests from the original NIHTB-CB that were included in the norming study were Dimensional Change Card Sort (DCCS), Flanker Inhibitory Control and Attention (Flanker), List Sorting Working Memory (LSWM), Oral Reading Recognition (ORR), Picture Sequence Memory (PSM), Picture Vocabulary (PV), and Pattern Comparison Processing Speed (PC); further details provided by Weintraub et al., 2013. In addition, several new tests (or existing but not previously normed supplemental tests) were added and normed as part of V3 to improve the overall construct coverage of the battery. These include the Face Name Associative Memory Exam (FNAME; assesses recall and matching of faces and names after timed delay; Rentz et al., 2011), Oral Symbol Digit (OSD; assesses processing speed orally), Rey Auditory Verbal Learning (RAVLT; assesses immediate and delayed recall), Speeded Matching (SM; a match-to-sample measure of processing speed appropriate for young children), and Visual Reasoning (VR; assesses nonverbal and abstract thinking). Details regarding updates to the existing NIHTB-CB tests, including the development and scoring of new tests, can be found in the NIHTB-CB V3 Technical Manual and other manuscripts (Ho et al., manuscript in preparation; LaForte et al., 2024).
Scores
All NIHTB-CB V3 tests, except DCCS and Flanker (see Shono et al., 2024), were calibrated to an IRT-based logit scale. For DCCS and Flanker, rate-corrected scores (ratio of sum of total correct to sum of reaction times across all trials) were transformed onto the same scale so that all NIHTB-CB V3 tests would share a common metric. All scores and their corresponding standard errors were then converted to Change Sensitive Scores (CSS), which are ability scores calculated from the linear transformation of the IRT-based or raw score of individual measures, following other established neurocognitive and psychoeducational assessments (Farmer et al., 2020; McGrew et al., 2014; LaForte et al., 2025; Roid & Pomplun, 2012). The CSSs indicate performance based on an average score at a given age and are on an equal interval scale. Following convention from other large-scale psychoeducational assessments, we centered the CSSs for NIHTB such that a score of 500 is interpreted as the median ability for a 10-year-old in the normative sample. Age-Adjusted Scores (SS; Standard Scores) are norm-referenced scores that take into account the variance explained by age so that such scores can be interpreted as relative to the mean performance for the participant’s age; they are centered at 100 with an SD of 15. SSs were derived for individuals ages 3 to 90. Age-and-Education-Adjusted scores (T-Scores) were developed starting at age 22 to 90 with four discrete education categories (High School or Less, Some College, College, and Graduate Degree or Higher); they are centered at 50 with an SD of 10. Age-and-education-adjusted norms were not created for children or young adults insofar as developmental change associated with age was a much stronger predictor of scores than education, whether parental education for children or one’s own education for adults aged 18 to 22 years.
Analysis
Weighting.
Weighting is a commonly used statistical method to ensure the sample matches the population of inference—in this case, the U.S. population—as closely as possible (Kalton & Flores-Cervantes, 2003). Iterative proportional fitting was used to derive individual probability weights (DeBell & Krosnick, 2009) any time a single marginal proportion exceeded a 5% limit on the target demographic. For each norming study participant, a weight was derived to better approximate the updated demographic proportions from the 2020 Decennial Census (U.S. Census Bureau, 2020), augmented with the 2019 ACS 5-year estimates as needed (ACS; U.S. Census Bureau, 2019). Per the sampling plan, weighting factors included sex-by-census region, race and ethnicity categories, and education attainment (using parental education level for children ages 3 to 17 years). Weights were truncated to a maximum value of 4, which was approximately five times the median weight of .79 across the entire sample (Figure 1). Individual sample weights were then applied to every participant in the sample. Bootstrap resamples were drawn with replacement with each individual weighted by their sample weight (e.g., a person with a sample weight of 2 was twice as likely to be selected in a bootstrap than average and a person with a sample weight of 0.5 was half as likely), and within each bootstrap, one plausible value for an expected score was drawn for each examinee based on their obtained CSS and the associated CSS standard error (Figure 2). These plausible values were regressed on examinee age in months using generalized additive models to provide the bootstrapped sampling distribution of potential reference values by age.
Figure 1. Distribution of Sampling Weights in the NIHTB Norming Study.

Figure 2. Reference Values for Individual NIHTB-CB V3 Measures.

Note. (A) shows age in years on a typical scale and (B) shows age in years on a logged scale. The solid black line indicates the reference value for that measure and age, and the dotted lines above and below the median are the reference value ±1 SD, respectively. Note that the Speeded Matching (SM) measure ranges from 3 to 6 years of age and thus is only placed on the logged scale for visualization purposes. DCCS = Dimensional Change Card Sort, Flanker = Flanker Inhibitory Control and Attention, FNAME Delay = Face Name Association Memory Exam Delay, LSWM = List Sorting Working Memory, ORR = Oral Reading Recognition, OSD = Oral Symbol Digit, PC = Pattern Comparison Processing Speed, PSM = Picture Sequence Memory, PV = Picture Vocabulary, RAVLT = Rey Auditory Verbal Learning Test Immediate, RAVLT Delay = Rey Auditory Verbal Learning Test Delay, SM = Speeded Matching, VR = Visual Reasoning. For the same figure with differing y-axis limits for more detailed rendering for each test/composite, see Figure A1.
Computing Normative Values
After extensive internal research, in part leveraging data from the original NIH Toolbox norming study (Gershon, 2016), a continuous norming approach (Oosterhuis et al., 2016; Zachary & Gorsuch, 1985) was determined to be the most statistically efficient method to derive normed scores. In recent years, continuous norming has been demonstrated to minimize—and in some cases entirely avoid—score gaps that can arise when age is discretized in sometimes arbitrary categories, which may obscure or belie the normative trajectory of neurocognitive performance across the lifespan (Lenhard & Lenhard, 2020). Instead, the continuous norming approach assumes that the distribution of abilities varies by age. There are many benefits of continuous norming; it allows smaller sample sizes, accommodates more flexible (e.g., nonparametric) statistical models that may more accurately capture neurocognitive trajectories, and it allows the inclusion of covariates that may be of interest to a specific clinical population (e.g., education level; Kiselica et al., 2024). Figure 3 includes an overview of the norm scores derivation process for the NIHTB-CB. Following Oosterhuis et al. (2016), bootstrap resamples were regressed on chronological age in months to develop age-adjusted norms for each test. Obtained scores were standardized based on deviation from age-based reference values using the age-based expected score variability. From these models, the median and SD of the deviance residuals were computed, and z-score transformation was applied, such that the mean SS norms were centered at M = 100 and SD = 15, with SSs greater than 160 or less than 40 were winsorized to 161 or 39, respectively (i.e., M ± 4 SD ± 1). The process was repeated in a similar fashion with age-within-education strata for T-score norms. Age-and-education-adjusted norms were distributed at M = 50 and SD = 10, with scores greater than 80 or less than 20 winsorized to 81 or 19, respectively (i.e., M ± 3 SD ± 1). This allowed for performance to be accurately estimated at both the extremely high and low ability ranges (Figure 4).
Figure 3. (A) Norming Approach for Individual Test Scores. (B) Norming Approach for Composite Scores.

Figure 4. Growth Curves of Change Sensitive Scores Ages 6+, Reproduced With Permission From LaForte et al. (2024).

Note. Figure 4 is optimally viewed in the online version of the publication. DCCS = Dimensional Change Card Sort; Flanker = Flanker Inhibitory Control and Attention.
Results
Sampling
Derivation of Norm-Referenced Measure-Level Scores.
The procured norming sample closely matched the demographic targets of the 2020 Decennial Census and 2019 ACS 5-year estimates. The mean sample weight was 1.0 (SD = .56), the overall median sample weight was .89, and the interquartile range of the sample weight was .30. Depending on age group, the median sample weight ranged from .7 to .93. Reassuringly, many of the individual’s weights were 1 or very close to 1, indicating close concordance between the individual’s representativeness relative to the demographic target(s). The largest sample weights were assigned to individuals from demographic groups that were underrepresented in the norming sample, particularly those with less than high school education or children of parents with less than high school education. See Online Appendix B for sampling tables.
Derivation of Norm-Referenced Composite Scores.
CSSs for the NIHTB-CB V3 composites were computed by averaging the CSS from the requisite tests, in line with the original factor structure validated in the original NIHTB-CB (Mungas et al., 2013, 2014). The crystallized composite comprised the PV and ORR measures, while the fluid composite comprised the DCCS, Flanker, list sorting working memory, PC, and PSM measures. The total cognition composite score was the average of the crystallized and fluid composite scores.
Due to the increasing usage of NIHTB-CB in large-scale epidemiological studies, including children at younger ages (Price et al., 2023), two additional composite scores were developed with the intent to be highly sensitive to neurodevelopmental changes. First, the early childhood composite score was developed for children 4 to 8.5 years old, and was comprised PV, VR, and SM measure scores. Second, an abbreviated early childhood crystallized-fluid-processing speed (CFPS) composite was developed for 3- to 6-year-old children and comprised PV, VR, and SM measure scores (Figure 5 for the composites’ sensitivity to age-related change). More details are found in Ho et al. (manuscript in preparation, OSF Preprint: https://osf.io/2w83f_v1).
Figure 5. Growth Curves of Change Sensitive Scores for the Early Childhood (EC) and Early Childhood—Cognitive, Fluid, and Processing Speed (EC-CFPS) Composites Ages 3–85.

The standard error of the mean CSS for composite scores was based on the empirical test–retest reliability for children (6–17 years of age) and adults (18–90 years of age), estimated separately, but composites use a traditional standard error based on the overall reliability of the test, not the reliability of individual scores.
Procedures to produce normed scores followed that of measure-level scoring. This included drawing a plausible value for each individual within the bootstrap resampling process to create reference values (Figure 6). This allowed for performance to be accurately estimated at both the extremely high and low ability ranges (Figure 5). The crystallized, fluid, and total composite CSS was regressed on age to obtain age-adjusted norm scores or age-within-education strata to obtain age-and-education-adjusted norm scores. The early childhood and EC-CFPS composite CSS were regressed on age to obtain the age-adjusted norm scores, with increased resolution at lower age ranges to account for rapid neurodevelopment.
Figure 6. Reference Values for NIHTB-CB V3 Composites.

Note. (A) shows age in years on a typical scale and (B) shows age in years on a logged scale. The solid black line indicates the reference value for that measure and age, and the dotted lines above and below the median are the reference value ±1 SD, respectively. Note that the early childhood composites EC and ECCFPS range from 3 to 8.5 and 3 to 6, respectively, and thus are only placed on the logged scale for visualization purposes. EC = early childhood composite; ECCFPS = early childhood crystallized-fluid-processing speed composite. For the same figure with differing y-axis limits for more detailed rendering for each test/composite, see Figure A2.
Discussion
Here, we described the development of norm scores for the NIHTB-CB V3, which involved a large-scale study of over 3,000 healthy children and adults who completed the battery in English. We created age-adjusted and age-and-education-adjusted norm scores weighted to U.S. Census proportions, applying regression-based norming approaches to obtain norms for each NIHTB-CB V3 measure and composite scores. According to a recent systematic review of normative data estimation approaches (delCacho-Tena et al., 2024), regression-based approaches have increased in popularity, likely due to more interpretive capability and more precise estimation of ability.
Since the original release of the NIHTB-CB, the battery has been deployed in various large-scale NIH-supported initiatives (Price et al., 2023), serving as a central outcome measure and realizing the original vision of the NIH Neuroscience Blueprint to be the “common currency” of neurocognition and other health outcomes.
Though not designed as a comprehensive neuropsychological or neurodiagnostic tool, the NIHTB-CB V3 may have a potential use as a brief screener to help identify individuals for further comprehensive clinical assessment. Usage of the NIHTB, and the cognition battery in particular, has recently enjoyed more usage in clinical research across a wide range of conditions (Fox et al., 2022; Wei et al., 2025). Future studies should continue to evaluate the clinical utility of the NIHTB-CB V3. Likewise, further validation of the NIHTB-CB V3 normative scores in clinical populations is necessary to increase their clinical interpretation.
Consistent with the norms for the original NIHTB (Beaumont et al., 2013; Casaletto et al., 2015), the reference sample from which the norm scores were derived included community-dwelling individuals aged 3 to 102 years of age that were able to provide informed consent or assent where necessary and had adequate visual, auditory, and motor functioning to complete the measures. Thus, caution is advised when interpreting norms in clinical samples with different functional profiles. Though the sampling plan was designed to maximize representativeness, the estimated norm scores and their associated standard error may differ from the true scores. Despite this, the availability of normed cognitive measures updated to the 2020 U.S. demographics provides an important tool for addressing both clinical and research endeavors in neurological and behavioral health.
Score Use and Interpretability
The scores described in this study are implemented in the NIHTB V3 iPad-based application and are automatically calculated and produced after test administration. For assessing individual change, we recommend using CSSs to track improvements or deterioration on an individual level (i.e., without making group comparisons). Norm-referenced scores are necessary for accurate evaluation of neuropsychological functioning relative to a well-specified and representative population. For V3, age-adjusted, and, for adults 22 and older, age-and-education-adjusted scores are automatically calculated and produced within the NIHTB app and are recommended for use when comparing an individual’s performance against either the univariate or joint effect of these two demographic variables, which are most often of interest. Users who are interested in examining the impact of additional variables on scores, such as gender, may consider entering these variables as covariates when modeling the normed score of choice.
Future Directions
To situate assessment results and resulting interpretations in their appropriate contexts, research and clinical users of neuropsychological tests have had a growing interest in incorporating socioeconomic status (SES) or other socio-cultural indices that more fully characterize and incorporate the cumulative influences of the social determinants of brain health (Kiselica et al., 2024) into normed scores. Rather than relying on proxy variables that may categorize individuals with radically different lived experiences as part of a one artificial monolith (e.g., Manly, 2005; Possin et al., 2021), SES-based norms may adjust scores for a comprehensive, multifaceted dimension that has the potential to play an important role in predicting neurodevelopmental trajectories across the lifespan. In doing so, the clinical utility and equity of assessment potentially may be improved. Indeed, SES has been shown to be a protective factor for neurocognition in both clinical pediatric populations (neuro-oncology; Torres et al., 2021) and aging populations (risk of incident dementia; Wang et al., 2024). More empirical nuance, careful identification strategies, and an equitable measurement lens would be required to precisely operationalize the components that could be incorporated into SES-based norms, including accounting for changes in SES over the life course (e.g., due to changes in educational attainment or financial capital; Marden et al., 2017).
Supplementary Material
Supplemental material for this article is available online.
Acknowledgments
The authors would like to gratefully acknowledge the contributions of numerous individuals who participated in data collection efforts for the NIHTB V3 norming study, including the site staff across the 13 data collection sites, the trained examiners, the project management staff, statistical analysts, the technical team at Northwestern and Helium Foot, and many scientists who advised on the norming and analytic strategy. In particular, they would also like to thank the contributions of several colleagues, such as Maria Varela Diaz, Julie Hook, Amy Giella, Anyelo Almonte Correa, Shalini Patel, Shaili Ganatra, Zutima Tuladhar, Vitali Ustsinovich, and Hubert Adam. The scores described in this study are implemented in the NIHTB V3 iPad-based application and are automatically calculated and produced after test administration. For scoring V3 item-level data outside of the app, please refer to https://nihtoolbox.org/support/.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was funded in part by the Environmental influences on Child Health Outcomes PRO Research Resource (U24OD023319; PI Gershon & Cella), the ARMADA: Advancing Reliable Measurement in Alzheimer’s Disease and cognitive Aging study (U2CAG057441; PI Gershon & Weintraub), and the original Mobile Toolbox study (U2CAG060426; Co-PIs Gershon, Kaat, Rentz, Kellen, & Weiner).
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical Approval and Informed Consent Statements
This study was approved by Northwestern University’s Institutional Review Board (STU00214814) as non-human subject’s research.
Data Availability Statement
The data will be made available on a public data repository 12 months after publication of this article.
References
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (Eds.). (2014). Standards for educational and psychological testing. American Educational Research Association. https://www.aera.net/publications/books/standards-for-educational-psychological-testing-2014-edition [Google Scholar]
- Beaumont JL, Havlik R, Cook KF, Hays RD, Wallner-Allen K, Korper SP, Lai JS, Nord C, Zill N, Choi S, Yost KJ, Ustsinovich V, Brouwers P, Hoffman HJ, & Gershon R. (2013). Norming plans for the NIH Toolbox. Neurology, 80(11 Suppl. 3), S87–S92. 10.1212/WNL.0b013e3182872e70 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Casaletto KB, Umlauf A, Beaumont J, Gershon R, Slotkin J, Akshoomoff N, & Heaton RK (2015). Demographically corrected normative standards for the English version of the NIH toolbox cognition battery. Journal of the International Neuropsychological Society, 21(5), 378–391. 10.1017/S1355617715000351 [DOI] [PMC free article] [PubMed] [Google Scholar]
- DeBell M, & Krosnick JA (2009). Computing weights for American national election study survey data (ANES Technical Report series no. nes012427). American National Election Studies. [Google Scholar]
- delCacho-Tena A, Christ BR, Arango-Lasprilla JC, Perrin PB, Rivera D, & Olabarrieta-Landa L. (2024). Normative data estimation in neuropsychological tests: A systematic review. Archives of Clinical Neuropsychology, 39(3), 383–398. 10.1093/arclin/acad084 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Delis DC, Kramer JH, Kaplan E, & Ober BA (2017). California Verbal Learning Test, Third Edition (CVLT-3). The Psychological Corporation. [Google Scholar]
- Dickinson MD, & Hiscock M. (2011). The Flynn effect in neuropsychological assessment. Applied Neuropsychology, 18(2), 136–142. 10.1080/09084282.2010.547785 [DOI] [PubMed] [Google Scholar]
- Embretson SE, & Reise SP (2013). Item response theory. Psychology Press. [Google Scholar]
- Farmer RL, McGill RJ, Dombrowski SC, McClain MB, Harris B, Lockwood AB, Powell SL, Pynn C, Smith-Kellen S, Loethen E, Benson NF, & Stinnett TA (2020). Teleassessment with children and adolescents during the coronavirus (COVID-19) pandemic and beyond: Practice and policy implications. Professional Psychology: Research and Practice, 51(5), 477–487. 10.1037/pro0000349 [DOI] [Google Scholar]
- Flynn JR (1987). Massive IQ gains in 14 nations: What IQ tests really measure. Psychological Bulletin, 101(2), 171–191. 10.1037/0033-2909.101.2.171 [DOI] [Google Scholar]
- Flynn JR (1998). WAIS-III and WISC-III gains in the United States from 1972 to 1995: How to compensate for obsolete norms. Perceptual and Motor Skills, 86(3, Pt 2), 1231–1239. 10.2466/pms.1998.86.3c.1231 [DOI] [Google Scholar]
- Flynn JR (2000). The hidden history of IQ and special education: Can the problems be solved? Psychology, Public Policy, and Law, 6(1), 191–198. 10.1037/1076-8971.6.1.191 [DOI] [Google Scholar]
- Fox RS, Zhang M, Amagai S, Bassard A, Dworak EM, Han YC, Kassanits J, Miller CH, Nowinski CJ, Giella AK, Stoeger JN, Swantek K, Hook JN, & Gershon RC (2022). Uses of the NIH Toolbox® in clinical samples: A scoping review. Neurology Clinical Practice, 12(4), 307–319. 10.1212/CPJ.0000000000200060 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gershon RC (2016). NIH Toolbox norming study (Harvard Dataverse, V.4). 10.7910/DVN/FF4DI7 [DOI] [Google Scholar]
- Gershon RC, Wagster MV, Hendrie HC, Fox NA, Cook KF, & Nowinski CJ (2013). NIH toolbox for assessment of neurological and behavioral function. Neurology, 80(11 Suppl 3), S2–S6. 10.1212/WNL.0b013e3182872e5f [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ho EH, Han YC, LaForte E, Kaat AJ, Ece B, Dworak EM, Olsen JB, Hook J, & Gershon R. (2025). Development and validation of the NIH Toolbox version 3.0 for children [Manuscript under submission, OSF Preprint: https://osf.io/2w83f_v1]. Feinberg School of Medicine, Northwestern University. [Google Scholar]
- International Test Commission. (2001). International guidelines for test use. International Journal of Testing, 1(2), 93–114. [Google Scholar]
- Kalton G, & Flores-Cervantes I. (2003). Weighting methods. Journal of Official Statistics, 19, 81–97. [Google Scholar]
- Kiselica AM, Karr JE, Mikula CM, Ranum RM, Benge JF, Medina LD, & Woods SP (2024). Recent advances in neuropsychological test interpretation for clinical practice. Neuropsychology Review, 34(2), 637–667. 10.1007/s11065-023-09596-1 [DOI] [PubMed] [Google Scholar]
- LaForte EM, Dailey D, & McGrew KS (2025). Woodcock-Johnson V technical manual. Riverside Assessments, LLC. [Google Scholar]
- LaForte EM, Hook JN, & Giella AK (2024). National Institutes of Health (NIH) Toolbox® version 3 technical manual. https://nihtoolbox.org/app/uploads/2024/04/NIHTB-V3-Technical-Manual_040524.pdf [Google Scholar]
- Lenhard W, & Lenhard A. (2020). Improvement of norm score quality via regression-based continuous norming. Educational and Psychological Measurement, 81(2), 229–261. 10.1177/0013164420928457 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manly JJ (2005). Advantages and disadvantages of separate norms for African Americans. The Clinical Neuropsychologist, 19(2), 270–275. 10.1080/13854040590945346 [DOI] [PubMed] [Google Scholar]
- Marden JR, Tchetgen Tchetgen EJ, Kawachi I, & Glymour MM (2017). Contribution of socioeconomic status at 3 life-course periods to late-life memory function and decline: Early and late predictors of dementia risk. American Journal of Epidemiology, 186(7), 805–814. 10.1093/aje/kwx155 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mather MA, Ho EH, Bedjeti K, Karpouzian-Rogers T, Rogalski EJ, Gershon R, & Weintraub S. (2024). Measuring multidimensional aspects of health in the oldest old using the NIH Toolbox: Results from the ARMADA study. Archives of Clinical Neuropsychology, 39(5), 535–546. 10.1093/arclin/acad105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- McGrew KS, LaForte EM, & Schrank FA (2014). Woodcock-Johnson IV technical manual. Riverside Assessments, LLC. [Google Scholar]
- Mungas D, Heaton R, Tulsky D, Zelazo PD, Slotkin J, Blitz D, Lai JS, & Gershon R. (2014). Factor structure, convergent validity, and discriminant validity of the NIH Toolbox Cognitive Health Battery (NIHTB-CHB) in adults. Journal of the International Neuropsychological Society, 20(6), 579–587. 10.1017/S1355617714000307 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mungas D, Widaman K, Zelazo PD, Tulsky D, Heaton RK, Slotkin J, Blitz DL, & Gershon RC (2013). National Institutes of Health Toolbox Cognition Battery (NIH Toolbox CB): Validation for children between 3 and 15 years: VII. NIH Toolbox Cognition Battery (CB): Factor structure for 3 to 15 year olds. Monographs of the Society for Research in Child Development, 78(4), 103–118. 10.1111/mono.12037 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oosterhuis HE, van der Ark LA, & Sijtsma K. (2016). Sample size requirements for traditional and regression-based norms. Assessment, 23(2), 191–202. 10.1177/1073191115580638 [DOI] [PubMed] [Google Scholar]
- Pietschnig J, & Voracek M. (2015). One century of global IQ gains: A formal meta-analysis of the Flynn effect (1909-2013). Perspectives on Psychological Science, 10(3), 282–306. 10.1177/1745691615577701 [DOI] [PubMed] [Google Scholar]
- Possin KL, Tsoy E, & Windon CC (2021). Perils of race-based norms in cognitive testing: The case of former NFL players. JAMA Neurology, 78(4), 377–378. 10.1001/jamaneurol.2020.4763 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Price JC, Lee JJ, Saraiya N, Lei S, & Mintz CD (2023). An update on NIH programs relevant to child brain health research: ECHO, ABCD, HBCD, and MIRA. Journal of Neurosurgical Anesthesiology, 35(1), 119–123. 10.1097/ANA.0000000000000875 [DOI] [PubMed] [Google Scholar]
- Rentz DM, Amariglio RE, Becker JA, Frey M, Olson LE, Frishe K, Carmasin J, Maye JE, Johnson KA, & Sperling RA (2011). Face-name associative memory performance is related to amyloid burden in normal elderly. Neuropsychologia, 49(9), 2776–2783. 10.1016/j.neuropsychologia.2011.06.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roid GH, & Pomplun M. (2012). The Stanford-Binet Intelligence Scales, fifth edition. In Flanagan DP & Harrison PL (Eds.), Contemporary intellectual assessment: Theories, tests, and issues (3rd ed., pp. 249–268). The Guilford Press. [Google Scholar]
- Shono Y, Ece B, Ho EH, Kaat AJ, La Forte EM, Ayturk E, & Gershon R. (2024). A comparison of scoring algorithms for the NIH Toolbox Executive Function Tasks in a U.S. norming sample. Psychological Assessment, 36(12), 760–771. 10.1037/pas0001350 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Timmerman ME, Voncken L, & Albers CJ (2021). A tutorial on regression-based norming of psychological tests with GAMLSS. Psychological Methods, 26(3), 357–373. 10.1037/met0000348 [DOI] [PubMed] [Google Scholar]
- Torres VA, Ashford JM, Wright E, Xu J, Zhang H, Merchant TE, & Conklin HM (2021). The impact of socioeconomic status (SES) on cognitive outcomes following radiotherapy for pediatric brain tumors: A prospective, longitudinal trial. Neuro-Oncology, 23(7), 1173–1182. 10.1093/neuonc/noab018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trahan LH, Stuebing KK, Fletcher JM, & Hiscock M. (2014). The Flynn effect: A meta-analysis. Psychological Bulletin, 140(5), 1332–1360. 10.1037/a0037173 [DOI] [PMC free article] [PubMed] [Google Scholar]
- U.S. Census Bureau. (2017a). 2017 American Community Survey 1-year estimates. https://www.census.gov/programs-surveys/acs/technical-documentation/table-and-geography-changes/2017/1-year.html
- U.S. Census Bureau. (2017b). Current population survey (CPS). https://www.census.gov/programs-surveys/cps.html
- U.S. Census Bureau. (2019). American Community Survey 5-year estimates. https://www.census.gov/data/developers/data-sets/acs-5year.html
- U.S. Census Bureau. (2020). 2020 Census Results. https://www.census.gov/programs-surveys/decennial-census/decade/2020/2020-census-results.html
- Victorson D, Manly J, Wallner-Allen K, Fox N, Purnell C, Hendrie H, Havlik R, Harniss M, Magasi S, Correia H, & Gershon R. (2013). Using the NIH Toolbox in special populations: Considerations for assessment of pediatric, geriatric, culturally diverse, non–English-speaking, and disabled individuals. Neurology, 80(11 Suppl 3), S13–S19. 10.1212/WNL.0b013e3182872e26 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang K, Fang Y, Zheng R, Zhao X, Wang S, Lu J, Wang W, Ning G, Xu Y, & Bi Y. (2024). Associations of socioeconomic status and healthy lifestyle with incident dementia and cognitive decline: Two prospective cohort studies. eClinicalMedicine, 76, 102831. 10.1016/j.eclinm.2024.102831 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wei X, McKinlay JD, Harding JE, Wouldes TA, Rogers J, Brown TL, & Franke N. (2025). NIH toolbox for assessment of neurocognitive, motor and emotional-behavioral function in childhood: A systematic review. Child Neuropsychology, 31(6), 948–983. 10.1080/09297049.2024.2447444 [DOI] [PubMed] [Google Scholar]
- Weintraub S, Dikmen SS, Heaton RK, Tulsky DS, Zelazo PD, Bauer PJ, Carlozzi NE, Slotkin J, Blitz D, Wallner-Allen K, Fox NA, Beaumont JL, Mungas D, Nowinski CJ, Richler J, Deocamp JA, Anderson JE, Manly JJ, Borosh B, … Gershon RC (2013). Cognition assessment using the NIH Toolbox. Neurology, 80(11 Suppl 3), S54–S64. 10.1212/WNL.0b013e3182872ded [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wechsler D. (2008). Wechsler Adult Intelligence Scale--Fourth Edition (WAIS-IV). APA PsycTests. 10.1037/t15169-000 [DOI] [Google Scholar]
- Wechsler D. (2009). Wechsler Memory Scale – Fourth Edition (WMS-IV). Pearson [Google Scholar]
- Wechsler D. (2012). Wechsler Preschool and Primary Scale of Intelligence (4th ed.). Pearson. [Google Scholar]
- Wechsler D. (2014). Wechsler Intelligence Scale for Children—Fifth Edition (WISC-V). Pearson. [Google Scholar]
- Wood SN, Pya N, & Säfken B. (2016). Smoothing parameter and model selection for general smooth models. Journal of the American Statistical Association, 111(516), 1548–1563. 10.1080/01621459.2016.1180986 [DOI] [Google Scholar]
- Zachary RA, & Gorsuch RL (1985). Continuous norming: Implications for the WAIS-R. Journal of Clinical Psychology, 41(1), 86–94. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data will be made available on a public data repository 12 months after publication of this article.
