Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Mar 1.
Published in final edited form as: Assessment. 2020 Apr 8;28(2):457–471. doi: 10.1177/1073191120913943

Reliability and Validity of the Spanish-language Version of the NIH Toolbox

Rina S Fox 1, Jennifer J Manly 2, Jerry Slotkin 3, John Devin Peipert 1, Richard C Gershon 1
PMCID: PMC7541574  NIHMSID: NIHMS1607150  PMID: 32264689

Abstract

The psychometric properties of the English-language NIH Toolbox for Assessment of Neurological and Behavioral Function (NIH Toolbox) have been examined in numerous populations. This study evaluated the reliability and validity of the Spanish-language NIH Toolbox. Participants were children ages 3–7 and adults ages 18–85 who took part in the NIH Toolbox norming study in Spanish. Results supported the internal consistency reliability of included measures. Test-retest reliability was strong for most tests, though it was weaker for the test of olfaction among children and the test of locomotion among adults. Spearman’s correlations and general linear models showed Spanish tests were often associated with age, sex, and education. Convergent validity for the two language measures that underwent more intensive development, evaluated via Spearman’s correlations with legacy measures, was strong. Results support using the Spanish-language NIH Toolbox to measure neurological and behavioral functioning among Spanish-speaking individuals in the United States.

Keywords: NIH Toolbox, Reliability, Validity, Spanish, Latinx


Over the previous half-century, the Latinx population in the United States has increased nine fold, currently representing approximately 18% of the overall population (United States Census Bureau, 2018a). Future projections indicate that by 2060, Latinxs will represent 27.5% of the population, reflecting a near doubling of the current number of Latinxs living in the United States (United States Census Bureau, 2018b). While English is the dominant language of the United States, as of 2011 only approximately half of Latinxs living in the United States reported being able to speak English “very well” (Ryan, 2013). Moreover, of the more than 60 million individuals age five and older living in the United States that reported speaking a language other than English at home, nearly two-thirds (62%) spoke Spanish (Ryan, 2013). Given these increases, consideration of the linguistic needs of the Latinx population in the United States is of particular importance for clinicians and researchers who wish to evaluate different components of neurological and behavioral functioning in multicultural contexts.

In recognition of this, the NIH Toolbox for Assessment of Neurological and Behavioral Function® (NIH Toolbox®) was developed as a collection of neuro-behavioral measures designed to quickly assess sensory, motor, emotional, and cognitive functioning across the lifespan (ages 3 to 85 years) in both English and Spanish. It was created to be appropriate for use in longitudinal studies by measuring the same constructs at different stages of human development, and can be used to track individual performance over time. Additionally, it was designed to provide a “common currency” in epidemiologic research, enabling the pooling and sharing of large datasets across studies. The full NIH Toolbox measurement system can be administered in two hours or less and, although it was not originally developed for clinical use, a growing body of evidence demonstrates its applicability for evaluating functioning in various normative and clinical populations (e.g., cancer (Sinha, Wong, Kallogjeri, & Piccirillo, 2018); neurologic conditions (Tulsky & Heinemann, 2017); social anxiety (Troller-Renfree, Barker, Pine, & Fox, 2015)).

The psychometric properties of the English-language version of the NIH Toolbox have been examined in the general United States population and numerous clinical populations. For example, prior studies have demonstrated support for the reliability and validity of the majority of the NIH Toolbox measures in healthy samples in English (Coldwell et al., 2013; Dalton et al., 2013; Reuben et al., 2013; Rine et al., 2013; Salsman et al., 2013; Varma, McKean-Cowdin, Vitale, Slotkin, & Hays, 2013; Weintraub et al., 2013). Additionally, the clinical utility and psychometric strength of the tests of cognitive functioning have been demonstrated among individuals with disabilities (Tulsky & Heinemann, 2017), and among children ages 3 to 15 (Bauer & Zelazo, 2013). Although normative standards for the Spanish-language tests of cognitive (Casaletto et al., 2016) and emotional (Babakhanyan, McKenna, Casaletto, Nowinski, & Heaton, 2018) functioning have been published, a comprehensive psychometric evaluation of the Spanish NIH Toolbox measures has not. In the present study, we explore the reliability and validity of the Spanish-language version of the NIH Toolbox for children and adults across four domains of functioning: sensory, motor, emotional, and cognitive.

Methods

Participants and Procedures

Data from the NIH Toolbox norming study were used. These data are publicly available and can be accessed via the HealthMeasures Dataverse data repository (https://dataverse.harvard.edu/dataverse/HealthMeasures). Participants in the NIH Toolbox norming study were community-dwelling children and adults who were neurologically healthy and capable of following instructions. A sampling strategy stratified by age, sex, and primary language was followed in accordance with the published NIH Toolbox norming plans (Beaumont et al., 2013). Participants were recruited by the market research company Delve, Inc. (now known as Focus Pointe Global) from 10 locations throughout the United States (Atlanta, Chicago – Oak Brook, Cincinnati, Columbus, Dallas, Los Angeles, Minneapolis, Philadelphia, Phoenix, St. Louis) that were selected, among other reasons, to maximize access to subjects from varied Spanish-speaking communities. To be eligible for inclusion, potential participants had to have adequate visual, auditory, vestibular, and motor functioning, either independently or with support from assistive devices, to enable completion of all items included in the NIH Toolbox testing battery. The present analysis included those participants ages 3 to 7 (n = 496) and ages 18 to 85 (n = 408) who elected to participate in the NIH Toolbox norming study in Spanish. Spanish-speaking children between the ages of 8 and 17 were not included in the NIH Toolbox norming study, as census data indicated that less than 2% of children in this age range living in the United States at the time of study enrollment used Spanish as their primary language (Beaumont et al., 2013).

Potential participants met with trained research personnel who administered structured interviews and standardized questionnaires to ensure eligibility prior to enrollment. Informed consent was obtained from all adult participants. Parental informed consent was obtained from children ages 3 to 7, and assent was obtained from children age 7. A subset of the individuals who participated in the NIH Toolbox norming study repeated the NIH Toolbox measures five to 14 days following initial test administration to enable evaluation of test-retest reliability. The NIH Toolbox norming study was approved by the institutional review board at Northwestern University through a protocol that covered all testing sites and was completed in accordance with the Helsinki Declaration.

Measures

The NIH Toolbox for Assessment of Neurological and Behavioral Functioning (Gershon et al., 2013).

As outlined above, the NIH Toolbox is comprised of assessments targeting four primary domains and contributing to four batteries: sensation, motor, emotion, and cognition. The full battery includes all assessments, while an early childhood battery includes a subset thereof, as described below. Most measures were originally developed in English and then translated into Spanish using the Functional Assessment of Chronic Illness Therapy translation methodology (Bonomi et al., 1996; Cella et al., 1998; Eremenco, Cella, & Arnold, 2005; Lent, Hahn, Eremenco, Webster, & Cella, 1999), although the Cognition Battery language measures were effectively developed from scratch in Spanish. Detailed information regarding the cultural adaptation, linguistic translation, and overall development of the Spanish-language version of the NIH Toolbox is available elsewhere (Gershon et al., 2019).

NIH Toolbox Sensation Battery (Coldwell et al., 2013; Cook et al., 2013; Dalton et al., 2013; Varma et al., 2013; Zecker et al., 2013).

The Sensation Battery is comprised of six measures assessing five domains: 1) audition, 2) gustation, 3) vision, 4) olfaction, and 5) pain. Audition is assessed using the NIH Toolbox Words-in-Noise Test (Zecker et al., 2013), which tests hearing in a noisy environment and is appropriate for individuals ages 6 to 85. Assessment of speech perception in noise is considered an ecologically valid measure of auditory functioning, as real-world communication often occurs in noisy environments. A Spanish-language version of the Words-in-Noise test was developed separately and was adapted for use in the NIH Toolbox (McArdle, Carlo, & Wilson, 2009). Scores on this measure are presented as decibels of signal-to-noise ratio (dB S/N), reflecting the quietest signal correctly perceived amidst noise. Lower scores reflect better hearing. Gustation is assessed with the NIH Toolbox Regional Taste Intensity Test (Coldwell et al., 2013), which measures the perceived intensity of quinine (i.e., bitter taste) and salt as administered in liquid. This test is appropriate for individuals ages 12 to 85, and scores are presented on a generalized labeled magnitude scale. Higher scores reflect greater perceived taste intensity. Vision is assessed with the NIH Toolbox Visual Acuity Test (Varma et al., 2013), which measures distance vision at three meters and is appropriate for individuals ages 3 to 85. Scores are presented as LogMAR units and in Snellen format; higher scores reflect worse vision. Olfaction is assessed with the NIH Toolbox Odor Identification Test (Dalton et al., 2013), which assesses a person’s ability to identify different odors using scratch-and-sniff cards, and is appropriate for individuals ages 3 to 85. Nine odorants are presented to individuals age 10+, and five are presented to individuals ages 3 to 9. Scores reflect the number of odorants correctly identified, with higher scores reflecting better smell. Pain is assessed with the Pain Intensity Survey and the Pain Interference Survey (Cook et al., 2013), which are appropriate for individuals ages 18 to 85. The Pain Intensity Survey is a single-item numerical rating scale assessing pain severity over the prior week; higher scores indicate more severe pain. The Pain Interference Survey is administered as a computer adaptive test and assesses the degree to which pain interferes with engagement in normal activities. Scores are presented as T-scores, with higher scores indicating more pain interference.

NIH Toolbox Motor Battery (Reuben et al., 2013).

The Motor Battery is comprised of five measures assessing five domains: 1) dexterity, 2) strength, 3) balance, 4) locomotion, and 5) endurance. The NIH Toolbox Motor Battery for individuals ages 7 to 85 includes all five of these measures. The NIH Toolbox Early Childhood Motor Battery for individuals ages 3 to 6 includes all of these measures except the test of locomotion. Dexterity is assessed using the 9-hole Pegboard Test. Scores reflect the number of seconds needed to accurately place and remove nine plastic pegs into a plastic pegboard using the dominant hand. Higher scores reflect worse manual dexterity. Strength is assessed using the Grip Strength Test, which evaluates the number of pounds of force generated using the dominant hand on a hand dynamometer. Higher scores indicate greater strength. Balance is assessed using the Standing Balance Test, which evaluates anterior-posterior postural sway. Raw normalized path length scores are calculated based on time and acceleration, and are then converted to theta values using an item response theory (IRT) model. Higher scores reflect better balance. Locomotion is assessed using the 4-Meter Walk Gait Speed Test. Scores are based on the number of seconds it takes to walk four meters at one’s usual pace, using the better of two trials, which are then converted to meters/second. Higher scores are indicative of better locomotion. Endurance is assessed using the 2-Minute Walk Test. Scores reflect the number of feet walked in two minutes, with higher scores reflecting better endurance.

NIH Toolbox Emotion Battery (Salsman et al., 2013).

The Emotion Battery broadly assesses four domains: 1) negative affect, 2) social relationships, 3) psychological well-being, and 4) stress and self-efficacy. The Spanish-language battery includes 17 self-report scales for adults ages 18 to 85, and 10 parent-report scales for children ages 3 to 7. In addition to these 27 measures, the English-language battery also includes 15 self-report scales for children ages 8 to 17 and 11 parent-report scales for children ages 8 to 12. Spanish translations of these additional measures are available; however, no norms exist given that children ages 8 to 17 were not included in the Spanish-language norming sample. Higher scores indicate more of the construct being assessed across all measures included in the Emotion Battery. The negative affect subdomain includes six assessments for adults (Anger-Affect, Anger-Hostility, Anger-Physical Aggression, Sadness, Fear-Affect, and Fear-Somatic Arousal) and four assessments for children (parent-reported assessments of Anger, Fear-Over Anxious, Fear-Separation Anxiety, and Sadness). The psychological well-being subdomain includes three assessments for adults (Positive Affect, Life Satisfaction, and Meaning) and two assessments for children (parent-reported assessments of Positive Affect and Life Satisfaction). The parent-reported assessment of Positive Affect was not administered in Spanish during the norming study and was therefore not included in the present analysis; however, a Spanish-language version of this measure is available without norms. The social relationships subdomain includes six assessments for adults (Friendship, Loneliness, Emotional Support, Perceived Hostility, Instrumental Support, and Perceived Rejection) and four assessments for children (parent-reported assessments of Social Withdrawal, Positive Peer Interactions, Peer Rejection, and Empathic Behaviors). The stress and self-efficacy subdomain includes two assessments for adults (Perceived Stress and Self-Efficacy). There are no assessments for children included in the stress and self-efficacy subdomain. All measures excepting the parent-reported assessments of Life Satisfaction (five items), Social Withdrawal (four items), Positive Peer Interactions (four items), and Peer Rejection (nine items) are scored using IRT methods to yield a theta value. For these four parent-reported measures, scores reflect a raw sum of the included items.

NIH Toolbox Cognition Battery (Weintraub et al., 2013).

The Cognition Battery is comprised of seven measures assessing five domains: 1) executive function and attention, 2) episodic memory, 3) working memory, 4) processing speed, and 5) language. Executive function and attention are assessed using the Dimensional Change Card Sort Test and the Flanker Inhibitory Control and Attention Test, which are appropriate for individuals ages 3 to 85. Scores on these measures reflect a combination of accuracy and reaction time, where each of these components receives a score between 0 and 5. These scores are then summed to yield a computed score ranging from 0 to 10, with higher scores reflecting better performance. Episodic memory is assessed using the Picture Sequence Memory Test, which is appropriate for ages 3 to 85. Scores are computed as IRT-estimated thetas, and reflect the ability to accurately recall an increasingly lengthy series of illustrations. Higher scores reflect better episodic memory. Working memory is assessed using the List Sort Memory Test, which is appropriate for ages 7 to 85, with a parallel supplemental test available for children ages 3 to 6. Scores are computed as the total number of foods and animals correctly recalled and reordered from smallest to largest, by category. Higher scores reflect better working memory. Processing speed is assessed using the Pattern Comparison Processing Speed Test, which is appropriate for ages 7 to 85. Scores reflect the number of times in an 85-second window that an individual can accurately determine if two side-by-side images are identical, with higher scores indicating better processing speed. Language ability is assessed using the Picture Vocabulary Test, which is appropriate for individuals ages 3 to 85, and the Oral Reading Recognition Test, which is appropriate for individuals ages 7 to 85. Both language measures are administered as computer adaptive tests and scored using IRT methodology to yield a theta score. The score for the Picture Vocabulary Test reflects an individual’s ability to correctly select one of four images to match the meaning of an audio-recorded word, and the score for the Oral Reading Recognition Test reflects the ability to read and correctly pronounce letters and words, shown one at a time on a screen. Higher scores reflect better language abilities. The NIH Toolbox Cognition Battery for individuals ages 7 to 85 includes all seven core measures. The NIH Toolbox Early Childhood Cognition Battery for individuals ages 3 to 6 includes only the Dimensional Change Card Sort Test, Flanker Inhibitory Control and Attention Test, Picture Sequence Memory Test, and Picture Vocabulary Test.

Sociodemographic variables.

In addition to completing the NIH Toolbox measures, participants self-reported sociodemographic information including age, sex, and years of education, as well as race/ethnicity and Latinx status as measured by the 2010 Census (Humes, Jones, & Ramirez, 2011).

Validity measures.

Convergent validity was not evaluated for most of the Spanish-language versions of the NIH Toolbox measures. However, the two language measures included in the Cognition Battery involved significantly more development than the other measures. Therefore, convergent validity for these two measures was evaluated by correlating scores on these measures with scores on two well-established legacy Spanish-language instruments, as described below.

The Batería-III Woodcock-Muñoz Vocabulario sobre dibujos test (Muñoz-Sandoval, Woodcock, McGrew, & Mather, 2005).

To enable assessment of the convergent validity of the Picture Vocabulary Test, the Batería-III Woodcock-Muñoz Vocabulario Sobre Dibujos test, the parallel Spanish-language version of the Woodcock-Johnson® III Picture Vocabulary test (Woodcock, McGrew, & Mather, 2001), was administered to a subset of participants. The Vocabulario Sobre Dibujos test measures oral language development and word knowledge.

The Word Accentuation Test.

To enable assessment of the convergent validity of the Oral Reading Recognition Test, a 48-item version of the Word Accentuation Test, a Spanish-language word recognition task, was administered to a subset of participants. The Word Accentuation Test requires respondents to read words aloud, which are then scored as correct or incorrect based on the accentuation of the word. The 48-item version used in this analysis was created by supplementing the 40 items included in the Word Accentuation Test – Chicago (Krueger, Lam, & Wilson, 2006), which was developed for use with Spanish speakers in the United States, with eight additional items included in the original Word Accentuation Test (Del Ser, Gonzalez-Montalvo, Martinez-Espinosa, Delgado-Villapalos, & Bermejo, 1997), which was developed in Madrid, that are not in the Chicago version.

Analytic Plan

For the present analysis, z-scores were first computed for all measures based on the mean and standard deviation for the full sample (e.g., children and adults) combined. This placed scores on a common metric, thus facilitating analysis and interpretation of results. Normality of the data was evaluated via a combination of visual inspection and statistical evaluation (i.e., Kolmogorov-Smirnov and Shapiro-Wilk tests). Non-parametric analytic approaches were implemented as numerous variables were non-normally distributed.

Reliability.

Different types of reliability were calculated for different measures within the NIH Toolbox, as appropriate. Cronbach’s coefficient alphas were calculated to assess internal consistency reliability for those self-report and proxy-report measures within the Emotion Battery that were administered as fixed forms. Marginal reliability coefficients were estimated with graded response models (IRT) for measures within the Emotion Battery that were administered as computer adaptive tests. Spearman’s correlations with associated p-values were calculated to evaluate test-retest reliability for measures within the Sensation, Motor, and Cognition Batteries. Intraclass correlations (ICCs) were initially calculated as well; however, results were unreliable due to low power and non-normality of the data. The strength of Spearman’s correlations was evaluated with cutoffs of 0.10, 0.30, and 0.50 indicating small, medium, and large effects, respectively (Cohen, 1992). Because most of the NIH Toolbox Emotion Battery measures are structured to assess a seven-day recall period, the average time span between administrations for the Emotion Battery was greater than eight days, and emotions would be expected to fluctuate during that interval, test-retest reliability was not evaluated for this battery. Additionally, to reduce overall burden, measures from the Motor Battery assessing dexterity, strength, and balance were not re-administered to the Spanish-speaking participants. Accordingly, the test-retest reliability of these measures was not evaluated. Within the Sensation Battery, the gustation and pain assessments were only administered to Spanish-speaking participants ages 18 and older, per NIH Toolbox administration guidelines, and only two participants ages 3 to 7 repeated the Words-in-Noise audition measure. Therefore, test-retest reliability was not evaluated for these tests in children.

Validity.

Convergent Validity – Language Measures.

Scores on the gold-standard language measures were converted to z-scores using the same approach outlined above, so scores on the validity measures and the NIH Toolbox language measures would be on the same metric. Spearman’s correlations between these gold-standard measures and the NIH Toolbox language measures were then calculated to evaluate convergent validity. These comparisons were cross-sectional.

Comparison of Scores to Demographic Characteristics.

In addition to the convergent validity analyses conducted for the two NIH Toolbox language measures, the validity of scores on the Spanish-language versions of all measures was evaluated with Spearman’s correlations and general linear models relating scores to demographic subgroup membership, controlling for age and sex as appropriate. These comparisons were cross-sectional. For Cognition Battery measures, these analyses also controlled for education level for adults and mother’s education level for children. For general linear models, effect sizes were reported as Cohen’s ds, and cutoffs of 0.20, 0.50, and 0.80 indicated small, medium, and large effects, respectively (Cohen, 1992). To account for multiple testing, a Bonferroni correction (α = 0.001) was used in this analysis.

Results

Participant Characteristics

The NIH Toolbox Spanish-language norming sample was comprised of 496 children ages 3 to 7 and 408 adults ages 18 to 85. Details regarding participant characteristics are presented in Table 1. Slightly more than 20% of the adult sample was born in the United States, including 2.9% reporting that they were born in Puerto Rico. The majority of adult respondents (77.5%) were born outside of the United States, similar to the sample included in the Pew Research Center’s 2013 National Survey of Latinos and Religion, in which 88.5% of Spanish-dominant respondents reported having been born outside of the United States (Pew Research Center, 2013). Foreign-born participants in the present sample reported having been born in Mexico (43.4%), Central America (i.e., Costa Rica, El Salvador, Guatemala, Honduras, Nicaragua, Panama; 16.4%), the Caribbean (i.e., Cuba, Dominican Republic; 4.2%), South America (i.e., Brazil, Chile, Colombia, Ecuador, Peru, Uruguay, Venezuela; 12.7%), and Europe (i.e., Spain; 0.2%). Information regarding country of origin was not available for the 3.0% of child respondents who were not born in the United States.

Table 1.

Sample Characteristics

Ages 3–7 (n = 496) Ages 18–85 (n = 408)
Agea 4.92 (1.42) 44.10 (16.72)
Education/Mother’s educationa 9.73 (3.91) 10.71 (4.33)
Sex – Femaleb 52.4% (260) 65.0% (265)
Latinxb 99.8% (495) 100% (408)
Background Groupb,c
 Mexican 76.2% (378) 55.1% (225)
 Puerto Rican 1.8% (9) 4.7% (19)
 Cuban --- 2.0% (8)
 Other 20.6% (102) 37.3% (152)
Raceb
 American Indian/Alaskan Native 13.1% (65) 9.6% (39)
 Asian --- 0.5% (2)
 Black 1.0% (5) 5.4% (22)
 Native Hawaiian/Pacific Islander --- 0.5% (2)
 White 78.6% (390) 77.0% (314)
 Multiracial 1.6% (8) 1.7% (7)
 Other/Not specified 5.6% (28) 5.4% (22)
Country of birthb
 United States 95.2% (472) 21.3% (87)
 Foreign-born 3.0% (15) 77.5% (316)

Note.

a

Mean (Standard Deviation);

b

% (n);

c

Response options reflective of question regarding Latinx origin as asked in the 2010 Census.

Reliability

Internal Consistency Reliability.

As shown in Table 2, Cronbach’s alphas demonstrated acceptable to strong internal consistency reliability for all measures within the Emotion Battery that were administered as fixed forms among both adults (αs ranged from 0.77 to 0.95) and children (αs ranged from 0.66 to 0.82).

Table 2.

Internal consistency and IRT reliability for NIH Toolbox Emotion Battery scores

Ages 3–7 (n = 496) Ages 18–85 (n = 408)
No. of items α IRT reliability No. of items α IRT reliability
Negative Affect
 Anger-Affect --- --- --- CAT --- 0.95
 Anger-Hostility --- --- --- 5 0.84 ---
 Anger-Physical Aggression --- --- --- 5 0.79 ---
 Sadness --- --- --- CAT --- 0.94
 Fear-Affect --- --- --- CAT --- 0.95
 Fear-Somatic Arousal --- --- --- 6 0.77 ---
 Anger, PR 9 0.82 --- --- --- ---
 Fear-Over Anxious, PR 6 0.66 --- --- --- ---
 Fear-Separation Anxiety, PR 7 0.71 --- --- --- ---
 Sadness, PR 7 0.70 --- --- --- ---
Psychological Well-Being
 Positive Affect --- --- --- CAT --- 0.97
 Life Satisfaction --- --- --- CAT --- 0.95
 Meaning --- --- --- CAT --- 0.87
 Life Satisfaction, PR 5 0.69 --- --- --- ---
Social Relationships
 Friendship --- --- --- 8 0.92 ---
 Loneliness --- --- --- 5 0.88 ---
 Emotional Support --- --- --- 8 0.95 ---
 Perceived Hostility --- --- --- 8 0.90 ---
 Instrumental Support --- --- --- 8 0.94 ---
 Perceived Rejection --- --- --- 8 0.92 ---
 Social Withdrawal, PR 4 0.76 --- --- --- ---
 Positive Peer Interactions, PR 4 0.79 --- --- --- ---
 Peer Rejection, PR 9 0.81 --- --- --- ---
 Empathic Behaviors, PR CAT --- 0.90 --- --- ---
Stress & Self-Efficacy
 Perceived Stress --- --- --- 10 0.78 ---
 Self-Efficacy --- --- --- CAT --- 0.91

Note. IRT = item response theory; CAT = computer adaptive test; PR = Parent-report.

IRT Reliability.

All Emotion Battery measures administered as computer adaptive tests had excellent reliability exceeding 0.90, with the exception of the Psychological Wellbeing – Meaning measure, which had an estimated reliability of 0.87.

Test-Retest Reliability.

Of the 904 individuals who participated in the NIH Toolbox norming study and completed questionnaires in Spanish, 73 to 86 participants — depending upon the domain — repeated these measures five to 14 days following initial test administration (sensation domain: n = 85, M = 8.12 days, SD = 2.63; motor domain: n = 85, M = 8.12 days, SD = 2.63; emotion domain: n = 73, M = 8.45 days, SD = 2.37; cognition domain: n = 86, M = 8.08 days, SD = 2.64. Of note, not all participants completed all measures included in each NIH Toolbox domain, as indicated in Tables 3 to 5.

Table 3.

Spearman’s test-retest correlations for NIH Toolbox Sensation Battery measures

Ages 3–7 (n = 38) Ages 18–85 (n = 47)
n ρ (p-value) n ρ (p-value)
Audition
 WIN Left Ear --- --- 9 0.49 (0.181)
 WIN Right Ear --- --- 10 0.55 (0.100)
Gustation
 Quinine, tongue --- --- 38 0.49 (0.002)
 Quinine, mouth --- --- 38 0.48 (0.003)
 Salt, tongue --- --- 38 0.53 (0.001)
 Salt, mouth --- --- 38 0.53 (0.001)
Vision
 Visual Acuity 31 0.69 (< 0.001) 37 0.87 (< 0.001)
Olfaction
 Odor ID 38 0.20 (0.223) 41 0.52 (0.001)
Pain
 Pain Intensity --- --- 44 0.78 (< 0.001)
 Pain Interference --- --- 47 0.63 (< 0.001)

Note. WIN = Words-In-Noise Test; Odor ID = Odor Identification; tongue = tip of tongue; mouth = whole mouth. ns reflect number of participants included in test-retest analysis for each NIH Toolbox measure.

Table 5.

Spearman’s test-retest correlations for NIH Toolbox Cognition Battery measures

Ages 3–7 (n = 38) Ages 18–85 (n = 48)
n ρ (p-value) n ρ (p-value)
Executive Function and Attention
 DCCS 27 0.59 (0.001) 47 0.63 (< 0.001)
 Flanker 30 0.69 (< 0.001) 46 0.65 (< 0.001)
Memory
 List Sort --- --- 47 0.69 (< 0.001)
 PSM 14 0.73 (0.003) 20 0.77 (< 0.001)
Processing Speed
 Pattern Comp --- --- 47 0.75 (< 0.001)
Language
 PVT 38 0.75 (< 0.001) 48 0.87 (< 0.001)
 ORRT --- --- 47 0.88 (< 0.001)

Note. DCCS = Dimensional Change Card Sort Test; Flanker = Flanker Inhibitory Control and Attention Test; List Sort = List Sorting Working Memory Test; PSM = Picture Sequence Memory Test; Pattern Comp = Pattern Comparison Processing Speed Test; PVT = Picture Vocabulary Test; ORRT = Oral Reading Recognition Test. ns reflect number of participants included in test-retest analysis for each NIH Toolbox measure.

Sensation Battery.

Among adults, analyses supported test-retest reliability based on large effects for the assessments of visual acuity and pain intensity (ρs ranged from 0.78 to 0.87). Reliability was lower for assessments of audition, gustation, olfaction, and pain interference, though effects were still medium to large (ρs ranged from 0.48 to 0.63). Among children, test-retest reliability was supported for vision based on a large effect (ρ = 0.69), but was poor for olfaction based on a small effect (ρ = 0.20; see Table 3).

Motor Battery.

Among both children and adults, test-retest reliability was supported for the assessment of endurance based on large effects (ρs ranged from 0.62 to 0.71), although it was poor for the assessment of locomotion among adults based on a small effect (ρ = 0.26; see Table 4).

Table 4.

Spearman’s test-retest correlations for NIH Toolbox Motor Battery measures

Ages 3–7 (n = 38) Ages 18–85 (n = 47)
n ρ (p-value) n ρ (p-value)
Locomotion
 4-meter walk time --- --- 40 0.26 (0.104)
Endurance
 2-min walk distance 37 0.62 (< 0.001) 37 0.71 (< 0.001)

Note. Assessments of strength, dexterity, and balance were only administered to Spanish-speaking participants once. ns reflect number of participants included in test-retest analysis for each NIH Toolbox measure.

Cognition Battery.

Although all seven measures are appropriate for individuals age seven and up, only one seven-year-old respondent repeated the Cognition Battery. Therefore, test-retest reliability was not evaluated for the measures not included in the Early Childhood Battery (i.e., List Sort Working Memory Test, Pattern Comparison Test, and Oral Reading Recognition Test) among children. Among adults, test-retest reliability was supported by large effects for all tests (ρs ranged from 0.63 to 0.88). Similarly, among children, test-retest reliability was demonstrated with large effects for tests of executive function and attention, language, and episodic memory (ρs ranged from 0.59 to 0.75; see Table 5).

Convergent Validity – Language Measures

Of the 904 Spanish-language respondents, 385 (ages 3 to 7: n = 263; ages 18 to 85: n = 122) completed the Batería-III Woodcock-Muñoz Vocabulario Sobre Dibujos test (Muñoz-Sandoval et al., 2005), the legacy measure for the Spanish Picture Vocabulary Test, and 195 (ages 3 to 7: n = 103, although only those age 7 included in analysis [n = 56]; ages 18 to 85: n = 92) completed the 48-item version of the Word Accentuation Test (Del Ser et al., 1997; Krueger et al., 2006), the legacy measure for the Spanish Oral Reading Recognition Test. Among adults, Spearman’s correlations demonstrated good convergent validity based on large effects between the NIH Toolbox Spanish language measures and legacy measures for both the Picture Vocabulary Test (ρ = 0.76, p < 0.001) and the Oral Reading Recognition Test (ρ = 0.65, p < 0.001). Among children, good convergent validity based on a large effect was found for the Picture Vocabulary Test (ρ = 0.60, p < 0.001), although convergent validity was lower with a medium effect for the Oral Reading Recognition Test (ρ = 0.26, p = 0.053).

Comparison of Scores to Demographic Characteristics

Sensation Battery.

Among adults, significant but small to medium age effects (ρs ranged from −0.22 to 0.44) were found for the Words-in-Noise Test, Visual Acuity Test, Odor Identification Test, and Pain Intensity Survey after controlling for sex. As expected, older age was associated with worse hearing, worse vision, worse olfaction, and greater pain. Among children, after controlling for sex, older age was related to better olfaction based on a medium effect (ρ = 0.37), and better vision based on a large effect (ρ = −0.52; see Table 6). No sex effects were found among adults or children. Although the Words-in-Noise Test of audition is appropriate for respondents age six and older, analyses were not conducted for this test among children due to low sample size in the appropriate age range (n = 10).

Table 6.

Effect sizes for comparisons of NIH Toolbox Sensation Battery among demographic groups

Ages 3–7 (n = 496) Ages 18–85 (n = 408)
n Agea
(ρ)
Sexb
(d)
n Agea
(ρ)
Sexb
(d)
Audition
 WIN Left Ear --- --- --- 63 0.38 0.25
 WIN Right Ear --- --- --- 63 0.44* 0.32
Gustation
 Quinine, tongue --- --- --- 340 −0.17 0.13
 Quinine, mouth --- --- --- 340 −0.15 0.25
 Salt, tongue --- --- --- 341 −0.10 0.04
 Salt, mouth --- --- --- 335 −0.03 0.10
Vision
 Visual Acuity 454 −0.52* 0.01 381 0.33* 0.25
Olfaction
 Odor ID 483 0.37* 0.11 399 −0.22* 0.40
Pain
 Pain Intensity --- --- --- 388 0.23* 0.17
 Pain Interference --- --- --- 408 0.17 0.12

Note. WIN = Words-In-Noise Test; Odor ID = Odor Identification; tongue = tip of tongue; mouth = whole mouth.

a

Adjusted for sex.

b

Adjusted for age.

*

p < 0.001.

ns reflect number of participants included in demographic comparison analyses for each NIH Toolbox measure.

Motor Battery.

Among adults, both small to medium age effects (ρs ranged from 0.32 to −0.42) and small to large sex effects (ds ranged from 0.24 to 0.99) were found for nearly all motor domain measures. Younger and male participants performed better. Among children, medium to large age effects (ρs ranged from 0.48 to −0.77) were found for all measures, with older children performing better. Additionally, one small sex effect was found (d = 0.23), with female children demonstrating better dexterity than male children (see Table 7).

Table 7.

Effect sizes for comparisons of NIH Toolbox Motor Battery among demographic groups

Ages 3–7 (n = 496) Ages 18–85 (n = 408)
n Agea
(ρ)
Sexb
(d)
n Agea
(ρ)
Sexb
(d)
Dexterity
 9-hole pegboard, D 465 −0.76* 0.23*
(M > F)
386 0.34* 0.41
 9-hole pegboard, ND 464 −0.77* 0.23 385 0.32* 0.32
Strength
 Grip, D 461 0.68* 0.11 391 −0.41* 0.91*
(M > F)
 Grip, ND 461 0.71* 0.14 390 −0.42* 0.99*
(M > F)
Balance
 Standing balance test 375 0.48* 0.21 251 −0.33* 0.24*
(M > F)
Locomotion
 4-meter walk time --- --- --- 376 −0.18 0.15
Endurance
 2-min walk distance 473 0.66* 0.08 372 −0.35* 0.42*
(M > F)

Note.

a

Adjusted for sex.

b

Adjusted for age.

*

p < 0.001.

D = dominant hand; ND = non-dominant hand; M = male; F = female. ns reflect number of participants included in demographic comparison analyses for each NIH Toolbox measure.

Emotion Battery.

No significant age or sex effects were found among adults. Significantly higher scores were found for older children as compared to younger children on the parent-report measure of Fear – Over Anxious and parent-report measures reflecting positive social interactions based on small effects (ρs ranges from 0.16 to 0.22). No other age or sex effects were found among children (see Table 8).

Table 8.

Effect sizes for comparisons of NIH Toolbox Emotion Battery scores among demographic groups

Ages 3–7 (n = 496) Ages 18–85 (n = 408)
n Agea
(ρ)
Sexb
(d)
n Agea
(ρ)
Sexb
(d)
Negative Affect
 Anger-Affect --- --- --- 386 −0.04 0.21
 Anger-Hostility --- --- --- 386 −0.04 0.10
 Anger-Physical Aggression --- --- --- 385 −0.02 0.35
 Sadness --- --- --- 388 0.13 0.14
 Fear-Affect --- --- --- 386 0.12 0.11
 Fear-Somatic Arousal --- --- --- 386 0.13 < 0.01
 Anger, PR 493 −0.06 0.11 --- --- ---
 Fear-Over Anxious, PR 491 0.22* 0.12 --- --- ---
 Fear-Separation Anxiety, PR 491 0.10 0.01 --- --- ---
 Sadness, PR 493 0.02 0.02 --- --- ---
Psychological Well-Being
 Positive Affect --- --- --- 387 −0.03 0.02
 Life Satisfaction --- --- --- 388 0.03 0.02
 Meaning --- --- --- 278 −0.03 0.10
 Life Satisfaction, PR 478 −0.09 0.12 --- --- ---
Social Relationships
 Friendship --- --- --- 388 −0.08 0.13
 Loneliness --- --- --- 387 0.03 0.05
 Emotional Support --- --- --- 387 −0.11 0.18
 Perceived Hostility --- --- --- 386 0.09 0.34
 Instrumental Support --- --- --- 387 −0.02 0.05
 Perceived Rejection --- --- --- 386 0.16 0.30
 Social Withdrawal, PR 484 −0.04 0.09 --- --- ---
 Positive Peer Interactions, PR 483 0.16* 0.01 --- --- ---
 Peer Rejection, PR 487 −0.02 0.05 --- --- ---
 Empathic Behaviors, PR 491 0.16* 0.24 --- --- ---
Stress & Self-Efficacy
 Perceived Stress --- --- --- 384 0.17 0.18
 Self-Efficacy --- --- --- 383 −0.16 0.02

Note.

a

Adjusted for sex.

b

Adjusted for age.

*

p < 0.001.

PR = Parent-report. ns reflect number of participants included in demographic comparison analyses for each NIH Toolbox measure.

Cognition Battery.

Among adults, significant small to medium age effects (ρs ranged from −0.36 to −0.47) were observed on all measures except the language tests, on which no significant relationships were found. Additionally, significant medium to large associations (ρs ranged from 0.30 to 0.59) were found between all cognition measures and education after controlling for age and sex. Younger age and greater educational attainment were associated with better performance. No significant sex effects were found among adults after adjusting for age and education. Among children, significant and large age effects (ρs ranged from 0.55 to 0.72) were observed for all Cognition Battery measures after controlling for sex and mother’s education, with older children performing better. Additionally, children demonstrated a positive, although weaker, small adjusted relationship (ρ = 0.24) between scores on the Dimensional Change Card Sort Test and mother’s education after controlling for age and sex. Children of mothers with greater educational attainment performed better. Finally, females performed better than males on the Picture Vocabulary Test based on a small effect (d = 0.28; see Table 9).

Table 9.

Effect sizes for comparisons of NIH Toolbox Cognition Battery scores among demographic groups

Ages 3–7 (n = 496) Ages 18–85 (n = 408)
n Agea
(ρ)
Mother’s Educb
(ρ)
Sexc
(d)
n Aged
(ρ)
Educb
(ρ)
Sexe
(d)
Executive Function and Attention
 DCCS 355 0.55* 0.24* 0.13 337 −0.45* 0.59* 0.04
 Flanker 358 0.68* 0.15 0.05 339 −0.47* 0.59* 0.09
Memory
 List Sort --- --- --- --- 333 −0.36* 0.37* 0.19
 PSM 290 0.72* 0.18 0.16 264 −0.36* 0.37* 0.30
Processing Speed
 Pattern Comp --- --- --- --- 338 −0.44* 0.37* 0.06
Language
 PVT 409 0.64* 0.09 0.28*
(F > M)
338 −0.01 0.38* 0.03
 ORRT --- --- --- --- 337 −0.18 0.30* 0.09

Note. Educ = Education; DCCS = Dimensional Change Card Sort Test; Flanker = Flanker Inhibitory Control and Attention Test; List Sort = List Sorting Working Memory Test; PSM = Picture Sequence Memory Test; Pattern Comp = Pattern Comparison Processing Speed Test; PVT = Picture Vocabulary Test; ORRT = Oral Reading Recognition Test; F = female; M = male;.

*

p < 0.001.

a

Adjusted for sex and mother’s education.

b

Adjusted for age and sex.

c

Adjusted for age and mother’s education.

d

Adjusted for sex and education.

e

Adjusted for age and education. ns reflect number of participants included in demographic comparison analyses for each NIH Toolbox measure.

Discussion

This study evaluated the reliability and validity of the Spanish-language version of the NIH Toolbox among children ages 3 to 7 and adults ages 18 to 85 who participated in the initial norming study. Such information is crucial to the success of researchers working with and attempting to interpret NIH Toolbox scores in this population. Relative to English-language tests, there are relatively few well-validated Spanish-language testing batteries available. Given the ongoing and rapid expansion of the Latinx population in the United States (United States Census Bureau, 2018a; 2018b), the Spanish-language version of the NIH Toolbox is poised to be a particularly valuable tool for assessing neurological and behavioral functioning among primary Spanish speakers.

Sensation Battery

For the Sensation Battery, test-retest reliability was strong for vision among both children and adults, and moderate for both olfaction and taste among adults. This is consistent with the English-language versions of the measures (see Supplemental Table 1; Dalton et al., 2013; Rawal, Hoffman, Honda, Huedo-Medina, & Duffy, 2015; Varma et al., 2013). English-language normative data are not available for the NIH Toolbox assessment of pain intensity because this measure was derived from the Patient-Reported Outcomes Measurement Information System® (PROMIS®), which underwent its own separate norming process. However, the test-retest reliability for this measure was somewhat lower for the present Spanish-language sample as compared to a general population sample of 100 individuals who completed the measure in English (Broderick, Schneider, Junghaenel, Schwartz, & Stone, 2013). This may be because the single-item pain assessment is structured to assess a seven-day recall period, and the average time span between administrations for the Sensation Battery was greater than eight days. Differences in reliability were also found for the Words-in-Noise Test, with stronger support having been found for the English-language version of the measure (Wilson & Burks, 2005), although this may be a function of low sample size in the present study. Finally, measures in the Sensation Battery generally related to age and sex in a manner consistent with prior effects found in English-speaking samples. For example, Friedman et al., (2004) found that older age was associated with increased prevalence of age-related macular degeneration in the United States population, consistent with the present finding that performance on the Visual Acuity Test decreased with age. Similarly, Kaneda and colleagues (2000) found that odor discrimination abilities decreased with age, and both Kaneda et al. and Mojet, Christ-Hazelhof, & Heidema (2001) found that taste decreased with age. Wilson (2001) found an inverse relationship between age and performance on the Words-in-Noise Test in English. Finally, reported pain has also been shown to increase with age (Johannes, Le, Zhou, Johnston, & Dworkin, 2010). Thus, the present results provide support for the research-related use of the Sensation Battery to assess sensory functioning among Spanish-speaking adults residing in the United States. Additional research with larger sample sizes is needed to effectively evaluate the utility of this battery among Spanish-speaking children.

Motor Battery

For the Motor Battery, test-retest reliability was poor for the 4-meter walk test of locomotion. This measure also performed the worst of all measures in this domain among English-speaking participants in the normative sample (See Supplemental Table 1; Reuben et al., 2013). A recently published analysis of this test conducted with the combined language adult sample from the NIH Toolbox norming study also identified disconcertingly low test-retest reliability for this measure (ICC = 0.41; Bohannon & Wang, 2019). It may be that the poor reliability is at least in part a function of human error, as the test is hand timed by an administrator and trial length is generally relatively quick. Automated timing may improve these results, although it is generally more burdensome and less practical than hand timing. Additionally, as was observed with the Sensation Battery, measures in the Motor Battery related to age and sex in a manner consistent with prior effects found in other samples. Similar to the present findings, published work conducted with children has reported greater grip strength and dexterity with increased age, and greater grip strength but worse dexterity in boys as compared to girls (Ervin, Fryar, Wang, Miller, & Ogden, 2014; Omar, Alghadir, Zafar, & Al Baker, 2018). Greater grip strength has also been reported in adult men as opposed to adult women (Yorke, Curtis, Shoemaker, & Vangsnes, 2015). Also in adults, decreased dexterity has been found with increasing age (Ruff & Parker, 1993). Deficits in balance have also been observed for adult women relative to men (Wolfson, Whipple, Derby, Amerman, & Nashner, 1994), and have been found to worsen with increasing age in adults (Kalisch, Kattenstroth, Noth, Tegenthoff, & Dinse, 2011) and improve with age in children (Butz, Sweeney, Roberts, & Rauh, 2015). Finally, endurance has been shown to be greater for men as compared to women, and to decrease with age among adults (Gibbons, Fruchter, Sloan, & Levy, 2001) but increase with age among children (Bohannon, Wang, Bubela, & Gershon, 2018). Interestingly, Öberg, Karsznia, & Öberg (1993) reported decreases in gait speed and step length with increasing age and for women as opposed to men; however, no relationships between sociodemographic variables and four-meter walk time were found in the present study. These associations with demographic variables provide preliminary support for the validity of the Motor Battery for use in research with Spanish-speaking children and adults living in the United States; however, additional research with larger samples is needed to more confidently assess the psychometric strength of this battery for use with this population.

Emotion Battery

The internal consistency reliability of the fixed form measures included in the Emotion Battery was supported by both Cronbach’s alpha values and IRT reliability statistics. The computed values were similar to the Cronbach’s alpha values reported for the English-language version of the NIH Toolbox Emotion Battery, for which alpha values ranged from 0.83 to 0.97 among adults and 0.73 to 0.92 among children (Salsman et al., 2013). Among adults, no relationships between measures in the Emotion Battery and sociodemographic variables were found. This is generally consistent with Babakhanyan et al. (2018), who evaluated the Emotion Battery data from the NIH Toolbox norming study in both English and Spanish. These authors explored the relationships of scores to the combined effects of numerous sociodemographic variables (i.e., age, education, gender, ethnicity, household income), and found few significant relationships with only small effect sizes in both languages (English R2 ranged from 0.005 to 0.048, Spanish R2 ranged from 0.017 to 0.033). Of note, these authors used a less stringent alpha value of 0.01, while the present analysis imposed a limit of 0.001, which may explain the presence of significant relationships in their results but not the present results. In the present study, only three measures in the Emotion Battery were related to age and none to sex among children. However, notably more associations with age and sex were found among children in the English-language normative sample (Paolillo et al., 2018). This may be a function of larger sample size in the English study, thus yielding greater power to identify relationships. Additionally, Paolillo et al. did not impose any corrections to control for family-wise error, and they included a broader age range for many of the parent-report measures included in the Emotion Battery (i.e., ages 3 to 12; Paolillo et al., 2018). That is to say, scores of children ages 8 to 12, who were only represented in the English study, may have driven the relationships of scores on measures within the Emotion Battery to sociodemographic variables.

In aggregate, although the Spanish-language Emotion Battery self-report and parent-report measures do not appear to be robust to commonly encountered challenges in the assessment of emotion (e.g., reliable parent-report assessment), these results provide preliminary support for the use of the Emotion Battery in research with Spanish-speaking individuals in the United States.

Cognition Battery

For the Cognition Battery, test-retest reliability was strong across both adults and children, as was observed with the English language subset of the normative sample (See Supplemental Table 1; Akshoomoff et al., 2013; Bauer & Zelazo, 2013; Heaton et al., 2014; Weintraub et al., 2013). Convergent validity of the language measures included in the NIH Toolbox Cognition Battery was strong among adults, further supporting the utility of these measures in Spanish-language adult respondents. Among children, good convergent validity was found for the Picture Vocabulary Test, although it was slightly lower for the Oral Reading Recognition Test. This may be a function of lower sample size, as the Oral Reading Recognition Test is only appropriate for respondents age 7 and older. Moreover, as was observed among English speakers, and consistent with expectations, performance on the Cognition Battery was generally associated with younger age and higher education among adults (Heaton et al., 2014; Weintraub et al., 2013), and with older age among children (Akshoomoff et al., 2013; Bauer & Zelazo, 2013). Unlike in the English-language comparison sample, mother’s education level was generally not significantly related to performance on the Cognition Battery among Spanish-speaking children; however, this may be due to the fact that children ages 8 to 15 were included in the analysis in English (Akshoomoff et al., 2013). Additionally, in the English-language normative sample of children, sex was not significantly related to scores on the Picture Vocabulary Test (Akshoomoff et al., 2013), while it was significantly related, albeit weakly, to scores on this measure in Spanish. Although the causes of sociodemographic differences in cognitive test performance remain poorly understood (Heaton et al., 2014), it is important to note that these relationships are not equivalent for the English- and Spanish-language versions of these tests. This highlights the importance of considering language groups differently when establishing normative standards and evaluating changes in cognitive functioning over time, as the impact of these variables on scoring is likely to differ across language groups. Despite these differences, in total, the present results provide support for the use of the Cognition Battery in research among both children and adults living in the United States who speak Spanish as their primary language.

Limitations and Future Directions

The present study has limitations. Non-normality of the data required non-parametric analytic approaches, which complicates comparison of findings to prior published studies. Additionally, while the present sample is representative of the areas in the United States where recruitment facilities were located, it is not necessarily representative of all Spanish-speaking individuals living in the United States.

Despite these limitations, the present results provide important support for the use of the Spanish-language version of the NIH Toolbox in research to measure sensory, motor, emotional, and cognitive functioning among Spanish-speaking individuals living in the United States. While the original goal of the NIH Toolbox was to develop measures for use in research studies, there are now numerous published articles providing clinical validation evidence for the English-language version. Most commercially available tests would use this type of evidence to also impute the clinical use of their tests in translations. It is likely that the clinical appropriateness of the NIH Toolbox measures extends to the Spanish version as well. Individual clinicians and clinical researchers are encouraged to determine the appropriateness of the Spanish-language version of the NIH Toolbox for their own needs, and where possible to conduct research to directly clinically validate the Spanish-language measures. Pending appropriate identification of such support, the Spanish-language version of the NIH Toolbox is likely to be useful in tracking outcomes in clinical and epidemiological research across the lifespan, and in diverse samples including mixed-language participants.

Supplementary Material

1

Acknowledgments:

This study is funded in whole or in part with Federal funds from the Blueprint for Neuroscience Research, NIH, under contract number HHS-N-260-2006-00007-C, and by the Environmental Influences on Child Health Outcomes (ECHO) program, Office of the Director, NIH, under award number U24OD023319. The authors report no relevant conflicts of interest to disclose. The authors would like to thank Jennifer Beaumont for providing additional details to facilitate the development of this manuscript. Additionally, the authors would like to thank the participants of the NIH Toolbox norming study for their important contributions. PROMIS, Patient-Reported Outcomes Measurement Information System, and NIH Toolbox for Assessment of Neurological and Behavioral Function are marks owned by the U. S. Department of Health and Human Services.

References

  1. Akshoomoff N, Beaumont JL, Bauer PJ, Dikmen SS, Gershon RC, Mungas D, … Heaton RK (2013). VIII. NIH Toolbox Cognition Battery (CB): Composite scores of crystallized, fluid, and overall cognition. Monographs of the Society for Research in Child Development, 78(4), 119–132. doi: 10.1111/mono.12038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Babakhanyan I, McKenna BS, Casaletto KB, Nowinski CJ, & Heaton RK (2018). National Institutes of Health Toolbox Emotion Battery for English- and Spanish-speaking adults: Normative data and factor-based summary scores. Patient Related Outcome Measures, 9, 115–127. doi: 10.2147/prom.S151658 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bauer PJ, & Zelazo PD (2013). IX. NIH Toolbox Cognition Battery (CB): Summary, conclusions, and implications for cognitive development. Monographs of the Society for Research in Child Development, 78(4), 133–146. doi: 10.1111/mono.12039 [DOI] [PubMed] [Google Scholar]
  4. Beaumont JL, Havlik R, Cook KF, Hays RD, Wallner-Allen K, Korper SP, … Gershon R (2013). Norming plans for the NIH Toolbox. Neurology, 80(11 Suppl 3), S87–92. doi: 10.1212/WNL.0b013e3182872e70 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bohannon RW, & Wang YC (2019). Four-meter gait speed: Normative values and reliability determined for adults participating in the NIH Toolbox study. Archives of Physical Medicine and Rehabilitation, 100(3), 509–513. doi: 10.1016/j.apmr.2018.06.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bohannon RW, Wang YC, Bubela D, & Gershon RC (2018). Normative two-minute walk test distances for boys and girls 3 to 17 years of age. Physical & Occupational Therapy in Pediatrics, 38(1), 39–45. doi: 10.1080/01942638.2016.1261981 [DOI] [PubMed] [Google Scholar]
  7. Bonomi AE, Cella DF, Hahn EA, Bjordal K, Sperner-Unterweger B, Gangeri L, … Zittoun R (1996). Multilingual translation of the Functional Assessment of Cancer Therapy (FACT) quality of life measurement system. Quality of Life Research, 5(3), 309–320. doi: 10.1007/BF00433915 [DOI] [PubMed] [Google Scholar]
  8. Broderick JE, Schneider S, Junghaenel DU, Schwartz JE, & Stone AA (2013). Validity and reliability of Patient-Reported Outcomes Measurement Information System instruments in osteoarthritis. Arthritis Care & Research, 65(10), 1625–1633. doi: 10.1002/acr.22025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Butz SM, Sweeney JK, Roberts PL, & Rauh MJ (2015). Relationships among age, gender, anthropometric characteristics, and dynamic balance in children 5 to 12 years old. Pediatric Physical Therapy, 27(2), 126–133. [DOI] [PubMed] [Google Scholar]
  10. Casaletto KB, Umlauf A, Marquine M, Beaumont JL, Mungas D, Gershon R, … Heaton RK (2016). Demographically corrected normative standards for the Spanish language version of the NIH Toolbox Cognition Battery. Journal of the International Neuropsychological Society, 22(3), 364–374. doi: 10.1017/s135561771500137x [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Cella D, Hernandez L, Bonomi AE, Corona M, Vaquero M, Shiomoto G, & Baez L (1998). Spanish language translation and initial validation of the Functional Assessment of Cancer Therapy quality-of-life instrument. Medical Care, 36(9), 1407–1418. doi: 10.1097/00005650-199809000-00012 [DOI] [PubMed] [Google Scholar]
  12. Cohen J (1992). A power primer. Psychological Bulletin, 112(1), 155–159. [DOI] [PubMed] [Google Scholar]
  13. Coldwell SE, Mennella JA, Duffy VB, Pelchat ML, Griffith JW, Smutzer G, … Hoffman HJ (2013). Gustation assessment using the NIH Toolbox. Neurology, 80(11 Suppl 3), S20–24. doi: 10.1212/WNL.0b013e3182872e38 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Cook KF, Dunn W, Griffith JW, Morrison MT, Tanquary J, Sabata D, … Gershon RC (2013). Pain assessment using the NIH Toolbox. Neurology, 80(11 Suppl 3), S49–53. doi: 10.1212/WNL.0b013e3182872e80 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Dalton P, Doty RL, Murphy C, Frank R, Hoffman HJ, Maute C, … Slotkin J (2013). Olfactory assessment using the NIH Toolbox. Neurology, 80(11 Suppl 3), S32–36. doi: 10.1212/WNL.0b013e3182872eb4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Del Ser T, Gonzalez-Montalvo JI, Martinez-Espinosa S, Delgado-Villapalos C, & Bermejo F (1997). Estimation of premorbid intelligence in Spanish people with the Word Accentuation Test and its application to the diagnosis of dementia. Brain and Cognition, 33(3), 343–356. doi: 10.1006/brcg.1997.0877 [DOI] [PubMed] [Google Scholar]
  17. Eremenco SL, Cella D, & Arnold BJ (2005). A comprehensive method for the translation and cross-cultural validation of health status questionnaires. Evaluation & the Health Professions, 28(2), 212–232. doi: 10.1177/0163278705275342 [DOI] [PubMed] [Google Scholar]
  18. Ervin RB, Fryar CD, Wang C-Y, Miller IM, & Ogden CL (2014). Strength and body weight in US children and adolescents. Pediatrics, 134(3), e782–789. doi: 10.1542/peds.2014-0794 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Friedman DS, O’Colmain BJ, Munoz B, Tomany SC, McCarty C, De Jong P, … Kempen J (2004). Prevalence of age-related macular degeneration in the United States. Archives of Ophthalmology, 122(4), 564–572. doi: 10.1001/archopht.122.4.564 [DOI] [PubMed] [Google Scholar]
  20. Gershon RC, Fox RS, Manly JJ, Mungas DM, Nowinski CJ, Roney EM, & Slotkin J (2020). The NIH Toolbox: Overview of development for use with Hispanic populations. Journal of the International Neuropsychological Society, 26(6), 567–575. doi: 10.1017/S1355617720000028 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Gershon RC, Wagster MV, Hendrie HC, Fox NA, Cook KF, & Nowinski CJ (2013). NIH Toolbox for Assessment of Neurological and Behavioral Function. Neurology, 80(11 Suppl 3), S2–6. doi: 10.1212/WNL.0b013e3182872e5f [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Gibbons WJ, Fruchter N, Sloan S, & Levy RD (2001). Reference values for a multiple repetition 6-minute walk test in healthy adults older than 20 years. Journal of Cardiopulmonary Rehabilitation and Prevention, 21(2), 87–93. [DOI] [PubMed] [Google Scholar]
  23. Heaton RK, Akshoomoff N, Tulsky D, Mungas D, Weintraub S, Dikmen S, … Gershon R (2014). Reliability and validity of composite scores from the NIH Toolbox Cognition Battery in adults. Journal of the International Neuropsychological Society, 20(6), 588–598. doi: 10.1017/s1355617714000241 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Humes KR, Jones NA, & Ramirez RR (2011). Overview of race and Hispanic origin: 2010. United States Census Bureau. [Google Scholar]
  25. Johannes CB, Le TK, Z X, Johnston JA, & Dworkin RH (2010). The prevalence of chronic pain in United States adults: Results of an Internet-based survey. The Journal of Pain, 11(11), 1230–1239. doi: 10.1016/j.jpain.2010.07.002 [DOI] [PubMed] [Google Scholar]
  26. Kalisch T, Kattenstroth J-C, Noth S, Tegenthoff M, & Dinse HR (2011). Rapid assessment of age-related differences in standing balance. Journal of Aging Research, 2011, 160490. doi: 10.4061/2011/160490 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kaneda H, Maeshima K, Goto N, Kobayakawa T, Ayabe-Kanamura S, & Saito S (2000). Decline in taste and odor discrimination abilities with age, and relationship between gustation and olfaction. Chemical Senses, 25(3), 331–337. doi: 10.1093/chemse/25.3.331 [DOI] [PubMed] [Google Scholar]
  28. Krueger KR, Lam CS, & Wilson RS (2006). The Word Accentuation Test – Chicago. Journal of Clinical and Experimental Neuropsychology, 28(7), 1201–1207. doi: 10.1080/13803390500346603 [DOI] [PubMed] [Google Scholar]
  29. Lent L, Hahn E, Eremenco S, Webster K, & Cella D (1999). Using cross-cultural input to adapt the Functional Assessment of Chronic Illness Therapy (FACIT) scales. Acta Oncologica, 38(6), 695–702. [DOI] [PubMed] [Google Scholar]
  30. McArdle R, Carlo M, & Wilson R (2009). Words-in-Noise-Test: English and Spanish. Proceedings from the American Speech-Language-Hearing Association. Retrieved from: https://www.asha.org/Events/convention/handouts/2009/0150_McArdle_Rachel/ [Google Scholar]
  31. Mojet J, Christ-Hazelhof E, & Heidema J (2001). Taste perception with age: Generic or specific losses in threshold sensitivity to the five basic tastes? Chemical Senses, 26(7), 845–860. doi: 10.1093/chemse/26.7.845 [DOI] [PubMed] [Google Scholar]
  32. Muñoz-Sandoval AF, Woodcock RW, McGrew KS, & Mather N (2005). Batería III Woodcock-Muñoz. Itasca, IL: Riverside Publishing. [Google Scholar]
  33. Öberg T, Karsznia A, & Öberg K (1993). Basic gait parameters: Reference data for normal subjects, 10–79 years of age. Journal of Rehabilitation Research and Development, 30, 210–223. [PubMed] [Google Scholar]
  34. Omar MT, Alghadir AH, Zafar H, & Al Baker S (2018). Hand grip strength and dexterity function in children aged 6–12 years: A cross-sectional study. Journal of Hand Therapy, 31(1), 93–101. doi: 10.1016/j.jht.2017.02.004 [DOI] [PubMed] [Google Scholar]
  35. Paolillo EW, McKenna BS, Nowinski CJ, Thomas ML, Malcarne VL, & Heaton RK (2018). NIH Toolbox® emotion batteries for children: Factor-based composites and norms. Assessment, 10.1177/1073191118766396 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Pew Research Center. (2013). National Survey of Latinos and Religion. Retrieved from: https://www.pewhispanic.org/datasets/
  37. Rawal S, Hoffman HJ, Honda M, Huedo-Medina TB, & Duffy VB (2015). The taste and smell protocol in the 2011–2014 US National Health and Nutrition Examination Survey (NHANES): Test–retest reliability and validity testing. Chemosensory Perception, 8(3), 138–148. doi: 10.1007/s12078-015-9194-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Reuben DB, Magasi S, McCreath HE, Bohannon RW, Wang YC, Bubela DJ, … Gershon RC (2013). Motor assessment using the NIH Toolbox. Neurology, 80(11 Suppl 3), S65–75. doi: 10.1212/WNL.0b013e3182872e01 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Rine RM, Schubert MC, Whitney SL, Roberts D, Redfern MS, Musolino MC, … Slotkin J (2013). Vestibular function assessment using the NIH Toolbox. Neurology, 80(11 Suppl 3), S25–31. doi: 10.1212/WNL.0b013e3182872c6a [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Ruff RM, & Parker SB (1993). Gender-and age-specific changes in motor speed and eye-hand coordination in adults: Normative values for the Finger Tapping and Grooved Pegboard tests. Perceptual and Motor Skills, 76(3 Pt 2), 1219–1230. doi: 10.2466/pms.1993.76.3c.1219 [DOI] [PubMed] [Google Scholar]
  41. Ryan C (2013). Lanugage use in the United States: 2011. American Community Survey Reports. Retrieved from https://www2.census.gov/library/publications/2013/acs/acs-22/acs-22.pdf
  42. Salsman JM, Butt Z, Pilkonis PA, Cyranowski JM, Zill N, Hendrie HC, … Cella D (2013). Emotion assessment using the NIH Toolbox. Neurology, 80(11 Suppl 3), S76–86. doi: 10.1212/WNL.0b013e3182872e11 [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Sinha P, Wong AWK, Kallogjeri D, & Piccirillo JF (2018). Baseline cognition assessment among patients with oropharyngeal cancer using PROMIS and NIH Toolbox. JAMA Otolaryngology – Head & Neck Surgery, 144(11), 978–987. doi: 10.1001/jamaoto.2018.0283 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Troller-Renfree SV, Barker TV, Pine DS, & Fox NA (2015). Cognitive functioning in socially anxious adults: Insights from the NIH Toolbox Cognition Battery. Frontiers in Psychology, 6, 764. doi: 10.3389/fpsyg.2015.00764 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Tulsky DS, & Heinemann AW (2017). The clinical utility and construct validity of the NIH Toolbox Cognition Battery (NIHTB-CB) in individuals with disabilities. Rehabilitation Psychology, 62(4), 409–412. doi: 10.1037/rep0000201 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. United States Census Bureau. (2018a). American Fact Finder. Retrieved from https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml
  47. United States Census Bureau. (2018b). Projected Race and Hispanic Origin: Main Projections Series for the United States, 2017–2060. Retrieved from https://www.census.gov/data/tables/2017/demo/popproj/2017-summary-tables.html
  48. Varma R, McKean-Cowdin R, Vitale S, Slotkin J, & Hays RD (2013). Vision assessment using the NIH Toolbox. Neurology, 80(11 Suppl 3), S37–40. doi: 10.1212/WNL.0b013e3182876e0a [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Weintraub S, Dikmen SS, Heaton RK, Tulsky DS, Zelazo PD, Bauer PJ, … Gershon RC (2013). Cognition assessment using the NIH Toolbox. Neurology, 80(11 Suppl 3), S54–64. doi: 10.1212/WNL.0b013e3182872ded [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Wilson RH (2011). Clinical experience with the words-in-noise test on 3430 veterans: Comparisons with pure-tone thresholds and word recognition in quiet. Journal of the American Academy of Audiology, 22(7), 405–423. doi: 10.3766/jaaa.22.7.3 [DOI] [PubMed] [Google Scholar]
  51. Wilson RH, & Burks CA (2005). Use of 35 words for evaluation of hearing loss in signal-to-babble ratio: A clinic protocol. Journal of Rehabilitation Research and Development, 42(6), 839–852. doi: 10.1682/JRRD.2005.01.0009 [DOI] [PubMed] [Google Scholar]
  52. Wolfson L, Whipple R, Derby CA, Amerman P, & Nashner L (1994). Gender differences in the balance of healthy elderly as demonstrated by dynamic posturography. Journal of Gerontology, 49(4), M160–M167. [DOI] [PubMed] [Google Scholar]
  53. Woodcock RW, McGrew KS, & Mather N (2001). Woodcock-Johnson III. Itasca, IL: Riverside Publishing. [Google Scholar]
  54. Yorke AM, Curtis AB, Shoemaker M, & Vangsnes E (2015). Grip strength values stratified by age, gender, and chronic disease status in adults aged 50 years and older. Journal of Geriatric Physical Therapy, 38(3), 115–121. doi: 10.1519/jpt.0000000000000037 [DOI] [PubMed] [Google Scholar]
  55. Zecker SG, Hoffman HJ, Frisina R, Dubno JR, Dhar S, Wallhagen M, … Wilson RH (2013). Audition assessment using the NIH Toolbox. Neurology, 80(11 Suppl 3), S45–48. doi: 10.1212/WNL.0b013e3182872dd2 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES