Abstract
Research has shown that executive function (EF) skills are associated with resilience in preschoolers experiencing risk and adversity, but these studies have typically relied on large batteries of tasks to measure children’s EF skills. There is a need for brief, reliable EF assessments that can be used in the field with diverse young children. The current study assessed the validity and test-retest reliability of two tablet-based EF tasks from the NIH Toolbox: The Dimensional Change Card Sort (DCCS) and the Flanker Inhibitory Control and Attention Test, each with a developmental extension (Dext) that is triggered when a child struggles with the standardized versions. Dext versions include easier levels intended to improve task accessibility for younger and disadvantaged children. Eighty-six preschoolers residing in emergency housing participated in two study sessions about one week apart, completing tablet-based DCCS-Dext and Flanker-Dext tasks, along with a table-top EF task (Peg Tapping) and measures of vocabulary and numeracy. The majority of participants triggered the Dext portion of the DCCS and almost half triggered the Dext portion of the Flanker, underscoring the need for extensions of the Toolbox EF tasks to lower the floor of these measures. The Dext EF measures were positively associated with Peg-Tapping, after controlling for age and vocabulary, indicating construct validity. They were also correlated with math achievement, suggesting criterion validity. DCCS-Dext and Flanker-Dext showed moderate test-retest reliability after one week. Together, these findings demonstrate the value of developmental extensions for assessing EF skills among children experiencing risk and adversity.
Keywords: Early childhood, executive function measures, risk and adversity, test-retest reliability, validity
Executive function (EF) skills support adaptive functioning in children exposed to poverty-related stress and adversity, including those who are homeless (Zelazo, 2020). These skills develop rapidly in early childhood, and are linked to academic and social competence in children experiencing homelessness (Herbers et al., 2011; Masten et al., 2012; Obradović, 2010). However, these studies have relied on large batteries of EF measures (typically between four and six tasks) and some measures are too difficult for those with less well-developed EF skills. There is a need for brief, developmentally appropriate EF measures to use with diverse groups of children. With a sample of preschoolers residing in an emergency shelter, the present study investigated the validity and test-retest reliability of developmental extension (Dext) versions of two NIH Toolbox® measures of EF (Zelazo et al., 2013)—the Dimensional Change Card Sort (DCCS) and the Flanker Inhibitory Control and Attention Test. The Dext versions were originally developed for the National Children’s Study to lower the floor of these tasks for younger children from diverse and disadvantaged backgrounds (Carlson et al., 2011; Masten et al., 2011), but to date, there is limited research on their validity and reliability. Shields et al. (2020) included children and young adults with intellectual disability and found that the NIH Toolbox DCCS and Flanker tasks had strong psychometric properties for participants with mental ages of 5 years and older, but the results with the Dext versions were mixed. Kalstabakken (2017) incorporated the Toolbox EF tasks with Dext into routine early childhood screening in a diverse urban school district. Both DCCS-Dext and Flanker-Dext showed promising validity, but reliability data were not available.
EF skills include working memory, inhibitory control, and cognitive flexibility and are foundational for goal-directed behavior (Blair & Raver, 2012; Zelazo, 2020). Multiple studies have shown that EF development is particularly marked during the preschool period (Carlson, 2005; Zelazo et al., 2003), suggesting that early childhood may be an important window of opportunity for interventions to promote EF development. These intervention efforts are especially critical for children experiencing sociodemographic risk and adversity, who tend to have less well-developed EF skills upon school entry (Manfra, 2019). EF skills have been found to partially account for the relation between SES and school success (better than IQ or language; e.g., Lawson & Farah, 2017; Nesbitt et al., 2013); the link between EF and achievement appears to be most robust for math compared to other content areas (Allan et al., 2014; Schmitt et al., 2017). Measures that accurately assess EF skills are essential to better understand and promote EF development, as well as reliably track growth over time.
The challenges of measuring preschoolers’ EF skills are amplified when conducting research in the field, particularly at a homeless shelter. Families are under a great deal of stress and often have limited availability (Bradley et al., 2018). To decrease participant burden, study sessions need to be brief and easily accessible. Further, existing EF measures may be too difficult for children who struggle with EF skills, and they have typically been validated with more socioeconomically advantaged children (e.g., Beck et al., 2011; Zelazo et al., 2013).
Recent versions of the DCCS-Dext and Flanker-Dext are administered through the NIH Toolbox App and include additional levels below the starting point of the standard version. These Dext levels are triggered automatically when a participant fails to meet performance criteria on the standard versions. Dext measures are appealing for use with preschoolers in an emergency shelter setting because easier levels may help to attenuate floor effects; tablet-based measures tend to engage children; and each task takes under 10 minutes to complete.
In the current study, we investigated the psychometric properties of the tablet-based NIH Toolbox DCCS-Dext and Flanker-Dext among young children residing in emergency housing. The aims of the study were to assess the extent to which the Dext levels were needed within this population and to establish initial validity and test-retest reliability. We hypothesized that (1) a significant portion of participants would need the Dext; (2) performance on the Dext measures would be positively correlated with age, given that EF skills develop rapidly during the preschool period (Carlson, 2005); (3) after controlling for age, the Dext measures would be positively associated with performance on a third EF task (convergent validity), weakly related to vocabulary (divergent validity), and correlated with math achievement (criterion validity); and (4) the measures would demonstrate good test-retest reliability.
Method
Eighty-nine preschoolers residing in an emergency homeless shelter with their families participated during the summers of 2015 and 2016. Data from three participants were excluded from 2016 because they also completed the study in 2015. The final sample included 86 children (57% female) between the ages of 36 and 72 months (M = 55.64, SD = 9.81). Participants were 66% Black/African American, 16% multiracial, 6% White, 3% American Indian or Alaska Native, and 9% other or not reported. To participate, children needed to be between the ages of 36 and 72 months, have a sufficient understanding of English, have no parent-reported developmental delay, and reside in shelter for at least 3 days (to allow for some acclimation to the shelter environment). The participation rate was based on all eligible families staying in shelter during the time of data collection; although the participation rate was unavailable for 2015, it was 54% during the summer of 2016.
Data collection occurred in a designated research room at the shelter over two study sessions, approximately one week apart (M = 8.38 days), with 84% of participants returning for their second session. Time between the two sessions ranged from 3 to 23 days, although 98% of participants returned within 14 days. Graduate and undergraduate students conducted the study sessions. Parents gave informed consent and children provided verbal assent. Children were given table-top (i.e., non-digital) and tablet-based assessments during a one-hour study session. Upon completion, children and parents received a toy and gift card, respectively. The University of Minnesota Institutional Review Board approved all study procedures.
The NIH Toolbox DCCS and Flanker tasks with Dext (Anderson et al., 2015; Carlson et al., 2015) were administered through the NIH Toolbox App v.2.0 on an iPad Air 2. In the DCCS-Dext, participants sorted virtual cards by size, shape, and color into two boxes on the screen. If a child failed to sort enough cards correctly on practice (fewer than 3 of 4 cards), pre-switch (fewer than 4 of 5 cards) or post-switch (fewer than 4 of 5 cards), the Dext levels were triggered accordingly. The Flanker-Dext task required children to “feed the fish” in the middle of the screen by touching the button matching the way the middle fish was “swimming” and ignore flanking fish on either side; in the incongruent trials, the flanking fish point in the opposite direction. The Dext levels were triggered if a child failed the practice trials (fewer than 3 of 4 correct) or the first set of test trials (fewer than 6 of 7 incongruent trials correct). Both DCCS-Dext and Flanker-Dext were scored using the dextScore function in the ICDtools package (Desjardins, 2018) in R (R Core Team, 2020), with possible scores ranging from −5.00 to 10.00.
In addition, children completed Peg-Tapping (Diamond & Taylor, 1996), a widely used EF measure with preschool-aged children that has previously been administered in shelters (Masten et al., 2012). In this table-top measure, children were instructed to tap a surface one time if the examiner tapped two times and two times if the examiner tapped one time. Children were also given the NIH Toolbox Picture Vocabulary Test (PVT; Gershon et al., 2013), a tablet-based measure of receptive vocabulary. Lastly, children completed the Woodcock-Johnson III Applied Problems subtest (Woodcock et al., 2001) to assess early numeracy. Early math skills have reliably been found to correlate positively with EF skills (e.g., Bull & Lee, 2014), and a meta-analysis has shown that EF and math are more strongly related than EF and literacy in early childhood (Allan et al., 2014). There is also considerable overlap among the networks of neural regions underlying EF skills and mathematical problem solving (e.g., Amalric & Dehaene, 2017; Menon, 2016). Given that EF skills reliably predict math performance, we included Applied Problems to assess criterion validity of the Dext measures. Peg-Tapping, PVT, and Applied Problems were administered only at Time 1.
Missing data ranged from 2% (Applied Problems at Time 1) to 24% (DCCS-Dext at Time 2). Rates of missingness were higher at Time 2 because 16% of the sample did not return for their second session. We examined whether there were differences on study variables at Time 1 between those who returned and those who did not. A series of independent samples t-tests revealed no significant differences on any of the variables of interest. To address missingness, we multiply imputed 25 datasets using a fully conditional specification method in Statistical Package for the Social Sciences (SPSS) v.24. The imputed data were used in the main analyses investigating validity of the Dext measures.
Results and Discussion
Table 1 displays descriptive statistics and correlations among study variables. Our first aim was to examine the extent to which the Dext measures were needed in this sample (i.e., triggered by performance on the standard versions). As indicated in Table 1, the range of scores on both DCCS-Dext and Flanker-Dext extended well below the lowest score possible (i.e., zero) on the standard versions. At Time 1, 70% of participants triggered the DCCS-Dext and 45% triggered the Flanker-Dext. The Dext measures almost entirely eliminated floor effects: Just one participant scored the lowest possible score on Flanker-Dext at Time 1 and two participants earned the lowest possible score on DCCS-Dext at Time 2. Figure 1 displays the scores on both Dext measures by child age. Participants of all ages triggered the Dext, and this was the case for every participant 42 months and younger.
Table 1.
Descriptive statistics and correlations among study variables
| (1) | (2) | (3) | (4) | (5) | (6) | (7) | |
|---|---|---|---|---|---|---|---|
| (1) DCCS T1 | 1 | .42** | −.12 | .07 | .33** | .13 | .30* |
| (2) DCCS T2 | .55** | 1 | .27* | .56** | .51** | .29** | .49** |
| (3) Flanker T1 | .31** | .46** | 1 | .56** | .22* | .31** | .38** |
| (4) Flanker T2 | .35** | .65** | .71** | 1 | .37** | .29** | .41** |
| (5) Peg-Tapping T1 | .55** | .62** | .51** | .56** | 1 | .22* | .48** |
| (6) PVT T1 | .45** | .47** | .61** | .53** | .50** | 1 | .38** |
| (7) Applied Problems T1 | .59** | .61** | .68** | .63** | .68** | .67** | 1 |
| (8) Child Age | .57** | .41** | .67** | .54** | .56** | .64** | .74** |
| n | 76 | 65 | 83 | 69 | 78 | 77 | 84 |
| Mean | −.90 | −.43 | .67 | 1.47 | 6.64 | 55.01 | 8.87 |
| Standard Deviation | 2.98 | 3.74 | 2.91 | 3.58 | 5.12 | 7.42 | 5.44 |
| Min–Max | −4.88–6.44 | −5.00–7.30 | −5.00–6.56 | −4.83–7.43 | 0–16 | 33.70–76.80 | 0–20 |
Note. Partial correlations are displayed above the diagonal and control for child age in months.
Bivariate and partial correlations were pooled from 25 multiply imputed datasets (n = 86).
T1 = Time 1. T2 = Time 2. PVT = Picture Vocabulary Test.
p < .05
p < .01.
Figure 1.

Range of scores on the DCCS-Dext and Flanker-Dext at Time 1.
Note. The points below the line represent participants who used the Dext levels.
Next, we examined the validity of the Dext tasks. At the bivariate level, DCCS-Dext and Flanker-Dext were positively associated with Peg-Tapping, PVT, Applied Problems, and child age (Table 1). After controlling for child age, most correlations remained statistically significant, except the correlation between DCCS-Dext at Time 1 and PVT, as well as DCCS-Dext at Time 1 and Flanker-Dext at both time points. However, the Time 2 scores for DCCS-Dext and Flanker-Dext were strongly correlated, which remained after controlling for child age and PVT (r = .52, p < .01). This finding suggests that some familiarity with the tablet tasks and study session format might be helpful.
DCCS-Dext and Flanker-Dext at both timepoints remained significantly correlated with Applied Problems, even after partialing out child age and PVT (r = .28, p < .05 for DCCS-Dext at Time 1; r = .43, p < .01 for DCCS-Dext at Time 2; r = .30, p < .01 for Flanker-Dext at Time 1; r = .34, p < .01 for Flanker-Dext at Time 2), suggesting criterion validity. For a conservative test of convergent and divergent validity, we regressed the Dext measures on child age, PVT, and Peg-Tapping to account for the shared variation among these related constructs. Results are shown in Table 2. Evidence for validity was particularly strong for both Dext measures at Time 2. Peg-Tapping was significantly related to both DCCS-Dext and Flanker-Dext after controlling for child age and PVT (convergent validity). Further, PVT was weakly related when accounting for variation explained by child age and Peg-Tapping (divergent validity). We also conducted sub-group analyses with only those participants who triggered the Dext levels at Time 1 (Supplemental Tables 1 and 2). The magnitude of the correlations was similar to those of the full sample, indicating that validity was not driven solely by the standard levels of the tasks.
Table 2.
Dext measures at each time point regressed on child age, PVT, and Peg-Tapping
| DCCS T1 | DCCS T2 | Flanker T1 | Flanker T2 | |
|---|---|---|---|---|
| Child Age | .11 (.04)* | −.01 (.06) | .12 (.04)** | .07 (.05) |
| PVT T1 | .03 (.05) | .11 (.07) | .10 (.04)* | .12 (.06) |
| Peg-Tapping T1 | .18 (.07)** | .37 (.09)** | .08 (.06) | .22 (.08)** |
Note. Unstandardized betas and standard errors were pooled from 25 multiply imputed datasets (n = 86).
Standard errors are displayed in the parentheses.
T1 = Time 1. T2 = Time 2. PVT = Picture Vocabulary Test.
p < .05
p < .01.
Test-retest reliability was examined with single-measure intraclass correlation coefficients (ICC) using the raw, unimputed data; the DCCS-Dext ICC was .56 and the Flanker-Dext ICC was .68. According to Koo and Li (2016), single-measure ICCs between .50 and .75 represent moderate reliability. The odds ratio of needing the Dext levels at Time 2 if they were used at Time 1 was 8.86 for DCCS-Dext and 4.70 for Flanker-Dext, demonstrating that many of the same children triggered the Dext levels again at Time 2. In the full sample, there was some evidence of practice effects on the Flanker-Dext but not the DCCS-Dext. A repeated measures ANOVA indicated significant improvement on the Flanker-Dext from Time 1 to Time 2 (F(1, 67) = 6.08, p < .05). We also found a marginal time by group (i.e., triggered Flanker-Dext levels or not) interaction (F(2, 66) = 2.99, p = .09), suggesting that the practice effects are likely driven by those needing the Dext levels. Of the children who participated in both study sessions, 34% required the Flanker-Dext levels at Time 2 (compared to 47% at Time 1). Practice with Flanker-Dext, in particular, seemed to serve a training role.
Our findings indicate that Dext versions of the NIH Toolbox EF tasks improve the utility of the Toolbox measures of EF for young children experiencing poverty-related risks and adversity. The standard versions of the DCCS and Flanker were too challenging for the majority of participants in this sample and the Dext levels revealed meaningful variability at the low end that likely would have been lost with the standard measure. Evidence for validity of the Dext measures included significant positive correlations with Peg-Tapping and Applied Problems, which remained after controlling for age, and relatively weak associations with PVT.
Strengths of the study include the high percentage of participants who returned for their second session and the use of these measures in the field. One limitation is that only one additional EF measure—Peg-Tapping—was included. Future research could continue to examine associations between the Dext measures and other EF measures. Additionally, participants had limited practice time with the tablet before the assessments began. Including a brief practice session might have enhanced test-retest reliability, given the evidence for improved validity for both measures at Time 2. Future studies might include a warm-up task to explore the effects of familiarity with tablet use. Even with these limitations, the addition of Dext levels appear to improve the utility of the NIH Toolbox measures of EF skills with diverse preschoolers.
Supplementary Material
Acknowledgments:
The authors are grateful to all of the families who participated and our community collaborators. The research reported here was supported by the National Institutes of Health (NIH) Environmental Influences on Child Health Outcomes (ECHO) Program, Grant U24 OD-023319-01 and the Irving B. Harris Professorship (Masten). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Footnotes
Disclosure of interest: The authors declare no conflict of interest.
References
- Allan NP, Hume LE, Allan DM, Farrington AL, & Lonigan CJ (2014). Relations between inhibitory control and the development of academic skills in preschool and kindergarten: A meta-analysis. Developmental Psychology, 50, 2368–2379. 10.1037/a0037493 [DOI] [PubMed] [Google Scholar]
- Amalric M, & Dehaene S (2017). Cortical circuits for mathematical knowledge: Evidence for a major subdivision within the brain’s semantic networks. Philosophical Transactions of the Royal Society B: Biological Sciences, 373, 20160515. 10.1098/rstb.2016.0515 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anderson JE, Zelazo PD, Carlson SM, Kalstabakken AW, & Masten AS (2015). Technical Report for the Flanker-Developmental Extension.
- Beck DM, Schaefer C, Pang K, & Carlson SM (2011). Executive function in preschool children: Test-retest reliability. Journal of Cognition and Development, 2, 169–193. 10.1080/15248372.2011.563485 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blair C, & Raver CC (2012). Child development in the context of adversity: Experiential canalization of brain and behavior. American Psychologist, 67, 309–318. 10.1037/a0027493 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bradley C, McGowan J, & Michelson D (2018). How does homelessness affect parenting behaviour? A systematic critical review and thematic synthesis of qualitative research. Clinical Child and Family Psychology Review, 21, 94–108. 10.1007/s10567-017-0244-3 [DOI] [PubMed] [Google Scholar]
- Bull R, & Lee K (2014). Executive functioning and mathematics achievement. Child Development Perspectives, 8, 36–41. 10.1111/cdep.12059 [DOI] [Google Scholar]
- Carlson SM (2005). Developmentally sensitive measures of executive function in preschool children. Developmental Neuropsychology, 28, 595–616. 10.1207/s15326942dn2802_3 [DOI] [PubMed] [Google Scholar]
- Carlson SM, Masten AS, Zelazo PD, Wenzel A, Anderson JE, Buckner M, & McGovern P (2011). Assessment of executive function for the National Children’s Study. Presentation at NCS Research Day, Washington, DC. [Google Scholar]
- Carlson SM, Zelazo PD, Anderson JE, Kalstabakken AW & Masten AS (2015). Technical Report for the Dimensional Change Card Sort - Developmental Extension.
- Desjardins CD (2018). ICDtools: R tools for the Institute of Child Development. R package version 0.1. Retrieved from https://rdrr.io/github/cddesja/ICDtools/ [Google Scholar]
- Diamond A, & Taylor C (1996). Development of an aspect of executive control: Development of the abilities to remember what I said and to “do as I say, not as I do.” Developmental Psychobiology, 29, 315–334. [DOI] [PubMed] [Google Scholar]
- Gershon RC, Slotkin J, Manly JJ, Blitz DL, Beaumont JL, Schnipke D, Wallner-Allen K, Golinkoff RM, Gleason JB, Hirsh-Pasek K, Adams MJ, & Weintraub S (2013). NIH toolbox cognition battery (CB): Measuring language (vocabulary comprehension and reading decoding). Monographs of the Society for Research in Child Development, 78, 49–69. 10.1111/mono.12034 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Herbers JE, Cutuli JJ, Lafavor TL, Vrieze D, Leibel C, Obradović J, & Masten AS (2011). Direct and indirect effects of parenting on the academic functioning of young homeless children. Early Education and Development, 22, 77–104. 10.1080/10409280903507261 [DOI] [Google Scholar]
- Kalstabakken AW (2017). Executive function measures in early childhood screening: Concurrent and predictive validity [Unpublished doctoral dissertation]. University of Minnesota. [Google Scholar]
- Koo TK, & Li MY (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15, 155–163. 10.1016/j.jcm.2016.02.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lawson GM, & Farah MJ (2017). Executive function as a mediator between SES and academic achievement throughout childhood. International Journal of Behavioral Development, 41, 94–104. 10.1177/0165025415603489 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manfra L (2019). Impact of homelessness on school readiness skills and early academic achievement: A systematic review of the literature. Early Childhood Education Journal, 47, 239–249. 10.1007/s10643-018-0918-6 [DOI] [Google Scholar]
- Masten AS, Carlson SM, Zelazo PD, Wenzel AJ, Anderson JE, Buckner M, & McGovern P (2011). Assessment of executive function for the National Children’s Study. Poster presented at NCS Research Day, Washington, DC. [Google Scholar]
- Masten AS, Herbers JE, Desjardins CD, Cutuli JJ, McCormick CM, Sapienza JK, Long JD, & Zelazo PD (2012). Executive function skills and school success in young children experiencing homelessness. Educational Researcher, 41, 375–384. 10.3102/0013189X12459883 [DOI] [Google Scholar]
- Menon V (2016). Memory and cognitive control circuits in mathematical cognition and learning. Progress in Brain Research, 227, 159–186. 10.1016/bs.pbr.2016.04.026 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nesbitt KT, Baker-Ward L, & Willoughby MT (2013). Executive function mediates socio-economic and racial differences in early academic achievement. Early Childhood Research Quarterly, 28, 774–783. 10.1016/j.ecresq.2013.07.005 [DOI] [Google Scholar]
- Obradović J (2010). Effortful control and adaptive functioning of homeless children: Variable-focused and person-focused analyses. Journal of Applied Developmental Psychology, 31, 109–117. 10.1016/j.appdev.2009.09.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from https://www.R-project.org/ [Google Scholar]
- Schmitt SA, Geldhof GJ, Purpura DJ, Duncan R, & McClelland MM (2017). Examining the relations between executive function, math, and literacy during the transition to kindergarten: A multi-analytic approach. Journal of Educational Psychology, 109, 1120–1140. 10.1037/edu0000193 [DOI] [Google Scholar]
- Shields RH, Kaat AJ, McKenzie FJ, Drayton A, Sansone SM, Coleman J, Michalak C, Riley K, Berry-Kravis E, Gershon RC, Widaman KF, & Hessl D (2020). Validation of the NIH Toolbox Cognitive Battery in intellectual disability. Neurology, 94, e1229–e1240. 10.1212/WNL.0000000000009131 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Woodcock RW, McGrew K, Mather N, & Schrank F (2001). Woodcock-Johnson Tests of Achievement. Itasca, IL: Riverside Publishing. [Google Scholar]
- Zelazo PD (2020). Executive function and psychopathology: A neurodevelopmental perspective. Annual Review of Clinical Psychology, 16, 431–454. 10.1146/annurev-clinpsy-072319-024242 [DOI] [PubMed] [Google Scholar]
- Zelazo PD, Anderson JE, Richler J, Wallner-Allen K, Beaumont JL, & Weintraub S (2013). NIH Toolbox Cognition Battery (CB): Measuring executive function and attention. Monographs of the Society for Research in Child Development, 78, 16–33. 10.1111/mono.12032 [DOI] [PubMed] [Google Scholar]
- Zelazo PD, Müller U, Frye D, & Marcovitch S (2003). The development of executive function in early childhood. Monographs of the Society for Research in Child Development, 68, 11–27. 10.1111/j.0037-976X.2003.00261.x [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
