Skip to main content
Paediatrics & Child Health logoLink to Paediatrics & Child Health
. 2018 Apr 9;23(8):e163–e169. doi: 10.1093/pch/pxy038

Establishing Bayley-III cut-off scores at 21 months for predicting low IQ scores at 3 years of age in a preterm cohort

Dianne E Creighton 1,2,, Selphee Tang 2, Jill Newman 2, Leonora Hendson 1,2, Reg Sauve 3
PMCID: PMC6242003  PMID: 30842698

Abstract

Objective

To evaluate predictive validity and establish cut-off scores on the Bayley-III at age 21 months that best predict Intelligence Quotient (IQ) scores <70 or <80) at 3 years in a high-risk preterm cohort.

Method

Bayley-III evaluations at 21 months corrected age and intellectual assessments, primarily with the WPPSI-III, at 3 years corrected age were conducted with 520 infants born less than 29 weeks gestational age or less than 1250 g birth weight. Receiver Operator Characteristic (ROC) curves were used to establish Bayley-III Cognitive Composite cut-off scores that maximized Sensitivity and Specificity in predicting low IQ. Similar analyses were performed using the Language Composite, and a research derived mean Cognitive-Language Composite.

Results

A regression model for the association between 21-month Bayley-III Cognitive Composite and 3-year IQ scores was significant (P<0.0001, Adjusted R2=0.36). The ROC area under the Curve was 0.90 for the Cognitive Composite predicting IQ<70. The cut-off score that maximized Sensitivity and Specificity for predicting 3-year IQ<70 was a Cognitive Composite of <80. The ROC Area under the Curve was 0.80 for Cognitive Composites predicting IQ<80 and a Cognitive Composite cut-off score of <90 maximized Sensitivity and Specificity.

Conclusion

In this high-risk preterm cohort, there was a strong association between the Bayley-III Cognitive Composite at 21 months and IQ at 3 years. A Cognitive Composite cut-off score of <80 optimized classification of IQ<70 at 3 years, and a Cognitive Composite cut-off score of <90 optimized classification of IQ<80.

Keywords: Bayley-III, Development, Preterm infants

INTRODUCTION

The Bayley Scales of Infant Development have been used for many years as a standard assessment measure for infants and young children enrolled in neonatal follow-up programs. The scales are used for the clinical assessment and referral of individual infants for treatment, for outcomes research on cohorts of children and in clinical trials to evaluate outcomes of neonatal treatments. The earlier version, the Bayley Scales of Infant Development-Second Edition (BSID-II) (1) had two developmental scales, yielding a Mental Development Index and a Psychomotor Development Index. The current version, the Bayley Scales of Infant and Toddler Development-Third Edition (Bayley-III) (2), has three administered scales, yielding Cognitive, Language and Motor Composites. One of the major changes in the third edition was to remove language items from the Mental Development scale, and place them on a separate Language scale. The purpose was to be able to differentiate children with language delays from those with cognitive impairments. The scales were re-normed on a more up to date population of children reflecting more current demographics in the USA. As well, 10% of the normative sample for the Bayley-III comprised children with pre-existing conditions associated with risk of developmental impairment.

The separation of language and cognitive abilities onto separate scales has the potential advantage of improving our understanding of the development of preterm infants. However, clinicians and researchers in high-risk infant follow-up programs using the Bayley-III have questions about whether the Bayley-III gives similar results to the older versions. When both versions are administered to the same individuals, the Bayley-III yields higher scores than the previous edition (3–5). Clinicians find that there is a larger proportion of children receiving scores in the ‘normal’ range, leading to a potential failure to detect developmental delay and to make earlier referral for intervention (4,6). Further, researchers outside the USA have documented that the Bayley-III ‘overestimates’ development, raising questions about the generalizability of the US norms to other populations (6–8).

Developmental delay or neurodevelopmental impairment (NDI) is often defined as an obtained score on standardized assessment that is more than 2 standard deviations (SD) below the mean of the general or normative population. The use of the usual cut-off score of <70 (>2 SD below the mean) to define delay or NDI on the Bayley-III may under-identify children requiring developmental support and may under-represent developmental impairment rates in follow-up cohorts. Clinicians and researchers need to know what cut-off scores on the Bayley-III will best identify infants likely to show developmental impairments.

To attempt to address these issues some researchers have compared BSID-II to Bayley-III findings. Vohr et al. compared NDI rates (scores <70) in extremely low birth weight infants detected by each test at 18 to 22 months in two time periods (9). The cognitive NDI rate was lower with the Bayley-III than with its predecessor (13% versus 43%, P<0.0001). This raises the question of whether the rates of impairment were overestimated in the past or whether rates are being underestimated currently (10). Some investigators have administered the BSID-II and Bayley-III concurrently to the same group of infants (4,11), again finding a lower NDI rate with the Bayley-III, but still not answering the question of which is the more accurate measure. Others have developed mathematical equations to convert Bayley-III cognitive and language composite scores into BSID-II Mental Development Index scores (5). This is interesting as a research solution, but does not provide a pragmatic strategy for the clinician who is trying to identify delay in a child, counsel parents, and make recommendations for intervention. Other strategies have been to compare the Bayley-III scores of the clinical group, usually preterm infants, with those of a control group, typically full-term infants (7), or with locally generated norms (10,12). To fund the recruitment and assessment of a control group for comparative purposes is very costly and to acquire enough infant assessments to generate local norms is an undertaking well beyond the scope of most follow-up programs.

The approach used in the present study is to determine what cut-off scores on the Bayley-III Cognitive and Language scales administered at 21 months corrected age (CA) in a preterm cohort best predict low IQ scores at 3 years CA. Similar approaches have been used in groups with different clinical characteristics and assessed at differing ages (13), but none with the cohort size of the current study.

METHOD

As part of a regional Neonatal Follow-up Program, the development of infants in southern Alberta born at ≤1250 g and/or less than 29 weeks’ gestation was monitored as a standard of care. Data from the cohort born April 2005 to December 2010 were used. Parents provided informed consent for their infants to participate in the assessments, and the study was approved by the Conjoint Research Ethics Board. Infants were assessed by trained, experienced psychologists, psychometrists and speech language pathologists using the Bayley-III at 21 months CA. Infants returned for an intellectual assessment at 36 months CA. The 3-year testing was administered by a psychologist or psychometrist. For most children (n=369, 71%), the Wechsler Preschool and Primary Scale of Intelligence-Third Edition (WPPSI-III) was used and Full Scale IQ scores were obtained (14). Data from other intellectual assessments administered to 3-year olds over the 6-year study period included: the WPPSI-IV (15) (n=88, 17%); the Leiter International Performance Scale-Revised (16) (n=23, 4%), or the Leiter International Performance Scale-3 (17) (n=2, 0.03%), nonverbal measures of intelligence administered with some non-English speaking children; and the Bayley-III Cognitive scale (n=34, 7%), administered to children with delays such that they were unable to achieve a basal level of success on the WPPSI-III. NDI was also inferred without test scores for children diagnosed by a paediatrician as having global developmental delay (n=4, 1%).

To assess predictive validity, Bayley-III Cognitive and Language Composite scores at 21 months CA were compared to intellectual test results at 36 months CA. A derived Cognitive-Language Composite score, the mean of the Cognitive and Language Composites, similar to the Mental Development Index of the BSID-II, was generated for each child and compared to the 36-month outcomes. From these data, Receiver Operator Characteristic (ROC) curves and Area under the Curves were used to evaluate the accuracy of the Bayley-III scores in predicting whether an infant is likely to show low IQ scores at age 3. Bayley-III cut-off scores were identified as the scores that maximized both sensitivity and specificity. Extremely Low IQ was defined as a full IQ or equivalent cognitive test score <70 (14) or diagnosis by a paediatrician of global developmental delay. Secondary analyses were conducted to determine the optimal Bayley-III Cognitive Composite cut-off score to predict an intellectual outcome at age 3 in the Borderline range or lower, defined as IQ<80 (14). Age correction for prematurity was used throughout, as the difference between scores for corrected versus chronological age continues to be clinically important at age 3 (18). Consistent use of age correction for prematurity in longitudinal studies reduces the risk of misinterpreting the reason for any change in cognitive test scores (18).

Linear regression modeling was conducted to show the degree of linear association along the distribution of 21 and 36 month scores from lower to higher values. Traditional diagnostic/screening test statistics, Sensitivity (Co-Positivity), Specificity (Co-Negativity), Positive Predictive Value and Negative Predictive Value, were reported along with the associated 95% confidence intervals for binomial proportions for the optimal Bayley-III cut-off scores. T-tests and Chi-square tests were used to compare baseline characteristics of participants (data included in the analysis) and non-participants (data not available), and t-tests were used to compare Bayley-III composite scores to population norms. The level of significance was set at P<0.05 (two-sided) throughout, and SAS v9.3 (SAS Institute Inc., Cary, NC, USA) was used for analysis.

RESULTS

The cohort included 817 children with birth weight less than 1250 g and gestational age (GA) less than 29 weeks who survived 36 months and were eligible to attend follow-up. Of these, 520 had both Bayley-III Cognitive Composite scores at 21 months and intellectual outcomes at 36 months and are included in the analyses. Ninety-four per cent were very preterm (<32 weeks GA) and all but one were very low birth weight (<1500 g) (Table 1). Survivors not included in the analyses (n=297) were lost to follow up at both of the 21- and 36-month visits (n=89), or attended follow-up clinic but cognitive test results were not available for both assessment points (n=208). Characteristics of participants were compared to nonparticipants (Table 2). Participants had lower mean birth weight (P<0.001), lower gestational age at birth (P<0.001), more frequently had bronchopulmonary dysplasia (P=0.016) and more often had severe retinopathy of prematurity (P=0.019).

Table 1.

Birth characteristics of the cohort (n=520)

Gestational age Frequency Percent
23–25 weeks 126 24%
26–28 weeks 260 50%
29–31 weeks 103 20%
32–34 weeks 31 6%
Total 520 100%
Birth weight Frequency Percent
480–499 g 2 <1%
500–749 g 101 19%
750–999 g 197 38%
1000–1249 g 188 36%
1249–1560 g 32 6%
Total 520 100%

Table 2.

Characteristics of survivors to age 3 (n=817) comparing participants vs. nonparticipants

Characteristic Participants
n=520
Nonparticipants
n=297
P-value*
Mean birth weight (g) 939 (SD 210) 994 (SD 205) <0.001
Mean gestational age (weeks) 27 (SD 2) 28 (SD 2) <0.001
Male 285 (55%) 144 (48%) 0.082
Multiple birth 160 (31%) 85 (29%) 0.497
Maternal race Caucasian 355/513 (69%) 191/279 (68%) 0.829
Maternal education, some postsecondary or more 328/498 (66%) 139/233 (60%) 0.104
Intraventricular hemorrhage Grade 3 or 4 38/517 (7%) 14/296 (5%) 0.142
Bronchopulmonary dysplasia 253/464 (55%) 116/257 (45%) 0.016
Severe retinopathy of prematurity 90/413 (22%) 27/196 (14%) 0.019
Confirmed sepsis 97/518 (19%) 47/295 (16%) 0.316
English spoken as second language or bilingual home 151 (29%) 74/237 (31%) 0.466

*t-test or Chi-square test

Supplemental oxygen required at 36 weeks gestational age

Stage ≥ 3, plus disease, or laser therapy.

Regression models for the linear association of 21-month Bayley-III Composite scores to 3-year IQ scores were generated. The model for the Cognitive Composite yielded an R2 of 0.36 (P<0.001, Figure 1). The Language Composite showed an R2 of 0.37 (P<0.001, Figure 1), and the derived Cognitive-Language Composite yielded an R2 of 0.44 (P<0.001).

Figure 1.

Figure 1.

Bayley-III Cognitive and Language Scores at 21 months corrected age versus IQ at 3 years corrected age.

Bayley-III Cognitive Composite scores at 21 months CA had a mean of 91.0 (SD 12.1, range 55 to 120), which is significantly lower (P<0.001) than the normative mean of 100 (SD 15). Bayley-III Language Composite scores (mean 88.5, SD 16.1, range 47 to 147) were also significantly lower than published norms (P<0.001).

To examine the utility of various Bayley-III Composite scores at 21 months that could be used as cut-off scores to delineate likely IQ<70 at age 3, ROC curves were generated. The ROC Area under the Curve value was 0.90 using the Cognitive Composite (Figure 2). The optimal cut-off Cognitive Composite Score was determined to be <80, which showed Sensitivity of 77% and Specificity of 90% for predicting IQ<70 (Table 2). Similar ROC analyses were performed with the Language Composite with Area under the Curve 0.83 and best cut-off <80 (Figure 2) and derived Cognitive-Language Composite, Area under the Curve 0.88 and best cut-off <83 (not shown). The Cognitive Composite cut-off score at 21 months that best predicted later IQ<80 was determined to be <90. This score showed Sensitivity of 68% and Specificity of 69% (Table 3).

Figure 2.

Figure 2.

ROC curves for Sensitivity and 1-Specificity of Bayley-III cognitive and language scores predicting IQ<70. ROC Receiver operator characteristic.

Table 3.

Sensitivity, specificity, PPV and NPV of Bayley-III at 21 months corrected age for predicting IQ<70 and IQ<80 at 3 years corrected age

Cut-off scores* Sensitivity Specificity PPV NPV
Predicting IQ<70
Cognitive Composite<80 77% 90% 30% 99%
(95% Confidence Interval) (61–93%) (88–93%) (19–42%) (98–100%)
Language Composite<80 76% 72% 12% 98%
(95% Confidence Interval) (58–94%) (67–76%) (6–17%) (97–100%)
Cognitive-Language Composite (derived)<83 76% 78% 15% 99%
(95% Confidence Interval) (58–94%) (74–82%) (8–21%) (97–100%)
Predicting IQ < 80
Cognitive Composite<90 68% 69% 23% 94%
(95% Confidence Interval) (56–79%) (65–73%) (17–29%) (92–97%)

*Cut-off scores found to maximize sensitivity and specificity.

NPV Negative predictive value; PPV Positive predictive value.

DISCUSSION

Our results using the Bayley-III Cognitive Composite, Language Composite and the derived Cognitive-Language Composite at 21 months CA indicate that these scores are strong predictors of intellectual outcome at 3 years CA. The Cognitive Composite was the strongest, based on the ROC analyses. There was a slightly reduced sample size when the Bayley-III Language scale data were used (n=444). Our experience is that the preterm children seen at 21 months have difficulty remaining fully engaged in the Bayley-III assessment as it progresses from the Cognitive scale to the Language scale. Whether this drop off in their engagement is due to their ability to regulate their attention and behaviour for longer periods, or whether it has to do with the challenge of the language based items, cannot be answered by the present study. The order of administration of the Cognitive and Language scales was not randomized. The association of the Bayley-III Language Composite to a standardized speech and language outcome at age 3 was not evaluated, but is an area for future research.

The present study, using ROC analyses to determine the optimal cut-off score to indicate delay, yielded similar results to those of Bode et al. (2014) who used a local control group to set the reference standards for normal, mild to moderately delayed, or severely delayed development. The Cognitive Composite cut-off scores that were established by Bode et al. were <77 for severe delay (2 SD below the mean of the control group) and 77 to 86 for mild to moderate delay (between 2 and 1 SD below the mean of the control group). The cut-off scores of 77 and 86 exactly correspond to those of the present study. Since cognitive composite scores on the Bayley-III are in five-point increments, there is no Cognitive Composite score of 77, but 80 is the closest. Similarly, there is no Cognitive Composite score of 86; so, 90 is the actual cut-off that would correspond. The present study differs from that of Bode et al. in terms of the ages of assessment: 21 months and 3 years in our study, 24 months and 4 years in Bode’s. It also differs in geographical location (western Canada versus northeastern United States). Despite these differences in methodology, ages of assessment and geographical location, the comparability of the results supports the generalizability of the findings.

Potential limitations to the current study need to be taken into account. There was no comparison group of full term children for the establishment of local norms. Children with physical disabilities were not excluded if they were able to participate in the Bayley-III assessment and obtain a valid score, as judged by the psychologist or speech language pathologist. There was a high rate of missing data, which reduced the sample size for analyses. Reasons for no cognitive results at either 21 months or 3 years included the child being reticent to engage, fatigued during the clinic day, scheduling difficulties, etc. Our information on those who were not included due to missing data indicates that as a group their neonatal difficulties were somewhat less severe. Not all children at age 3 were administered the same intellectual assessment. The majority (71%) completed the WPPSI-III. The WPPSI-IV was used when it became available, consistent with clinical practice guidelines for the use of up-to-date tests. The correlation between full scale IQ scores for the WPPSI-III and WPPSI-IV is high (corrected r=0.86) and test means are similar (Standard Difference = 0.25) (15). For clinical reasons, other standardized cognitive assessments appropriate to the needs of the remaining children were administered. A very small number of children (1%) were categorized as having likely IQ<70 based on paediatric assessment of global developmental delay. This reflects the clinical reality of neonatal follow-up, and thus may enhance the usefulness of the current findings regarding the Bayley-III. Our findings should not be taken to mean that Bayley-III scores at 21 months are sufficient for evaluating outcomes of Neonatal Intensive Care Unit intervention trials. Had we followed the children to age 5, when intellectual test findings are likely more stable, and had we included a measure of adaptive behaviour, then data on intellectual disability, an important outcome of prematurity, could have been obtained.

CONCLUSION

Findings of this study, with a large sample size of 520 children, indicate that neonatal follow-up clinic assessments of preterm infants using the Bayley-III can provide useful information for predicting 3-year intellectual outcomes from the 21-month assessment. The validity of the 21-month cognitive assessment is supported by the strength of the association with the 3-year classification of IQ<70, with an Area under the Curve statistic of 0.90 (considered to be excellent). Applying pragmatic decision rules to categorize Cognitive Composite scores <80 as indicating likely 3-year IQ scores in the Extremely Low range (<70), and Cognitive Composite scores <90 as indicating likely IQ scores in the Borderline range or lower(<80), the clinician is in a position to counsel parents and recommend interventions and further follow-up for those at greatest risk.

Institution where work originated: Alberta Childrens Hospital

Ethics Board: Conjoint Health Research Ethics Board, University of Calgary

References

  • 1. Bayley N. Bayley Scales of Infant DevelopmentTM - Second Edition. San Antonio, TX: Harcourt Assessment, Inc, 1993. [Google Scholar]
  • 2. Bayley N. Bayley Scales of Infant and Toddler DevelopmentTM - Third Edition. San Antonio, TX: Harcourt Assessment, Inc, 2006. [Google Scholar]
  • 3. Jary S, Whitelaw A, Walløe L, Thoresen M. Comparison of Bayley-2 and Bayley-3 scores at 18 months in term infants following neonatal encephalopathy and therapeutic hypothermia. Dev Med Child Neurol 2013;55:1053–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Acton BV, Biggs WS, Creighton DE, et al. Overestimating neurodevelopment using the Bayley-III after early complex cardiac surgery. Pediatrics 2011;128:e794–800. [DOI] [PubMed] [Google Scholar]
  • 5. Moore T, Johnson S, Haider S, Hennessy E, Marlow N. Relationship between test scores using the second and third editions of the Bayley scales in extremely preterm children. J Pediatr 2012;160:553–8. [DOI] [PubMed] [Google Scholar]
  • 6. Chinta S, Walker K, Halliday R, Loughran-Fowlds A, Badawi N. A comparison of the performance of healthy Australian 3-year-olds with the standardised norms of the Bayley scales of infant and toddler development (version-III). Arch Dis Child 2014;99:621–4. [DOI] [PubMed] [Google Scholar]
  • 7. Anderson PJ, De Luca CR, Hutchinson E, Roberts G, Doyle LW; Victorian Infant Collaborative Group Underestimation of developmental delay by the new Bayley-III scale. Arch Pediatr Adolesc Med 2010;164:352–6. [DOI] [PubMed] [Google Scholar]
  • 8. Yu Y, Hsieh W, Hsu C, et al. A psychometric study of the Bayley scales of infant and toddler development-3rd edition for term and preterm Taiwanese infants. Res Dev Disabil 2013;34:3875–83. [DOI] [PubMed] [Google Scholar]
  • 9. Vohr BR, Stephens BE, Higgins RD, et al. ; Eunice Kennedy Shriver National Institute of Child Health and Human Development Neonatal Research Network Are outcomes of extremely preterm infants improving? Impact of bayley assessment on outcomes. J Pediatr 2012;161:222–8.e3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Johnson S, Moore T, Marlow N. Using the bayley-III to assess neurodevelopmental delay: which cut-off should be used?Pediatr Res 2014;75:670–4. [DOI] [PubMed] [Google Scholar]
  • 11. Lowe JR, Erickson SJ, Schrader R, Duncan AF. Comparison of the bayley II mental developmental index and the Bayley III cognitive scale: Are we measuring the same thing?Acta Paediatr 2012;101:e55–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Cromwell EA, Dube Q, Cole SR, et al. Validity of US norms for the Bayley scales of infant development-III in Malawian children. Eur J Paediatr Neurol 2014;18:223–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Bode MM, D’Eugenio DB, Mettleman BB, Gross SJ. Predictive validity of the Bayley, third edition at 2 years for intelligence quotient at 4 years in preterm infants. J Dev Behav Pediatr 2014;35:570–5. [DOI] [PubMed] [Google Scholar]
  • 14. Wechsler D. Wechsler Preschool and Primary Scale of Intelligence - Third Edition. San Antonio, TX: The Psychological Corporation, 2002. [Google Scholar]
  • 15. Wechsler D. Wechsler Preschool and Primary Scale of Intelligence - Fourth Edition. San Antonio, TX: The Psychological Corporation, 2012. [Google Scholar]
  • 16. Roid GH, Miller LJ.. Leiter International Performance Scale-Revised. Wood Dale, IL: Stoelting, 1997. [Google Scholar]
  • 17. Roid GH, Miller LJ, Pomplun M, Koch C.. Leiter International Performance Scale-3rd Edition. Wood Dale, IL: Stoelting, 2013. [Google Scholar]
  • 18. Wilson-Ching M, Pascoe L, Doyle LW, Anderson PJ. Effects of correcting for prematurity on cognitive test scores in childhood. J Paediatr Child Health 2014;50:182–8. [DOI] [PubMed] [Google Scholar]

Articles from Paediatrics & Child Health are provided here courtesy of Oxford University Press

RESOURCES