Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2025 Sep 5;28(6):e70072. doi: 10.1111/desc.70072

Priorities for New Data Collection

Brian MacWhinney 1,, Catherine Snow 2
PMCID: PMC12415499  PMID: 40910415

1.

Schaff, Loukatou, Cristia, and Havron (SLC&H) have contributed a fascinating and important analysis of the demographic characteristics of the child language data currently available in the CHILDES database. They were able to supplement information already on the web by soliciting further specifics from many of the original data contributors. They have identified biases in the representation of urbanization, family structure, SES, languages studied, countries represented, and multilingualism. These biases in the availability of data from rural, non‐Western, low‐education participants speaking non‐Indo‐European languages raise concerns when drawing conclusions about universality of phenomena, echoing widespread worries within psychology, sociology, and education about the dominance in research studies of data gathered only from WEIRD (Western, educated, industrialized, rich, and democratic) populations (Henrich et al. 2010).

Child language data had an even more extreme bias in the 1970s, when the bulk of our transcript data came from typically developing children of English‐speaking academics, often in the northeastern United States. Since then, the coverage has broadened greatly to include data from 48 languages, variations in SES, and a rich collection of types of multilingualism. Despite this growth in coverage, the database can never be truly representative of all the patterns of variation in the 2.2 billion children on the planet. This is because it would be difficult to attain fully representative coverage. Despite improvements in recording technology (LENA), automatic speech recognition (Liu et al. 2023), natural language processing (Liu and MacWhinney 2024), GenAI (Warstadt and Bowman 2022), and corpus linguistics (Baayen 2010), the collection and analysis of child language samples remains a daunting task. Barriers to data collection include privacy restrictions, researchers who are unwilling to share their data, restrictive IRB policies, lack of recognition for corpus work, logistical problems in rural areas, the need to rely on translators, and scarcity of research support. Given these limitations, the goal of eliminating the gaps so as to produce a fully balanced representation seems unattainable, at least in the near term.

Fortunately, we can make productive use of the gaps and biases identified by SLC&H to guide our research. We can do this by focusing on the contrasts between universals and variation in language acquisition. This line of research begins by first proposing some universal and then collecting data that could falsify the universal. For example, SLC&H point to studies evaluating the universality of the noun bias, late passive acquisition, reduced parental input in rural communities, variations in gesture typology, or the effects of early bilingualism. In each of these areas, a universal is proposed based on evidence from current corpora, and then further data is collected that either confirms or falsifies the universal.

Consider the case of the noun bias described by Gentner (2006). Studies based on samples such as the three children in Brown (1973) do indeed show an early noun bias for the English of children of educated parents in the Boston area when sampled during interviews recorded by graduate students. However, as shown by Sugárné (1970) for Hungarian, the use of verbs increases markedly and surpasses nouns when children are recorded on the playground. Moreover, as Ninio and Snow (1988) have shown, early vocabulary is rich in socially mediated terms that lie outside the noun‐verb contrast. When we turn to languages outside of Indo‐European, such as Chinese, Korean, or Mayan, we can see a reversal of the noun bias. Thus, both activity and language impact this feature of early vocabularies, suggesting that it may be important to explore the further effects of activity types as well as urbanization, SES, and birth order on this pattern.

To cite another example, using data in CHILDES (Gleason and Ely 1997; Gleason and Greif 1983) compared the lexicon used in interactions with mothers, with fathers, and over the dinner table and found a great amount of non‐overlap between these situations. Lexical non‐overlap has also been documented for children learning two languages (Yip and Matthews 2007) that are used in very different settings. Although not included in this survey, language disabilities also have enormous and varied impacts on both the overall course and the details of language acquisition (Bishop 1997; Guendouzi et al. 2011).

We can also propose and test universals regarding language teaching methods. WEIRD parents rely on elaborations and recasts to promote children's learning (Sokolov 1993). However, Schieffelin (1985) found that Kaluli mothers relied instead on asking children to repeat phrases after them. Studies of non‐Western and rural cultures have shown that they can vary markedly in their use of praise, teasing, emotion terms, honorifics, and other routines. Even more extreme differences in parental output have been documented for groups such as the Navajo or Maya, in which direct parental input to young children is often minimal (Scollon 1976).

Examples of this type could be multiplied dozens of times. However, what is missing in these reports are the detailed transcriptions of real‐life interactions that would allow us to understand these patterns in greater detail. We have no shared transcript data from Kaluli, Mayan, Navajo, or Samoan that would allow us to track the effects of these variations in input. However, there are areas where such data does exist. For example, Gleason's recordings of mother, father, and dinner table talk are in CHILDES, and her published results on lexical non‐overlap can be traced in further detail, as can the Yip and Matthews recordings of their bilingual subjects. For SES and ethnic group contrasts, one can look at the transcripts and audio from the Harvard HSLLD (Home‐School Study of Language and Literacy Development) and a series of 12 papers analyzing these patterns. This gives us a rich picture of these contrasts in the Boston area, and we can then ask about what would be the results of a similar study conducted in Marseille, Manchester, Mombasa, Mumbai, or Mannheim. Data from rural populations and special areas could be particularly informative. For this, the representation of American, English‐speaking children growing up in rural families who are eligible by family income for Head Start will increase with the imminent release of transcripts from the Early Head Start Project (Pan et al. 2005). We can study alternative patterns of language loss and maintenance as indigenous communities become increasingly linked to the global economy.

To maximize our ability to understand these patterns of variation or universality, we need to create language sampling protocols that allow for cross‐linguistic comparison. An example of such an effort is the Global Tales (https://talkbank.org/childes/access/GlobalTales/) project that asks children in the age range between 3 and 6 to tell stories about times when they were either happy, confused, angry, or proud, or when they had to deal with a situation that was either problematic or important. These same questions are being asked by researchers working with children from 25 countries and languages. The results so far demonstrate both variation and universality in the nature of the stories children tell. Most of the data collected so far is from middle‐class children in urban settings, and adding data from rural populations and across SES levels is an important goal. Other projects working on cross‐cultural and cross‐linguistic comparisons include Acquisition Sketch, LITMUS, LaCoLa, Frog Stories, and PLAY.

We can study universals and variation using comparisons across demographic variables. However, we also need to consider the role of individual variation in patterns of acquisition. For example, Peters (1977) contrasted children with precise articulation and those with “mush mouth”. Nelson (1973) contrasted referential and expressive children—a contrast that was then echoed in Bloom et al. (2001). Nelson (1981) further notes that children may shift from one acquisitional strategy to another across time. To examine strategies and processes in detail, Lieven, Tomasello, and colleagues collected densely sampled corpora for English, Finnish, and German. Using such data, they were able to show that, even on the level of argument structures for the English articles, acquisition is highly lexically specific, rather than driven by universal featural structures (Lieven et al. 1997).

Looking back across the 50‐plus years since the publication of Brown (1973), we can marvel at the growth in the availability of data on child language acquisition: from a set of transcripts from three children produced on mimeographed sheets to a world with data on thousands of children across 48 languages linked to terabytes of media. Of course, every glass in science is always half empty, and we are always striving for a fuller understanding, but it is heartening to know how much progress has been made. The careful work by SLC&H advances us still further by serving as a guide for new comparisons and by suggesting priorities for new data collection.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

The authors have nothing to report.

Data Availability Statement

The data that support the findings of this study are openly available in CHILDES at https://childes.talkbank.org.

References

  1. Baayen, H. 2010. “Demythologizing the Word Frequency Effect.” The Mental Lexicon 5: 436–461. [Google Scholar]
  2. Bishop, D. 1997. Uncommon Understanding. Psychology Press. [Google Scholar]
  3. Bloom, L. , Tinker E., and Scholnick E. K.. 2001. “The Intentionality Model and Language Acquisition: Engagement, Effort, and the Essential Tension in Development.” Monographs of the Society for Research in Child Development 66, no. 4: i–101. [PubMed] [Google Scholar]
  4. Brown, R. 1973. A First Language: the Early Stages. Harvard University Press. 10.4159/harvard.9780674732469. [DOI] [Google Scholar]
  5. Gentner, D. 2006. “Why Verbs Are Hard to learn.” In Action Meets Word: How Children Learn Verbs, edited by Hirsh‐Pasek K. and Golinkoff R., 544–564. Cambridge University Press. [Google Scholar]
  6. Gleason, J. B. , and Ely R.. 1997. “Input and the Acquisition of Vocabulary: Examining the Parental Lexicon.” In Problem of Meaning: Behavioral and Cognitive Perspectives, edited by Mandell C. and McCabe A., 221–260. Elsevier. [Google Scholar]
  7. Gleason, J. B. , and Greif E.. 1983. Men's Speech to Young Children. In Thorne B., Kramerae C., and Henley N.. Language, Gender and Society. [Google Scholar]
  8. Guendouzi, J. , Loncke F., and Williams M. J.. 2011. The Handbook of Psycholinguistic and Cognitive Processes: Perspectives in Communication Disorders. Psychology Press. [Google Scholar]
  9. Henrich, J. , Heine S., and Norenzayan A.. 2010. “The Weirdest People in the World?” Behavioral and Brain Sciences 33: 61–135. [DOI] [PubMed] [Google Scholar]
  10. Lieven, E. V. , Pine J. M., and Baldwin G.. 1997. “Lexically‐Based Learning and Early Grammatical Development.” Journal of Child Language 24, no. 1: 187–219. [DOI] [PubMed] [Google Scholar]
  11. Liu, H. , and MacWhinney B.. 2024. “Morphosyntactic Analysis for CHILDES.” Language Development Research 4, no. 1: 233–258. 10.34842/j97r-n823. [DOI] [Google Scholar]
  12. Liu, H. , MacWhinney B., Fromm D., and Lanzi A.. 2023. “Automation of Language Sample Analysis.” Journal of Speech, Language, and Hearing Research: JSLHR 66, no. 7: 2421–2433. 10.1044/2023_JSLHR-22-00642. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Nelson, K. 1973. “Structure and Strategy in Learning How to Talk.” Monographs of the Society for Research in Child Development 38: 1–2. [Google Scholar]
  14. Nelson, K. 1981. “Individual Differences in Language Development: Implications for Development and Language.” Developmental Psychology 17: 170–187. [Google Scholar]
  15. Ninio, A. , and Snow C.. 1988. “Language Acquisition Through Language Use: The Functional Sources of Children's Early Utterances.” In Categories and Processes in Language Acquisition, edited by Levy Y., Schlesinger I., and Braine M., 11–30. Lawrence Erlbaum. [Google Scholar]
  16. Pan, B. , Rowe M., SInger J., and Snow C.. 2005. “Maternal Correlates of Growth in Toddler Vocabulary Production in Low‐Income Families.” Child Development 76: 763–782. [DOI] [PubMed] [Google Scholar]
  17. Peters, A. M. 1977. “Language Learning Strategies: Does the Whole Equal the Sum of the Parts?” Language 53: 560–573. [Google Scholar]
  18. Schieffelin, B. 1985. “The Acquisition of Kaluli.” In The Crosslinguistic Study of Language Acquisition. Volume 1: The Data, edited by Slobin D.. Lawrence Erlbaum Associates. [Google Scholar]
  19. Scollon, R. 1976. Conversations With a One Year Old: A Case Study of the Developmental Foundation of Syntax. University Press of Hawaii. [Google Scholar]
  20. Sokolov, J. L. 1993. “A Local Contingency Analysis of the Fine‐Tuning Hypothesis.” Developmental Psychology 29: 1008–1023. [Google Scholar]
  21. Sugárné, K. J. 1970. “A Szokincs és a Szófajok Gyakoriságának Alakulása 3‐6 Éves Gyermekek Beszédének Feladatmegoldás, Illetöleg Kommunikáció Során.” Altalános Nyelvészeti Tanulmányok 7: 149–159. [Google Scholar]
  22. Warstadt, A. , and Bowman S. R.. 2022. “What Artificial Neural Networks Can Tell Us About Human Language Acquisition.” In Algebraic Structures in Natural Language, 17–60. CRC Press. [Google Scholar]
  23. Yip, V. , and Matthews S.. 2007. The Bilingual Child: Early Development and Language Contact. Cambridge University Press. 10.1017/cbo9780511620744. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are openly available in CHILDES at https://childes.talkbank.org.


Articles from Developmental Science are provided here courtesy of Wiley

RESOURCES