Skip to main content
Science Advances logoLink to Science Advances
. 2025 Jul 2;11(27):eads6821. doi: 10.1126/sciadv.ads6821

Fast and robust visual object recognition in young children

Vladislav Ayzenberg 1,2,*, Sukran Bahar Sener 3, Kylee Novick 4, Stella F Lourenco 4
PMCID: PMC12219494  PMID: 40601725

Abstract

By adulthood, humans rapidly identify objects from sparse visual displays and across large disruptions to their appearance. What are the minimal conditions needed to achieve robust recognition abilities and when might these abilities develop? To answer these questions, we investigated the upper limits of children’s object recognition abilities. We found that children as young as 3 years successfully identified objects at speeds of 100 milliseconds (both forward and backward masked) under sparse and disrupted viewing conditions. By contrast, a range of computational models implemented with biologically informed properties or optimized for visual recognition did not reach child-level performance. Models only matched children if they received more object examples than children are capable of experiencing. These findings highlight the robustness of the human visual system in the absence of extensive experience and identify important developmental constraints for building biologically plausible machines.


The visual recognition abilities of preschool children rival those of state-of-the-art artificial intelligence models.

INTRODUCTION

Humans extract meaning rapidly from sparse, and often incomplete, visual information. By adulthood, participants identify objects presented as quickly as 100 ms (1) and they do so across large variations in the visible appearance of objects, such as different orientations (2) or partial occlusion (3). Adult participants also need very little information to infer an object’s identity. For instance, they readily recognize objects when properties such as color, texture, or other internal features are removed, leaving only an object outline (4). Moreover, adults maintain high accuracy even when the contours of the object outline are distorted or deleted (5). Thus, with little visual information, the adult visual system is able to quickly and accurately identify objects. What mechanisms support such robust recognition abilities and when do these abilities develop?

Uncovering the mechanisms needed to quickly recognize objects in challenging contexts is difficult when studying only an adult sample. When encountering objects under degraded conditions, adults may rely on dedicated neural processes that can compensate for missing object information, or they may simply draw on preexisting knowledge gained from years of visual experience with similar degraded scenarios. For instance, recurrent neural circuits in the ventral visual pathway—the primary pathway underlying visual recognition (6)—have been found to support recognition when objects are partially occluded or information is missing (79). In this context, the missing portions of an object are perceptually completed via feedback from higher-level areas or lateral input from other portions of the same visual area (7, 9). Alternatively, given their breadth of experience, adults may have simply encountered the degraded object under similar conditions previously. Extensive perceptual learning literature shows that prior exposure to similar objects or contexts improves recognition performance for objects presented under otherwise challenging conditions (10). Repeated visual exposure may even alter the response profile of neurons in the ventral pathway, leading to selectivity for commonly encountered objects or stimulus features (11, 12). In this view, processes such as recurrence may not be needed because observers have seen similar objects previously and recognition can be accomplished using only a feedforward pass through the ventral pathway.

One approach to help identify the minimal conditions necessary to accomplish object recognition is to study developmental populations (1315). Specifically, young children provide an ideal subject pool for testing mechanistic questions about object recognition because they have limited visual experience (16, 17) and immature visual processing pathways (18, 19). This means that researchers can present stimuli in contexts that children are unlikely to have seen before and track children’s performance in relation to the maturity of their visual system. Together, these factors may allow researchers to isolate the processes that are necessary and sufficient for object recognition even under challenging conditions. Nevertheless, little is known about when, and how, such fast and robust recognition abilities develop in childhood.

The use of developmental populations, however, can be challenging for several reasons. One challenge is that children are rarely tested under the same conditions as adults (13, 20), making direct comparisons between populations difficult. Specifically, object recognition in human adults, as well as adult nonhuman primates, is typically tested by presenting objects in a speeded task (100- to 300-ms presentation) and visually masking the stimuli (6, 2123). By requiring participants to respond quickly and interfering with subsequent visual processing using a mask, this design reduces participants’ ability to draw on high-level inferential processes and, therefore, better isolates the visual processes related to the most recently seen stimulus (6). The development of this standardized approach has been crucial in enabling comparisons between humans and other model organisms (both biological and synthetic), thereby providing an important method by which to explore the mechanisms underlying visual perception (21, 24). However, because children’s capacity to follow instructions or stay attentive for long experimental sessions is limited, they are rarely tested under such challenging constraints. The lack of a standardized approach has resulted in unresolved debates regarding the robustness of children’s visual recognition abilities (25). For instance, research with infants suggests that the processes underlying visual recognition abilities, such as global form perception (26, 27) and perceptual completion (28), arise early in development [for review, see (29)]. In contrast, studies with older children suggest that these same processes may not develop until adolescence (30, 31). However, because these studies have largely relied on either indirect measures (e.g., looking time) or especially artificial experimental procedures (e.g., aligning Gabor wavelets), they are difficult to compare across populations and may not provide a direct or accurate measure of children’s abilities.

In the current study, we sought to determine the upper bound of young children’s abilities using a challenging object recognition task similar to the standardized approach typically used with adults (see Fig. 1). We did this by requiring 3 to 5 year old participants to identify rapidly presented (100 to 300 ms) two-dimensional outlines of common objects (both forward and backward masked) under different stimulus manipulations. Specifically, we included objects that had complete, undisrupted contours, as well as objects with perturbed or deleted contours. These stimulus conditions were carefully selected to test the presence of different mechanisms in children. Complete contour objects are those that children of this age likely have extensive familiarity with via picture books and adults can identify via a feedforward pass through the ventral pathway (32). This condition has the added benefit of providing a baseline for children’s performance in a challenging task. By contrast, perturbed contour objects disrupt the appearance of familiar local visual features, require global form perception, and are challenging for even sophisticated computer vision models (33, 34). Similarly, deleted contour objects are likely to be unfamiliar to children and, in adults, require perceptual completion as well as recurrent processing within the ventral pathway (32). As mentioned previously, these two processes—global form perception and perceptual completion—have historically been thought to develop late in childhood (30, 31, 35), but this claim has primarily been tested via indirect methods.

Fig. 1. Stimuli and human testing procedure.

Fig. 1.

(A) Children and adults were tested with object outlines that had either complete, perturbed, or deleted contours. (B) On each trial, participants were presented with an object image rapidly (100- to 300-ms duration), which was both forward and backward masked. In the prompt phase, child participants were asked to verbally indicate which object they saw among two possibilities (read by an experimenter). Adult participants responded by pressing an arrow key that corresponded to each object label.

To further explore what kinds of processes might be necessary to accomplish robust object recognition in childhood, we also compared children to a range of deep neural network models (DNNs; see Table 1). DNNs are computational models that can be trained to accomplish various visual perception tasks and provide researchers with a well-controlled test bed by which to evaluate theories of visual processing (36) and their development (37, 38). For instance, DNNs with recurrent architectures show stronger performance on challenging visual tasks compared to feedforward ones (39) and they have internal representations that are closely aligned with the adult primate ventral pathway (4042). Other work shows that DNNs trained with naturalistic visual experience and learning objectives are sufficient to recapitulate many human-like visual biases (4345) and they show human-like performance on a range of object recognition tasks (46, 47). Like human adults, increasing the variability of DNNs’ visual experience to include degraded stimuli also improves their recognition performance for objects presented under challenging conditions (43, 48).

Table 1. Model description.

All models tested in the current study. See Materials and Methods for a detailed description of each model.

Name Architecture Learning objective Training data Dataset description
VoneNet_FF-EcoSet V1 + Feedforward CORNet_S Image classification EcoSet Ecologically valid images in 565 categories
VoneNet_FF-EcoSet-Style V1 + Feedforward CORNet_S Image classification Stylized EcoSet Stylized ecologically valid images in 565 categories
VoneNet_R-EcoSet V1 + Recurrent CORNet_S Image classification EcoSet Ecologically valid images in 565 categories
VoneNet_R-EcoSet-Style V1 + Recurrent CORNet_S Image classification Stylized EcoSet Stylized ecologically valid images in 565 categories
ResNext50-SAY ResNext50_32x4d Temporal self-supervision SAYCam Videos Videos from child head-mounted cameras
CVCL ResNext50_32x4d Vision-language learning SAYCam Videos + Text Videos and language from child head-mounted cameras
ConvNext-ImageNet1k ConvNext_Large Image classification ImageNet1k 1.3 million curated photographs in 1000 classes
ViT-ImageNet1k ViT_B_16 Image classification ImageNet1k 1.3 million curated photographs in 1000 classes
ViT-DINOv2 ViT_B_14 Unsupervised global image prediction LVD-142M 142 million images from the internet and other datasets
ViT-CLIP ViT_B_16 Vision-language learning CLIP data 400 million image-text pairs from the internet
ResNet50-ImageNet1k ResNet50 Image classification ImageNet1k 1.3 million curated images in 1000 classes
ResNet50-ImageNet-Sketch ResNet50 Image classification ImageNet1k + line drawings of ImageNet objects 1.35 million curated images and line drawings in 1000 classes
ResNet50-ImageNet21k ResNet50 Image classification ImageNet21k 14 million curated images in 21,000 classes
ResNet50-DINO ResNet50 Unsupervised global image prediction ImageNet1k 1.3 million curated images in 1000 classes
ResNet50-YFCC15m ResNet50 Vision-language learning YFCC15m 15 million image-text pairs from the internet
ResNet50-CLIP ResNet50 Vision-language learning CLIP data 400 million image-text pairs from the internet

In the current study, we compared children to a series of biologically inspired DNNs that allowed us to explore what mechanisms support fast and robust visual recognition early in development. Specifically, we included feedforward and recurrent models with ventral-like architectures (VoneNet_FF and VoneNet_R) because they allowed us to test predictions related to object speed, neural development, and viewing condition. For instance, if rapidly presented (and masked) objects do not afford time for recurrent processing (32), then we may find that performance on fast stimulus durations is best matched by feedforward models, but performance on slower stimulus durations is best matched by recurrent models. Moreover, if young children have immature recurrent circuits (25, 49), then their recognition performance may be best matched by a DNN with a feedforward architecture across multiple durations, particularly under the deleted contour condition that was previously shown to require recurrence in adults (9, 39). We also included models with different training diets to explore what kinds of visual experiences may support robust recognition. Specifically, we included DNNs trained with realistic, but “curated,” images (EcoSet and ImageNet) (46), images manipulated to increase their variability (Stylized EcoSet and ImageNet-Sketch) (43), and naturalistic videos recorded from children (6 to 32 months of age) wearing head-mounted cameras (SAYCam) (45). If performance under perturbed and deleted stimulus conditions depends on visual experience with similarly degraded stimuli, then models explicitly trained with objects whose contours vary (e.g., Stylized EcoSet) or models that received such experience during natural viewing (e.g., SAYCam) may perform well under these conditions. We also included models trained with an unsupervised learning objective (e.g., ResNet50_SAY and DINOv2) or ones that emphasize vision-language alignment [e.g., CVCL (Child’s View for Contrastive Learning) and CLIP]. These models were included because prominent developmental hypotheses suggest that object learning occurs in an unsupervised manner in infancy (50, 51) and is further bootstrapped by language learning (47, 52). Last, to explore the performance gaps between children and current DNNs, we included models whose performance has been optimized for visual recognition via supervised [e.g., ConvNext and ViT (Vision Transformer)] or unsupervised (e.g., DiNOv2) approaches. Collectively, these model comparisons were included to support an initial exploration of a broad range of developmental hypotheses. However, it is important to note that our hypotheses are neither exhaustive nor mutually exclusive with one another. Yet, by comparing children to DNNs, researchers can better understand the gaps between humans and computational models (17, 20, 53, 54), as well as help identify models that can be used to inform developmental theory (55, 56).

RESULTS

Child performance

What are the upper limits of children’s visual recognition abilities? We tested this question by presenting 3-, 4-, and 5-year-old children (N = 128) with a challenging recognition task where they were required to identify rapidly presented objects (100 to 300 ms) that were both forward and backward masked (see Fig. 1 and movie S1). Each child’s top stimulus speed was identified using a titration procedure, wherein stimulus durations decreased by 50 ms (capped at 100 ms) for every three consecutively correct trials (see fig. S1 for survival curve analyses). Moreover, objects could be presented with either complete, perturbed, or deleted contours in a between-subjects design. See Materials and Methods for full design details.

Overall, we found that children’s performance was above chance (0.50) under all conditions, even when stimuli were presented at the fastest speed of 100 ms (all P < 0.001, all d > 0.65; see Fig. 2), suggesting that object recognition is fast and robust from a young age.

Fig. 2. Children’s performance for each condition.

Fig. 2.

Across age, children performed above chance for each condition at each duration. Error bars depict 95% confidence intervals. The dotted black line indicates chance performance (0.50).

Next, we examined whether there were differences in performance as a function of stimulus condition (complete, perturbed, or deleted), presentation time (durations: 100 to 300 ms), and child’s age. A repeated-measures analysis of covariance, with age as a covariate, revealed main effects of condition (F2,122 = 16.62, P < 0.001, ηp2 = 0.21), duration (F4,488 = 2.64, P = 0.033, ηp2 = 0.02), and age (F1,122 = 28.25, P < 0.001, ηp2 = 0.19). There were no significant two- or three-way interactions between the factors (all P > 0.314). Overall, participants’ performance was higher at slower durations than faster ones. Moreover, post hoc comparisons (Holm-Bonferroni corrected) revealed that performance was worse for the deleted condition than either complete or perturbed conditions (all P < 0.001, all d > 0.66; see Fig. 2). There was no difference between complete and perturbed conditions (P = 0.502).

When age is converted into a categorical variable (3, 4, and 5 years), we found that 3 year olds performed worse than 4 and 5 year olds (all P < 0.001; all d > 0.70), but there was no difference between 4 and 5 year olds’ performance (P = 0.136, d = 0.20; see Fig. 3). An additional comparison to adults revealed that adults performed better than children of every age and under each condition (all P < 0.001; all d > 0.78; see Fig. 3).

Fig. 3. Performance under each condition by age group.

Fig. 3.

(A) Under the complete condition, participants of all ages performed above chance, even at the fastest speeds. (B) Under the perturbed condition, 4 and 5 year olds performed above chance at all speeds, whereas 3 year olds were only above chance when durations were 200 ms and slower. (C) Under the deleted condition, 4 and 5 year olds performed above chance at all speeds, whereas 3 year olds only performed above chance at the slowest speeds (250 and 300 ms). Error bars depict 95% confidence intervals. The dotted black line indicates chance performance (0.50).

To provide a more detailed understanding of the development of object recognition, we examined performance separately for each age group and condition (see Fig. 3). We found that like adults, 4 and 5 year olds performed above chance under every condition and at all durations (all P < 0.038, all d > 0.63). By contrast, 3 year olds performed above chance at all durations of the complete condition (all P < 0.007, all d > 1.00) but only at speeds of 200 ms or slower under the perturbed condition (all P < 0.001, all d > 1.27) and 250 ms or slower under the deleted condition (all P < 0.007, all d > 0.98). The fact that children, especially in the youngest age group, performed worse under the deleted condition suggests that perceptual completion and recurrent processing in the ventral pathway may still be immature in preschool-aged children. Furthermore, 3 year olds’ poor performance at the fastest durations under the perturbed condition suggests that although young children have the capacity for global form perception, this capacity may not be fully mature at this age (30).

Together, these findings suggest that by 4 years of age, children rapidly extract meaning from sparse visual displays, including when there is missing information. Even 3-year-old children performed above chance under most conditions despite stimuli being presented as fast as 100 ms with both forward and backward masking. Nevertheless, children’s object recognition abilities are not fully mature, as even 5-year-old children performed worse than adults.

In a final analysis, we examined whether children’s impressive performance in this task could be explained by low-level shape properties, such as curvature statistics (i.e., curvilinear versus rectilinear features) or shape envelope similarity (i.e., elongated versus square-like). We computed the similarity of curvature statistics and shape envelope for each object pair and median split the data into high and low similarity trials (see Materials and Methods). To maximize statistical power, child performance was grouped into fast (mean of 100- and 150-ms durations) and slow (mean of 200- and 250-ms durations) stimulus durations.

We did not find significant main effects of curvature or shape envelope similarity (all P > 0.139, ηp2 = 0.04) nor any two-way interactions between low-level shape similarity (curvature or shape envelope) and age group (all P > 0.085, ηp2 < 0.09), stimulus condition (all P > 0.338, ηp2 = 0.04), or stimulus duration (all P > 0.057, ηp2 = 0.06). However, because these comparisons exhibit small-to-medium effect sizes, it is possible that they would reach the threshold for statistical significance with greater power (see Fig. 4). Thus, although low-level shape properties may have contributed to children’s performance, they are unable to fully account for children’s accuracy across stimulus durations or conditions.

Fig. 4. Influence of low-level shape features.

Fig. 4.

Performance separated by (A and B) curvature and (C and D) shape envelope similarity across different [(A) and (C)] stimulus durations and [(B) and (D)] age groups. The black dotted line indicates chance performance (0.50) Error bars depict 95% confidence intervals.

Model comparisons

How does children’s performance compare to biologically inspired and performance-optimized DNNs? To answer this question, we compared children (and adults) to DNN models with different types of architectures (e.g., feedforward versus recurrent), visual experiences (e.g., curated versus variable naturalistic visual experience), and learning objectives (e.g., classification versus vision-language alignment), as well as DNNs optimized for visual recognition tasks (see Table 1). Participants’ performance in each age group (3, 4, and 5 years and adult) was split into fast (mean of 100 and 150 ms) and slow (mean of 200 and 250 ms) durations.

Overall, models performed above chance under every condition, regardless of their architecture, training experience, or learning objective (see Fig. 5 and tables S2 to S4). Furthermore, an analysis of model performance by layer revealed that performance was generally above chance by middle layers and best in the top layers of the models (see fig. S2).

Fig. 5. Model and human performance under each condition.

Fig. 5.

Performance of models and humans under the (top) complete, (middle) perturbed, and (bottom) deleted contour conditions. Human data for each age (red dotted lines: children; gray dotted lines: adults) were aggregated into fast (100 and 150 ms) and slow (200 and 250 ms) stimulus durations. Humans were compared to (A to C) biologically inspired (blue: ventral-like architecture; green: trained on child experience) and performance-optimized (orange: classification objective; violet: unsupervised and vision-language objective) models and (D to F) models selected to disambiguate between the contributions of training type, scale, and learning objective (yellow: classification objective; purple: vision-language objective). The y axis indicates classification accuracy. The black dotted line indicates chance performance (0.5). Error bars depict 95% confidence intervals for models. See fig. S3 and tables S2 to S4 for variability estimates and confidence intervals for human data.

Like humans, these models generally performed best under the complete condition and worst under the deleted condition (see fig. S3). However, whereas humans did not show a significant difference between complete and perturbed conditions, models generally performed worse under the perturbed condition (see fig. S3). Among biologically inspired models, recurrent models (VoneNet_R) generally outperformed feedforward models (VoneNet_FF), which is consistent with the hypothesis that recurrence may be crucial for robust object recognition. Training with variable visual experience (i.e., Stylized EcoSet) showed a benefit but only under the deleted condition. Models trained with videos from child head-mounted cameras (i.e., ResNext50_SAY) generally outperformed feedforward models but not recurrent models. Interestingly, VoneNet_R performance was as strong, or stronger, than ConvNext-ImageNet1k and ViT-ImageNet1k, even though VoneNet models have fewer parameters (VoneNet_R: 55m params; ConvNext: 198m params; ViT: 86m params). Furthermore, ResNext50-SAY, a model trained with videos from children’s perspective, matched the performance of ConvNext and ViT under the deleted condition and even outperformed these models under the perturbed condition. These findings suggest that biologically inspired architectures and experiences may lead to improvements in the performance and efficiency of DNNs.

Next, we compared the performance of biologically inspired models to children on the basis of overlapping confidence intervals (see fig. S3 for individual human variability). Under the complete condition, recurrent models (VoneNet_R) performed as well as 3- and 4-year-old children at fast stimulus durations but not slow stimulus durations nor as well as 5-year-old children at any duration (see Fig. 5A and table S2). Under the perturbed condition, all biologically inspired models matched the performance of 3 year olds at the fast durations but not any other age group nor 3 year olds at the slow durations (see Fig. 5B and table S3). Under the deleted condition, all biologically inspired models, except CVCL, performed as well as children at the fast durations (see Fig. 5C and table S4). With stylized training, VoneNet_R even surpassed 3-year-old children at the slow durations. However, these models performed worse than 4 and 5 year olds at slow durations (Fig. 5C). Nevertheless, the fact that 3-year-old performance under the deleted condition was matched by a feedforward model (VoneNet_FF), regardless of stimulus duration, suggests that recurrent connections may still be immature at this age.

However, overall, we did not find that biologically inspired nor performance-optimized classification models (ConvNext-ImageNet1k and ViT-ImageNet1k) consistently matched children’s performance across both conditions and stimulus durations. The only models to consistently match or surpass children were ViT-DINOv2, a performance-optimized vision model trained using an unsupervised learning objective, and ViT-CLIP, a vision-language model trained to associate images and text via a contrastive learning objective. These models even matched the performance of adults in some cases (Fig. 5, A to C).

Learning objective or scale?

What properties of ViT-DINOv2 and ViT-CLIP account for their strong visual recognition performance? One possibility is that an unsupervised learning objective or one that pairs linguistic information with images results in more robust visual representations. Unsupervised learning is often hypothesized to be critical for aligning the visual representations of primates and DNNs (50, 57). Moreover, classic developmental work has suggested that language bootstraps object recognition by guiding children’s attention to diagnostic object properties (58). However, the mediocre performance of ResNext50-SAY, an unsupervised model, and CVCL, a model trained to associate naturalistic video with linguistic information from children, suggests that neither unsupervised learning nor a vision-language objective on its own is sufficient for DNNs to match child performance.

An alternative possibility is that the success of these models is attributed to their extensive object experience. In contrast to the classification models, which were trained with ImageNet (1.3 million images) or EcoSet (1.5 million images), ViT-DINOv2 and ViT-CLIP were trained with datasets of 142 and 400 million images, respectively (see Table 1). Across multiple epochs of training, these models have been exposed to over a billion images each (ViT-DINOv2: 1.28 billion; ViT-CLIP: 12.8 billion). For comparison, if, from the moment of birth, a child were to see a new object every second of their life for six consecutive years, they would have seen only 189 million objects—less than half the number of object images in the CLIP dataset. Even if this estimate was increased 10-fold (i.e., one image every 100 ms without sleep), the oldest children in our sample would still have less than one-sixth of the total object experience of CLIP-trained models. Although this estimate is not intended to be a realistic accounting of human visual experience, it serves to illustrate the scale of model training when projected onto the timeline of the human lifespan. This back-of-the-envelope estimate ignores the fact that infants conservatively sleep an average of 14+ hours per day in the first year of life and 3 to 5 year olds sleep at least 10 hours per day (59).

To test whether the scale of object experience or the learning objective contributed to model performance, we tested six additional models with the same architecture. Specifically, we tested ResNet50 models that were pretrained on image datasets of comparable size (ImageNet21k: 14 million images; YFFC15m: 15 million images), but with one using a standard object classification objective and dataset (ImageNet21k), and the other using a vision-language objective to align image and text embeddings (YFFC15m). We also tested a ResNet50 model trained on standard ImageNet1k using the unsupervised DINO procedure (ResNet50-DINO; 1.3 million images) (60). As baselines, we also included a ResNet50 model trained to classify images from the standard ImageNet1k dataset (1.3 million images) and one trained with the same procedure and dataset as CLIP (ResNet50-CLIP; 400 million images). Last, to explore whether specific object experience with line drawings contributed to the performance of large-scale models, we also tested a ResNet50 model fine-tuned on ImageNet sketch (ResNet50-ImageNet-Sketch), a dataset composed of 50,0000 line drawings (see Materials and Methods).

Consistent with the prior analysis, we found that ResNet50-CLIP showed the strongest performance compared to the other models, and it matched (or surpassed) children under almost every condition (see Fig. 5, D to F). The next best model across all conditions was ResNet50-ImageNet21k, which showed large improvement over a base ResNet50 model trained on ImageNet1k. Moreover, ResNet50-ImageNet21k outperformed both DINO- and YFFC15m-trained ResNet50 models (see Fig. 5, D to F), suggesting that the training scale, and not the learning objective, led to performance improvements. Last, we found that a ResNet50 model trained on ImageNet-Sketch showed large improvements over a base ResNet50, particularly under the complete contour condition, where it even outperformed children in some cases. However, its performance was still lower than a model trained on ImageNet21k under perturbed and deleted conditions, as well as most children under the perturbed condition.

Last, to understand the importance of scale, we examined the relation between object experience and recognition performance across all models tested (Fig. 6). This analysis revealed strong Spearman rank-ordered correlations between a model’s classification accuracy and the total number of images it was exposed to (log-transformed) across training for every condition (complete: ρ = 0.83; perturbed: ρ = 0.78; deleted: ρ = 0.70), which is consistent with prior work on scaling laws in DNNs (61). To compare this scaling pattern to humans, we plotted data from children and adults using the conservative estimate described previously (one object every second of life without sleep). Human data were generally above and left of the model trend line, suggesting that humans develop robust visual recognition abilities with far less object experience than models. Thus, although certain kinds of experiences, such as experience with line drawings, can improve visual recognition abilities, our results suggest that scale remains the strongest driver of performance for DNNs but not humans.

Fig. 6. Recognition performance as a function of experience.

Fig. 6.

Scatter plots showing the relation between classification accuracy on the y axis for (A) complete, (B) perturbed, and (C) deleted contour conditions and total number of images models were trained with on the x axis in the log scale. Human estimates are plotted as stars, and their experience is conservatively estimated as seeing one object every second of their life without sleep.

DISCUSSION

In the current study, we sought to understand the development of robust object recognition in young children. Our results demonstrate that young children succeed at identifying objects from sparse visual displays at speeds as fast as 100 ms and even when the contours are disrupted. Direct comparisons to a range of computational models revealed that children generally outperformed both biologically inspired and performance-optimized DNNs. Only by exponentially increasing a model’s training data, beyond what children can realistically experience, did models match or surpass children’s performance. Together, these findings suggest that object recognition is fast and robust early in development, but gaps remain in our ability to approximate these processes with current computational models.

Visual recognition in young children

We found that by 4 years of age, children identified objects whose local visual features were disrupted via contour perturbations or deletions, even when presented as rapidly as 100 ms and when stimuli were forward and backward masked. Even 3-year-old children performed above chance, although they required somewhat slower presentation times when object contours were disrupted. These findings suggest that young children exhibit robust recognition in which they readily extract global form across variations in local features.

Our results stand in stark contrast to a collection of studies in which the findings suggested a protracted development of object recognition abilities (25). Specifically, these studies found that the ability to ignore local features, so as to represent global form, or to accomplish perceptual completion across disconnected contours remained difficult for children as old as 10 years of age (30, 31, 35). These findings have been mixed, however. Other studies with similar methods and age groups found the opposite, namely, that children prioritized global form over local features (62).

Our findings are instead consistent with a rich literature on younger children that suggests an earlier development of object recognition abilities. For instance, 2 year olds preferentially use shape information to categorize objects across changes in color and texture (63), and with linguistic scaffolding, young children generalize object identities to sparse caricature versions of common objects (52, 64). Furthermore, research with infants has shown that by 3 months of age, they identify novel objects from orientations not previously seen (65, 66), and older infants are capable of categorizing novel objects across variations in contours (26, 67). By the end of the first year of life, children can infer an object’s category from only its shape silhouette (68), suggesting that they do not rely on color, texture, or internal local features to recognize objects. Last, other work has shown that infants are capable of inferring the complete shape of an object even when it is partially occluded (69, 70). However, studies with infants cannot measure the speed of processing and they generally rely on indirect looking time measures to infer processes like recognition, which can be difficult to compare across different populations. Our work shows that by at least preschool age, children are capable of identifying rapidly presented objects from sparse and disrupted visual displays.

We would not suggest, however, that children’s visual recognition abilities are fully mature at preschool age. We generally found that children’s performance under all conditions and at all stimulus durations improved with age, with children performing worse than adults. The effect of age was most pronounced when contours were disrupted. Specifically, under the perturbed and deleted conditions, 3 year olds performed at chance at the fastest speeds, whereas by 4 years of age, children performed above chance at all speeds. This pattern cannot be explained by 3 year olds’ inability to conduct the task given that their performance did not differ from other children under the complete contour condition, and it was above chance at slower speeds under all conditions. Instead, given that the recognition of contour-deleted objects may require perceptual completion (30, 31) and recurrent processing (9, 39), the developmental change we observed may indicate underdeveloped recurrent connections in 3-year-old children. Moreover, 3 year olds’ performance in the fastest durations of the perturbed condition suggests that their global form perception abilities are still maturing. Yet, the fact that children generally performed worse than adults under all conditions and durations suggests that there may be other developmental factors not directly related to visual recognition, such as attention or executive control, that affect children’s performance. Thus, one interpretation of our results, which is consistent with the existing literature from both infants and older children, is that children have the capacity for robust object recognition early in development, but this ability does not become fully adult-like until adolescence (35).

Using children as a benchmark

A major goal of the current study was to compare children to computational models to explore what mechanisms might be needed for the development of visual recognition abilities and to assess the gaps between humans and machines. Such comparisons are crucial if DNNs are to be used as tools to explore the cognitive and neural mechanisms underlying visual recognition in humans (36, 71, 72). Although we found that factors like a biologically inspired recurrent architecture (VoneNet-R) and naturalistic or variable experience (Stylized EcoSet, SayCam) allowed models to match or even outperform much larger DNNs optimized for image classification (e.g., ConvNext), these models did not consistently match children’s performance. Few of the hypotheses we described in Introduction relating to performance on different stimulus durations or conditions bore out. The only exception was that the performance of 3-year-old children across all durations was well matched to a feedforward model, which provides computational evidence that recurrent circuits in humans may be immature at a young age.

The only models to consistently match, or surpass, children’s performance were those trained on especially large datasets (e.g., ImageNet21k and CLIP). Could children’s performance in the current study reflect prior experience with many objects? Our analyses, as well as existing research with children, suggest that this possibility is unlikely. First, as described in the results, even a conservative estimate of children’s object exposure reveals that they are unable to receive as much object experience as available in the CLIP dataset until they are well into adolescence. Even if images were presented every 100 ms without sleep—a time frame shorter than the typical length of a fixation (150 to 300 ms) (73) or the latency of a saccade (~200 ms) (74)—it would be difficult for an 18-year-old adult to see as many images as a fully trained CLIP model. Moreover, models trained using the SAYCam dataset (ResNet50_SAY and CVCL), a dataset composed of head-mounted videos from children ages 6 months to 3 years (45), did not consistently match child performance across conditions even for the youngest ages.

However, as we discussed in Results, these estimates of visual experience are not intended to be realistic. Human vision does not have a clear frame rate and humans often have multiple objects in view at once. Instead, these estimates are intended to contextualize model experience on the scale of the human lifespan. Whereas most model training regimens emphasize a large number of distinct object examples, the human lifespan does not afford such varied experience. Studies analyzing children’s visual experience from head-mounted cameras show that children are exposed to relatively few distinct objects. For instance, in the first year of life, just 10 common objects occupy approximately one-third of infants’ visual experience (16), with another large portion accounted for by three faces (75), hands (76), and visually simple architectural features of a scene (e.g., bright ceiling light) (77). Instead, children’s experience with these few objects is extensive such that they densely sample many views of the same few objects (78). Furthermore, it is unlikely that children’s performance in the current study can be explained by extensive experience with drawing or coloring books because children do not show the capacity to draw and color, beyond simple scribbles, until ~3 years of age (79). Thus, at best, these activities occupy only a small portion of children’s visual experience at the ages we tested. Moreover, our results showed that models trained using video data from children’s perspective (ResNet50_SAY and CVCL) (45) did not result in a consistent match to child performance across conditions even for the youngest ages, further suggesting that specific kinds of prior experience did not contribute to children’s performance. Like human children, SayCam-trained models are presented with a continuous stream of visual information, which includes multiple objects.

In contrast to children, our results suggest that DNNs are far more reliant on specific visual experiences in their training data to succeed. For instance, fine-tuning models on ImageNet-Sketch, a dataset composed of 50,000 line drawings (80)—likely more than children could have experienced at this age—led to large improvements in model performance. However, these improvements were primarily under the complete contour condition where stimuli were most similar to the training data. Even models trained on Stylized EcoSet, a training regimen that disrupts local visual features and object contours (43), were largely surpassed by children under both the complete and perturbed conditions. Only models trained on especially large but opaque datasets (e.g., CLIP) matched or surpassed children, suggesting that their strong performance on our task was likely due to the presence of similar kinds of images in their training data. Indeed, CLIP-trained models show catastrophic recognition failures if small aspects of their training data are lesioned (81). These findings add to a body of work showing that current DNNs, regardless of architecture, training, or learning objective, show substantial deficits in generalizing to examples that are outside the specific distribution of their training experience (82, 83). Together, our findings contribute to the growing literature that illustrates how children develop sophisticated perceptual (17, 26, 53), linguistic (20), and other high-level cognitive abilities (84) from far less data than state-of-the-art machine learning models currently require.

Mechanisms of visual development

What, then, might account for the discrepancy between children and DNNs? One possibility is that we only tested recurrent models that implement local, within-layer, recurrence where information is processed by the same layer multiple times. By contrast, recurrent processing in the human ventral pathway occurs both within and between cortical areas, a process also known as top-down feedback (85, 86). Top-down feedback may be particularly important for perceptual completion because it exploits the larger-receptive fields of higher-level visual areas (9, 85). Moreover, it provides a mechanism by which prior knowledge from areas such as the prefrontal cortex can be used to make predictions about the content of ambiguous displays (87). Thus, top-down feedback offers both a perceptual mechanism by which to fill in missing information and a process by which to efficiently apply prior experience to the current scenario.

Incorporating continuous recurrent dynamics may also improve models’ capacity to approximate the temporal dynamics of human vision. Although we used feedforward and recurrent models to examine the mechanisms needed to recognize objects at different stimulus durations, this approach is limited because VoneNet_R uses a fixed number of recurrent loops and this type of recurrence cannot be used for the other models without altering their architecture and retraining them. Moreover, although researchers have identified methods to model temporal dynamics related to decision-making (e.g., reaction time) (88), there is currently no approach (to the best of our knowledge) for approximating stimulus duration in computational models without altering their architectures. This gap in temporal processing abilities is particularly glaring given that the spatiotemporal continuity of objects (i.e., motion) has long been identified as an important constraint for developing robust object representations (28, 69, 89, 90).

Another possibility is that the current DNNs only approximate the ventral visual pathway, which, on its own, may not be sufficient for robust object recognition. Specifically, accumulating evidence suggests that the ventral pathway, like DNNs, is most sensitive to local visual features (34, 91, 92), and it has difficulty representing a complete object form without input from the dorsal visual pathway and prefrontal cortex (86, 9395). Dorsal visual areas develop early (19, 96, 97) and support the formation of global shape percepts by representing the spatial arrangement, rather than the appearance, of object parts (94, 98). In addition, as mentioned previously, prefrontal areas may be crucial for applying prior knowledge to solve visual recognition tasks (86), although these areas may develop later [(99); but see (100)]. Nevertheless, models that implement the processes of dorsal and prefrontal cortices may better approximate human recognition abilities across development.

A final possibility is that the kind of visual experience children receive, rather than the quantity, is crucial for robust object recognition. As mentioned previously, children are exposed to relatively few objects in the first year of life, but their experiences with these objects are far more variable and extensive (78). DNNs trained on children’s views of objects from head-cam videos perform better than those trained on the same objects using adult viewpoints (101), suggesting that child experience provides better training data than that of adults’. Other work shows that building in constraints of the newborn visual system, such as infants’ initially blurry and low-saturation vision (44, 102), improves the robustness of DNNs. In addition, as mentioned previously, experience with smoothly moving objects is crucial for forming robust object representations (28, 69, 89, 90). These constraints, combined with potentially innate capacities to represent shape properties like the three-dimensional structure (14, 65), symmetry (103), and other Gestalt cues (28), suggest that the human visual system is well optimized to form robust object representations. Although our data suggest that naturalistic visual experience alone is insufficient for DNNs to match child performance, integrating both biologically plausible architectures and naturalistic experiences may help close the gaps between humans and machines.

Toward developmentally inspired machine learning

In addition to incorporating biological and experiential constraints, a key condition for fostering a field of developmentally inspired machine learning is the collection of large high-quality child datasets (17, 53, 104). In the current study, we sought to provide a child benchmark by which to explore the developmental mechanisms needed to recognize objects at different speeds and viewing conditions. However, in addition to comparisons of overall performance, it is also important for datasets to support multivariate analyses that can elucidate whether children and DNNs solve tasks using similar approaches. For instance, a common measure of alignment between artificial and biological organisms is representational similarity analysis (RSA), in which the correspondence between two organisms is measured as the degree to which they show correlated patterns of similarity judgments across the object pairs in a set (105). In the context of the current study, RSA would have allowed us to examine whether children and DNNs exhibit similar error patterns for each object target-distractor pair (e.g., confusability between airplane and truck versus airplane and apple). However, because our study incorporated many experimental factors (5 durations × 3 conditions × 3 age groups) and children could only conduct a small number of trials per session (30 trials), there were insufficient data for every object pair across experimental factors to allow for RSA. Thus, although the approach used in the current study can shed light on what mechanisms are sufficient to achieve child-level performance, multivariate analyses can provide a more precise measure for exploring whether DNNs accomplish object recognition in a human-like way.

Beyond behavior, a crucial source of insight into the mechanisms of visual recognition is neuroimaging data from children. Neural data can directly illuminate the mature and immature processes of the child visual system (18, 106, 107). Moreover, by combining child imaging data with DNN models, we can shed light on the underlying visual representations of developing brain areas (107110). However, imaging awake children, particularly infants, is notoriously difficult (111). Computational modeling offers a powerful method by which to test initial hypotheses of neural development (38). These models can then be used to guide the design of otherwise costly and time-consuming neuroimaging experiments with children. In this view, DNNs can be “control reared,” much like biological organisms (38), to examine which architectures, experiences, or learning objectives lead to adult-like neural organization. Thus, we imagine a developmentally inspired field of machine learning where insights from children are used to improve the efficiency and performance of DNNs, and inversely, DNNs are used as tools to explore difficult-to-test questions in children.

MATERIALS AND METHODS

Participants

For human children, sample sizes and testing procedures were preregistered (https://aspredicted.org/Z6B_PF4) on the basis of pilot testing. We sought to test 135 children ages 3 through 5 years of age. Five additional children were included because they were scheduled before the target sample size was met. An additional 12 children were excluded from the analyses for not following instructions (n = 3), for failing to complete the task (n = 5), because of technical issues (n = 3), or because of a developmental disability (n = 1). The five additional children were thus retained to compensate for those who were excluded. In total, data from 128 children (Mage = 4.62 years, range = 3.05 to 5.95; 64 girls and 64 boys) were analyzed. Children were randomly assigned to one of three stimulus conditions (between-subjects: complete, perturbed, or deleted), with an equal number of children (n = 42) per condition. For comparison, we also recruited 30 adult participants (10 per condition) online. One participant did not finish the task, and thus, 29 adult participants were analyzed. Age and sex were not collected for adults. All participants were tested online through Children Helping Science (children) or Prolific (adults). All protocols (children: IRB no. 03571; adults: IRB no. 03388) were approved by the Institutional Review Board (IRB) of Emory University. The parent or legal guardian of each child provided written consent, and children provided verbal assent before participation. All adult participants provided informed consent.

Models

For computational models, we selected a combination of models on the basis of biologically inspired properties or performance on visual recognition tasks. See Table 1 for all models.

To examine the possible contributions of feedforward and recurrent neural circuits, we implemented a shallow feedforward and recurrent VoneNet architecture (112). VoneNet is a convolutional DNN whose architecture is based on the primate ventral visual pathway, with distinct layers corresponding to V1, V2, V4, and IT. What further distinguishes VoneNet from other DNNs is that layer V1 directly simulates the organization of the primary visual cortex by implementing a fixed weight front end composed of a Gabor filter bank. The inclusion of a biologically plausible front end was shown to improve the adversarial robustness of DNNs, as well as VoneNet’s match to the monkey ventral pathway (112). Feedforward or recurrent processing was implemented into each VoneNet architecture by manipulating the number of times a stimulus was passed through each layer on a single presentation (i.e., recurrent steps). In the feedforward architecture, each stimulus was processed only once by each layer. For the recurrent architecture, the number of recurrent steps for each layer (V2: 2; V4: 4; IT: 2) was selected on the basis of prior work (112).

In addition, to explore the contributions of different kinds of visual experience, each VoneNet architecture was trained on either EcoSet, an ecologically valid stimulus set composed of 565 basic-level categories (46), or stylized EcoSet, which uses style transfer techniques to randomly vary the color and texture of each EcoSet image (43). We specifically included stylized EcoSet because prior work found that conventional DNNs are biased to use texture information, but training on style transfer images induces a more human-like shape bias in the models (43). All models were trained in a supervised manner to classify images into 565 categories. Models were trained in PyTorch using the procedure described by Dapello et al. (112). Thus, four biologically inspired models were trained for this study: feedforward models trained on either EcoSet or Stylized-EcoSet and recurrent models trained on the same two datasets. The use of these models allowed us to explore two possible mechanisms underlying visual recognition: recurrent processing and variable visual experience.

We also tested two other models with biologically plausible properties. These included a ResNext50 model trained on the SAYCam dataset (45), a dataset consisting of naturalistic head-cam videos from three children (child: “S,” “A,” and “Y”), using the temporal self-supervision method (from here on referred to as ResNext50-SAY). Furthermore, to examine whether language may bootstrap visual learning, we tested the CVCL model, a contrastive ResNext50 model that was trained to associate linguistic utterances with co-occurring visual information from the head-cam videos of a single child from the SAYCam dataset (47). CVCL was included because it successfully uses biologically plausible training data and a learning objective to recapitulate the visual abilities of a young child.

To estimate the gaps (or lack thereof) between child and machine perception, we also tested several performance-optimized object classification models (as of March 2024). Specifically, we tested a ViT, which was trained on ImageNet using a transformer architecture (113), and ConvNext, which is a state-of-the-art convolutional DNN that was optimized using the best practices from the past decade of machine learning to maximize visual recognition performance on ImageNet (114). We also tested a high-performing model trained using an unsupervised learning regimen, namely ViT-DINOv2 (115), and VIT-CLIP, a vision-language model that uses the ViT architecture and learns to associate 400 million images and text pairs from the CLIP dataset using a contrastive learning objective (116).

Last, to examine the roles of learning objective and training size on performance, we also tested six versions of ResNet50. Specifically, we tested a ResNet50 trained to classify images from standard ImageNet (1000 classes, 1.4 million images) (117) or ImageNet21k (21,000 classes, 14 million images) (118), as well as ResNet50 architectures trained to associate text-image pairs from YFFC 15M (15 million images) or the CLIP training set (400 million images) (119), and a ResNet50 model trained on ImageNet1k using the unsupervised DINO training procedure (60). Last, we tested a ResNet50 model fine-tuned on ImageNet-Sketch, a dataset of 50,000 line drawings (80). Fine-tuning was accomplished by taking a standard ResNet50 model pretrained on ImageNet1k and retraining the entire model for 30 epochs using the ImageNet-Sketch dataset. To prevent catastrophic forgetting of naturalistic stimulus features, the training dataset was augmented with 50,000 randomly selected images from ImageNet1k (equal number per category).

Stimuli

Thirty unique objects (15 animate and 15 inanimate) were selected for the object recognition task created for this study. Objects were selected from the Snodgrass and Vanderwart (120) image set by sampling age-appropriate nouns from the Peabody Picture Vocabulary Test (PPVT-4). See table S1 for the list of objects used. Each object image was transformed into an outline and adapted into three stimulus conditions: complete, perturbed, and deleted contours. Stimuli with complete contours were adopted as is from Snodgrass and Vanderwart (120). Perturbed stimuli were created by applying the ripple distortion (amount: 200%; size: medium) to the complete contour stimuli using image editing software (Photopea; www.photopea.com/). The deleted contour stimuli were created by removing 50% of the contours from each complete stimulus using salience scores based on the medial axis of the object (121). To minimize visual discomfort during the testing session, stimulus contours were presented as white lines on a gray background (see Fig. 1).

Human testing procedure

Each child was tested individually by an experienced experimenter over Zoom using a two-alternative forced-choice procedure. Trials began with a colorful fixation stimulus, which remained onscreen until children attended to it (as determined by the experimenter). The stimulus display was sandwiched by forward and backward masks; the response prompt followed (Fig. 1B). Masks were created by randomly overlaying object outlines across the image frame and then box scrambling the result.

During presentation of the response display, the experimenter asked the child whether they saw the target stimulus or a distractor (e.g., “Did you see an airplane or a truck?”; order randomized across trials). The distractor was always another object from the stimulus set with the same animacy (e.g., animate target paired with an animate distractor; randomly selected), ensuring that children could not rely on low-level stimulus features to identify the objects. The response phase was followed by feedback indicating whether the child responded correctly (green check) or incorrectly (gray square).

The duration of stimulus displays varied between 100 and 300 ms. Durations were determined via a titrated procedure. At the beginning of each session, objects were presented for 300 ms. If children correctly identified objects on three consecutive trials, then the duration decreased by 50 ms (capped at 100 ms). However, if children incorrectly identified objects for three consecutive trials, then the stimulus duration increased by 50 ms (capped at 300 ms).

Before testing, children were introduced to the task in a practice phase where they were required to identify slowly presented (1 s) colored photographs (e.g., shoes) using the same instructions as above. Adults were tested using an identical procedure, except that testing was conducted asynchronously without an experimenter present. Adults made their responses by pressing the arrow key that corresponded to either the target or distractor label.

Model testing procedures

We tested model recognition accuracy using the same stimulus set presented to children and with a comparable two-alternative forced-choice procedure. In the object recognition task, children were required to determine which of two semantic labels corresponded to the target stimulus (e.g., “airplane” or “truck”). To accomplish this task, children must be able to match the stored object representation that is associated with each label (e.g., examples of previously seen airplanes) with the stimulus image they saw during the trial.

To approximate this process, we trained machine learning classifiers using the feature activations from the penultimate layer of each model on naturalistic images of each object and then tested them on each stimulus display in a pairwise fashion (e.g., train on photographs of airplanes and trucks; test on the perturbed airplane stimulus). For each stimulus category, we provided 500 images for training (15,000 images total). Training images were randomly selected from the EcoSet (46), ImageNet21k (118), and Open Images datasets. Visual inspection of the images showed that they comprised both photographic (i.e., real) and stylized (e.g., cartoons) examples, making them comparable to a child’s viewing experience. See fig. S4 for classification performance on the images used for training.

Because there can be significant differences in the performance of a model based on which classifier is used or how many examples are used to train the classifier, we tested each model using six common classifiers (support vector machine, logistic regression, ridge regression, naïve Bayes, K-nearest neighbors, and nearest centroid). We specifically included the K-nearest neighbors and nearest centroid classifier because these methods may also approximate the use of exemplar and prototype representations in humans, respectively. Furthermore, we parametrically varied the number of images used for classifier training (5, 10, 25, 50, 100, 150, 200, 250, and 300). For each stimulus, training and testing were conducted using a 20-fold cross-validation regime (stratified k-fold). For each model, final comparisons with children were made by selecting the best-performing version of the model across all classifiers and quantities of training.

For layer-wise analyses, feature activations were extracted from the final rectified linear unit (ReLU) or linear readout operation of each layer block of a model. For computational efficiency, layer-wise analyses were conducted using only 150 training images and the K-nearest neighbors (KNN) classifier.

Statistical analyses

Preprocessing of child data

In the current task, we used a titrated procedure to determine the fastest speed at which children could identify images. However, the use of a titrated procedure meant that not all children contributed trials at every stimulus duration. Survival analyses revealed that 3 year olds had a lower probability of reaching the fastest speeds under the perturbed and deleted conditions than 4 and 5 year olds (but not the complete condition; see fig. S1). However, if we only included data from children that reached each speed, then this would likely overestimate children’s performance (particularly at the fastest speeds) because it would exclude every participant that might otherwise perform at chance. To address this issue, we imputed data for each missing trial using a conservative estimate that assumes that children’s performance for those trials would be below chance. This value was computed for participants with missing data by subtracting the standard error of the mean of their performance on existing trials from 0.50.

We chose to impute below-chance values because imputing 0.50 (i.e., chance) would inflate the test statistic such that, statistically, child accuracy could never fall below chance levels. That is, if the lower bound of children’s performance on any trial type is 0.50, then the lower bound of the confidence interval would never cross the chance line. In this context, a one-sample t test comparing performance to chance would always have a positive t value, because no value would ever be below 0.50. Thus, by imputing a value below chance, we strongly penalized the overall group estimate of performance for those trials, which ensures that we did not overestimate the strength of children’s performance. Nonetheless, qualitatively similar results are found when imputed data were set to 0.50.

Low-level shape similarity

To examine whether child performance could be explained by low-level shape features like curvature statistics or a coarse similarity between shape envelopes, we computed child recognition accuracy as a function of curvature and shape envelope similarity. The curvature statistics of every object were computed using the normalized contour curvature model (122), and similarity was measured as the Pearson correlation between curvature values for each object pair. Shape envelopes were computed by calculating an elongation ratio for each object as the quotient between the longest and shortest axes of the object, and similarity was measured as the difference between elongation ratios for each object pair. Trials were then median split on the basis of curvature or shape envelope similarity.

Model comparison

Humans and models were compared on the basis of overlapping confidence intervals (95% confidence intervals). Human data were split into four age groups (3, 4, and 5 year olds and adults) and two speeds (fast: mean accuracies at 100 and 150 ms; slow: mean accuracies at 200 and 250 ms). Performance for models was computed as mean decoding accuracy across all pairwise comparisons of objects from the best-performing version of the model across all classifiers and quantities of training examples. Confidence intervals for humans and models were computed using a bootstrapping procedure (10,000 resamples).

Acknowledgments

Funding: This work was supported, in part, by University of Pennsylvania MindCORE and Data Driven Discovery Initiative (DDDI) fellowships awarded to V.A., as well as funding from the Emory University Research Council awarded to S.F.L.

Author contributions: V.A. and S.F.L. conceptualized the study, wrote and revised the original draft, acquired resources and funding for the study, and supervised the project. V.A. conducted the analyses and validation checks and created the visualizations. V.A. and S.B.S. programmed the task software. V.A., S.B.S., and K.N. collected data and conducted the investigation process.

Competing interests: The authors declare that they have no competing interests.

Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. The data, analysis code, and stimuli can be found online at https://doi.org/10.5281/zenodo.15513992 and at https://github.com/vayzenb/kornet.

Supplementary Materials

The PDF file includes:

Figs. S1 to S4

Tables S1 to S4

Legend for movie S1

sciadv.ads6821_sm.pdf (900.6KB, pdf)

Other Supplementary Material for this manuscript includes the following:

Movie S1

REFERENCES AND NOTES

  • 1.Grill-Spector K., Kanwisher N., Visual recognition: As soon as you know it is there, you know what it is. Psychol. Sci. 16, 152–160 (2005). [DOI] [PubMed] [Google Scholar]
  • 2.Biederman I., Bar M., One-shot viewpoint invariance in matching novel objects. Vision Res. 39, 2885–2899 (1999). [DOI] [PubMed] [Google Scholar]
  • 3.Murray R. F., Sekuler A. B., Bennett P. J., Time course of amodal completion revealed by a shape discrimination task. Psychon. Bull. Rev. 8, 713–720 (2001). [DOI] [PubMed] [Google Scholar]
  • 4.Wagemans J., De Winter J., de Beeck H. O., Ploeger A., Beckers T., Vanroose P., Identification of everyday objects on the basis of silhouette and outline versions. Perception 37, 207–244 (2008). [DOI] [PubMed] [Google Scholar]
  • 5.Biederman I., Cooper E. E., Priming contour-deleted images: Evidence for intermediate representations in visual object recognition. Cog. Psychol. 23, 393–419 (1991). [DOI] [PubMed] [Google Scholar]
  • 6.DiCarlo J. J., Zoccolan D., Rust N. C., How does the brain solve visual object recognition? Neuron 73, 415–434 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lamme V. A. F., Roelfsema P. R., The distinct modes of vision offered by feedforward and recurrent processing. Trends Neurosci. 23, 571–579 (2000). [DOI] [PubMed] [Google Scholar]
  • 8.Kok P., Bains L. J., van Mourik T., Norris D. G., de Lange F. P., Selective activation of the deep layers of the human primary visual cortex by top-down feedback. Curr. Biol. 26, 371–376 (2016). [DOI] [PubMed] [Google Scholar]
  • 9.Wokke M. E., Vandenbroucke A. R., Scholte H. S., Lamme V. A., Confuse your illusion: Feedback to early visual cortex contributes to perceptual completion. Psychol. Sci. 24, 63–71 (2013). [DOI] [PubMed] [Google Scholar]
  • 10.Goldstone R. L., Perceptual learning. Annu. Rev. Psychol. 49, 585–612 (1998). [DOI] [PubMed] [Google Scholar]
  • 11.Arcaro M. J., Schade P. F., Livingstone M. S., Universal mechanisms and the development of the face network: What you see is what you get. Annu. Rev. Vision Sci. 5, 341–372 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Logothetis N. K., Pauls J., Poggio T., Shape representation in the inferior temporal cortex of monkeys. Curr. Biol. 5, 552–563 (1995). [DOI] [PubMed] [Google Scholar]
  • 13.Frank M. C., Baby steps in evaluating the capacities of large language models. Nat. Rev. Psycol. 2, 451–452 (2023). [Google Scholar]
  • 14.Wood J. N., Newborn chickens generate invariant object representations at the onset of visual object experience. Proc. Natl. Acad. Sci. U.S.A. 110, 14000–14005 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Benton D. T., The elusive “Developmental Mechanism”: What they are and how to study and test them. Dev. Rev. 65, 101034 (2022). [Google Scholar]
  • 16.Clerkin E. M., Hart E., Rehg J. M., Yu C., Smith L. B., Real-world visual statistics and infants’ first-learned object names. Philos. Trans. R. Soc. B. Biol. Sci. 372, 20160055 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Huber L. S., Geirhos R., Wichmann F. A., The developmental trajectory of object recognition robustness: Children are like small adults but unlike big deep neural networks. J. Vision 23, 1–30 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Nishimura M., Scherf K. S., Zachariou V., Tarr M. J., Behrmann M., Size precedes view: Developmental emergence of invariant object representations in lateral occipital complex. J. Cogn. Neurosci. 27, 474–491 (2015). [DOI] [PubMed] [Google Scholar]
  • 19.Ayzenberg V., Granovetter M. C., Robert S., Patterson C., Behrmann M., Differential functional reorganization of ventral and dorsal visual pathways following childhood hemispherectomy. J. Vis. 24, 416 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.A. W. M. Tan, S. Yu, B. Long, W. A. Ma, T. Murray, R. D. Silverman, J. D. Yeatman, M. C. Frank, DevBench: A multimodal developmental benchmark for language learning. arXiv:2406.10215 (2024). 10.48550/arXiv.2406.10215. [DOI]
  • 21.Rajalingham R., Issa E. B., Bashivan P., Kar K., Schmidt K., DiCarlo J. J., Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks. J. Neurosci. 38, 7255–7269 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Wichmann F. A., Janssen D. H., Geirhos R., Aguilar G., Schütt H. H., Maertens M., Bethge M., Methods and measurements to compare men against machines. Electron. Image 2017, 36–45 (2017). [Google Scholar]
  • 23.Biederman I., Ju G., Surface versus edge-based determinants of visual recognition. Cog. Psychol. 20, 38–64 (1988). [DOI] [PubMed] [Google Scholar]
  • 24.Tarr M. J., Bülthoff H. H., Image-based object recognition in man, monkey and machine. Cognition 67, 1–20 (1998). [DOI] [PubMed] [Google Scholar]
  • 25.Nishimura M., Scherf S., Behrmann M., Development of object recognition in humans. F1000 Biol. Rep. 1, 56 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Ayzenberg V., Lourenco S., Perception of an object’s global shape is best described by a model of skeletal structure in human infants. eLife 11, e74943 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Rakison D. H., Butterworth G. E., Infants’ attention to object structure in early categorization. Dev. Psychol. 34, 1310–1325 (1998). [DOI] [PubMed] [Google Scholar]
  • 28.Valenza E., Leo I., Gava L., Simion F., Perceptual completion in newborn human infants. Child Dev. 77, 1810–1821 (2006). [DOI] [PubMed] [Google Scholar]
  • 29.Ayzenberg V., Behrmann M., Development of visual object recognition. Nat. Rev. Psycol. 3, 73–90 (2024). [Google Scholar]
  • 30.Scherf K. S., Behrmann M., Kimchi R., Luna B., Emergence of global shape processing continues through adolescence. Child Dev. 80, 162–177 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Kovács I., Kozma P., Fehér Á., Benedek G., Late maturation of visual spatial integration in humans. Proc. Natl. Acad. Sci. U.S.A. 96, 12204–12209 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Drewes J., Goren G., Zhu W., Elder J. H., Recurrent processing in the formation of shape percepts. J. Neurosci. 36, 185–192 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Baker N., Lu H., Erlikhman G., Kellman P. J., Deep convolutional networks do not classify based on global object shape. PLoS Comp. Biol. 14, e1006613 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Ayzenberg V., Behrmann M., Does the brain’s ventral visual pathway compute object shape? Trends Cog. Sci. 26, 1119–1132 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Kovács I., Human development of perceptual organization. Vision Res. 40, 1301–1310 (2000). [DOI] [PubMed] [Google Scholar]
  • 36.Doerig A., Sommers R. P., Seeliger K., Richards B., Ismael J., Lindsay G. W., Kording K. P., Konkle T., van Gerven M. A. J., Kriegeskorte N., Kietzmann T. C., The neuroconnectionist research programme. Nat. Rev. Neurosci. 24, 431–450 (2023). [DOI] [PubMed] [Google Scholar]
  • 37.Yermolayeva Y., Rakison D. H., Connectionist modeling of developmental changes in infancy: Approaches, challenges, and contributions. Psychol. Bull. 140, 224–255 (2014). [DOI] [PubMed] [Google Scholar]
  • 38.Wood J. N., Pandey L., Wood S. M., Digital twin studies for reverse engineering the origins of visual intelligence. Annu. Rev. Vision Sci. 10, 145–170 (2024). [DOI] [PubMed] [Google Scholar]
  • 39.Tang H., Schrimpf M., Lotter W., Moerman C., Paredes A., Ortega Caro J., Hardesty W., Cox D., Kreiman G., Recurrent computations for visual pattern completion. Proc. Natl. Acad. Sci. U.S.A. 115, 8835–8840 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Kar K., Kubilius J., Schmidt K., Issa E. B., DiCarlo J. J., Evidence that recurrent circuits are critical to the ventral stream’s execution of core object recognition behavior. Nat. Neurosci. 22, 974–983 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Kietzmann T. C., Spoerer C. J., Sörensen L. K., Cichy R. M., Hauk O., Kriegeskorte N., Recurrence is required to capture the representational dynamics of the human visual system. Proc. Natl. Acad. Sci. U.S.A. 116, 21854–21863 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.S. Thorat, G. Aldegheri, T. C. Kietzmann, Category-orthogonal object features guide information processing in recurrent neural networks trained for object categorization. arXiv:2111.07898 (2021). 10.48550/arXiv.2111.07898. [DOI]
  • 43.R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, W. Brendel, ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv:1811.12231 (2018). 10.48550/arXiv.1811.12231. [DOI]
  • 44.Vogelsang L., Gilad-Gutnick S., Ehrenberg E., Yonas A., Diamond S., Held R., Sinha P., Potential downside of high initial visual acuity. Proc. Natl. Acad. Sci. U.S.A. 115, 11333–11338 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Sullivan J., Gilad-Gutnick S., Mei M., Perfors A., Wojcik E. H., Frank M. C., Sinha P., SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective. Open Mind (Camb.) 5, 20–29 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Mehrer J., Spoerer C. J., Jones E. C., Kriegeskorte N., Kietzmann T. C., An ecologically motivated image dataset for deep learning yields better models of human vision. Proc. Natl. Acad. Sci. U.S.A. 118, e2011417118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Vong W. K., Wang W., Orhan A. E., Lake B. M., Grounded language acquisition through the eyes and ears of a single child. Science 383, 504–511 (2024). [DOI] [PubMed] [Google Scholar]
  • 48.Jang H., Tong F., Improved modeling of human vision by incorporating robustness to blur in convolutional neural networks. Nat. Commun. 15, 1989 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Burkhalter A., Development of forward and feedback connections between areas V1 and V2 of human visual cortex. Cereb. Cortex 3, 476–487 (1993). [DOI] [PubMed] [Google Scholar]
  • 50.Zhuang C., Yan S., Nayebi A., Schrimpf M., Frank M. C., DiCarlo J. J., Yamins D. L., Unsupervised neural network models of the ventral visual stream. Proc. Natl. Acad. Sci. U.S.A. 118, e2014196118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.O’Reilly R. C., Russin J. L., Zolfaghar M., Rohrlich J., Deep predictive learning in neocortex and pulvinar. J. Cogn. Neurosci. 33, 1158–1196 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Smith L. B., Learning to recognize objects. Psychol. Sci. 14, 244–250 (2003). [DOI] [PubMed] [Google Scholar]
  • 53.S. Sheybani, L. B. Smith, Z. Tiganj, S. S. Maini, A. Dendukuri, “ModelVsBaby: A developmentally motivated benchmark of out-of-distribution object recognition” (OSF, 2024).
  • 54.Zaadnoordijk L., Besold T. R., Cusack R., Lessons from infant learning for unsupervised machine learning. Nat. Mach. Intell. 4, 510–520 (2022). [Google Scholar]
  • 55.Smith L. B., Slone L. K., A developmental approach to machine learning? Front. Psychol. 8, 2124 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Cusack R., Ranzato M. A., Charvet C. J., Helpless infants are learning a foundation model. Trends Cog. Sci. 28, 726–738 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Konkle T., Alvarez G. A., A self-supervised domain-general learning framework for human ventral stream representation. Nat. Commun. 13, 491 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Smith L. B., Jones S. S., Landau B., Gershkoff-Stowe L., Samuelson L., Object name learning provides on-the-job training for attention. Psychol. Sci. 13, 13–19 (2002). [DOI] [PubMed] [Google Scholar]
  • 59.Paruthi S., Brooks L. J., D’Ambrosio C., Hall W. A., Kotagal S., Lloyd R. M., Malow B. A., Maski K., Nichols C., Quan S. F., Rosen C. L., Troester M. M., Wise M. S., Recommended amount of sleep for pediatric populations: A consensus statement of the american academy of sleep medicine. J. Clin. Sleep Med. 12, 785–786 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (IEEE, 2021), pp. 9650–9660. [Google Scholar]
  • 61.Bahri Y., Dyer E., Kaplan J., Lee J., Sharma U., Explaining neural scaling laws. Proc. Natl. Acad. Sci. U.S.A. 121, e2311878121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Wakui E., Jüttner M., Petters D., Kaur S., Hummel J. E., Davidoff J., Earlier development of analytical than holistic object recognition in adolescence. PLOS ONE 8, e61041 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.S. O. Abdelrahim, M. C. Frank, “Examining the robustness and generalizability of the shape bias: A meta-analysis,” in Proceedings of the Annual Meeting of the Cognitive Science Society (UC Merced, 2024), vol. 46.
  • 64.Smith L. B., From fragments to geometric shape: Changes in visual object recognition between 18 and 24 months. Curr. Dir. Psychol. Sci. 18, 290–294 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Slater A., Morison V., Shape constancy and slant perception at birth. Perception 14, 337–344 (1985). [DOI] [PubMed] [Google Scholar]
  • 66.Mash C., Arterberry M. E., Bornstein M. H., Mechanisms of visual object tecognition in infancy: Five-month-olds generalize beyond the interpolation of familiar views. Inf. Dent. 12, 31–43 (2007). [DOI] [PubMed] [Google Scholar]
  • 67.Mareschal D., Quinn P. C., Categorization in infancy. Trends Cog. Sci. 5, 443–450 (2001). [DOI] [PubMed] [Google Scholar]
  • 68.Quinn P. C., Eimas P. D., Tarr M. J., Perceptual categorization of cat and dog silhouettes by 3-to 4-month-old infants. J. Exp. Child Psychol. 79, 78–94 (2001). [DOI] [PubMed] [Google Scholar]
  • 69.Kellman P. J., Spelke E. S., Perception of partly occluded objects in infancy. Cog. Psychol. 15, 483–524 (1983). [DOI] [PubMed] [Google Scholar]
  • 70.Johnson S. P., Aslin R. N., Perception of object unity in 2-month-old infants. Dev. Psychol. 31, 739–745 (1995). [Google Scholar]
  • 71.Schrimpf M., Kubilius J., Lee M. J., Ratan Murty N. A., Ajemian R., DiCarlo J. J., Integrative benchmarking to advance neurally mechanistic models of human intelligence. Neuron 108, 413–423 (2020). [DOI] [PubMed] [Google Scholar]
  • 72.Bowers J. S., Malhotra G., Dujmović M., Montero M. L., Tsvetkov C., Biscione V., Puebla G., Adolfi F., Hummel J. E., Heaton R. F., Deep problems with neural network models of human vision. Behav. Brain Sci. 46, e385 (2023). [DOI] [PubMed] [Google Scholar]
  • 73.T. Tullis, B. Albert, in Measuring the User Experience (Second Edition), T. Tullis, B. Albert, Eds. (Morgan Kaufmann, 2013), pp. 163–186.
  • 74.R. H. Carpenter, Movements of the Eyes, 2nd Rev. (Pion Limited, 1988).
  • 75.Jayaraman S., Smith L. B., Faces in early visual environments are persistent not just frequent. Vision Res. 157, 213–221 (2019). [DOI] [PubMed] [Google Scholar]
  • 76.Long B. L., Kachergis G., Agrawal K., Frank M. C., A longitudinal analysis of the social information in infants’ naturalistic visual experience using automated detections. Dev. Psychol. 58, 2211–2229 (2022). [DOI] [PubMed] [Google Scholar]
  • 77.Anderson E. M., Candy T. R., Gold J. M., Smith L. B., An edge-simplicity bias in the visual input to young infants. Sci. Adv. 10, eadj8571 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Smith L. B., Jayaraman S., Clerkin E., Yu C., The developing infant creates a curriculum for statistical learning. Trends Cog. Sci. 22, 325–336 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.C. J. Dunst, E. Gorman, “Development of infant and toddler mark making and scribbling” in Centre for Early Learning Literacy Review, vol. 2 (Cultural Learning Alliance, 2009).
  • 80.Wang H., Ge S., Lipton Z., Xing E. P., Learning robust global representations by penalizing local predictive power. Adv. Neural Inf. Process. Syst. 32, 7834–7844 (2019).31857781 [Google Scholar]
  • 81.P. Mayilvahanan, R. S. Zimmermann, T. Wiedemer, E. Rusak, A. Juhos, M. Bethge, W. Brendel, In search of forgotten domain generalization. arXiv:2410.08258 (2024). 10.48550/arXiv.2410.08258. [DOI]
  • 82.S. Madan, T. Sasaki, T.-M. Li, X. Boix, H. Pfister, Small in-distribution changes in 3D perspective and lighting fool both CNNs and transformers. arXiv:2106.16198 (2021).
  • 83.Geirhos R., Jacobsen J.-H., Michaelis C., Zemel R., Brendel W., Bethge M., Wichmann F. A., Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665–673 (2020). [Google Scholar]
  • 84.E. Yiu, M. Qraitem, C. Wong, A. N. Majhi, Y. Bai, S. Ginosar, A. Gopnik, K. Saenko, KiVA: Kid-inspired visual analogies for testing large multimodal models. arXiv:2407.17773 (2024). 10.48550/arXiv.2407.17773. [DOI]
  • 85.Lee T. S., Mumford D., Romero R., Lamme V. A., The role of the primary visual cortex in higher level vision. Vision Res. 38, 2429–2454 (1998). [DOI] [PubMed] [Google Scholar]
  • 86.Bar M., Kassam K. S., Ghuman A. S., Boshyan J., Schmid A. M., Dale A. M., Hämäläinen M. S., Marinkovic K., Schacter D. L., Rosen B. R., Halgren E., Top-down facilitation of visual recognition. Proc. Natl. Acad. Sci. U.S.A. 103, 449–454 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Hardstone R., Zhu M., Flinker A., Melloni L., Devore S., Friedman D., Dugan P., Doyle W. K., Devinsky O., He B. J., Long-term priors influence visual perception through recruitment of long-range feedback. Nat. Commun. 12, 6288 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Y.-A. Cheng, I. F. Rodriguez, S. Chen, K. Kar, T. Watanabe, T. Serre, RTify: Aligning deep neural networks with human behavioral decisions. arXiv:2411.03630 (2024). 10.48550/arXiv.2411.03630. [DOI]
  • 89.Wood J. N., Wood S. M., The development of invariant object recognition requires visual experience with temporally smooth objects. Cog. Sci. 42, 1391–1406 (2018). [DOI] [PubMed] [Google Scholar]
  • 90.Kellman P. J., Short K. R., Development of three-dimensional form perception. J. Exp. Psychol. Hum. Perception Performance 13, 545–557 (1987). [DOI] [PubMed] [Google Scholar]
  • 91.Jagadeesh A. V., Gardner J., Texture-like representation of objects in human visual cortex. Proc. Natl. Acad. Sci. U.S.A. 119, e2115302119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Wang B., Ponce C. R., Tuning landscapes of the ventral stream. Cell Rep. 41, 111595 (2022). [DOI] [PubMed] [Google Scholar]
  • 93.Ayzenberg V., Simmons C., Behrmann M., Temporal asymmetries and interactions between dorsal and ventral visual pathways during object recognition. Cereb. Cortex Comms. 4, tgad003 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Romei V., Driver J., Schyns P. G., Thut G., Rhythmic TMS over parietal cortex links distinct brain frequencies to global versus local visual processing. Curr. Biol. 21, 334–337 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95.Freud E., Behrmann M., Snow J. C., What does dorsal cortex contribute to perception? Open Mind 4, 40–56 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Bourne J. A., Rosa M. G., Hierarchical development of the primate visual cortex, as revealed by neurofilament immunoreactivity: Early maturation of the middle temporal area (MT). Cereb. Cortex 16, 405–414 (2006). [DOI] [PubMed] [Google Scholar]
  • 97.Ayzenberg V., Song C., Arcaro M. J., An intrinsic hierarchical, retinotopic organization of visual pulvinar connectivity in the human neonate. Curr. Biol. 35, 300–314.e5 (2025). [DOI] [PubMed] [Google Scholar]
  • 98.Ayzenberg V., Behrmann M., The dorsal visual pathway represents object-centered spatial relations for object recognition. J. Neurosci. 42, 4693–4710 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Kolk S. M., Rakic P., Development of prefrontal cortex. Neuropsychopharmacology 47, 41–57 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Ellis C. T., Skalaban L. J., Yates T. S., Turk-Browne N. B., Attention recruits frontal cortex in human infants. Proc. Natl. Acad. Sci. U.S.A. 118, e2021474118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.S. Bambach, D. J. Crandall, L. B. Smith, C. Yu, “An egocentric perspective on active vision and visual object learning in toddlers,” in 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob) (IEEE, 2017), pp. 290–295. [Google Scholar]
  • 102.Vogelsang M., Vogelsang L., Gupta P., Gandhi T. K., Shah P., Swami P., Gilad-Gutnick S., Ben-Ami S., Diamond S., Ganesh S., Sinha P., Impact of early visual experience on later usage of color cues. Science 384, 907–912 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.Cassia V. M., Valenza E., Simion F., Leo I., Congruency as a nonspecific perceptual property contributing to newborns’ face preference. Child Dev. 79, 807–820 (2008). [DOI] [PubMed] [Google Scholar]
  • 104.Kiat J. E., Luck S. J., Beckner A. G., Hayes T. R., Pomaranski K. I., Henderson J. M., Oakes L. M., Linking patterns of infant eye movements to a neural network model of the ventral stream using representational similarity analysis. Dev. Sci. 25, e13155 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105.Kriegeskorte N., Mur M., Bandettini P., Representational similarity analysis – Connecting the branches of systems neuroscience. Front. Sys. Neuro. 2, 4 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106.Dekker T., Mareschal D., Sereno M. I., Johnson M. H., Dorsal and ventral stream activation and object recognition performance in school-age children. Neuroimage 57, 659–670 (2011). [DOI] [PubMed] [Google Scholar]
  • 107.Grill-Spector K., Golarai G., Gabrieli J., Developmental neuroimaging of the human ventral visual cortex. Trends Cog. Sci. 12, 152–162 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108.Bayet L., Zinszer B. D., Reilly E., Cataldo J. K., Pruitt Z., Cichy R. M., Nelson C. A., Aslin R. N., Temporal dynamics of visual representations in the infant brain. Dev. Cog. Neurosci. 45, 100860 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109.Xie S., Hoehl S., Moeskops M., Kayhan E., Kliesch C., Turtleton B., Köster M., Cichy R. M., Visual category representations in the infant brain. Curr. Biol. 32, 5422–5432.e6 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Yates T. S., Ellis C. T., Turk-Browne N. B., Emergence and organization of adult brain function throughout child development. Neuroimage 226, 117606 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 111.Ellis C. T., Skalaban L. J., Yates T. S., Bejjanki V. R., Córdova N. I., Turk-Browne N. B., Re-imagining fMRI for awake behaving infants. Nat. Commun. 11, 4523 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112.Dapello J., Marques T., Schrimpf M., Geiger F., Cox D., DiCarlo J. J., Simulating a primary visual cortex at the front of CNNs improves robustness to image perturbations. Adv. Neural Inf. Process. syst. 33, 13073–13087 (2020). [Google Scholar]
  • 113.A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929 (2020). 10.48550/arXiv.2010.11929. [DOI]
  • 114.Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, “A ConvNet for the 2020s,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2022), pp. 11976–11986. [Google Scholar]
  • 115.M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, Dinov2: Learning robust visual features without supervision. arXiv:2304.07193 (2023). 10.48550/arXiv.2304.07193. [DOI]
  • 116.A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning (PMLR, 2021), pp. 8748–8763. [Google Scholar]
  • 117.K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), pp. 770–778. [Google Scholar]
  • 118.T. Ridnik, E. Ben-Baruch, A. Noy, L. Zelnik-Manor, Imagenet-21k pretraining for the masses. arXiv:2104.10972 (2021). 10.48550/arXiv.2104.10972. [DOI]
  • 119.M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2023), pp. 2818–2829. [Google Scholar]
  • 120.Snodgrass J. G., Vanderwart M., A standardized set of 260 pictures: Norms for name agreement, image agreement, familiarity, and visual complexity. J. Exp. Psychol. Hum. Learning Memory 6, 174–215 (1980). [DOI] [PubMed] [Google Scholar]
  • 121.M. Rezanejad, G. Downs, J. Wilder, D. B. Walther, A. Jepson, S. Dickinson, K. Siddiqi, “Scene categorization from contours: Medial axis based salience measures,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2019), pp. 4116–4124. [Google Scholar]
  • 122.Marantan A., Tolkova I., Mahadevan L., Image cognition using contour curvature statistics. Proc. R. Soc. A 479, 20220662 (2023). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figs. S1 to S4

Tables S1 to S4

Legend for movie S1

sciadv.ads6821_sm.pdf (900.6KB, pdf)

Movie S1


Articles from Science Advances are provided here courtesy of American Association for the Advancement of Science

RESOURCES