Abstract
Rapid developments in the fields of learning and object recognition have been obtained by successfully developing and using methods for learning from a large number of labeled image examples. However, such current methods cannot explain infants’ learning of new concepts based on their visual experience, in particular, the ability to learn complex concepts without external guidance, as well as the natural order in which related concepts are acquired. A remarkable example of early visual learning is the category of 'containers' and the notion of ‘containment’. Surprisingly, this is one of the earliest spatial relations to be learned, starting already around 3 month of age, and preceding other common relations (e.g., ‘support’, ‘in-between’). In this work we present a model, which explains infants’ capacity of learning ‘containment’ and related concepts by ‘just looking’, together with their empirical development trajectory. Learning occurs in the model fast and without external guidance, relying only on perceptual processes that are present in the first months of life. Instead of labeled training examples, the system provides its own internal supervision to guide the learning process. We show how the detection of so-called ‘paradoxical occlusion’ provides natural internal supervision, which guides the system to gradually acquire a range of useful containment-related concepts. Similar mechanisms of using implicit internal supervision can have broad application in other cognitive domains as well as artificial intelligent systems, because they alleviate the need for supplying extensive external supervision, and because they can guide the learning process to extract concepts that are meaningful to the observer, even if they are not by themselves obvious, or salient in the input.
Keywords: Containment relation, Spatial relations learning, Infants’ perceptual learning, Developmental trajectory, Unsupervised learning, Computational model
1. Introduction
In the first months of life infants acquire significant knowledge about the world, which allows them to form expectations about their environment, and guide their interactions with their surroundings. A major aspect of this knowledge is recognizing objects and their interactions (Spelke & Kinzler, 2007). An essential and extensively studied component of this capability is forming categories of spatial relations between objects, such as occlusion (A is behind B), support (A is on B) and containment (A is inside B) (Aguiar and Baillargeon, 1999, Baillargeon, 2004, Casasola and Cohen, 2002, Casasola et al., 2003, Hespos and Baillargeon, 2001b, Luo and Baillargeon, 2005, Needham and Baillargeon, 1993, Wilcox et al., 1996).
In the current work we describe a computational model that learns about ‘containment’, one of the earliest spatial relations to be learned (Casasola et al., 2003, Hespos and Baillargeon, 2001a, Hespos and Baillargeon, 2001b, Hespos and Spelke, 2004, Piaget and Inhelder, 1967, Wang et al., 2005), and a range of related notions, such as ‘support’ (Casasola & Cohen, 2002) and ‘cover’ (Hespos and Baillargeon, 2001a, Wang et al., 2005). This learning is obtained in the model fast and without supervision, relying only on perceptual processes that are present in the first months of life.
As summarized below (Section 2.1), studies of infants’ ability to recognize and categorize spatial relations between objects have shown that selective responses to containment events emerge as early as about 2.5 months of age, and continue to develop through a characteristic sequence of stages. The two main goals of the model are to provide an explanation for infants’ ability to learn complex concepts such as containment early and without guidance, as well as the natural order of concepts acquisition. The ability to learn complex concepts visually in an unguided manner goes beyond current highly successful computational models, which learn in a supervised manner, using large data sets of supplied labeled examples (Krizhevsky et al., 2012, LeCun et al., 2015). In contrast, the current model is able to acquire the visual concept of ‘containment’ and related relations by ‘merely looking’, and it naturally goes through stages in the observed developmental trajectory. It recognizes first dynamic occlusion events, and then generalizes to static images (Fig. 1). It distinguishes between ‘behind’, ‘in-front’ and ‘inside’ relations, and can tell apart ‘tight’ and ‘loose’ fit (Casasola et al., 2009, Casasola and Cohen, 2002, Hespos and Spelke, 2004). Learning ‘support’ relations (Baillargeon et al., 1992, Casasola and Cohen, 2002, Hespos and Baillargeon, 2001a) is more difficult in the model and emerges only later (Figs. 1F and 4F). The model deals with related concepts (e.g. ‘cover’ (Hespos and Baillargeon, 2001a, Wang et al., 2005)) and predicts developmental steps that can be tested empirically (Section 3.3).
Fig. 1.
‘Containment’ developmental stages. (A) Dynamic input. Short temporal sequences depicting (top to bottom): ‘in-front’, ‘behind’ and ‘inside’ events. (B) Static input. Single-frame images of (top to bottom): ‘in-front’, ‘behind’ and ‘inside’ relations. (C) ‘Tight’ (top) and ‘loose’ (middle, bottom) fit. (D) High-angle view (contained object is not occluded by the container). (E) ‘Cover’ relations. (F) ‘Support’ relations.
Fig. 4.
Capacities of the model. Each capacity is added incrementally to all previous ones. (A) Figure-ground segmentation of moving regions (motion direction indicated by the green arrow). The model constructs and stores a simple representation of the region (in blue), based on the region’s low-level features. (B) Detection of familiar regions in a static scene. Based on the representation obtained in (A), the model can detect a familiar region (in blue) and separate it from its background in a static scene. (C) Detection of boundaries and their ownership at motion discontinuities. The model detects object boundaries of a moving region (motion indicated by the green arrow) by identifying motion discontinuities (red contours), and determines ‘ownership’ direction (red arrows) of the boundary (which side belongs to the object). The model extends the representation of the moving region in (A) to include the ownership along its boundaries. (D) Detection of internal boundaries of a moving region (motion indicated by the green arrow). Such internal boundaries are typically produced at a container’s rim. The model represents the internal boundary as a part of the object (boundary ownership direction indicated by red arrows). (E) Detection of familiar boundaries in a static scene. Based on the representation in (C, D), external and internal boundaries of familiar objects can be detected in a static scene. (F) Detection of familiar internal regions in a static scene. In a container, based on (B-E), the model can discriminate between the ‘front’ (blue) and ‘back’ (red) sides separated by the internal boundary. The ‘front’ side is the region owning the internal boundary. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
The focus of the model is on learning from visual experience, which normally plays a major role in the first months of life. As purposeful manipulation and interaction with the environment develops, haptic perception and motor manipulation can also contribute to the acquisition of ‘containment’ relations, and can lead to the acquisition of spatial relation concepts even in the lack of vision (Bedny and Saxe, 2012, Landau and Gleitman, 1985). These multimodal aspects will not be considered here, but they should be included in a full model of learning spatial relations. The model also focuses on the perceptual ability to recognize and categorize aspects of containment, such as distinguishing between ‘containment’ and other spatial relations such as ‘behind’, or responding differently to container and non-container objects. In addition to such perceptual categorizations, infants exhibit from an early age capacities to perform some forms of physical reasoning, for example, expecting objects to continue to exist when they become hidden from view in an 'inside' relation, or expecting an object to be transported with the container when the container is moved to a new location. With additional experience, physical reasoning extends to include predictions that are more complex, for example, expecting that an object will not be successfully inserted into a container when its width is too large compared with the container's opening. Detailed knowledge and physical reasoning about containment continue to develop with experience, and adults eventually form a rich conceptual representation of containment-related notions, which play a role in their understanding of physical and biological properties and processes (e.g., how the cell-membrane functions as a ‘container’ (Collins and Forbus, 1987, Davis et al., 2013)). Perceptual processes and physical reasoning are often inter-dependent, but the current focus is on the perceptual abilities; physical reasoning processes and cognitive aspects are for the most part beyond the scope of the current model.
The rest of the paper is organized as follows. A survey of related studies, both psychological and computational, is presented in Section 2. In Section 3 we present our model that learns about ‘containment’ and related concepts, fast and without supervision, relying only on perceptual processes that are present in the first months of life. Details of the model's algorithmic implementation are described in Section 4. The data used for experimental evaluation of the model and evaluation results are described in Section 5. Finally, we discuss the results and implications of this study in Section 6.
2. Related work
2.1. Behavioral studies
Behavioral studies have accumulated evidence that infants begin to detect and respond specifically to containers and containment events already about 2.5 months of age (Hespos & Baillargeon, 2001b), which is about as early as occlusion has been demonstrated (Aguiar and Baillargeon, 1999, Baillargeon, 2004, Luo and Baillargeon, 2005). Infants at this age distinguish ‘inside’ from ‘in-front’ and ‘behind’; and containers from simple non-container objects (Hespos & Baillargeon, 2001b).
Additional aspects of containment continue to develop over time: By 4 months, infants develop sensitivity to width information, showing surprise when a wide object becomes fully hidden inside a narrow container; between 4 and 6 months, they distinguish between loose and tight containment (Casasola, 2008, Casasola and Cohen, 2002, Casasola et al., 2003, Sitskoorn and Smitsman, 1995, Wang et al., 2004), and between 6.5 and 7.5 months they develop sensitivity to the relative height of the container and an inserted object (Hespos and Baillargeon, 2001a, Hespos and Baillargeon, 2006), which is later than showing sensitivity to relative height for occlusion events (Baillargeon & DeVos, 1991). Interestingly, support relation is generalized to novel objects later than containment (Casasola, 2005, Casasola and Cohen, 2002), as is ‘in between’ (Quinn, Adams, Kennedy, Shettler, & Wasnik, 2003).
Studies of the earliest stages of selective responses to containment (up to about 6 months of age) have focused on a dynamic setting, showing an object being gradually inserted into a container. The reliance on visual motion in the early development of containment is consistent with a broad range of studies showing that the motion of objects in the image plays an important role in early object perception. Infants show ability to track moving objects from birth (Johnson, Dziurawiec, Ellis, & Morton, 1991), and initially use motion as a basic cue for grouping (Spelke, 1990). At the age of about two months, infants use common motion, as well as accretion and deletion cues at object boundaries, to group visual stimuli into distinct coherent objects and for separating them from their background (Granrud et al., 1984, Johnson and Mason, 2002, Kaufmann-Hayoz et al., 1986). At around 5 months, accretion and deletion cues also provide information for the estimation of relative depth between surfaces in dynamic scenes (Granrud et al., 1984, von Hofsten and Spelke, 1985). Selective responses to containment in dynamic scenes is soon followed by selective responses to containers and containment configurations in static scenes (Aguiar and Baillargeon, 1998, Hespos and Baillargeon, 2006, Wang et al., 2004).
With respect to depth perception, the role of depth cues in the early acquisition of containment relations has not been studied in detail. Effective binocular depth perception emerges around 4–5 months of age, and static pictorial depth cues emerge later (Kavšek et al., 2012, Yonas et al., 2002). Depth cues could provide useful information, but do not appear crucial for the early learning stage. The model presented below does not depend directly on depth perception, but depth information could be naturally incorporated into the model (e.g., in the delineation of objects boundaries).
Additional aspects of containment continue to develop over the following several months. Starting around 5 months, infants are sensitive to tight-fit versus loose-fit containment relations (Fig. 1C), and appear to understand the motion restriction of a contained object inside a container as imposed by the width differences between the contained object and the container’s cavity (Casasola et al., 2009, Casasola and Cohen, 2002, Hespos and Spelke, 2004). At about 6 months infants begin to generalize a dynamic containment event seen from a low-viewing angle, to a high-view angle event, where the contained object is not occluded by the container (Casasola et al., 2003) (Fig. 1D). The acquisition of ‘support’ relation between two objects develops later (Aguiar and Baillargeon, 2000, Aguiar and Baillargeon, 2003, Casasola and Cohen, 2002). Finally, it has also been observed that categorical containment relations (e.g. containers as an object category, or inside vs. behind) are first learned between familiar objects, and only later generalize to novel objects (Casasola and Cohen, 2002, Casasola et al., 2003).
2.2. Computational studies
A number of computational studies have addressed the problem of recognizing complex spatial relations including containment. In general, state-of-the-art computer vision methods, including deep neural network models, find the problem of identifying containers and containment events challenging, due to the high variability in the appearance of containers and contained objects and materials, even when annotated labels are provided for the training examples.
Several studies have addressed the problem of recognizing spatial relations, including support and adjacency, but without addressing concepts related to containment. For example, models by Rosman and Ramamoorthy, 2011, Silberman et al., 2012, have used detailed depth information extracted from binocular vision or provided by depth sensors, to identify and analyze contact regions of object surfaces, from which the relations are then inferred.
A more recent study has addressed containment relations using simulations of 3D objects, including both containers and non-containers (Liang, Zhao, Zhu, & Zhu, 2015). The model uses physical simulations of dynamic events, such as random collisions between two rigid objects, to estimate the probability of containment relations between the two. Physical simulations using objects with known 3-D shape were also used in (Wang & Liang, 2017), to consider a planning problem: given an object and a set of containers, select the best container to use for transferring the object from one location to another. These models provide means for explaining the physical meaning of containment concepts, but they are not used to account for the early development of containment concepts.
Containment relations in a 3D scene were also examined in a model by Liang, Zhao, Zhu, and Zhu (2016), in the context of videos depicting people interacting with objects and moving them from one location (e.g. a box) to another (e.g. a refrigerator). Containment relations are inferred primarily by tracking agents and their interaction with an object over time (e.g. moving an object from the box to the refrigerator), rather than by analysis of the spatial relations between the container and the inserted object. The model provides a representation that can keep track of both visible and contained objects over time, and is not related directly to the early acquisition of containment concepts.
Computational models have also shown how non-visual modalities can contribute to the learning of containers and containment. For example, a study with a robotic system has shown how interactions with objects can use manipulation and acoustic signals, combined with vision, to divide the objects into containers and non-containers, and learn to recognize such categories visually (Griffith, Sinapov, Sukhoy, & Stoytchev, 2012).
3. The model
3.1. Overview
In this section, we describe the model and how it learns about containment and containers fast and without supervision, relying on perceptual processes that are already present in the first months of life. The model goes naturally through stages, which appear in infant learning, recognizing first dynamic occlusion events, and then generalizing to static images (Fig. 1). It distinguishes between ‘behind’, ‘in-front’ and ‘inside’ relations, and can tell apart ‘tight’ and ‘loose’ fit.
Learning in the model extends to related spatial relations, in particular ‘cover’ (Wang et al., 2005) and ‘support’ (Aguiar and Baillargeon, 2000, Casasola, 2005, Casasola and Cohen, 2002). The model explains why these relations present additional difficulties to the learning process, and makes predictions about developmental steps that can be tested empirically.
The focus of the model is on learning from visual experience, which normally plays a major role in the first months of life. During learning, the model is exposed to videos and images, and it acquires on its own and in a natural order an early version of a set of concepts related to containment.
3.1.1. Paradoxical occlusion as a teaching cue
How can relatively complex and abstract concepts related to containment be learned without guidance, by being visually exposed to relevant containment events?
We suggest that the learning of containment and related relations is guided internally by teaching signals, which are present already at the onset of the learning process. As shown by the model, in both dynamic and static visual input, containment can be identified as an instance of 'paradoxical occlusion', defined as a situation where an object O, which occludes a second object C, is at the same time also occluded by C (Fig. 2B and 2C). Typically, an occlusion relation between two objects goes in one direction: one object either occludes, or being occluded by, the second object (Fig. 2A). This simple ordering is violated in a paradoxical occlusion situation.
Fig. 2.
Occlusion, paradoxical occlusion and containment. (A) Simple occlusions. (Top) An object (1) is in front of, and occludes a second object (2), or (bottom) is behind, being occluded by (2). (B) Dynamic containment occurs when a switch in boundary ownership (orange arrow) between (1) and (2) signals a dynamic ‘paradoxical occlusion’, where the boundary switches fromoccluding (2) to being occluded by (2). (C) Static containment at a low view is detected as a static 'paradoxical occlusion', where (1) occludes (2) and (2) occludes (1) along different parts of the common boundary between them. (D) Static containment at a high view: the common boundary is owned by (1), separating it from the ‘back’ region of (2). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
It is known that during early visual experience infants develop the ability to segregate a scene into distinct objects, and determine occlusion relations between them (Kaufmann-Hayoz et al., 1986, Kellman and Spelke, 1983, Needham, 2001). Our model suggests that the occurrence of an unusual paradoxical occlusion event is noted by the system and serves as an early internal signal for containment configurations, which then get elaborated by additional learning. The model demonstrates that paradoxical occlusion serves as an efficient and reliable internal guidance, which leads to the learning of containment and related notions in a human-like manner, and based on similar initial capacities. In Section 3.2, we briefly describe the visual capacities included in the model prior to the onset of learning about containment. Section 3.3 then describes the sequence of stages that the model goes through, guided by paradoxical occlusion, and using the pre-existing capacities. The full pipeline of the model, including all the required computations, is illustrated in Fig. 3. Detailed account of the model, its training, and technical details of its algorithmic implementation are subsequently described in Section 4.
Fig. 3.
Schematic summary of all the computations included in the model. Red frames indicate dynamic (video) rather than static scenes. Top: the background capacities (C1-C6) applied to the familiarization videos (detailed illustration of the capacities is shown in Fig. 4). Optical flow is computed for the input, and then capacities C1-C6 are applied to create a representation of the input. The capacities are shown on the right, later capacities added on top of earlier ones. Bottom: recognition processes at three stages – dynamic, static at low view, and static at high view (detailed illustration of the recognition process is shown in Figs. 5A–D, 7). Dynamic: the algorithm detects a switch in boundary ownership. Static: the algorithm detects mixed boundary ownership. High-view: the algorithm detects that all boundaries are within the container’s back region. Right column: Detected boundary ownership in the output is indicated in blue (object) or in red (container). In high view, the container’s detected front region is indicated in red, and the back region in orange. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
3.2. Capacities of the model
The model includes a number of perceptual capacities, which are assumed to develop over time in the visual system at an early age, and support the acquisition of containment concepts. These capacities are listed as C1-C6 below, to make explicit what are the assumptions built into the model. As can be seen, they are all processes dealing with object segregation and boundary detection, starting from moving objects and generalizing to static scenes. The different capacities are listed briefly, together with the empirical evidence supporting their role in the visual system at an early age, when containment concepts are acquired. More details about their specific algorithmic implementation in the model are described separately in Section 4.1. A visual summary of the capacities appears in Fig. 4.
Capacity C1: Figure-ground segmentation of moving regions. When a region in the image moves against a stationary background, it is separated from the background based on its motion, and the model constructs and stores a simple representation of the region, based on low-level features included in the region (Fig. 4A, Section 4.1.1; (Johnson et al., 1991, Spelke, 1990)). A model lacking this capacity will not be able to segregate moving objects from the background and effectively construct object representations.
Capacity C2: Detection of familiar regions in a static scene. Based on the representation obtained in capacity C1, the model can detect a familiar region and separate it from its background in a static scene (Fig. 4B, Section 4.1.2; (Needham, 2001, Needham and Baillargeon, 1998, Needham and Modi, 1999, Spelke, 1990, Spelke et al., 1989)). Lacking this capacity, the model will not be able to segregate object regions from the background or another object under occluding conditions in a static scene.
Capacity C3: Detection of boundaries and their ownership at motion discontinuities. The model detects object boundaries of a moving region by identifying motion discontinuities, and determines ‘ownership’ direction of the boundary (which side belongs to the object; (Granrud et al., 1984, Johnson and Mason, 2002, Yonas et al., 1987)). The model extends the representation of the moving region to include the ownership along its boundaries (Fig. 4C, Section 4.1.3). Lacking this capacity, the model will not be able to assign correctly the ownership to the boundaries of moving objects, and therefore will have limited ability to distinguish between occluding and occluded objects.
Capacity C4: Detection of internal boundaries of a moving region. The boundaries detected in capacity C3 are the bounding contours, separating the moving object from the background. This is extended next to the detection of motion discontinuities within the region of an object, as internal boundaries. Such internal boundaries are typically produced at a container’s rim. The motion signal is weaker at these internal boundaries compared with the external ones (because the motion difference across the boundary is small; Section 4.1.4), and therefore the model includes them as a later stage. The model represents the internal boundary as a part of the object. (Fig. 4D, Section 4.1.4; (Arterberry and Yonas, 1988, Arterberry and Yonas, 2000)). Lacking this capacity, the model will not be able to extract internal object boundaries along the cavity of a moving container, and therefore will have limited ability to distinguish between containers and non-containers, which do not have a cavity.
Capacity C5: Detection of familiar boundaries in a static scene. The detection of external and internal object boundaries is based originally on motion discontinuities. Based on the representation in capacities C3, C4, external and internal boundaries of objects can subsequently be detected in a static scene (Fig. 4E, Section 4.1.5; (Kestenbaum et al., 1987, Spelke et al., 1989)), starting with familiar objects, or familiar simple shapes, and gradually generalizing to novel ones (Needham and Baillargeon, 1998, Needham et al., 2005, Needham and Modi, 1999). The extraction of boundaries, as well as segmentation and occlusion in general, appear earlier in development for familiar objects, and later generalize to novel objects. In terms of the containment model, we focus on the stage of familiar objects, but when the generalizations above take place, they will also naturally allow the containment model to generalize to novel objects. Lacking this capacity, the model will not be able to detect and analyze object boundaries in a static scene, including common borders between objects under occlusion, and therefore will have limited ability to distinguish between different occlusion conditions, such as ‘behind’ and ‘inside’.
Capacity C6: Detection of familiar internal regions in a static scene. In a container, based on capacities C2-C5, the model can discriminate between the ‘front’ and ‘back’ sides of the internal boundary. The external bounding boundary of an object, detected in capacity C3 above, separates the object as a whole from its background. In a similar manner, an internal boundary divides the object region into two sub-regions, separated by the boundary (Fig. 4F, Section 4.1.6; (Kestenbaum et al., 1987)). The ‘front’ side is the region owning the internal boundary. Lacking this capacity, the model will not be able to distinguish between the front and back regions of a container, and therefore will confuse high-containment and in-front relations.
3.3. Stages of recognition
In this section, we describe how the model develops to gradually acquire notions related to containment and containers. The description is divided into a sequence of stages, which make explicit the cues and capacities used by the model to acquire the relevant notions without requiring external guidance. The description focuses on the main principles driving the process, and a more detailed description of the algorithmic implementation follows in Section 4.2. The succession of stages is summarized visually in Fig. 1.
3.3.1. Recognizing dynamic containment
The model’s initial and simplest recognition stage of containment relations between objects (as well as ‘in-front’, ‘behind’), relies on motion information in dynamic displays, presented during training and testing. These dynamic displays are similar to dynamic scenarios used in studies of the earliest specific responses to containment relations in infants (Baillargeon, 2004, Hespos and Baillargeon, 2001a, Wang et al., 2005). Under these conditions, the model is first presented with brief familiarization video sequences of moving objects (both containers and non-containers; see Supplementary Data S3, S4 for sample videos), from which it learns to detect the objects’ regions in subsequent test videos, even when one of them (e.g., the container) is stationary (Section 3.2, capacities C1, C2). Following familiarization, the model is presented with dynamic containment and occlusion events: an object is being moved by hand, to be placed in-front, behind, or inside a container. In such a dynamic sequence, when an object O is inserted into a container C, at the moment when a ‘containment event’ takes place (entering a cavity in the container C), object O turns from progressively occluding C to become partly occluded by C, signaling a paradoxical occlusion, which is specific to containment relations (Fig. 2B). This event is detected by a switch in boundary ownership (Section 3.2, capacity C3), from being owned by the moving object to the stationary container. When this switch occurs inside region C rather than at its boundary, it identifies unambiguously the container C and the contained object O (Fig. 5A). In a ‘behind’ relation, the occlusion relation does not switch – C consistently occludes O (the boundary between them is owned by C). Similarly, in an ‘in-front’ relation, the occlusion relation also does not switch, and the boundary is consistently owned by the moving object O.
Fig. 5.
Schematic illustration of the computations used by the model. (A-D) Learning to identify ‘containment’ relations (bottom row), ‘in-front’ (top), and ‘behind’ (middle). Between the object and the container, boundaries marked in blue are owned by the object and in red by a container. (A) Identifying containment in dynamic input. Each row represents a dynamic event, depicting an object placed (top to bottom) ‘in-front’, ‘behind’, or ‘inside’ a stationary object. The model segments the objects (colored regions), detects the motion boundary between them, and detects the switch from ‘blue in-front of red’ to ‘red in-front of blue’. (B) Detecting ‘containment’ in static images (two examples, bottom). ‘Paradoxical occlusion’ is detected along the common border, where the object is in front of the container (at the blue boundary) but behind it at the rim (red boundary). (C) ‘Loose’ vs. ‘tight’ fit is measured by the fraction of the detected boundary (solid-red) relative to the full length of the internal boundary (dotted-red). (D) High-angle view: The container’s region is segregated into ‘front’ (red) and ‘back’ (orange) regions, separated by the internal boundary. Detection of ‘containment’ is extended to include occlusion (blue boundary) confined to the ‘back’ region. (E) ‘Cover’ relation (when the internal part, i.e. the back region, is invisible) and ‘support’ (in F) are related to containment, but in the model they require additional learning and predicted to appear at later stages. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
3.3.2. Recognizing static containment
The next stage in the model is the ability to recognize containment relations (as well as ‘in-front’ and ‘behind’) not only in dynamic events, but also in static scenes (Aguiar and Baillargeon, 1998, Hespos and Baillargeon, 2006, Wang et al., 2004). The extension that makes this possible is the ability to recognize external and internal object boundaries in stationary scenes.
Object boundaries are learned originally from motion discontinuities. They include the external object boundaries, and, for a container, also an internal boundary γ (at the container’s rim), which is a characteristic part of a container (Figs. 2C, 4B; Section 3.2, capacities C3, C4). The main addition to the model, listed as capacity C5 above, is the ability to recognize, in a static image, the boundaries of objects observed during a familiarization stage, together with their boundary ownership (the boundary side belonging to the object). The boundary between objects specifies their occlusion relations: the object that owns the boundary occludes the object on the other side (Dorfman et al., 2013, Granrud et al., 1984, Yonas et al., 1987). In a static scene, paradoxical occlusion is signaled not by a dynamic switch in boundary ownership, but by conflicting occlusion relations between the two objects. The conflicting cues arise along different parts of the common boundary between the container C and the inserted object O (Fig. 5B). A part of the common boundary is owned by the object O, and another part, along C’s internal boundary, is owned by the container C. Similar to the dynamic case, the container C both occludes and is being occluded by O, signaling a ‘containment’ relation. Similar to the model, infants are sensitive to the internal boundary at the rim of a container. When the back of a box is removed, transforming the internal boundary between the front and back of the box from an internal to external boundary, infants no longer interpret the box as a container, but rather as an occluder (Mou & Luo, 2017).
The relation of in-front/behind between image regions induces at this stage three types of relations between an object and a container: object in-front, object behind, and paradoxical occlusion. The empirical evidence shows that at about the same time, these classes of object relations become associated with predictions about an expected object location, in the following way. When an object is placed behind a container, a motion of the container will reveal the occluded object. In contrast, when the relation is a paradoxical occlusion, a motion of the container will transport the object with it. Infants were shown to look longer at ‘surprising’ events, when these predictions are violated; for example, when an object is inserted into a container, but then becomes revealed when the container moves, rather than being transported with it (Hespos and Baillargeon, 2001a, Wang et al., 2005). This use of the different spatial relations to predict outcomes such as ‘reveal’ or ‘transport’ caused by the container motion, is an early example of associating meaning with the newly formed categories of spatial relations, similar to the notion of ‘Quinian bootstrapping’ in concept learning (Carey, 2009).
3.3.3. ‘Tight’ versus ‘loose’ fit
A distinction that follows in the model is between ‘tight’ and ‘loose’ containment relations, which infants are sensitive to (Casasola et al., 2009, Casasola and Cohen, 2002, Hespos and Spelke, 2004, Sitskoorn and Smitsman, 1995). In a containment event, an object O can either fit tightly inside a container C, or it may occupy only a part of C’s cavity, and may be free to move within it (Fig. 1C). Discrimination between tight and loose static containment is based in the model on the internal boundary of the container (Section 3.2, capacities C4, C5) and the object regions on its two sides (Section 3.2, capacity C2), extracted automatically during the familiarization stage. The model produces a measure of the containment ‘tightness’ based on the proportion between the total length of C’s internal boundary, and the length of the common part of the internal boundary, shared between C and the inserted object O: containment is tight if β, the boundary between the object and container, and the rim γ are similar in size, and loose if β is significantly smaller (Fig. 5C). This stage depends, therefore, on the reliable detection of C’s internal boundary through most of its length. In analogy with the model, infants develop sensitivity to the relative width of the inserted object compared with the container cavity width at about 4 months, and it is higher at this age than their sensitivity to height comparisons (Hespos and Baillargeon, 2001a, Hespos and Baillargeon, 2006). Interestingly, the difference between width and height judgments appears to persist in some form into adulthood, and can be detected under challenging conditions, where occluded or contained objects slightly change their width or height, while invisible during a dynamic occlusion or containment event (Strickland & Scholl, 2015).
3.3.4. Recognizing static containment from a high view angle
High-angle containment (Casasola, 2008, Casasola et al., 2003) is more difficult in the model than a low-view configuration, because the object O is no longer adjacent to the internal boundary γ, and is not occluded by the container (Figs. 1D and 2D). Therefore, an additional capacity is required for this configuration. In the model, high-view containment is identified by using the internal boundary to divide the container region into sub-regions (Section 3.2, capacity C6, Fig. 4F and 5D). In the early stage, an object in the image is represented by a single region, (Section 3.2, capacities C1-C4). Subsequently, internal object boundaries are detected (Section 3.2, capacities C4, C5). The addition of an internal boundary naturally breaks the single object region into two regions joined along the internal boundary. For dealing with high-view angle, the model uses the natural representation of the two sub-regions, ‘front’ and ‘back’ of a container, discussed in Section 3.2, capacity C6 above (simple objects, without an internal discontinuity, are still represented by a single region). High view containment is detected when all the common borders between O and C are owned by O, and separate O from the ‘back’ region of C.
In the refined representation, instead of a single region, the container is now composed of two regions, separated by the internal boundary. In this representation, the object can be in-front of the container in different ways: it can be in front of the container’s Front region only, Back region only, or both. One of these, when the object is in front of the Back region only, naturally corresponds to high-view containment. In the model, the relation of high-view containment will be confused initially with in-front relation (which is in fact correct as well, since the object lies in front of the container's regions behind it). Connecting this new high-view configuration to ‘containment’ requires in the model an additional learning stage, which can be supported in two ways. First, the high-view and low-view containment configurations are similar in terms of boundary ownerships along the object-container common border, and one can transform to the other with a small change in the observer’s view direction. Second, similar to low-view, a motion of the container in high-view will transport the object with it. Learning to predict the object motion, which infants are sensitive to (Section 3.3.2), will therefore result in treating low-view and high-view containment in a similar manner. The model’s prediction that high-view containment comes at a later stage than low-view is consistent with the known data (Casasola et al., 2003). However, since high-view containment has not been tested in the past at ages as early as low-view, the prediction remains to be tested more fully in future studies.
3.3.5. Extensions to related concepts: Support and cover relations
In the model, the ‘support’ (‘on-top’) relation (Figs. 1F and 4F; (Baillargeon et al., 1992, Casasola and Cohen, 2002, Needham and Ormsbee, 2003)) is more difficult than containment since the discontinuity boundary γ, present in containers, is replaced in this case by a convex object edge. Our simulations show that an extended capacity is required for detecting this boundary. Briefly, the surface boundaries of a supporting object have no depth discontinuity (only a discontinuity in surface orientation), making them significantly more difficult to detect by motion discontinuities (Arterberry and Yonas, 1988, Marr, 1982). The model suggests that the acquisition of ‘support’ relation is delayed relative to ‘containment’, because its learning depends on the reliable detection of the convex internal boundary.
A cover relation (Hespos and Baillargeon, 2001b, Wang et al., 2005) (Figs. 1E and 4E) is similar to containment (in both, object O is partially inserted into a cavity in C), and can therefore be learned in a similar manner. However, this learning in the model will depend crucially on whether the internal discontinuity γ at the covering object’s opening rim, is made visible during familiarization (Fig. 6), since this discontinuity is required for perceiving the hollow cavity of the covering object. The model predicts that low-view ‘containment’, high-view, and ‘support’ will be acquired in this order, and that ‘cover’ will be learned spontaneously, provided that the rim γ will be visible during familiarization, but will not be learned otherwise at this stage. The prediction can be tested empirically, by a small but crucial modification to experiments conducted by Wang et al. (2005), which showed that infants are able to recognize covering events already at 2.5 months. We predict that a similar experiment, but without a dynamic presentation showing the rim of the covering object at the beginning of the trial, will fail to demonstrate a consistent distinction between ‘cover’ and ‘in-front’ relations.
Fig. 6.
Figure-ground segmentation. Analysis steps for a simple, non-container object (top) and a container (bottom). (A) Two consecutive frames from a familiarization video sequence. (B) Computed optical flow between the two frames using (Sun et al., 2013). Direction and magnitude are represented by hue and saturation respectively. The container both translates down and wiggles (see arrows). (C) Motion discontinuities are computed from local gradients of the optical flow. (D) Figure-ground segmentation for the two types of objects: (top) a simple object is separated from the background by an external boundary; (bottom) a container has in addition two sub-regions separated by an internal boundary at the container’s rim. Later, the hand is separated from the object using an additional image of the object at rest without the hand holding it.
4. Model implementation
This section is divided into two parts: the first describes the algorithmic implementation of the capacities incorporated in the model. The second describes the implementation of how the capacities are used by the model at different stages, to make increasingly complex judgements about containment relationships. The model implementation is designed to allow some robustness to variations in object and background appearance. Nevertheless, limitations at this age in view-direction and illumination invariance can have an effect on testing results when there are large differences in viewing directions and illumination conditions between familiarization and test conditions. Additional limitations are unreliable computations of optical flow on smooth or reflecting object surfaces, and noisy computations of motion discontinuities along fuzzy or ragged object boundaries. Source code of the model’s implementation is included as Supplementary Data S1, S2.
4.1. Algorithmic implementation of the model's capacities
Described below are the algorithms implementing the model’s capacities, which deal with motion computations, segmentation and regions representation, at an increasing level of detail. The implementation uses primarily existing computer vision schemes, which perform similar basic computations (see Appendix A for details), although their specific implementation may be different from the biological one.
4.1.1. Figure-ground segmentation of moving regions
The model applies a standard optical flow computation (Black and Anandan, 1996, Sun et al., 2013) between each pair of successive video frames in the input sequences, to evaluate the motion in the scene. Moving regions are then separated from the background (Burt et al., 1991, Horn and Weldon, 1988, Irani et al., 1994, Peleg and Rom, 1990), and their boundaries are identified as loci of discontinuity (or sharp transition) in image motion ((Dorfman et al., 2013, Ogale et al., 2005, Sargin et al., 2009, Stein and Hebert, 2009, Sundberg et al., 2011); Fig. 6). Local motion discontinuities are detected using directional gradients of the optical flow (Beck et al., 2008, Verri et al., 1989). Internal boundaries between figure sub-regions do not participate in the segmentation process at this stage. The model produces a representation of the moving figure region. The object representation uses a so-called star model (Crandall, Felzenszwalb, & Huttenlocher, 2005), which is a configuration of local regions surrounding a common center. Each region includes a description of its local appearance (Lowe, 2004) its segmentation mask, and its offset from the object center (Crandall et al., 2005, Dorfman et al., 2013, Karlinsky et al., 2010, Leibe et al., 2004). The object center is determined at the first familiarization video frame as the center of mass of object pixels. Object models were learned from videos where a grasping hand was moving the object. In order to separate the hand from the learned object model, we detect the object once in a reference image that contains the object without a hand.
4.1.2. Detection of familiar regions in a static scene
The method used above for representing an image region (Section 4.1.1), is also used for detecting a similar region in a new image (Crandall et al., 2005, Dorfman et al., 2013, Leibe et al., 2004). The model is tolerant to occlusion and moderate scale changes, and can robustly detect partially occluded objects when located in-front, behind or inside other objects in the scene (Figs. 7A and 5B). Given a static image, local appearance descriptors (Lowe, 2004) are densely extracted throughout the image. For each descriptor, the model retrieves the k nearest neighbors (k-NN) from the learned object model. Each neighbor votes for the location of the object center with a relative weight proportional to its learned predictive accuracy. The image location with the highest total votes is detected as the object center (Crandall et al., 2005). To segment out the object region (and subsequently its boundaries), the model projects back the figure-ground masks associated with the image features at the corresponding offset from the detected center (Dorfman et al., 2013, Leibe et al., 2004). The model’s segmentation capability develops with the different stages starting with a single object foreground region in the first stage, and including the object external and internal boundaries, and ‘front’ and ‘back’ regions at later stages. The model is tolerant to object scale differences of about ±10% (which could be extended with additional training), with respect to the familiarization scale, and therefore can robustly detect objects that appear closer or further away relative to their distance from the camera during familiarization, when located in-front, behind or inside other objects in the scene.
Fig. 7.
Increasing detection abilities in the model. (A) Detection of familiar dynamic and static regions (Section 3.2, capacities C1, C2) - containers (red region) and simple objects (blue region). (B) Detection of boundaries (Section 3.2, capacities C4, C5) - external in simple objects (blue lines) and both external and internal (at the rim) in containers (red lines). (C) Detection of container’s sub-regions (Section 3.2, capacity C6) - ‘front’ (red region) and ‘back’ (orange region) separated by internal boundaries at the container’s rim (the simple object is in blue). Presented are the spatial relations “in-front”, “behind” and “inside” at the last three rows (top to bottom, respectively). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
4.1.3. Detection of boundary ownership at motion discontinuities
We use motion cues to assign, for each pixel along a motion discontinuity, the direction of the region that ‘owns’ the boundary (Sundberg et al., 2011). The owner region is defined as the neighboring region that moves together with the boundary (Yonas et al., 1987). The algorithm computes the image motion on the two sides of the motion boundary, denoted by V1, V2. It also tracks the displacement of the motion boundary itself, Vb. Ideally, the velocity of the owning region should match the measured velocity of the boundary (since the boundary is the edge of the moving region, Fig. 8). The algorithm therefore computes ||V1-Vb||, ||V2-Vb||, and the owner is identified by the side that produces the smaller magnitude; the magnitude itself is used as a confidence score. The local confidence scores computed for determining the boundary owner are integrated along small boundary segments (Kovesi, 2000).
Fig. 8.
Boundary ownership computation. (A) A patch of pixels in the first frame, showing a segment of the motion boundary in black. (B) Motion fields on the two sides, along strips parallel to the boundary (bottom strip in red; top strip in green). Arrows show the optical flow V1 and V2 in these strips. (C) The motion boundary in the second frame (solid black) has been displaced from its previous position (shaded region) according to the motion Vb. This motion of the boundary is more similar to the motion V1 of the top region (green arrows) than to V2, and therefore, the boundary is owned by the top region (Sundberg et al., 2011). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
4.1.4. Detection of internal boundaries of familiar moving regions
Based on the algorithms described in 4.1.1, 4.1.2, the object is separated from the background along its external boundaries. The internal boundary at a container’s rim is detected as a motion discontinuity boundary during the dynamic familiarization stage, similar to Section 4.1.3 above. However, the internal motion discontinuity is typically noisier and harder to detect compared with external boundaries (since the depth-difference across the boundary is smaller). This stage (Section 4.1.4) is therefore reducing noise by integrating measurements along the detected discontinuities. In our implementation, the model segments the object from the background, using all the motion discontinuities detected (both external and internal). The segmentation follows (Najman and Schmitt, 1996, Shi and Malik, 2000), but where the boundaries are motion discontinuities rather than intensity edges. The method produces initially a so-called ‘over-segmentation’, which consists of a relatively large number of small regions that are compatible with the detected boundaries.
Next, regions separated by low boundary ownership measure (see Section 4.1.3) are iteratively merged until there are two regions left. A high discontinuity measure between the last two regions indicates that the object consists of ‘front’ and ‘back’ regions separated by an internal boundary. Finally, the model produces a representation of the internal boundary similar to the representation of external boundaries in Section 4.1.2 (the internal boundary separates between two sub-regions of the object rather than between the object and background; Fig. 4D).
4.1.5. Detection of familiar internal boundaries in a static scene
Based on Section 4.1.4 the detection of the internal boundary separating two object sub-regions in a static scene, uses the same algorithm applied to locate external boundaries, separating figure from background regions (4.1.1, 4.1.2; Fig. 7B).
4.1.6. Detection of familiar internal regions in a static scene
The internal boundary along the rim naturally divides the container’s region into two sub-regions, ‘front’ and ‘back’ (Figs. 4F, 5D and 6D). Based on the algorithms described in 4.1.3, 4.1.4, during the familiarization phase, the model discriminates between the front and back regions of a container separated by an internal boundary at the container’s rim (Fig. 6D). The two sub-regions become a part of the object model. Using this representation, the model uses the algorithm used to detect regions and boundaries in a static image (4.1.1, 4.1.2, 4.1.4, 4.1.5), to detect also the front and back regions in a static scene.
4.2. Algorithmic implementation of the model's stages
Described below are the algorithms implementing the model’s recognition processes at the different stages, i.e. how the model uses the evolving capacities to make increasingly complex judgments about containment relations.
4.2.1. Recognizing dynamic containment
The algorithm tracks the moving boundary between two object regions in the image, R1, R2, which have been learned during the familiarization period. The algorithm computes boundary ownership at each pixel along the moving boundary (Figs. 2B, 4C and 5A); since the ownership measurements can be noisy, they are integrated over 200 ms, and the direction of ownership is determined by a majority vote. The model detects a paradoxical occlusion (a containment relation) when ownership direction switches from R1 to R2 between two consecutive time windows. Furthermore, this switch signals that the final owner, R2, is the container. To increase noise robustness, the final detection also requires a minimal difference between the fraction of the common border owned by each object (set empirically to 0.4), and a minimal length of assigned boundary (set to 30 pixels) on average per frame.
4.2.2. Recognizing static containment
Static containment is detected by paradoxical occlusion: the simple object O occludes the container C, and it is also occluded by C at the internal boundary (Figs. 1B, 2C, 5B). Since the internal boundary is partially occluded on one side by the inserted object, the detection algorithm is applied separately to the two sides of the boundary. Paradoxical occlusion is detected based on the simultaneous detection of opposing occlusion relations along the common borders between O and C (4.1.2, 4.1.4, 4.1.5, 4.1.6). In our implementation, decision was based on a measure of the paradoxical relation defined by:
where o, c are the number of border pixels owned by O and C respectively, along their common borders.
4.2.3. Measuring containment ‘tightness’
In a containment event, the model extracts the entire length of the internal boundary of the container C and the length of the common part of the internal boundary shared between C and the inserted object O. The model then produces a continuous measure of the containment ‘tightness’ based on the proportion between the two lengths. A measure close to 1 (similar lengths) indicates a ‘tight’ fit, and a low measure (the common boundary between C and O is small compared with the full internal boundary of C) indicates a ‘loose’ fit.
4.2.4. Recognizing static containment from a high view
In a high view, the object is entirely surrounded by the container’s ‘back’ region (Figs. 2D and 5D). Using the capacities C1, C2, C4-C6 (Section 3.2), the model detects the objects’ regions and boundaries in the static scene, including the container’s front and back regions. The model assigns a boundary ownership to all the common borders between the simple object O and the container C. The model detects high-view containment when all the common borders between O and C are owned by O, and separate O from the ‘back’ region of C. By analyzing the local ownership along the common borders and the ‘front’/’back’ regions, the model separates containment from ‘in-front’ and ‘behind’ relations. The model does not require supervision to learn high-view containment, but instead finds this new relation based on clustering of occlusion relations. To this end, the model uses the four possible local occlusion relations between the object O and the container’s (C) front and back regions: O occludes C’s front region, O occludes C’s back region, C’s front region occludes O and C’s back region occludes O. The model applies a normalized histogram of the number of boundary pixels over the four types of local occlusion relations to classify among four types of object-container spatial relations. The histogram creates four clear distinct clusters, corresponding to: O is in front of C, O is behind C, O is inside C (and also occluded by C) as seen from a low view, or O in inside C (and is not occluded by C) as seen from a high view. In our experiments, we used a K-means clustering with K = 4 to classify the spatial relations.
5. Experimental evaluation
Our model uses evolving capacities to make increasingly complex judgments about containment relationships. To demonstrate and evaluate the model's recognition capabilities at the different stages, we used a set of natural videos and images showing containers and simple objects at multiple spatial relations. The representations in the model are learned automatically and without supervision by introducing the objects to the model using unlabeled familiarization video sequences (Section 5.1). We demonstrate how the learned representations can be used to perform complex judgments about containment relations including discriminating between container and non-container (simple) objects, and classifying occlusion relations including ‘in-front’, ‘behind’ and ‘inside’ (Section 5.2). Source code of the computational experiments described in this section is included as Supplementary Data S1, S2.
5.1. Video and image datasets
Data for the model consisted of familiarization video sequences, and test data included both videos and static images. The familiarization video sequences were used to introduce the participating objects to the model. We used the test data to evaluate the model’s performance in detecting containment along with related relations (‘behind’, ‘in-front’, ‘on-top’; Fig. 9).
Fig. 9.
Examples from the test dataset. Examples from images used for testing the model, depicting different spatial relations (‘in front’, ‘behind’, ‘inside’, ‘on-top’) between various objects and various containers from multiple views.
The videos and images (640 × 360 pixels) were taken with a stationary camera from two viewpoints: a low viewing angle, where objects were partially occluded when placed inside a container, and a high viewing angle, where objects were fully visible inside a container (Figs. 1A, 1D and 9). Objects used in the experiments included seven containers (a wooden box, 3 baskets and 3 card boxes), and five non-container objects, termed ‘simple’ objects (stuffed animals). For support (‘on-top’) and cover relations, we used two containers, but with their open side facing down (Fig. 1E and 1F).
5.1.1. Data for familiarization sequences
Short video sequences were presented to the model prior to testing, and were used by the model to learn about the participating objects, by detecting object boundaries and separating the objects from the background (see Supplementary Data S3, S4 for sample familiarization videos). Similar familiarization episodes are used routinely in infant experiments for introducing the participating objects to the infants (Baillargeon, 1998, Casasola, 2008). Each video introduced a single object being moved by a hand against a static background. We also used an image of the object at rest, without a holding hand, to separate the hand from the object. The motion included some wiggling, which made the motion boundary at the container rim easier to detect (Fig. 6). For each object there were 2–4 sequences, with a total duration of 2 s for simple objects and 8 s for containers.
5.1.2. Data for dynamic test sequences
Short video sequences (total of 176 events) were used for testing the automatic detection of dynamic containment events (‘inside’, 59 events) and distinguish them from occlusion events (‘in-front’ or ‘behind’, 57 and 60 events, respectively; see Supplementary Data S5-S7 for sample test videos). Each video depicted a stationary container and a moving object being placed inside, in-front of, or behind the container. The videos were taken from a low-viewing angle. Each test video was one second long.
5.1.3. Data for static test images
Single-frame images were used for testing the detection of different spatial relations between objects in a static setting. Each test image showed a simple object ‘inside’, ‘in-front’ or ‘behind’ a container. Test images included both low (176 images) and high viewing angles (175 images). There were 351 static test images in total, for different object-container pairs, different spatial relations and the two viewing angles (Fig. 9).
5.2. Experiments and results
5.2.1. Containers vs. non-containers
Containers have an opening or a cavity, through which other objects may be inserted into them. We tested whether it is possible to discriminate between container and non-container (simple) objects, based only on the initial model capacities (i.e. motion segmentation and boundary ownership at motion discontinuities), and without any labeled examples. We examined all objects in our dataset (both simple and container objects, Section 5.1). Each frame image of the familiarization videos (Section 5.1.1) is segmented by the model, separating moving regions from the background. An internal boundary within the object is detected using the border score (used to determine boundary ownership) between the last two segments in the segmentation process as a measure of depth discontinuity contours within the object (Section 4.1.4).
We compared the scores from all video frames for each object. These scores lead to a perfect distinction between container and non-container objects (1-tailed two-sample t-test, t(1048) = 14.18, p < 10-6; Fig. 10).
Fig. 10.
Discrimination between container and non-container objects. Container and non-container (simple) objects are discriminated without any labeled examples. In videos of moving objects, based on the initial capacities of the model, containers and simple objects are well-separated by the measured evidence of boundary ownership at motion discontinuities inside the moving object regions.
5.2.2. Recognizing containment in dynamic scenes
Following familiarization, we applied the model at its first, dynamic stage (Section 3.3.1) to the test video sequences (Section 5.1.2). The model correctly identified 91% (160 out of 176) of the test dynamic events (distinguishing ‘containment’ from ‘occlusion’), based on a measure of a boundary ownership switch (Section 3.3.1), which occurs in containment events but not in simple occlusion events (2-sample tailed t-test, t(174) = 20.14, p < 10-6, Fig. 11).
Fig. 11.
Recognizing containment in dynamic scenes. (A) Containment is detected by a switch in boundary ownership from the inserted object to the container. The boundary ownership score indicates the owner of the common border between a container and non-container (simple) objects. During a simple occlusion event (left), one of the objects maintains ownership throughout the event. However, during a containment event (right), the common border switches ownership between the simple object and the container, signaling a paradoxical occlusion. (B) The model computes an ‘inversion score’ for boundary ownership, and uses it to detect containment. The boundary ownership inversion score measures the confidence level of a boundary ownership switch during a dynamic event. Containment (blue triangles) is well separated from occlusion (red discs) by the detection threshold (dotted line). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
5.2.3. Recognizing containment in static scenes at a low-view
Following familiarization, we applied the model at the low-view static stage (Section 3.3.2) to the test images (Section 5.1.3, low-view images). At this stage, the model correctly identified 89% (157 out of 176) of the test static scenes (distinguishing between ‘contained’ and ‘not-contained’), based on a measure of conflicting occlusion cues along the common boundary between two objects (Section 3.3.2). These conflict cues arise from a paradoxical occlusion, which is present in containment scenes but not in simple occlusions scenes (two-sample tailed t-test, t(174) = 13.27, p < 10-6, Fig. 12).
Fig. 12.
Recognizing containment in static scenes. In a static scene, paradoxical occlusion is detected by the model directly (applying a simple threshold) from the simultaneous detection of opposing occlusion relations along the common borders between a container and a non-container (simple) object, when some border parts are owned by one object, while other parts are owned by the other object. The opposing occlusion relations are measured by the ratio between the length (number of pixels) owned by one of the objects and the total number of pixels (along the common borders).
The classification threshold for this measure, was set empirically to 0.16 and was fixed for all experiments (results were insensitive to the threshold in the range 0.1–0.5).
5.2.4. Tight vs. Loose fit
Since we did not have objective ‘ground truth’ for tight and loose fit, we evaluated the performance of the model in distinguishing 'tight' and 'loose' fit between objects and containers, by comparing the tight/loose judgments produced by the model (Section 3.3.3) with judgments produced by five human adults for the same test images. (Subjects were asked to judge between ‘tight’ vs. ‘loose’ fit in the static test images, thereby producing scores in the range 1–5.) The judgments of ‘loose’ and ‘tight’ produced by the model (in the range [0, 1]) were found to be correlated with the human judgements (Pearson r = 0.71, p < 10−6). The results show that the manner, in which the algorithm extracts and uses boundary information from visual cues, is similar to the human tight/loose fit visual judgements (Fig. 13).
Fig. 13.
Tight vs. loose fit in containment events by the model compared with humans. (A) Containment 'tightness' is measured by the ratio between the length of the common border (along the internal boundary of the container) shared with the inserted object, and the length of the entire internal boundary. A ratio close to 1 (similar lengths) indicates a ‘tight’ fit, while a low ratio indicates a ‘loose’ fit (B) The judgments of ‘loose’ and ‘tight’ fit produced by the model are correlated (R2 = 0.51) with the judgements of five human subjects, suggesting that the manner in which the algorithm extracts and uses boundary information from visual cues, is similar to humans’ 'tightness' judgement.
5.2.5. Recognizing containment in static scenes at a high-view
Following familiarization, we applied the model at the high-view static stage (Section 3.3.4) to the test images (Section 5.1.3). The model at this stage identified correctly 82% (144 out of 175) of the high-view test images, as well as 82% (288 out of 351) of a mixture of both low and high-view test images.
5.2.6. Extensions to related concepts: Support and cover relations
The detection of object boundaries learned at motion discontinuities along the container’s rim is essential for recognizing a containment relation. Similarly, the detection of object boundaries on the surface of a supporting object is essential for recognizing a support relation. These two types of object boundaries arise from different kinds of discontinuity (Marr, 1982): contours of depth discontinuity at the container rim (occlusion contours), vs. contours of discontinuity in surface orientation, e.g. along the internal edges of a box (or even no discontinuity, for a smooth supporting object).
We tested computationally which type of contours produces stronger motion discontinuities. For this purpose, we compared the detection confidence score (4.1.3, 4.1.4) of motion discontinuities along edges of a box, that could be convex (solid boxes) or concave (open boxes; Fig. 14).
Fig. 14.
Containment vs. Support. (A) Object boundaries along a container rim are highly detectable by the model from motion flow at motion discontinuities. (B) In contrast, object boundaries on the surface of a supporting object are not detectable from motion flow. These two types of boundaries arise from different types of surface discontinuity: contours of depth discontinuity at the container rim (occlusion contours), vs. contours of discontinuity in surface orientation, e.g. along the internal edges of a box. Arrows indicate the objects rotation. Color maps at bottom show the computed gradient of the optical flow: bright colors show high gradients along contours of motion discontinuity. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
We simulated short video sequences of rotating synthetic 3D cubes (both concave and convex) with 10 different textures. The mean confidence score for the concave (container) cubes (M = 2.37, SD = 1.76) was significantly higher (two-sample tailed t-test: t(11,798) = 85.68, p < 10−6) than the mean confidence score for the convex (closed box) cubes (M = 0.28, SD = 0.64; Fig. 14). The relative difficulty in detecting convex internal boundaries makes the detection of support relation more difficult in the model compared with containment relations. Since the detection of the internal object boundaries is crucial for identifying containment, cover and support relations, this difficulty in detecting the internal boundaries is likely to contribute to the relative delay in acquiring support relations in human vision.
Taken together, the results show that, equipped with simple initial capacities, the model acquires successfully a range of containment related concepts, in an unsupervised manner, and with relatively limited exposure to training stimuli.
5.2.7. Known failures of the model
Failures of the model arise primarily from two sources. One source of failure is limitations of the early processing of motion and boundaries. For example, failure to detect motion discontinuities due to heavy shadowing or weak motion signals. The second source of failures comes from particular, non-generic configurations, such as a tight-fit containment, in which the contained object completely occludes the back part of the container.
6. Discussion
Concepts related to spatial relations in general and containment in particular are a fundamental component of human cognition, and they play a useful role in reasoning about a broad range of physical phenomena (Davis et al., 2013, Spelke and Hespos, 2002). Their acquisition in early development raises a number of basic questions: how can they be acquired early and without supervision? why is containment, which computationally appears abstract and complex (Liang et al., 2015, Yu et al., 2015), acquired before other relations, and what causes the particular time trajectory of its acquisition? The current model shows how containment concepts can emerge early and without explicit supervision, and in a predictable order. The main mechanism that allows this learning is the detection of paradoxical occlusion, and its use for guiding the learning process. The ability to detect and pay attention to paradoxical occlusion can be expected in early developmental stages, when infants rapidly learn to detect object boundaries and establish occlusion relations (Aguiar and Baillargeon, 1999, Luo and Baillargeon, 2005). The paradoxical occlusion signal then provides internal implicit supervision, and guides the system to acquire gradually a range of useful containment-related concepts.
The model makes predictions about the acquisition order (on high-view, Section 3.3.4 and cover, Section 3.3.5), which could be tested in future studies. High-view containment categorization comes later than low-view in the development of containment relations (around 6 months, (Casasola et al., 2003)). In the model, this requires an extension of occlusion detection from paradoxical occlusion to the configuration of occluding an object’s back part. The model suggests that at an early age (prior to capacity C4, Section 3.2), high-view containment should be confusable with an in-front relation (since the object occludes the container, before a distinction into ‘front’ and ‘back’ regions is made). With respect to cover, the model predicts that a small change in the stimuli presentations used in the past (Wang et al., 2005) will lead to an inability to distinguish ‘cover’ from ‘in-front’ relations (Section 3.3.5).
At an early stage (prior to capacity C6, Section 3.2), when the container is represented in the model as a solid region rather than a set of separable front and back regions, the sensitivity to paradoxical occlusion incorporated in the model, may be a special case of violating an expectation, since unlike simple occlusion, in paradoxical occlusion two opposing occlusion relations exist between the same two objects. Consistent with general developmental processes (Stahl & Feigenson, 2015), this unexpected configuration can enhance the learning of containment relations and their implications.
Detecting static paradoxical occlusion may be aided by depth information (Johnson and Aslin, 1996, Spelke et al., 1989), which was not used by the current model. However, binocular vision (Braddick, 1996) and pictorial depth perception (Kavšek et al., 2012) evolve gradually starting at a few months of age, and their contribution to early stages of containment learning is likely to be limited. The model focuses on early stages of learning to identify containers and containment; reaching a comprehensive understanding of concepts related to ‘containment’ at an adult level (Davis et al., 2013) is likely to develop over an extended period, and to incorporate non-visual components, including sensory-motor manipulation.
At a general level, the model uses internal implicit supervision to guide the learning process, unlike external guidance by labeled training examples. A similar strategy of using simple internal signals, typically motion based, and consistent with infants’ early capacities, have been proposed for several other learning tasks, which appear at a surprisingly early age, including general object segregation (Dorfman et al., 2013), and the recognition of hands and direction of gaze (Ullman, Harari, & Dorfman, 2012). Biologically, the automatic use of specific teaching signals to guide learning may be based on an appropriate pre-existing general patterns of connectivity between cortical regions which gradually develop to acquire specific functional specializations (Arcaro, Schade, Vincent, Ponce, & Livingstone, 2017). For example, in the containment case, the models suggests a pattern of connectivity between regions dealing with segregation and ordinal depth, to regions involved with objects and their properties, such as the ability to predict the location of hidden object behind or inside containers (Hespos & Baillargeon, 2001b).
The mechanism of using internal implicit supervision to guide learning is likely to have broader application in other cognitive domains, because it serves two highly useful and general roles. First, it alleviates the need for supplying extensive external supervision, and second, it can guide the learning process to extract concepts that are meaningful to the observer, even if they are not by themselves highly salient in the visual input. Such aspects of cognitive learning discovered in infants can conceivably be adapted for use by future machine learning systems, which currently often rely on large annotated data sets supplying external supervision (Krizhevsky et al., 2012, LeCun et al., 2015), and focus on image structures that are statistically salient (Le et al., 2012).
In the current model, the internal guiding signals were incorporated in the model prior to the learning stage. An intriguing alternative for future studies is to develop more extended learning methods, which cover both evolutionary and individual aspects. Such a process would use prolonged unsupervised training to discover on their own useful guiding signals, which can subsequently support fast unsupervised learning from experience.
Acknowledgments
Acknowledgments
The work was supported by European Research Council (ERC) Advanced Grant “Digital Baby”, Israeli Science Foundation (ISF) grant 320/16 and the German Research Foundation (DFG Grant ZO 349/1-1).
Author contributions
S.U., N.D. and D.H. contributed equally to this work. S.U., N.D. and D.H. designed research; D.H. and N.D. performed research; D.H. and N.D. analyzed data; S.U., N.D. and D.H. wrote the paper. All authors discussed the results and commented on the manuscript.
Footnotes
Supplementary data to this article can be found online at https://doi.org/10.1016/j.cognition.2018.11.001.
Appendix A. Computer vision algorithms
Optical flow is the pattern of apparent motion of objects, surfaces, and edges in a dynamic visual scene caused by the relative motion between the observer and the scene. Computational optical flow methods calculate the motion vector between two successive video frames at every image (pixel) location (Fig. 6B; (Black and Anandan, 1996, Sun et al., 2013)).
Motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. Motion segmentation techniques provide labels to these coherent image regions, which undergo different and independent motion patterns, and correspond to moving objects in the scene (Fig. 6D; (Burt et al., 1991, Horn and Weldon, 1988, Irani et al., 1994, Peleg and Rom, 1990)).
Boundary detection at motion discontinuities uses motion difference between adjacent image regions to determine the boundary between the regions. The computation of the motion difference involves both spatial and temporal aggregation of the optical flow. These boundaries at motion discontinuities may be combined with static occlusion information to better fit object boundaries. Boundary ownership is assigned to adjacent regions by comparing the optical flow at boundary points and at the regions (Figs. 6D, 8; (Dorfman et al., 2013, Ogale et al., 2005, Sargin et al., 2009, Stein and Hebert, 2009, Sundberg et al., 2011, Beck et al., 2008, Verri et al., 1989)).
Part-based object detection models represent an object category by a set of object parts under mutual geometric constraints. For simplicity, the “star-model” object representation consists of a configuration of local regions surrounding a common center. Each region is represented by a set of image features describing local appearance (e.g. the SIFT descriptor (Lowe, 2004)), a segmentation mask assigning a single label to all image pixels inside the region, and a geometric offset from the object center (Crandall et al., 2005, Dorfman et al., 2013, Karlinsky et al., 2010, Leibe et al., 2004).
To detect an object in an image, local appearance image features are densely extracted throughout the image. For each feature, the object model retrieves the k nearest neighbor (k-NN) features from the learned representation. Since in the model, features are associated with local segmentation masks and offsets from the object center, each neighbor votes for the location of the object center with a relative weight proportional to its learned predictive accuracy. The object center is detected at the image location with the highest total votes. The object region and its boundaries are segmented out by projecting back the local segmentation masks associated with the image features, at the corresponding offsets from the detected object center (Fig. 7).
Appendix B. Supplementary material
The following are the Supplementary data to this article:
References
- Aguiar A., Baillargeon R. Eight-and-a-half-month-old infants’ reasoning about containment events. Child Development. 1998;69(3):636–653. [PubMed] [Google Scholar]
- Aguiar A., Baillargeon R. 2.5-month-old infants’ reasoning about when objects should and should not be occluded. Cognitive Psychology. 1999;39:116–157. doi: 10.1006/cogp.1999.0717. [DOI] [PubMed] [Google Scholar]
- Aguiar, A., & Baillargeon, R. (2000). Perseveration and problem solving in infancy. In Reese, H. W. (Ed.), Advances in child development and behavior (Vol. 27, pp. 135–180). 10.1016/S0065-2407(08)60138-X. [DOI] [PubMed]
- Aguiar A., Baillargeon R. Perseverative responding in a violation-of-expectation task in 6.5-month-old infants. Cognition. 2003;88(3):277–316. doi: 10.1016/s0010-0277(03)00044-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arcaro M.J., Schade P.F., Vincent J.L., Ponce C.R., Livingstone M.S. Seeing faces is necessary for face-domain formation. Nature Neuroscience. 2017;20(10):1404–1412. doi: 10.1038/nn.4635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arterberry M.E., Yonas A. Infants’ sensitivity to kinetic information for three-dimensional object shape. Perception and Psychophysics. 1988;44(1):1–6. doi: 10.3758/bf03207466. [DOI] [PubMed] [Google Scholar]
- Arterberry M.E., Yonas A. Perception of three-dimensional shape specified by optic flow by 8-week-old infants. Perception & Psychophysics. 2000;62(3):550–556. doi: 10.3758/bf03212106. [DOI] [PubMed] [Google Scholar]
- Baillargeon R. Infants’ understanding of the physical world. In: Sabourin M., Craik F., editors. Advances in psychological science Vol. 2, Biological and cognitive aspects. Psychology Press; 1998. pp. 503–529. [Google Scholar]
- Baillargeon R. Infants’ Physical World. Current Directions in Psychological Science. 2004;13(3):89–94. [Google Scholar]
- Baillargeon R., DeVos J. Object permanence in young infants: Further evidence. Child Development. 1991;62(6):1227–1246. [PubMed] [Google Scholar]
- Baillargeon R., Needham A., DeVos J. The development of young infants’ intuitions about support. Cognition. 1992;1(2):69–78. [Google Scholar]
- Beck C., Ognibeni T., Neumann H. Object segmentation from motion discontinuities and temporal occlusions-A biologically inspired model. PLoS ONE. 2008;3(11) doi: 10.1371/journal.pone.0003807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bedny M., Saxe R. Insights into the origins of knowledge from the cognitive neuroscience of blindness. Cognitive Neuropsychology. 2012;29(December):56–84. doi: 10.1080/02643294.2012.713342. [DOI] [PubMed] [Google Scholar]
- Black M.J., Anandan P. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer Vision and Image Understanding. 1996;63(1):75–104. [Google Scholar]
- Braddick O. Binocularity in Infancy. Eye. 1996;10:182–188. doi: 10.1038/eye.1996.45. [DOI] [PubMed] [Google Scholar]
- Burt, P. J., Hingorani, R., & Kolczynski, R. J. (1991). Mechanisms for isolating component patterns in the sequential analysis of multiple motion. In IEEE Workshop on Visual Motion (pp. 187–193). 10.1109/WVM.1991.212808. [DOI]
- Carey S. Oxford University Press; New York: 2009. The origin of concepts. [Google Scholar]
- Casasola M. When less is more: How infants learn to form an abstract categorical representation of support. Child Development. 2005;76(1):279–290. doi: 10.1111/j.1467-8624.2005.00844.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Casasola M. The development of infants’ spatial categories. Current Directions in Psychological Science. 2008;17(1):21–25. [Google Scholar]
- Casasola M., Bhagwat J., Burke A.S. Learning to form a spatial category of tight-fit relations: How experience with a label can give a boost. Developmental Psychology. 2009;45(3):711–723. doi: 10.1037/a0015475. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Casasola M., Cohen L.B. Infant categorization of containment, support and tight-fit spatial relationships. Developmental Science. 2002;5(2):247–264. [Google Scholar]
- Casasola M., Cohen L.B., Chiarello E. Six-month-old infants’ categorization of containment spatial relations. Child Development. 2003;74(3):679–693. doi: 10.1111/1467-8624.00562. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Collins, J. W., & Forbus, K. D. (1987). Reasoning About Fluids Via Molecular Collections. In Proc AAAI (pp. 590–594). 10.1016/B978-1-4832-1447-4.50048-1. [DOI]
- Crandall, D., Felzenszwalb, P., & Huttenlocher, D. (2005). Spatial priors for part-based recognition using statistical models. In Proc computer vision and pattern recognition (pp. 10–17). 10.1109/CVPR.2005.329. [DOI]
- Davis E., Marcus G., Chen A. Reasoning from radically incomplete information: The case of containers. Advances in Cognitive Systems. 2013;2:1–18. [Google Scholar]
- Dorfman, N., Harari, D., & Ullman, S. (2013). Learning to perceive coherent objects. In Proc annual meeting of the cognitive science society (pp. 394–399).
- Granrud C.E., Yonas A., Smith I.M., Arterberry M.E., Glicksman M.L., Sorknes A.C. Infants’ sensitivity to accretion and deletion of texture as information for depth at an edge. Child Development. 1984;55:1630–1636. [PubMed] [Google Scholar]
- Griffith S., Sinapov J., Sukhoy V., Stoytchev A. A behavior-grounded approach to forming object categories: Separating containers from noncontainers. IEEE Transactions on Autonomous Mental Development. 2012;4(1):54–69. [Google Scholar]
- Hespos S., Baillargeon R. Infants’ knowledge about occlusion and containment events: A surprising discrepancy. Psychological Science. 2001;12(2):141–147. doi: 10.1111/1467-9280.00324. [DOI] [PubMed] [Google Scholar]
- Hespos S.J., Baillargeon R. Reasoning about containment events in very young infants. Cognition. 2001;78(3):207–245. doi: 10.1016/s0010-0277(00)00118-9. [DOI] [PubMed] [Google Scholar]
- Hespos S.J., Baillargeon R. Décalage in infants’ knowledge about occlusion and containment events: Converging evidence from action tasks. Cognition. 2006;99(2):B31–41. doi: 10.1016/j.cognition.2005.01.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hespos S., Spelke E.S. Conceptual precursors to language. Nature. 2004;430(6998):453–456. doi: 10.1038/nature02634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Horn B.K.P., Weldon E.J. Direct methods for recovering motion. International Journal of Computer Vision. 1988;2(1):51–76. [Google Scholar]
- Irani M., Rousso B., Peleg S. Computing occluding and transparent motions. International Journal of Computer Vision. 1994;12(1):5–16. [Google Scholar]
- Johnson M.J., Dziurawiec S., Ellis H., Morton J. Newborns’ preferential tracking of face-like stimuli and its subsequent decline. Cognition. 1991;40:1–19. doi: 10.1016/0010-0277(91)90045-6. [DOI] [PubMed] [Google Scholar]
- Johnson S., Mason U. Perception of kinetic illusory contours by two month old infants. Child Development. 2002;73(1):22–34. doi: 10.1111/1467-8624.00389. [DOI] [PubMed] [Google Scholar]
- Johnson S.P., Aslin R.N. Perception of object unity in young infants: The roles of motion, depth, and orientation. Cognitive Development. 1996;11(2):161–180. [Google Scholar]
- Karlinsky, L., Dinerstein, M., Harari, D., & Ullman, S. (2010). The chains model for detecting parts by their context. In Proc computer vision and pattern recognition (pp. 25–32). 10.1109/CVPR.2010.5540232. [DOI]
- Kaufmann-Hayoz R., Kaufmann F., Stucki M. Kinetic contours in infants’ visual perception. Child Development. 1986;57(2):292–299. doi: 10.1111/j.1467-8624.1986.tb00028.x. [DOI] [PubMed] [Google Scholar]
- Kavšek M., Yonas A., Granrud C.E. Infants’ sensitivity to pictorial depth cues: A review and meta-analysis of looking studies. Infant Behavior & Development. 2012;35(1):109–128. doi: 10.1016/j.infbeh.2011.08.003. [DOI] [PubMed] [Google Scholar]
- Kellman P.J., Spelke E.S. Perception of partly occluded objects in infancy. Cognitive Psychology. 1983;15(4):483–524. doi: 10.1016/0010-0285(83)90017-8. [DOI] [PubMed] [Google Scholar]
- Kestenbaum R., Termine N., Spelke E.S. Perception of objects and object boundaries by 3-months old infants. British Journal of Developmental Psychology. 1987;5(4):367–383. [Google Scholar]
- Kovesi, P. D. (2000). MATLAB and octave functions for computer vision and image processing.
- Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In Proc neural information processing systems (pp. 1–9).
- Landau B., Gleitman L.R. Harvard University Press; London: 1985. Language and experience: Evidence from the blind child. [DOI] [Google Scholar]
- Le Q.V., Monga R., Devin M., Corrado G., Chen K., Ranzato M.A.…Ng A.Y. Building high-level features using large scale unsupervised learning. International Conference on Machine Learning. 2012;81–88 [Google Scholar]
- LeCun Y., Bengio Y., Hinton G. Deep learning. Nature. 2015;521(7553):436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
- Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In Workshop on statistical learning in computer vision (pp. 1–16).
- Liang W., Zhao Y., Zhu Y., Zhu S.-C. Evaluating human cognition of containing relations with physical simulation. CogSci. 2015 doi: 10.1145/2992138.2992148. [DOI] [Google Scholar]
- Liang, W., Zhao, Y., Zhu, Y., & Zhu, S. (2016). What is where: inferring containment relations from videos. In Proceedings of the 25th international joint conference on artificial intelligence (IJCAI 2016) (pp. 3418–3424).
- Lowe D.G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision. 2004;60(2):91–110. [Google Scholar]
- Luo Y., Baillargeon R. When the ordinary seems unexpected: Evidence for incremental physical knowledge in young infants. Cognition. 2005;95(3):297–328. doi: 10.1016/j.cognition.2004.01.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marr D. Freeman; New York: 1982. Vision: A computational investigation into the human representation and processing of visual information. [Google Scholar]
- Mou Y., Luo Y. Is it a container? Young infants’ understanding of containment events. Infancy. 2017;22(2):256–270. doi: 10.1111/infa.12148. [DOI] [PubMed] [Google Scholar]
- Najman L., Schmitt M. Geodesic saliency of watershed contours and hierarchical segmentation. Pattern Analysis and Machine Intelligence. 1996;18(12):1163–1173. [Google Scholar]
- Needham A. Object recognition and object segregation in 4.5-month-old infants. Journal of Experimental Child Psychology. 2001;78(1):3–22. doi: 10.1006/jecp.2000.2598. [DOI] [PubMed] [Google Scholar]
- Needham A., Baillargeon R. Intuitions about support in 4.5-month-old infants. Cognition. 1993;47(2):121–148. doi: 10.1016/0010-0277(93)90002-d. [DOI] [PubMed] [Google Scholar]
- Needham A., Baillargeon R. Effects of prior experience on 4.5-month old infants’ object segregation. Infant Behavior and Development. 1998;21:1–24. [Google Scholar]
- Needham A., Dueker G., Lockhead G. Infants’ formation and use of categories to segregate objects. Cognition. 2005;94(3):215–240. doi: 10.1016/j.cognition.2004.02.002. [DOI] [PubMed] [Google Scholar]
- Needham A., Modi A. Infants’ use of prior experiences with objects in object segregation: Implications for object recognition in infancy. Advances in Child Development and Behavior. 1999;27:99–133. doi: 10.1016/s0065-2407(08)60137-8. [DOI] [PubMed] [Google Scholar]
- Needham, A., & Ormsbee, S. M. (2003). The development of object segregation during the first year of life. In Perceptual organization in vision: behavioral and neural perspectives (pp. 205–179).
- Ogale A.S., Fermuller C., Aloimonos Y. Motion segmentation using occlusions. IEEE Transactions On Pattern Analysis and Machine Intelligence. 2005;27(6):988–992. doi: 10.1109/TPAMI.2005.123. [DOI] [PubMed] [Google Scholar]
- Peleg, S., & Rom, H. (1990). Motion based segmentation. In Proc international conference on pattern recognition (Vol. 1, pp. 109–113). 10.1109/ICPR.1990.118074. [DOI]
- Piaget J., Inhelder B. W.W. Norton & Company; 1967. The child’s conception of space. [DOI] [Google Scholar]
- Quinn P.C., Adams A., Kennedy E., Shettler L., Wasnik A. Development of an abstract category representation for the spatial relation between in 6- to 10-month-old infants. Developmental Psychology. 2003;39(1):151–163. doi: 10.1037//0012-1649.39.1.151. [DOI] [PubMed] [Google Scholar]
- Rosman B., Ramamoorthy S. Learning spatial relationships between objects. The International Journal of Robotics Research. 2011;30(11):1328–1342. [Google Scholar]
- Sargin, M. E., Bertelli, L., Manjunath, B. S., & Rose, K. (2009). Probabilistic occlusion boundary detection on spatio-temporal lattices. In Proc ICCV (pp. 560–567). 10.1109/ICCV.2009.5459190. [DOI]
- Shi J., Malik J. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence. 2000;22(8):888–905. [Google Scholar]
- Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor Segmentation and Support Inference from RGBD Images. In ECCV (pp. 1–14). 10.1007/978-3-642-33715-4_54. [DOI]
- Sitskoorn M., Smitsman A. Infants’ perception of dynamic relations between objects: Passing through or support? Developmental Psychology. 1995;31(3):437–447. [Google Scholar]
- Spelke E. Principles of object perception. Cognitive Science. 1990;14(1):29–56. [Google Scholar]
- Spelke E., Hespos S. Conceptual development in infancy: The case of containment. In: Stein N.L., Bauer P.J., Rabinowitch M., editors. Representation, memory, and development: Essays in honor of Jean Mandler. Erlbaum; Hillsdale, NJ: 2002. [Google Scholar]
- Spelke E., Kinzler K. Core knowledge. Developmental Science. 2007;10(1):89–96. doi: 10.1111/j.1467-7687.2007.00569.x. [DOI] [PubMed] [Google Scholar]
- Spelke E., von Hofsten C., Kestenbaum R. Object perception in infancy: Interaction of spatial and kinetic information for object boundaries. Developmental Psychology. 1989 [Google Scholar]
- Stahl A.E., Feigenson L. Observing the unexpected enhances infants’ learning and exploration. Science. 2015;348(6230):91–94. doi: 10.1126/science.aaa3799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stein A.N., Hebert M. Occlusion boundaries from motion: Low-level detection and mid-level reasoning. International Journal of Computer Vision. 2009;82(3):325–357. [Google Scholar]
- Strickland B., Scholl B.J. Visual perception involves event-type representations: The case of containment versus occlusion. Journal of Experimental Psychology. General. 2015;144(3):570–580. doi: 10.1037/a0037750. [DOI] [PubMed] [Google Scholar]
- Sun, D., Wulff, J., Sudderth, E., Pfister, H., & Black, M. J. (2013). A fully-connected layered model of foreground and background flow. In Proc computer vision and pattern recognition. 10.1109/CVPR.2013.317. [DOI]
- Sundberg, P., Brox, T., Maire, M., Arbelaez, P., & Malik, J. (2011). Occlusion boundary detection and figure/ground assignment from optical flow. In Proc computer vision and pattern recognition. 10.1109/CVPR.2011.5995364. [DOI]
- Ullman S., Harari D., Dorfman N. From simple innate biases to complex visual concepts. Proceedings of the National Academy of Sciences of the United States of America. 2012;109(44):18215–18220. doi: 10.1073/pnas.1207690109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Verri, A., Uras, S., & Micheli, E. D. (1989). Motion segmentation from optical flow. In Proceedings of the Alvey Vision conference 1989 (pp. 36.1–36.6). 10.5244/C.3.36. [DOI]
- von Hofsten C., Spelke E.S. Object perception and object-directed reaching in infancy. Journal of Experimental Psychology. General. 1985;114(2):198–212. doi: 10.1037//0096-3445.114.2.198. [DOI] [PubMed] [Google Scholar]
- Wang, H., & Liang, W. (2017). Transferring objects : Joint inference of container and human pose. In ICCV. 10.1109/ICCV.2017.319. [DOI]
- Wang S., Baillargeon R., Paterson S. Detecting continuity violations in infancy: A new account and new evidence from covering and tube events. Cognition. 2005;95(2):129–173. doi: 10.1016/j.cognition.2002.11.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang S.H., Baillargeon R., Brueckner L. Young infants’ reasoning about hidden objects: Evidence from violation-of-expectation tasks with test trials only. Cognition. 2004;93(3):167–198. doi: 10.1016/j.cognition.2003.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilcox T., Nadel L., Rosser R. Location memory in healthy preterm and full-term infants. Infant Behavior and Development. 1996;19(3):309–323. [Google Scholar]
- Yonas A., Craton L.G., Thompson W.B. Relative motion: Kinetic information for the order of depth at an edge. Perception & Psychophysics. 1987;41(1):53–59. doi: 10.3758/bf03208213. [DOI] [PubMed] [Google Scholar]
- Yonas A., Elieff C.A., Arterberry M.E. Emergence of sensitivity to pictorial depth cues: Charting development in individual infants. Infant Behavior and Development. 2002;25(4):495–514. [Google Scholar]
- Yu, L.-F., Duncan, N., & Yeung, S.-K. (2015). Fill and transfer: A simple physics-based approach for containability reasoning. In Proceedings of the IEEE international conference on computer vision (pp. 711–719). 10.1109/ICCV.2015.88. [DOI]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.














