Abstract
The visual recognition of actions is an important visual function that is critical for motor learning and social communication. Action-selective neurons have been found in different cortical regions, including the superior temporal sulcus, parietal and premotor cortex. Among those are mirror neurons, which link visual and motor representations of body movements. While numerous theoretical models for the mirror neuron system have been proposed, the computational basis of the visual processing of goal-directed actions remains largely unclear. While most existing models focus on the possible role of motor representations in action recognition, we propose a model showing that many critical properties of action-selective visual neurons can be accounted for by well-established visual mechanisms. Our model accomplishes the recognition of hand actions from real video stimuli, exploiting exclusively mechanisms that can be implemented in a biologically plausible way by cortical neurons. We show that the model provides a unifying quantitatively consistent account of a variety of electrophysiological results from action-selective visual neurons. In addition, it makes a number of predictions, some of which could be confirmed in recent electrophysiological experiments.
Introduction
Motor actions are often directed toward goal objects, such as grasping of a piece of food. The recognition of such transitive goal-directed actions is an important function of the visual system with high importance for motor learning and the interpretation of the actions of others. The neural basis of this visual capability is only partially understood. Neurons with visual selectivity for goal-directed hand actions have been found in multiple regions of monkey cortex, including the superior temporal sulcus (STS) (Perrett et al., 1989; Jellema and Perrett, 2006; Barraclough et al., 2009), parietal cortex (Fogassi et al., 2005; Rozzi et al., 2008; Bonini et al., 2010), and premotor cortex (for a review, see Rizzolatti and Sinigaglia, 2010). A subgroup of these neurons that has received enormous interest in cognitive neuroscience is the “mirror neurons,” which combine visual selectivity for observed actions with selective motor tuning during action execution. (See Materials and Methods, Transitive action-selective neurons and view-independence.)
Most existing computational models for goal-directed action recognition have focused on the possible role of motor representations, and the “mirror neuron system” for action understanding (Wolpert et al., 2003; Oztop et al., 2006) (see Materials and Methods, Relationship to other models, for a more detailed review). Most of these models assume, implicitly, that action recognition occurs by a matching of observed and internally simulated motor behavior within a body-centered frame of reference, e.g., using joint angle representations. First, this computational approach predicts view-independence of the relevant neural representations. Second, this computational approach requires a relatively accurate reconstruction of the three-dimensional effector geometry, even from monocular action stimuli.
The first point seems difficult to reconcile with the observation that many action-selective neurons in monkey cortex show view dependence, e.g., in the STS (Perrett et al., 1985; Oram and Perrett, 1996; Jellema and Perrett, 2003; Barraclough et al., 2009), and recently also area F5 in premotor cortex (Caggiano et al., 2011). A transformation in a body-centered frame might thus not occur until very late in the cortical processing hierarchy. View-dependent mechanisms are meanwhile accepted as a standard explanation for the recognition of three-dimensional shapes in the ventral stream. (See Materials and Methods, Relationship to other models, for a more detailed discussion.) With respect to the second point, it is known from computer vision that the estimation of three-dimensional joint angles from monocular image sequences is a very challenging computational vision problem (Weinland et al., 2011), and one might ask whether the brain really solves this problem if action recognition can be accomplished by computationally less costly strategies, e.g., bypassing the three-dimensional reconstruction of the effector configuration.
We present in the following a physiologically plausible model that reproduces visual properties of action-selective neurons in higher cortical areas of monkey cortex. The model accomplishes action recognition without an explicit reconstruction of the three-dimensional effector geometry, relying on well-established simple neural principles. The model is computationally powerful enough to recognize actions from real video sequences, accomplishing position- and view-invariance. It provides a unifying account for a variety of electrophysiological and imaging results from monkey cortex.
Materials and Methods
Overview of the model architecture
An overview of the model architecture is shown in Figure 1. The model consists of three main components: (1) A neural shape processing hierarchy that recognizes the moving effector (e.g., the hand) and goal objects, (2) a module that integrates the information about the relationship between effector and object, and (3) a module containing neurons that are selective for transitive actions and that establishes view-invariance of recognition.
The first model component follows closely well-known neural models for shape recognition in the ventral stream (Oram and Perrett, 1994; Riesenhuber and Poggio, 1999b; Rolls and Milward, 2000; Cadieu et al., 2007). Its core part is view-specific detectors for shapes. Invariance and feature complexity increase along the hierarchy, where position and scale invariance are achieved by maximum-pooling. Contrasting with the mentioned object recognition models, the shape-selective neurons in our model show only incomplete position-invariance. These neural units have spatially localized receptive fields with a diameter approximately 4° visual angle, corresponding to electrophysiological results from area IT (Op De Beeck and Vogels, 2000; DiCarlo and Maunsell, 2003; Aggelopoulos and Rolls, 2005). This makes it possible to decode the two-dimensional retinal positions of recognized goal objects and effectors from the population activity of such shape detectors.
A second modification compared with standard object recognition models is that the neural detectors for effector shapes, such as hand postures, are selective for the temporal order with which such shapes occur in the stimulus. Such temporal sequence selectivity is compatible with neural data, e.g., from the superior temporal sulcus (Jellema and Perrett, 2003; Vangeneugden et al., 2009; Singer and Sheinberg, 2010), and it can be accounted for by recurrent connections between shape-selective neurons (Giese and Poggio, 2003).
The second model component is substantially extending existing previous architectures and implements a physiologically plausible mechanism for the integration of the information about effector and goal object. This computational function is potentially associated with neurons in parietal cortex, and potentially also in the STS. The central component is a neural representation of the relative positions of effector and goal object, and of the matching between object type and grip [relative position map (RPM)].
The third model component contains neural detectors that are selective for goal-directed action stimuli. This component integrates the information from the previous modules. In addition, this component is critical for accomplishing view invariance of recognition, by pooling of the responses of a number of view-specific modules. The neural detectors in this model component reproduce properties of action-selective neurons in the STS and premotor cortex (e.g., area F5).
Relationship to other models
Many other biologically-relevant computational models for goal-directed action recognition have focused on the role of motor representations (Haruno et al., 2001; Wolpert et al., 2003) and specifically of the mirror neuron system (Oztop and Arbib, 2002; Demiris and Simmons, 2006; Oztop et al., 2006; Kilner et al., 2007). These models assume typically a matching of visual input to internal representations of motor programs that are represented in terms of variables relevant for motor control, such as joint angles. Only very few implementations have presented how such variables could be extracted from real image sequences (Oztop and Arbib, 2002; Metta et al., 2006; Tessitore et al., 2010). In this sense, the model presented here is complementary to approaches that mainly treat the relationship between visual and motor representations (Erlhagen et al., 2006; Kiebel et al., 2008; Bonaiuto and Arbib, 2010; Chersi et al., 2011).
The model presented in this paper represents actions in terms of learned sequences of learned example views of action stimuli. View-independence is accomplished by pooling over the output signals of neural classifiers that are specific for individual views. Such approaches are very common in computer vision (Weinland et al., 2011), proving their computational feasibility. In addition, the representation of three-dimensional structures in terms of learned example views is meanwhile accepted as a fundamental mechanism for the cortical representations of object shape in the ventral stream (Poggio and Edelman, 1990; Oram and Perrett, 1994; Logothetis et al., 1995; Tarr and Bülthoff, 1998; Riesenhuber and Poggio, 1999a). This hypothesis seems consistent with electrophysiological data showing view-dependent and view-independent shape-selective neurons, and an experience-dependent modulation of tuning properties of neurons in area IT (Kobatake et al., 1998; Sigala and Logothetis, 2002; Freedman et al., 2006; Suzuki and Tanaka, 2011). Also, biologically inspired computational and neural models, based on learned example views, have successfully reproduced a variety of properties of the recognition of non-transitive actions, sometimes even reaching benchmark performance compared with computer vision algorithms (Giese and Poggio, 2003; Lange and Lappe, 2006; Jhuang et al., 2007; Prevete et al., 2008; Escobar et al., 2009; Jhuang et al., 2010). However, many action-selective neurons in monkey cortex show a critical dependence of their response properties on the presence of goal objects and their spatial relationship to the moving effector (like the grasping hand) (Perrett et al., 1989; Gallese et al., 1996; Umiltà et al., 2001; Barraclough et al., 2009). These previous models do not account for this property of action-selective neurons, which is likely essential for the decoding of the meaning of observed transitive actions. Our model proposes simple neural circuits that account for these neurophysiological observations, at the same time proposing a neural implementation of a computational step that might be essential for the realization of higher forms of action categorization. The following sections give a more detailed description of the individual components of the model. In parallel, we discuss different experimental results that support the core assumptions of the proposed architecture.
Shape recognition pathway
The recognition of effector and object shapes is accomplished by a hierarchical neural pathway whose structure is compatible with well-known models for visual object recognition (Perrett and Oram, 1993; Riesenhuber and Poggio, 1999b; Mel and Fiser, 2000; Rolls and Milward, 2000). It has been shown in previous work that such hierarchies can support action recognition by the recognition of shape sequences. For example, a body movement can be represented as a temporal sequence of body shapes (Giese and Poggio, 2003; Lange and Lappe, 2006; Prevete et al., 2008). Recent work in computer vision shows that neutrally inspired hierarchical architectures that recognize sequences of body shapes, or optic flow patterns, can be computationally quite powerful, reaching state-of-the-art performance in computer vision (Jhuang et al., 2007; Serre et al., 2007b; Schindler and van Gool, 2008; Escobar et al., 2009).
The shape recognition pathway consists of a hierarchy of layers, where the complexity of the extracted features increases along the pathway. The tuning properties of these detectors are predefined at the lowest hierarchy level (Gabor filters) and learned at higher hierarchy levels. Following previous shape recognition models (Fukushima, 1980; Riesenhuber and Poggio, 1999b), the pathway is organized in terms of layers that correspond functionally to “simple” and “complex cells.” Assuming that the stimuli for the simulated experiments were typically foveated, we did not model the modulation of receptive field properties with the eccentricity within the visual field. The simple cells increase feature complexity, while the complex cells pool responses of simple cells of the same type over neighboring spatial positions and scales, resulting in an increase of position and scale invariance along the hierarchy (cf. Rust and diCarlo, 2010). The spatial resolution was down-sampled by a factor of two at each complex cell level. The output nonlinearity of the neural detectors was given by a linear threshold function. This nonlinearity provides a coarse approximation of the output nonlinearity of real cortical neurons (Movshon et al., 1978; Carandini et al., 1997) and results in a suppression of responses of suboptimally stimulated neural detectors. The parameters of the model neurons were, as far as possible, constrained by physiological parameters, partially taking over results from related models in the literature (Serre and Riesenhuber, 2004; Serre et al., 2007b). If no experimental evidence was available, parameter values were optimized for shape recognition performance in a separate cross-validation experiment (see Video stimuli and simulation procedures). Figure 1A shows a coarse overview of the shape recognition pathway, where the approximate receptive field sizes of the neural detectors are indicated by the insets below. A more detailed description of the different hierarchy levels is given in the following.
Shape recognition hierarchy.
The first hierarchy level that models simple cells in primary visual cortex consists of local orientation detectors that are modeled by quadrature phase pairs of Gabor filters with eight different preferred orientations and seven different spatial scales (Jones and Palmer, 1987). Receptive fields sizes ranged from 0.35° to 0.99°; matching approximately the values observed in electrophysiological experiments (cf. Serre and Riesenhuber, 2004). The output signals of the Gabor filters were rectified and normalized (Heeger, 1993).
From the output signals of the Gabor filters, “complex cell” responses were computed by pooling of the responses of orientation detectors with the same orientation preference and spatial scale using a maximum operation. The spatial receptive fields of these complex cells had diameters between 0.63° to 1.37°, consistent with data from monkey cortex (Schiller et al., 1976; De Valois et al., 1982).
The model neurons at intermediate hierarchy levels extract shape features of intermediate complexity, similar to neurons in area V4 (Gallant et al., 1993; Pasupathy and Connor, 1999). The responses of the simple cells at the intermediate layers were given by Gaussian radial basis functions (RBFs) with divisive lateral inhibition (Heeger, 1993). The responses of detector type κ at hierarchy level l with receptive-field center x were given by the function:
where N (x | μ, Σ) is the functional form of the multidimensional normal distribution with mean μ and covariance matrix Σ. The vector hl−1 (x, t) signifies the outputs from a local neighborhood of complex cells at the previous hierarchy layer that feed into the simple cell of type κ with receptive field center x, and the parameters πκ define the weighting of the input. The small constant c > 0 prevents the denominator from vanishing. Mathematically, the last equation approximates a Gaussian mixture model. The centers dkl of the individual Gaussians were determined by unsupervised learning, using k-means clustering, from the response vectors of complex cells sampled at random positions on the previous layer, computed from a set of training sequences. For each extracted cluster also the covariance matrix Λkl was estimated from the training data, and the weights πkl were set to values proportional to the size of the cluster.
A fixed number of 200 Gaussian RBFs were learned from training data and shared for all subsequent computations. The responses of the learned feature detectors were pooled over local spatial neighborhoods (diameter 1.59 to 2.35°) using a maximum operation, defining the responses of the corresponding complex cells that were characterized by an increased level of position invariance. The same procedure was replicated to generate a further intermediate layer that extracts even more complex form features (spatial pooling ranges: 2.16 to 3.19°). These two intermediate layers turned out to be sufficient to accomplish robust performance for the simulations presented in this paper. More complex visual tasks, e.g., including massive clutter or substantial variations in size, might require the introduction of additional intermediate hierarchy layers (Serre et al., 2007a; Fidler et al., 2008).
The highest level of the shape recognition hierarchy is formed by neural detectors that are selective for complete views of objects and effectors. These detectors were also modeled as radial basis functions with the functional form N (x | dκ, 0.1 · I), where the RBF centers were sampled equidistantly in time from example frame sequences, and where I indicates the unit matrix. The receptive fields have diameters of approximately 3.9°, covering an area that contains whole object shapes (see insets Fig. 1A).
Only a subset of these shape detectors on the highest hierarchy level generalized robustly to novel instances of the same shape class. More robust detectors for the individual shape classes were constructed by learning of linear neural networks that map the response vectors fκ(x, t) of the shape detectors at position x onto a shape class-specific activation aγ(x, t). These linear networks were given by the equation
where the weights ωγ were learned by linear regression with sparsification [Lasso method (Tibshirani, 1994)]. For the recognition of static shapes (i.e., the goal object) this linear network was trained using the actual input vectors fκ(x, t) and corresponding idealized binary class activities as training data (i.e., aμ(x, t) = 1 if the stimulus belonged to pattern class γ and aμ(x, t) = 0 otherwise). For the recognition of dynamic shapes the linear network was trained with the actual input vectors and idealized moving output distributions (moving activity peaks), where the details are given in Selectivity for the temporal order of effector shapes, below.
In the spatial continuum limit the function aγ(x, t) can be interpreted as a two-dimensional activation field with a peak that is located at the retinal position of the recognized shape. Opposed to typical object recognition models, this representation at the highest level of the shape recognition hierarchy in our model is not completely position-invariant and retains coarse position information about the retinal coordinates of the recognized shapes. For each shape the model contains multiple replica of the shape detectors with different preferred positions and highly overlapping receptive fields with a diameter of approximately 3.9°. This representation of shape position is crucial for the subsequent levels of the model that determine the relative locations of effector and goal object (see below).
Selectivity for the temporal order of effector shapes.
The recognition of hand actions depends strongly on the temporal order of the occurrence of hand shapes in the visual stimulus. This is immediately apparent if one observes a movie showing a hand action with random temporal order of the frames, or if such a movie is played in reverse temporal order. Reversing temporal order can sometimes even result in the perception of a completely different action (e.g., grasping vs placing).
There are multiple physiologically plausible mechanisms that can account for such temporal sequence selectivity. We used a mechanism that was proposed before in the context of neural action recognition models (Giese and Poggio, 2003). The network mechanism consists of a single network layer with asymmetric lateral connections between neurons that encode individual snapshots from the hand motion sequence. The resulting network dynamics can be described by a neural field (Wilson and Cowan, 1972; Amari, 1977; Ben-Yishai et al., 1997; Giese, 1999; Erlhagen et al., 2006) with an asymmetric interaction kernel (Zhang, 1996; Xie and Giese, 2002). It has been shown that this type of network, if activated by a moving localized input distribution, supports a form-stable output activation distribution that propagates along the network with the same speed as the input. Using a proceeding described in the study by Zhang (1996), the functional form of this traveling pulse was adjusted, by learning of the shape of the lateral interaction kernel, to fit reported average firing rates of body action-selective neurons in the STS (Oram and Perrett, 1996). The moving activity pulse is only a stable solution of the network within a limited range of speeds for the input distribution. If the input pulse moves in the opposite direction along the fields or with inadequate speed the stable solution of the network dynamics breaks down, and the output amplitude of the network is very small (Xie and Giese, 2002). Likewise, the activation of the inputs of the network with random temporal order results in outputs with very small amplitude (Giese and Poggio, 2003). In addition, previous work shows that the lateral connections of such networks can be learned easily by time-dependent Hebbian plasticity (Brody and Hopfield, 2003; Jastorff and Giese, 2004).
The sequence-selective networks that encode the time course of individual hand actions (e.g., closing for grasping and opening for placing) in this model consist of 20 coupled neurons per action type. We signify by sν(ξ, t) the input current of the neuron ξ encoding action ν, where ξ can be interpreted as the position of the neuron in a one-dimensional neural field. The network dynamics is specified by the differential equations:
In this equation, g(x) is a sigmoid activation function that behaves approximately linear in the relevant input range for x > 0 and decays exponentially for x < 0 (Zhang, 1996), hu = 0.15 is a constant that specifies the resting activity level of the network, and τu (= 20 ms) is the time constant of the neural field. As consequence of the asymmetric lateral interaction kernel wu(ξ), an active neuron in the field preactivates neurons that encode temporally subsequent hand shapes, while it inhibits the other neurons.
The term qν(t) > 0 specifies lateral inhibitory feedback from other neural fields that encode different action patterns. Such inhibition turned out to be critical for accomplishing robust behavior, especially for the discrimination between action patterns with different temporal order of the frames. Defining by Ov(t) = maxξ′ uv(ξ′,t) the maximum of the output activity of the field encoding pattern ν, the strength of this nonlinear feedback was given by the equation:
with 1(x) = 1 for x > 0 and zero otherwise and with θ = 0.7. This equation specifies an inhibition of fixed strength (q0 = 0.71), in the case that at least one other field v′ is significantly activated by the same stimulus as field ν.
The input distribution of the individual neural fields were computed from the output vector f(x, t) of the shape recognition hierarchy by learning of linear mappings similar to Equation (2). We first computed position-selective input vectors sν(x, t) = [sν(1, x, t), sν(2, x, t), …], discretely sampling the position coordinate ξ of the neural field. These input vectors were approximated by the linear mapping sν(x, t) = Ων f(x, t), which was trained by pairs of input vectors f derived from (spatially centered) training patterns and corresponding idealized input peaks of the neural field that were given by Gaussian functions with maximum amplitude 0.2 and a width (variance) σs2 = 4 that moved with an appropriate speed over the neural field. While in our model these linear mappings were constructed directly by supervised learning, other work shows that it is possible to exploit Hebbian plasticity mechanisms to learn such mappings by association of time varying inputs with stable solutions in dynamic neural networks that represent the time course of actions (Zhang, 1996; Markram et al., 1997; Song et al., 2000). However, such unsupervised learning mechanisms were not the focus of the work presented in this paper.
The input distribution of the neural field defined by Equation (3) was given by the position-specific inputs with the maximum amplitude, effectively realizing a competition between the inputs with different retinal position specificities. Discrete sampling of the function sν(ξ, x, t) with respect to the variable ξ defines the position-selective input vectors sν(x, t). The input of the field encoding hand action type ν was then given by sν(t) = sν(x*, t) with x* = argmaxx sν(x, t). The model assumes thus a complete position-invariance of the encoding of sequences of hand shapes. This assumption has the advantage that it avoids a combinatorial explosion of neurons due to a replication of the competitive set of neural fields for each represented spatial position. However, it seems likely that in the brain this assumed perfect decoupling of a (position invariant) encoding of hand shape sequences, and of the relative position of hand and object is much less strict and potentially not clearly separated. Further quantitative physiological data will be necessary to clarify this point.
Consistent with previous models for the recognition of non-transitive actions (Giese and Poggio, 2003), the highest level of the sequence-selectivity circuit is given by neural detectors that integrate the output signals of the individual neural fields over time. These motion pattern neurons become activated during the occurrence of particular hand actions, but only if the corresponding image frames appear in the correct temporal order and specify an approximately natural speed of the action. However, their activity is strongly reduced if the corresponding hand shapes occur in wrong temporal order or with unnatural speeds. The motion pattern neurons are defined by the differential equation:
The constant hm determines the resting activity level, and τm = 40 ms is the time constant of the leaky integrator dynamics.
We assume further the existence of position-selective motion pattern neurons (Baker et al., 2000; Jellema et al., 2004; Vangeneugden et al., 2009) that integrate the outputs of the motion pattern neurons and of the corresponding position-selective hand shape neurons multiplicatively. These neurons form an activation map that shows an activity peak at the retinal position of the hand only if the hand shapes arise in the right temporal sequence. Mathematically, the activation of the corresponding detectors was defined by the equation
where sν(x, t) signifies the position-dependent input distribution for hand action type ν, evaluated at receptive field center x (see above). Again, a variety of mechanisms is suitable for the implementation of the same computational function, some of them assuming a less strict separation of position-selective and sequence-selective neural representations.
In general, it seems possible that local motion features, such as extracted by neurons in the medial temporal area (MT), might also contribute to the recognition of the effector motion. Our present model does not contain a motion pathway that is suitable for the analysis of the complex optic flow patterns that are associated with hand deformations. It seems plausible that such patters are exploited by the visual system, and it remains an interesting experimental question whether this is the case. In the domain of biological motion recognition (of non-transitive actions) from point-light displays a vivid discussion has emerged about the question how form and local motion features are integrated, where recent evidence points to a flexible integration of both cues (Giese and Poggio, 2003; Casile and Giese, 2005; Lange and Lappe, 2006; Vangeneugden et al., 2009; Thurman et al., 2010).
Representation of the hand-object interaction.
The recognition of goal-directed actions requires an association of the extracted information about the effector (hand) movement and the shape and position of the goal object. Our model proposes a simple physiologically-inspired mechanism that accounts for this association. Opposed to other models that solve the same problem by an analysis of the three-dimensional structure of object and effector (Oztop and Arbib, 2002; Bonaiuto and Arbib, 2010), a computationally quite challenging problem especially for monocular stimuli, our model shows that action recognition can also be accomplished by view-specific mechanisms without an explicit reconstruction of three-dimensional structure.
As central mechanism for the analysis of the hand-object interaction we postulate a relative position map (RPM), a two-dimensional neural activation map that represents the two-dimensional position of the hand relative to the goal object in an image frame of reference by a localized activity peak (Fig. 1B). Exploiting the fact that the shape-selective neurons in our model are tuned to the retinal position of the recognized shapes, this activation map can be computed by a simple feed-forward network from the responses of the detectors on the highest level of the shape recognition hierarchy.
We present in the following a mathematical formulation of this step in the spatial continuum limit, while in the real implementation the network was based on a discretization of the two-dimensional spatial position x using 1500 model neurons whose receptive field centers were arranged within a rectangular grid. Let aγ(x, t) signify the activation distribution of the shape recognition neurons for goal shape γ, and mν(x, t) the activation map that corresponds to the position-selective motion pattern neurons for hand action ν (Eq. 6). Assuming that we have learned relevant combinations of objects and actions, the relative position map representing combinations of hand action ν and goal shape γ was defined by a simple feed-forward network that combines both variables multiplicatively:
The multiplication corresponds to a generalized weighted geometric mean. The integral implements a summation over the whole two-dimensional visual field. The vector d signifies the two-dimensional position of the hand relative to the object in the RPM (Fig. 2A). The parameter αhand determines in how far the model neurons defining the RPM are selective for the shape of the hand compared with the shape of the object. Differences in the input selectivity of action-selective neurons for effector and object shapes have been reported in the literature (cf. Perrett et al., 1989).
The proposed mechanism is equivalent to a gain field (Zipser and Andersen, 1988; Salinas and Abbott, 1995; Pouget and Sejnowski, 1997), and realizes a coordinate transformation from retinal to object-centered coordinates. Gain fields are an established model for the neural realization of coordinate transformations in parietal cortex and have been also discussed elsewhere in the context of invariant object representations (Deneve and Pouget, 2003; Crowe et al., 2008). Physiological data suggest the existence of goal-centered neural representations in parietal cortex (Fogassi et al., 2005; Bonini et al., 2011) and the STS (Perrett et al., 1989), comparable to the object-centered representations in the ventral stream (Jellema and Perrett, 2006; Connor et al., 2007). However, some data points also to the existence of effector centered representations (Buneo et al., 2002; Ochiai et al., 2005; Pesaran et al., 2006). We have tested successfully versions of the model with both types of transformations, showing that both relative position representations result in similar computational performance.
The RPM represents a useful neural representation that permits to verify if the spatial relationship between hand and object is compatible with a functional, successful grasping action, exploiting simple neural circuits. In our model we assume neural detectors for two types of features that can be easily computed from the RPM: The first feature is the position of the hand relative to the goal object. We assume the existence of affordance neurons whose receptive fields include all (relative) hand positions that are consistent with successful grips. Their spatial receptive fields were learned from training stimuli, and they were defined mathematically by regions Gνγ that included all hand positions that elicited at least 75% of the maximum activity in the RPM for a given combination of action and object. In addition, we assume in the model that affordance neurons integrate information over time. Their response dynamics is described by the following differential equation:
where ν signifies the action, γ the goal shape, and where τA = 160 ms is the time constant of the temporal integration. The affordance neurons respond only if the stimulus shows the right combination of hand shape and object shape, combined with their correct spatial arrangement. Neurons with similar selectivity for interactions have been found in the STS and the ventral premotor cortex (Perrett et al., 1989; Gallese et al., 1996).
The second feature that we computed from the RPM is the relative motion of the hand in relationship to the object. This feature turned out to increase substantially the robustness of our recognition results. Local motion at each location of the RPM was computed by simple correlation-based detectors (Adelson and Bergen, 1985; cf. Bayerl and Neumann, 2004), referred to as relative speed neurons in the following. Our model contains 49 detectors (per position) detecting different combinations of horizontal and vertical speed components, covering a speed regime of approximately ± 2.4° per second in both directions. In particular, our model includes detectors for zero relative speed (Bayerl and Neumann, 2004), which were important to detect actions without relative movement between hand and effector (e.g., placing of an object with the hand).
Detectors for meaningful relative motion events in the context of actions were constructed from these local detectors responses by weighted summation, where we assumed four classes of relative motion neurons (detectors for “moving apart,” “approaching,” and “moving together”). Figure 2A shows schematically how the detectors for abstract relative motion events can be constructed from the responses of the relative speed neurons. [Similar circuits have been proposed as models for optic-flow-selective neurons in area MST (Koenderink, 1986; Saito et al., 1986; Zemel and Sejnowski, 1998; Beardsley and Vaina, 2001)].
More precisely, the response of the relative motion neuron for motion type β was obtained by weighting the responses of the relative speed neurons eνγ(φ, v, d, t) by a function wβ and pooling them over the positions d in the RPM and relative motion direction φ (using summation), and over relative motion speed v and position-tuning directions of the relative motion neurons ρp (by maximum computation). The resulting sum activity is smoothed over time by a leaky integrator with time constant τM = 160 ms, resulting in the equation:
For the approaching and moving apart detectors the weight function wβ(φ, v, φp, d) was proportional to the expression
(where ud, uφ, and up are unit vectors in the directions of the vector d, the preferred direction φ of the relative speed neuron, and a direction that defines the position selectivity of the relative motion neuron. The last term suppresses input signals from relative motion neurons with relative speed v = 0, δy signifying the Kronecker function that is one for y = 0 and zero otherwise.) The function wβ specifies direction templates (compare Fig. 2A), where η = 1 specifies a detector for expanding motion and η = −1 a detector for contracting motion. (Tuning width parameters: σd = 0.25 and σφ = π/2).
For the moving together detectors the function wβ(φ, v, φp, d) was proportional to the term
that specifies the same position selectivity, combined with a speed-dependent term that decays gradually for increasing speeds (with σv = 1).
The information of the affordance neurons and the relative motion neurons is finally combined by neural detectors for transitive action that are described in Transitive action-selective neurons and view-independence.
The position- and shape-based information processed by the affordance neurons, and the relative motion information encoded by the relative motion neurons provide two separate channels that represent critical information about goal-directed action stimuli. Which of these features contributes more reliable information depends on the stimulus class.
More detailed simulations show that both pathways are computational beneficial for the processing of natural action stimuli. To demonstrate this we created two additional versions of the model, one which contains only the channel realizing action analysis with affordance neurons, and another one that includes only the channel realizing relative motion analysis. Figure 2B shows that a model with only the affordance neuron pathway, opposed to the model with only relative motion analysis, can distinguish successful and non-successful grasping actions, where the hand either correctly touches the object, or where it grasps next to the object (“mimicked action”). Clearly, the distinction of these two action types is critical for the correct recognition of normal grasping. By contrast, the model version with only relative motion processing and no affordance neurons can distinguish different phases during pushing actions, such as the approach of the object by the hand, or the movement of the object after the pushing (Fig. 2C). The same distinction is not possible with the model that contains only the processing channel with the affordance neurons. Likewise, we have shown elsewhere that such a model can be used to derive judgments of “perceived causality” from abstract motion displays (Fleischer et al., 2012). This provides a demonstration that both pathways fulfill important computational functions.
In some sense this relevance of form and motion features parallels the integration of form versus local motion features for the recognition of non-transitive body motion, which has been extensively discussed in the field of biological motion processing (Giese and Poggio, 2003; Casile and Giese, 2005; Lange and Lappe, 2006). However, opposed to this discussion the relevant motion here is the relative motion between effector and object, not the local motion in the image.
Transitive action-selective neurons and view-independence.
Neurons with selectivity for transitive actions, whose responses are modulated by the exact relationship between the effector movement and goal objects, have been found in multiple regions of the monkey cortex, including the STS (Perrett et al., 1989; Jellema and Perrett, 2006; Barraclough et al., 2009), parietal areas (Fogassi et al., 2005; Rozzi et al., 2008; Bonini et al., 2010), and the premotor cortex (Rizzolatti and Sinigaglia, 2010). One subgroup of these neurons that has recently received particular interest in cognitive neuroscience are the mirror neurons, which also show selective motor tuning during action execution (Di Pellegrino et al., 1992; Gallese et al., 1996; Umiltà et al., 2001; Bonini et al., 2010; Caggiano et al., 2009; Kraskov et al., 2009). Functional imaging studies have suggested that action-selective regions exist also in human cortex (Iacoboni et al., 1999, 2005; Buccino et al., 2004; Chong et al., 2008; Kilner et al., 2009). While fMRI adaptation studies have revealed partially inconclusive results about the presence of mirror neurons in human cortex (Chong et al., 2008; Dinstein et al., 2008; Lingnau et al., 2009), single-cell recordings in humans demonstrate the existence of action-selective and mirror neurons in various areas in the human cortex, including the supplementary motor area (SMA) (Mukamel et al., 2010). Detailed fMRI studies on action recognition that compare human and monkey cortex suggest a partial homology between the relevant areas in both species (Buccino et al., 2004; Nelissen et al., 2005, 2006; Jastorff et al., 2011).
In our model such neurons are modeled by detectors for transitive actions that integrate the information from the previous processing levels. We assume that this integration is first accomplished in a view-specific manner, and finally view-invariance is accomplished by pooling at the highest level of the model. The second-highest layer of our model hierarchy is formed by view-dependent transitive action neurons that integrate the responses from the affordance neurons and the relative motion neurons in a multiplicative way. We assume a multiplicative integration according to the equation (where we assume that the relative motion type β is chosen in accordance with the recognized action-object combination, so that this index can be dropped in the output variables):
The whole architecture described up to this level is based on learned example views of shapes. Correspondingly, the activity of the transitive action neurons is selective for the view from which a particular action has been observed during the training of the system. Many classical theories have assumed that visual parameters, such as the view, are not relevant on cortical processing levels that represent the relationship between effectors and objects for action. Recent electrophysiological data, however, shows that the view angle of observed actions has a strong influence on the responses of the majority of the tested mirror neurons in premotor cortex (area F5) while only a minority is view-invariant (Caggiano et al., 2011). This strongly suggests that view parameters are cortically represented even in premotor cortex and by neurons that have well-defined motor properties. Consistent with this physiological result, our model assumes an organization in terms of view-based modules whose outputs are integrated only at the highest level of the processing hierarchy (compare Fig. 1C). The responses of the view-independent transitive action detectors are obtained by pooling (again by maximum computation) of the outputs of the view-dependent action detectors whose output signals are given by Equation 12. The view-independent transitive action detectors show responses to transitive actions independent of the point of view. Properties consistent with the transitive action detectors in the model have been observed in neurons in the STS and area F5 in macaque cortex (Perrett et al., 1989; Jellema and Perrett, 2006; Caggiano et al., 2011).
Video stimuli and simulation procedures
Datasets.
For the evaluation of the model we recorded sets of video stimuli showing a hand grasping objects. Videos were recorded using a CANON XL1-S camera with a frame rate of 25 Hz. A subset of these stimuli was also used in physiological experiments with monkeys, partially testing hypotheses derived from the proposed model (Caggiano et al., 2011). All video frames were converted to gray-scale and preprocessed by removing low-intensity background noise using intensity thresholds. Typical example frames are shown in Figure 1.
The first data set (dataset A) consisted of 270 videos with a resolution of 360 × 176 pixels, depicting side views of grasps (view direction 90° relative to the facing direction of the actor, all actions being executed by the same actor). Videos showed a hand grasping balls with different diameters (4, 8, and 12 cm) with either a power or a precision grip. The stimulus set was derived from 50 original movies by video manipulation, where the original videos included power grips of large and middle-sized balls and precision grips of all tested ball sizes. For the original movies the hand started at a resting position 30 cm in front of the object on the table and moved naturally, grasping the object. The manipulated videos were generated by color segmentation of the hand, the object and the background. The manipulated set included movies showing only the hand, or only the object. Another set of movies showed spatially shifted versions of the action scenes (9 different positions displaced by maximum ± 4°, again for precision and power grip). Testing was based on tenfold cross-validation using a leave-one-out strategy: always, the data from nine repeated conditions was used for the fitting of the model parameters, and the remaining additional trial was used for validation. Data was averaged over all possible 10 partitions of the data in training and test set. (Repetition refers to an independent execution of the same action by the same actor).
The second set (dataset B) contained 150 videos (resolution 405 × 364 pixels), showing different views of power grips, performed either from the top or to the side of a cylindrical goal object (height 10 cm, diameter 4 cm). This action was recorded from 19 different view angles, differing by ∼10°, and the grips being executed by the same actor. This angle set included specifically the first person perspective (0°) and the opposite view (“third person perspective”; corresponding to 180°). Each grip was repeated three times. An additional data set contained examples of the same action shown with three view angles (0, 90, and 180°) by two additional actors, again with three repetitions. Evaluation was based on leave-one-out cross-validation over the repeated trials.
A third dataset (dataset C), created by video edition, was a subset of the videos from dataset A. These data set contained videos showing grasping and placing actions, similar to the stimuli used in the studies by Barraclough et al. (2009) and Nelissen et al. (2005). In these movies, the hand entered the scene, grasped a small ball with a precision grip, and moved out (grasping). A second set of sequences was generated by reversing the order of the frames of the original videos, so that the hand entered the scene with the ball and left it after releasing the ball (placing). Additional control stimuli showed only the hand (pantomimed action) and only the object. Additional views for the test of view dependence were generated by mirror-reflecting the grasping and placing stimuli along the vertical axis, resulting in movies showing the opposite hand interacting with the object from the opposite side (cf. Barraclough et al., 2009). This data set was based on nine repetitions of each condition, and cross-validation was based on training of the relevant model parameters with the data from eight repetitions, testing on the remaining one, and appropriate averaging over all partitions in training and test set.
Learning of the model parameters and simulation procedures.
The parameters of the model were learned from a set of 137 training stimuli, and generalization to novel stimuli was tested on at least nine independent cross-validation runs. The training set consisted of 85 sequences from dataset A and 52 sequences from dataset B, including different view angles that differed by 30°. The remaining sequences and in particular all sequences from dataset C were used for testing. From each training sequence we extracted images containing only the hand or only the object using color segmentation, as well as frames with typical hand-object interactions, presented in the center of the images. These data was specifically used to estimate the parameters of the model in Equations 1 and 3, and for the learning of the linear mappings according to Equation 2.
The results in the following section are all based on cross-validation data sets that were disjoint from the training stimuli. Model parameters, estimated by the previously described procedure, were identical for all simulations presented in this paper. The model provides thus a unifying quantitative account for the experimental results shown in Results.
To account for the fact that some electrophysiological and fMRI studies present results that average cell classes with different computational properties (Nelissen et al., 2005; Barraclough et al., 2009; Caggiano et al., 2011), we specified two additional parameters in the simulation of those results that account for the fractions of the different cell populations in these studies. The first parameter αtrans determines the contributions of neurons that are selective for transitive (goal-directed) and non-transitive actions to the population activity, which were different in the simulated studies investigating neurons in areas F5 and STS. In the relevant simulations the population activity including both types of action-selective neurons, for hand action ν and goal object γ, was modeled by the linear combination:
Likewise, some reported experimental data mixed contributions of neurons with different degrees of selectivity for object shape and hand shape. For the simulation of these data we fitted the parameter αhand from Equation 7 using a least-squares procedure. For the other simulations, the values of the parameters were αtrans = 1 and αhand = 0.5.
The parameters of the model presented in this paper were largely determined by supervised learning from labeled example patterns. A biologically more plausible theory would require the acquisition of the relevant patterns by unsupervised or partially supervised learning. Unsupervised learning of hierarchies in object recognition, and recently also in action recognition, is an actual topic in computer vision and machine learning. A variety of approaches has successfully realized learning of recognition hierarchies by applying sparseness constraints to multi-layer architectures, such as convolutional hierarchies (Kavukcuoglu et al., 2010) or compositional representations (Fidler et al., 2009). Unsupervised learning of spatiotemporal feature hierarchies has also been realized by independent subspace analysis (Le et al., 2011) and slow feature analysis (Nater et al., 2011). Another dominant approach has been the learning of deep architectures (Hinton, 2007; Bengio, 2009), e.g., using deep belief nets with successful application to the recognition and modeling of human gait trajectories (Taylor et al., 2010). Another approach for the learning of hierarchical dynamical models that has been applied for the modeling and recognition of actions and other complex spatiotemporal patterns uses recurrent neural networks as generative models in combination with a variational Bayesian framework (dynamic expectation maximization) (Yildiz and Kiebel, 2011; Bitzer and Kiebel, 2012). For many of these approaches it is largely unclear how they relate to plastic mechanisms in real neural systems. However, it has been discussed that unsupervised learning algorithms, such as PCA, independent component analysis (ICA), or sparse learning, might be realized in physiologically plausible ways by combining Hebbian and anti-Hebbian learning rules with intrinsic adaptation mechanisms within individual neurons (Földiák, 1990; Falconbridge et al., 2006; Gerhard et al., 2009).
Special simulation procedures for individual experiments.
For the simulation of the experimental data by Nelissen et al. (2005), we evaluated the response of the model to grasping stimuli and the corresponding control stimuli from dataset C. Following the experimental study, we also used static control stimuli, each presenting one single frame extracted from the middle of each test sequence (no hand-object contact). We approximated the changes of the BOLD signal in the relevant areas by summation of the activity of the model neurons on the corresponding hierarchy level over the whole stimulus duration. For comparison with the experimental data from the action-selective area F5a we used an average activity value that weighted contributions from neurons that are selective for transitive and for intransitive actions (compare Eq. 13).
To simulate the results of the study by Barraclough et al. (2009), we tested the model on the complete dataset C. The processing of the data follows closely the description in Barraclough et al. (2009). The sequence length was down-sampled to 800 ms (20 frames) to match the stimulus conditions in the study. We evaluated only model neurons showing a strong response to the action stimuli used in the experimental study, i.e., action detectors encoding precision grips of small objects. Responses of the model neurons were aligned with the time course of the stimulus following the alignment procedures described in Barraclough et al. (2009). For comparison with the experimental data the response of each model neuron was renormalized, setting the maximum response over all test sequences to one and the baseline activity to zero.
For comparison of the model's performance with the electrophysiological data reported by Perrett et al. (1989), we used the video stimuli from dataset A that showed power grips of a medium sized ball. We created additional similar control stimuli, following the physiological study, from the original videos using color segmentation (hand pantomiming the action, presentation of only the object, hand mimicking the action at a distance of 4 cm from the object). To model the results of the experiment, we evaluated only the responses of transitive action-selective model neurons that responded to power grip actions without motion of the hand relative to the object (zero relative speed).
Results
Neurons with visual selectivity for goal-directed actions have been described in the superior temporal sulcus (Perrett et al., 1989; Barraclough et al., 2009), the parietal cortex (Fogassi et al., 2005; Rozzi et al., 2008; Bonini et al., 2010, 2011), and in premotor cortex, especially in ventral area F5 (Di Pellegrino et al., 1992; Gallese et al., 1996; Umiltà et al., 2001; Caggiano et al., 2009), as well as in dorsal premotor cortex and even in area M1 (Tkach et al., 2007). It seems likely that computational levels of processing do not exactly map onto individual areas in cortex. Instead, it appears that neurons with quite similar computational properties sometimes exist in multiple cortical areas. For example, it has been described that neurons in the STS parallel a lot of properties of mirror neurons in area F5 (Keysers and Perrett, 2004). We thus define here neuron classes according to their functional properties, being aware that the same class of neurons might simultaneously exist at multiple levels of the cortical processing hierarchy, e.g., in the STS and in area F5. A well-established dissociation between STS and F5 is that the superior temporal sulcus does not contain motor neurons. Motor properties are not captured by the proposed model, which focuses on the visual processing mechanisms.
The proposed model provides a unifying account for a variety of visual properties of action-selective neurons that have been reported in single-cell recordings in the superior temporal sulcus and area F5 in macaques. We focus in the following on effects that highlight important computational properties.
Tuning for action type and critical relevance of the goal object
Many action-selective neurons in premotor cortex are selective for the type of the observed goal-directed action (e.g., precision vs power grip). This is illustrated in Figure 3 that shows data from the study by Gallese et al. (1996) from mirror neurons in area F5 of the premotor cortex of macaque monkeys. These neurons show selective motor responses during the execution of hand actions (such as grasping, placing, or object manipulation) and, at the same time, they respond selectively to visually observed actions of other agents (monkeys or humans). Due to the simultaneous presence of visual and motor selectivity, these neurons have been termed mirror neurons. Approximately half of the mirror neurons in this area that were selective for grasping showed also selectivity for the grip type (precision vs power grip), as illustrated in the left and middle panel of Figure 3A. The neuron responds strongly to a precision grip and fails to respond almost completely to a power grip. The rightmost panel shows the response for a mimicked action without a goal object. While the hand performed the same movement the neuron remained silent. Such selectivity for the presence of a goal object is typical for many action-selective neurons in premotor cortex (Gallese et al., 1996).
The recognition of action type from real video stimuli is a challenging vision problem since the grip type depends on subtle variations of the finger position, corresponding to changes of only a few pixels in the images. In addition, varying object shapes cause clutter and occlusion for the recognition of the finger configuration. Despite these computational challenges the proposed model accomplishes this recognition task, reproducing the action-type selectivity of cortical neurons. This is shown in Figure 3B that shows the response of the view-independent transitive action detectors at the highest level of the model that have been trained with different types of grips of goal objects with different sizes (dataset A, see Materials and Methods). The different line-styles and colors indicate different types of grips. The gray thin curves indicate the neural activity for individual trials of the preferred action. The thick curves indicate the average activity over trials for different action types. The action-selective model neurons show a robust selectivity for the different grip types, and the model is able to discriminate robustly precision and power-grip stimuli. The time course of the activity is similar to the neural data, showing a weak initial response that increases when the hand approaches the goal object and when grip-specific hand shapes become distinguishable. Consistent with the neural data, the action-selective model neurons respond only weakly if the same stimuli are presented without goal object (Fig. 3C).
Similar selectivity for action type and the presence of the goal object has also been observed in monkey fMRI experiments (Nelissen et al., 2005). In general, the relationship between neural activity at the single-cell level (spikes and local field potentials) and the BOLD signal measured in fMRI experiments is quite involved and dependent on the specific brain area (for review, see Logothetis, 2002, 2008; Logothetis and Wandell, 2004; Nir et al., 2007; Ekstrom, 2010). However, some studies have successfully linked functional imaging results and behavior of groups of neurons at the level of single cells in higher visual areas (Op de Beeck et al., 2008) (for review, see Tsao and Livingstone, 2008). Here we make the strongly simplifying assumption that fMRI responses in the relevant areas might be modeled by the sum activity of the corresponding neural levels of the model. Consistent with the single cell data, visual selectivity for transitive action stimuli was found in area F5 of the premotor cortex. While selective activation during the observation of transitive actions in the caudal part of the premotor cortex (area F5c) was found only for stimuli showing the whole upper body of the acting agent, more anterior regions (areas F5a,p) were also selectively activated by stimuli showing only the hand and the goal object. Since our model focuses on the recognition of hand actions we modeled the activity in these subregions.
Figure 4A shows the BOLD activity relative to fixation baseline from two separate fMRI experiments. The first experiment contrasted the full action stimulus, a static picture of the action from the middle part of the stimulus sequence, and movies showing only the moving hand or the static object. High activation emerges only for the full stimulus. Substantially reduced activation was observed for moving and static object stimuli. For the static hand-alone stimuli almost no activation was observed at all. The second experiment contrasted dynamic and static stimuli, and stimuli showing the normal action with correct contact between hand and object and mimicked actions, where the hand executed the same movement in absence of a goal object. Compared with the normal action dynamic mimicked action stimuli induced a reduced response. The static stimuli (derived from normal and mimicked action movies) induced almost no response.
For the simulation of the BOLD responses in this study, we computed the sum activity over all neurons in the two highest levels of the recognition model (view-dependent and the view-invariant transitive-action detectors). For the simulations with the model we used a visual stimulus set that closely matched the properties of the stimuli in Nelissen et al. (2005) (see Materials and Methods, stimulus set C). Since the relevant premotor areas contain a mixture of neurons with selectivity for transitive and intransitive actions, we optimized the parameter αtrans that determines the relative influence of neurons with selectivity for transitive and non-transitive actions on the sum signal (choosing αtrans = 1/3). Likewise, the parameter αhand that determines the relative contributions of hand- and object-selective shape detectors to the activity of the neurons in the RPM was chosen to be αhand = 4/5 since this resulted in the best fit of the BOLD data. The sum activities derived from the model were normalized and scaled with a constant factor, to simplify comparison with the experimental data.
The resulting normalized sum activities, shown in Figure 4, B and D, are qualitatively highly similar to the experimentally observed BOLD activities (A and C). As for the real fMRI data, the activations for static stimuli are strongly reduced. Mimicked actions induce a substantial response, which however does not reach the level of the normal actions. Presentation of the object alone induces a weak response that is bigger for moving objects. The model reproduces thus, at least qualitatively, a variety of effects that have been observed in this fMRI experiment in monkeys.
Neurons that are selective for the visual observation of transitive actions have not only been found in the premotor cortex, but also at lower cortical levels, such as the STS. The STS, through parietal areas, projects to area F5 in the premotor cortex (Seltzer and Pandya, 1978; Matelli et al., 1986; Keysers and Perrett, 2004). The action-selective neurons in this area show a number of properties that resemble closely those of the neurons on area F5 (Perrett et al., 1989; Barraclough et al., 2009). We tried to reproduce data from a study by Barraclough et al. that compared the responses of single cells in the STS for grasping and placing with (transitive), and without goal object (non-transitive actions), and to stimuli presenting the goal object alone.
Figure 5A shows the original data from the study by Barraclough et al. (2009), where neural responses (spike density functions) were temporally aligned by the response latencies for the individual stimuli and are displayed with a default latency of 100 ms. Normal transitive actions induced, on average, substantially higher activity in the recorded hand action-selective neurons than stimuli showing the hand action without a goal object.
However, also actions without goal object (intransitive) induced significant activity. Stimuli showing the goal object alone resulted in rather small activity, and clearly activity below the level that is induced by stimuli presenting only the moving hand.
The corresponding simulation results are shown in Figure 5B and provide a good qualitative match of the experimental data. In the experimental study, neurons with selectivity for transitive and non-transitive actions had not been distinguished. For the simulation of this STS data the sum activity was a weighted sum of the motion pattern neurons, which also respond to non-transitive actions, and of the transitive action detectors. The parameter αtrans that controls the contribution of these two detector populations on the sum activity was set to 2/3 since this matched the approximate ratio of non-transitive and transitive action-selective neurons in the electrophysiological study. In addition, for the simulation of this STS data we chose the value αhand = 3/4 for the parameter that determines the strength of the influences of object and hand shape on the RPM activity.
A similar result was obtained in a classical study on visual neurons with selectivity for hand actions in the lower bank of the STS (area TEa) (Perrett et al., 1989). This study not only tested normal transitive action stimuli and stimuli showing only hand or the object. It also included a condition showing mimicked actions, where the hand did not touch the object correctly, while it was moving in a very similar way as in the normal stimulus. Like before, the responses to stimuli showing only the hand or the object were strongly reduced (Fig. 6). The same applies to the condition with the mimicked stimuli, where the hand failed to touch the object, the distance between hand and object in the image frames being <0.55°. This implies a high spatial selectivity of the underlying neural mechanisms that detect the hand-object contact.
The model nicely reproduces this high selectivity for the relationship between effector and goal object (Fig. 6B). This selectivity is a consequence of the receptive field properties of the affordance neurons, which are selective of the retinal effector position relative to the object (Fig. 2). The fact that for this experiment the activity of the action-selective neurons for stimuli without goal object is lower than for the simulation results in Figure 5 is a consequence of the different fractions of transitive action-selective neurons included in the population averages, which we tried to match with the experimental data (see Materials and Methods).
Summarizing, the model reproduces the high selectivity for different hand action types and the precise tuning for the spatial relationship between effector and object, as observed for action-selective single cells in multiple cortical areas. The high selectivity for the action type is explained by the shape-selectivity of the hand detectors in the shape recognition pathway. The high selectivity for the relationship between effector and object is a consequence of the tuning properties of the affordance neurons, whose response depends on the relative positions of effector and object.
Position invariance
Despite the high selectivity of cortical action-selective neurons discussed in the last section, such neurons show a remarkable degree of invariance with respect to the position of transitive action stimuli in the image. This is illustrated in Figure 7A that shows the response of a mirror neuron in area F5 to grasping stimuli presented in the left hemifield, the center, or in the right hemifield [adapted from the study by Gallese et al. (1996)]. The red disc illustrates the stimulus position in the visual field. The response of the neuron is largely unaffected by the retinal position of the stimulus. Since this physiological study did not include a control of eye movements it seems likely that a substantial part of the observed invariance is due to the foveation of the stimulus by the monkey. However, substantial amounts of invariance with respect to stimulus position, even with a control of eye position, have been shown for shape-selective neurons in the inferotemporal cortex as well as for shape-selective neurons in the dorsal stream (Op De Beeck and Vogels, 2000; Janssen et al., 2008).
Our model is able to reproduce a high degree of position invariance. This is illustrated in Figure 7B that shows the responses of a view-invariant transitive motion detector (selective for power grip) in the model for nine different retinal positions of the action stimulus, where the distance between neighboring stimulus positions corresponds to 4° of visual angle. (The stimulus size was approximately 8°.) Responses for different retinal positions are almost identical, demonstrating almost perfect position invariance. The model is able to accomplish even much larger spatial invariance regimes, and we tested successfully with up to ±30°, as observed in physiological experiments.
In the model, position invariance is accounted for by the combination of two mechanisms: (1) The maximum pooling operations in the form processing pathway, which makes the shape detectors at the highest level of the form recognition pathway partially position invariant (Fukushima, 1980; Riesenhuber and Poggio, 1999b); and (2) the computation of the relative position of the effector and the goal object in the RPM, which explicitly computes a coordinate transformation.
View tuning
View-dependent coding is a well-known property of shape-selective neurons in the inferotemporal cortex (Logothetis et al., 1995; Tarr and Bülthoff, 1998), as well as of shape and action-selective neurons in the STS (Perrett et al., 1982, 1989; Oram and Perrett, 1996). In a recent study, Barraclough et al. (2009) have tested the view dependence of the responses of action-selective neurons. The tested neurons were selective for visually observed grasping and placing actions, and they were tested with views of hands interacting with an object either from the left or from the right side relative to the viewpoint of the monkey. The corresponding average responses, as function of time, are shown in Figure 8A. Neurons showed a strong selectivity for the preferred view (black symbols). The presentation of the non-preferred action from the preferred view resulted in higher responses than the presentation of the preferred action with the non-preferred view. View preference, thus, modulated the (average) responses of the neurons more than the type of the action.
Figure 8B illustrates the corresponding simulation results obtained with our model. The figure shows the average responses of the view-selective transitive action detectors at the second-highest processing level. Modeling results are qualitatively quite similar to the real data from the STS. Like for the other simulation of the STS data, the parameter we chose αhand = 3/4 for the parameter that determines the influence of hand versus object in the RPM. Since the population of cells underlying this study of STS neurons seemed not to be identical with the one underlying the data shown in Figure 5, we refitted the value of the parameters αtrans = 0.9 (fraction of transitive action-selective neurons). However, fitting both simulations from Figures 5B and 8B with joint identical parameters leads to qualitatively very similar results.
Since our model, such as other biologically-inspired models of form and action recognition (Poggio and Edelman, 1990; Oram and Perrett, 1994; Riesenhuber and Poggio, 1999b; Giese and Poggio, 2003), is organized in terms of view-specific modules, it can reproduce the view-selectivity of action-selective cortical neurons. However, it is not necessarily expected that it also reproduces the fact that the stimulus view has a stronger influence on the tuning of these neurons than the action type. In the model, this behavior is explained by the fact that stimulus views are processed separately up to a very high level of the processing hierarchy, while different actions observed with the same views share many low and mid-level features. The presentation of a non-preferred action thus induces some rudimentary activity in neurons that encode different actions that are observed with the same view.
View dependence of action-selective neurons has not only been observed in primarily visual structures, such as the STS, but also at higher representation levels. A recent study, which was partly motivated by this model, tested the view dependence of mirror neurons in area F5 of the premotor cortex, exploiting well-controlled video stimuli instead of real actions executed in front of the monkey (Caggiano et al., 2011). Since area F5 is functionally very close to motor cortex, we expected to find a large number of neurons that encode visually observed actions in a body-centered frame of reference independent of the stimulus view. However, presenting the same action from three different views, we found a quite large fraction (74%) of mirror neurons with clear view tuning. Only a smaller fraction (26%) showed view-independent responses. In addition, we failed to find a clear preference for the first person view, as might be expected if the monkey learned particularly well the relationship between own actions and the associated visual feedback signals. The left panel in Figure 9A shows the normalized activity of the measured F5 mirror neurons for the three tested views, different line types referring to the subsets of neurons that showed a significant preference for the individual views. The right panel shows the corresponding simulation result (average responses computed with the same normalization procedure as for the electrophysiological data) for the view-dependent transitive action detectors, which form the second-highest hierarchy level of our model. Clearly, the simulation result nicely matches the experimental data from the view-dependent subset of mirror neurons.
Due to the limited recording time, in the physiological experiment the number of stimulus views that could be tested was quite limited. In simulations with the model we could test, however, how many stimulus views are required to accomplish robust view-independent recognition at the highest layer of the model for real video stimuli. This is an important question, since a mechanism that requires a storage of huge numbers of stimulus views would be computationally inefficient or even infeasible.
Quantitative simulations showed that with as few as seven view-specific modules we could accomplish a robust view-independent recognition of goal-directed hand actions from real videos, at the same time achieving high selectivity for the distinction of different action types. This is illustrated in Figure 9B, which shows the responses of the view-dependent transitive-action detectors in gray and the resulting response of the corresponding view-invariant detector in black. The model was trained with seven views of one action (grasping a cylinder from the top) and was tested with 19 different views, differing by view angle steps of 10° between the views (dataset B, see Materials and Methods). The tuning width of the view-dependent detectors was approximately 50°, coarsely consistent with data about view-dependent neurons in area STS and IT (Perrett et al., 1991; Logothetis et al., 1995). (Precise data about the view dependence of transitive action-selective neurons is presently still unavailable.) For the trained action the responses of the view-dependent detectors degrade gradually with the distance between the training and the test view. For the distractor action the responses of the view-dependent detectors remain weak for all tested views. The response of the view-independent detectors remains high for all views. Even though their response still varies slightly with the stimulus view, it robustly discriminates for all views between the trained action (solid line) and the distractor action (dashed line).
Prediction: temporal sequence selectivity of transitive action-selective neurons
A central assumption for the proposed mechanism for the recognition of hand actions was the temporal sequence-selectivity of the motion pattern neurons, which form the basis for the association of information about hand postures over time. Reversing the temporal order of the stimulus frames substantially reduces the responses of the motion pattern neurons, and as consequence also the responses of action-selective neurons at higher processing levels. The temporal sequence selectivity of action-selective neurons at lower levels is consistent with recent electrophysiological data from neurons in the STS (Vangeneugden et al., 2009, 2011; Singer and Sheinberg, 2010). The model predicts that sequence selectivity should also be observed at the highest level of the neural processing hierarchy, for neurons that are selective for transitive actions. This prediction can be easily tested in an electrophysiological experiment by showing the same action movie in normal and reverse temporal order.
Figure 10A illustrates the relevant stimulus set, two movies showing the frames of a grasping actions in normal and reversed order. Reversely played grasping looks like the placing of an object (the hand coming in and leaving the scene without the object). Following the conventions by Barraclough et al. (2009), we refer to reverse grasping as “placing” in the following. The activation of the transitive action detectors in our model (pooled over view-dependent and view-independent model neurons), after training with grasping respectively placing, for both types of stimuli are shown in Figure 10B. Clearly, the transitive action-selective neurons in the model show a very strong degree of temporal sequence selectivity. This selectivity is not only a consequence of the neural field dynamics discussed before. It is further augmented by the fact that stimuli played in reverse temporal order also reverse the relative motion vectors between effector and goal object, which are detected by the relative motion neurons. Both influences are combined multiplicatively by the transitive action detectors of the model.
Motivated by this model prediction, these stimuli were really tested in an electrophysiological experiment, recording the activity of mirror neurons in area F5. Consistent with the prediction, a significant fraction of the measured neurons (63%) showed strong sequence selectivity, and the quantitative results look strikingly similar to the simulations shown in Figure 10B (V. Caggiano, J. Pomper, F. Fleischer, M. A. Giese, and P. Thier, unpublished observations).
Performance limitations of the model
While the proposed model successfully accomplishes recognition on real videos of hand-object interactions, we want to stress here that the main purpose of the work was the reproduction of data from neurons and not the maximization of the computational performance in the sense of computer vision. We acknowledge that many not biologically-inspired algorithms have been developed in this field (for review, see Pavlovic et al., 1997; Mitra and Acharya, 2007; Weinland et al., 2011), which certainly would outperform our model on challenging data sets, which for example include substantial amounts of background clutter. The effective processing of complex scenes with complex clutter likely necessitates improved dictionaries of detectors for the intermediate-level features. The learning of such detector hierarchies has been a core problem of the fields of shape and action recognition in computer vision in the last decade (Moeslund et al., 2006; Serre et al., 2007b), and the principle architecture of our model would not change by inclusion of such improved hierarchies of shape detectors.
Another important major addition that seems necessary for the processing of complex realistic scenes with many objects and potentially multiple acting effectors (e.g., from multiple agents) is attentional control and the tracking of attended objects and effectors. Neural mechanisms supporting such computational functions have been extensively studied in the context of neural models for attention (Deco and Rolls, 2004; Hamker, 2006; Tsotsos, 2011). Such mechanisms could be integrated in our model by adding a network dynamics to all layers of the hierarchy and by introducing appropriate backward connections. In fact, first attempts to integrate such mechanisms in action processing models related to ours have been made (Layher et al., 2012).
Concluding, the present model clearly has strong computational limits, some of which might be mitigated by including other physiologically plausible mechanisms. However, the performance limits for the processing of complex real action scenes using such neural architectures will have to be explored after adding such extensions to the present architecture.
Discussion
In this paper we have presented a physiologically inspired neural model for the visual recognition of transitive hand actions, defined by interactions between a moving hand and a goal object. The model is based largely on well-established neural principles, all of which can be implemented by physiologically plausible circuits. The model provides a unifying account for a variety of physiological results about action-selective neurons at the single-cell level, as well as for results about the population activity in relevant areas in macaque cortex. To our knowledge, this is the first model for the visual recognition of transitive actions that provides such detailed comparisons with neural data.
The proposed model has been shown to be computationally powerful enough to recognize actions from real video sequences. This gives credibility to the computational feasibility of the postulated neural principles as basis of the processing of natural action stimuli. This also distinguishes our action recognition model from many others that assume abstract visual input signals, not specifying exactly how they can be derived from real images by physiologically plausible mechanisms. In addition, this property made it possible to test the model with original stimuli that have been used in physiological experiments. However, the model would need several substantial extensions to deal, for example, with substantial amounts of clutter, or scenes that include multiple possible goal objects or observed effectors. Some possible extensions and performance limitations of the model were discussed in Results, Performance limitations of the model.
Our model not only provides a unifying account for a number of physiological results from action-selective neurons in monkey cortex. It also leads to several important theoretical insights.
First, it shows that the recognition of goal-directed actions and visual tuning properties of action-selective neurons can be accounted for by established mechanisms, which are based on learned view-specific neural representations, and without the necessity of an accurate reconstruction of the three-dimensional structure of the effector and the object. Since the estimation of joint angles, especially from monocular images, is a challenging computer vision problem (for review, see Wu and Huang, 1999; Erol et al., 2005), our model suggests that the brain might bypass this computational step using, at least to a substantial degree, representations that are based on two-dimensional views. In addition, such a solution seems theoretically attractive since it postulates that the brain uses similar neuro-computational principles for the processing of static and dynamic three-dimensional stimuli (compare Materials and Methods, Relationship to other models). In addition, it is at least an interesting observation that the majority of robust algorithms for action detection and classification exploits example-based (view-specific) representations (Gavrila, 1999; Moeslund et al., 2006). The focus on visual processing mechanisms makes our model complementary to many other models for the visual recognition of hand actions that focus on the role of motor representations, making simplifying assumptions about the visual processing (see also Materials and Methods, Relationship to other models).
Second, the model proposes a set of concrete circuits for the integration of the information about objects and dynamic effectors that could be implemented with real cortical neurons. At the same time, the model makes precise predictions about the behavior of such neurons that can be validated by single-unit recordings. Because of space limitations, we discuss here only a few examples can be discussed: (1) The model postulates the existence of neurons that encode the relative position of effector and object (relative position map), and a multiplicative integration of the relevant input signals from shape-selective representations. Neurons with such properties might be found in the superior temporal sulcus (Perrett et al., 1989) or the inferior parietal lobule (Fogassi et al., 2005; Chafee et al., 2007; Crowe et al., 2008; Rozzi et al., 2008). (2) The existence of affordance neurons, e.g., in parietal areas, with spatially organized receptive fields can be tested. (3) The model assumes a hierarchical architecture, where information is first processed in view-specific modules and then integrated by pooling at the highest level of the hierarchy. This predicts specific connections between view-specific and view-invariant action-selective neurons, e.g., in the premotor cortex or in the STS. Recent electrophysiological results proof the existence of view-specific representation very high up in the processing stream, even in premotor cortex (Caggiano et al., 2011). (4) The model postulates neurons that are selective for the relative motion between effector and object (relative speed and motion neurons). Contrasting with regular motion detectors, e.g., in area MT, such neurons process motion in the RPM, and thus they should be characterized by a high degree of shape selectivity. Neurons of this type might be present in the STS or parietal areas.
Many other specific predictions follow from the proposed architecture. Some predictions, such as the view dependence and sequence-selectivity of mirror neurons, have been confirmed by electrophysiological experiments that were partially motivated by this theory (Caggiano et al., 2011). In addition, the model makes also predictions about the population activity in cortical areas that are associated with the different postulated computational modules. Such predictions seem ideally suited for comparisons with fMRI data. Additional simulations addressing such aspects are in progress and might help to develop a more complete theory that links corresponding mechanisms in the brains of human and non-human primates.
Undoubtedly, our model makes a number of very strong simplifications, some of which violate known facts about the modeled cortical structures. In addition, many fundamental aspects about the model have to be refined in future work. Again only a few fundamental limitations can be discussed here: (1) The model focuses purely on the visual processing of actions and lacks completely interactions with motor representations. Especially, it does not account for the motor properties of some action-selective neurons in parietal and premotor cortex, and especially of mirror neurons. A large body of literature suggests, in addition, interactions between visual and motor representations, and the mirror neuron system might play a central role in establishing such interactions (Rizzolatti and Craighero, 2004; Kilner et al., 2007; Schütz-Bosbach and Prinz, 2007). The existence of feedback connections from motor to visual representations (e.g., between premotor areas, area PFG and the STS) is strongly suggested by anatomical data (Rizzolatti and Craighero, 2004; Rizzolatti and Sinigaglia, 2010). An adequate theoretical framework to capture such feedback influences are hierarchies of predictive (neuro-)dynamical representations (Demiris and Simmons, 2006; Kiebel et al., 2008), such as neural fields. It seems straight forward, and has been successfully established in previous work in robotics, to couple such neural field representations for motor programs (Erlhagen and Schöner, 2002; Cisek, 2006) with ones for visual input sequences (Erlhagen et al., 2006). (2) Beyond the top-down connections from motor representations, the visual pathway is characterized by strong feedback connectivity (Felleman and Van Essen, 1991; Salin and Bullier, 1995) that is not captured by our model. In the context of action recognition, such connections might support especially the dynamic tracking of objects and effectors in the scene, and the attentional selection of individual objects in complex or cluttered scenes with multiple possible targets by attentional mechanisms. (See Results, Performance limitations of the model, for further details.) (3) As for previous models for the recognition of non-transitive actions (Giese and Poggio, 2003; Jhuang et al., 2007; Escobar et al., 2009) one might consider a second primary visual pathway that processes local motion and optic flow features instead of form features. In how far form versus motion features influence the visual recognition of goal-directed actions is, to our knowledge, largely unclear, and seems to define an interesting question for future research. (4) A further important shortcoming of the proposed model is the complete lack of disparity features. Many neurons in the dorsal as well as in the ventral stream are disparity-selective (Shikata et al., 1996; Janssen et al., 1999; Taira et al., 2000; Durand et al., 2007; Srivastava et al., 2009; Orban, 2011. Also, some recent evidence shows the existence of disparity-selective neurons in cortical areas that are involved in action processing, such as the premotor area F5 (Joly et al., 2009; Theys et al., 2012). It seems possible to extend the chosen example-based approach by inclusion of disparity-dependent features, such as relative disparity. Similar approaches have been proposed for object recognition from stereo images in computer vision (Helmer and Lowe, 2010). Such extensions might provide interesting insights in the computational role of disparity features in the perception and control of actions, and the internal representation of the geometry of the external space during action execution (La'davas, 2002).
Footnotes
This research was supported by the Deutsche Forschungsgemeinschaft (SFB 550-C10, GI 305/4-1), EU projects FP7-ICT-215866 SEARISE, FP7-249858-TP3 TANGO, FP7-ICT-248311 AMARSi; and BMBF MBF FKZ:01GQ1002. We thank L. Fogassi, G. Rizzolatti, D. Endres, and J. Pomper for helpful discussions. We are grateful to A. Christensen for help with the simulations, and to M. Angelovska for help with the editing of the figures.
The authors declare no competing financial interests.
References
- Adelson EH, Bergen JR. Spatiotemporal energy models for the perception of motion. J Opt Soc Am A. 1985;2:284–299. doi: 10.1364/JOSAA.2.000284. [DOI] [PubMed] [Google Scholar]
- Aggelopoulos NC, Rolls ET. Scene perception: inferior temporal cortex neurons encode the positions of different objects in the scene. Eur J Neurosci. 2005;22:2903–2916. doi: 10.1111/j.1460-9568.2005.04487.x. [DOI] [PubMed] [Google Scholar]
- Amari S. Dynamics of pattern formation in lateral-inhibition type neural fields. Biol Cybern. 1977;27:77–87. doi: 10.1007/BF00337259. [DOI] [PubMed] [Google Scholar]
- Baker CI, Keysers C, Jellema T, Wicker B, Perrett DI. Coding of spatial position in the superior temporal sulcus of the macaque. Curr Psychol Lett. 2000;1:71–87. [Google Scholar]
- Barraclough NE, Keith RH, Xiao D, Oram MW, Perrett DI. Visual adaptation to goal-directed hand actions. J Cogn Neurosci. 2009;21:1806–1820. doi: 10.1162/jocn.2008.21145. [DOI] [PubMed] [Google Scholar]
- Bayerl P, Neumann H. Disambiguating visual motion through contextual feedback modulation. Neural Comput. 2004;16:2041–2066. doi: 10.1162/0899766041732404. [DOI] [PubMed] [Google Scholar]
- Beardsley SA, Vaina LM. A laterally interconnected neural architecture in mst accounts for psychophysical discrimination of complex motion patterns. J Comput Neurosci. 2001;10:255–280. doi: 10.1023/A:1011264014799. [DOI] [PubMed] [Google Scholar]
- Bengio Y. Learning deep architectures for AI. Found Trends Mach Learn. 2009;2:1–127. doi: 10.1561/2200000006. [DOI] [Google Scholar]
- Ben-Yishai R, Hansel D, Sompolinsky H. Traveling waves and the processing of weakly tuned inputs in a cortical network module. J Comput Neurosci. 1997;4:57–77. doi: 10.1023/A:1008816611284. [DOI] [PubMed] [Google Scholar]
- Bitzer S, Kiebel SJ. Recognizing recurrent neural networks (rrnn): Bayesian inference for recurrent neural networks. Biol Cybern. 2012;106:201–217. doi: 10.1007/s00422-012-0490-x. [DOI] [PubMed] [Google Scholar]
- Bonaiuto J, Arbib MA. Extending the mirror neuron system model, ii: what did i just do? a new role for mirror neurons. Biol Cybern. 2010;102:341–359. doi: 10.1007/s00422-010-0371-0. [DOI] [PubMed] [Google Scholar]
- Bonini L, Rozzi S, Serventi FU, Simone L, Ferrari PF, Fogassi L. Ventral premotor and inferior parietal cortices make distinct contribution to action organization and intention understanding. Cereb Cortex. 2010;20:1372–1385. doi: 10.1093/cercor/bhp200. [DOI] [PubMed] [Google Scholar]
- Bonini L, Serventi FU, Simone L, Rozzi S, Ferrari PF, Fogassi L. Grasping neurons of monkey parietal and premotor cortices encode action goals at distinct levels of abstraction during complex action sequences. J Neurosci. 2011;31:5876–5886. doi: 10.1523/JNEUROSCI.5186-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brody CD, Hopfield JJ. Simple networks for spike-timing-based computation, with application to olfactory processing. Neuron. 2003;37:843–852. doi: 10.1016/S0896-6273(03)00120-X. [DOI] [PubMed] [Google Scholar]
- Buccino G, Lui F, Canessa N, Patteri I, Lagravinese G, Benuzzi F, Porro CA, Rizzolatti G. Neural circuits involved in the recognition of actions performed by nonconspecifics: an fmri study. J Cogn Neurosci. 2004;16:114–126. doi: 10.1162/089892904322755601. [DOI] [PubMed] [Google Scholar]
- Buneo CA, Jarvis MR, Batista AP, Andersen RA. Direct visuomotor transformations for reaching. Nature. 2002;416:632–636. doi: 10.1038/416632a. [DOI] [PubMed] [Google Scholar]
- Cadieu C, Kouh M, Pasupathy A, Connor CE, Riesenhuber M, Poggio T. A model of v4 shape selectivity and invariance. J Neurophysiol. 2007;98:1733–1750. doi: 10.1152/jn.01265.2006. [DOI] [PubMed] [Google Scholar]
- Caggiano V, Fogassi L, Rizzolatti G, Thier P, Casile A. Mirror neurons differentially encode the peripersonal and extrapersonal space of monkeys. Science. 2009;324:403–406. doi: 10.1126/science.1166818. [DOI] [PubMed] [Google Scholar]
- Caggiano V, Fogassi L, Rizzolatti G, Pomper JK, Thier P, Giese MA, Casile A. View-based encoding of actions in mirror neurons of area f5 in macaque premotor cortex. Curr Biol. 2011;21:144–148. doi: 10.1016/j.cub.2010.12.022. [DOI] [PubMed] [Google Scholar]
- Carandini M, Heeger DJ, Movshon JA. Linearity and normalization in simple cells of the macaque primary visual cortex. J Neurosci. 1997;17:8621–8644. doi: 10.1523/JNEUROSCI.17-21-08621.1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Casile A, Giese MA. Critical features for the recognition of biological motion. J Vis. 2005;5:348–360. doi: 10.1167/5.4.6. [DOI] [PubMed] [Google Scholar]
- Chafee MV, Averbeck BB, Crowe DA. Representing spatial relationships in posterior parietal cortex: single neurons code object-referenced position. Cereb Cortex. 2007;17:2914–2932. doi: 10.1093/cercor/bhm017. [DOI] [PubMed] [Google Scholar]
- Chersi F, Ferrari PF, Fogassi L. Neuronal chains for actions in the parietal lobe: a computational model. PLoS ONE. 2011;6:e27652. doi: 10.1371/journal.pone.0027652. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chong TT, Cunnington R, Williams MA, Kanwisher N, Mattingley JB. fMRI adaptation reveals mirror neurons in human inferior parietal cortex. Curr Biol. 2008;18:1576–1580. doi: 10.1016/j.cub.2008.08.068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cisek P. Integrated neural processes for defining potential actions and deciding between them: a computational model. J Neurosci. 2006;26:9761–9770. doi: 10.1523/JNEUROSCI.5605-05.2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Connor CE, Brincat SL, Pasupathy A. Transformation of shape information in the ventral pathway. Curr Opin Neurobiol. 2007;17:140–147. doi: 10.1016/j.conb.2007.03.002. [DOI] [PubMed] [Google Scholar]
- Crowe DA, Averbeck BB, Chafee MV. Neural ensemble decoding reveals a correlate of viewer-to object-centered spatial transformation in monkey parietal cortex. J Neurosci. 2008;28:5218–5228. doi: 10.1523/JNEUROSCI.5105-07.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deco G, Rolls ET. A neurodynamical cortical model of visual attention and invariant object recognition. Vision Res. 2004;44:621–642. doi: 10.1016/j.visres.2003.09.037. [DOI] [PubMed] [Google Scholar]
- Demiris Y, Simmons G. Perceiving the unusual: temporal properties of hierarchical motor representations for action perception. Neural Netw. 2006;19:272–284. doi: 10.1016/j.neunet.2006.02.005. [DOI] [PubMed] [Google Scholar]
- Deneve S, Pouget A. Basis functions for object-centered representations. Neuron. 2003;37:347–359. doi: 10.1016/S0896-6273(02)01184-4. [DOI] [PubMed] [Google Scholar]
- De Valois RL, Albrecht DG, Thorell LG. Spatial frequency selectivity of cells in macaque visual cortex. Vision Res. 1982;22:545–559. doi: 10.1016/0042-6989(82)90113-4. [DOI] [PubMed] [Google Scholar]
- DiCarlo JJ, Maunsell JH. Anterior inferotemporal neurons of monkeys engaged in object recognition can be highly sensitive to object retinal position. J Neurophysiol. 2003;89:3264–3278. doi: 10.1152/jn.00358.2002. [DOI] [PubMed] [Google Scholar]
- Dinstein I, Thomas C, Behrmann M, Heeger DJ. A mirror up to nature. Curr Biol. 2008;18:R13–R18. doi: 10.1016/j.cub.2007.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- di Pellegrino G, Fadiga L, Fogassi L, Gallese V, Rizzolatti G. Understanding motor events: a neurophysiological study. Exp Brain Res. 1992;91:176–180. doi: 10.1007/BF00230027. [DOI] [PubMed] [Google Scholar]
- Durand JB, Nelissen K, Joly O, Wardak C, Todd JT, Norman JF, Janssen P, Vanduffel W, Orban GA. Anterior regions of monkey parietal cortex process visual 3D shape. Neuron. 2007;55:493–505. doi: 10.1016/j.neuron.2007.06.040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ekstrom A. How and when the fMRI BOLD signal relates to underlying neural activity: the danger in dissociation. Brain Res Rev. 2010;62:233–244. doi: 10.1016/j.brainresrev.2009.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Erlhagen W, Schöner G. Dynamic field theory of movement preparation. Psychol Rev. 2002;109:545–572. doi: 10.1037/0033-295X.109.3.545. [DOI] [PubMed] [Google Scholar]
- Erlhagen W, Mukovskiy A, Bicho E. A dynamic model for action understanding and goal-directed imitation. Brain Res. 2006;1083:174–188. doi: 10.1016/j.brainres.2006.01.114. [DOI] [PubMed] [Google Scholar]
- Erol A, Bebis G, Nicolescu M, Boyle RD, Twombly X. A review on vision-based full DOF hand motion estimation. Paper presented at IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR); June; San Diego, CA. 2005. [Google Scholar]
- Escobar MJ, Masson GS, Vieville T, Kornprobst P. Action recognition using a bio-inspired feed-forward spiking network. Int J Comput Vision. 2009;82:284–301. doi: 10.1007/s11263-008-0201-1. [DOI] [Google Scholar]
- Falconbridge MS, Stamps RL, Badcock DR. A simple hebbian/anti-hebbian network learns the sparse, independent components of natural images. Neural Comput. 2006;18:415–429. doi: 10.1162/089976606775093891. [DOI] [PubMed] [Google Scholar]
- Felleman DJ, Van Essen DC. Distributed hierarchical processing in the primate cerebral cortex. Cereb Cortex. 1991;1:1–47. doi: 10.1093/cercor/1.1.1-a. [DOI] [PubMed] [Google Scholar]
- Fidler S, Boben M, Leonardis A. Similarity-based cross-layered hierarchical representation for object categorization. Paper presented at IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR); June; Anchorage, AK. 2008. [Google Scholar]
- Fidler S, Boben M, Leonardis A. Optimization framework for learning a hierarchical shape vocabulary for object class detection. Paper presented at British Machine Vision Conference (BMVC); September; London, UK. 2009. [Google Scholar]
- Fleischer F, Christensen A, Caggiano V, Thier P, Giese MA. Neural theory for the perception of causal actions. Psychol Res. 2012;76:476–493. doi: 10.1007/s00426-012-0437-9. [DOI] [PubMed] [Google Scholar]
- Fogassi L, Ferrari PF, Gesierich B, Rozzi S, Chersi F, Rizzolatti G. Parietal lobe: from action organization to intention understanding. Science. 2005;308:662–667. doi: 10.1126/science.1106138. [DOI] [PubMed] [Google Scholar]
- Földiák P. Forming sparse representations by local anti-hebbian learning. Biol Cybern. 1990;64:165–170. doi: 10.1007/BF02331346. [DOI] [PubMed] [Google Scholar]
- Freedman DJ, Riesenhuber M, Poggio T, Miller EK. Experience-dependent sharpening of visual shape selectivity in inferior temporal cortex. Cereb Cortex. 2006;16:1631–1644. doi: 10.1093/cercor/bhj100. [DOI] [PubMed] [Google Scholar]
- Fukushima K. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern. 1980;36:193–202. doi: 10.1007/BF00344251. [DOI] [PubMed] [Google Scholar]
- Gallant JL, Braun J, Van Essen DC. Selectivity for polar, hyperbolic, and cartesian gratings in macaque visual cortex. Science. 1993;259:100–103. doi: 10.1126/science.8418487. [DOI] [PubMed] [Google Scholar]
- Gallese V, Fadiga L, Fogassi L, Rizzolatti G. Action recognition in the premotor cortex. Brain. 1996;119:593–609. doi: 10.1093/brain/119.2.593. [DOI] [PubMed] [Google Scholar]
- Gavrila D. The visual analysis of human movement: A survey. Comput Vis Image Underst. 1999;73:82–98. doi: 10.1006/cviu.1998.0716. [DOI] [Google Scholar]
- Gerhard F, Savin C, Triesch J. A robust biologically plausible implementation of ica-like learning. Paper presented at the 17th European Symposium on Artificial Neural Networks (ESANN); April; Bruges, Belgium. 2009. [Google Scholar]
- Giese MA. Neural field theory of motion perception. Dordrecht: Kluwer Academic; 1999. [Google Scholar]
- Giese MA, Poggio T. Neural mechanisms for the recognition of biological movements. Nat Rev Neurosci. 2003;4:179–192. doi: 10.1038/nrn1057. [DOI] [PubMed] [Google Scholar]
- Hamker FH. Modeling feature-based attention as an active top-down inference process. Biosystems. 2006;86:91–99. doi: 10.1016/j.biosystems.2006.03.010. [DOI] [PubMed] [Google Scholar]
- Haruno M, Wolpert DM, Kawato M. Mosaic model for sensorimotor learning and control. Neural Comput. 2001;13:2201–2220. doi: 10.1162/089976601750541778. [DOI] [PubMed] [Google Scholar]
- Heeger DJ. Modeling simple-cell direction selectivity with normalized, half-squared, linear operators. J Neurophysiol. 1993;70:1885–1898. doi: 10.1152/jn.1993.70.5.1885. [DOI] [PubMed] [Google Scholar]
- Helmer S, Lowe DG. Using stereo for object recognition. Paper presented at IEEE International Conference on Robotics and Automation (ICRA); May; Anchorage, AL. 2010. [Google Scholar]
- Hinton GE. Learning multiple layers of representation. Trends Cogn Sci. 2007;11:428–434. doi: 10.1016/j.tics.2007.09.004. [DOI] [PubMed] [Google Scholar]
- Iacoboni M, Woods RP, Brass M, Bekkering H, Mazziotta JC, Rizzolatti G. Cortical mechanisms of human imitation. Science. 1999;286:2526–2528. doi: 10.1126/science.286.5449.2526. [DOI] [PubMed] [Google Scholar]
- Iacoboni M, Molnar-Szakacs I, Gallese V, Buccino G, Mazziotta JC, Rizzolatti G. Grasping the intentions of others with one's own mirror neuron system. PLoS Biol. 2005;3:e79. doi: 10.1371/journal.pbio.0030079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Janssen P, Vogels R, Orban GA. Macaque inferior temporal neurons are selective for disparity-defined three-dimensional shapes. Proc Natl Acad Sci U S A. 1999;96:8217–8222. doi: 10.1073/pnas.96.14.8217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Janssen P, Srivastava S, Ombelet S, Orban GA. Coding of shape and position in macaque lateral intraparietal area. J Neurosci. 2008;28:6679–6690. doi: 10.1523/JNEUROSCI.0499-08.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jastorff J, Giese M. Time-dependent hebbian learning rules for the learning of templates for visual motion recognition. In: Ilg U, Bülthoff H, Mallot H, editors. Dynamic perception. 151–156. Amsterdam: IOS; 2004. [Google Scholar]
- Jastorff J, Clavagnier S, Gergely G, Orban GA. Neural mechanisms of understanding rational actions: Middle temporal gyrus activation by contextual violation. Cereb Cortex. 2011;21:318–329. doi: 10.1093/cercor/bhq098. [DOI] [PubMed] [Google Scholar]
- Jellema T, Perrett DI. Perceptual history influences neural responses to face and body postures. J Cogn Neurosci. 2003;15:961–971. doi: 10.1162/089892903770007353. [DOI] [PubMed] [Google Scholar]
- Jellema T, Perrett DI. Neural representations of perceived bodily actions using a categorical frame of reference. Neuropsychologia. 2006;44:1535–1546. doi: 10.1016/j.neuropsychologia.2006.01.020. [DOI] [PubMed] [Google Scholar]
- Jellema T, Maassen G, Perrett DI. Single cell integration of animate form, motion and location in the superior temporal cortex of the macaque monkey. Cereb Cortex. 2004;14:781–790. doi: 10.1093/cercor/bhh038. [DOI] [PubMed] [Google Scholar]
- Jhuang H, Serre T, Wolf L, Poggio T. A biologically inspired system for action recognition. Paper presented at IEEE International Conference on Computer Vision (ICCV); October; Rio de Janeiro, Brazil. 2007. [Google Scholar]
- Jhuang H, Garrote E, Mutch J, Poggio T, Steele A, Serre T. Automated home-cage behavioral phenotyping of mice. Nature Comm. 2010;1:1–9. doi: 10.1038/ncomms1064. [DOI] [PubMed] [Google Scholar]
- Joly O, Vanduffel W, Orban GA. The monkey ventral premotor cortex processes 3D shape from disparity. Neuroimage. 2009;47:262–272. doi: 10.1016/j.neuroimage.2009.04.043. [DOI] [PubMed] [Google Scholar]
- Jones JP, Palmer LA. An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex. J Neurophysiol. 1987;58:1233–1258. doi: 10.1152/jn.1987.58.6.1233. [DOI] [PubMed] [Google Scholar]
- Kavukcuoglu K, Sermanet P, Boureau YL, Gregor K, Mathieu M, LeCun Y. Learning convolutional feature hierarchies for visual recognition. In: Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, editors. Advances in neural information processing systems (NIPS) Curran Associates; 2010. pp. 1090–1098. [Google Scholar]
- Keysers C, Perrett DI. Demystifying social cognition: a hebbian perspective. Trends Cogn Sci. 2004;8:501–507. doi: 10.1016/j.tics.2004.09.005. [DOI] [PubMed] [Google Scholar]
- Kiebel SJ, Daunizeau J, Friston KJ. A hierarchy of time-scales and the brain. PLoS Comput Biol. 2008;4:e1000209. doi: 10.1371/journal.pcbi.1000209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kilner JM, Friston KJ, Frith CD. Predictive coding: an account of the mirror neuron system. Cogn Process. 2007;8:159–166. doi: 10.1007/s10339-007-0170-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kilner J, Neal A, Weiskopf N, Friston KJ, Frith CD. Evidence of mirror neurons in human inferior frontal gyrus. J Neurosci. 2009;29:10153–10159. doi: 10.1523/JNEUROSCI.2668-09.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kobatake E, Wang G, Tanaka K. Effects of shape-discrimination training on the selectivity of inferotemporal cells in adult monkeys. J Neurophysiol. 1998;80:324–330. doi: 10.1152/jn.1998.80.1.324. [DOI] [PubMed] [Google Scholar]
- Koenderink JJ. Optic flow. Vision Res. 1986;26:161–179. doi: 10.1016/0042-6989(86)90078-7. [DOI] [PubMed] [Google Scholar]
- Kraskov A, Dancause N, Quallo MM, Shepherd S, Lemon RN. Corticospinal neurons in macaque ventral premotor cortex with mirror properties: a potential mechanism for action suppression? Neuron. 2009;64:922–930. doi: 10.1016/j.neuron.2009.12.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- La'davas E. Functional and dynamic properties of visual peripersonal space. Trends Cogn Sci. 2002;6:17–22. doi: 10.1016/s1364-6613(00)01814-3. [DOI] [PubMed] [Google Scholar]
- Lange J, Lappe M. A model of biological motion perception from configural form cues. J Neurosci. 2006;26:2894–2906. doi: 10.1523/JNEUROSCI.4915-05.2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Layher G, Giese MA, Neumann H. Learning representations for animated motion sequence and implied motion mecognition. Paper presented at IEEE International Conference on Neural Networks (ICANN); September; Lausanne, Switzerland. 2012. [Google Scholar]
- Le QV, Zou WY, Yeung SY, Ng AY. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); June; Colorado Springs, CO. 2011. [Google Scholar]
- Lingnau A, Gesierich B, Caramazza A. Asymmetric fMRI adaptation reveals no evidence for mirror neurons in humans. Proc Natl Acad Sci U S A. 2009;106:9925–9930. doi: 10.1073/pnas.0902262106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Logothetis NK. The neural basis of the blood-oxygen-level-dependent functional magnetic resonance imaging signal. Philos Trans R Soc Lond B Biol Sci. 2002;357:1003–1037. doi: 10.1098/rstb.2002.1114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Logothetis NK. What we can do and what we cannot do with fMRI. Nature. 2008;453:869–878. doi: 10.1038/nature06976. [DOI] [PubMed] [Google Scholar]
- Logothetis NK, Wandell BA. Interpreting the BOLD signal. Annu Rev Physiol. 2004;66:735–769. doi: 10.1146/annurev.physiol.66.082602.092845. [DOI] [PubMed] [Google Scholar]
- Logothetis NK, Pauls J, Poggio T. Shape representation in the inferior temporal cortex of monkeys. Curr Biol. 1995;5:552–563. doi: 10.1016/S0960-9822(95)00108-4. [DOI] [PubMed] [Google Scholar]
- Markram H, Lübke J, Frotscher M, Sakmann B. Regulation of synaptic efficacy by coincidence of postsynaptic aps and epsps. Science. 1997;275:213–215. doi: 10.1126/science.275.5297.213. [DOI] [PubMed] [Google Scholar]
- Matelli M, Camarda R, Glickstein M, Rizzolatti G. Afferent and efferent projections of the inferior area 6 in the macaque monkey. J Comp Neurol. 1986;251:281–298. doi: 10.1002/cne.902510302. [DOI] [PubMed] [Google Scholar]
- Mel BW, Fiser J. Minimizing binding errors using learned conjunctive features. Neural Comput. 2000;12:731–762. doi: 10.1162/089976600300015574. [DOI] [PubMed] [Google Scholar]
- Metta G, Sandini G, Natale L, Craighero L, Fadiga L. Understanding mirror neurons: a bio-robotic approach. Epigen Robot. 2006;7:197–232. [Google Scholar]
- Mitra S, Acharya T. Gesture recognition: a survey. IEEE Trans Syst Man Cybern C Applicat Rev. 2007;37:311–324. [Google Scholar]
- Moeslund TB, Hilton A, Krüger V. A survey of advances in vision-based human motion capture and analysis. Comput Vis. 2006;104:90–126. doi: 10.1016/j.cviu.2006.08.002. [DOI] [Google Scholar]
- Movshon JA, Thompson ID, Tolhurst DJ. Spatial summation in the receptive fields of simple cells in the cat's striate cortex. J Physiol. 1978;283:53–77. doi: 10.1113/jphysiol.1978.sp012488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mukamel R, Ekstrom AD, Kaplan J, Iacoboni M, Fried I. Single-neuron responses in humans during execution and observation of actions. Curr Biol. 2010;20:750–756. doi: 10.1016/j.cub.2010.02.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nater F, Grabner H, Van Gool L. Temporal relations in videos for unsupervised activity analysis. Paper presented at British Machine Vision Conference (BMVC); August; Dundee, Scotland, UK. 2011. [Google Scholar]
- Nelissen K, Luppino G, Vanduffel W, Rizzolatti G, Orban GA. Observing others: multiple action representation in the frontal lobe. Science. 2005;310:332–336. doi: 10.1126/science.1115593. [DOI] [PubMed] [Google Scholar]
- Nelissen K, Vanduffel W, Orban GA. Charting the lower superior temporal region, a new motion-sensitive region in monkey superior temporal sulcus. J Neurosci. 2006;26:5929–5947. doi: 10.1523/JNEUROSCI.0824-06.2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nir Y, Fisch L, Mukamel R, Gelbard-Sagiv H, Arieli A, Fried I, Malach R. Coupling between neuronal firing rate, gamma LFP, and BOLD fMRI is related to interneuronal correlations. Curr Biol. 2007;17:1275–1285. doi: 10.1016/j.cub.2007.06.066. [DOI] [PubMed] [Google Scholar]
- Ochiai T, Mushiake H, Tanji J. Involvement of the ventral premotor cortex in controlling image motion of the hand during performance of a target-capturing task. Cereb Cortex. 2005;15:929–937. doi: 10.1093/cercor/bhh193. [DOI] [PubMed] [Google Scholar]
- Op De Beeck H, Vogels R. Spatial sensitivity of macaque inferior temporal neurons. J Comp Neurol. 2000;426:505–518. doi: 10.1002/1096-9861(20001030)426:4<505::AID-CNE1>3.0.CO;2-M. [DOI] [PubMed] [Google Scholar]
- Op de Beeck HP, Dicarlo JJ, Goense JB, Grill-Spector K, Papanastassiou A, Tanifuji M, Tsao DY. Fine-scale spatial organization of face and object selectivity in the temporal lobe: do functional magnetic resonance imaging, optical imaging, and electrophysiology agree? J Neurosci. 2008;28:11796–11801. doi: 10.1523/JNEUROSCI.3799-08.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oram MW, Perrett DI. Modeling visual recognition from neurobiological constraints. Neural Netw. 1994;7:945–972. doi: 10.1016/S0893-6080(05)80153-4. [DOI] [Google Scholar]
- Oram MW, Perrett DI. Integration of form and motion in the anterior superior temporal polysensory area (stpa) of the macaque monkey. J Neurophysiol. 1996;76:109–129. doi: 10.1152/jn.1996.76.1.109. [DOI] [PubMed] [Google Scholar]
- Orban G. The extraction of 3D shape in the visual system of human and nonhuman primates. Annu Rev Neurosci. 2011;34:361–388. doi: 10.1146/annurev-neuro-061010-113819. [DOI] [PubMed] [Google Scholar]
- Oztop E, Arbib MA. Schema design and implementation of the grasp-related mirror neuron system. Biol Cybern. 2002;87:116–140. doi: 10.1007/s00422-002-0318-1. [DOI] [PubMed] [Google Scholar]
- Oztop E, Kawato M, Arbib M. Mirror neurons and imitation: a computationally guided review. Neural Netw. 2006;19:254–271. doi: 10.1016/j.neunet.2006.02.002. [DOI] [PubMed] [Google Scholar]
- Pasupathy A, Connor CE. Responses to contour features in macaque area v4. J Neurophysiol. 1999;82:2490–2502. doi: 10.1152/jn.1999.82.5.2490. [DOI] [PubMed] [Google Scholar]
- Pavlovic VI, Sharma R, Huang TS. Visual interpretation of hand gestures for human-computer interaction: a review. IEEE Trans Pattern Anal Mach Intell. 1997;19:677–695. [Google Scholar]
- Perrett D, Oram M. Neurophysiology of shape processing. Image Vis Comput. 1993;11:317–333. doi: 10.1016/0262-8856(93)90011-5. [DOI] [Google Scholar]
- Perrett DI, Rolls ET, Caan W. Visual neurons responsive to faces in the monkey temporal cortex. Exp Brain Res. 1982;47:329–342. doi: 10.1007/BF00239352. [DOI] [PubMed] [Google Scholar]
- Perrett DI, Smith PA, Potter DD, Mistlin AJ, Head AS, Milner AD, Jeeves MA. Visual cells in the temporal cortex sensitive to face view and gaze direction. Proc R Soc Lond B Biol Sci. 1985;223:293–317. doi: 10.1098/rspb.1985.0003. [DOI] [PubMed] [Google Scholar]
- Perrett DI, Harries MH, Bevan R, Thomas S, Benson PJ, Mistlin AJ, Chitty AJ, Hietanen JK, Ortega JE. Frameworks of analysis for the neural representation of animate objects and actions. J Exp Biol. 1989;146:87–113. doi: 10.1242/jeb.146.1.87. [DOI] [PubMed] [Google Scholar]
- Perrett DI, Oram MW, Harries MH, Bevan R, Hietanen JK, Benson PJ, Thomas S. Viewer-centred and object-centred coding of heads in the macaque temporal cortex. Exp Brain Res. 1991;86:159–173. doi: 10.1007/BF00231050. [DOI] [PubMed] [Google Scholar]
- Pesaran B, Nelson MJ, Andersen RA. Dorsal premotor neurons encode the relative position of the hand, eye, and goal during reach planning. Neuron. 2006;51:125–134. doi: 10.1016/j.neuron.2006.05.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poggio T, Edelman S. A network that learns to recognize three-dimensional objects. Nature. 1990;343:263–266. doi: 10.1038/343263a0. [DOI] [PubMed] [Google Scholar]
- Pouget A, Sejnowski T. Spatial transformations in the parietal cortex using basis functions. J Cogn Neurosci. 1997;9:222–237. doi: 10.1162/jocn.1997.9.2.222. [DOI] [PubMed] [Google Scholar]
- Prevete R, Tessitore G, Santoro M, Catanzariti E. A connectionist architecture for view-independent grip-aperture computation. Brain Res. 2008;1225:133–145. doi: 10.1016/j.brainres.2008.04.076. [DOI] [PubMed] [Google Scholar]
- Riesenhuber M, Poggio T. Are cortical models really bound by the “binding problem”? Neuron. 1999a;24:87–93. 111–25. doi: 10.1016/s0896-6273(00)80824-7. [DOI] [PubMed] [Google Scholar]
- Riesenhuber M, Poggio T. Hierarchical models of object recognition in cortex. Nat Neurosci. 1999b;2:1019–1025. doi: 10.1038/14819. [DOI] [PubMed] [Google Scholar]
- Rizzolatti G, Craighero L. The mirror-neuron system. Annu Rev Neurosci. 2004;27:169–192. doi: 10.1146/annurev.neuro.27.070203.144230. [DOI] [PubMed] [Google Scholar]
- Rizzolatti G, Sinigaglia C. The functional role of the parieto-frontal mirror circuit: interpretations and misinterpretations. Nat Rev Neurosci. 2010;11:264–274. doi: 10.1038/nrn2805. [DOI] [PubMed] [Google Scholar]
- Rolls ET, Milward T. A model of invariant object recognition in the visual system: learning rules, activation functions, lateral inhibition, and information-based performance measures. Neural Comput. 2000;12:2547–2572. doi: 10.1162/089976600300014845. [DOI] [PubMed] [Google Scholar]
- Rozzi S, Ferrari PF, Bonini L, Rizzolatti G, Fogassi L. Functional organization of inferior parietal lobule convexity in the macaque monkey: electrophysiological characterization of motor, sensory and mirror responses and their correlation with cytoarchitectonic areas. Eur J Neurosci. 2008;28:1569–1588. doi: 10.1111/j.1460-9568.2008.06395.x. [DOI] [PubMed] [Google Scholar]
- Rust NC, Dicarlo JJ. Selectivity and tolerance (“invariance”) both increase as visual information propagates from cortical area v4 to it. J Neurosci. 2010;30:12978–12995. doi: 10.1523/JNEUROSCI.0179-10.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saito H, Yukie M, Tanaka K, Hikosaka K, Fukada Y, Iwai E. Integration of direction signals of image motion in the superior temporal sulcus of the macaque monkey. J Neurosci. 1986;6:145–157. doi: 10.1523/JNEUROSCI.06-01-00145.1986. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salin PA, Bullier J. Corticocortical connections in the visual system: Structure and function. Physiol Rev. 1995;75:107–154. doi: 10.1152/physrev.1995.75.1.107. [DOI] [PubMed] [Google Scholar]
- Salinas E, Abbott LF. Transfer of coded information from sensory to motor networks. J Neurosci. 1995;15:6461–6474. doi: 10.1523/JNEUROSCI.15-10-06461.1995. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schiller PH, Finlay BL, Volman SF. Quantitative studies of single-cell properties in monkey striate cortex. iii. spatial frequency. J Neurophysiol. 1976;39:1334–1351. doi: 10.1152/jn.1976.39.6.1334. [DOI] [PubMed] [Google Scholar]
- Schindler K, van Gool L. Combining densely sampled form and motion for human action recognition. Paper presented at DAGM Symposium; June; Munich, Germany. 2008. [Google Scholar]
- Schütz-Bosbach S, Prinz W. Perceptual resonance: action-induced modulation of perception. Trends Cogn Sci. 2007;11:349–355. doi: 10.1016/j.tics.2007.06.005. [DOI] [PubMed] [Google Scholar]
- Seltzer B, Pandya DN. Afferent cortical connections and architectonics of superior temporal sulcus and surrounding cortex in rhesus-monkey. Brain Res. 1978;149:1–24. doi: 10.1016/0006-8993(78)90584-X. [DOI] [PubMed] [Google Scholar]
- Serre T, Riesenhuber M. AI Memo 2004–017/CBCL Memo 23. Cambridge, MA: MIT; 2004. Realistic modeling of simple and complex cell tuning in the HMAX model, and implications for invariant object recognition in cortex. [Google Scholar]
- Serre T, Kreiman G, Kouh M, Cadieu C, Knoblich U, Poggio T. A quantitative theory of immediate visual recognition. Prog Brain Res. 2007a;165:33–56. doi: 10.1016/S0079-6123(06)65004-8. [DOI] [PubMed] [Google Scholar]
- Serre T, Wolf L, Bileschi S, Riesenhuber M, Poggio T. Robust object recognition with cortex-like mechanisms. IEEE Trans Pattern Anal Mach Intell. 2007b;29:411–426. doi: 10.1109/TPAMI.2007.56. [DOI] [PubMed] [Google Scholar]
- Shikata E, Tanaka Y, Nakamura H, Taira M, Sakata H. Selectivity of the parietal visual neurones in 3D orientation of surface of stereoscopic stimuli. Neuroreport. 1996;7:2389–2394. doi: 10.1097/00001756-199610020-00022. [DOI] [PubMed] [Google Scholar]
- Sigala N, Logothetis NK. Visual categorization shapes feature selectivity in the primate temporal cortex. Nature. 2002;415:318–320. doi: 10.1038/415318a. [DOI] [PubMed] [Google Scholar]
- Singer JM, Sheinberg DL. Temporal cortex neurons encode articulated actions as slow sequences of integrated poses. J Neurosci. 2010;30:3133–3145. doi: 10.1523/JNEUROSCI.3211-09.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Song S, Miller KD, Abbott LF. Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat Neurosci. 2000;3:919–926. doi: 10.1038/78829. [DOI] [PubMed] [Google Scholar]
- Srivastava S, Orban GA, De Mazière PA, Janssen P. A distinct representation of three-dimensional shape in macaque anterior intraparietal area: fast, metric, and coarse. J Neurosci. 2009;29:10613–10626. doi: 10.1523/JNEUROSCI.6016-08.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suzuki W, Tanaka K. Development of monotonic neuronal tuning in the monkey inferotemporal cortex through long-term learning of fine shape discrimination. Eur J Neurosci. 2011;33:748–757. doi: 10.1111/j.1460-9568.2010.07539.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taira M, Tsutsui KI, Jiang M, Yara K, Sakata H. Parietal neurons represent surface orientation from the gradient of binocular disparity. J Neurophysiol. 2000;83:3140–3146. doi: 10.1152/jn.2000.83.5.3140. [DOI] [PubMed] [Google Scholar]
- Tarr MJ, Bülthoff HH. Image-based object recognition in man, monkey and machine. Cognition. 1998;67:1–20. doi: 10.1016/S0010-0277(98)00026-2. [DOI] [PubMed] [Google Scholar]
- Taylor GW, Sigal L, Fleet DJ, Hinton GE. Dynamical binary latent variable models for 3D human pose tracking. Paper presented at The 23rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR); June; San Francisco, CA. 2010. [Google Scholar]
- Tessitore G, Donnarumma F, Prevete R. An action-tuned neural network architecture for hand pose estimation. Paper presented at International Conference on Fuzzy Computation and International Conference on Neural Computation; October; Valencia, Spain. 2010. [Google Scholar]
- Theys T, Srivastava S, van Loon J, Goffin J, Janssen P. Selectivity for three-dimensional contours and surfaces in the anterior intraparietal area. J Neurophysiol. 2012;107:995–1008. doi: 10.1152/jn.00248.2011. [DOI] [PubMed] [Google Scholar]
- Thurman SM, Giese MA, Grossman ED. Perceptual and computational analysis of critical features for biological motion. J Vis. 2010;10(12):15. doi: 10.1167/10.12.15. [DOI] [PubMed] [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Statist Soc Ser B. 1994;58:267–288. [Google Scholar]
- Tkach D, Reimer J, Hatsopoulos NG. Congruent activity during action and action observation in motor cortex. J Neurosci. 2007;27:13241–13250. doi: 10.1523/JNEUROSCI.2895-07.2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsao DY, Livingstone MS. Mechanisms of face perception. Annu Rev Neurosci. 2008;31:411–437. doi: 10.1146/annurev.neuro.30.051606.094238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsotsos JK. A computational perspective on visual attention. Cambridge, MA: MIT; 2011. [Google Scholar]
- Umilt à MA, Kohler E, Gallese V, Fogassi L, Fadiga L, Keysers C, Rizzolatti G. I know what you are doing. a neurophysiological study. Neuron. 2001;31:155–165. doi: 10.1016/S0896-6273(01)00337-3. [DOI] [PubMed] [Google Scholar]
- Vangeneugden J, Pollick F, Vogels R. Functional differentiation of macaque visual temporal cortical neurons using a parametric action space. Cereb Cortex. 2009;19:593–611. doi: 10.1093/cercor/bhn109. [DOI] [PubMed] [Google Scholar]
- Vangeneugden J, De Mazière PA, Van Hulle MM, Jaeggli T, Van Gool L, Vogels R. Distinct mechanisms for coding of visual actions in macaque temporal cortex. J Neurosci. 2011;31:385–401. doi: 10.1523/JNEUROSCI.2703-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weinland D, Ronfard R, Boyer E. A survey of vision-based methods for action representation, segmentation and recognition. Comp Vis Image Under. 2011;115:224–241. doi: 10.1016/j.cviu.2010.10.002. [DOI] [Google Scholar]
- Wilson HR, Cowan JD. Excitatory and inhibitory interactions in localized populations of model. Biophysics. 1972:1–24. doi: 10.1016/S0006-3495(72)86068-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wolpert DM, Doya K, Kawato M. A unifying computational framework for motor control and social interaction. Philos Trans R Soc Lond B Biol Sci. 2003;358:593–602. doi: 10.1098/rstb.2002.1238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu Y, Huang TS. Vision-based gesture recognition: a review. In: Braffort A, Gherbi R, Gibet S, Richardson J, Teil E, editors. Proceedings of the international gesture workshop on gesture-based communication in human-computer interaction; London: Springer-Verlag; 1999. pp. 103–115. [Google Scholar]
- Xie X, Giese MA. Nonlinear dynamics of direction-selective recurrent neural media. Phys Rev E Stat Nonlin Soft Matter Phys. 2002;65 doi: 10.1103/PhysRevE.65.051904. 051904. [DOI] [PubMed] [Google Scholar]
- Yildiz IB, Kiebel SJ. A hierarchical neuronal model for generation and online recognition of birdsongs. PLoS Comput Biol. 2011;7:e1002303. doi: 10.1371/journal.pcbi.1002303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zemel RS, Sejnowski TJ. A model for encoding multiple object motions and self-motion in area mst of primate visual cortex. J Neurosci. 1998;18:531–547. doi: 10.1523/JNEUROSCI.18-01-00531.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang K. Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: a theory. J Neurosci. 1996;16:2112–2126. doi: 10.1523/JNEUROSCI.16-06-02112.1996. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zipser D, Andersen RA. A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature. 1988;331:679–684. doi: 10.1038/331679a0. [DOI] [PubMed] [Google Scholar]