Skip to main content
PLOS One logoLink to PLOS One
. 2020 Dec 28;15(12):e0243829. doi: 10.1371/journal.pone.0243829

Using enriched semantic event chains to model human action prediction based on (minimal) spatial information

Fatemeh Ziaeetabar 1,*, Jennifer Pomp 2, Stefan Pfeiffer 1, Nadiya El-Sourani 2, Ricarda I Schubotz 2, Minija Tamosiunaite 1,3, Florentin Wörgötter 1
Editor: Chen Zonghai4
PMCID: PMC7769489  PMID: 33370343

Abstract

Predicting other people’s upcoming action is key to successful social interactions. Previous studies have started to disentangle the various sources of information that action observers exploit, including objects, movements, contextual cues and features regarding the acting person’s identity. We here focus on the role of static and dynamic inter-object spatial relations that change during an action. We designed a virtual reality setup and tested recognition speed for ten different manipulation actions. Importantly, all objects had been abstracted by emulating them with cubes such that participants could not infer an action using object information. Instead, participants had to rely only on the limited information that comes from the changes in the spatial relations between the cubes. In spite of these constraints, participants were able to predict actions in, on average, less than 64% of the action’s duration. Furthermore, we employed a computational model, the so-called enriched Semantic Event Chain (eSEC), which incorporates the information of different types of spatial relations: (a) objects’ touching/untouching, (b) static spatial relations between objects and (c) dynamic spatial relations between objects during an action. Assuming the eSEC as an underlying model, we show, using information theoretical analysis, that humans mostly rely on a mixed-cue strategy when predicting actions. Machine-based action prediction is able to produce faster decisions based on individual cues. We argue that human strategy, though slower, may be particularly beneficial for prediction of natural and more complex actions with more variable or partial sources of information. Our findings contribute to the understanding of how individuals afford inferring observed actions’ goals even before full goal accomplishment, and may open new avenues for building robots for conflict-free human-robot cooperation.

1 Introduction

Human beings excel at recognizing actions performed by others, and they do so even before the action goal has been effectively achieved [1, 2]. Thus, humans engage in action prediction. During this process, the brain activates a premotor-parietal network [3] that largely overlaps with the networks needed for action execution and action imagery [4]. Though in recent years, some progress has been made towards computationally more concrete models of the mechanisms and processes underlying action recognition [5], it still remains largely unresolved how the brain accomplishes this complex task. Prediction of actions can rely on different sources of information, including manipulated objects [69], contextual objects [10, 11], movements [12], context [13] and features regarding the actress or actor [14]. A major aim of ongoing research is to disentangle the respective contribution and relevance of these sources of information feeding human action prediction. Since these sources are largely confounded even in simple instances of natural action, the experimental approach has to fully control or to bluntly eliminate all potentially confounding sources that are not in the focus of empirical testing.

Against this backdrop, the present study addressed the relevance of spatial relations between the objects in an action scene. Previous studies comparing manipulation of appropriate objects (i.e., normal actions) with manipulations of inappropriate objects (i.e., pantomime) showed that brain activity during action observation was largely explained by processing of the actor’s movements [15]. As a caveat, this finding may be explained by the particular movement-focused strategy subjects selected in this study where normal and pantomime actions were presented in intermixed succession. Other studies show that motion features are used by the brain to segment observed actions into meaningful segments and to update internal predictive models of the observed action [16, 17]. Correspondingly, individuals segment actions into consistent, meaningful chunks [18, 19], and intra-individually, they do so in a highly consistent manner, albeit high inter-individual variability [16]. It has been argued that the objective quality of these chunks is that within the continuous sequence, breakpoints may convey a higher amount of information than the remainder of the event. Nevertheless, this suggestion remains speculative as long as we do not find a way to objectively quantify the flow of information that the continuous stream of input provides. This objectification is hampered by the fact that time-continuous information is highly variable with regard to spatial and temporal characteristics differing between action exemplars. Moreover, object information is a confounding factor in natural actions. As exemplars of object classes, individual objects provide information about possible types of manipulation the observer has learned these objects to be associated with [6, 20, 21]. For instance, knives are mostly used for cutting. Hence, objects can efficiently restrict the number of actions that an action observer expect to occur [6]. Speculatively, humans may use a mixed strategy exploiting object as well as spatial information, and this strategy may be adapted to current constraints. For instance, spatial information and, specifically, spatial relations that are in the center of the current study may become more relevant when objects are difficult to recognize, e.g. when observing actions from a distance, in dim light or in case when actions are performed with objects or in environments not familiar to an observer, or when objects are used in an unconventional way.

In the present study, we sought to precisely analyse and objectify the way that humans exploit information about spatial relations during action prediction. Eliminating object and contextual (i.e., room, scene) information as confounding factors, we tested the hypothesis that spatial relations between objects can be exploited to successfully predict the outcome of actions before the action aim is fully accomplished.

As the basis for spatial relation calculation we use extended semantic event chains (eSEC), introduced in our previous work for action recognition in computer vision [22]. This approach allows us to determine a sequence of discrete spatial relations between different objects in the scene throughout the manipulation. These sequences were shown to allow action prediction in computer vision applications [23]. Three types of spatial relations are calculated: (1) object touching vs. non-touching in the scene, (2) static spatial relations, like above or around and (3) dynamic spatial relations like moving together or moving apart.

The approach was developed based on previous assumptions on the importance of spatial relations in action recognition [2428] and stands in contrast to action recognition and prediction methods based on time continuous information, like trajectories [2932] or continuous action videos [3335]. It also stands in contrast to the methods exploiting rich contextual information [3640]. Here we consider that time continuous information is much disturbed by intra-class variability of the same action, e.g. see [41], thus it is not the best source for action prediction, while contextual information in the current study we consider as distractors as explained above.

The current study consisted of the following steps:

  • Creating a virtual reality database containing ten different manipulation actions with multiple scenarios each.

  • Conducting a behavioural experiment in which human participants engaged in action prediction in virtual reality for all scenarios, where prediction time and prediction accuracy were measured.

  • Calculating three types of spatial relations using the eSEC model: (1) touching vs. non-touching relations, (2) static spatial relations and (3) dynamic spatial relations.

  • Performing an information theoretical analysis to determine how participants used these three types of spatial relations for action prediction.

  • Training an optimal (up to the learning accuracy) machine algorithm to predict an action using the relational information provided by the eSEC model.

  • Comparing human to the optimal machine action prediction strategies based on spatial relations.

The paper is organized as follows: In Section 2 we describe both, the setup of the experiments as well as the main aspects of data analysis. Here we keep the description of machine methods intuitive to make the paper accessible to psychology-oriented readers; in Section 3 we provide and explain results of the current study, in Section 4 we evaluate our findings and define implications for future work. In the Appendix 5 we provide the details of the machine algorithms.

2 General experimental protocols and methods

A flow chart of our study is depicted in Fig 1. In the following we will briefly describe each box in the flow-chart.

Fig 1. Experimental schedule.

Fig 1

2.1 Virtual reality videos

We designed a set of ten actions and created multiple virtual reality videos for each action. The ten actions were: chop, cut, hide, uncover, put on top, take down, lay, push, shake, and stir. All objects, including hand and tools, were represented by cubes of variable size and color to serve object-agnostic (except the hand) action recognition. The hand was always shown as a red cube (Fig 2). Scene arrangements and object trajectories varied in order to generate a wide diversity in the samples of each manipulation action type. For each of the ten action types, 30 sample scenarios were recorded by human demonstration. All action scenes included different arrangements of several cubes (including distractor cubes) to ensure that videos were indistinguishable at the beginning. The Virtual reality system as well as the ten actions mentioned above are specified in the Appendix, Subsections 5.1 to 5.3.

Fig 2.

Fig 2

The VR experiment process, (a): experiment training stage for put on top action, (b): experiment testing stage: action scene playing and (c): experiment testing stage: selecting the action type.

2.2 Behavioural study on action prediction

Forty-nine right-handed participants (20-68 yrs, mean 31.69 yrs, SD = 9.86, 14 female) took part in the experiment. One additional participant completed the experiments, but was excluded from further analyses due to an error rate of 14.7%, classified as outlier. Prior to the testing, written informed consent was obtained from all participants.

The experiment was not harmful and no sensitive data had been recorded and experimental data has been treated anonymously and only the instructions explained below had been given to the participants.

The experiment was performed in accordance with the ethical standards laid down by the 1964 Declaration of Helsinki. We followed the relevant guidelines of the Germany Psychological Society (Document: 28.09.2004 DPG: “Revision der auf die Forschung bezogenen ethischen Richtlinien”) and also obtained official approval for these experiments by the Ethics Committee responsible at the University of Göttingen.

Participants were given a detailed explanation regarding the stimuli and the task of the experiment. They were then familiarized with the VR system and shown how to deliver their responses during the experiment. The participants’ task was to indicate as quickly as possible which action was currently presented.

Every experiment started with a short training phase in which one example of each action was presented. During this demo version, the name of the currently presented action was highlighted in green on the background board (see Fig 2(a)). After the training phase, we asked the participant if everything was clear and if he/she confirmed, we would start the test stage of the experiment.

During the test stage, a total of 30 × 10 action videos (trials) were shown to the participants in randomized order where the red hand-cube entered the scene and performed an action (Fig 2(b)). When the action was recognized and the participant pressed the motion controller’s button, the moment of this button press was recorded as response time. Concurrently, all cubes disappeared from the scene so that no post-decision cogitation about the action was possible. At the same time, the controller was marked with a red pointer added to its front. Hovering over the action of choice and pressing motion controller’s button again recorded the actual choice and advanced the experiment to the next trial (Fig 2(c)). Participants were allowed to rest during the experiment, and continued the experiment after resting. Since participants mostly proceeded quickly to the next trial, the overall duration of the experimental session usually did not exceed one hour. All experimental data were analysed using different statistical methods described in Subsections 2.5 and 2.6.

2.3 Extraction of spatial relations (eSEC)

The extended semantic event chain framework (eSEC) used as the underlying model in this study makes use of object-object relations. We defined three types of spatial relations in our framework: 1)“Touching” and “Non-touching” relations (TNR), 2) “Static Spatial Relation” (SSR) and 3)“Dynamic Spatial Relation” (DSR).

TNR between two objects were defined according to collision or “no collision” between their representative cubes.

SSR describe the relative position of two objects in space. We used the following SSRs: “Above”, “Below”, “Around”, “Top”, “Bottom”, “AroundTouching”, “Inside”, “Surrounding” and “Null” (no relation in case two objects are too far away from each other). For algorithmic definition of those relations see Appendix, Subsection 5.5.

DSRs describe relative movements of two objects. We used the following DSRs: “Moving Together”, “Halting Together” (describing the case where both objects are not moving), “Fixed-Moving Together” (describing the case when one object is moving across the other), “Getting Closer”, “Moving Apart”, “Stable” (describing the case when the distance between objects does not change), “No Relation” (describing the case when the distance between two objects exceeds a pre-defined threshold). For algorithmic definition of those relations see Appendix, Subsection 5.5.

Importantly, eSEC do not make use of any real object information. Objects remain abstracted (like in the VR experiments). We defined five abstract object types that play an essential role in any manipulation action and call them the fundamental objects (see Table 1). Fundamental objects 1, 2, and 3 obtain their role in the course of an action: they are numbered according to the order by which they encounter transitions between the relations N (non-touching) and T (touching). For example, ‘fundamental object “1”’ obtains its role given by “number 1” by being the first that encounters a change in touching (usually this is the object first touched by the hand).

Table 1. Definition of the fundamental objects during a manipulation action [23].

Object Definition Remarks
Hand The object that performs an action. Not touching anything at the beginning and at the end of the action. It touches at least one object during an action.
Ground The object that supports all other objects except the hand in the scene. It is extracted as a ground plane in a visual scene.
1 The object that is the first to obtain a change in its T/N relations. Trivially, the first transition will always be a touch by the hand.
2 The object that is the second to obtain a change in its T/N relations. Either T→N or N→T relational change can happen.
3 The object that is the third to obtain a change in its T/N relations. Either T→N or N→T relational change can happen.

Note that not all fundamental objects defined in Table 1 are always existing in a specific action. Only hand, ground and fundamental object 1 are necessarily present in all analysed actions. The action-driven “birth” of objects 1, 2, and 3 automatically leads to the fact that irrelevant (distractor) objects are always ignored by the eSEC analysis.

Thus, the maximal number of relations that had to be analysed for an action was set by defined relations between fundamental objects: Given five object roles, there were C(5, 2) = 10 possible combinations leading to ten relations for each type (NTR, SSR, DSR), resulting in 30 relations in total.

The Enriched Semantic Event Chain (eSEC) is a matrix-form representation of the change of the three types of spatial relations described above throughout the action for the pairs of fundamental objects defined in Table 1. Fig 3 shows the eSEC matrix for a put on top action and demonstrates how relations change throughout this action.

Fig 3. Description of a “put on top” action in the eSEC framework with relation graph between all objects.

Fig 3

Only hand and ground are pre-specified, object 1 is the one first touched by the hand, object 2 the next where a touching/un-touching (T/N) change happens and object 3 in this case remains undefined (U) in all rows as there are no more T/N changes. This leads to the graph on the top left that shows all relations. Abbreviations in the eSEC are: U: undefined, T: touching, N: non-touching, O: very far (static), Q: very far (dynamic), Ab: above, To: top, Ar: around, ArT: around with touch, S: stable, HT: halt together, MT: move together, MA: moving apart, GC: getting close. Note that the two leftmost columns are identical for all actions as they indicate the starting situation before any action. The top, middle and bottom ten rows of the matrix indicate TNR, SSR and DSR between each pair of fundamental objects in a “put on top” action, respectively.

2.4 Machine prediction

Machine prediction of a manipulation action was based on a learning procedure. For learning, we divided our data (eSEC tables) into train and test samples and performed a column-by-column comparison. That is, similarity values between the eSECs were derived by comparing each test action’s eSEC (up to prediction column) to the every member of the training sample. We defined an action as “predicted” when the average similarity for one class remained high, while similarity for all other classes was low in this column. The similarity measurement algorithm between two eSEC matrices is explained in the Appendix, Subsection 5.6. Note, that the machine prediction algorithm, defined above, makes optimal action predictions based on eSEC information, to the precision of the applied learning procedure.

2.5 Comparison of human and machine predictive performance

We assessed predictive performance (of human or machine) relative to the length of the action measured in eSEC columns. The eSEC column at which prediction happens is called “prediction column”. Predictive power is defined as:

P=(1-column(α)Total(α))*100% (1)

where column(α) is the “prediction column” and Total(α) is the total number of columns in the action α eSEC table. The earlier the action was predicted, the higher is the values of the measure P.

To compare human and machine predictive power, first, a repeated measures ANOVA on predictive power of humans were calculated with action (1—10) as within-subject factor. Then, human and machine performance was compared for each action separately using one-sample t-tests. As the machine data do not show variance, their predictive power value was used to compare it to human performance.

In addition, to inspect for the presence of learning effects in the human sample, correlations (Spearman Rho) were calculated for the number of trial (1—30) per action and predictive power as well as error rate.

Data were analysed using RStudio (Version 1.2.5001, RStudio Inc.) and SPSS 26 (IBM, New York, United States).

2.6 Information theoretical analysis

To model human action prediction based on eSEC matrices, we calculated the informational gain based on each eSEC column entry. More specifically, based on the eSEC descriptions of all ten actions, we derived a measurement of the amount of information presented in each column (or action step) of each action in comparison to all other actions. Each eSEC column, for a given sub-table (Touching = T, Static = S, Dynamic = D), contains ten coded descriptions of the spatial relations between hand, objects and ground. By stringing the eSEC codes of one column together, each column gets a new single code formally describing the action stage of a sub-table the participant observes at that moment. By taking the frequency of each action step or column-code across all 10 actions, we calculated the likelihood of a specific code in reference to the other actions in this column. So, if all eSEC descriptions are the same for one column, this column-code is assigned a likelihood of “1”. If only one action differs (from the remaining nine actions), it gets a likelihood of 0.1 and the column-code of the differing action receives a likelihood value of 0.9, and so forth. Because not every action has the same number of columns, the lack of eSEC descriptions is also treated as a possible event. That means, if for example seven out of ten actions already have stopped at one point in time, these seven actions would receive a likelihood of 0.7 for this specific column.

We conducted this likelihood assignment procedure for each of the three types of information (TNR, SSR, DSR) separately. Note that the likelihood also gives an estimate of the information about one action that is presented in a column. If the likelihood of an action code is low, only a few or just this single action has this particular action code. So, if this code appears, it powerfully constrains action prediction.

Based on the likelihood p of an action step x, we then calculated bit rates to quantify (self-)information I according to Shannon [42]:

I(x)=-log2(px) (2)

This transformation into information has two advantages over calculating with likelihoods. Firstly, it is more intuitive because more information is also displayed as a higher value, and secondly, we now were able to derive cumulated information by adding up the information values associated with successive columns. The transformation and cumulation were also done for all three information types separately. Thus, we obtained information values for each action step for each type of information separately. The additivity of the data also made it possible to combine multiple types of information by simply summing up the columns of the sub-tables.

Based on these information values, we modelled human performance. We employed the following models: one based only on TNR, one based only on SSR, one based only on DSR, three models adding two of the three types of information (T+S; T+D; S+D), one model adding all three types of information (T+S+D) and finally one model that ignores the three differing types of information and calculates the self-information based on all eSEC entries independent of the information type (Overall). For each model and for each action separately, a logistic regression was calculated using SPSS26. Each logistic regression included the absolute amount of information per action step according to the respective model, the accumulated information up to each action step, and the interaction term of these absolute and accumulated predictors. The logistic regressions’ dependent binary variable was the presence of a response during the respective action step, indicating whether the action was predicted during this action step or not. Since predictors were correlated, models were estimated using the stepwise forward method for variable entry. Note that we did not interpret the coefficients and therefore did not need to regularize the regression model due to coefficient’s correlation. Model fits were compared using the BIC (Bayesian-Information-Criterion) [43].

3 Results

In the human reaction time experiments, response times that exceeded the length of the action video were treated as time-outs and corresponding trials (13 out of 14700) were excluded from further analyses.

Participants’ mean prediction accuracy was very high with a mean of 97.6% (SD = 1.8%, n = 49), ranging from 93.0% to 100.0%. Participants’ mean predictive power ranged from 29.34 to 44.56 (M = 37.03, SD = 3.44, n = 49). Regarding learning effects, hence, possible trends in performance change along an experiment, correlation analyses showed a significant reduction effect in error rates (rs = −.72, p <.001, n = 30) and a significant enhancement effect for human predictive power (rs = .96, p <.001, n = 30). Over trials, the mean error rate ranged from 0.004 to 0.063 (M = 0.024, SD = 0.015, n = 30) and the mean predictive power ranged from 31.51 to 39.56 (M = 37.00, SD = 1.87, n = 30).

Human predictive power was further analysed using a repeated measures ANOVA with action as within-subjects factor and, due to the significant learning effect, trial as second within-subject factor. Therefore, we pooled each six trials and used trial as a factor with five levels. Mauchly’s test indicated that the assumption of sphericity was violated for action (χ2(44) = 302.02, p <.001), trial (χ2(9) = 109.20, p <.001) and for the interaction of action and trial (χ2(665) = 1226.79, p <.001), therefore degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity (action: ϵ = 0.37, trial: ϵ = 0.43, action × trial: ϵ = 0.40). The main effect of action was significant (F(3.37,161.78)=427.96,p<.001,ηp2=.899) just as the main effect of trial (F(1.71,82.16)=43.07,p<.001,ηp2=.473) and the interaction effect (F(14.49,695.46)=2.95,p<.001,ηp2=.058). For further analysis each action was considered individually. As shown in Fig 4, predictive power varied strongly between actions. For instance, put and take actions were not correctly classified before most columns of the video (88% and 72%, respectively) were already presented, whereas cut, stir and uncover did only need about half (48%, 51%, and 52%, respectively) of the video time.

Fig 4. Mean predictive power of human and machine.

Fig 4

t-values and p-values according to the t-tests per action.

Separate one-sample t-tests per action for human vs. optimal (machine) predictive power consistently showed lower predictive power for the human (ts < −2, ps <.05). See details in Fig 4. Predictive power ranged from 14.3% to 62.5% for the machine, whereas humans predictive power ranged from 6.2% to 58.3%. On average, the machine spared observation of the remaining 45.6% of the video columns, humans the remaining 37%. In half of the actions (take, uncover, hide, push and put on top), this difference reached a very large effect size (ds > 1). Interestingly, most pronounced differences not in terms of effect size but in terms of overall sampling time emerged for actions that were most quickly classified by the algorithm (take, uncover, cut). For take actions, humans sampled twice as many columns (72%) as the optimal performing algorithm (38%).

Logistic regressions revealed significant results for the eight models for each action respectively. All models tested significantly against their null model (ps <.001). Fig 5 shows McFadden R2 and BIC per model per action. Shaded cells indicate which model fits best human action prediction behaviour based on the BIC. Deploying the AIC (Akaike-Information-Criterion) yielded similar results.

Fig 5. Fitting different models to the actions.

Fig 5

(Abbreviations are shortened to allow to encode combinations by a short “+” annotation. We have Touch = T = TNR, Static = S = SSR, Dynamic = D = DSR. This leads to different combinations: T+S, T+D, S+D, T+S+D, where “Overall” refers to treating all eSEC columns independently of their individual information contents (see Methods).

As to the type of information exploited for prediction, we found marked differences between human and machine strategies. The machine behaviour was perfectly predicted by the biggest local gain in information, i.e., by transition into the column where the action code became unique for the respective action (Fig 6). For instance, when dynamic information was the first to provide perfect disambiguation between competing action models, the algorithm always followed this cue immediately (this was the case for cut and hide). Likewise, static information ruled machine behaviour for push and lay, reflecting the earliest possible point of certain prediction in these actions. Human suboptimal behaviour was nicely reflected by the fact (see Fig 5) that for cut and hide, subjects considered a combination of both dynamic and static spatial information (where they should have focused on dynamic information); the same strategy was applied to push and lay, where subjects should have better followed static information only.

Fig 6. Comparison of human (red bars) and machine (green bars) predictive performance.

Fig 6

Blue bars indicate the relative amount (percentage) of action steps elapsed per action, before the TNR (light blue), DSR (blue) or SSR (dark blue) model provided maximal local informational gain, enabling a secure prediction of the respective action. For instance, the 5th eSEC-column of the overall 13 eSEC-columns describing the cut action provided a unique description in terms of DSR. That is, after around 38% of these action’s columns, the cut action could be predicted on the basis of DSR information, and this is what the algorithm did, as indicated by the green bar of equal length. In contrast, humans correctly predicted the cut action at the 6th (mean 6.26) column, corresponding to 48% of this action, exploiting both dynamic and static spatial information (cf. Fig 5 for this outcome).

Notably, when all three types of information (i.e., touching, static or dynamic information) were equally beneficial (this was the case fotake, uncover, shake, and put), human performance was best modelled by a combination of all three types of information (i.e., either T+S+D or Overall), with the exception of chop, where subjects followed static spatial information. A post-hoc paired-sample t-test showed a significant effect of informational difference (t(48) = 15.95, p <.001, dz = 2.3). The z-transformed difference between mean human and machine predictive power was explicitly larger for informationally indifferent actions (M = 2.1) than for informationally different actions (M = 1.1). Expressed in non-transformed values, humans showed 12% less predictive power than the algorithm for informational indifferent action categories, but only 5% for the informational different ones.

4 Discussion

Humans predict actions based on different sources of information, but we know only little about how flexible these sources can be exploited in case that others are noisy or unavailable. Also, to better understand the respective contribution of these different sources of information, one has to avoid confounds and to properly control or eliminate alternative sources when focusing on one of them. In the present study, we tested how optimal human action prediction is when only static and dynamic spatial information is available. To this end we used action videos which were highly abstracted dynamic displays containing cubic place holders for all objects including hands, so that any information about real-world objects, environment, context, situation or actor were completely eliminated. We modelled human action prediction by an algorithm called enriched semantic event chain (eSEC), which had been derived from older “grammatical” approaches towards action encoding [2426, 28, 44]. This algorithm is solely based on spatial information in terms of touching and untouching events between objects, their static and dynamic spatial relations.

Results show that participants performed strikingly well in predicting object-abstracted actions, i.e., by assigning the ongoing video to one out of ten basic action categories, before the video was completed. On average, they spared observation of the remaining 37% of each video. This finding suggests that humans engage in action prediction even on the basis of only static and dynamic spatial information, if other sources of information are missing.

Future studies have to examine how much real object or contextual information would further improve this performance level. Especially, object information provides an efficient restriction on to-be-expected manipulations [611]. It remains to be tested how non-spatial object information potentially interacts with the exploitation of static and dynamic spatial relations between objects involved in actions. Moreover, actions occur in certain contexts and environments that further restrict the observer’s expectation, for instance with regard of certain classes of actions [2, 13, 4547].

Albeit humans performed very well in action prediction, the machine algorithm, which was able to use the eSEC information optimally (up to the learning precision of that algorithm), consistently outperformed our participants, and this difference was significant for each single action category. On average, humans achieved about 91% of the predictive power of the machine. Based on an information theoretic approach, further analyses revealed that humans—in this particular setting—did not select the optimal strategy to disambiguate actions as fast as possible: While the machine reliably detected the earliest occurrence of disambiguation between the ongoing action and all other action categories, as indicated by the highest gain in information at the respective action step, human subjects did so in only half of the action categories. Instead, humans unswervingly applied a mixing strategy, concurrently relying on both dynamic and spatial information in 8 out of 10 action types. This strategy was particularly disadvantageous for actions that were equally well predictable based on either static (NTR, SSR) or dynamic (DSR) information. Particularly in these—one may say—informationally indifferent cases, humans were significantly biased towards prolonged decisions: here, they showed 12% less predictive power than the algorithm as compared to 5% for the informationally different actions.

Summarizing these effects, the human bias towards using mixing strategies, combining static and dynamic spatial information, and to prolonged decisions for informational indifferent action categories establish overall poorer human predictive power. In principle, these two effects may result from the same general heuristics of human action observers, to exploiting multiple sources of information rather than relying on the first available source only. As a consequence, individuals prioritize correct over fast classification of observed actions.

Let us also note that the eSEC, which here was used to model human action recognition, is an advanced approach in the field of machine vision. It is finer grained and more expressive than its predecessor Semantic Event Chain [24, 44], but does not use fine (and inter-personally variable) spatial details, as compared to the Hidden Markov Model (HMM) algorithm, which, though being a classical approach, still represents the current state of the art in spatial-information (e.g. trajectory) based action recognition [48, 49]. In a previous study [23], we compared the predictive power of the eSEC framework with an HMM [32, 50]. The study was done on two real data sets, and we found that the average predictive power of eSECs was 61.9% as compared to only 32.4% for the HMM-based approach. This is because the three types of spatial relations, comprising the eSEC columns, capture important spatial and temporal properties of an action.

Describing action with a grammatical structure [2527] such as eSEC [22, 23], renders a simple and fast framework for recognition and prediction in the presence of unknown objects and noise. This robustness lends itself to an intriguing hypothesis, which is asking to what degree such an event-based framework might help young infants to bootstrap action knowledge in view of the vast number of objects that they have never encountered before. In terms of spatial relations (as implemented in the current eSEC framework), the complexity of an action is far smaller than the complexity of the realm of objects with which an action can be performed, even when only considering a typical baby’s environment. Clearly, this approach has proven to be beneficial for robotic applications [51] and we plan to extend it to complex actions and interactions between several agents (humans and robots) to examine the exploitation and exploration of predictive information during cooperation and competition.

Limitations

Our approach did not take into account all dynamic and static spatial information provided by human action. For instance, we restricted dynamic spatial information to between-object change, whereas in natural action, we would also register dynamic within-hand change. Thus, actors shape their hands to fit the to-be-grasped object already when starting to reach out for it [52, 53], providing a valuable pointer to potentially upcoming manipulations and goals [54]. Likewise, gaze information plays a role in natural action observation [55], as the actors’ looking to an object draws the observer’s attention to the same object [56], and hence, potentially upcoming targets of the action.

Furthermore, our study was restricted to ten possible actions, whereas in everyday life, the number of potentially observable actions is much higher, resulting in higher uncertainty and higher competition among these potential actions. Speculatively, the human bias to employing mixed exploitation strategies may be better adapted to disambiguate actions among this broader range of action classes. Future studies have to enlarge the sample of concurrently investigated actions to test this assumption and to increase overall ecological validity.

5 Appendix: Detailed methods

5.1 Virtual reality system

The main components of our VR system include computing power (for 3D data processing), head mounted display (for showing the VR content) and motion controllers (as the input devices). A Vive VR headset and motion controller released by HTC in April 2016 with a resolution of 1080 x 1200 per eye, have been used as our VR system. The “roomscale” system, which provides a precise 3D motion tracking between two infrared base stations, is the main advantage of this headset, which creates the opportunity to record and review actions for experiments on a larger scale of up to 5 meters diagonally. The Unreal Engine 4 (UE4) is a high performance game engine developed by Epic Games and is chosen as the game engine basis of this project. It has built-in support for VR environments and the Vives motion controllers.

5.2 Scenario recording

In order to make VR-videos for the 10 different actions, 30 variants of each action were recorded by two members of BCCN team (a 23 year old undergraduate male and a 30 year old doctoral student female). They implemented a VR platform by using C++ code structure. The motion controller is the core input component of the VR environment and they provided a separate function for each button on that by C++ programming. The designed system included three different modes. First, a mode to record new actions for the experiment; second, a mode to review in, and last, the experiment itself. To keep the controls as simple as possible and to avoid a second motion controller without implementing a complex physics system, the recording mode was split into two sub-modes: A single-cube recording mode (for single, mostly static cubes) and a two-cubes recording mode (for object manipulation).

5.3 Stimuli

Actions were defined as follows:

Chop: The hand-object (hereafter: hand) touches an object (tool), picks up the object from the ground, puts it on another object (target) and starts chopping. When the target object has been divided into two parts, the tool object untouches the pieces of the target object. After that, the hand puts the tool object on the ground, untouches it, and leaves the scene.

Chop scenarios had a mean length of 17.86 s (SD = 3.56, range = 13-27).

Cut: The hand touches an object (tool), picks up the object from the ground, puts it on another object (target) and starts cutting. When the target object was divided into two parts, the tool object untouches the pieces of the target object. After that, the hand puts the tool object on the ground, untouches it, and leaves the scene.

Cut scenarios had a mean length of 19.50 s (SD = 3.13, range = 13-25).

Hide: The hand touches an object (tool), picks up the object from the ground, puts it on another object (target) and starts coming down on the target object until it covers that object thoroughly. Then the hand untouches the tool object and leaves the scene.

Hide scenarios had a mean length of 13.43 s (SD = 2.40, range = 9-20).

Uncover: The hand touches an object (tool), picks up the object from the ground. The second object (target) emerges as the tool object is raised from the ground, because the tool object had hidden the target object. After that, the hand puts the tool object on the ground, untouches it, and leaves the scene.

Uncover scenarios had a mean length of 12.66 s (SD = 3.20, range = 9-21).

Put on top: The hand touches an object, picks up the object from the ground and puts it on another object. After that, the hand untouches the first object and leaves the scene.

Put on top scenarios had a mean length of 10.90 s (SD = 2.006, range = 8-16).

Take down: The hand touches an object that is on another object, picks up the first object from the second object and puts it on the ground. After that, the hand untouches the first object and leaves the scene.

Take down scenarios had a mean length of 10.60 s (SD = 3.04, range = 6-18).

Lay: The hand touches an object on the ground and changes its direction (lays it down) while it remains touching the ground. After that, the hand untouches the object and leaves the scene.

Lay scenarios had a mean length of 11.23 s (SD = 1.79, range = 8-15).

Push: The hand touches an object on the ground and starts pushing it on the ground. After that, the hand untouches the object and leaves the scene.

Push scenarios had a mean length of 12.56 s (SD = 1.73, range = 9-17).

Shake: The hand touches an object, picks up the object from the ground and starts shaking it. Then, the hand puts it back on the ground, untouches it and leaves the scene.

Shake scenarios had a mean length of 12.10 s (SD = 2.05, range = 9-17).

Stir: The hand touches an object (tool), picks up the object from the ground, puts it on another object (target) and starts stirring. After that, the hand puts the tool object on the ground, untouches it, and leaves the scene.

Stir scenarios had a mean length of 20.23 s (SD = 4.67, range = 14-31).

5.4 Details of machine action prediction

Note that all methodological details concerning our spatial relations definition (section 5.4.1) and their computation (section 5.5) as well as details of the similarity measurement algorithm (section 5.6) were reported previously in [23] and [28]. Hence, the next three subsections are essentially a repetition from those two papers without many changes.

5.4.1 Spatial relations

The details on how to calculate static and dynamic spatial relations are provided below. Here we start first with a general description.

  1. Touching and non-touching relations (TNR) between two objects were defined according to collision or non-collision between their representative cubes.

  2. Static spatial relations (SSR) included: ‘Above” (Ab), “Below” (Be), “Right” (R), “Left” (L), “Front” (F), “Back” (Ba), “Inside” (In), “Surround” (Sa). Since “Right”, “Left”, “Front” and “Back” depend on the viewpoint and directions of the camera axes, we combined them into “Around” (Ar) and used it at times when one object was surrounded by another. Moreover, “Above” (Ab), “Below” (Be) and “Around” (Ar) relations in combination with “Touching” were converted to “Top” (To), “Bottom” (Bo) and “Touching Around” (ArT), respectively, which corresponded to the same cases with physical contact. Fig 7 (a1-a3) shows static spatial relations between two objects cubes. If two objects were far from each other or did not have any of the above-mentioned relations, their static relation was considered as Null (O). This led to a set of nine static relations in the eSECs: SSR = {Ab, Be, Ar, Top, Bottom, ArT, In, Sa, O}. The additional relations, mentioned above: R, L, F, Ba are only used to define the relation Ar = around, because the former four relations are not view-point invariant.

  3. Dynamic Spatial Relations (DSR) require to make use of the frame history in the video. We used a history of 0.5 seconds, which is an estimate for the time that a human hand takes to change the relations between objects in manipulation actions. DSRs included the following relations: “Moving Together” (MT), “Halting Together” (HT), “Fixed-Moving Together” (FMT), “Getting Close” (GC), “Moving Apart” (MA) and “Stable” (S). DSRs between two objects cubes are shown in Fig 7 (b1-b6). MT, HT and FMT denote situations when two objects are touching each other while: both of them are moving in a same direction (MT), are motionless (HT), or when one object is fixed and does not move while the other one is moving on or across it (FMT). Case S denotes that any distance-change between objects remained below a defined threshold of ξ = 1 cm during the entire action. All these dynamic relations cases are clarified in Fig 7(b). In addition, Q is used as a dynamic relation between two objects when their distance exceeded the defined threshold ξ or if they did not have any of the above-defined dynamic relations. Therefore, dynamic relations make a set of seven members: DSR = {MT, HT, FMT, GC, MA, S, Q}.

Fig 7.

Fig 7

(a) Static Spatial Relations: (a1) Above/Below, (a2) Around, (a3) Inside/Surround. (b) Dynamic Spatial Relations: (b1) Moving Together, (b2) Halting Together, (b3) Fixed-Moving Together, (b4) Getting Close, (b5) Moving Apart, (b6) Stable.

Finally, whenever one object became “Absent” or hidden during an action, the symbol (A) was used for annotating this condition. In addition, we use the symbol (X) whenever one object was destroyed or lost its primary shape (e.g. in cut or chop actions).

5.4.2 Object types

An exhaustive description of the five fundamental object types had been given in the main text and shall not be repeated here.

5.5 Mathematical definition of the spatial relations

As mentioned above, touching and non-touching relations between two objects are defined according to collision or non-collision between their representative cubes. 3D collision detection is a challenging topic which has been addressed in [57]. But, because the objects in our study are just cubes, we interpreted the contact of one of the six surfaces of one cube with one of the other cube’s surfaces (see Fig 8) as touching event and this can be detected easily.

Fig 8. Possible situations that two cubes touch each other.

Fig 8

For example, in the left second situation of Fig 8, which has been shown with more details in Fig 9, the following condition will lead to a touching relation from a side.

[x1β=x1α][(y1α<y2β<y2α)(y1α<y1β<y2α)][(z1α<z2β<z2α)(z1α<z1β<z2α)] (3)

Fig 9. Coordinate details of the two cubes that touch each other from side.

Fig 9

Moreover, all discussed static and dynamic relations are defined by a set of rules. We start with explaining the rule set for static spatial relations and then proceed to dynamic spatial relations. In general, xmin, xmax, ymin, ymax, zmin and zmax indicate the minimum and maximum values between the points of object cube αi in x, y and z axes, respectively.

Let us define the relation “Left”, SSR(αi, αj) = L (object αi is to the left of object αj) if:

xmax(αi)<xmax(αj) (4)

and the following exception condition holds

[¬(ymin(αi)>ymax(αj))][¬(ymin(αj)>ymax(αi))][¬(zmin(αi)>zmax(αj))][¬(zmin(αj)>zmax(αi))] (5)

The exception condition excludes from the relation “Left” those cases when two object cubes do not overlap in altitude (y direction) or front/back (z direction). Several examples of objects holding relation SSR(red, blue) = L, when the size and shift in y direction varies, are shown in Fig 10.

Fig 10. Possible states of Left relation between two objects cubes when size and y positions vary.

Fig 10

SSR(αi, αj) = R is defined by xmax(αi)>xmin(αj) and the identical set of exception conditions. The relations Ab, Be, F, Ba are defined in an analogous way. For Ab and Be the emphasis is on the “y” dimension, while for the F, Ba the emphasis is on the “z” dimension.

For the relation “inside” SSR(αi, αj) = In we use:

[xmin(αj)xmin(αi)][xmax(αi)xmax(αj)][zmin(αj)zmin(αi)][zmax(αi)zmax(αj)][ymin(αj)ymax(αi)ymax(αj)] (6)

The opposite holds for relation Sa (surrounding). For example, if SSR(αi, αj) = InSSR(αj, αi) = Sa.

In addition of computing spatial relations TNR between two objects based on the above rules, we also check the touching relation between those two objects. This is then used to define several other relations. For example, if one object is above the other object, while they are touching each other, their static relation will be To (top).

[SSR(αi,αj)=Ab][TNR(αi,αj)=T][SSR(αi,αj)=To] (7)

There can be more than one static spatial relations between two object cubes. For example, one object can be both to the left and in back of the other object. However, to fill the eSEC matrix elements we need only one relation per object pair. This problem is solved by definition of a new notion called shadow.

Each cube has six surfaces. We label them as top, bottom, right, left, front and back based on their positions in our scene coordinate system. Whenever object αi is to the left of object αj, one can make a projection from the right surface of object αi onto the left rectangle of object αj and consider only the rectangle intersection area, This area is represented by the newly defined parameter shadow. Suppose SSR(αi, αj) = {R1, …, Rk} while R1, …, RmSSR and we have calculated the shadow(αi, αj, R) for all relations R between the objects αi and αj. The relation with the biggest shadow is then selected as the main static relation between the two objects: (Fig 11 includes the above description in the image format.)

SSR(αi,αj)=Rn(1nk),if:nonumber (8)
shadow(αi,αj,Rn)=max1mk(Shadow(αi,αj,Rm)) (9)

Fig 11. Selection of one static spatial relation from several possible relations.

Fig 11

Dynamic spatial relations (DSR) are defined as follows. Suppose Oif shows the central point of the object cube αif (object αi in fth frame); we define δ(αif,αjf)=||Oif-Ojf|| to be a two argument function for measuring the Euclidean distance between the cubes αi and αj in fth frame.

DSR(αif,αjf)={GC,ifδ(αif,αjf)-δ(αif+θ,αjf+θ)>ξMA,ifδ(αif+θ,αjf+θ)-δ(αif,αjf)>ξ (10)

For this we use a time window of θ = 10 frames (image snapshots in VR) in our experiments (= 0.5s); the threshold ξ is kept at 0.1 m:

In the following we defined five conditions P1 to P5, which then will be used to characterize the remaining DSRs.

P1:[TNR(αif,αjf)=T][TNR(αif+θ,αjf+θ)=T]P2:[TNR(αif,αjf)=N][TNR(αif+θ,αjf+θ)=N]P3:OifOif+θP4:OjfOjf+θP5:δ(αif+θ,αjf+θ)-δ(αif,αjf)<ξ (11)

The dynamic relations MT, HT, FMT and S, based on the five conditions above are now defined in the following way:

DSR(αif,αjf)={MT,ifP1P3P4HT,ifP1¬P3¬P4FMT,ifP1(P3P4)S,ifP2P5 (12)

5.6 Similarity measure between eSECs

Suppose θ1 and θ2 are the names of two actions with eSECs that have n and m columns, respectively. We can concatenate the corresponding TNR, SSR and DSR of each fundamental object pair into a triple and make a 10-row matrix for θ1 and θ2 with ternary elements (TNR, SSR, DSR) instead of writing down a 30-row eSEC each:

θ1=((a1,1,a11,1,a21,1)(a1,2,a11,2,a21,2)(a1,n,a11,n,a21,n)(a2,1,a12,1,a22,1)(a2,2,a12,2,a22,2)(a2,n,a12,n,a22,n)(a10,1,a20,1,a30,1)(a10,2,a20,2,a30,2)(a10,n,a20,n,a30,n))
θ2=((b1,1,b11,1,b21,1)(b1,2,b11,2,b21,2)(b1,n,b11,n,b21,n)(b2,1,b12,1,b22,1)(b2,2,b12,2,b22,2)(b2,n,b12,n,b22,n)(b10,1,b20,1,b30,1)(b10,2,b20,2,b30,2)(b10,n,b20,n,b30,n))

We define the differences in the three different relation categories L1:3, by using the elements of both matrices:

Li,j1={0,ifai,j=bi,j1,otherwise
Li,j2={0,ifai+10,j=bi+10,j1,otherwise
Li,j3={0,ifai+20,j=bi+20,j1,otherwise

where 1 ≤ i ≤ 10, 1 ≤ jk, k = max(n, m).

Then the compound difference for the three categories is defined in the following way:

di,j=Li,j1+Li,j2+Li,j33. (13)

If one matrix had more columns than the other matrix. i.e., m < n or vice versa, the last column of the smaller matrix is repeated to match the number of columns of the bigger matrix. This leads to a consistent drop in similarity regardless of which two action are being compared.

Now we define D as the matrix, which contains all compound differences between the elements of the two eSECs.

D(10,k)=(d1,1d1,2d1,kd2,1d2,2d2,kd10,1d10,2d10,k)

where di,j denotes the dissimilarity of ith objects pair at the jth time stamp (column). Then, D, which is the total dissimilarity between eSECs of θ1 and θ2 is considered as the average across all elements of matrix D.

Dθ1,θ2=1k*10(j=1ki=110di,j) (14)

Accordingly, the similarity between these eSECs Simθ1,θ2, is measured as:

Simθ1,θ2=(1-Dθ1,θ2)*100% (15)

Supporting information

S1 Video

(MP4)

S1 Dataset

(RAR)

Data Availability

The dataset which includes human participants results in the VR experiment has been uploaded as Supporting Information files. The "Human participants dataset" folder includes a file named: "readme.txt" which has a brief explanation about this dataset. All these human analysis can be repeated with this data.

Funding Statement

The research leading to these results has received funding from the German Research Foundation (DFG) grant WO388/13-1 and SCHU1439/8-1 as well as the European Community’s H2020 Programme (Future and Emerging Technologies, FET) under grant agreement no. 732266, Plan4Act.

References

  • 1. Isik L, Tacchetti A, Poggio T. A fast, invariant representation for human action in the visual system. Journal of Neurophysiology. 2017;119(2):631–640. 10.1152/jn.00642.2017 [DOI] [PubMed] [Google Scholar]
  • 2. Wurm MF, Schubotz RI. Squeezing lemons in the bathroom: contextual information modulates action recognition. Neuroimage. 2012;59(2):1551–1559. 10.1016/j.neuroimage.2011.08.038 [DOI] [PubMed] [Google Scholar]
  • 3. Caspers S, Zilles K, Laird AR, Eickhoff SB. ALE meta-analysis of action observation and imitation in the human brain. Neuroimage. 2010;50(3):1148–1167. 10.1016/j.neuroimage.2009.12.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Hardwick RM, Caspers S, Eickhoff SB, Swinnen SP. Neural correlates of action: Comparing meta-analyses of imagery, observation, and execution. Neuroscience & Biobehavioral Reviews. 2018. 10.1016/j.neubiorev.2018.08.003 [DOI] [PubMed] [Google Scholar]
  • 5. Giese MA, Rizzolatti G. Neural and computational mechanisms of action processing: Interaction between visual and motor representations. Neuron. 2015;88(1):167–180. 10.1016/j.neuron.2015.09.040 [DOI] [PubMed] [Google Scholar]
  • 6. Schubotz RI, Wurm MF, Wittmann MK, von Cramon DY. Objects tell us what action we can expect: dissociating brain areas for retrieval and exploitation of action knowledge during action observation in fMRI. Frontiers in Psychology. 2014;5:636 10.3389/fpsyg.2014.00636 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Ruddle RA, Savage JC, Jones DM. Symmetric and asymmetric action integration during cooperative object manipulation in virtual environments. ACM Transactions on Computer-Human Interaction (TOCHI). 2002;9(4):285–308. 10.1145/586081.586084 [DOI] [Google Scholar]
  • 8.Gupta A, Davis LS. Objects in action: An approach for combining action understanding and object perception. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2007. p. 1–8.
  • 9. Hrkać M, Wurm MF, Kühn AB, Schubotz RI. Objects Mediate Goal Integration in Ventrolateral Prefrontal Cortex during Action Observation. PLOS One. 2015;10(7):e0134316 10.1371/journal.pone.0134316 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. El-Sourani N, Wurm MF, Trempler I, Fink GR, Schubotz RI. Making sense of objects lying around: How contextual objects shape brain activity during action observation. NeuroImage. 2018;167:429–437. 10.1016/j.neuroimage.2017.11.047 [DOI] [PubMed] [Google Scholar]
  • 11. El-Sourani N, Trempler I, Wurm MF, Fink GR, Schubotz RI. Predictive Impact of Contextual Objects during Action Observation: Evidence from fMRI. Journal of Cognitive Neuroscience. 2019;32(2):326–337. 10.1162/jocn_a_01480 [DOI] [PubMed] [Google Scholar]
  • 12. Stadler W, Springer A, Parkinson J, Prinz W. Movement kinematics affect action prediction: comparing human to non-human point-light actions. Psychological research. 2012;76(4):395–406. 10.1007/s00426-012-0431-2 [DOI] [PubMed] [Google Scholar]
  • 13. Wurm MF, Cramon DY, Schubotz RI. The Context-Object-Manipulation triad: Cross talk during action perception revealed by fMRI. Journal of Cognitive Neuroscience. 2012;24(7):1548–1559. 10.1162/jocn_a_00232 [DOI] [PubMed] [Google Scholar]
  • 14. Wurm MF, Hrkać M, Morikawa Y, Schubotz RI. Predicting goals in action episodes attenuates BOLD response in inferior frontal and occipitotemporal cortex. Behavioural brain research. 2014;274:108–117. 10.1016/j.bbr.2014.07.053 [DOI] [PubMed] [Google Scholar]
  • 15. Schubotz RI, von Cramon DY. The case of pretense: Observing actions and inferring goals. Journal of Cognitive Neuroscience. 2009;21(4):642–653. 10.1162/jocn.2009.21049 [DOI] [PubMed] [Google Scholar]
  • 16. Schubotz RI, Korb FM, Schiffer AM, Stadler W, von Cramon DY. The fraction of an action is more than a movement: neural signatures of event segmentation in fMRI. NeuroImage. 2012;61(4):1195–1205. 10.1016/j.neuroimage.2012.04.008 [DOI] [PubMed] [Google Scholar]
  • 17. Kurby CA, Zacks JM. Segmentation in the perception and memory of events. Trends in Cognitive Sciences. 2008;12(2):72–79. 10.1016/j.tics.2007.11.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Newtson D. Attribution and the unit of perception of ongoing behavior. Journal of Personality and Social Psychology. 1973;28(1):28 10.1037/h0035584 [DOI] [Google Scholar]
  • 19. Newtson D, Engquist G. The perceptual organization of ongoing behavior. Journal of Experimental Social Psychology. 1976;12(5):436–450. 10.1016/0022-1031(76)90076-7 [DOI] [Google Scholar]
  • 20. Bach P, Nicholson T, Hudson M. The affordance-matching hypothesis: how objects guide action understanding and prediction. Frontiers in Human Neuroscience. 2014;8:254 10.3389/fnhum.2014.00254 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Nicholson T, Roser M, Bach P. Understanding the goals of everyday instrumental actions is primarily linked to object, not motor-kinematic, information: evidence from fMRI. PLOS One. 2017;12(1):e0169700 10.1371/journal.pone.0169700 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Ziaeetabar F, Aksoy EE, Wörgötter F, Tamosiunaite M. Semantic analysis of manipulation actions using spatial relations. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE; 2017. p. 4612–4619.
  • 23. Ziaeetabar F, Kulvicius T, Tamosiunaite M, Wörgötter F. Recognition and prediction of manipulation actions using Enriched Semantic Event Chains. Robotics and Autonomous Systems. 2018;110:173–188. 10.1016/j.robot.2018.10.005 [DOI] [Google Scholar]
  • 24. Aksoy EE, Abramov A, Dörr J, Ning K, Dellen B, Wörgötter F. Learning the semantics of object–action relations by observation. The International Journal of Robotics Research. 2011;30(10):1229–1249. 10.1177/0278364911410459 [DOI] [Google Scholar]
  • 25. Pastra K, Aloimonos Y. The minimalist grammar of action. Philosophical Transactions of the Royal Society B: Biological Sciences. 2012;367(1585):103–117. 10.1098/rstb.2011.0123 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Yang Y, Guha A, Fermüller C, Aloimonos Y. A cognitive system for understanding human manipulation actions. Advances in Cognitive Sysytems. 2014;3:67–86. [Google Scholar]
  • 27.Summers-Stay D, Teo CL, Yang Y, Fermüller C, Aloimonos Y. Using a minimal action grammar for activity understanding in the real world. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE; 2012. p. 4104–4111.
  • 28. Wörgötter F, Ziaeetabar F, Pfeiffer S, Kaya O, Kulvicius T, Tamosiunaite M. Humans Predict Action using Grammar-like Structures. Scientific reports. 2020;10(1):1–11. 10.1038/s41598-020-60923-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Ryoo MS. Human activity prediction: Early recognition of ongoing activities from streaming videos. In: 2011 International Conference on Computer Vision. IEEE; 2011. p. 1036–1043.
  • 30.Zhou B, Wang X, Tang X. Random field topic model for semantic region analysis in crowded scenes from tracklets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2011. p. 3441–3448.
  • 31. Morris BT, Trivedi MM. Trajectory learning for activity understanding: Unsupervised, multilevel, and long-term adaptive approach. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2011;33(11):2287–2301. 10.1109/TPAMI.2011.64 [DOI] [PubMed] [Google Scholar]
  • 32. Elmezain M, Al-Hamadi A, Michaelis B. Hand gesture recognition based on combined features extraction. World Academy of Science, Engineering and Technology. 2009;60:395. [Google Scholar]
  • 33. Fermüller C, Wang F, Yang Y, Zampogiannis K, Zhang Y, Barranco F, et al. Prediction of manipulation actions. International Journal of Computer Vision. 2018;126(2-4):358–374. 10.1007/s11263-017-0992-z [DOI] [Google Scholar]
  • 34.Tanke J, Gall J. Human Motion Anticipation with Symbolic Label. arXiv preprint arXiv:191206079. 2019.
  • 35. Cheng K, Lubamba EK, Liu Q. Action Prediction Based on Partial Video Observation via Context and Temporal Sequential Network With Deformable Convolution. IEEE Access. 2020;8:133527–133540. 10.1109/ACCESS.2020.3008848 [DOI] [Google Scholar]
  • 36.Pei M, Jia Y, Zhu SC. Parsing video events with goal inference and intent prediction. In: 2011 International Conference on Computer Vision. IEEE; 2011. p. 487–494.
  • 37.Li K, Hu J, Fu Y. Modeling complex temporal composition of actionlets for activity prediction. In: 2011 International Conference on Computer Vision. IEEE; 2011. p. 487–494.
  • 38.Yang Y, Fermüller C, Aloimonos Y. Detection of manipulation action consequences (MAC). In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2013. p. 2563–2570.
  • 39. Seker MY, Tekden AE, Ugur E. Deep effect trajectory prediction in robot manipulation. Robotics and Autonomous Systems. 2019;119:173–184. 10.1016/j.robot.2019.07.003 [DOI] [Google Scholar]
  • 40. Ejdeholm D, Harsten J. Manipulation Action Recognition and Reconstruction using a Deep Scene Graph Network; 2020. [Google Scholar]
  • 41. Bulling A, Blanke U, Schiele B. A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys (CSUR). 2014;46(3):33 10.1145/2499621 [DOI] [Google Scholar]
  • 42. Shannon CE. A mathematical theory of communication. Bell System Technical Journal. 1948;27:379–423 and 623–656. 10.1002/j.1538-7305.1948.tb00917.x [DOI] [Google Scholar]
  • 43. Schwarz GE. Estimating the dimension of a model. Annals of Statistics. 1978;6(2):461–464. 10.1214/aos/1176344136 [DOI] [Google Scholar]
  • 44. Aksoy EE, Orhan A, Wörgötter F. Semantic decomposition and recognition of long and complex manipulation action sequences. International Journal of Computer Vision. 2017;122(1):84–115. 10.1007/s11263-016-0956-8 [DOI] [Google Scholar]
  • 45.Shapovalova N, Gong W, Pedersoli M, Roca FX, Gonzalez J. On importance of interactions and context in human action recognition. In: Iberian Conference on Pattern Recognition and Image Analysis. Springer; 2011. p. 58–66.
  • 46.Zheng Y, Zhang YJ, Li X, Liu BD. Action recognition in still images using a combination of human pose and context information. In: 2012 19th IEEE International Conference on Image Processing. IEEE; 2012. p. 785–788.
  • 47. Wurm MF, Artemenko C, Giuliani D, Schubotz RI. Action at its place: Contextual settings enhance action recognition in 4- to 8-year-old children. Developmental Psychology. 2017;53(4):662–670. 10.1037/dev0000273 [DOI] [PubMed] [Google Scholar]
  • 48. Barros P, Maciel-Junior NT, Fernandes BJ, Bezerra BL, Fernandes SM. A dynamic gesture recognition and prediction system using the convexity approach. Computer Vision and Image Understanding. 2017;155:139–149. 10.1016/j.cviu.2016.10.006 [DOI] [Google Scholar]
  • 49. Sun H, Lu Z, Chen CL, Cao J, Tan Z. Accurate human gesture sensing with coarse-grained RF signatures. IEEE Access. 2019;7:81227–81245. 10.1109/ACCESS.2019.2923574 [DOI] [Google Scholar]
  • 50.Elmezain M, Al-Hamadi A, Michaelis B. Hand trajectory-based gesture spotting and recognition using HMM. In: 2009 16th IEEE International Conference on Image Processing (ICIP). IEEE; 2009. p. 3577–3580.
  • 51. Aein MJ, Aksoy EE, Wörgötter F. Library of actions: Implementing a generic robot execution framework by using manipulation action semantics. The International Journal of Robotics Research. 2019;38(8):910–934. 10.1177/0278364919850295 [DOI] [Google Scholar]
  • 52. Ingram JN, Howard IS, Flanagan JR, Wolpert DM. Multiple grasp-specific representations of tool dynamics mediate skillful manipulation. Current Biology. 2010;20(7):618–623. 10.1016/j.cub.2010.01.054 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Jeannerod M, Arbib M, Rizzolatti G, Sakata H. Grasping objects: the cortical mechanisms. Trends Neurosci. 1995;18:314–32. 10.1016/0166-2236(95)93921-J [DOI] [PubMed] [Google Scholar]
  • 54. Heumer G, Amor HB, Jung B. Grasp recognition for uncalibrated data gloves: A machine learning approach. Presence: Teleoperators and Virtual Environments. 2008;17(2):121–142. 10.1162/pres.17.2.121 [DOI] [Google Scholar]
  • 55. Land MF. Vision, eye movements, and natural behavior. Visual Neuroscience. 2009;26(1):51–62. 10.1017/S0952523808080899 [DOI] [PubMed] [Google Scholar]
  • 56.Fathi A, Li Y, Rehg JM. Learning to recognize daily actions using gaze. In: European Conference on Computer Vision. Springer; 2012. p. 314–327.
  • 57. Jiménez P, Thomas F, Torras C. 3D collision detection: a survey. Computers & Graphics. 2001;25(2):269–285. 10.1016/S0097-8493(00)00130-8 [DOI] [Google Scholar]

Decision Letter 0

Chen Zonghai

24 Aug 2020

PONE-D-20-16126

Human and Machine Action Prediction Independent of Object Information

PLOS ONE

Dear Dr. Ziaeetabar,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Oct 08 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Chen Zonghai

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

4. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

Additional Editor Comments (if provided):

Revise according to the reviewer's opinion.

Reviewer 1

The manuscript entitled “Human and Machine Action Prediction Independent of Object Information” explores action prediction algorithms under no context condition. A virtual reality setup is established to research action recognition mechanism differences between human and machine vision. In manipulation actions, all objects are emulated with cubes so that human participants cannot infer action through object context and use spatial relations instead. Results show that participants are able to predict actions in, on average, less than 64% of the action's duration. In comparison, a computational model, the so-called enriched Semantic Event Chain (eSEC), which incorporates the information of spatial relations is employed. After being trained by the same actions as those observed by participants, this model successfully predicted actions even better than humans. Using information theoretical analysis, eSECs are able to make optimal use of individual cues, whereas humans seem to mostly rely on a mixed-cue strategy, which takes longer until recognition.

The research work reveals interesting mechanism of action prediction in human through well-designed comparison experiments. Providing a better cognitive basis of action recognition may, on the one hand improve our understanding of related human pathologies and, on the other hand, also help in building robots for conflict-free human-robot cooperation.

However, it remains to be promoted in following aspects:

1. What’s the motivation of this research? It should be stated in the beginning.

2. From introduction section, the necessity of human action prediction research without context information is not explained. This may benefit human-computer interaction, but in most applications, context information is available and is effective for action prediction.

3. The purpose of the research work is unclear. To explain human’s action prediction mechanism without context information or to propose a better action prediction algorithm? Experiments setup varies for different research purpose.

4. Section 1.2 should focus more on action prediction as it is the topic of this research.

5. Some details on human experiments should be clarified. In the short training phase, how to determine the end of training? Is it decided by researchers or participants? As it’s not a routine scenario, to make a fair comparison with machine vision, it should be decided by participants and an additional test should be added to validate that participants have been well-trained.

6. Are participants informed that their response time will be recorded as an evaluation criterion, which may affect their prediction timing?

7. What about prediction accuracy? Are all prediction results correct? How to analyze wrong predictions?

8. A typo mistake in line 140: two “for example”.(less...)

Reviewer 2

This paper proposes a system about machine based action recognition system eSEC learning and designed a virtual reality setup and tested recognition speed for different manipulation actions. The authors introduce in details how the theoretical analysis is done and recognition speed is performed.

Paper is not well organized and has limited potential for acceptance in “PLOS ONE”, in current format though there are some observations, corrections and suggestions regarding this paper.

• Author MUST clearly describe their contribution. Put another section what is author contribution?

• Separate introduction and literature review.

• Proposed work section is quite weak and needs major improvement. It lacks any flow diagram, algorithm, pseudo code etc. Each step of proposed algorithm/work should be clearly depicted how your work is different from existing work.

• Diagrams and flow charts are not good need to redraw.

• Performance measures should be more. The proposed work should be evaluated with a number of performance measures to prove its validity.

• Abstract and Conclusion are poorly written need much revision.

• Add references that are more recent.

• Overall, the paper lies in the category of revision.

Overall the language is not very good; however, it MUST be proofread if again before submission again.(less...)

Reviewer 3

Action prediction independent of object information as an observation and hypothesis is validated through a psycho-physical experiments rigorously conducted by the authors. A set of 10 actions are considered over a VR based experimental system. The authors further validated an eSEC computational framework to show that with eSEC, machine could achieve action prediction capability. The machine prediction power vs human prediction performance as a comparison is provided by the authors and some speculative explanations are given and discussed.

The draft is fairly well written and flows well. I really enjoyed reading the draft.

The experiments are thorough enough to approach their conclusions in my view.

The eSECs as a formulation and representation is adopted as a computational tool in this draft, is an appropriate choice given its prior use in similar computational problem domains (in robotics and computer vision fields).

The relevant literature is also well presented and reviewed.

Some parts of the draft could be improved by making the description more clear to the readers. For example, line 396 "when all three types of information"

It is unclear to me what is the third type of information other than dynamic and static ones? Please clarify.

Also, an interesting future question and direction could be, as most of the action recognition dataset and benchmarks in computer vision research area come up with the set of actions in a kind of ad-hoc manner (especially for manipulation action dataset). I would be keen to see the authors based on their discoveries from this draft to provide some designing principles for future action recognition dataset and challenges, that could fully consider the types of information discussed here.(less...)

Reviewer 4

Following are some observations

• Abstract is too much lengthy.

• Abstract is not written according to the theme of abstract.

• Actual methodology/algorithms are not mentioned in the abstract.

• In introduction section, contributions should be mentioned in bullets for better understanding of the readers.

• The manuscript should be checked for typos. In some places the word Figure is written while in other places Fig is written, must be uniform throughout.

• Figures quality is not good, must be 300dpi.

• Authors employed so-called extended semantic event chains (eSEC) which is an existing work, what is their real contribution?(less...)

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

Reviewer #4: I Don't Know

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

Reviewer #4: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript entitled “Human and Machine Action Prediction Independent of Object Information” explores action prediction algorithms under no context condition. A virtual reality setup is established to research action recognition mechanism differences between human and machine vision. In manipulation actions, all objects are emulated with cubes so that human participants cannot infer action through object context and use spatial relations instead. Results show that participants are able to predict actions in, on average, less than 64% of the action's duration. In comparison, a computational model, the so-called enriched Semantic Event Chain (eSEC), which incorporates the information of spatial relations is employed. After being trained by the same actions as those observed by participants, this model successfully predicted actions even better than humans. Using information theoretical analysis, eSECs are able to make optimal use of individual cues, whereas humans seem to mostly rely on a mixed-cue strategy, which takes longer until recognition.

The research work reveals interesting mechanism of action prediction in human through well-designed comparison experiments. Providing a better cognitive basis of action recognition may, on the one hand improve our understanding of related human pathologies and, on the other hand, also help in building robots for conflict-free human-robot cooperation.

However, it remains to be promoted in following aspects:

1. What’s the motivation of this research? It should be stated in the beginning.

2. From introduction section, the necessity of human action prediction research without context information is not explained. This may benefit human-computer interaction, but in most applications, context information is available and is effective for action prediction.

3. The purpose of the research work is unclear. To explain human’s action prediction mechanism without context information or to propose a better action prediction algorithm? Experiments setup varies for different research purpose.

4. Section 1.2 should focus more on action prediction as it is the topic of this research.

5. Some details on human experiments should be clarified. In the short training phase, how to determine the end of training? Is it decided by researchers or participants? As it’s not a routine scenario, to make a fair comparison with machine vision, it should be decided by participants and an additional test should be added to validate that participants have been well-trained.

6. Are participants informed that their response time will be recorded as an evaluation criterion, which may affect their prediction timing?

7. What about prediction accuracy? Are all prediction results correct? How to analyze wrong predictions?

8. A typo mistake in line 140: two “for example”.

Reviewer #2: This paper proposes a system about machine based action recognition system eSEC learning and designed a virtual reality setup and tested recognition speed for different manipulation actions. The authors introduce in details how the theoretical analysis is done and recognition speed is performed.

Paper is not well organized and has limited potential for acceptance in “PLOS ONE”, in current format though there are some observations, corrections and suggestions regarding this paper.

• Author MUST clearly describe their contribution. Put another section what is author contribution?

• Separate introduction and literature review.

• Proposed work section is quite weak and needs major improvement. It lacks any flow diagram, algorithm, pseudo code etc. Each step of proposed algorithm/work should be clearly depicted how your work is different from existing work.

• Diagrams and flow charts are not good need to redraw.

• Performance measures should be more. The proposed work should be evaluated with a number of performance measures to prove its validity.

• Abstract and Conclusion are poorly written need much revision.

• Add references that are more recent.

• Overall, the paper lies in the category of revision.

Overall the language is not very good; however, it MUST be proofread if again before submission again.

Reviewer #3: Action prediction independent of object information as an observation and hypothesis is validated through a psycho-physical experiments rigorously conducted by the authors. A set of 10 actions are considered over a VR based experimental system. The authors further validated an eSEC computational framework to show that with eSEC, machine could achieve action prediction capability. The machine prediction power vs human prediction performance as a comparison is provided by the authors and some speculative explanations are given and discussed.

The draft is fairly well written and flows well. I really enjoyed reading the draft.

The experiments are thorough enough to approach their conclusions in my view.

The eSECs as a formulation and representation is adopted as a computational tool in this draft, is an appropriate choice given its prior use in similar computational problem domains (in robotics and computer vision fields).

The relevant literature is also well presented and reviewed.

Some parts of the draft could be improved by making the description more clear to the readers. For example, line 396 "when all three types of information"

It is unclear to me what is the third type of information other than dynamic and static ones? Please clarify.

Also, an interesting future question and direction could be, as most of the action recognition dataset and benchmarks in computer vision research area come up with the set of actions in a kind of ad-hoc manner (especially for manipulation action dataset). I would be keen to see the authors based on their discoveries from this draft to provide some designing principles for future action recognition dataset and challenges, that could fully consider the types of information discussed here.

Reviewer #4: Following are some observations

• Abstract is too much lengthy.

• Abstract is not written according to the theme of abstract.

• Actual methodology/algorithms are not mentioned in the abstract.

• In introduction section, contributions should be mentioned in bullets for better understanding of the readers.

• The manuscript should be checked for typos. In some places the word Figure is written while in other places Fig is written, must be uniform throughout.

• Figures quality is not good, must be 300dpi.

• Authors employed so-called extended semantic event chains (eSEC) which is an existing work, what is their real contribution?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes: Yezhou Yang

Reviewer #4: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Dec 28;15(12):e0243829. doi: 10.1371/journal.pone.0243829.r003

Author response to Decision Letter 0


29 Oct 2020

First we want to thank the reviewers for the helpful comments. Further, we provide the answers to those comments. Reviewer comments will be written in Courier new font, our answers in Calibri font.

Please, note, that the figures in the merged submission PDF are not reproduced with good quality, but the PLOS guarantees the download of the high-resolution figures, see https://everyone.plos.org/2011/02/11/ask-everyone-figure-files-in-your-merged-pdf/

*****Reviewer 1*****

The manuscript entitled “Human and Machine Action Prediction Independent of Object Information” explores action prediction algorithms under no context condition. A virtual reality setup is established to research action recognition mechanism differences between human and machine vision. In manipulation actions, all objects are emulated with cubes so that human participants cannot infer action through object context and use spatial relations instead. Results show that participants are able to predict actions in, on average, less than 64% of the action's duration. In comparison, a computational model, the so-called enriched Semantic Event Chain (eSEC), which incorporates the information of spatial relations is employed. After being trained by the same actions as those observed by participants, this model successfully predicted actions even better than humans. Using information theoretical analysis, eSECs are able to make optimal use of individual cues, whereas humans seem to mostly rely on a mixed-cue strategy, which takes longer until recognition. The research work reveals interesting mechanism of action prediction in human through well-designed comparison experiments. Providing a better cognitive basis of action recognition may, on the one hand improve our understanding of related human pathologies and, on the other hand, also help in building robots for conflict-free human-robot cooperation. However, it remains to be promoted in following aspects:

1) What’s the motivation of this research? It should be stated in the beginning.

The motivation of the study was to find out whether and how humans use spatial relations between objects in prediction of manipulation actions. In the revised manuscript, we write in the Introduction (lines 83-88):

“In the present study, we sought to precisely analyze and objectify the way that humans exploit information about spatial relations during action prediction. Eliminating object and contextual (i.e., room, scene) information as confounding factors, we tested the hypothesis that spatial relations between objects can be exploited to successfully predict the outcome of actions before the action aim is fully accomplished.”

2) From introduction section, the necessity of human action prediction research without context information is not explained. This may benefit human-computer interaction, but in most applications, context information is available and is effective for action prediction.

Our study is the first to specifically address how humans use relational spatial information for prediction. Furthermore, we write now (see lines 78-82): “For instance, spatial information and, specifically, spatial relations that are in the center of the current study may become more relevant when objects are difficult to recognize, e.g. when observing actions from a distance, in dim light or in case when actions are performed with objects or in environments not familiar to an observer, or when objects are used in an unconventional way.”

In addition, we had suggested in the discussion of the 1st submission already that infants could potentially bootstrap action understanding starting with spatial relations, at the stages of development where their general world knowledge is still limited and thus, contextual information cannot be interpreted in the same way by them as by a grown-up person. (See lines 458-465).

3) The purpose of the research work is unclear. To explain human’s action prediction mechanism without context information or to propose a better action prediction algorithm? Experiments setup varies for different research purpose.

The purpose of the research work is to investigate the usage of spatial relations for manipulation prediction by humans. Calculation of the spatial relations is based on the eSEC model, which was originally developed for computer vision. We now have much rewritten the introduction and method section to make it clear that we concentrate on explaining human’s action prediction mechanism.

We also have changed the title of the paper to make the purpose of the paper more clear. The new title is: “Using enriched Semantic Event Chains to model human action prediction based on (minimal) spatial information”.

4) Section 1.2 should focus more on action prediction as it is the topic of this research.

Having refocused the paper on human action prediction, the introduction is not divided into sections any more. We have rewritten and much shortened the text from section 1.2, focusing at prediction as requested by the reviewer (see lines 98-106).

5) Some details on human experiments should be clarified. In the short training phase, how to determine the end of training? Is it decided by researchers or participants? As it’s not a routine scenario, to make a fair comparison with machine vision, it should be decided by participants and an additional test should be added to validate that participants have been well-trained.

The procedure was as follows: We first verbally explained the type of actions to the participant with the help of wooden cubes, and then we asked the participant to put the VR headset on his/her eyes and watch these explained actions in the world of virtual reality. For each action, we showed a sample and the name of the action appeared in a green box in front of the participant throughout the action. After showing one experimental sample for each of the ten actions, we asked the participant if everything was clear, and if he/she confirmed, we would start the experiment. Now this is written in text too (see lines 167-169)

Regarding the training level, is it important to note that the ten actions we used were very simple and are integral part of everyday life object manipulation. Prediction accuracy during the experiment was very high with a mean of 97.6%, which underpins that the task was well understood.

6) Are participants informed that their response time will be recorded as an evaluation criterion, which may affect their prediction timing?

Yes, all the details including the importance of response time were explained to each participant before the experiment. The instruction was “indicate as quickly as possible, which action was currently presented”, but we did not ask the participant to do this in any competitive way (e.g. competing against other participants or the machine), avoiding this way time pressure. Thus, we used an absolutely conventional approach to instructing participants in a behavioral experiment where error rates and reaction times are recorded.

7) What about prediction accuracy? Are all prediction results correct? How to analyze wrong predictions?

We thank the reviewer for this important inquiry. We added the prediction accuracy to the results section. Participants' mean prediction accuracy was very high (M = 97.6%, SD = 1.8, n = 49). Therefore, we did not further analyze wrong predictions (apart from the correlational analysis of the error rate to identify learning effects) (see lines 318-319 and 322-325).

Please, note that we detected a coding mistake in our trial variable, which we now corrected. This led to changes in the results section concerning the learning effects - now showing a significant reduction in error rate and enhancement in predictive power in the course of the experiment, which is fully in line with what one would expect for human behavior (see lines 321-324).

8) A typo mistake in line 140: two “for example”.

Thank you - now corrected

*****Reviewer 2*****

This paper proposes a system about machine based action recognition system eSEC learning and designed a virtual reality setup and tested recognition speed for different manipulation actions. The authors introduce in details how the theoretical analysis is done and recognition speed is performed.

Paper is not well organized and has limited potential for acceptance in “PLOS ONE”, in current format though there are some observations, corrections and suggestions regarding this paper.

1) Author MUST clearly describe their contribution. Put another section what is author contribution?

Now we bullet list our contributions in the introduction (see lines 107-122):

“The current study consisted of the following steps:

• Creating a virtual reality database containing ten different manipulation actions with multiple scenarios each.

• Conducting a behavioural experiment in which human participants engaged in action prediction in virtual reality for all scenarios, where prediction time and prediction accuracy were measured.

• Calculating three types of spatial relations using the eSEC model: (1) touching vs. non-touching relations, (2) static spatial relations and (3) dynamic spatial relations.

• Performing an information theoretical analysis to determine how participants used these three types of spatial relations for action prediction.

• Training an optimal (up to the learning accuracy) machine algorithm to predict an action using the relational information provided by the eSEC model.

• Comparing human to the optimal machine action prediction strategies based on spatial relations.”

2) Separate introduction and literature review.

To merge the introduction with the literature review is standard in psychology papers. Thus, in this multidisciplinary paper we leave it this way, especially in view that this paper focuses onto the investigation of human action prediction with machine prediction providing the point of comparison for the analysis of human performance.

3) Proposed work section is quite weak and needs major improvement. It lacks any flow diagram, algorithm, pseudo code etc. Each step of proposed algorithm/work should be clearly depicted how your work is different from existing work.

We have updated the flow diagram presented in Figure 1, which now better explains of which parts our study is composed. We are providing descriptions for each block of that flow diagram in the Method section. However, in the main text we are providing only the essential (intuitive) description of the main algorithmic steps of the eSEC model and we are providing all finer computational details in the Appendix. This is done with the aim not to over-burden the main text with computational details and make it accessible to readers interested in the psychological aspects of the study, which are our core contribution.

The differences from existing work are described in the introduction and discussion sections. Essentially, we investigate for the first time how humans recognize actions based on relational spatial information between manipulated objects, when context is not available.

Here we summarize what is old and what is new with respect to the methods in our study. Old methods: we use the eSEC model developed in our previous work and we use standard statistical methods to investigate the obtained human data. New is the virtual reality setup in which we perform our experiments and the virtual reality database with 300 manipulation action scenarios (10 actions, 30 scenarios each) which we demonstrate to the participants of our study.

4) Diagrams and flow charts are not good need to redraw.

Thank you for drawing our attention to this shortcoming. All figures were reproduced using higher resolution (600 dpi).

Please, note, that the figures in the merged submission PDF are not reproduced with good quality, but the PLOS guarantees the download of the high-resolution figures, see https://everyone.plos.org/2011/02/11/ask-everyone-figure-files-in-your-merged-pdf/

5) Performance measures should be more. The proposed work should be evaluated with a number of performance measures to prove its validity.

Thank you for this remark. We added the human prediction accuracy as performance measure.

All in all, in respect to methods, we now did the following:

• We performed a repeated measures ANOVA to analyze human predictive power and compared human and machine predictive power for different actions using t-tests.

• We calculated the information gain based on each eSEC column entry, and fitted logistic regression to the obtained series for eight different sets of spatial relations (Touching, Static and Dynamic alone, as well as all possible combinations of those types of relations plus one model which does not divide the relations into separate components).

• We analyzed learning effects in human error rates and predictive power.

We would argue that this is a set of methods, both representative and allowing us achieving trustworthy results, showing that humans successfully exploit relational spatial information for action prediction. Moreover, the set of methods allows us to determine what strategies humans deploy when provided only with dynamic and static spatial information for action prediction.

6) Abstract and Conclusion are poorly written need much revision.

Abstract and discussion sections have been rewritten. Conclusion section was incorporated into the discussion.

7) Add references that are more recent.

We added new references in the introduction: for definition of more recent psychological findings and for recent machine learning aspects (Stadler et al, 2012, Ref. No. [12], Wurm et al, 2014, Ref. No. [14], Cheng et al. 2020, Ref. No. [35], Ejdeholm et al. 2020, Ref. No. [40]), as well as Barros et al, 2017, Ref. No. [48] and Sun et al, 2019, Ref. No. [49] in the discussion.

8) Overall, the paper lies in the category of revision.

Thanks for that opportunity! We hope that the revised manuscript is much clearer now.

9) Overall the language is not very good; however, it MUST be proofread if again before submission again.

We have proof-read the paper.

*****Reviewer 3*****

Action prediction independent of object information as an observation and hypothesis is validated through a psycho-physical experiments rigorously conducted by the authors. A set of 10 actions are considered over a VR based experimental system. The authors further validated an eSEC computational framework to show that with eSEC, machine could achieve action prediction capability. The machine prediction power vs human prediction performance as a comparison is provided by the authors and some speculative explanations are given and discussed.

1) Some parts of the draft could be improved by making the description more clear to the readers. For example, line 396 "when all three types of information", It is unclear to me what is the third type of information other than dynamic and static ones? Please clarify.

Now we write (see lines 376-377): “Notably, when all three types of information (i.e., touching, static or dynamic information) were equally beneficial (this was the case for take, uncover, shake, and put)…”

We also name the three types of information explicitly in the abstract, introduction, method and result sections to make the writing clearer.

2) Also, an interesting future question and direction could be, as most of the action recognition dataset and benchmarks in computer vision research area come up with the set of actions in a kind of ad-hoc manner (especially for manipulation action dataset). I would be keen to see the authors based on their discoveries from this draft to provide some designing principles for future action recognition dataset and challenges that could fully consider the types of information discussed here.

Possibly the most interesting aspect for any new data set relates to the finding that we can predict actions often very efficiently in situations, where context is not interpretable (such as those introduced by our VR cube-world here). This suggests creating such a data set which should also include actions, where objects are used in unconventional ways (like cutting dough with a spoon, etc.), to avoid fully object-based action recognition strategies. Comparing results from such data with results from more conventional (object- or context oriented) data should allow us to better understand the role and affordances of objects. This, however, can only be done in future work.

*****Reviewer 4*****

1) Abstract is too much lengthy.

The Abstract has been sharpened, but – to address the next comment of the reviewer – methods, etc. had been clearly pointed out in the Abstract, too, leading to the fact that not much of its length could be reduced.

2) Actual methodology/algorithms are not mentioned in the abstract.

The eSEC model with some details and information theoretical analysis, the two stepping-stones of our methodological approach, are mentioned in the abstract.

3) In introduction section, contributions should be mentioned in bullets for better understanding of the readers.

In the revised manuscript, we added a bullet list of contributions. Now we write (see lines 107-122):

“The current study consisted of the following steps:

• Creating a virtual reality database containing ten different manipulation actions with multiple scenarios each.

• Conducting a behavioural experiment in which human participants engaged in action prediction in virtual reality for all scenarios, where prediction time and prediction accuracy were measured.

• Calculating three types of spatial relations using the eSEC model: (1) touching vs. non-touching relations, (2) static spatial relations and (3) dynamic spatial relations.

• Performing an information theoretical analysis to determine how participants used these three types of spatial relations for action prediction.

• Training an optimal (up to the learning accuracy) machine algorithm to predict an action using the relational information provided by the eSEC model.

• Comparing human to the optimal machine action prediction strategies based on spatial relations.”

4) The manuscript should be checked for typos. In some places the word Figure is written while in other places Fig is written, must be uniform throughout.

We use only “Figure” now. We also have proof-read the entire manuscript.

5) Figures quality is not good, must be 300dpi.

We reproduced all figures at 600 dpi.

Please, note, that the figures in the merged submission PDF are not reproduced with good quality, but the PLOS guarantees the download of the high-resolution figures, see https://everyone.plos.org/2011/02/11/ask-everyone-figure-files-in-your-merged-pdf/

6) Authors employed so-called extended semantic event chains (eSEC) which is an existing work, what is their real contribution?

The motivation of the study was to find out how humans use spatial relations between objects for prediction of manipulation actions using the eSEC predictions as a point of comparison. Our study is the first to investigate how humans recognize actions based on relational spatial information between manipulated objects, when contextual information is not available.

For a full list of contributions see our bullet list in lines 108-122.

Attachment

Submitted filename: Response_to_reviewers.pdf

Decision Letter 1

Chen Zonghai

27 Nov 2020

Using enriched Semantic Event Chains to model human action prediction based on (minimal) spatial information

PONE-D-20-16126R1

Dear Dr. Fatemeh Ziaeetabar,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Chen Zonghai

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Based on the opinions of the reviewers, it is suggested that the manuscript be accepted.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #4: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #4: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #4: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #4: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The author made necessary changes to the manuscript( PONE-D-20-16126) and answered the concerns of the reviewer.

Reviewer #4: Authors addressed all comments, it is accepted in its current form. It is also recommended for publication

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #4: No

Acceptance letter

Chen Zonghai

14 Dec 2020

PONE-D-20-16126R1

Using enriched Semantic Event Chains to model human action prediction based on (minimal) spatial information

Dear Dr. Ziaeetabar:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Chen Zonghai

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Video

    (MP4)

    S1 Dataset

    (RAR)

    Attachment

    Submitted filename: Response_to_editor.pdf

    Attachment

    Submitted filename: Response_to_reviewers.pdf

    Data Availability Statement

    The dataset which includes human participants results in the VR experiment has been uploaded as Supporting Information files. The "Human participants dataset" folder includes a file named: "readme.txt" which has a brief explanation about this dataset. All these human analysis can be repeated with this data.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES