Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2022 Nov 30;18(11):e1009866. doi: 10.1371/journal.pcbi.1009866

Tracking human skill learning with a hierarchical Bayesian sequence model

Noémi Éltető 1,*, Dezső Nemeth 2,3,4, Karolina Janacsek 3,5, Peter Dayan 1,6
Editor: Christoph Mathys7
PMCID: PMC9744313  PMID: 36449550

Abstract

Humans can implicitly learn complex perceptuo-motor skills over the course of large numbers of trials. This likely depends on our becoming better able to take advantage of ever richer and temporally deeper predictive relationships in the environment. Here, we offer a novel characterization of this process, fitting a non-parametric, hierarchical Bayesian sequence model to the reaction times of human participants’ responses over ten sessions, each comprising thousands of trials, in a serial reaction time task involving higher-order dependencies. The model, adapted from the domain of language, forgetfully updates trial-by-trial, and seamlessly combines predictive information from shorter and longer windows onto past events, weighing the windows proportionally to their predictive power. As the model implies a posterior over window depths, we were able to determine how, and how many, previous sequence elements influenced individual participants’ internal predictions, and how this changed with practice.

Already in the first session, the model showed that participants had begun to rely on two previous elements (i.e., trigrams), thereby successfully adapting to the most prominent higher-order structure in the task. The extent to which local statistical fluctuations in trigram frequency influenced participants’ responses waned over subsequent sessions, as participants forgot the trigrams less and evidenced skilled performance. By the eighth session, a subset of participants shifted their prior further to consider a context deeper than two previous elements. Finally, participants showed resistance to interference and slow forgetting of the old sequence when it was changed in the final sessions. Model parameters for individual participants covaried appropriately with independent measures of working memory and error characteristics. In sum, the model offers the first principled account of the adaptive complexity and nuanced dynamics of humans’ internal sequence representations during long-term implicit skill learning.

Author summary

A central function of the brain is to predict. One challenge of prediction is that both external events and our own actions can depend on a variably deep temporal context of previous events or actions. For instance, in a short motor routine, like opening a door, our actions only depend on a few previous ones (e.g., push the handle if the key was turned). In longer routines such as coffee making, our actions require a deeper context (e.g., place the moka pot on the hob if coffee is ground, the pot is filled and closed, and the hob is on). We adopted a model from the natural language processing literature that matches humans’ ability to learn variable-length relationships in sequences. This model explained the gradual emergence of more complex sequence knowledge and individual differences in an experiment where humans practiced a perceptual-motor sequence over 10 weekly sessions.

Introduction

“[…] even intuition might be reduced to mathematics”.

Isaac Asimov

The fluency and accuracy of perception and action depend critically on our preternatural ability to predict. For instance, when learning a new action routine, the steps are generated independently of each other, yielding slow and error-prone performance. Routines that are practiced extensively, like opening the front door at home, become fast, accurate, and effortless. This is because the sequence of actions comprising the routine becomes predictable via the gradual learning of dependencies among sequence elements. This sequence learning mechanism that creates skills is ubiquitous: along with its role in the genesis of fluent motor performance [1], it operates in spatio-temporal vision, assisting scene perception [2], and in the auditory domain, underlying speech perception [35] and speech production [6, 7].

Skill production in the form of sequence learning has been most widely studied in serial reaction time (SRT) tasks [8] in which participants are instructed to follow a repeating pattern of key presses like ABAC. With practice, they become faster to produce the key presses obeying this sequence than those associated with a random sequence. Notably, this increase in fluency is not always accompanied by explicit knowledge of the sequence. Some participants who become faster to respond to the pattern are not able to verbally report or themselves generate the true pattern, suggesting that they learned it implicitly [9]. As such, this paradigm can capture non-intentional sequential behavior that does not require conscious awareness.

Conventional SRT tasks pose higher-order sequence learning problems. That is, the sequence elements depend on more than one previous element. In the example sequence ABAC, whether B or C follows A is uncertain; but C follows BA with certainty. That is, the first-order dependence of C on A is uncertain but its second-order dependence on BA is certain. Learning the second-order dependencies ensures predictability, and thus fluency, for all elements of this example sequence. However, if the order of the sequence generating process is not known or instructed a priori, the learner has to arrive at the second-order solution by themselves. The same is true in real-life sequence learning problems. Indeed, a central challenge of sequence prediction is to determine the exploitable predictive context, the depth of which can vary from sequence to sequence, and even from element to element. Humans spontaneously adopt the depth of context appropriate to the sequence statistics [10]. Furthermore, learners in the wild accommodate substantial noise. For instance, we might have to greet a neighbor while opening our front door. Such intervening elements should be flexibly ignored in our sequence input in order to condition only on the parts of the input that belong to the door opening routine and correctly generate the next step.

The Alternating Serial Response Time (ASRT) task [11] was developed to study higher-order sequence learning in the face of noise. The paradigm is identical to that of the SRT but the sequence of key presses is predictable only on every alternate trial. Participants gradually respond more quickly on predictable trials, presumably because they learn to exploit a sufficiently deep context—that is to say, they form larger context-action chunks. However, due to the probabilistic sequence, participants’ knowledge in the ASRT is completely implicit [12], as opposed to the mixed explicit-implicit knowledge that is typically exhibited in the SRT. Therefore, one can assume that the response times in the ASRT are predominantly influenced by the probability of the upcoming elements and not by other, explicit, strategies. We used the ASRT to study how humans adapt implicitly to higher-order structure in a noisy sequence—providing unique insight into the long-term learning of a complex skill.

Since Shannon [13], so-called n-gram models have been a conventional approach to modeling non-deterministic higher-order sequential dependencies. An n-gram model learns to predict the next element given the previous n − 1 elements. For instance, a 3-gram (or trigram) model predicts an element given two previous elements. In essence, an n-gram is a chunk of n adjacent elements, and we use the terms interchangeably. One major limitation of such models is that the number of n-grams grows exponentially as a function of their size n. Thus, acquiring or storing an n-gram table becomes statistically and computationally infeasible, respectively, even at moderate values of n. Critically, a simple n-gram model fails to exploit the typically hierarchical nature of chunks: i.e., that a chunk ‘inherits’ predictive power from its suffix. For instance, in the speech prediction example, given a context of ‘in California, San’, the most distant word ‘in’ is weakly predictive, while ‘California’ and ‘San’ are strongly predictive of ‘Francisco’. The entire context ‘in California, San’ inherits most of its predictive power from the shallower context ‘California, San’. Similarly, action chunks underlying our motor skills, like opening a door, are often embedded into, and interrupted by, previously unseen or irrelevant actions. Humans appear capable of exploiting the hierarchical statistical structure of sequences by down-weighting, or ignoring, parts of the context that have not convincingly been observed to be predictive.

Teh [14] suggested a Bayesian non-parametric extension of n-gram models as a principled machine learning solution to both the problem of complexity and hierarchical context weighting. This model builds structure on the fly as evidence accumulates, extending from (n − 1)- to n-gram dependencies according to the observed statistics. Thus, it flexibly reduces to a unigram model if no chunk is present, or builds the bigram, trigram, etc. levels if appropriate. For prediction, it smooths over all chunks that are consistent with the available context, proportional to their prior evidence. This model was originally suggested as a language model; here, we consider its use for a more general cognitive contextual sequence learning problem.

In our experiment, participants practiced the same visuo-motor second-order sequence in the ASRT task for 8 long sessions, each separated by a week. In two subsequent sessions, the sequence was changed in order to test participants’ resistance to interference. We tracked the evolution of sequence knowledge using the Bayesian non-parametric sequence model, capturing representational dynamics and sensitivity to local statistical fluctuations by adapting it to learn online and be suitably forgetful. We fitted the sequence model to participants’ response times assuming that faster responses reflect more certain expectations. We show how shifting their priors over the predictive contexts allowed participants to grow and refine their internal sequence representations week by week. Already in the first session, participants began to rely on two previous elements for prediction, thereby successfully adapting to the main task structure. However, at this early stage, trigram recency influenced their responses, as captured by the forgetting mechanism of our model. With training, trigram forgetting was reduced, giving rise to robustness against local statistical fluctuations. Thus, our model reduced to a simple, stationary trigram model. However, by the last training session, we observed that a subset of participants shifted their prior further to consider a context even deeper than two previous elements. The fitted parameter values guiding higher-order sequence learning were correlated with independently measured working memory scores. Finally, reduced chunk forgetting predicted the resistance to interference in the last two sessions.

1 Methods

1.1 Ethics statement

All participants provided written informed consent before enrollment and received course credits for taking part in the experiment. The study was approved by the United Ethical Review Committee for Research in Psychology (EPKEB) in Hungary (Approval number: 30/2012) and by the research ethics committee of Eötvös Loránd University, Budapest, Hungary. The study was conducted in accordance with the Declaration of Helsinki.

1.2 Experiment

A detailed description of the task, procedure, and participants can be found in [15] where this data was first published. In brief, we tested participants’ long-term sequence learning on a serial reaction time task with second-order dependence (the Alternating Serial Reaction Time task or ASRT [11]). On each trial, a cue appeared in one of four equally spaced, horizontally arranged locations. Participants had to press a corresponding key as accurately and quickly as possible. The next trial started 120ms after the response (Fig 1a). An eight-element second-order sequence dictated the cue locations, i.e. the sequence elements. In this, four deterministic states, each associated with a unique element, were interleaved with four random states, which produced the four elements with equal probabilities.

Fig 1. The Alternating Serial Reaction Time (ASRT) task with second-order dependence structure.

Fig 1

(a) Participants had to press the key corresponding to the current sequence element (i.e. cue location) on the screen as accurately and quickly as possible, using the index and middle fingers of both hands. In the display, the possible locations were outlined in black and the cue always looked the same, fill color and saturation are only used here for explanatory purposes. (b) The structure of the example sequence segment in (a). Color saturation and outline indicate the element that was presented on a trial. The vertical arrow indicates the current trial. The task was generated from an eight element second-order sequence where every second element was deterministic and the elements in between were random. The deterministic components in this example are: red-blue-yellow-green. The element on any random trial (including the current one) is unpredictable. However, this current trial happens to mimic the deterministic second-order dependence where green is followed by red after a gap of one trial, making it a high probability trigram trial (H). The other random elements were associated with lower probability trigrams (L). (c) Under the true generative model, when in a random state, high-probability trigrams (rH) and low-probability trigrams (rL) are equally unexpected. (d) A learner who can pick up second-order dependencies, but who is agnostic to the higher-order alternating state structure, would expect rH more than rL. (e) In the last training session (session 8; after more than 14,000 trials), participants responded faster to deterministic than random trials, suggesting that they learned to predict the upcoming element. They also responded quickly even on random trials if those happened to complete a high probability trigram (rH). The y axis shows the standardised reaction time (RT) averaged over the different trial types on the last session of learning. The error bars indicate the 95% CI.

This second-order rule implies that a deterministic element is predictable from the element two time steps ago. If one ignores the deterministic/random state of the alternating sequence, this also means that some trigrams (i.e., sequences of three elements) have high probabilities. Such trigrams can also arise, by chance, in the random states (Fig 1b), and so allowed [11] a test of whether participants had learned the global alternating rule in the task, in which case any element in a random state would be unexpected (by the time the state had been inferred), or if they had instead merely learned local dependencies or frequent chunks, in which cases a random state that happened to complete a (so-called random) high frequency trigram would also be expected. The excess speed of responses to the final elements of random high-frequency trigrams compared to random low-frequency trigrams shown in Fig 1c suggests chunk learning. Learning was purely implicit in this task, as none of the participants could verbalize or reproduce the sequence structure in the debriefing at the end of the experiment.

Participants completed nine sessions of 2125 trials each, and a tenth session of 1700 trials, each separated by a week. For each participant, one of the six unique permutations of the deterministic second-order sequences was selected in a pseudo-random manner and the same sequence was practiced for 8 (training) sessions. On session 9, unbeknownst to the participants, two elements in the deterministic second-order sequence were swapped and thus all but one of the second-order pairs were changed. Over four, 425 trial, epochs of session 10, old and new sequences alternated according to oldnewoldnew. We refer to sessions 9 and 10 as interference sessions. Of the 32 participants, we analysed data from the 25 (22 females and 3 males; Mage = 20.4 years, SDage = 1.0 years) who completed all ten sessions.

1.3 Modeling strategy

We assume that a learner predicts the probabilities of successive elements in the sequence, and that the response alacrity for pressing key k on trial t (τt) scales logarithmically with the probability the model awards to that k. Thus high-probability responses are the fastest [16] due to being most expected (recent neural evidence in [17]) (Fig 2 upper box). Note that fitting to the participants’ responses rather than the actual events allows us to make inferences about their internal sequence model from their errors as well as their correct responses. For this, we make the assumption that errors reflect not only less caution but also expectations (as captured by lower response threshold and biased starting point in evidence accumulation models, i.e. [18]). Indeed, individuals are prone to respond according to their internal predictions, even if these do not match the actual upcoming elements [19]. In the ASRT, where the current element is already presented at the time of the response, supposedly a conflict arises between the instructed response and the error response. However, the nature of the conflict is not within the scope of this study.

Fig 2. Modeling strategy.

Fig 2

We adopted a model-based approach, fitting the hyperparameters θ of an internal sequence model (upper box), together with low level effects (the spatial distance between subsequent response locations, response repetition, error and post-error trials; lower box) to participants’ response times. The contribution of the sequence model is the scaled log of the predictive probability of each key press k (one of the four keys, marked as transparent square), given the context u (previous events, marked as a string of colored squares). The sequence model makes predictions by flexibly combining information from deepening windows onto the past, considering fewer or more previous stimuli.

The learner faces the problem of finding a model that considers a wide enough context of predictive past elements, whilst not suffering from combinatorial explosion and overfitting by considering too many past elements that are redundant for prediction. The solution we consider in this paper is a Bayesian nonparametric n-gram model [14]. In a nutshell, the model combines the predictive information from progressively deeper windows onto the past: no previous element; one previous element; two previous elements etc., corresponding to the unigram, bigram, trigram, etc., levels. The hierarchies in the model provide a principled way of combining information: a deeper window ‘inherits’ evidence strength from the shallower windows that it contains.

Teh [14] employed an offline algorithm to model a given static sequence such as a text corpus. However, the model can be fitted in an online sequential fashion instead, updating the beliefs at each observation and using the updated model for predicting the next observation. This captures representational dynamics: more complex models are employed as the data to justify them accumulates. We hypothesized that humans build their internal sequence representation in a similar way, starting with learning short-range dependencies and gradually adding long-range dependencies if they are indeed significantly present in the data. Therefore, we adopted this model to serve as an internal sequence model.

In order to isolate the effect of the internal sequence prediction on reaction times (RTs), we controlled for low-level effects that exert significant influence in serial reaction time tasks (S2 Table). These included the spatial distance between subsequent elements, repetition of the same location, errors, and post-error trials (Fig 2 lower box). Thus, we performed a single, integrated fit of the free parameters of the sequence predictor and the low-level effects to the recorded RTs.

1.4 The internal sequence model

The model from [14] infers a nested predictive context of sequence elements, probabilistically choosing the level of the nesting for a particular prediction based on priors and data. Since we treat it as an internal and subjective rather than a normative and objective model (fitting parameters to participants’ reaction times rather than to the actual structure of the second order sequence), we can infer how much sequence information participants used for adapting their behavior. Using online model fitting, we capture the dynamic, trial-by-trial refinement of individuals’ sequence representation.

At the core of the model is the Dirichlet process (DP) mixture [20]. This places a prior on the unigram probabilities G(k) of key presses k:

kGDP(α,H) (1)

where α is called a strength parameter, and H is a base probability distribution. The DP is a prior over how to cluster data points together, with the strength parameter determining the propensity for co-affiliation. In our case, each cluster is labeled by k. Thus, the DP expresses a prior over of a future key press, given a history of key presses. The base distribution determines the prior probabilities of cluster labels k. In our case, H is uniform, expressing that we have no information about the probabilities of the key presses before observing the data.

A commonly used procedure for sampling from G is the Chinese restaurant process (CRP) [21]. The CRP is both a generative and recognition procedure as we explain below. In the CRP metaphor, a new customer either joins an occupied table with a probability proportional to the number of customers at that table (making the model affiliative) or opens up a new one (making the model infinite), with probability proportional to parameter α (Fig 3a). Each table corresponds to a cluster and, in our case, is labeled by a key press (e.g., ‘key 1’, marked as colors in Fig 3a). The customers correspond to observations of the certain key press (see the summary of the terminology in Table 1). In the recognition procedure, we treat the response as given and the customer is bound to sit at or open a table with the corresponding label, i.e. the probability of the given key press belonging to each cluster is computed. The fact that the same key press can be the label of different clusters reflects that the same response could arise from different latent causes, e.g., different contexts. In the generative model, the probabilities of sitting at or opening tables that share the label are summed, i.e. predicts how likely, on average, each key press is. The strength parameter α controls the expected number of clusters: the higher α is, the more prone customers will be to open their own cluster. The resulting seating arrangement S is a sample of G. Since in the generative procedure the labels of new tables are sampled independently from H, a high strength value α will cause G to resemble H (hence, enhancing the ‘strength of H’).

Fig 3. Treating the sequence learning problem as an hierarchical nonparametric clustering problem.

Fig 3

(a) The traditional, unforgetful Chinese restaurant process (CRP) is a nonparametric Bayesian model where the probability that a new observation belongs to an existing cluster or a new one is determined by the cluster sizes and the strength parameter α. In the metaphor, the new customer (new observation; see the terminology in Table 1; shown as black dots) sits at one of the existing tables (clusters labeled by key press identity, e.g., ‘response to left side of the screen’; shown as colored circles) or opens up a new table (shown as open circle) with probabilities proportional to the number of customers sitting at the tables and α. Here, the most likely next response would be of the type pink. (b) The distance-dependent or ‘forgetful’ Chinese restaurant process (ddCRP) is governed by a distance metric, according to the ‘close together sit together’ principle. In our case, the customers are subject to exponential decay with rate λ, as shown in the inset (and illustrated by the grey colours of the customers). Even though the same number of customers sit at the tables as in (a), this time the predictive probability of a yellow response is highest because most of the recent responses were yellow. (c) In the distance-dependent hierarchical Chinese restaurant process (HCRP), restaurants are labeled by the context of some number of preceding events and are organized hierarchically such that restaurants with the longest context are on top. Thus, each restaurant models the key press of the participant at time point t, kt, given a context of n events (etn, …et−1). A new customer arrives first to the topmost restaurant that corresponds to its context in the data (in the example, the customer is bound to visit the restaurant labeled by the context ‘yellow-blue’ when he arrives to level 2). If it opens up a new table, it also backs off to the restaurant corresponding to the context one element shorter (in the example, to the restaurant labeled by the context ‘blue’).

Table 1. Terminology of the hierarchical Chinese restaurant process mapped onto the experimental measures of the current study.

ddHCRP metaphor Experimental measures
customer observation at t
table subset of observations up to t − 1
dish label of the observation at t (key press kt or event et)
restaurant on level n context of n previous events (etn:t−1)

The affiliative process gives rise to the ‘rich gets richer’ property where clusters with more data points attract other data points more strongly. This captures, for instance, the fact that the more often we gave a response in the past, the more likely we are to repeat the same response in the future [22]. However, these clusters would ultimately grow without bound. Since our participants are forgetful and might be sensitive to local statistical fluctuations, we used a variant called the distance-dependent CRP (ddCRP) [23] (Fig 3b). Here, affiliation among customers decreases as a function of sequential distances D. D is the set of distances between all customers that form sequential pairs, measured in the number of past trials relative to the current customer. We set the affiliation decrease to be exponential, with rate λ, crudely modeling human forgetting curves.

This yields the distance-dependent Dirichlet process (ddDP) prior:

ktGtddDP(α,λ,H,D) (2)

So far, we have a suitable model of (potentially nonstationary) marginal probabilities of key presses that corresponds to a unigram model. We can use this building block to represent the key press probabilities at time point t conditioned on a context of n preceding key press instructions or events ut = (etn, …et−1) hierarchically. The Dirichlet process prior in Eq 2 can be modified to give a distance-dependent hierarchical Dirichlet process (ddHDP) prior:

ktGutddHDP(α|ut|,λ|ut|,Gπ(ut),D) (3)

where π(ut) is the suffix of the context ut containing all but the earliest event. Both the strength and decay constant are functions of the length |ut| of the context. Crucially, instead of an uninformed base distribution, we have Gπ(ut), the vector of probabilities of the current events given all but the earliest event in the context. Of course, Gπ(ut) also has to be estimated. We do this recursively, by applying Eq 3 to the distribution given the shallower context. We continue the recursion until the context is ‘emptied out’ and the probability of the next response given no previous elements is a DP with an uninformed base distribution. This hierarchical procedure ensures that context information is weighted proportionally to predictive power. Consider an example where participants are always instructed to respond ‘red’ after having seen ‘green—yellow’. Then, irrespectively of elements preceding ‘green’, the response should be unchanged. Consider that a novel element ‘cyan’ was inserted and the participant’s full context contains ‘cyan—green—yellow’. According to Eq 3, the probability of the next response being ‘red’ given the full context ‘cyan—green—yellow’ will depend on the probability of ‘red’ given the shallower context of the two previous elements ‘green—yellow’—the actually predictive context. The earliest element in the context—’cyan’—is redundant to prediction and the probability distribution over the next response given the longer context will strongly resemble the probability distribution given the shorter, useful context. Note that we use the completely novel element ‘cyan’ to illustrate the extreme case of an unpredictive element that should be ignored. However, the principle applies to weakly predictive contexts that should be proportionally down-weighted.

We represent the HDP with the distance-dependent hierarchical Chinese restaurant process (HCRP) (Fig 3c). Here, we have Chinese restaurants on N+1 levels, each level modeling the probability distribution over responses given a context of n previous events. At each level n, we have a potentially unbounded number of restaurants identified by a context u, meaning that a customer kt can only ever access the restaurant if ut is part of their context in the data. The ‘reliance’ of deeper contexts on shallower ones is realised by a back-off procedure (C Algorithm in S1 Appendix). A customer visits the topmost level first, particularly the restaurant corresponding to its context of length N in the data. With probabilities proportional to the recency of customers and αN, the customer either joins previous customers or opens up a new table, creating a new cluster. In the latter case, the customer also ‘backs off’ to the restaurant below, identified with a context that is one event shallower, where the same seating evaluation takes place. This may be repeated until the customer reaches the level of empty context. This induces a process where the most predictive segment of a long context will contribute the most to the prediction of the next key press kt, and superfluous context is organically overlooked. The most likely next k, given the context, is the k that the participant recently chose in the same context or a shallower segment of it.

A relevant property of the HCRP is that the ‘rich gets richer’ effect will generalise across contexts that share the same suffix. For instance, in case of many recent observations of the response ‘red’ in the context ‘green—yellow’, the longer, previously seen context ‘green—green—yellow’ and even a previously unseen context ‘cyan—green—yellow’ will also be associated with increased likelihoods of ‘red’. This is because ‘red’ is likely under the common suffix of both contexts. This property is desirable for the prediction of behavior, as individuals are expected to generalize their chunk inventory to previously unseen wider contexts [24, 25].

The α and λ parameters control memory over two timescales; the former controls short-term memory by tracking the use of the current context and the latter controls long-term memory by determining the longevity of previous chunks of context and response. Short-term memory acts as activated long-term memory [26]. That is, the context of a few previous events is a pointer to responses made in the same context, stored on the long-term. Since the α and λ parameters are specific to the hierarchy levels and are inferred from the data, the sequence learning problem is cast as learning a prior over ‘What should I remember?’, as in other approaches to representation learning [27].

To a learner that knows the alternating structure and the current state of the ASRT, any context longer than two previous events is superfluous, due to the second-order dependencies (Fig 1c). However, if the learner is agnostic to the alternating structure (Fig 1d) then no context can be deemed superfluous, as longer contexts enable the implicit tracking of the sequence state. Indeed, pure sequence prediction performance increases with the number of levels in the HCRP hierarchy and with lower values of α (S2 Fig). Similarly, long-term memory is beneficial in the training sessions, as it allows for a better estimation of the stationary chunk distribution and provides resistance to local statistical fluctuations. However, human learners are solving the task under resource constraints, motivating them to increase the complexity of their representations only to a level that enables good enough performance [28]. Within our framework, a parsimonious sequence representation is ‘carved out’ by learning to ignore dispensable context and enhancing the memory of previous observations in the necessary context.

1.5 Parameter fitting

Given the sequence presented to the participants, their responses, and response times, we are interested in finding the parameter values of the low-level effects and the internal sequence model that most likely generated the behavior. We assumed that the likelihood of the log response times was a Gaussian distribution with a mean value of the log response times predicted by the full model. We performed approximate Bayesian computation (ABC) to approximate the maximum a posteriori values of the parameters of interest:

argmaxθ,ρP(θ|e,k,τ,σ) (4)

where θ is the parameter vector of the HCRP comprising the strength parameters α and forgetting rate parameters λ; ρ is the vector of response parameters, including the weights of the low-level effects, the weight of the HCRP prediction, and the response noise; e is the sequence of events (mapping onto required responses); and k is the sequence of key presses (actual responses); τ is the sequence of response times; and σ is the Gaussian noise of the response time.

As a first step of our ABC procedure, we parsed e and k chronologically, such that the probability of a key press at t was influenced by observations up to t − 1, modeling sequential information accumulation (A Algorithm in S1 Appendix). At each time step, the HCRP operated as both the generative and recognition model, based on the same hierarchical back-off scheme. On trial t, we evaluated the probability of seating a new customer to a table serving dish kt, according to the generative process (B Algorithm in S1 Appendix). This corresponded to computing the predictive probability p(kt) of the participant’s response. Then, the seating arrangement was updated by seating a customer to a table serving the dish et, according to the recognition process (C Algorithm in S1 Appendix). This corresponded to updating the participant’s internal model with the event et, that is, the required response (which, in the case of erroneous responses, was different from the actual response kt). As such, we modeled the generative process of the actual responses k as a function of having learnt the required responses e. Note that in the parsing procedure, the seating arrangement was only updated with the current customer—backtracking (i.e. re-seating old customers) was not possible. This models online, trial-by-trial learning. We parsed the sequence five times to generate five seating arrangements. p(kt) was averaged over the five seating arrangement samples (yielding a relatively high-precision estimate of p(kt); see A Fig in S2 Appendix).

The log predictive probability log(p(kt)) of the actual response was assumed to be linearly related to τ, higher surprise causing slower responses [16]. We mapped log(p(kt)) to τpredicted using the response parameters ρ. Then we computed the Gaussian densities p(τ|τpredicted, σ). The goal was to find θ and ρ that maximize the product of these densities, that is, the likelihood of the measured response latencies to the sequence elements. In order to approximate θ and ρ that maximize this likelihood, we performed random search in the space of θ for 1000 iterations (for the convergence of the random search procedure, see A Text in S2 Appendix). In each iteration, we fitted ρ using OLS (thus, ρ was a deterministic function of θ). We repeated the search procedure 10 times, yielding 10 samples from the posterior distribution, and θ associated with the highest likelihood of participants’ responses was chosen as the MAP estimate of the hypermarameters.

We reran the ABC on consecutive data bins (sessions or within-session epochs when higher resolution is justified) to track potential shifts in the posteriors over practice. In the first five bins (five epochs of session 1), the prior was uninformed (S1 Table, left). In each successive bin, the prior was informed by both the fitted hyperparameters and learned parameters from the previous bins. For the hyperparameters θ, we used a Gaussian prior with a mean of the MAP values of the hyperparameters from the previous bin and a fixed variance, truncated at the boundaries of the uniform prior for session 1 (S1 Table, right). The learned parameter values, that is, the seating arrangements S accumulated across all previous bins were carried over. The ‘heredity’ of the seating arrangements modeled continual learning. Nevertheless, changes in θ caused the same sequence information to be weighted differently. For instance, if λ decreased, old instances of chunks that were previously uninfluential, became more influential.

For later model evaluation, we held out the middle segment of reaction time data from a session (the central 255 trials) or epoch (the central 85 trials). Middle segments were chosen in order to ensure a more representative test data, as the beginning and end of a session can be affected by warm-up [29] and fatigue [30] effects, respectively. The HCRP parsed the entire e in order to contain sequence knowledge that could explain τ on later segments. But, importantly, the middle segment of τ was not used for computing the posterior probabilities of θ and ρ. Therefore, all predictions reflected all previous observations and only predictions of the training data were used to optimise the hyperparameters. Holding out the responses but not the observations, instead of completely held-out data as typical in machine learning, was essential to provide the model with the right context for predicting behavior without contaminating it with test behavior.

2 Results

2.1 Practice-related changes in the low-level response model

In Fig 4 we show fitted values of both the response parameters ρ and the internal sequence model hyperparameters θ. In general, responses were faster if they were repetitions of the previous responses and and if they were erroneous (Fig 4a, negative coefficients for ‘repetition’ and ‘error’). On the other hand, slowing as a function of spatial distance from the previous response location and post-error slowing was observed (positive coefficients for ‘spatial distance’ and ‘post-error’). Since the sequence prediction coefficient (i.e. the weight of the HCRP) expresses the effect of surprise and the sequence was not completely predictable, the coefficient was generally above zero.

Fig 4. Fitted parameter values shown session by session, and at a subsession resolution in the initial and final sessions.

Fig 4

The grey band on the bottom of each plot shows the sequence that participants practiced: the old sequence in sessions 1–8 (dark grey), the new sequence in session 9 (light grey), and both sequences alternately in session 10. Point distance in (a) and cell width in (b) are proportional to data bin size—we fitted the model to 5 epochs within sessions 1, 9, and 10 to assess potentially fast shifts. (a) Fitted values of the response parameters in units of τ [ms]. The error bars indicate the 95% CI for the between-subjects mean. (b) Fitted values of the strength α (left) and forgetting rate λ (middle) parameters are shown, as well as their joint effect on prediction (right). A context of n previous events corresponds to level n in the HCRP. Lower values of α and λ imply a greater contribution from the context to the prediction of behavior. The context gain for context length n is the decrease in the KL divergence between the predictive distribution of the complete model and a partial model upon considering n previous elements, compared to considering only n − 1 previous elements. Note that the scale of the context gain is reversed and higher values signify more gain.

Parameter dynamics were already evident in the first session. To test practice-related changes, we conducted repeated measures ANOVAs with practice time unit (epochs or sessions) as within-subject predictor, allowing random intercepts for the participants. In the first session, while responses became faster in general (p < .01), pre-error speeding (p < .001) and post-error slowing were attenuated (p < .01). All three temporal trends persisted during sessions 2–8 (p < .001; p < .001; p < .01). At the same time, repetition facilitation became attenuated (p < .001) and the effect of the prediction from the internal sequence model (the HCRP) was increased (p < .001). Compared with the last training session, in the first epoch of the interference session (session 9), participants slowed down (p < .001) and the predictions from the HCRP were less expressed in their responses (p < .001).

2.2 Practice-related changes in the hyperparameters of the internal sequence model

The HCRP hyperparameters guide both learning and inference. Thus, the fitted HCRP hyperparameters reflect how participants use previous sequence knowledge as well as how they incorporate new evidence.

Remember that both α and λ (Fig 4b, left and middle) determine the influence of a given context, by down-weighting shallower contexts or by reducing the decay of old observations in the same context, respectively. In order to visualize the joint effect of the two parameters, we computed the KL divergence between the predictive probability distribution given the whole context and shallower contexts. A lower KL divergence indicated that the context weighed strongly into the overall prediction. Then, we computed the degree to which the KL divergence is reduced by adding more context, and averaged it across all trials and contexts of a given length. The resulting values reflect the average gain from the context windows to the prediction of the response (Fig 4b, right).

During the first session, α1 increased (p < .001), suggesting that the first-order dependency of the response on the one previous event was reduced (Fig 4b, left, right). From session 2 to 8, both λ1 and λ2 decreased prominently (p < .001 and p < .01). This suggests that participants adaptively increased their memory capacity not just for first-order but also second-order sequence information. Storing more instances of their previous responses following two events allowed them to be more robust against local statistical fluctuations due to the random nature of the task. That is, their behavior became gradually less influenced by the most recent trigrams and better reflected the true trigram statistics. Even though third-order sequence knowledge (i.e. conditioning response on three previous events) would further improve performance, λ3 did not decrease during training sessions 2 to 8 (p = .376.). This suggests that participants carved out the minimal context that is predictive and learned to remember previous instances in these contexts, ignoring deeper contexts that would give diminishing returns.

During session 9, when 75% of the sequence was changed, participants resisted the interference by further enhancing the trigram statistics that they had accumulated in earlier sessions, as reflected by further decrease in λ2 (p < .01). With such a mild trigram forgetting, the internal sequence model remains dominated by data from the old sequence throughout sessions 9 and 10. At the same time, the sequence prediction coefficient was reduced sharply from session 8 to 9 (p < .001), indicating that the internal sequence model was not governing the responses as strongly anymore when it became largely outdated. This implies that the internal sequence model accounted for less variability in the RTs during the interference sessions than the late training sessions.

2.3 Correlation of the sequence model hyperparameters and working memory test scores

α and λ capture how sequence memory is used. We conducted an exploratory analysis (Fig 5A and 5B) in order to relate the values of the HCRP hyperparameters that were inferred on the sequence learning task to participants’ performance on independent memory tests. Three tests were conducted prior to the sequence learning experiment: the digit span test to measure verbal working memory (WM); the Corsi blocks test to measure visuo-spatial WM; and the counting span test to measure complex WM. The former two, ‘simple’ WM tasks require the storage of items, while the latter complex WM task requires storage and processing at the same time.

Fig 5. Correlation between the fitted HCRP parameters and working memory.

Fig 5

(a)(b) Pearson correlation matrices of the working memory test scores and the strength parameters α and decay parameters λ of the HCRP model, respectively. Correlations that met the significance criterion of p < .05 are marked with black boxes. (c)(d) Scatter plots of the correlations that that met the significance criterion of p < .05. Bands represent the 95% CI.

We found that the complex WM score was negatively correlated with α3 (r = −.50, p = .01) such that higher complex WM was related to more reliance on longer contexts for prediction (Fig 5c). The spatial WM score was related to λ3 (r = − .47, p = .02), reflecting that a higher spatial WM capacity allowed for better retention of previous responses in longer contexts (Fig 5d). Verbal WM was not related to any of the hyperparameters (all ps > .05), probably due to the fact that the sequence learning task itself was in the visuo-spatial and not the verbal domain. In case we control for all 24 comparisons presented in the paper, the two significant correlations do not survive the Benjamini-Hochberg correction (their adjusted p values increase to.09 and.28, respectively). However, we note that this correction is too harsh, as we did not have prior hypotheses about all of these 24 relationships. In fact, it was expected that the hyperparameters governing the role of very short contexts is not related to WM, as the 1–2 item WM capacity variance is expected to be extremely low in healthy adults. However, we included all comparisons for completeness.

2.4 Trial-by-trial prediction of response times

Using the fitted HCRP parameter values HCRP for each participant and session or epoch, we generated predicted RTs and evaluated the goodness of fit of our model by computing the coefficient of determination (r2) between the predicted and measured RTs on held-out test segments. Fig 6a and 6b show how the predictions from the internal sequence model, as well as other, low-level effects jointly determine the predicted RTs on the first and the seventh training sessions of an example participant, respectively. In the first session (Fig 6a, Top), the predicted RTs (red line) were only determined by a slight effect of spatial distance between subsequent events and errors (pale green and purple bars). The internal sequence model was insufficiently mature to contribute to the responses yet. By session 7 (Fig 6b, Top), the sequence prediction effect (pale red bars) became the most prominent. Responses that previously were highly contingent on some part of the deepening event context were faster. This came from a well developed internal sequence model whose predictions became more certain and more aligned to the sequence of events (Fig 6b, Middle). By virtue of the HCRP, the depth of the substantially predictive context changed trial by trial (Fig 6a, bottom). On high probability trigram trials (marked by ticks) a context of two previous events was used for prediction, whereas on other trials only one previous event had substantial weight. Overall, this participant’s responses in this late stage of learning became more influenced by the internal sequence predictions than the effects of spatial distance, error, and response repetition. On average across participants, the fraction of response time variance accounted for by the internal sequence prediction increased monotonically from session 1 to 7; it plateaued by session 8 and reduced in the interference sessions 9 and 10 (Fig 6c).

Fig 6. Trial-by-trial predictive check.

Fig 6

In (a) and (b) we show example segments of held-out data from session 1 and 7 of participant 102 (Top) Colored bars show the positive (slowing) and negative (speeding) effects predicted by the different components in our model relative to the intercept (horizontal black line). The overall predicted RT value (red line) is the sum of all effects. The color code of the event and the response are shown on the bottom. A mismatch between the two indicates an error. (Middle) Predictive probabilities of the four responses are shown for each trial. The cells’ hue indicate the response identity, saturation indicates probability value. The sequence prediction effect (pale red bar in (Top)) is inversely proportional to the probability of the response, i.e. higher probability yields faster response. The ticks at the bottom indicate high-probability trigram trials. (Bottom) We show what proportion of the predictive probability comes from each context length. Higher saturation indicates a larger weight for a context length. (c) Test prediction performance of the full model and each model component in terms of unique variance explained, averaged across participants. Bands represent the 95% CI.

2.5 The internal sequence model predicts second-order effects during learning and interference

Our model accounted for participants’ sequence learning largely by enhancing the memory for trigrams of two predictive elements and a consequent response. This suggested that the HCRP, by virtue of its adaptive complexity, boiled down asymptotically to a stationary second-order sequence model (i.e. trigram model) with deterministic (d) and two sorts of random (r) trials, those following the deterministic scheme (random high; rH), and those not (random low; rL). Therefore, we tested how well calibrated the HCRP was to the second-order structure by analyzing the predicted RT differences on the held-out data, contingent on the sequence state and trigram probability. This follows the sort of descriptive analyses conventionally conducted in ASRT studies (e.g., [31]). We conducted two-way repeated measures ANOVAs with time unit (session or epoch) and trial type (state: d/r or P(trigram) in r states: rH/rL) as within-subject factors and with the measured or predicted RTs as outcome variable.

During training sessions 1–8, participants gradually became faster for d than r trials, as well as for rH than rL trials. These divergence patterns were matched by the HCRP predictions (Fig 7a and 7b; significant session*trial type interactions in Table 2). In the interference sessions 9 and 10, we tested the relationship between the RTs and the trigram probabilities for both old and new sequences. In order to study the effect of the old and new sequence statistics separately, we only included non-overlapping H trials (trials that are H in the old trigram model and L in the new one and vice versa) and contrasted them with overlapping L trials (trials that are L in both the old and new trigram models). In these analyses, we only consider r trials but we drop the r from the trial type notation for brevity.

Fig 7. Calibration of the HCRP model.

Fig 7

(a) RTs predicted by our HCRP model are shown against measured RTs for d versus r trials on held-out test data. (b) Same as (a) for rH versus rL trials. The two dashed lines mark the mean RTs for d and rH trials in session 8. The RT advantage of d over rH by session 8 marks (> 2)-order sequence learning. (c-d) rH versus rL trials are labelled according to the old trigram model (i.e. old sequence) or the new trigram model (i.e. new sequence). The grey band on the bottom shows the sequence that participants practiced: the old sequence in sessions 1–8 (dark grey), the new sequence in session 9 (light grey) and alternating the two sequences in session 10. (e) (> 2)-order sequence learning, quantified as the standardized RT difference between rH and d trials, shown for measured and predicted RTs. In session 1, rH trials are more expected because they reoccur sooner on average. By session 8, d trials are more expected because they are more predictable, given a > 2 context. This was predicted by the HCRP but not the trigram model. (f) Correlation of the measured and predicted (> 2)-order effect in session 1 and session 8. (g) Average predictive performance of the HCRP and the trigram models. (a-g) The error bands and bars represent the 95%CI.

Table 2. Repeated measures ANOVAs in sessions 1–8.

In the left set of columns, the trial type is defined as the state and in the right set of columns it is defined as P(trigram).

effect B [ms] F p effect B [ms] F p
measured RT session -10.11 133.38 <.001 session -9.33 97.81 <.001
state -1.99 116.01 <.001 P(trigram) -13.68 272.66 <.001
session*state -3.63 29.07 <.001 session*P(trigram) -2.72 9.25 <.001
predicted RT session -11.06 216.68 <.001 session -10.19 203.70 <.001
state -3.16 171.44 <.001 P(trigram) -7.02 229.83 <.001
session*state -2.76 44.92 <.001 session*P(trigram) -3.01 26.44 <.001

In session 9, the effect of the old sequence statistics progressively waned, as evidenced by the coalescence of the curves for Hold and Lold trials (Fig 7c; Hold: diamond, Lold: circle). This temporal pattern reflecting unlearning was significant for the RTs predicted by our model, but, being noisier, was only a trend for the measured RTs (Table 3 top left). The gradual divergence pattern of Hnew and Lnew typically seen in naïve participants was not significant for the measured RTs, contrary to the clear predicted relearning pattern predicted by our model (Fig 7d; Table 3 top right). Nevertheless, a slight overall speed advantage of Hnew over Lnew was also significant for the measured RTs, confirming overall learning in spite of the noisy learning curves. Indeed, by the last epoch of session 9, the mean speed advantage of Hnew to Lnew was not significantly lower than that of Hold to Lold (8.34 ms versus 13.91 ms, t = 1.42, p = .166), suggesting substantial learning but also resistance to interference. Surprisingly, the speed advantage of Hold the in the last epoch of session 9 was positively correlated with Hnew (r = .73, p < .001). This suggests that the more efficient participants were at acquiring the old sequence, the better they learned the new one, suggesting a common factor behind the ability of parallel learning of new information and maintenance of old information. Our HCRP model could not account for this parallel process, because the forgetting mechanism inherently traded off the retention of old statistics against adapting to new statistics (r = .16, p = .438).

Table 3. Repeated measures ANOVAs in session 9 and 10.

In the left set of columns, the trial type is defined as Pold(trigram) and in the right set of columns it is defined as Pnew(trigram).

effect F p effect F p
session 9
measured RT epoch .63 .640 epoch .54 .705
Pold(trigram) 91.48 <.001 Pnew(trigram) 5.58 .023
epoch*Pold(trigram) 2.23 .071 epoch*Pnew(trigram) 1.02 .396
predicted RT epoch 3.65 .008 epoch 3.17 .016
Pold(trigram) 121.31 <.001 Pnew(trigram) 98.38 <.001
epoch*Pold(trigram) 7.16 <.001 epoch*Pnew(trigram) 5.09 <.001
session 10
measured RT epoch 2.29 .085 epoch 2.50 .036
Pold(trigram) 122.57 <.001 Pnew(trigram) 21.71 .002
epoch*Pold(trigram) 1.49 .223 epoch*Pnew(trigram) 1.96 .545
predicted RT epoch 18.45 <.001 epoch 15.00 <.001
Pold(trigram) 207.71 <.001 Pnew(trigram) 152.93 <.001
epoch*Pold(trigram) 5.13 .003 epoch*Pnew(trigram) 1.31 .276

Due to the resistance to interference, the old trigram statistics were reactivated upon experiencing the old sequence. In the first epoch of session 10, participants’ behavior was significantly influenced by the old sequence statistics, albeit to a lesser degree than prior to interference, in session 8 (mean RT difference between H and L was 32.87 ms and 21.43 ms, respectively; time unit*Pold(trigram) interaction: p = .045) and the amount of forgetting was closely estimated by our model (28.61 ms and 18.08 ms, respectively; p = < .001). Throughout the four alternating epochs of session 10, the measured RTs reflected the parallel maintenance of the two sequence representations, as the main effects of both Pold(trigram) and Pnew(trigram) were significant (Table 3 bottom). The two trigram effects were not temporally modulated on the measured RTs—overall, there was no change in the influence of the old versus new sequence statistics due to the alternation between the old and new sequences. Whereas our model could account for the main trigram effects, it could not account for their joint temporal stability. Since the maintenance of the new trigram statistics was reflected in the measured RTs in epoch 2, our HCRP model assumed that this new knowledge traded off against knowledge of the old sequence. Therefore, it incorrectly predicted weaker expression of the old sequence statistics in the epochs where the new sequence was practiced (notice the zig-zag pattern in the red lines in Fig 7c; epoch*Pold(trigram) interaction in Table 3). The participants may have been able to employ meta-learning and invoke either the representation of the old or new sequence adaptively. Such a process could be captured by two independent HCRP sequence models and an arbiter that controls the influence of each, potentially based on their posterior probability. However, a perfect account of the resistance to interference and the parallel learning of two sequences is beyond the scope of the current paper.

2.6 The internal sequence model predicts (> 2)-order effects

Overall, the HCRP model captured the gradual emergence of second-order sequence knowledge and its resistance to interference. During sessions 1–8, it even captured higher-order effects, despite the fact that these are much weaker on average and more variable across participants. To assess this, we quantified a (> 2)-order effect as the (normalized) RT difference between rH and d, as is conventional in ASRT studies (e.g., [11, 32]). The reason is that d trials are constrained by the sequence phase, thereby respecting the (> 2)-order dependencies of the sequence whereas the rH trials are not constrained by the sequence phase and do not have (> 2)-order dependencies. Participants showed a reversal in the (> 2)-order effect whereby they were faster on rH than d trials in session 1 and they became faster on d than rH trials by session 8 (session*trial type interaction: p = .007). This reversal could not be explained by a stationary trigram model that is, by design, agnostic to (> 2)-order dependencies and recency (p = .939; Fig 7e), but could be explained by the HCRP (p = .025).

The explanation based on the HCRP depends on the change in trigram forgetting. Since the rH trials are not locked to the sequence phase, their average reoccurrence distance is shorter than that of the d trials. In other words, if a trigram reoccurs in a rH trial, it tends to reoccur sooner than it would have in a d trial, at the appropriate sequence phase (S3 Fig). Therefore, stronger forgetting induced a slight speed advantage for rH trials in session 1, although this effect was not significant across the whole sample on either the measured RTs, or on those predicted by our model (p = .084 and p = .199, respectively). Individual differences in the initial recency bias were explained by the HCRP (r = .620, p < .001; Fig 7f, left). By session 8, trigram forgetting was reduced and the 4-gram statistics had a slight effect on behavior, as expressed by the advantage of d to rH trials. Due to heterogeneity among the participants, neither the measured nor the predicted average (> 2)-order effect was significant from zero on session 8 (p = .080 and p = .076, respectively). As in session 1, the individual variability in the (> 2)-order effect was captured by the HCRP (r = .766, p < .001; Fig 7f, right).

Nevertheless, across all sessions, the (> 2)-order effect was rather small compared to the second-order effect, even though learning the (> 2)-order dependencies allowed for more certain predictions. This can be viewed as resource-rational regularisation of participants’ internal sequence model. Therefore, overall, the HCRP approximated a stationary trigram model and these two models explained a similar total amount of variance in the RTs (Fig 7g).

2.7 Predicting the response time of errors

So far, we accounted for general pre-error speeding and post-error slowing in the linear response model. By doing so, we controlled for factors other than sequence prediction influencing the error latency, for instance, a transiently lower evidence threshold (see e.g., [33]). Sequence prediction influences the latency of errors by providing what amounts to prior evidence for anticipated responses. As such, errors reflecting sequence prediction are predicted to be even faster than errors arising from other sources (e.g., within-trial noise of evidence accumulation rate, not modeled here). To assess this, we categorised participants’ errors based on the sequence prediction they reflected. ‘Pattern errors’ are responses that are consistent with the second-order dependencies, i.e. the global statistics of the task, instead of the actual stimulus. For instance, the following scenario was labelled as a pattern error: ‘red’-X-’blue’ was a second-order pair in the task and on a random trial that was preceded by ‘red’ two time steps before the participant incorrectly responded ‘blue’ when the event was ‘yellow’. ‘Recency errors’ are repetitions of the most recent response given in the same context of two previous elements, i.e. they reflect sensitivity to local trigram statistics. Only responses that did not qualify as pattern error could be labelled as recency errors. For instance, the following scenario was labelled as a recency error: on the most recent occurrence of the context ‘red’-’red’, the participant responded ‘yellow’; in the current context, the participant incorrectly responded ‘yellow’ when the event was ‘green’. Errors that fell into none of these two categories were labelled as ‘other’. We only analyzed those errors that were made on low-probability trigram trials. The proportion of pattern errors gradually increased due to learning, while the proportions of recency errors and other errors was reduced (Fig 8).

Fig 8. Proportions of errors of different types across training sessions.

Fig 8

Paired t-tests revealed that the measured RTs were faster for pattern errors than other errors (RTotherRTpattern = 27.55 ms, t = 8.58, p < .001), suggesting that expectations based on the global trigram statistics indeed contributed to making errors. Similarly, participants were 12.84 ms faster to commit recency errors than other errors (RTotherRTrecency = 11.47 ms, t = 2.38, p = .025) (Fig 9). This indicated that participants’ errors were also influenced by local statistics, although to a significantly lesser degree (t = 2.76, p = .011). While a stationary trigram model was able to explain the pattern error RTs (RTotherRTpattern = 25.15 ms, t = 11.04, p < .001), it lacked the distance-dependent inference mechanism that could explain the recency error RTs (RTotherRTrecency = -1.14 ms, t = -1.18, p = .249).

Fig 9. Predicting the latency of errors.

Fig 9

(a) Pattern errors. (b) Recency errors. In the case of HCRPf, the hyperparameter priors were adjusted to express more forgetfulness. The error bars represent the 95%CI.

The HCRP model correctly predicted fast pattern errors (RTotherRTpattern = 21.32 ms, t = 20.91, p < .001), but underestimated the speed of recency errors (RTotherRTpattern = 4.67 ms, t = 3.91, p < .001, Fig 9). The reason for this was that the HCRP fit the data by reducing trigram forgetting, explaining participants’ overall robustness to the recent history of trigrams (Fig 4). However, as our error RT analysis revealed, sensitivity to recent history was more expressed on error trials.

Explaining why error responses were more influenced by recent history than correct responses is beyond the scope of this paper. However, note that our HCRP model does have the flexibility to explain the effect in isolation. Therefore, we fitted the HCRP to the same data but using a different prior for the forgetting hyperparameters, thus projecting the model into a more forgetful regime (priors defined in S1 Table; posteriors shown in S4 Fig). As shown in Fig 9, the HCRP with stronger forgetting prior, HCRPf, was able to explain the degree to which error responses were influenced by global and local trigram statistics, as the speed advantage of pattern errors (RTotherRTpattern = 21.79 ms, t = 11.10, p < .001) and recency errors (RTotherRTrecency = 10.92 ms, t = 6.33, p < .001) was more correctly predicted on average. In sum, while a less forgetful higher-order sequence learning model accounted best for participants’ RTs due to their general robustness to local noise, a more forgetful model accounted for a slight sensitivity to local noise that was expressed in the speed of error responses.

In order to elucidate the relationship between the inferred seating arrangements in the HCRPf model and participants’ errors, we computed the average seating odds for each level across the three error types (the same measure as in Fig 6a and 6b, bottom). As shown in Fig 10, the weight of level 2 in the HCRPf, that is, the recency of trigrams, did not influence the speed of different types of errors to the same degree (F(2, 72) = 39.90, p < .001). The weight of level 2 was stronger in the case of recency errors than other errors (t = -2.72, p = .009), and it was even stronger in pattern errors than recency errors (t = -5.61, p < .001). The overall difference in the weight of level 0, that is, the recency of unigrams (F(2, 72) = 20.94, p < .001) was driven by the opposite trend. The weight of the unigram observations was stronger when participants committed other errors than recency errors (t = 2.35, p = .02), and stronger in the case of recency errors than pattern errors (t = 3.89, p < .001). There were significant differences among the error types in the weights of level 1 and 3 as well, though more modest and not three-way (F(2, 72) = 10.26, p < .001; F(2, 72) = 8.02, p < .001, respectively). These results demonstrate the straightforward relationship between the learned parameters of the HCRPf, i.e. proportions of observations contingent on deepening contexts, and the types of errors that participants made.

Fig 10. The HCRPf weighting of context depth is different among errors of different types.

Fig 10

(Left) Weights of all HCRPf levels. (Right) Zoomed in for HCRPf level 3 only.

2.8 Alternative models

The principled back-off mechanism of the HCRP model was essential to account for participants’ smoothing behavior—namely, that they flexibly combined higher- and lower-order information for prediction. Ablated alternative models listed in A Table in S2 Appendix, even if containing higher-order sequence information, fell short in explaining how participants combined the higher-order information locally, while also maintaining stable knowledge of the global chunk statistics (D Fig, E Fig and F Fig in S2 Appendix).

3 Discussion

In this study, we elucidated how humans come to exploit intricate regularities in their sensori-motor environments. We used a hierarchical non-parametric Bayesian model to characterize the gradual, implicit learning of the higher-order sequential structure in a serial reaction time task over thousands of trials. Our model fitted the trajectory of response times, showing how participants refined their internal sequence models, adapting to larger and larger predictive chunks. Model parameters correlated with working memory capacity and error characteristics.

As a generative model for sequences, the hierarchical Bayesian sequence model [14] that we used is able to capture all these dependencies. More than that, it combines the multi-order information in an adaptive way. As a recognition model for accounting for the reaction times of our participants, judicious choices of priors per session, adjustment of the model to allow for forgetting [23], and augmentation with low level effects such as for repetitions and error, allowed it to fit behaviour rather well (and better than alternative models lacking an adaptive chunk-smoothing mechanism). Updating the model after every observation enabled it to track learning at a trial-by-trial resolution. The hyperparameters of the model that govern the preference to use longer contexts and to be less forgetful were correlated with participants’ complex and spatial working memory scores, as measured by independent tests. [34] suggested that there is no substantial influence of working memory capacity on sequence learning if the task is implicit. However, their review included studies in which the analyses were based on aggregate sequence learning scores (computed as the response time difference between predictable and unpredictable trials). Here we show that the parameters of a mechanistically modeled sequence forgetting process are in fact related to working memory. This highlights that working memory does play a role in sequence processing—whether explicit or implicit—although a relatively elaborate modeling might be needed to capture its subtle and/or complex contributions.

According to the HCRP, participants gradually enriched their internal sequence model so that it reflected zero and second-order contingencies. However, in the first session, forgetfulness-induced volatility in the sequence model explained why high probability trigram trials were more expected by the participants in random states. In previous ASRT studies, this effect was termed ‘inverse learning’ [17, 35]. Here we show that the effect is not counter-intuitive if we allow for forgetfulness. The effect arises due to the specific distance structure of the ASRT, namely that the same trigram recurs at smaller distances, on average, in random trials than deterministic trials (that enforce spacing among the elements; S3 Fig). As such, the ‘random high’ trials were more readily recalled.

As sessions progressed, trigram forgetfulness was reduced. This can be seen as a form of long-term skill learning as more importance and less forgetting is gradually assigned to predictive context-response chunks. Thus, after the first session, participants did not expect a globally frequent trigram less if that trigram happened to be locally rare. Previous work highlighted one key aspect of well-established skills: that the variability of self-generated actions is decreased, yielding smoother and more stable performance [36]. By contrast, here, skill is associated with more sophisticated methods of managing regular external variability. Simply put, learners can change the size of chunks they choose to remember better. The context-depth specific parametrisation of our model allows for the identification of other potential practice-related changes, which were actually not observed in the ASRT data, but are often observed in other tasks. For instance, in case of higher memory demands, it is possible that participants improve their performance by shifting the strength of a chunk size required by the task but remain forgetful about chunks of that size. This would result in behavior that is increasingly contingent on the correct chunk size but noisy due to over-reliance on recent observations.

By the last training session, some participants enriched their sequence representation further, with increased memory not only for trigrams but also for four-grams, implicitly incorporating knowledge of the sequence phase and enabling better performance. This shift was much milder than that in the memory for trigrams, and was not exhibited by all participants. While many previous studies have focused on the question of whether higher-order learning did, or did not, take place under certain conditions, special groups, etc., our method uncovers the degree to which information of different orders was learned.

Our model was able to account for the initial resistance to interference and slow relearning when a new sequence was introduced in session 9. Sequence interference not only reduced the response time variance explained by the internal sequence model (which was full of information about the old sequence), but also increased the variance explained by low-level effects, such as the response repetitions and the spatial distance between the current and previous cue. However, note that the low-level effect sizes did not increase—in fact, they mildly decreased as a consequence of the interference, as shown in Fig 4a. Therefore, our interpretation is that participants did not ‘fall back’ to rely on aspects of the data other than chunk statistics (in which case our labeling of these effects as ‘low-level’ might be brought into question), but rather that the low-level effects were less obscured by the effects of learning that became obsolete during interference.

Finally, our model class could reproduce error speeding that was specific to the type of errors, that is, whether they reflected the global statistics, local statistics, or no apparent statistics of the task. However, since there were insufficient errors (∼10%) in our data for them to exert sufficient impact on the parameters, to examine a ddHCRP account of go wrong more precisely, we had to force the model into a more forgetful regime by adjusting the minimum forgetting rate to a lower level in the prior. Whilst acknowledging the artificiality of this procedure, we hope that the result is relevant for cases such as explicit sequence prediction tasks in which erroneous responses are more prevalent.

The HCRP model has distinct advantages over the two alternatives that have been suggested for characterizing or modeling the ASRT. Until recently, all papers employing the ASRT stuck to the kind of descriptive analysis that was used in the first ASRT paper [11]. This purely descriptive account focuses rather narrowly on the task itself, asking whether participants’ behaviour is appropriately contingent on the frequent and infrequent trigrams that arise from the structure of the task. Although such descriptions can show what sort of conformance there is, and how quickly it arises over sessions, they do not provide a trial-by-trial account of learning. In these terms, trigram dependencies arise more strongly in the HCRP as forgetfulness wanes—something that happens relatively quickly across sessions, underpinning the general success of the trigram model in explaining behaviour.

The other, contemporaneous alternative account [15] is mechanistic, and so is rather closer to ours. [15] fitted their model to the same data set that we present here, therefore the differences between our models and fitting procedures are straightforwardly understood and we will elaborate on them here. They use an infinite hidden Markov model (iHMM) [37, 38] which is a nonparametric Bayesian extension of the classical HMM, capable of flexibly increasing the number of states given sufficient evidence in the data. As such, it can automatically ‘morph’ into the generative process of the ASRT, that is, into a model containing four states, each deterministically emitting one event, deterministically alternating with four random states, each emitting either of four events with equal probability. The trouble with the iHMM for modeling variable length contextual dependencies is that it has to use a proliferation of states to do so (typically combinatorial in the length of the context), posing some of the same severe statistical and computational difficulties as n-gram models that we noted in the introduction. Indeed, unlike the case for HMMs, parts of the sequence that have not proved predictive or have proved superfluous—whether beyond or even intervening in the predictive window—are organically under-weighed in the HCRP. As another alternative, a sequence compression method was proposed for modeling compact internal representations [39], however, this method is yet to be extended to non-binary sequences and trial-by-trial dynamic compression. Our model, apart from capturing parsimony, also captured the nonstationarity of humans’ expectations. Thus, it explained higher-order perseveration effects whereby recent chunks were more expected to reoccur.

Apart from the prior structure itself, our modeling approach differed from that employed by [15] in three ways. First, [15] assumed a stationary model structure for each session, while we sought to capture how participants update their internal model trial by trial. This is important for the initial training session and the interference sessions, where quick within-session model updating is expected. Second, instead of treating the learning sessions independently, we assumed that participants refine the same internal model session by session with new information, as well as by adjusting the hyperparameter set of that model (i.e. how the learned information used). Third, we controlled for the rather strong low-level effects that are ubiquitous in sequence learning studies and are often controlled for in descriptive analyses: the effect of spatial distance of response cues, repetition facilitation, pre-error speeding, and post-error slowing. Although some of these can, in principle, be characterized by the HCRP and the iHMM, it is likely that they are generated by rather different mechanisms in the brain (e.g., repetition effects may arise from the modulation of tuning curves [40]), and so it can be confounding to corrupt the underlying probabilistic model learning with them. Of course, separating low- from high-level effects is not always so straightforward.

Serial reaction time tasks of various sorts have been extensively used to assess whether sequence learning can be implicit [1]; developmentally invariant or even superior in children compared to adults [41]; impaired [42], intact [43, 44] or even enhanced [45] in neurological and psychiatric conditions; persistent for months [31] and resistant to distraction [12] even after brief learning; (in)dependent on sleep [32, 46] and subjective sleep quality [47], etc. It would be possible to use the parameterization afforded by models such as the HCRP to ask what components differ significantly between conditions or populations.

We treat the ddHCRP as a computational-level description of multi-order sequence statistical learning and use, rather than as a process model that could be transparently implementable in neural hardware. Non-parametric Bayesian methods in this family are quite prevalent as computational-level models in cognitive science (e.g., the work of [48] on extinction in classical conditioning; and [49] in motor control), mostly also without suggested neural implementations. However, there has been at least one interesting attempt to link Chinese restaurant processes to cortico-striatal interactions by [50]. They developed a cognitive model of structure learning based on the CRP, akin to our model, along with a neurobiologically explicit network model approximation. Later, they demonstrated that EEG signals were predictive of participants’ CRP-like clustering behavior [51]; along with fMRI-based investigations about prefrontal cortical regions involved in cluster creation and use [52]. One could perhaps imagine that the rich complexities of the expansive and contractive connections between the cortex, the basal ganglia and the striatum could implement some form of the hierarchy in the ddCRP. Alternatively, purely cortical mechanisms might be involved.

Given the utility of the HCRP model for capturing higher-order sequence learning in humans, it becomes compelling to use it to characterize sequential behavior in other animals too. For instance, the spectacular ‘dance show’ of a bird-of-paradise, the western parotia, comprising a series of different ballet-like dances has been recorded extensively but is yet to be modeled. There has been more computational work on bird songs. Bengalese finches, birds that are domesticated from wild finches, developed probabilistic, complex songs that a first-order Markov model of the syllables is not able to capture [53]. [54] modeled second-order structure of the Bengalese finch songs using a partially observable Markov model where states are mapped to syllable pairs. However, some individuals or species might use even higher-order internal models to generate songs, and this would be a natural target for the HCRP.

Structures in birds, such as area HVC, that apparently sequence song [55], or the hippocampus of rodents with its rich forms of on-line and off-line activation of neurons in behaviourally-relevant sequences [56, 57], or the temporal pole and posterior orbitofrontal cortex of humans [58], whose activity apparently increases with the depth of the predictive context, are all attractive targets for model-dependent investigations using the HCRP.

Of course, some behaviour that appears sequentially rich like that of the sphex, the jeweled cockroach wasps, the fruit flies [59] or indeed of rodents [60] in their grooming behaviour may actually be rather more dynamically straightforward, proceeding in a simple mandatory unidirectional order that does not require the complexities of the HCRP to be either generated or recognized. Thus, fruit flies clean their eyes, antennae, head, abdomen, wings and thorax, in this particular order. [59] showed that this sequence arises from the suppression hierarchy among the self-cleaning steps: wing-cleaning suppresses thorax-cleaning, abdomen-cleaning suppresses both of these, and head-cleaning suppresses all the others. In a dust-covered fly, all of the cleaning steps are triggered at the same time. But each step can only be executed after the neural population underlying it is released from suppression, upon completion of the cleaning step higher in the hierarchy. As such, the suppression hierarchy ensures a strictly sequential behavior. Notably, behaviour of this sort is closed loop—responding to sensory input about success—whereas we consigned closed loop aspects to the ‘low level’ of effects such as post-error slowing. It would be possible to tweak the model to accommodate sensory feedback more directly.

In conclusion, we offered a quantitative account of the long-run acquisition of a fluent perceptuo-motor skill, using the HCRP as a flexible and powerful model for characterising learning and performance associated with sequences which enjoy contextual dependencies of various lengths. We showed a range of insights that this model offered for the progressive performance of human subjects on a moderately complex alternating serial reaction time task. We explained various previously confusing aspects of learning, and showed that there is indeed a relationship between performance on this task and working memory capacity, when looked at through the HCRP lens. The model has many further potential applications in cognitive science and behavioural neuroscience.

Supporting information

S1 Appendix. Supplementary algorithms.

(PDF)

S2 Appendix. Supplementary results.

(PDF)

S1 Fig. Sequence predictions of 3-level HCRP models fitted to 100 data points.

The models were trained with batch learning in order to clearly show how the pattern of predictions depends on the sequence structure without online updates of the model parameters. In (a), the sequence was the concatenation of repeats of a 12-element determinstic pattern (Serial Reaction Time Task or SRT). In (b), the sequence was generated from the ASRT. (Top) Colors denote the sequence elements. The vertical bar marks the boundary between the two repeats in the SRT example segment. (Middle) Predictive probabilities of the four events are shown for each trial. The cells’ hue indicate the event identity, saturation indicates probability value. The Xs indicate the event with the highest predicted probability, i.e. the predicted event; Xs are green for correct predictions and red for incorrect predictions. The ticks at the bottom in (b) indicate high-probability trigram trials. Note that, after having a context of at least two previous elements, all predictions are correct in the case of the deterministic SRT. In the ASRT, incorrect predictions occur for the low probability trigrams. (Bottom) We show what proportion of the predictive probability comes from each context length. Higher saturation indicates a larger weight for a context length. Note that the context of two previous elements is invariably dominant in the SRT predictions where every event is predictable from the previous two. In the ASRT, the context weights follow the largely alternating pattern of the high and low probability trigrams, the former ones being predictable from two previous events, the latter ones being unpredictable.

(TIFF)

S2 Fig. Negative log likelihood loss of HCRP models fitted to 10.000 ASRT data points.

(a) Negative log likelihood as a function of the maximum number of previous events considered. (b) Negative log likelihood as a function of the prior importance of two previous events, i.e. trigrams (b). In (b), lower values of α2 imply higher prior importance. The vertical dashed line in (a) marks the n that was used for fitting the human data in the Manuscript.

(TIFF)

S3 Fig. Trigram reoccurrence distance in trials.

Vertical lines mark the medians. Note the marked periodicity in the case of d trials that imposes a spacing among the trigrams and increases the median reoccurrence distance.

(TIFF)

S4 Fig. Fitted values of the strength α (left) and forgetting rate λ (middle) parameters, as well as their joint effect on prediction (right), using the constrained prior that places the model in a forgetful regime, described in S1 Table.

A context of n previous events corresponds to level n in the HCRP. Lower values of α and λ imply a greater contribution from the context to the prediction of behavior. The context gain for context length n is the decrease in the KL divergence between the predictive distribution of the complete model and a partial model upon considering n previous elements, compared to considering only n-1 previous elements. Note that the scale of the context gain is reversed and higher values signify more gain.

(TIFF)

S1 Table. Hyperparameter prior sets for fitting the response times of all responses (sections 3.2-.3.6) and errors only (section 3.7).

In session 1, the prior was uninformed. In all subsequent sessions, the prior was a truncated Gaussian N’ with the mean of MAP value in the previous session, a fixed variance, and the same interval that the uninformed distributions have in session 1. For most of our results, the first, wider of λ prior was used to allow for extreme forgetfulness or unforgetfulness. For the prediction of response times of errors, we restricted our model to a more forgetful regime by narrowing the λ prior.

(PDF)

S2 Table. Mixed effects model with random intercepts for participants and several low-level predictors, sorted by their absolute fitted slope B (in ms).

Due to the large data set, all factors are significant. However, we made an arbitrary cut-off at the horizontal line for the low-level effects included in the response model because of the small effect sizes.

(PDF)

Acknowledgments

We thank Sebastian Bruijns for helpful discussions on our model and Eric Schulz for providing useful feedback on the manuscript.

Data Availability

The raw data, computational modeling and analysis code is publicly available at https://github.com/noemielteto/HCRP_sequence_learning. All other relevant data are within the manuscript and its Supporting information files.

Funding Statement

This research was supported by the National Brain Research Program (project 2017-1.2.1-NKP-2017-00002, PI: D.N.) and the Hungarian Scientific Research Fund (OTKA PD 124148, PI: KJ, OTKA K 128016, PI: DN). NE and PD were funded by the Max Planck Society. PD was also funded by the Alexander von Humboldt Foundation. DN was funded by the IDEXLYON Fellowship of the University of Lyon as part of the Programme Investissements d’Avenir (ANR-16-IDEX-0005). KJ was funded by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences.

References

  • 1. Robertson EM. The serial reaction time task: implicit motor skill learning? Journal of Neuroscience. 2007;27(38):10073–10075. doi: 10.1523/JNEUROSCI.2747-07.2007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Fiser J, Aslin RN. Statistical learning of higher-order temporal structure from visual shape sequences. Journal of Experimental Psychology: Learning, Memory, and Cognition. 2002;28(3):458. [DOI] [PubMed] [Google Scholar]
  • 3. Saffran JR, Aslin RN, Newport EL. Statistical learning by 8-month-old infants. Science. 1996;274(5294):1926–1928. doi: 10.1126/science.274.5294.1926 [DOI] [PubMed] [Google Scholar]
  • 4. Saffran JR, Wilson DP. From syllables to syntax: multilevel statistical learning by 12-month-old infants. Infancy. 2003;4(2):273–284. doi: 10.1207/S15327078IN0402_07 [DOI] [Google Scholar]
  • 5. Norris D. Word recognition: Context effects without priming. Cognition. 1986;. doi: 10.1016/S0010-0277(86)90001-6 [DOI] [PubMed] [Google Scholar]
  • 6. Beattie GW, Butterworth BL. Contextual probability and word frequency as determinants of pauses and errors in spontaneous speech. Language and speech. 1979;22(3):201–211. doi: 10.1177/002383097902200301 [DOI] [Google Scholar]
  • 7. Ten Oever S, Martin AE. An oscillating computational model can track pseudo-rhythmic speech by using linguistic predictions. Elife. 2021;10:e68066. doi: 10.7554/eLife.68066 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Nissen MJ, Bullemer P. Attentional requirements of learning: Evidence from performance measures. Cognitive psychology. 1987;19(1):1–32. doi: 10.1016/0010-0285(87)90002-8 [DOI] [Google Scholar]
  • 9. Destrebecqz A, Cleeremans A. Can sequence learning be implicit? New evidence with the process dissociation procedure. Psychonomic bulletin & review. 2001;8(2):343–350. doi: 10.3758/BF03196171 [DOI] [PubMed] [Google Scholar]
  • 10. Remillard G, Clark JM. Implicit learning of first-, second-, and third-order transition probabilities. Journal of Experimental Psychology: Learning, Memory, and Cognition. 2001;27(2):483. [DOI] [PubMed] [Google Scholar]
  • 11. Howard JH Jr, Howard DV. Age differences in implicit learning of higher order dependencies in serial patterns. Psychology and aging. 1997;12(4):634. doi: 10.1037/0882-7974.12.4.634 [DOI] [PubMed] [Google Scholar]
  • 12. Vékony T, Török L, Pedraza F, Schipper K, Pleche C, Tóth L, et al. Retrieval of a well-established skill is resistant to distraction: evidence from an implicit probabilistic sequence learning task. PloS one. 2020;15(12):e0243541. doi: 10.1371/journal.pone.0243541 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Shannon CE. A mathematical theory of communication. The Bell system technical journal. 1948;27(3):379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x [DOI] [Google Scholar]
  • 14.Teh YW. A hierarchical Bayesian language model based on Pitman-Yor processes. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics; 2006. p. 985–992.
  • 15. Török B, Nagy DG, Kiss M, Janacsek K, Németh D, Orbán G. Tracking the contribution of inductive bias to individualised internal models. PLoS computational biology. 2022;18(6):e1010182. doi: 10.1371/journal.pcbi.1010182 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Carpenter RH, Williams M. Neural computation of log likelihood in control of saccadic eye movements. Nature. 1995;377(6544):59–62. doi: 10.1038/377059a0 [DOI] [PubMed] [Google Scholar]
  • 17. Kóbor A, Horváth K, Kardos Z, Takács Á, Janacsek K, Csépe V, et al. Tracking the implicit acquisition of nonadjacent transitional probabilities by ERPs. Memory & Cognition. 2019;47(8):1546–1566. doi: 10.3758/s13421-019-00949-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Ratcliff R. A theory of memory retrieval. Psychological review. 1978;85(2):59. doi: 10.1037/0033-295X.85.2.59 [DOI] [Google Scholar]
  • 19. Schvaneveldt RW, Gomez RL. Attention and probabilistic sequence learning. Psychological Research. 1998;61(3):175–190. doi: 10.1007/s004260050023 [DOI] [Google Scholar]
  • 20. Ferguson TS. A Bayesian analysis of some nonparametric problems. The annals of statistics. 1973; p. 209–230. [Google Scholar]
  • 21.Pitman J, et al. Combinatorial stochastic processes. Technical Report 621, Dept. Statistics, UC Berkeley, 2002. Lecture notes for …; 2002.
  • 22. Thorndike EL. Animal intelligence: An experimental study of the associative processes in animals. The Psychological Review: Monograph Supplements. 1898;2(4):i. [Google Scholar]
  • 23. Blei DM, Frazier PI. Distance dependent Chinese restaurant processes. Journal of Machine Learning Research. 2011;12(8). [Google Scholar]
  • 24. Orbán G, Fiser J, Aslin RN, Lengyel M. Bayesian learning of visual chunks by human observers. Proceedings of the National Academy of Sciences. 2008;105(7):2745–2750. doi: 10.1073/pnas.0708424105 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Zhou Y, Lake BM. Flexible Compositional Learning of Structured Visual Concepts. arXiv preprint arXiv:210509848. 2021;.
  • 26. Ruchkin DS, Grafman J, Cameron K, Berndt RS. Working memory retention systems: A state of activated long-term memory. Behavioral and Brain sciences. 2003;26(6):709–728. doi: 10.1017/S0140525X03000165 [DOI] [PubMed] [Google Scholar]
  • 27. Radulescu A, Shin YS, Niv Y. Human Representation Learning. Annual Review of Neuroscience. 2021;44. doi: 10.1146/annurev-neuro-092920-120559 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Bhui R, Lai L, Gershman SJ. Resource-rational decision making. Current Opinion in Behavioral Sciences. 2021;41:15–21. doi: 10.1016/j.cobeha.2021.02.015 [DOI] [Google Scholar]
  • 29. Quentin R, Fanuel L, Kiss M, Vernet M, Vékony T, Janacsek K, et al. Statistical learning occurs during practice while high-order rule learning during rest period. NPJ Science of learning. 2021;6(1):1–8. doi: 10.1038/s41539-021-00093-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Török B, Janacsek K, Nagy DG, Orbán G, Nemeth D. Measuring and filtering reactive inhibition is essential for assessing serial decision making and learning. Journal of Experimental Psychology: General. 2017;146(4):529. doi: 10.1037/xge0000288 [DOI] [PubMed] [Google Scholar]
  • 31. Kóbor A, Janacsek K, Takács Á, Nemeth D. Statistical learning leads to persistent memory: Evidence for one-year consolidation. Scientific reports. 2017;7(1):1–10. doi: 10.1038/s41598-017-00807-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Simor P, Zavecz Z, Horváth K, Éltető N, Török C, Pesthy O, et al. Deconstructing procedural memory: Different learning trajectories and consolidation of sequence and statistical learning. Frontiers in Psychology. 2019;9:2708. doi: 10.3389/fpsyg.2018.02708 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Tillman G, Van Zandt T, Logan GD. Sequential sampling models without random between-trial variability: The racing diffusion model of speeded decision making. Psychonomic Bulletin & Review. 2020;27:911–936. doi: 10.3758/s13423-020-01719-6 [DOI] [PubMed] [Google Scholar]
  • 34. Janacsek K, Nemeth D. Implicit sequence learning and working memory: correlated or complicated? Cortex. 2013;49(8):2001–2006. doi: 10.1016/j.cortex.2013.02.012 [DOI] [PubMed] [Google Scholar]
  • 35. Song S, Howard JH, Howard DV. Implicit probabilistic sequence learning is independent of explicit awareness. Learning & Memory. 2007;14(3):167–176. doi: 10.1101/lm.437407 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Shmuelof L, Krakauer JW, Mazzoni P. How is a motor skill learned? Change and invariance at the levels of task success and trajectory control. Journal of neurophysiology. 2012;108(2):578–594. doi: 10.1152/jn.00856.2011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Beal MJ, Ghahramani Z, Rasmussen CE. The infinite hidden Markov model. Advances in neural information processing systems. 2002;1:577–584. [Google Scholar]
  • 38.Van Gael J, Saatci Y, Teh YW, Ghahramani Z. Beam sampling for the infinite hidden Markov model. In: Proceedings of the 25th international conference on Machine learning; 2008. p. 1088–1095.
  • 39. Planton S, van Kerkoerle T, Abbih L, Maheu M, Meyniel F, Sigman M, et al. A theory of memory for binary sequences: Evidence for a mental compression algorithm in humans. PLoS computational biology. 2021;17(1):e1008598. doi: 10.1371/journal.pcbi.1008598 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Grill-Spector K, Henson R, Martin A. Repetition and the brain: neural models of stimulus-specific effects. Trends in cognitive sciences. 2006;10(1):14–23. doi: 10.1016/j.tics.2005.11.006 [DOI] [PubMed] [Google Scholar]
  • 41. Janacsek K, Fiser J, Nemeth D. The best time to acquire new skills: Age-related differences in implicit sequence learning across the human lifespan. Developmental science. 2012;15(4):496–505. doi: 10.1111/j.1467-7687.2012.01150.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Hsu HJ, Bishop DV. Sequence-specific procedural learning deficits in children with specific language impairment. Developmental science. 2014;17(3):352–365. doi: 10.1111/desc.12125 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Nemeth D, Janacsek K, Balogh V, Londe Z, Mingesz R, Fazekas M, et al. Learning in autism: implicitly superb. PloS one. 2010;5(7):e11731. doi: 10.1371/journal.pone.0011731 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Unoka Z, Vizin G, Bjelik A, Radics D, Nemeth D, Janacsek K. Intact implicit statistical learning in borderline personality disorder. Psychiatry Research. 2017;255:373–381. doi: 10.1016/j.psychres.2017.06.072 [DOI] [PubMed] [Google Scholar]
  • 45. Takács Á, Kóbor A, Chezan J, Éltető N, Tárnok Z, Nemeth D, et al. Is procedural memory enhanced in Tourette syndrome? Evidence from a sequence learning task. Cortex. 2018;100:84–94. doi: 10.1016/j.cortex.2017.08.037 [DOI] [PubMed] [Google Scholar]
  • 46. Nemeth D, Janacsek K, Londe Z, Ullman MT, Howard DV, Howard JH. Sleep has no critical role in implicit motor sequence learning in young and old adults. Experimental brain research. 2010;201(2):351–358. doi: 10.1007/s00221-009-2024-x [DOI] [PubMed] [Google Scholar]
  • 47. Zavecz Z, Nagy T, Galkó A, Nemeth D, Janacsek K. The relationship between subjective sleep quality and cognitive performance in healthy young adults: Evidence from three empirical studies. Scientific reports. 2020;10(1):1–12. doi: 10.1038/s41598-020-61627-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Gershman SJ, Jones CE, Norman KA, Monfils MH, Niv Y. Gradual extinction prevents the return of fear: implications for the discovery of state. Frontiers in behavioral neuroscience. 2013;7:164. doi: 10.3389/fnbeh.2013.00164 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Heald JB, Lengyel M, Wolpert DM. Contextual inference underlies the learning of sensorimotor repertoires. Nature. 2021;600(7889):489–493. doi: 10.1038/s41586-021-04129-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Collins AG, Frank MJ. Cognitive control over learning: creating, clustering, and generalizing task-set structure. Psychological review. 2013;120(1):190. doi: 10.1037/a0030852 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Collins AGE, Frank MJ. Neural signature of hierarchically structured expectations predicts clustering and transfer of rule sets in reinforcement learning. Cognition. 2016;152:160–169. doi: 10.1016/j.cognition.2016.04.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Donoso M, Collins AG, Koechlin E. Foundations of human reasoning in the prefrontal cortex. Science. 2014;344(6191):1481–1486. doi: 10.1126/science.1252254 [DOI] [PubMed] [Google Scholar]
  • 53. Jin DZ, Kozhevnikov AA. A compact statistical model of the song syntax in Bengalese finch. PLoS computational biology. 2011;7(3):e1001108. doi: 10.1371/journal.pcbi.1001108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Katahira K, Suzuki K, Okanoya K, Okada M. Complex sequencing rules of birdsong can be explained by simple hidden Markov processes. PloS one. 2011;6(9):e24516. doi: 10.1371/journal.pone.0024516 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Pfaff JA, Zanette L, MacDougall-Shackleton SA, MacDougall-Shackleton EA. Song repertoire size varies with HVC volume and is indicative of male quality in song sparrows (Melospiza melodia). Proceedings of the Royal Society B: Biological Sciences. 2007;274(1621):2035–2040. doi: 10.1098/rspb.2007.0170 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Skaggs WE, McNaughton BL. Replay of neuronal firing sequences in rat hippocampus during sleep following spatial experience. Science. 1996;271(5257):1870–1873. doi: 10.1126/science.271.5257.1870 [DOI] [PubMed] [Google Scholar]
  • 57. Nádasdy Z, Hirase H, Czurkó A, Csicsvari J, Buzsáki G. Replay and time compression of recurring spike sequences in the hippocampus. Journal of Neuroscience. 1999;19(21):9497–9507. doi: 10.1523/JNEUROSCI.19-21-09497.1999 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Klein-Flügge MC, Wittmann MK, Shpektor A, Jensen DE, Rushworth MF. Multiple associative structures created by reinforcement and incidental statistical learning mechanisms. Nature communications. 2019;10(1):1–15. doi: 10.1038/s41467-019-12557-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Seeds AM, Ravbar P, Chung P, Hampel S, Midgley FM Jr, Mensh BD, et al. A suppression hierarchy among competing motor programs drives sequential grooming in Drosophila. Elife. 2014;3:e02951. doi: 10.7554/eLife.02951 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Berridge KC, Whishaw IQ. Cortex, striatum and cerebellum: control of serial order in a grooming sequence. Experimental brain research. 1992;90(2):275–290. doi: 10.1007/BF00227239 [DOI] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009866.r001

Decision Letter 0

Samuel J Gershman, Christoph Mathys

20 Apr 2022

Dear Ms Éltető,

Thank you very much for submitting your manuscript "Tracking human skill learning with a hierarchical Bayesian sequence model" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Christoph Mathys

Associate Editor

PLOS Computational Biology

Samuel Gershman

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Reviewer #1: This article proposes a novel computational model of sequence learning, and tests some predictions of the model using behavioural data acquired in a variant of the Alternating Serial Response Time (ASRT) task. On each trial, one out of four possible cues is shown and subjects (n=25) have to respond as quickly as possible by pressing a (cue-specific) key. Unbeknownst to the participants, half the cues follow a deterministic second-order sequence, such that the cue at trial t is exactly predictable from the cue at trial t-2. The other half (interleaved with the predictable cues) is random. Participants completed 8 weekly sessions of 2125 trials each (!) where the underlying deterministic sequence was kept fixed, plus 2 additional sessions that include covert changes in the deterministic sequence. Peoples' implicit learning in the task was monitored using response time, which essentially increase when they are surprised. The main result is twofold: (i) after learning, peoples' RT is quicker for predictable cues than for random cues, and (ii) they are also quicker for those random cues that conform (by chance) with the deterministic 2nd-order sequence than for the other random cues. Taken together, these results imply that participants have learned (some aspect of) the task contingencies, albeit entirely implicitly.

Overall, I found the paper interesting and timely. The modelling aspect of the work is novel and promising, and the methods consists of a balanced mixture of computational and experimental approaches, which I feel empathetic with. Having said this, there are a number of issues that, to my opinion, slightly compromise the quality of the paper (see below). I believe that, if these are addressed in the revised manuscript, this paper would make a significant contribution to the field. Let me now expose the two main concerns I have with the current version of the manuscript.

First of all, I felt that the experimental paradigm was not perfectly appropriate for testing the model. This is essentially because the model is about the ability to flexibility adapt the sophistication of sequence learning to the (hidden) complexity of the sequence. In brief, the model starts with the premise the brain should be equipped with some mechanism that enables it to gradually change its representational dynamics, sic : "We hypothesized that humans build their internal sequence representation in a similar way, starting with learning short-range dependencies and gradually adding long-range dependencies if they are indeed significantly present in the data" (l. 164-166, p. 6). Accordingly, the model focuses on an optimal (Bayesian) mechanism that can learn any dependency structure (this is the computational problem that the model is meant to solve). However, the dependency structure of the ASRT task is invariant. In other terms, one can solve the task without being able to flexibly adapt to the dependency structure. I would even argue that the main results would be expected, under any form of sequence learning that would capture 2nd-order dependencies. I note that this most likely extends to most (if not all) results reported in the manuscript as it stands. For example, a simple online regression based upon, e.g., truncated Volterra series would make qualitatively identical predictions. Now, I'm not saying people are doing this. Rather, I'm challenging the implicit assumption that the current experimental design offers direct empirical evidence for the model that authors have proposed here. In my opinion, this implies that authors should provide comparative evidence for their model. In other terms, they should try other (simpler?) models, and show that they are less likely explanations for peoples' behavior than the "distance-dependent Hierarchical Chinese Restaurant Process" (hereafter: ddHCRP) model. Importantly, the comparison should be fair, in that candidate models should be a priori able to learn the 2nd-order sequence structure.

Second, I have a few issues with the presentation of the model, as it stands. In brief, if one is not already cognisant of Dirichlet processes, then one cannot understand how the model works. There are only 3 equations in the manuscript that relate to the model, and they clearly are not sufficient to describe the model. This is PLoS Computational Biology: authors should not be reluctant to be explicit about the mathematics :) As I am sure authors are well aware, the typical way of describing a bayesian learning/inference algorithm is to start with the generative model, and then describe the model inversion procedure. Equations 1-3 are summarizing the main aspect of the generative model, but describing the full generative model requires more details. More importantly, no computational detail is given regarding model inversion. Does it rely on sampling, or some variational approximation (the latter, I guess, since they cite Beal and colleagues)? The authors should insert a complete model/methods subsection on model inversion, with full details regarding the algorithmic approach. For example: how is it initialized? What summary statistics are used and how are they updated online? How does computational complexity grow (Dirichlet processes rely on some threshold to augment their state-space: what is it here?) ? They should use this description to highlight the core computational properties of the ddHCRP learning algorithm. In relation to the first comment above, this may also serve to motivate the choice of candidate alternative models, and discuss possible pros and cons from a computational perspective.

I have other more minor concerns, but I don't believe it is useful to discuss these unless authors are willing to address those two issues first. I hope authors understand my concerns, and take this review round as an opportunity to improve the paper (which I think deserves to be published in PLoS CB, provided the above concerns are adequately addressed)!

Jean Daunizeau.

Reviewer #2: Learning increasingly complex statistical regularities present in

continuous sequences is a difficult problem affected by the curse of dimensionality. Yet, humans can take advantage of such regularities to speed up their responses during perceptuo-motor tasks. In this manuscript, the authors adapt an existing model of language processing to capture the learning process of participants practicing a visuo-motor task over multiple weeks.

Overall, the model elegantly capture many behavioural hallmarks of implicit sequence learning, and thus appear as a promising quantifying tool. The manuscript focuses on validating the predictions of the model, laying the groundwork for future studies in cognitive neurosciences.

I have two major comments and few minor points I would like the authors to address before considering publication.

Major comments:

- I might have missed some supplementary material, but I could not find the formal description of the model. I understand the hierarchical Dirichlet process is relatively standard, but the details of its implementation and in particular all the adaptations of the response mapping (low level biases), the explicit rules for selecting and weighting levels, and the inversion routine must be included, albeit in an annex, both for the sake clarity and to ensure self-sufficiency of the manuscript and therefore reproducibility. The absence of a complete set of equations describing the learning rules and the response function, combined with the lack of typographic distinction between scalar and vector parameters, makes the read rather difficult. The author should also make sure to make their code available in a public repository for completeness.

- The model is entirely fitted to the RTs of actual responses, which is perfectly understandable and well justified in the manuscript. However, the structure of the model should allow to also predict the "motor choices", and more interestingly, make non-trivial predictions about errors (eg. generalisation errors). The authors should provide some hints about this could be implemented or, better, provide some additional analyses addressing this point.

Having a look at such qualitative predictions might be particularly critical eg. in section 2.7 in which 'pattern errors' (coming from learned expectations) are opposed to 'recency' and 'other' errors: if the differential in RT between those cases is indeed coming from the expression of some learned >2-order contingencies, choices qualifying as 'pattern errors' should be aligned with the internal 'seating pattern' recovered by the learning model (but not in the case of other errors). Addressing this point would make a fair sanity check of the interpretation of the error speeding in pattern errors.

Another prediction would be that participants/phases characterised by a deeper representation should also exhibit specific types of error reflecting their higher order expectations. I understand such events might be rare and therefore hard to analyse, but they could provide some further insights into the behavioural variability across participants, which is a bit lacking in the current manuscript, especially for a model intended as a quantification tool for behavioural neurosciences.

Minor comments:

- I understand where the Chinese restaurant example comes from, but the back and forth between this terminology and the presented experimental design is a bit hard to follow. Sentences like 'the customers respond to certain key presses' or 'the probability of sitting at a table ... predict how likely each key is' are a bit nonsensical for a reader not familiar with Chinese restaurant processes. If the author insist on keeping the CRP example, I would suggest doing it in one place, then map each terms to both the equations (cf my first point) and the respective experimental concepts (eg. what is a table in terms of sequence representation?).

- p.10: "We parsed the sequence five times..." Why five? This seems a very low for a sampling procedure. Or is this meant at the trial level, therefore exhausting the possible seating arrangements?

- On the same topic, it would be helpful to have some measure of convergence of the inversion procedure (eg. variance of the estimates across different runs of the random search)?

- Is there a correlation between the lambdas at the different levels? What is the rationale for not having identical values across levels?

- Table 1: effect size (in ms) should be provided to allow comparison between actual and predicted RT effects.

Reviewer #3: In “Tracking human skill learning with a hierarchical Bayesian sequence model” Eltetö and colleagues apply a hierarchical sequence model to reaction time data from an alternating serial response time task. The hierarchical Chinese restaurant process (HCRP) with each level of the hierarchy coding for sequences of a particular length is enriched with a process of forgetting (forgetful HCRP). The model is then applied to data from 25 participants who completed 10 runs with a total of more than 20000 trials. The first 8 runs were generated by the same statistical trigram structure, while sessions 9 and 10 where used to probe the stability of the learned sequences against novel trigrams. In combination with some low level feature such as repetition and error-related effects, the model was able to capture how participants learned the sequence. The success fitting is illustrated by the correlation of predicted traces to left out data. The fitted parameters capture the nature of the task (over session) and, importantly, correlate with working memory indices acquired with classical working memory tasks.

This is a very nice application of a model for implicit learning tested on a rich dataset that has enough trials (25 participants, >20000 trials each) to actually allow to fit the model and investigate effects of changing structure in the task. Both task and model are well explained. I have few comments and questions on the model and model fitting to the authors (see below).

Jakob Heinzle

Major:

Model fitting: It was not entirely clear to me how ABC was applied, here. You describe that you simulated five instances of new seatings in every trial and then averaged over them. Is this enough to get stable fits? How could you assure this, other than by looking at the correlation to the held-out data? In addition, it was not clear to me how you dealt with the held out data. Did you fit the model up to the time of the held-out data and then modeled the held out data to check the posterior predictive value? Or was the model fit on the entire data/session and then tested on the held-out data? Did you restart the sequence for learning after the held-out data, or how did you include the sequence dependencies across the boarder of the held out data?

Discussion: While you elaborate in the discussion on possible applications e.g. in sequence learning in song birds, I was missing a discussion on how you think the fHCRP in combination with ABC could be implemented in the brain. It would be interesting to read your thoughts on this as a generative model of behavior should neuronally implemented as well.

Availability of data and model: While you say that the relevant data will be made available, it is not clear whether this includes the raw data and the full analysis/modeling code. You have acquired a unique data set which could serve other groups as a basis for their modeling. Please mention the availability of the data within the manuscipt.

Minor:

Figure 1: The 95% CI are not visible. Are these confidence intervals of the mean. Could show standard deviations instead to increase visibility.

Figure 5: It was not clear to me whether you applied any correction for multiple comparison for the multiple WM tests and levels of alpha and lambda? Given that you suggest that the detailed model is necessary to extract parameters that relate implicit sequence learning with WM performance, it would be good to state this clearly. In particular, because other studies have suggested there is no relation as you discuss.

Figure 6c: It would be interesting to discuss not only the trace of sequence learning but also others. Repetition, for example, shows an increase for the last two sessions. What ist the interpretation of this? It is also not clear, what exactly the plotted curves for individual regressors show? Unique variance?

Priors: When fitting the model with the forgetful prior (Figure 8). Was the posterior value of lambda always at the lowest end of the uniform prior? If so, why did you chose exactly that value? Where were the gaussian priors in sessions 2-10 truncated? At the border of the uniform prior in session 1? This was not clear to me.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No: 

Reviewer #2: No: Not only the code and data were not provided at this stage, the manuscript is lacking a detailed description of the model. As mentioned in my comment, this omission needs to be addressed in future revisions.

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: JEAN DAUNIZEAU

Reviewer #2: No

Reviewer #3: Yes: Jakob Heinzle

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009866.r003

Decision Letter 1

Samuel J Gershman, Christoph Mathys

10 Oct 2022

Dear Ms Éltető,

Thank you very much for submitting your manuscript "Tracking human skill learning with a hierarchical Bayesian sequence model" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Christoph Mathys

Academic Editor

PLOS Computational Biology

Samuel Gershman

Section Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

Reviewer's Responses to Questions

Comments to the Authors:

Reviewer #1: I woud like to congratulate the authors for their thorough responses and revisions (addiitonal minor comments I had were also answered when adressing the comments of the other reviewer).

I think this work will provide a significant contribution to the field!

Jean Daunizeau.

Reviewer #2: The authors answered all my concerns. I appreciate in particular the additional model comparison and algorithm explanation. I still found the provided code a bit lacking (I expected a README and comments structuring the code: the repo is so far quite bare and unusable without some serious guessing work). I trust the authors to address this last point before publication. I otherwise have no further comments.

Reviewer #3: Comments on Revised manuscript “Tracking human skill learning with a hierarchical Bayesian sequence model”.

In this revision, the authors have addressed my concerns in a satisfactory manner. I have two comments remaining directly related to their answers.

Jakob Heinzle

Major:

I was a bit surprised about your statement that the data was already published in the recently accepted paper by Török et al. (PLoS Comput Biol 18(6):e1010182. https://doi.org/10.1371/journal.pcbi.1010182). I understand this is exactly the same data that you analyse, here. I had not realized this when reading the initial manuscript, and I think it should be made more transparent. While you cited the Török study (a preprint version of it) for additional reference on the methods, I could not find a statement that it is indeed the same data that you are using. I consider it as fundamentally important to mention in your paper that it is not the first time this data are published. Readers need to understand that you present a novel modeling analysis, but of an existing data set. Also, in the discussion, you should mention that the results of Török et al rest on exactly the same data. This is important for the questions regarding model comparison that other reviewers have brought up and which I think are highly relevant. Note: I added this point as major, not because I think that it will entail a lot of work, but because of its importance.

Minor:

There are still some things unclear about ABC. E.g. what stopping criterium do you use, also, it is clear that using 5 instantiations of the CRP is not reaching a plateau. Why is 5 still a good number? In addition, I think you need to mention more clearly that your held out dataset is not fully independent. I understand that you need to include the trials of the middle segment (the held out data) in your CRP updates to give the model the right context for the later trials. However, this means that your parameter estimates are conditional on the inputs (“features”) of the training set, even if you do not use the RT measurement of that period to fit the hyperparameters. In machine learning, one would usually not include the features of the test set in the training, even if test labels are not used. I realize that this is a tricky point and, hence, I think it is important that you explain this deviation from an ideal held-out set to the reader.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Jean Daunizeau

Reviewer #2: No

Reviewer #3: Yes: Jakob Heinzle

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009866.r005

Decision Letter 2

Samuel J Gershman, Christoph Mathys

31 Oct 2022

Dear Ms Éltető,

We are pleased to inform you that your manuscript 'Tracking human skill learning with a hierarchical Bayesian sequence model' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Christoph Mathys

Academic Editor

PLOS Computational Biology

Samuel Gershman

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #3: The authors answered all my questions and I have no further concerns.

Congratulations on this nice piece of work.

Jakob Heinzle

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #3: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #3: Yes: Jakob Heinzle

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009866.r006

Acceptance letter

Samuel J Gershman, Christoph Mathys

16 Nov 2022

PCOMPBIOL-D-22-00132R2

Tracking human skill learning with a hierarchical Bayesian sequence model

Dear Dr Éltető,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Anita Estes

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Supplementary algorithms.

    (PDF)

    S2 Appendix. Supplementary results.

    (PDF)

    S1 Fig. Sequence predictions of 3-level HCRP models fitted to 100 data points.

    The models were trained with batch learning in order to clearly show how the pattern of predictions depends on the sequence structure without online updates of the model parameters. In (a), the sequence was the concatenation of repeats of a 12-element determinstic pattern (Serial Reaction Time Task or SRT). In (b), the sequence was generated from the ASRT. (Top) Colors denote the sequence elements. The vertical bar marks the boundary between the two repeats in the SRT example segment. (Middle) Predictive probabilities of the four events are shown for each trial. The cells’ hue indicate the event identity, saturation indicates probability value. The Xs indicate the event with the highest predicted probability, i.e. the predicted event; Xs are green for correct predictions and red for incorrect predictions. The ticks at the bottom in (b) indicate high-probability trigram trials. Note that, after having a context of at least two previous elements, all predictions are correct in the case of the deterministic SRT. In the ASRT, incorrect predictions occur for the low probability trigrams. (Bottom) We show what proportion of the predictive probability comes from each context length. Higher saturation indicates a larger weight for a context length. Note that the context of two previous elements is invariably dominant in the SRT predictions where every event is predictable from the previous two. In the ASRT, the context weights follow the largely alternating pattern of the high and low probability trigrams, the former ones being predictable from two previous events, the latter ones being unpredictable.

    (TIFF)

    S2 Fig. Negative log likelihood loss of HCRP models fitted to 10.000 ASRT data points.

    (a) Negative log likelihood as a function of the maximum number of previous events considered. (b) Negative log likelihood as a function of the prior importance of two previous events, i.e. trigrams (b). In (b), lower values of α2 imply higher prior importance. The vertical dashed line in (a) marks the n that was used for fitting the human data in the Manuscript.

    (TIFF)

    S3 Fig. Trigram reoccurrence distance in trials.

    Vertical lines mark the medians. Note the marked periodicity in the case of d trials that imposes a spacing among the trigrams and increases the median reoccurrence distance.

    (TIFF)

    S4 Fig. Fitted values of the strength α (left) and forgetting rate λ (middle) parameters, as well as their joint effect on prediction (right), using the constrained prior that places the model in a forgetful regime, described in S1 Table.

    A context of n previous events corresponds to level n in the HCRP. Lower values of α and λ imply a greater contribution from the context to the prediction of behavior. The context gain for context length n is the decrease in the KL divergence between the predictive distribution of the complete model and a partial model upon considering n previous elements, compared to considering only n-1 previous elements. Note that the scale of the context gain is reversed and higher values signify more gain.

    (TIFF)

    S1 Table. Hyperparameter prior sets for fitting the response times of all responses (sections 3.2-.3.6) and errors only (section 3.7).

    In session 1, the prior was uninformed. In all subsequent sessions, the prior was a truncated Gaussian N’ with the mean of MAP value in the previous session, a fixed variance, and the same interval that the uninformed distributions have in session 1. For most of our results, the first, wider of λ prior was used to allow for extreme forgetfulness or unforgetfulness. For the prediction of response times of errors, we restricted our model to a more forgetful regime by narrowing the λ prior.

    (PDF)

    S2 Table. Mixed effects model with random intercepts for participants and several low-level predictors, sorted by their absolute fitted slope B (in ms).

    Due to the large data set, all factors are significant. However, we made an arbitrary cut-off at the horizontal line for the low-level effects included in the response model because of the small effect sizes.

    (PDF)

    Attachment

    Submitted filename: response_letter_1.pdf

    Attachment

    Submitted filename: response_letter_2.pdf

    Data Availability Statement

    The raw data, computational modeling and analysis code is publicly available at https://github.com/noemielteto/HCRP_sequence_learning. All other relevant data are within the manuscript and its Supporting information files.


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES