Skip to main content
PLOS One logoLink to PLOS One
. 2024 Jun 10;19(6):e0297917. doi: 10.1371/journal.pone.0297917

Individual differences in working memory impact the trajectory of non-native speech category learning

Casey L Roark 1,2,*, Giorgio Paulon 3, Giovanni Rebaudo 3, Jacie R McHaney 1, Abhra Sarkar 3, Bharath Chandrasekaran 1,2,*
Editor: Alessandra S Souza4
PMCID: PMC11164376  PMID: 38857268

Abstract

What is the role of working memory over the course of non-native speech category learning? Prior work has predominantly focused on how working memory might influence learning assessed at a single timepoint. Here, we substantially extend this prior work by examining the role of working memory on speech learning performance over time (i.e., over several months) and leverage a multifaceted approach that provides key insights into how working memory influences learning accuracy, maintenance of knowledge over time, generalization ability, and decision processes. We found that the role of working memory in non-native speech learning depends on the timepoint of learning and whether individuals learned the categories at all. Among learners, across all stages of learning, working memory was associated with higher accuracy as well as faster and slightly more cautious decision making. Further, while learners and non-learners did not have substantially different working memory performance, learners had faster evidence accumulation and more cautious decision thresholds throughout all sessions. Working memory may enhance learning by facilitating rapid category acquisition in initial stages and enabling faster and slightly more careful decision-making strategies that may reduce the overall effort needed to learn. Our results have important implications for developing interventions to improve learning in naturalistic language contexts.

Introduction

Categorization involves mapping variable inputs to discrete labels and is an important process that supports complex cognitive processes, such as object recognition [1] and speech perception [2]. Humans can learn novel categories throughout the lifespan across different perceptual modalities. However, there are also large individual differences in the underlying learning processes and outcomes [3, 4]. As such, there is a need to better understand what contributes to successful or less successful learning. In this study, we systematically examine the contributions of an ability that has been linked to category learning in prior work–working memory capacity.

Working memory (WM) reflects the resources available for the temporary storage and manipulation of information relevant for a given task [5, 6]. Category learning involves many processes that are dependent on WM. Learners need to attend to task-relevant features and ignore task-irrelevant features, maintain features of a stimulus in mind as relevant or irrelevant for a decision, hold hypotheses in mind about stimulus-category-response mapping, compare representations of the stimulus to previous stimuli or rules, and incorporate feedback to update existing category representations and hypotheses about category identity. The ability to learn categories across sensory modalities has generally been found to be positively associated with WM [712]. WM is thought to support faster initial category learning [7] by allowing learners to hold multiple hypotheses about category identity at mind and test these hypotheses and specifically to rapidly and efficiently find a useful hypothesis [13].

Importantly, prior studies have primarily focused on the role of WM in initial learning, and, as a result, it is unclear how WM may play a role in maintenance of performance or learning patterns over time. In the earliest stages of learning, learners must be highly flexible with their behavior and search a large pool of potential hypotheses about category identity. As performance improves and becomes more stable over time, WM processes may be less relevant because learners may be making small refinements to existing rules rather than keeping many competing hypotheses in mind. As a result, it is necessary to examine learning beyond very initial learning especially for categories that are difficult or challenging to acquire within a single session.

In the current study, we examine a specific case of category learning that is an important skill in second language acquisition–non-native speech category learning. The ability to learn a new language has been positively associated with individual abilities like WM capacity [1417]. Assessed in a single session, the ability to learn sounds of a non-native language in adulthood has been positively linked to WM capacity [11, 12]. However, other studies examining learning across longer training periods (e.g., multiple sessions across many days) have found that WM ability does not predict the ability to learn non-native speech categories [18, 19]. The role of WM across the trajectory of non-native speech category learning is not yet clear. It is possible that WM supports initial, but not later speech learning.

In the current study, we train participants on non-native Mandarin tone categories. In Mandarin, distinct pitch patterns are lexically contrastive–the same syllable produced with four different pitch patterns (e.g., high-flat, low-rising, low-dipping, and high-falling) alters the meaning. Learning to distinguish sounds based on these pitch patterns can be difficult for non-native listeners and there are large individual differences in learning [3, 2024].

For both speech and artificial perceptual categories, training beyond one session can be very successful, leading to significant learning and retention over time. In studies not focused on WM, participants learn through extensive training over several weeks [2429] and then sometimes are brought back for a test of retention months later (e.g., [27]– 3 months; [24]– 8 weeks). Neural representations of categories rapidly emerge within a single session of initial learning [30, 31], but continue developing over time with more experience [24].

The role of WM beyond initial category acquisition is not well understood. Whereas initial learning involves testing a large range of possible hypotheses about stimulus-response mapping and using feedback to update these hypotheses, learning beyond the novice stage involves refining existing hypotheses, learning about idiosyncratic stimuli, and continuing to develop and refine representations. Additionally, after a delay in experience, learners must reactivate existing representations and hypotheses to continue refining their category knowledge. It is possible that these processes rely less on WM than initial testing among multiple hypotheses as is necessary during initial learning. In the current study, we examine the role of WM in both an initial learning session and learning sessions after one and three months from the initial session.

Our approach involves inviting participants who previously completed a single session of training on Mandarin tone categories [12] back for additional training sessions. McHaney et al. [12] demonstrated that WM abilities were related to success in initial non-native speech category learning across two experiments–one behavioral (revisited here) and one with pupillometry. Specifically, individuals with higher WM capacity were better at learning, better at finding task-appropriate strategies, and had pupil responses that reflected better stimulus-related attention. Based on this, McHaney et al. [12] concluded that WM may support learning by enhancing attention to task-relevant information. Critically, because this prior study tested only a single session of learning, it is possible this conclusion may only apply to initial learning. In the current study, we invite participants from Experiment 1 of McHaney et al. [12] back for two additional sessions–one session one month after their initial training and another session two months after the second session. We follow up with the same sample from McHaney et al. [12] to understand how individual differences in WM relate to individual differences in learning beyond initial acquisition.

An important aspect of understanding individual differences in learning is the acknowledgement that many individuals perform at chance levels even after extensive training. We see two main possibilities that could explain this pattern–(1) these participants are actively engaged and trying but persistently fail to learn and/or (2) these participants are actively disengaged and are not trying to learn, so they fail to learn. Disentangling these two possibilities is challenging. Prior work on category learning takes one of two approaches regarding participants performing at chance levels. Some studies remove these participants entirely, typically by removing participants who perform at or below chance levels by the end of learning [3235]. Other studies retain these participants in the sample as it is impossible to know if their performance reflects a true inability to learn or whether they are disengaged [3639]. The lack of consistency in these approaches across studies makes it difficult to understand this poor performing subset of the population. In the current study, we take a hybrid version of these approaches to better understand the underlying challenges facing less successful performers. We examine both the entire set of participants and participants who perform at above-chance levels (i.e., learners vs. non-learners who do not perform at above-chance levels). By examining the patterns while considering if participants eventually learned or not, we can better understand behaviors and abilities that lead to success.

We employ a multifaceted approach to understand what WM does or does not do for initial and later learning of speech categories. Specifically, we assess if WM is related to (1) performance in initial and later learning sessions, (2) maintenance of category knowledge over time, (3) generalization of category knowledge to different talkers, (4) rate of evidence accumulation and (5) response caution during decision making (Table 1).

Table 1. Hypothesized role of working memory across measures.

Hypothesized role of working memory Relevant measure(s) in current study
Initial learning Hold multiple hypotheses in mind, better and faster learning Accuracy in session 1
Later learning Enhanced attention and motivation Accuracy in sessions 2 and 3
Maintenance of category knowledge Quickly reactivate and flexibly use existing representations Accuracy in first block of sessions 2 and 3 compared to final block of sessions 1 and 2
Generalization Flexibly apply rules to new contexts Accuracy in generalization test with different talkers and no feedback
Evidence accumulation More efficient processing, mobilization of attentional resources Evidence accumulation (drift) rate parameter from drift diffusion modeling
Response caution More cautious, gather more information and test against multiple hypotheses before a decision is made Decision threshold (boundary) parameter from drift diffusion modeling

Working memory capacity is operationalized by the operation span score (OSPAN).

Initial and later learning

Based on prior work, we expect that higher WM will be beneficial to initial acquisition (i.e., session 1) of non-native speech categories [12]. This prediction stems from prior work that has demonstrated that higher WM is associated with faster and better initial artificial category learning [79, 11, 4042]. We also expect to observe this pattern given that the first session of training was published in McHaney et al. [12] where among all 195 participants, WM was positively related to learning. A subset of these participants (107/195) returned for the current study.

Building on the prior study, we will probe the extent to which WM is related to performance in subsequent learning sessions. It is possible that WM provides benefits only in initial learning by quickly allowing learners to test many different hypotheses and find the ones that maximize their performance (e.g., [13]) and that WM is unrelated to learning beyond this novice stage. This prediction would be consistent with the observation that WM is not related to speech category learning when assessed after eight days of training [18, 19]. However, it is also possible that WM provides benefits to learning beyond initial acquisition, allowing for enhanced further refinement of category representations.

Maintenance of category knowledge

By probing performance after one and two months of no additional exposure or training, we will examine the maintenance of performance over time. One possibility is that higher WM may allow learners to quickly reactivate and flexibly use their category representations developed in prior session(s). However, it is possible that maintenance of performance over time may be independent of WM and could reflect long-term memory abilities instead.

Generalization

The ability to accurately identify novel category exemplars is a hallmark of categorization. We will assess generalization in each session by presenting learners with novel stimuli spoken by novel talkers that they do not encounter during training, without providing feedback about the correct category. To successfully generalize to these novel talkers, they will need to apply their existing knowledge flexibly to the new context. It is possible that generalization relies on WM, as the ability to flexibly apply rules (e.g., cognitive flexibility) is correlated with WM capacity [43, 44] and generalization to novel contexts is related to individual differences in WM capacity [45, 46].

Decision processes during learning

Using a drift diffusion modeling (DDM) approach [47, 48], we will examine whether different components of the decision process (e.g., rate of evidence accumulation and response caution) are related to WM. DDMs are popular tools to understand decision making processes from accuracy and response time measures [4953]. DDMs assume that during decision making, sensory evidence for multiple decision alternatives is accumulated in the human brain at varying rates, and a decision is made when such evidence reaches a particular boundary [47, 48]. In the case of learning non-native speech sound categories like Mandarin tone categories, as a participant hears a stimulus, they begin accumulating evidence towards all four response options (e.g., high-flat, low-rising, low-dipping, high-falling). Each of the four response options has its own decision threshold, with higher thresholds requiring more evidence to be accumulated before the decision will be made, reflecting more cautious responding. Evidence is also accumulated toward each threshold at its own rate, with faster rates reflecting higher quality of evidence extracted from the stimulus. Below we consider the possibility that WM relates to these two components of the decision process.

The classical literature on DDMs has focused almost exclusively on binary decision-making in static settings and typically focuses on group-level analyses rather than heterogeneity across individuals. Recently, Paulon et al. [53] extended these models significantly, accounting for situations with more than two decision alternatives, heterogeneity across individuals, and longitudinal evolution of the decision-making processes by considering individual-specific and time-varying accumulators of evidence. As such, we will examine decision processes over time with estimates at both the group and individual subject level.

Rate of evidence accumulation

We predict that more WM resources may enable learners to acquire information from the stimulus more quickly, thereby reducing the perceived difficulty of the task and effort needed to learn. The rate of evidence accumulation reflects the quality of information extracted from the stimulus, with faster rates reflecting a faster evidence accumulation process. The evidence accumulation process may also reflect efficiency of retrieval or access to categorization exemplars or other representations in memory. Faster evidence accumulation rates are associated with motivation and better task performance [54]. Prior work has demonstrated that evidence accumulation rates are related to WM abilities, with faster evidence accumulation associated with higher WM capacity [55, 56].

Response caution

We predict that more WM resources may allow learners to be more cautious and less impulsive in their responses and to collect more evidence for a particular category response before making a decision. Response caution is reflected in the decision threshold. Higher thresholds reflect more cautious responses that need more evidence before a decision is made, whereas lower thresholds reflect more impulsive responses based on less evidence [57]. More difficult tasks result in more cautious response patterns, requiring that participants gather more information to make decisions [48, 58]. Individuals with higher WM capacity may have sufficient resources to gather and consult more information during decision making. As such, they may be more cautious in their responses, gathering more information to hold in WM as they learn to make more accurate decisions. This may ensure that the learner builds up enough of a representation of the stimulus before they make a response and, thus, enhance learning. Alternatively, individuals with higher WM capacity may have sufficient resources to maintain similar decision thresholds as individuals with lower WM capacity, enabling them to respond faster without making sacrifices in accuracy.

Summary

To summarize, we examine the relationship between WM capacity and non-native Mandarin tone speech category learning in an extended training task with three sessions separated by one and two months, respectively. To gain mechanistic insights on the putative relationship between WM and individual differences in category learning over time, we assess behavior from multiple angles. Specifically, we examine how initial and later learning performance, maintenance of performance across delays, generalization to novel talkers, rate of evidence accumulation, and response caution are related to WM capacity (Table 1).

Methods

Participants completed three sessions of Mandarin tone category learning separated by at least one and two months (Session 1 to 2: M = 32.1 days, SD = 0.68, range 31.7–35.6 days; Session 2 to 3: M = 61.4 days, SD = 2.56, range 56.6–70.9 days). Data from the first session appeared in a previously published study [12], and the second and third sessions have not appeared elsewhere.

Participants

Participants were adults ages 18–35 recruited from Prolific (prolific.co) and participated via Gorilla Experiment Builder [59]. A total of 198 participants completed session 1 (99 Female (F), 99 Male (M), M = 25.0 years, SD = 4.97). Three participants were excluded because they did not follow instructions on the WM task, leaving a total of 195 participants in session 1 (98 F, 97 M, M = 24.9 years, SD = 4.89). There was substantial attrition from session-to-session, and we excluded participants who did not complete all sessions– 153 completed session 2 (70 F, 83 M, M = 24.9 years, SD = 5.05), and 107 completed session 3 (47 F, 60 M, M = 24.8 years, SD = 5.07). Participants who completed only one or two sessions did not differ in WM or categorization accuracy compared to those who completed all sessions (Fig A in S1 File).

Participants completed a language history questionnaire prior to participating. All participants were native speakers of non-tonal languages and reported no prior experience with any tonal languages, including Mandarin. Participants were given a sound check before the start of each session to ensure they could hear the sounds and were wearing headphones. Participants received $10/session for their participation (total up to $30 across three sessions). Informed consent was obtained from all participants. The study protocol was approved by the Institutional Review Board at the University of Pittsburgh.

Stimuli

The stimuli were natural speech productions recorded from four native speakers (2 M, 2 F) of Mandarin Chinese (Fig 1A). Each tone category (e.g., high-flat, low-rising, low-dipping, and high-falling) was produced by each speaker in five syllable contexts (/bu/, /di/, /lu/, /ma/, and /mi/) for a total of 80 stimuli (20/category). The stimuli from two speakers (1 F, 1 M) were used during the training blocks and the stimuli from the other speakers (1 F, 1 M) were withheld for the generalization block. The same 40 generalization stimuli were presented in the generalization block of each session and participants never received feedback about these stimuli. To reduce incidental differences in duration across categories, the stimuli were duration-normalized to 440 ms and RMS-amplitude normalized to 70 dB. The stimuli are shown in Fig 1A in a two-dimensional space (relative pitch, pitch change) that can be used to separate the stimuli into categories and is linked to neural representations of these categories [60, 61].

Fig 1. Stimuli and procedure.

Fig 1

A. Two-dimensional representation of stimuli used during category learning and generalization with colors reflecting different tone categories. B. Session procedure. C. Task procedure.

Procedure

Category learning

Participants completed three separate sessions of category learning (Fig 1B). Sessions 1 and 2 were separated by one month. Sessions 2 and 3 were separated by two months. In each session, participants completed six blocks of an identical category learning task and an additional generalization block with different stimuli and no feedback. The stimuli were the same across sessions. Participants never received feedback about the generalization stimuli. At the beginning of the experiment, participants were told that they would be grouping sounds into different categories based on corrective feedback. They were not given any specific instructions about the stimuli or what might differentiate the categories from one another.

In the category learning task, there were six blocks of 40 trials each. In the generalization task, there was one block of 40 trials. Participants heard the 440 ms duration sound, followed by a prompt about the category identity (“Which category?”) (Fig 1C). They pressed the 1, 2, 3, and 4 buttons on the keyboard to respond. Participants received trial-by-trial feedback in the category learning task where they were informed about whether their decision was ‘Correct’ or ‘Incorrect.’ The feedback was presented immediately for 750 ms. Participants did not receive feedback in the generalization task. In both tasks, there was an intertrial interval of 1 sec.

Working memory capacity

In the first session, participants first completed the category learning and generalization blocks and then completed an operation span task [62] as a measure of WM capacity. Participants were shown simple arithmetic problems and reported whether the presented solutions were correct or incorrect (e.g., (1 + 7) x 2 = 16) and were then shown a letter on the screen (e.g., A). A sequence of these arithmetic problems and letters from three to seven items in length made up a trial. After a full sequence was presented, participants were instructed to recall the letters presented in order. There were 15 trials. Participants’ WM capacity was calculated based on the OSPAN score–the sum of the length of all correctly recalled spans. For example, if a participant correctly recalled a sequence of four letters (e.g., A, I, D, F), four points were added to their score. The minimum possible OSPAN score is 0 and the maximum possible OSPAN score is 75. We did not filter scores based on accuracy on the arithmetic problems [63] and participants were generally very accurate (M = 85%, SD = 14%; Fig B in S1 File).

Drift diffusion modeling

We applied a variant of the DDMs developed in Paulon et al. [53]. The model estimates the evidence accumulation rate (i.e., drift) μd,s for each combination of decision response d and stimulus category s and decision thresholds (i.e., boundaries) bd for each decision response d. Additionally, the model also fits offset parameters δs for each stimulus category, which characterize the times taken by the actions that are not directly relevant to the actual decision-making processes (e.g., the time required to encode the s-th stimulus before evidence accumulation begins, to press a computer key, to record a response after a decision is reached, etc.). The model lets the parameters μd,s bd and δs to vary between participants, which accommodates the substantial variability across participants. Importantly, the model also allows μd,s and bd to evolve smoothly over time (across training blocks), explaining the changes in the decision-making processes as the participants learn over time. We allowed the drift rates to vary across both stimulus category and response and assume that participants gather evidence towards each of the four possible response options at different rates depending on the true identity of the stimulus category. The decisions participants make in this task are tied directly to the sound category. Exemplars from within a sound category share characteristics and differ from exemplars from other sound categories. Due to the stimulus characteristics, participants may accumulate evidence at different rates for the different stimulus-response combinations. Boundaries only varied across response and different levels of response caution were not dependent on the true stimulus category.

The data were filtered to exclude very fast and very slow responses by removing the top and bottom 1% of all trials across all participants based on reaction time. The remaining data, comprising both correct and incorrect trials, were used to estimate the parameters. Since gradual improvements in making correct decisions characterize learning, in our discussions below, we emphasize heavily on inferring the drift rates associated with successful identification of the stimulus (μd,s for correct responses with s = d). Consideration of all responses does not change the overall results (see Fig D in S1 File, Table D in S1 File).

We adopted a Bayesian framework for these analyses, assigning priors to the parameters and relying on samples drawn from the posterior using a Markov chain Monte Carlo (MCMC) algorithm for estimation and inference. The algorithm was run for 6,000 iterations with the initial 2,000 iterations discarded as burn-in. The remaining samples were further thinned by an interval of 5 to reduce autocorrelation. MCMC diagnostics such as trace-plots of the parameters, Geweke test for stationarity of the chains, etc. indicated no convergence or mixing-related issues. Posterior predictive checks indicated good model fit. Finally, posterior means are reported as point estimates and pointwise credible intervals are used to assess uncertainty. For more details on the implementation of these models, see S1 File.

Data were visualized and analyzed using R, version 4.3.1 [64] and the following R packages: tidyverse, version 1.3.2, [65], ggplot2, version 3.4.3 [66], ggthemes, version 4.2.4 [67], lddmm, version 0.4.2 [68], lme4, version 1.1.34 [69], lmerTest, version 3.1.3 [70], rstatix, version 0.7.2 [71].

Results

Learning performance

On average, participants learned the Mandarin tone categories with substantial individual variability in performance (Fig 2A). For context, we also plot the reaction times (Fig 2B). We note that for visualization of performance across blocks, we grouped participants by their WM scores based on a median split (Mdn = 46), with values equal to or higher than the median defined as high WM and values lower than the median being defined as low WM. The analyses were conducted using raw OSPAN scores as a continuous variable with linear mixed effects models using the lme4 package in R [69] and are also shown (Fig 2C).

Fig 2. Working memory and learning performance across all participants.

Fig 2

A. Accuracy and B. Reaction times after removing the shortest and longest 1% of responses. Error bars reflect SEM. For purposes of illustration, high and low working memory groups are defined based on a median split of working memory (OSPAN) scores. C. Relation between OSPAN score and proportion correct across blocks and sessions for all participants.

We examined the extent to which WM capacity, indexed by the OSPAN score, was associated with performance in the category learning task. We used linear mixed effects models with session (as categorical variable), block, WM capacity, all possible interactions as fixed effects, participant (intercept) as a random effect, and average accuracy across a block as the continuous outcome variable. Session 1 was treated as the baseline session. Full results are presented in Table 2.

Table 2. Summary of results on WM capacity and category learning performance.

β SE p
Intercept 26.0 5.09 < .0001
OSPAN -0.011 0.11 .92
Block 1.84 0.62 .0032
Session 2 6.51 3.43 .058
Session 3 14.5 3.43 < .0001
OSPAN * Block 0.055 0.013 < .0001
OSPAN * Session 2 0.31 0.074 < .0001
OSPAN * Session 3 0.22 0.074 .0026
Block * Session 2 -0.28 0.88 .75
Block * Session 3 -1.12 0.88 .20
OSPAN * Block * Session 2 -0.053 0.019 .0057
OSPAN * Block * Session 3 -0.032 0.019 .095

β, estimate. SE, standard error of estimate. p, p-value. OSPAN, operation span score.

Overall, accuracy improved linearly across blocks in all sessions (βBlock = 1.84, SE = 0.62, p = .0032; βBlock*Session2 = -0.28, SE = 0.88, p = .75; βBlock*Session3 = -1.12, SE = 0.88, p = .20) and improved marginally in session 2 from session 1 (βSession2 = 6.51, SE = 3.43, p = .058) and significantly in session 3 from session 1 (βSession3 = 14.5, SE = 3.43, p < .0001).

Collapsing across blocks, the relationship between WM score and accuracy was not significant in session 1 (βOSPAN = -0.011, SE = 0.11, p = .92), but was significantly stronger in sessions 2 and 3 (βOSPAN*Session2 = 0.31, SE = 0.074, p < .0001; βOSPAN*Session3 = 0.22, SE = 0.074, p = .0026). Importantly, the relationship between WM score and accuracy interacted with both block and session. In session 1, there was a positive relationship between WM and accuracy that became stronger across blocks (βOSPAN*Block = 0.055, SE = 0.013, p < .0001). A one unit increase in WM score was associated with an additional 0.055% increase in accuracy in each block. While in the first block, the relationship between WM score and accuracy was very weak (0.044%), by the final block, the relationship was clearly positive (0.32%). As a reminder, WM scores could range from 0 to 75, so even a relatively modest increase in WM score of 10 points would be associated with an additional increase in accuracy of 3.2% in the final block of session 1. A larger difference in WM score of 30 points would be associated with an additional increase in accuracy of 9.6% in this block.

One month later, in session 2, there was a positive relationship between WM and accuracy. While the relationship between WM and accuracy became stronger across blocks, the relative change was significantly smaller than in session 1 (βOSPAN*Block*Session2 = -0.053, SE = 0.019, p = .0057). In session 2, a one unit increase in WM score was associated with an additional 0.002% increase in accuracy in each block. Across blocks, the relationship between WM score and accuracy was similar to session 1 (range 0.30% - 0.31%).

Two months after session 2, in session 3, there was a positive relationship between WM and accuracy that became stronger across blocks in a way that was not significantly different from session 1 (βOSPAN*Block*Session3 = -0.032, SE = 0.019, p = .095). In session 3, a one unit increase in WM score was associated with an additional 0.023% increase in accuracy in each block. In the first block, the relationship between WM score and accuracy was 0.24% and by the final block, the relationship was similar to the final blocks of the other sessions (0.35%).

Taken together, we found that working memory ability was positively associated with speech category learning accuracy across training sessions, becoming relatively stronger across blocks in sessions 1 and 3 and was stable in session 2. While in the very initial stages of learning, WM score was not significantly related to accuracy (0.044% in first block of session 1), by the end of session 1 and persisting through the other sessions, WM score was positively related to accuracy (range 0.24% to 0.35%). The positive relationship between WM ability and performance emerged within the first session and remained relatively stable throughout follow up sessions 2 and 3.

Learners and non-learners

Importantly, we also aimed to understand if the relationship between WM capacity and accuracy was present when considering only participants who learned the categories. We identified participants who performed at or below chance levels in the final block of session 3 (defined by 95% cumulative binomial probability, 40 trials, 0.25 probability of correct response = 25% +/- 10%) as ‘non-learners’ and those who performed better than chance as ‘learners’ (Fig 3A). Even though the non-learners were defined based on their accuracy in the final block of session 3, non-learners had significantly lower accuracy throughout all blocks (Bonferroni-corrected pairwise comparisons, p < .001), except for the first block of session 1 (p = .078). This underlines the necessity of considering learners separately from non-learners.

Fig 3. Working memory and learning performance across learners and non-learners.

Fig 3

A. Accuracy and B. Reaction times after removing the shortest and longest 1% of responses. Error bars reflect SEM. For purposes of illustration, high and low working memory groups are defined based on a median split of working memory (OSPAN) scores. Groups are additionally separated into learners and non-learners based on session 3 block 6 accuracy and whether it was greater (learners) or less than (non-learners) chance performance. C. Relation between OSPAN score and proportion correct across blocks for learners only.

A total of 32% (34/107) of participants were classified as non-learners. WM scores for learners (M = 44.1) were marginally higher than non-learners (M = 36.7; t(65.6) = 1.79, p = .078, 95% CI [-0.84, 15.6]). This may indicate that individuals with lower WM may be more likely to be non-learners. It is important to note that we cannot completely rule out that non-learners with seemingly lower WM may have been generally disengaged in the experiment, leading to poorer performance in both the WM task and the category learning task. If this is the case, WM scores for these individuals may not reflect their true WM abilities. As post-hoc evidence that some participants may have been disengaged across tasks, we found that learners (M = 90%) performed better than non-learners (M = 79%) at identifying the arithmetic equations as correct or incorrect in the WM task (t(44) = 3.49, p = .0011, 95% CI [4.58, 17.1]; Fig B in S1 File). In the following analyses, we focus on the remaining 68% (73/107) of participants who are operationally defined as ‘learners’ in the category learning task. Because the accuracies of non-learners were within a low and highly restricted range by definition, we examined the relationship between WM score and accuracy for learners only.

To understand if the relationship between WM and category learning performance was present when examining learners only, we ran the same linear model analysis with learners only (Fig 3B; Table 3). Session 1 was treated as a baseline.

Table 3. Summary of results on WM capacity and category learning performance across groups.
β SE p
Intercept 27.5 4.94 < .0001
OSPAN -0.029 0.10 .78
Block 3.46 0.71 < .0001
Session 2 13.0 3.90 .00090
Session 3 25.0 3.90 < .0001
OSPAN * Block 0.050 0.015 .00059
OSPAN * Session 2 0.30 0.081 .00018
OSPAN * Session 3 0.18 0.081 .023
Block * Session 2 -0.56 1.00 .58
Block * Session 3 -1.12 1.00 .26
OSPAN * Block * Session 2 -0.058 0.021 .0048
OSPAN * Block * Session 3 -0.043 0.021 .039

β, estimate. SE, standard error of estimate. p, p-value. OSPAN, operation span score.

Of critical interest is whether WM score and accuracy were still positively related when examining only those who learned the categories. In session 1 ignoring block, the relationship between WM and accuracy was not significant (βOSPAN = -0.029, SE = 0.10, p = .76). However, this relationship became stronger across blocks (βOSPAN*Block = 0.050, SE = 0.015, p = .00059; βOSPAN*Block*NonLearners = -0.043, SE = 0.026, p = .11). A one unit increase in WM score was associated with an additional 0.050% increase in accuracy in each block for learners. By the final block of session 1, a one unit increase in WM score was associated with a 0.27% increase in accuracy for learners.

In session 2, the relationship between WM and accuracy was positive and significantly stronger than session 1 (βOSPAN*Session2 = 0.30, SE = 0.081, p = .00018). Ignoring block, a one unit increase in WM was associated with an increase in accuracy of 0.27%. This relationship was relatively stable, becoming mildly weaker across blocks. The relationship between WM score and accuracy across blocks was significantly different from session 1 (βOSPAN*Block*Session2 = -0.058, SE = 0.021, p = .0048). A one unit increase in WM score was associated with an additional 0.008% decrease in accuracy in each block. By the final block of session 2, a one unit increase in WM score was associated with a 0.23% increase in accuracy for learners.

In session 3, the relationship between WM and accuracy was positive and significantly stronger than session 1 (βOSPAN*Session3 = 0.18, SE = 0.081, p = .023). Ignoring block, a one unit increase in WM was associated with an increase in accuracy of 0.15%. The relationship was relatively stable, becoming mildly stronger across blocks. The relationship between WM score and accuracy across blocks was significantly different from session 1 (βOSPAN*Block*Session3 = -0.043, SE = 0.021, p = .039). A one unit increase in WM score was associated with an additional 0.007% increase in accuracy in each block for learners. By the final block of session 3, a one unit increase in WM score was associated with a 0.20% increase in accuracy for learners.

Among learners only, higher WM ability was associated with better non-native speech category learning performance. This relationship emerged within the first session and was persistent across sessions 2 and 3 and, unsurprisingly, was slightly weaker than the relationship including all participants. The slope of the relationship between WM score and accuracy was 0.27% in the final block of session 1, 0.23% in the final block of session 2, and 0.20% in the final block of session 3.

Maintenance of category knowledge over time

By examining learning across several sessions separated by one and two months, respectively, we can assess the maintenance of categorization performance and category knowledge over time. We assessed category knowledge maintenance by comparing adjacent training blocks that were either separated by no delay (i.e., blocks 5 and 6 of the same session) or a delay of one or two months (i.e., block 6 of one session and block 1 of the next session). Performance across these blocks and sessions for learners and non-learners separately is shown in Fig 4A. Because we are interested in how knowledge is retained over time, we focus our analyses only on learners.

Fig 4. Working memory and performance maintenance.

Fig 4

A. Error bars reflect SEM. For purposes of illustration, high and low working memory groups are defined based on a median split of working memory (OSPAN) scores. Groups are additionally separated into learners and non-learners based on session 3 block 6 accuracy and whether it was greater (learners) or less than (non-learners) chance performance. B. Relation between OSPAN score and percent difference from block 5 to 6 within a session (No Delay) and block 6 to block 1 (Delay) for learners only.

Learners were somewhat able to maintain their category knowledge after a month or more of no additional training. Between sessions 1 and 2, accuracy fell an average of 7.2% (58.0% in block 6 to 50.9% in block 1) and between sessions 2 and 3, accuracy fell an average of 7.0% (65.6% in block 6 to 58.6% in block 1). In contrast, accuracy was relatively stable in the end of the sessions with accuracy increasing by 1.8% in session 1 (56.0% in block 5 to 57.8% in block 6) and by 0.3% in session 2 (65.2% in block 5 to 65.5% to block 6).

The ability to maintain category performance in adjacent blocks both with no delay (i.e., block 5 vs block 6) and after a one- or two-month delay (i.e., block 6 and block 1 of the next session) was unrelated to learners’ WM capacity (Fig 4B, Table B in S1 File). We examined the percent difference between adjacent blocks across sessions using a linear mixed effects model with time (session 1 to 2 as baseline), delay (delay as baseline), WM score (OSPAN), and all interactions as fixed effects and participant as a random effect. WM was unrelated to the retention of performance across sessions 1 to 2 (βOSPAN = 0.032, SE = 0.068, p = .64) and 2 and 3 (βOSPAN*Sessions 2 to 3 = 0.020, SE = 0.094, p = .83). The relationship between WM and retention did not depend on whether there was a delay of a month (βOSPAN*Delay = 0.012, SE = 0.094, p = .90) or two months (βOSPAN*Delay*Sessions 2 to 3 = -0.14, SE = 0.13, p = .29).

Generalization to novel speakers

By examining how participants respond to new speakers about which they never receive feedback, we can assess the generalizability of their category knowledge. We first calculated a generalization score by subtracting the final training block accuracy from the test accuracy. Overall, learners were successful at generalizing their knowledge to the new speakers (Fig 5A). Once again, we focus our analyses on learners as there is no clear category knowledge for non-learners to generalize. We examined whether generalization performance across sessions was related to WM capacity by examining session (session 1 as baseline), WM score (OSPAN), and the interaction between session and WM score as fixed effects and participant as a random effect (Fig 5B, Table C in S1 File).

Fig 5. Working memory and category generalization.

Fig 5

A. Error bars reflect SEM. For purposes of illustration, high and low working memory groups are defined based on a median split of working memory (OSPAN) scores. Groups are additionally separated into learners and non-learners based on session 3 block 6 accuracy and whether it was greater (learners) or less than (non-learners) chance performance. B. Relation between OSPAN score and generalization test score (mean generalization accuracy–mean block 6 accuracy) across sessions for learners only.

WM ability was not significantly related to learners’ generalization ability in session 1 (βOSPAN = 0.080, SE = 0.055, p = .14). There were no significant differences in the relationship between WM and generalization accuracy in sessions 1 and 2 (βOSPAN*Session2 = -0.049, SE = 0.076, p = .52) or sessions 1 and 3 (βOSPAN*Session3 = -0.085, SE = 0.076, p = .26). Overall, these results demonstrate that, among learners, WM ability is not significantly related to the ability to generalize Mandarin tone category knowledge to novel speakers.

Decision processes

We examined participants’ decision processes based on the parameters from the drift diffusion models. We focus on the evidence accumulation rate (i.e., drift rate; Fig 6A) and decision threshold (i.e., boundary; Fig 6C) parameters. As these are Bayesian analyses, we interpret differences between groups where there is no overlap in the 95% credible intervals. We estimated the parameters for each individual and block, separately across sessions, with all subjects together (i.e., both learners and non-learners). As in prior work, we focus on the results for drift rates for accumulators where the stimulus category is the same as the response category (i.e., correct responses) [51]. This allows for examination of decision processes at play on trials where participants made correct responses. The overall pattern of results does not change when examining responses from all accumulators (Fig D in S1 File, Table D in S1 File).

Fig 6. Working memory and decision processes.

Fig 6

A and C: error bars reflect 95% credible intervals. For purposes of illustration, high and low working memory groups are defined based on a median split of working memory (OSPAN) scores. Groups are additionally separated into learners and non-learners based on session 3 block 6 accuracy and whether it was greater (learners) or less than (non-learners) chance performance. B and D: relation between OSPAN score and evidence accumulation rate and decision threshold for learners only.

First, we note the difference between learners and non-learners. Learners had higher evidence accumulation rates and higher decision thresholds than non-learners. In learners, the evidence accumulation rates increased over time, indicating that they became faster at accumulating evidence towards the correct decision. In contrast, the evidence accumulation rates in non-learners were low and flat throughout training, providing evidence of their general disengagement from the task. The decision thresholds were lower in non-learners than learners throughout the sessions deviating from one another after the very first block of training, indicating that non-learners needed less evidence to make their decision. This pattern may indicate that non-learners’ decisions were based on optimizing speed rather than categorization accuracy.

Critically, our modeling approach enables estimation of the decision parameters at the individual participant level, allowing for examination of how these parameters relate to WM capacity. To understand how decision parameters differed based on WM in learners, we ran separate linear mixed effects models on the two parameters with block, session, WM score (OSPAN), and all interactions as fixed effects and participant as a random effect. Session 1 was treated as a baseline. Full results are shown in Tables 4 and 5. We focus on the results on the relationship between WM capacity and evidence accumulation rates and decision thresholds.

Table 4. Summary of results on WM capacity and evidence accumulation rate.

β SE p
Intercept 0.15 0.17 .38
OSPAN 0.00075 0.0035 .83
Block 0.11 0.021 < .0001
Session 2 0.47 0.12 < .0001
Session 3 0.64 0.12 < .0001
OSPAN * Block 0.0013 0.00044 .0030
OSPAN * Session 2 0.0064 0.0024 .0084
OSPAN * Session 3 0.0072 0.0024 .0030
Block * Session 2 -0.066 0.030 .029
Block * Session 3 -0.051 0.030 .087
OSPAN * Block * Session 2 -0.00059 0.00062 .34
OSPAN * Block * Session 3 -0.00065 0.00062 .29

β, estimate. SE, standard error of estimate. p, p-value. OSPAN, operation span score.

Table 5. Summary of results on WM capacity and decision threshold.

β SE p
Intercept 1.35 0.085 < .0001
OSPAN -0.0024 0.0018 .17
Block -0.014 0.013 .31
Session 2 0.0023 0.074 .98
Session 3 -0.049 0.074 .51
OSPAN * Block 0.00063 0.00028 .022
OSPAN * Session 2 0.0047 0.0015 .0022
OSPAN * Session 3 0.0035 0.0015 .021
Block * Session 2 -0.0091 0.019 .63
Block * Session 3 0.035 0.019 .068
OSPAN * Block * Session 2 -0.00074 0.00039 .058
OSPAN * Block * Session 3 -0.00066 0.00039 .090

β, estimate. SE, standard error of estimate. p, p-value. OSPAN, operation span score.

Overall, learners with higher WM capacity accumulated evidence more quickly towards the correct decision in each session (Fig 6B). In session 1, there was not a significant relationship between WM and evidence accumulation rate (βOSPAN = 0.00075, SE = 0.0035, p = .83). However, the relationship became significantly stronger across blocks (βOSPAN*Block = 0.0013, SE = 0.00044, p = .0030). A one unit increase in WM score was associated with an increase in evidence accumulation rate of 0.0021 units for learners in the first block of session 1 and 0.0086 units for learners in the final block of session 1.

The strength of the relationship between WM score and evidence accumulation rate also increased across sessions (βOSPAN*Session2 = 0.0064, SE = 0.0024, p = .0084; βOSPAN*Session3 = 0.0072, SE = 0.0024, p = .0030). In session 2, a one unit increase in WM score was associated with an increase in evidence accumulation rate of 0.0071 units for learners and this relationship was not significantly different across blocks (βOSPAN*Block*Session2 = -0.00059 SE = 0.00062, p = .34). In session 3, a one unit increase in WM score was associated with an increase in evidence accumulation rate of 0.0079 units for learners and this relationship was not significantly different across blocks (βOSPAN*Block*Session3 = -0.00065, SE = 0.00062, p = .29).

In contrast, there was no clear relationship between WM capacity and decision thresholds in any session (Fig 6B). In session 1, a one unit increase in WM score was associated with a non-significant decrease in threshold of 0.0024 units for learners (βOSPAN = -0.0024, SE = 0.0018, p = .17). The relationship between WM and threshold became slightly less negative across blocks in session 1 (βOSPAN*Block = 0.00063, SE = 0.00028, p = .022). A one unit increase in WM score was associated with a decrease in threshold of 0.0018 units for learners in the first block but an increase of 0.0014 units in the final block of session 1. Overall, in session 1, there was no clear relationship between WM score and decision threshold.

The relationship between WM and threshold differed in sessions 2 and 3 compared to session 1 (βOSPAN*Session2 = 0.0047, SE = 0.0015, p = .0022; βOSPAN*Session3 = 0.0035, SE = 0.0015, p = .021). However, this difference appears to stem from changing from a negligible negative relationship in session 1 to a negligible positive relationship in sessions 2 and 3. In session 2, one unit increase in WM score was associated with an increase in threshold of 0.0023 for learners, which did not significantly differ across blocks (βOSPAN*Block*Session2 = -0.00074, SE = 0.00039, p = .058). In session 3, a one unit increase in WM score was associated with an increase in threshold of 0.0011 units for learners, which did not significantly differ across blocks (βOSPAN*Block*Session3 = -0.00066, SE = 0.00039, p = .090). In sum, decision thresholds did not strongly relate to WM capacity in any session.

Overall, learners with higher WM capacity had faster evidence accumulation rates. The relationship began to emerge in the first session and was clearly present in the second and third sessions. In contrast, learners’ decision thresholds did not depend on WM capacity. Together, these results indicate that WM capacity impacts specific elements of decision-making differently across the trajectory of learning.

Discussion

We investigated non-native speech category learning in initial learning sessions and in two follow up sessions with one and two months between each session, respectively. We examined the extent to which WM capacity was related to initial and later learning sessions and in which ways (Fig 7). Considering all participants, higher WM was associated with better speech category learning across learning stages. Participants with higher WM may also have been more likely to learn the categories than participants with lower WM. When considering only individuals who performed at above-chance levels (i.e., learners), WM was associated with better performance by later blocks of initial acquisition (session 1) and in intermediate and later sessions (session 2–3) becoming somewhat weaker over time. WM ability was generally unrelated to maintenance of category knowledge over delays or generalization of category knowledge to new talkers. Finally, among learners, higher WM capacity was associated with faster evidence accumulation rates across learning sessions and was not associated with decision thresholds in any session.

Fig 7. Role of working memory in different stages of category learning.

Fig 7

Visualization of relationship between behavioral measures and working memory for learners based on the regression model coefficients. Error bars reflect SEM.

Learners and non-learners

Our results demonstrate that simply grouping all participants together does not tell a complete story because some participants clearly do not demonstrate learning, performing at chance levels even after extensive training. However, swiftly removing these non-learners as is common practice in the field [3235] may obscure parts of the story as well. Participants who performed at or below chance levels at the end of three sessions of training were consistently poor performers across all blocks and sessions had marginally lower WM scores than learners. Importantly, it is possible that non-learners with lower WM scores may have been generally disengaged in the experiment, performing poorly across all measures (Fig B in S1 File). In support of the interpretation that non-learners were generally disengaged in the task, they had very low and flat evidence accumulation rates across learning, which may be indicative of general task disengagement [54, 72].

Regardless of WM ability, we found that a substantial number of participants (32%) were classified as non-learners. These individuals returned for three separate sessions of the same task that they were unable to consistently perform above chance levels. It is important to consider participants’ goals and motivation for completing the task and compare this with experimenter-defined goals. Whereas we instructed them to respond as accurately as possible, their goal seemed to be to respond as quickly as possible regardless of accuracy evidenced by non-learners’ much lower decision thresholds than learners. Decision thresholds (i.e., response caution) are related to the speed-accuracy tradeoff [57], with lower decision thresholds reflecting favoring speed over accuracy. As such, we interpret these low decision thresholds as a mark of these participants’ disengagement in the category learning task. Importantly, favoring speed over accuracy is an adaptive strategy if your goal is not to learn the categories, but instead to complete the experiment as quickly as possible [73].

It is necessary to understand and adapt to the goals of our participants. This study was conducted using an online population, rather than a more typical convenience sample of college students leveraged in prior studies. This approach presents challenges, but also highlights that the goals and motivations to perform a simple experimental task may be different among a broader population than in student populations often examined in experimental psychology research.

It is important to understand how task disengagement is related to WM ability to understand potential interventions to improve learning. It is unclear if some non-learners want to learn, but they are unable to or if they are actively deciding to disengage from the task. Future work should include dynamic measures of task engagement, such as pupil dilation, to better understand how task engagement is related to WM and contributes to differences learning outcomes. If task disengagement is truly related to WM and we want to improve learning for individuals with lower WM, a first step should be ensuring that they are engaged in the task in the first place.

Together, these results highlight the importance of consideration of individual differences in learning. In particular, these results call for the need of special consideration of individuals who may be disengaged from the task. It is possible that a role that WM plays in learning is ensuring that resources are available for engagement in complex tasks.

Initial learning and learning over time

The main goal of the current study was to understand the role of working memory in learning beyond initial acquisition. In line with prior work, we found that WM was positively related to learning by the end of the first session [712]. The benefit of higher WM in initial learning may stem from the ability to hold in mind many possible hypotheses which helps learners home in on the best one and use it faster and more efficiently [7, 12, 13]. Our results are in line with this prior work and suggest a role for WM in initial non-native speech category acquisition.

As a novel contribution, our results extend these findings and demonstrate that among participants who eventually learn the speech categories, WM was related to learning performance starting at the end of session 1 and persisting in sessions 2 and 3. This pattern of results conflicts with other work on speech category learning that demonstrates that given multiple days of training, there is no clear link between WM and performance [18, 19]. However, these prior studies trained participants on across days separated by very short delays, rather than delays of over a month or more without additional training. Our results indicate that WM helps in initial acquisition of category knowledge, but individuals with lower WM may be able to ‘catch up’ given more time. Specifically, our results provide some preliminary evidence that the relationship between WM and non-native speech category learning may become weaker over time. Lower WM is not a sentence to poor learning forever. As long as participants remain engaged, they are able to learn.

This work also connects with prior investigations of learning from initial acquisition in novices to overtrained performance in experts in both language (e.g., [24]) and other perceptual contexts (e.g., [74]). While category representations start to emerge within a single session of training [30, 75], it is clear that further learning continues to shape representations and the networks supporting learning. For example, as individuals move from initial acquisition to highly experienced experts, there is a decrease in activation in sensory and frontal brain regions [76, 77], potentially reflecting increased neural efficiency with learning. Research from visual category learning demonstrates that similar neural networks support initial and well-learned categorization behavior, but that these networks become more coordinated with extensive practice [78]. Together, these results highlight the need to understand how learning and the cognitive abilities and processes that support categorization change from the very initial novice stages of learning to behavior in overtrained experts. This is particularly relevant for speech and language learning contexts, where expert or even genuinely stable levels of performance are unlikely to emerge in a single training session.

Task difficulty and effort

We found that WM was consistently related to faster evidence accumulation among learners. These results are in line with prior work that demonstrates that evidence accumulation rates are linked to individual differences in WM [55, 56]. Faster evidence accumulation rates reflect higher motivation [54], faster mobilization of attentional resources [79], and lower task difficulty [8084].

We then might interpret the persistently higher evidence accumulation rate in learners with higher WM as reflective of heightened motivation, rapid mobilization of available attentional resources, and perhaps perceived difficulty of the task. That is, even when accuracies were similar, learners with higher WM may have achieved that level of performance with lower perceived difficulty and perceived or exerted effort. Conversely, lower evidence accumulation rates observed in learners with lower WM may be associated with slower mobilization of motivational or attentional resources and more perceived difficulty in the task. Future research should clarify how WM relates to perceived difficulty and perceived and exerted effort during learning.

In summary, these results indicate that higher WM capacity is not a guarantee of better learning. Rather, it reflects better initial acquisition and general performance due to the ability to hold multiple hypotheses in mind and more rapid decision-making processes throughout learning. Lower WM also does not doom one to poor performance and, instead, lower WM may be linked to more time and resource-dependent decision processes which may be more effortful for the learner. Future work should address the perceived and exerted effort in learning and how this is related to WM.

Limitations

We note that there was significant attrition across sessions. Whereas 195 individuals completed the first session, only 107 returned for both follow up sessions. This is a challenge for longitudinal designs using online samples but is a necessary challenge to overcome to understand learning beyond initial acquisition. While we considered non-learners who completed all three sessions, it is also important to consider participants who failed to complete all parts of the experiment. In future work, it will be important to understand participants’ reasons for returning or not returning to better understand what is motivating their performance in the task. Importantly, we found that WM did not differ based on how many sessions participants completed (Fig A-a in S1 File). This indicates that it was not just lower or higher WM individuals who failed to return for follow up sessions. There was also no difference in categorization accuracy based on the number of sessions participants completed. That is, within the same session, those who completed one, two, or all three sessions did not differ in their accuracy (Fig A-b in S1 File).

Another limitation of the current work is that we used a single measure of WM, measured at a single timepoint [85]. Specifically, we used an operation span measure based on ability to manipulate and remember a sequence of letters given a mathematical task interference. Operation span is extensively used and is a highly reliable measure of WM [85, 86]. Even still, one measure likely does not reflect the true complexity of WM. Further, because of the nature of the complex span task we used to assess WM capacity, it is possible that performance was influenced by some combination of WM and long-term memory [87]. As a result, the observed relationship between WM score and speech category learning performance may reflect the ability to hold onto and manipulate information in WM as well as retrieve exemplars or rules from long-term memory. However, it is important to note that measures that should theoretically be related to long-term memory or activation of exemplars stored in memory (e.g., maintenance, generalization) were not significantly related to WM score. Future studies should collect multiple measures of WM including visuospatial and auditory WM as well as measures of long-term memory to better understand how speech category learning relies on WM and long-term memory abilities.

Finally, participants learned four difficult categories with minimal feedback (e.g., “correct” or “incorrect”). Because this kind of feedback is ambiguous when the response is incorrect, it is possible that performance may have improved if we had provided full feedback (e.g., “correct, that was category 1”). However, prior work has demonstrated that Mandarin tone learning, as we examined here, is better with minimal feedback relative to full feedback [88]. Future studies will need to address the role of WM in learning with full and minimal feedback.

Conclusion

We examined the relationship between WM and non-native speech category learning, maintenance of category knowledge across sessions, generalization to novel talkers, and decision processes involved in learning. The results demonstrate that higher WM is not a guarantee of learning, nor is lower WM a sentence to long-term learning difficulties. WM is one important ability in supervised category learning. Here, we highlight the need for a nuanced approach that considers the stage of learning and whether participants eventually learn. By leveraging a drift diffusion modeling approach and examining behavior from several angles over time, we conclude that WM may help learners by facilitating rapid category acquisition in initial stages and enhanced performance during subsequent stages of learning due to rapid evidence accumulation that may reduce the effort needed to learn. These results have important implications for developing interventions to improve learning in naturalistic language contexts and understanding what it means to be engaged in a task.

Supporting information

S1 File

(DOCX)

pone.0297917.s001.docx (11MB, docx)

Acknowledgments

All data, stimulus materials, and analysis code are publicly available at the Open Science Framework and can be accessed at https://doi.org/10.17605/OSF.IO/WDPYU. Data from the first session appeared in McHaney et al. (2021). Data from the second and third sessions have not appeared previously. Casey L. Roark is now at the University of New Hampshire, Department of Psychology. Jacie R. McHaney and Bharath Chandrasekaran are now at Northwestern University, Roxelyn and Richard Pepper Department of Communication Sciences and Disorders.

Data Availability

The stimulus materials, data, and analysis code are publicly available through the Open Science Framework repository and can be accessed online at https://doi.org/10.17605/OSF.IO/WDPYU.

Funding Statement

This research was supported by the National Institute on Deafness and Other Communication Disorders [R01DC013315A1 to B.C., F32DC018979 to C.L.R., and T32DC011499 to K. Kandler and B. Yates (trainee: J.R.M.)] and the National Science Foundation [NSF-1953712 to B.C. & A.S.]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Richler JJ, Palmeri TJ. Visual category learning. Wiley Interdiscip Rev Cognitive Sci. 2014;5: 75–94. doi: 10.1002/wcs.1268 [DOI] [PubMed] [Google Scholar]
  • 2.Holt LL, Lotto AJ. Speech perception as categorization. Atten Percept Psychophys. 2010;72: 1218–1227. doi: 10.3758/APP.72.5.1218 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Llanos F, McHaney JR, Schuerman WL, Yi HG, Leonard MK, Chandrasekaran B. Non-invasive peripheral nerve stimulation selectively enhances speech category learning in adults. Npj Sci Learn. 2020;5: 12. doi: 10.1038/s41539-020-0070-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Baese-Berk MM, Chandrasekaran B, Roark CL. The nature of non-native speech sound representations. J Acoust Soc Am. 2022;152: 3025–3034. doi: 10.1121/10.0015230 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Baddeley A. Working memory. Science. 1992;255: 556–559. doi: 10.1126/science.1736359 [DOI] [PubMed] [Google Scholar]
  • 6.Baddeley AD, Hitch G. Working memory. Exploring working memory: selected works of Alan Baddeley. Taylor & Francis Group; 2017. pp. 43–79. [Google Scholar]
  • 7.Sewell DK, Lewandowsky S. Attention and Working Memory Capacity: Insights From Blocking, Highlighting, and Knowledge Restructuring. J Exp Psychology Gen. 2012;141: 444–469. doi: 10.1037/a0026560 [DOI] [PubMed] [Google Scholar]
  • 8.Craig S, Lewandowsky S. Whichever way you Choose to Categorize, Working Memory Helps you Learn. Q J Exp Psychol. 2011;65: 439–464. doi: 10.1080/17470218.2011.608854 [DOI] [PubMed] [Google Scholar]
  • 9.Lewandowsky S. Working Memory Capacity and Categorization: Individual Differences and Modeling. J Exp Psychology Learn Mem Cognition. 2011;37: 720–738. doi: 10.1037/a0022639 [DOI] [PubMed] [Google Scholar]
  • 10.Lewandowsky S, Yang L-X, Newell BR, Kalish ML. Working memory does not dissociate between different perceptual categorization tasks. J Exp Psychology Learn Mem Cognition. 2012;38: 881–904. doi: 10.1037/a0027298 [DOI] [PubMed] [Google Scholar]
  • 11.Maddox WT, Chandrasekaran B, Smayda K, Yi H-G. Dual systems of speech category learning across the lifespan. Psychol Aging. 2013;28: 1042–56. doi: 10.1037/a0034969 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.McHaney JR, Tessmer R, Roark CL, Chandrasekaran B. Working memory relates to individual differences in speech category learning: Insights from computational modeling and pupillometry. Brain Lang. 2021;222: 105010. doi: 10.1016/j.bandl.2021.105010 [DOI] [PubMed] [Google Scholar]
  • 13.Lloyd K, Sanborn A, Leslie D, Lewandowsky S. Why Higher Working Memory Capacity May Help You Learn: Sampling, Search, and Degrees of Approximation. Cognitive Sci. 2019;43: e12805. doi: 10.1111/cogs.12805 [DOI] [PubMed] [Google Scholar]
  • 14.Miyake A, Friedman NP. Individual Differences in Second Language Proficiency: Working Memory as Language Aptitude. In: Healy AF, Bourne LE, editors. Foreign Language Learning: Psycholinguistic Studies on Training and Retention. Mahwah, NJ: Lawrence Erlbaum Associates; 1998. pp. 339–364. [Google Scholar]
  • 15.Baddeley A, Gathercole S, Papagno C. The phonological loop as a language learning device. Psychol Rev. 1998;105: 158–173. doi: 10.1037/0033-295x.105.1.158 [DOI] [PubMed] [Google Scholar]
  • 16.Cheung H. Nonword Span as a Unique Predictor of Second-Language Vocabulary Learning. Dev Psychol. 1996;32: 867–873. doi: 10.1037/0012-1649.32.5.867 [DOI] [Google Scholar]
  • 17.Speciale G, Ellis NC, Bywater T. Phonological sequence learning and short-term store capacity determine second language vocabulary acquisition. Appl Psycholinguist. 2004;25: 293–321. doi: 10.1017/s0142716404001146 [DOI] [Google Scholar]
  • 18.Perrachione TK, Lee J, Ha LYY, Wong PCM. Learning a novel phonological contrast depends on interactions between individual differences and training paradigm design. J Acoust Soc Am. 2011;130: 461–472. doi: 10.1121/1.3593366 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ingvalson EM, Nowicki C, Zong A, Wong PCM. Non-native Speech Learning in Older Adults. Front Psychol. 2017;8: 148. doi: 10.3389/fpsyg.2017.00148 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wang Y, Jongman A, Sereno JA. Acoustic and perceptual evaluation of Mandarin tone productions before and after perceptual training. J Acoust Soc Am. 2003;113: 1033–1043. doi: 10.1121/1.1531176 [DOI] [PubMed] [Google Scholar]
  • 21.Wang Y, Spence MM, Jongman A, Sereno JA. Training American listeners to perceive Mandarin tones. J Acoust Soc Am. 1999;106: 3649–3658. doi: 10.1121/1.428217 [DOI] [PubMed] [Google Scholar]
  • 22.Wong PCM, Perrachione TK. Learning pitch patterns in lexical identification by native English-speaking adults. Appl Psycholinguist. 2007;28: 565–585. doi: 10.1017/s0142716407070312 [DOI] [Google Scholar]
  • 23.Xu Y, Gandour JT, Francis AL. Effects of language experience and stimulus complexity on the categorical perception of pitch direction. J Acoust Soc Am. 2006;120: 1063–1074. doi: 10.1121/1.2213572 [DOI] [PubMed] [Google Scholar]
  • 24.Reetzke R, Xie Z, Llanos F, Chandrasekaran B. Tracing the Trajectory of Sensory Plasticity across Different Stages of Speech Learning in Adulthood. Curr Biol. 2018;28: 1419–1427.e4. doi: 10.1016/j.cub.2018.03.026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Roeder JL, Ashby FG. What is automatized during perceptual categorization? Cognition. 2016;154: 22–33. doi: 10.1016/j.cognition.2016.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Waldschmidt JG, Ashby FG. Cortical and striatal contributions to automaticity in information-integration categorization. Neuroimage. 2011;56: 1791–1802. doi: 10.1016/j.neuroimage.2011.02.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Bradlow AR, Akahane-Yamada R, Pisoni DB, Tohkura Y. Training Japanese listeners to identify English /r/and /l/: Long-term retention of learning in perception and production. Percept Psychophys. 1999;61: 977–985. doi: 10.3758/bf03206911 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Hélie S, Waldschmidt JG, Ashby FG. Automaticity in rule-based and information-integration categorization. Atten Percept Psychophys. 2010;72: 1013–1031. doi: 10.3758/APP.72.4.1013 [DOI] [PubMed] [Google Scholar]
  • 29.Hélie S, Roeder JL, Ashby FG. Evidence for Cortical Automaticity in Rule-Based Categorization. J Neurosci. 2010;30: 14225–14234. doi: 10.1523/JNEUROSCI.2393-10.2010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Ester EF, Sprague TC, Serences JT. Categorical Biases in Human Occipitoparietal Cortex. J Neurosci. 2020;40: 917–931. doi: 10.1523/JNEUROSCI.2700-19.2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Feng G, Li Y, Hsu S-M, Wong PCM, Chou T-L, Chandrasekaran B. Emerging Native-Similar Neural Representations Underlie Non-Native Speech Category Learning Success. Neurobiology Lang. 2021;2: 280–307. doi: 10.1162/nol_a_00035 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Folstein JR, Palmeri TJ, Gauthier I. Category Learning Increases Discriminability of Relevant Object Dimensions in Visual Cortex. Cereb Cortex. 2013;23: 814–823. doi: 10.1093/cercor/bhs067 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hammer R, Sloutsky V, Grill-Spector K. Feature saliency and feedback information interactively impact visual category learning. Front Psychol. 2015;6: 74. doi: 10.3389/fpsyg.2015.00074 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Maddox WT, Chandrasekaran B. Tests of a dual-system model of speech category learning. Biling Lang Cognition. 2014;17: 709–728. doi: 10.1017/s1366728913000783 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Rosedahl LA, Eckstein MP, Ashby FG. Retinal-specific category learning. Nat Hum Behav. 2018;2: 500–506. doi: 10.1038/s41562-018-0370-z [DOI] [PubMed] [Google Scholar]
  • 36.Ell SW, Ashby FG. The effects of category overlap on information-integration and rule-based category learning. Percept Psychophys. 2006;68: 1013–1026. doi: 10.3758/bf03193362 [DOI] [PubMed] [Google Scholar]
  • 37.Kaplan AS, Murphy GL. Category learning with minimal prior knowledge. J Exp Psychology Learn Mem Cognition. 2000;26: 829–846. doi: 10.1037//0278-7393.26.4.829 [DOI] [PubMed] [Google Scholar]
  • 38.Roark CL, Chandrasekaran B. Individual variability in strategies and learning outcomes in auditory category learning. In: Fitch T, Lamm C, Leder H, Tessmar K, editors. Proceedings of the 43rd Annual Conference of the Cognitive Science Society. Austin, TX: Cognitive Science Society; 2021. pp. 154–160.
  • 39.Roark CL, Holt LL. Perceptual dimensions influence auditory category learning. Atten Percept Psychophys. 2019;81: 912–926. doi: 10.3758/s13414-019-01688-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.DeCaro MS, Thomas RD, Beilock SL. Individual differences in category learning: Sometimes less working memory capacity is better than more. Cognition. 2008;107: 284–294. doi: 10.1016/j.cognition.2007.07.001 [DOI] [PubMed] [Google Scholar]
  • 41.DeCaro MS, Carlson KD, Thomas RD, Beilock SL. When and how less is more: reply to Tharp and Pickering. Cognition. 2009;111: 397–403. doi: 10.1016/j.cognition.2009.03.001 [DOI] [PubMed] [Google Scholar]
  • 42.Tharp IJ, Pickering AD. A note on DeCaro, Thomas, and Beilock (2008): Further data demonstrate complexities in the assessment of information-integration category learning. Cognition. 2009;111: 411–415. doi: 10.1016/j.cognition.2008.10.003 [DOI] [PubMed] [Google Scholar]
  • 43.Diamond A. Executive Functions. Annu Rev Psychol. 2013;64: 135–168. doi: 10.1146/annurev-psych-113011-143750 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.McCabe DP, Roediger HL, McDaniel MA, Balota DA, Hambrick DZ. The Relationship Between Working Memory Capacity and Executive Functioning: Evidence for a Common Executive Attention Construct. Neuropsychology. 2010;24: 222–243. doi: 10.1037/a0017619 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Lenaert B, Ven V van de, Kaas AL, Vlaeyen JWS. Generalization on the Basis of Prior Experience Is Predicted by Individual Differences in Working Memory. Behav Ther. 2016;47: 130–140. doi: 10.1016/j.beth.2015.10.001 [DOI] [PubMed] [Google Scholar]
  • 46.Wills AJ, Barrasin TJ, McLaren IPL. Working Memory Capacity and Generalization in Predictive Learning. Proceedings of the Annual Meeting of the Cognitive Science Society. 2011. pp. 3205–3210. doi: 10.1016/j.cogsys.2005.02.002 [DOI] [Google Scholar]
  • 47.Ratcliff R. A Theory of Memory Retrieval. Psychological Review. 1978;85: 59–108. doi: 10.1037/0033-295x.85.2.59 [DOI] [Google Scholar]
  • 48.Ratcliff R, Smith PL, Brown SD, McKoon G. Diffusion Decision Model: Current Issues and History. Trends Cogn Sci. 2016;20: 260–281. doi: 10.1016/j.tics.2016.01.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Nosofsky RM. Attention, Similarity, and the Identification-Categorization Relationship. Journal of Experimental Psychology: General. 1986;115: 39–57. doi: 10.1037//0096-3445.115.1.39 [DOI] [PubMed] [Google Scholar]
  • 50.Nosofsky RM, Palmeri TJ. An Exemplar-Based Random Walk Model of Speeded Classification. Psychol Rev. 1997;104: 266–300. doi: 10.1037/0033-295x.104.2.266 [DOI] [PubMed] [Google Scholar]
  • 51.Roark CL, Paulon G, Sarkar A, Chandrasekaran B. Comparing perceptual category learning across modalities in the same individuals. Psychon B Rev. 2021;28: 898–909. doi: 10.3758/s13423-021-01878-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Heffernan EM, Adema JD, Mack ML. Identifying the neural dynamics of category decisions with computational model-based functional magnetic resonance imaging. Psychon B Rev. 2021; 10-Jan. doi: 10.3758/s13423-021-01939-4 [DOI] [PubMed] [Google Scholar]
  • 53.Paulon G, Llanos F, Chandrasekaran B, Sarkar A. Bayesian Semiparametric Longitudinal Drift-Diffusion Mixed Models for Tone Learning in Adults. J Am Stat Assoc. 2021; 14-Jan. doi: 10.1080/01621459.2020.1801448 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Hübner R, Schlösser J. Monetary reward increases attentional effort in the flanker task. Psychon B Rev. 2010;17: 821–826. doi: 10.3758/PBR.17.6.821 [DOI] [PubMed] [Google Scholar]
  • 55.Schmiedek F, Oberauer K, Wilhelm O, Süβ H-M, Wittmann WW. Individual Differences in Components of Reaction Time Distributions and Their Relations to Working Memory and Intelligence. J Exp Psychology Gen. 2007;136: 414–429. doi: 10.1037/0096-3445.136.3.414 [DOI] [PubMed] [Google Scholar]
  • 56.Ester EF, Ho TC, Brown SD, Serences JT. Variability in visual working memory ability limits the efficiency of perceptual decision making. J Vision. 2014;14: 2-Feb. doi: 10.1167/14.4.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Bogacz R, Wagenmakers E-J, Forstmann BU, Nieuwenhuis S. The neural basis of the speed–accuracy tradeoff. Trends Neurosci. 2010;33: 16-Oct. doi: 10.1016/j.tins.2009.09.002 [DOI] [PubMed] [Google Scholar]
  • 58.Philiastides MG, Ratcliff R, Sajda P. Neural Representation of Task Difficulty and Decision Making during Perceptual Categorization: A Timing Diagram. J Neurosci. 2006;26: 8965–8975. doi: 10.1523/JNEUROSCI.1655-06.2006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Anwyl-Irvine A, Massonnié J, Flitton A, Kirkham N, Evershed J. Gorilla in our Midst: An online behavioral experiment builder. bioRxiv. 2020; 438242. doi: 10.3758/s13428-019-01237-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Feng G, Gan Z, Wang S, Wong PCM, Chandrasekaran B. Task-General and Acoustic-Invariant Neural Representation of Speech Categories in the Human Brain. Cereb Cortex. 2017;28: 3241–3254. doi: 10.1093/cercor/bhx195 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Li Y, Tang C, Lu J, Wu J, Chang EF. Human cortical encoding of pitch in tonal and non-tonal languages. Nat Commun. 2021;12: 1161. doi: 10.1038/s41467-021-21430-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Turner ML, Engle RW. Is working memory capacity task dependent? J Mem Lang. 1989;28: 127–154. doi: 10.1016/0749-596x(89)90040-5 [DOI] [Google Scholar]
  • 63.Đokić R, Koso-Drljević M, Đapo N. Working memory span tasks: Group administration and omitting accuracy criterion do not change metric characteristics. Plos One. 2018;13: e0205169. doi: 10.1371/journal.pone.0205169 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Team RC. R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria; 2022. Available: https://www.R-project.org/ [Google Scholar]
  • 65.Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, et al. Welcome to the Tidyverse. J Open Source Softw. 2019;4: 1686. doi: 10.21105/joss.01686 [DOI] [Google Scholar]
  • 66.Wickham H. ggplot2: Elegant Graphics for Data Analysis. 2016. [Google Scholar]
  • 67.Arnold JB. ggthemes: Extra Themes, Scales and Geoms for “ggplot2.” 2018. Available: https://CRAN.R-project.org/package=ggthemes [Google Scholar]
  • 68.Paulon G, Sarkar A. lddmm: Longitudinal Drift-Diffusion Mixed Models (LDDMM). 2021. Available: https://CRAN.R-project.org/package=lddmm [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Bates D, Mächler M, Bolker B, Walker S. Fitting Linear Mixed-Effects Models Using lme4. J Stat Softw. 2015;67. doi: 10.18637/jss.v067.i01 [DOI] [Google Scholar]
  • 70.Kuznetsova A, Brockhoff PB, Christensen RHB. {lmerTest} Package: Tests in Linear Mixed Effects Models. Journal of Statistical Software. 2017;82: 26-Jan. doi: 10.18637/jss.v082.i13 [DOI] [Google Scholar]
  • 71.Kassambara A. rstatix: Pipe-Friendly Framework for Basic Statistical Tests. 2021. Available: https://CRAN.R-project.org/package=rstatix [Google Scholar]
  • 72.Balsdon T, Wyart V, Mamassian P. Confidence controls perceptual evidence accumulation. Nat Commun. 2020;11: 1753. doi: 10.1038/s41467-020-15561-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Hawkins GE, Brown SD, Steyvers M, Wagenmakers E-J. An optimal adjustment procedure to minimize experiment time in decisions with multiple alternatives. Psychon B Rev. 2012;19: 339–348. doi: 10.3758/s13423-012-0216-z [DOI] [PubMed] [Google Scholar]
  • 74.Diamond R, Carey S. Why Faces Are and Are Not Special: An Effect of Expertise. J Exp Psychology Gen. 1986;115: 107–117. doi: 10.1037//0096-3445.115.2.107 [DOI] [PubMed] [Google Scholar]
  • 75.Feng G, Gan Z, Yi HG, Ell SW, Roark CL, Wang S, et al. Neural dynamics underlying the acquisition of distinct auditory category structures. Neuroimage. 2021; 118565. doi: 10.1016/j.neuroimage.2021.118565 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Souza ACS de Yehia HC, Sato M, Callan D. Brain activity underlying auditory perceptual learning during short period training: simultaneous fMRI and EEG recording. Bmc Neurosci. 2013;14: 8. doi: 10.1186/1471-2202-14-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Little DM, Thulborn KR. Correlations of cortical activation and behavior during the application of newly learned categories. Cognitive Brain Res. 2005;25: 33–47. doi: 10.1016/j.cogbrainres.2005.04.015 [DOI] [PubMed] [Google Scholar]
  • 78.DeGutis J, D’Esposito M. Network Changes in the Transition from Initial Learning to Well-Practiced Visual Categorization. Front Hum Neurosci. 2009;3: 44. doi: 10.3389/neuro.09.044.2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Sarter M, Gehring WJ, Kozak R. More attention must be paid: The neurobiology of attentional effort. Brain Res Rev. 2006;51: 145–160. doi: 10.1016/j.brainresrev.2005.11.002 [DOI] [PubMed] [Google Scholar]
  • 80.Gold JI, Shadlen MN. The Neural Basis of Decision Making. Annu Rev Neurosci. 2007;30: 535–574. doi: 10.1146/annurev.neuro.29.051605.113038 [DOI] [PubMed] [Google Scholar]
  • 81.Mulder MJ, Maanen L van, Forstmann BU. Perceptual decision neurosciences–A model-based review. Neuroscience. 2014;277: 872–884. doi: 10.1016/j.neuroscience.2014.07.031 [DOI] [PubMed] [Google Scholar]
  • 82.O’Connell RG, Dockree PM, Kelly SP. A supramodal accumulation-to-bound signal that determines perceptual decisions in humans. Nat Neurosci. 2012;15: 1729–1735. doi: 10.1038/nn.3248 [DOI] [PubMed] [Google Scholar]
  • 83.van-Maanen L, Portoles O, Borst JP. The Discovery and Interpretation of Evidence Accumulation Stages. Comput Brain Behav. 2021;4: 395–415. doi: 10.1007/s42113-021-00105-2 [DOI] [Google Scholar]
  • 84.van-Maanen L, Forstmann BU, Keuken MC, Wagenmakers E-J, Heathcote A. The impact of MRI scanner environment on perceptual decision-making. Behav Res Methods. 2016;48: 184–200. doi: 10.3758/s13428-015-0563-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Conway ARA, Kane MJ, Al CET. Working memory span tasks: A methodological review and user’s guide. Psychonomic Bulletin & Review. 2005;12: 769–786. doi: 10.3758/bf03196772 [DOI] [PubMed] [Google Scholar]
  • 86.Unsworth N, Heitz RP, Schrock JC, Engle RW. An automated version of the operation span task. Behavior Research Methods. 2005;37: 498–505. doi: 10.3758/bf03192720 [DOI] [PubMed] [Google Scholar]
  • 87.Unsworth N, Engle RW. The Nature of Individual Differences in Working Memory Capacity: Active Maintenance in Primary Memory and Controlled Search From Secondary Memory. Psychol Rev. 2007;114: 104–132. doi: 10.1037/0033-295X.114.1.104 [DOI] [PubMed] [Google Scholar]
  • 88.Chandrasekaran B, Yi H-G, Maddox WT. Dual-learning systems during speech category learning. Psychon B Rev. 2014;21: 488–95. doi: 10.3758/s13423-013-0501-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Alessandra S Souza

2 Mar 2023

PONE-D-23-00404Individual differences in working memory impact the trajectory of non-native speech category learningPLOS ONE

Dear Dr. Roark,

Thank you for submitting your manuscript to PLOS ONE. I have sent your MS for evaluation by three experts on category learning, diffusion modeling, and working memory, and I have read your manuscript as well. First of all, I would like to thank the reviewers for their very insightful and constructive feedback, which you can find appended below. As you will see from their comments, they all appreciated your manuscript, the rigor of the methods, and the general approach. I agree with their assessment. The most substantial comment is related to the choices in implementing the drift-diffusion modeling, as well as the use of a linear regression with categorical predictors. Reviewer 1 would also appreciate if you could try to make more clear the main take home message of the manuscript communicating more clearly what is learned from the current analyses at the end. I believe that addressing these points will strengthen the MS and increase the impact of your work. Therefore I am inviting a Major Revision (which should also give you more time to address the raised points). Please consider all points raised by the reviewers one-by-one. At this stage, I do not have any further comments to add.

Just as final remark: thank you for the very comprehensive sharing of the data, materials and analysis code on the OSF. This is the correct spirit to allow for the complete reproducibility of your results.

Please submit your revised manuscript by Apr 16 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Alessandra S. Souza, Ph.D.

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“This research was supported by the National Institute on Deafness and Other Communication Disorders [R01DC013315A1 to B.C., F32DC018979 to C.L.R., and T32DC011499 to K. Kandler and B. Yates (trainee: J.R.M.)] and the National Science Foundation [NSF-1953712 to B.C. & A.S.]. All data, stimulus materials, and analysis code are publicly available at the Open Science Framework and can be accessed at https://doi.org/10.17605/OSF.IO/WDPYU. Data from the first session appeared in McHaney et al. (2021). Data from the second and third sessions have not appeared previously. Corresponding authors: Casey L. Roark, casey.l.roark@gmail.com and Bharath Chandrasekaran, b.chandra@pitt.edu

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“This research was supported by the National Institute on Deafness and Other Communication Disorders [R01DC013315A1 to B.C., F32DC018979 to C.L.R., and T32DC011499 to K. Kandler and B. Yates (trainee: J.R.M.)] and the National Science Foundation [NSF-1953712 to B.C. & A.S.]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

3. We noted in your submission details that a portion of your manuscript may have been presented or published elsewhere. [DETAILS AS NEEDED] Please clarify whether this [conference proceeding or publication] was peer-reviewed and formally published. If this work was previously peer-reviewed and published, in the cover letter please provide the reason that this work does not constitute dual publication and should be included in the current manuscript.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

Reviewer #3: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: No

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Roark et al. present the results of research aimed at assessing the role of working memory in the learning and retention of categorical knowledge about non-native language stimuli. Participants completed a task online in which they learned Mandarin tone categories across three sessions, separated by months. The authors assessed the relationships between various aspects of task performance and participants' performance on an operation span task. In particular, they focused on relationships between OSPAN task performance and initial and later learning, maintenance of category knowledge across sessions, generalisation between trained and non-trained stimuli, and drift rate and threshold parameters from a multi-alternative diffusion model used to fit the data. Results showed positive relationships between OSPAN scores and categorisation accuracy in later learning (e.g., later blocks of session 1 and most blocks of subsequent sessions); however, when the sample was split into learners and non-learners, this relationship was not evident in either group. Maintenance of category knowledge across sessions and across the final blocks within a session was also unrelated to OSPAN scores, as was generalisation from training to test stimuli. Diffusion modelling showed positive relationships between OSPAN scores and both drift rate and threshold in later blocks of session 1, and in subsequent sessions, for learners. The authors conclude that greater WM capacity may aid category learning in language contexts, particularly in the early stages of acquisition; and that higher WM capacity allows for greater caution in decision-making owing owing to more rapid evidence accumulation.

I should mention at the outset that whereas my knowledge of the WM literature and evidence accumulation models of decision-making is fairly good, I am far from an expert when it comes to research into language learning. My comments thus largely focus on these former aspects of the work. I should also note that this makes it difficult for me to judge the theoretical value of the contribution this research makes to the language learning area.

I have four major comments concerning the paper (not presented in any particular order), and a few more minor ones.

Major comments:

1. My understanding is that theories of performance on complex span tasks (such as the OSPAN task) often assume a contribution from retrieval from LTM/episodic memory (e.g., Unsworth & Engle, 2007). If so, then I think a relevant question is, to what extent do the relationships between OSPAN performance and category learning shown here reflect the influence of LTM retrieval in both tasks? In other words, are the relationships evident in this work a result of the fact that both tasks to some extent tap how well people can retrieve stuff from LTM? Or is there something more to it than that? The meaning of the relationship depends on the answer to this question, and therefore I think it's something that would be worth addressing.

2. I was fairly confused by some aspects of the way the diffusion modelling was used in the paper. I had not come across this particular model before, but---if I understand it correctly---the basic idea seems straightforward: Assume Wald distributions for each response with densities that are multiplied by the product of the survival functions for the other responses. Nonetheless, there were two points that confused me: a) At one point (p. 24) you mention focusing on "correct trials only". I followed the reference used here (Roark et al., 2021, PB&R) to try to make sense of this, and unfortunately became even more confused. The source of my confusion is this: I don't understand how you can isolate a drift rate and threshold for correct trials only when all the responses must surely influence the estimated parameter values. The only thing I could think of was that you were only fitting the data from trials with correct responses; but this would lead to heavily biased parameter estimates (with the bias stronger when the proportion of errors increases). As such, I think you either need to clarify what it was you actually did, so that other readers don't end up as confused as I am; and/or re-do the modelling to fit all the data, rather than just those from correct trials. (On about my fifth read through the section, I think what you were actually doing here was fitting the model to all the data, but just reporting the parameters associated with the accumulator corresponding to the correct response. Right? This is more sensible than what I initially thought you were doing, but if this is correct, you really ought to clarify it so other readers don't go down the same garden path I did.) b) It seems to me that you use the estimated parameters for individual participants, obtained from the diffusion modelling, as the outcome variable in a regular linear regression with OSPAN score as the predictor. As I understand it though, the hierarchical structure of the diffusion model means these participant-level parameters are not independent, and therefore are not appropriate for use in the subsequent regression. Perhaps a better tack would be to include OSPAN scores as a predictor in the diffusion model itself.

3. Related to the previous point: If what you were doing with the diffusion modelling was indeed fitting all the data but just reporting parameters for the accumulator associated with the correct responses, I'd recommend also reporting the parameters for the other accumulators, at least in supplemental materials. In particular, when it comes to the effects of WM on thresholds (e.g., increasing with practice for higher-WM individuals), it's important to know whether this is unique to the accumulator for the correct response or consistent across all accumulators. If the effect is only evident on the accumulator for the correct response, this may indicate the model is not providing a good account of the data (since it is psychologically implausible that a participant could selectively adjust the threshold on the accumulator for the correct response before they know it to be the correct response). The solution to this would be to fix the accumulators to have the same thresholds.

4. I found it a little difficult to get a clear picture of the primary message you're intending the reader to take from the research. As I mentioned above, I am not very familiar with the language learning literature, so this may be an artefact of that unfamiliarity (e.g., findings with an importance that would be obvious to someone actively working in the area were not obvious to me). Even so, if you want the work to be more accessible to a general audience, it might be worth thinking about how you can better emphasise this message.

Minor comments:

p. 3, line 53: I think "temporal storage" should be "temporary storage".

pp. 3--4, lines 67--68: Really pedantic point here, but I think it's more precise to characterise performance as improving rather than increasing.

p. 4: In the discussion of the role of WM capacity in language learning (or elsewhere), I wonder whether it might be worthwhile bringing in recent work by Smalle (e.g., Smalle et al., 2021, 2022; see references below) suggesting that interfering with cognitive control mechanisms (an important part of WM) can improve statistical learning (an important part of language development)?

p. 8, lines 164--165: The phrasing is a bit awkward here.

p. 12, line 247: "were produced" should be "was produced".

p. 14, lines 290--291: There are a few differences between the study you cite here in support of not using an accuracy filter, and your own research (e.g., in-lab student vs. online sample) that make me wonder whether including an accurary threshold would be equally unnecessary in your case. I'd be interested to at least know how well participants performed in the arithmetic task (e.g., was the proportion of participants who did poorly similar to that in Ðokic et al.'s sample?). Actually I see from looking at the supplemental materials that you do provide arithmetic task performance data; perhaps good to mention this here.

p. 14, line 296: Did you consider allowing \\delta_s (i.e., the offset parameter) to vary across participants as well (and assess its relationship with OSPAN scores)? It seems plausible to me that part of learning stimuli in a novel language is developing the ability to more rapidly encode them, so that the decision (/categorisation) process receives good information more rapidly.

p. 15, line 307: What priors did you assign to the parameters?

p. 16, lines 338--346: Would a logistic model (e.g., via glmer) not be a better choice than a linear one here, given accuracy is restricted between 0 and 1 (/100)?

p. 17, lines 354--360: To enable the reader to better grasp the meaning of these effects, it might be useful to provide some information about the distribution of WM scores (e.g., to emphasise that a one-unit difference in WM scores is a fairly small increment); otherwise they may seem trivial.

p. 24, lines 490--491: When you say here that you "focus on the results for correct trials only", what exactly do you mean? Surely the model gives you parameter estimates that combine information from correct and error responses, no? In that case, how can you isolate the correct trials?

pp. 24--25, lines 509 on: Is there any reason for not just including WM capacity as a predictor in the diffusion modelling? Given the hierarchical nature of the diffusion model fitting, your individual participant parameter estimates are not independent, are they? Therefore the independence assumption of the regression analysis is violated.

p. 29, lines 620--625: The first thing that springs to my mind when I read this is Logan's instance theory of automatisation (e.g., Logan, 1988). If there is an initial stage of learning where a task is completed algorithmically, followed by expert responses resulting from LTM retrieval once enough instances are accumulated in memory, we would probably expect a fair bit of WM involvement in the first, but perhaps not so much in the second.

References:

Logan, G. D. (1988). Toward an instance theory of automatization. Psychological Review, 95(4), 492–527. https://doi.org/10.1037/0033-295X.95.4.492

Smalle, E. H. M., Muylle, M., Duyck, W., & Szmalec, A. (2021). Less is more: Depleting cognitive resources enhances language learning abilities in adults. Journal of Experimental Psychology: General, 150(12), 2423–2434. https://doi.org/10.1037/xge0001058

Smalle, E. H. M., Daikoku, T., Szmalec, A., Duyck, W., & Onen, R. M. (2022). Unlocking adults’ implicit statistical learning by cognitive depletion. Proceedings of the National Academy of Sciences of the United States of America, 119(2), 1–9. https://doi.org/10.1073/pnas.2026011119

Unsworth, N., & Engle, R. W. (2007). The nature of individual differences in working memory capacity: Active maintenance in primary memory and controlled search from secondary memory. Psychological Review, 114(1), 104–132. https://doi.org/10.1037/0033-295X.114.1.104

Reviewer #2: The authors examined the influence of working memory on category learning performance. Unlike the other studies, which either only studied the effect of WMC on immediate category learning or the effect after the categories were fully learned, the authors examined the continuum between the categories that were first formed to the time when the categories were fully acquired. Thus, the authors bridged the gap between the past studies and provided new insight into category learning. 

Overall, the experiment design was straightforward and logical. The flaws of doing an online study were carefully sidestepped and discussed. A large portion of chance-level participants might be dressed with a reward incentive. However, as a sufficient number of participants remained after excluding non-learners, running a new experiment might not be worth it. While the data is mostly sufficient to support the claim, the analysis is left to be desired.

There are two major concerns with the presented statistic:

1. The blocks were treated as categories instead of time series. This stops the author from looking at the IV change across blocks and can only look at the difference between specific blocks. Most importantly, supporting the "stable" performance claim is difficult without looking at the performance across the blocks. 

2. The statistic comes with vastly inflated alpha. The inflation might be the consequence of 1), as the interaction between the WMC and each block would have to be tested.

3. The authors claim the effect of WM changes over sessions (P.29, 619). However, the only statistic associated with the claim was the gradually (but not wholly) increasing p-value which happened to become insignificant at the later session (again, but not all blocks). An actual test of WMC effect across sessions would be a direct test of this claim. 

Drift diffusion model

The construction of the drift diffusion model also raised a question, why did the decision boundary depend on both the stimulus and the response? The authors used separate decision boundaries for four possible stimuli and four possible responses, i.e., 16 decision boundaries. Why? I can understand the difference in drift rate, as different stimuli would give different evidence for all possible responses. However, why would the decision boundaries change depending on the stimuli?

As the authors pointed out, the analysis only focused on the boundary leads to the correct response. Thus, implementing multiple boundaries based on the stimuli seems unnecessary and further explaination might be needed. If the extra boundaries were needed due to mathematical reasons. The authors needed to communicate it.

Some minor points:

P.7, Table 1: I think it might be dangerous to call the drift rate the "effieciency" of evidence accumulation. A stronger signal would also lead to a larger drift rate and is different from the "efficiency". 

P.7, Line 146: "A subset of these participants (107/195) return for the current study. This line doesn't fit into the previous sentences; maybe a misplacement?

P.10, Line 195: "The rate of evidence accumulation reflects the quality of information extracted from the stimulus, with faster rates reflecting a more efficient evidence accumulation process." This could also be on the increased efficiency of retrieving and comparing the examplers from memory. It would be difficult to identify whether the increase in drift rate was due to better information extraction or other categorization processes.

P.10, 200: Why can't the participants with higher WM to have the same decision threshold and respond faster without sacrificing the accuracy?

Reviewer #3: The authors investigate the relationship between working memory (WM; as measured by Operation Span) and learning Mandarin tone categories that were novel to the participants using a longitudinal, multi-session design. The authors examined both early learning as well as later retention/learning with follow-up sessions scheduled one month after the first and two months after the second. The authors also looked at whether there were differences for participants who successfully learned the tone categories vs. those who did not. For the latter, it appears that non-learners are generally disengaged from the task and may be seeking to minimize time spent in the experiment.

The key findings were that WM was positively associated with early learning, replicating several previous findings. Results for knowledge retention across sessions, however, showed no link, suggesting that WM is mainly involved in early, rather than later stages of learning. This is possibly due to the requirement of having to consider many different hypotheses early on in learning vs. only having to refine a single hypothesis (or a smaller set of viable ones) after performance has stabilized.

The authors also employ diffusion modeling to examine links between quality of evidence (drift rates, or accumulation rates) and response caution (decision thresholds). For learners, there was evidence that higher WM individuals showed not only more efficient evidence accumulation, but also greater response caution.

I found the manuscript to be very well written and exceptionally clear. The study was well-motivated, and the conclusions generally followed from the results of the analyses. My overall disposition is quite positive. I do have a few comments, mostly minor, that I would like to be addressed in a revision. I detail these below.

Major Comments

1 - Feedback was only provided as correct/error, but with a 4-category structure, this is not especially informative. Better to provide category feedback (i.e., information about response accuracy along with what the correct category was). This should be acknowledged as a limitation, as it may have contributed to poor performance leading to task disengagement. Because error feedback is highly ambiguous, it is that much more difficult to learn effectively.

2 - The racing diffusion model used here is set up in a way that differs from typical applications. The current model is configured in a way that is extremely flexible, raising questions about the viability of simpler, potentially more theoretically plausible, configurations. More seriously, the model configuration raises falsifiability concerns about the authors’ application of the model (see Jones & Dzhafarov, 2014).

My main concern is with the way decision thresholds are set. Based on my reading of the description on page 14, it seems that decision thresholds for the four categories are conditioned on the category of the stimulus (i.e., if a Category 1 stimulus is presented, the thresholds for the categories might be set to A, B, C, and D, but if a Category 2 stimulus is presented, they might be set at E, F, G, and H). If this is correct—apologies if I have misunderstood—the assumption is psychologically implausible, at it allows the system to adaptively configure itself in response to the stimulus category, but prior to identifying the stimulus category. A plausible configuration could still allow thresholds to differ across categories, but independent of the current stimulus.

Jones, M., & Dzhafarov, E. N. (2014). Unfalsifiability and mutual translatability of major modeling schemes for choice reaction time. Psychological Review, 121(1), 1–32. https://doi.org/10.1037/a0034190

3 - Related to this is whether a simpler model might provide a satisfactory account of the data. Given the unfamiliarity of the categories to the learners, is there an a priori reason to assume differences in decision thresholds? How well does a model with a single threshold common across all categories (and all stimuli) perform? The threshold(s) could still vary with learning, but testing a common threshold model against one that allows category-specific thresholds to vary would be prudent.

Minor Comments

4 – In Table 1, “enhanced learning and motivation” is listed as the hypothesized role of WM in later learning. Should this be “maintenance of knowledge”?

5 - Line 53 – Should “temporal” be “temporary”?

6 - Line 129 – The description of the two analyses considered could be explained more clearly in the parenthetical comment (i.e., learners only vs. all participants, including both learners and non-learners).

7 - Is the “non-decision time” offset, δs, allowed to vary across participants (and potentially learning)? If not, is there a reason why it is fixed across individuals/blocks?

8 - There have recently been a number of models that combine learning assumptions with diffusion decision mechanisms (e.g., Fontanesi et al., 2019, Miletic et al., 2021; Pedersen et al., 2017; Sewell et al., 2019). Do the authors think that their conclusions would be any different if one of these other models were used instead?

Fontanesi, L., Gluth, S., Spektor, M.S. et al. A reinforcement learning diffusion decision model for value-based decisions. Psychon Bull Rev 26, 1099–1121 (2019). https://doi.org/10.3758/s13423-018-1554-2

Steven Miletić, Russell J Boag, Anne C Trutti, Niek Stevenson, Birte U Forstmann, Andrew Heathcote (2021) A new model of decision processing in instrumental learning tasks eLife 10:e63055. https://doi.org/10.7554/eLife.63055

Pedersen, M.L., Frank, M.J. & Biele, G. The drift diffusion model as the choice rule in reinforcement learning. Psychon Bull Rev 24, 1234–1251 (2017). https://doi.org/10.3758/s13423-016-1199-y

Sewell, D.K., Jach, H.K., Boag, R.J. et al. Combining error-driven models of associative learning with evidence accumulation models of decision-making. Psychon Bull Rev 26, 868–893 (2019). https://doi.org/10.3758/s13423-019-01570-4

9 - Were fast responses filtered out of the data? Responses faster than 150-200 ms are usually regarded as anticipatory rather than stimulus-driven.

10 - Page 26 – The interpretation of decision thresholds in non-learners should be phrased more cautiously. Since most of the WM correlations are non-significant, there are concerns about false positives. Given that only one of the correlations outside of Session 1 was significant, I think the authors should be clear that the evidence that non-learners with higher WM have lower decision thresholds is relatively weak/inconsistent. That said, the more systematic trend in Session 1 is perhaps indicative of general task disengagement (or optimizing performance to minimize time in the experiment). Why this would only manifest in Session 1 is unclear, however, hence my general view that these results should be interpreted with greater caution.

The idea of non-learners trying to minimize time in the experiment is related to previous discussion along these lines by Hawkins, Brown, Steyvers, & Wagenmakers (2012).

Hawkins, G.E., Brown, S.D., Steyvers, M. et al. An optimal adjustment procedure to minimize experiment time in decisions with multiple alternatives. Psychon Bull Rev 19, 339–348 (2012). https://doi.org/10.3758/s13423-012-0216-z

11 - In the abstract, it is mentioned that WM is positively associated with generalization, but this doesn’t seem to be the case given the analyses reported on pp. 22-23.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Hsuan-Yu Lin

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Jun 10;19(6):e0297917. doi: 10.1371/journal.pone.0297917.r002

Author response to Decision Letter 0


29 Apr 2023

We thank the editor and three reviewers for consideration of our manuscript. Below, we briefly summarize the major changes we have made to the manuscript in response to reviewer comments. We then respond to each point raised by the reviewers with our responses beneath. We believe these changes significantly strengthen the manuscript.

Summary:

Drift diffusion model details and analyses. To address comments from all three reviewers (R1: C2, 3, 11, 12, 15, 16, R2: C4, R3: C2, 3, 7, 8, 9), we have provided more information about the drift diffusion model parameters and analyses. We have also provided justification of our choice to allow boundaries to differ across stimulus and response categories and included a direct comparison of this full model with a sub-model that only allowed boundaries to differ across response categories. We demonstrated that the full model is psychologically plausible and provides a better statistical fit to the data than the sub-model. We believe these revisions substantially improve the manuscript and better contextualize the results.

Block as a continuous/categorical variable. Reviewer 2 (C1-3) raised valid concerns about our application of the regression models with block as a categorical variable. We agree with these concerns and have now re-ran the models using block as a continuous variable within sessions. We also directly compare performance across sessions. This enables us to better compare across blocks and sessions without an overly inflated alpha.

Takeaway messages. Reviewer 1 (C4) suggested that we edit the manuscript to make the takeaway messages clearer for readers. We have now drawn readers’ attention to the main takeaways throughout the manuscript and believe that this strengthens the manuscript as a whole.

Reviewer comments and responses

Reviewer #1

Roark et al. present the results of research aimed at assessing the role of working memory in the learning and retention of categorical knowledge about non-native language stimuli. Participants completed a task online in which they learned Mandarin tone categories across three sessions, separated by months. The authors assessed the relationships between various aspects of task performance and participants' performance on an operation span task. In particular, they focused on relationships between OSPAN task performance and initial and later learning, maintenance of category knowledge across sessions, generalisation between trained and non-trained stimuli, and drift rate and threshold parameters from a multi-alternative diffusion model used to fit the data. Results showed positive relationships between OSPAN scores and categorisation accuracy in later learning (e.g., later blocks of session 1 and most blocks of subsequent sessions); however, when the sample was split into learners and non-learners, this relationship was not evident in either group. Maintenance of category knowledge across sessions and across the final blocks within a session was also unrelated to OSPAN scores, as was generalisation from training to test stimuli. Diffusion modelling showed positive relationships between OSPAN scores and both drift rate and threshold in later blocks of session 1, and in subsequent sessions, for learners. The authors conclude that greater WM capacity may aid category learning in language contexts, particularly in the early stages of acquisition; and that higher WM capacity allows for greater caution in decision-making owing owing to more rapid evidence accumulation.

I should mention at the outset that whereas my knowledge of the WM literature and evidence accumulation models of decision-making is fairly good, I am far from an expert when it comes to research into language learning. My comments thus largely focus on these former aspects of the work. I should also note that this makes it difficult for me to judge the theoretical value of the contribution this research makes to the language learning area.

I have four major comments concerning the paper (not presented in any particular order), and a few more minor ones.

Major comments:

R1.C1: 1. My understanding is that theories of performance on complex span tasks (such as the OSPAN task) often assume a contribution from retrieval from LTM/episodic memory (e.g., Unsworth & Engle, 2007). If so, then I think a relevant question is, to what extent do the relationships between OSPAN performance and category learning shown here reflect the influence of LTM retrieval in both tasks? In other words, are the relationships evident in this work a result of the fact that both tasks to some extent tap how well people can retrieve stuff from LTM? Or is there something more to it than that? The meaning of the relationship depends on the answer to this question, and therefore I think it's something that would be worth addressing.

Thank you for raising this point. We now directly address the potential involvement of LTM/episodic memory to performance in the OSPAN task in the interpretation of our results in the Discussion on pages 37-38 (lines 796-806): “Further, because of the nature of the complex span task we used to assess WM capacity, it is possible that performance was influenced by some combination of WM and long-term memory [88]. As a result, the observed relationship between WM score and speech category learning performance may reflect the ability to hold onto and manipulate information in WM as well as retrieve exemplars or rules from long-term memory. However, it is important to note that measures that should theoretically be related to long-term memory or activation of exemplars stored in memory (e.g., maintenance, generalization) were not significantly related to WM score. Future studies should collect multiple measures of WM including visuospatial and auditory WM as well as measures of long-term memory to better understand how speech category learning relies on WM and long-term memory abilities.”

R1.C2: 2. I was fairly confused by some aspects of the way the diffusion modelling was used in the paper. I had not come across this particular model before, but---if I understand it correctly---the basic idea seems straightforward: Assume Wald distributions for each response with densities that are multiplied by the product of the survival functions for the other responses. Nonetheless, there were two points that confused me:

a) At one point (p. 24) you mention focusing on "correct trials only". I followed the reference used here (Roark et al., 2021, PB&R) to try to make sense of this, and unfortunately became even more confused. The source of my confusion is this: I don't understand how you can isolate a drift rate and threshold for correct trials only when all the responses must surely influence the estimated parameter values. The only thing I could think of was that you were only fitting the data from trials with correct responses; but this would lead to heavily biased parameter estimates (with the bias stronger when the proportion of errors increases). As such, I think you either need to clarify what it was you actually did, so that other readers don't end up as confused as I am; and/or re-do the modelling to fit all the data, rather than just those from correct trials. (On about my fifth read through the section, I think what you were actually doing here was fitting the model to all the data, but just reporting the parameters associated with the accumulator corresponding to the correct response. Right? This is more sensible than what I initially thought you were doing, but if this is correct, you really ought to clarify it so other readers don't go down the same garden path I did.)

We indeed fit the model to data from all trials comprising both correct AND incorrect responses to estimate the parameters. However, since gradual improvements in making correct decisions characterize learning, in our discussion of the results, we emphasized heavily on inferring the parameters associated with successful identification of the input stimulus. We have edited the Methods section (page 16, lines 340-344) to explain these points more clearly: “The remaining data, comprising both correct and incorrect trials, were used to estimate the parameters. Since gradual improvements in making correct decisions characterize learning, in our discussions below, we emphasize heavily on inferring the parameters associated with successful identification of the stimulus, that is, µ_(d,s) and b_(d,s) for correct responses with s=d.”

b) It seems to me that you use the estimated parameters for individual participants, obtained from the diffusion modelling, as the outcome variable in a regular linear regression with OSPAN score as the predictor. As I understand it though, the hierarchical structure of the diffusion model means these participant-level parameters are not independent, and therefore are not appropriate for use in the subsequent regression. Perhaps a better tack would be to include OSPAN scores as a predictor in the diffusion model itself.

We used the estimated parameters for individual parameters from the diffusion modeling as the outcome variable. A single-stage estimation method with the OSPAN scores as a predictor in the diffusion model would be statistically more appropriate, we agree. However, while the multi-category multi-subject time-varying longitudinal drift-diffusion mixed models we have used here is a very sophisticated one, it is not straightforward to incorporate external continuous covariates in this model - we would need to carefully design such models, develop associated computational machinery, rigorously test them, etc. This is beyond the scope of the current article, but we do plan to pursue this statistics methodology problem as the topic of a separate project. We also note that multi-stage methods are quite widely used in the scientific literature when single-stage methods are not available. Based on our experience (in other contexts), multi-stage models, especially when they involve many latent variables (as with latent diffusion processes in drift-diffusion models), often produce more efficient and numerically stable inference. Finally, for this particular problem, we do not think a single-stage implementation will change the scientific conclusions we arrived at using our two-stage procedure.

R1.C3: 3. Related to the previous point: If what you were doing with the diffusion modelling was indeed fitting all the data but just reporting parameters for the accumulator associated with the correct responses, I'd recommend also reporting the parameters for the other accumulators, at least in supplemental materials. In particular, when it comes to the effects of WM on thresholds (e.g., increasing with practice for higher-WM individuals), it's important to know whether this is unique to the accumulator for the correct response or consistent across all accumulators. If the effect is only evident on the accumulator for the correct response, this may indicate the model is not providing a good account of the data (since it is psychologically implausible that a participant could selectively adjust the threshold on the accumulator for the correct response before they know it to be the correct response). The solution to this would be to fix the accumulators to have the same thresholds.

We now report the parameters for all accumulators in the Supporting Information – specifically, we present the relationship between OSPAN score and evidence accumulation rate/decision threshold for the average of all stimuli and responses rather than just correct responses. We note that this does not change the overall pattern of results.

R1.C4: 4. I found it a little difficult to get a clear picture of the primary message you're intending the reader to take from the research. As I mentioned above, I am not very familiar with the language learning literature, so this may be an artefact of that unfamiliarity (e.g., findings with an importance that would be obvious to someone actively working in the area were not obvious to me). Even so, if you want the work to be more accessible to a general audience, it might be worth thinking about how you can better emphasise this message.

We thank the reviewer for this comment. We have now better summarized the primary takeaway messages for readers in the Abstract, Introduction, and Discussion.

Minor comments:

R1.C5: p. 3, line 53: I think "temporal storage" should be "temporary storage".

We have changed this to “temporary storage”

R1.C6: pp. 3--4, lines 67--68: Really pedantic point here, but I think it's more precise to characterise performance as improving rather than increasing.

We have changed this to “improving”

R1.C7: p. 4: In the discussion of the role of WM capacity in language learning (or elsewhere), I wonder whether it might be worthwhile bringing in recent work by Smalle (e.g., Smalle et al., 2021, 2022; see references below) suggesting that interfering with cognitive control mechanisms (an important part of WM) can improve statistical learning (an important part of language development)?

Thank you for pointing us to these references. After reviewing these references, we have decided not to include them in our Discussion of our results because the methods are potentially too different from one another (e.g., unsupervised statistical learning of sequences vs. supervised learning of speech categories). We do not believe that we would be able to incorporate these references in the manuscript without providing substantial additional context and that is out of scope for the current article.

R1.C8: p. 8, lines 164--165: The phrasing is a bit awkward here.

We have updated this to: “The ability to accurately identify novel exemplars is a hallmark of categorization” (page 8, line 166-167)

R1.C9: p. 12, line 247: "were produced" should be "was produced".

We have updated this to “was produced”

R1.C10: p. 14, lines 290--291: There are a few differences between the study you cite here in support of not using an accuracy filter, and your own research (e.g., in-lab student vs. online sample) that make me wonder whether including an accurary threshold would be equally unnecessary in your case. I'd be interested to at least know how well participants performed in the arithmetic task (e.g., was the proportion of participants who did poorly similar to that in Ðokic et al.'s sample?). Actually I see from looking at the supplemental materials that you do provide arithmetic task performance data; perhaps good to mention this here.

We now directly report average performance on the arithmetic task in the main text (page 14, lines 298-300) and point interested readers to the Supporting Information for more information.

R1.C11: p. 14, line 296: Did you consider allowing \\delta_s (i.e., the offset parameter) to vary across participants as well (and assess its relationship with OSPAN scores)? It seems plausible to me that part of learning stimuli in a novel language is developing the ability to more rapidly encode them, so that the decision (/categorisation) process receives good information more rapidly.

We did allow the offset parameter to also vary across participants. We have now clarified this explicitly on page 15, lines 308-310: “The model lets the parameters µ_(d,s), b_(d,s), and δ_s to vary between participants, which accommodates the substantial variability across participants.”

R1.C12: p. 15, line 307: What priors did you assign to the parameters?

The full details of our drift-diffusion model, including the choice of the priors, the resulting posterior, associated computational machinery for model fitting and posterior inference, simulated and real data illustrations, etc. can be found in Paulon et al (2021). A full discussion of these details is outside of the scope of this article. Following your comments, however, we now present a non-technical summary of the model in the Supporting Information.

R1.C13: p. 16, lines 338--346: Would a logistic model (e.g., via glmer) not be a better choice than a linear one here, given accuracy is restricted between 0 and 1 (/100)?

Even though accuracy is restricted between 0 and 1, the mean accuracy values are continuous and normally distributed and are therefore appropriate for a linear analysis. Because our foc

Attachment

Submitted filename: 2023-04-25_ResponsetoReviewers.pdf

pone.0297917.s002.pdf (250.6KB, pdf)

Decision Letter 1

Alessandra S Souza

21 Jun 2023

PONE-D-23-00404R1Individual differences in working memory impact the trajectory of non-native speech category learningPLOS ONE

Dear Dr. Roark,

 Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.  Thank you for submitting your manuscript to PLOS ONE. I have sent your paper back to the original three reviewers of your prior submission. All reviewers agree that your paper is substantially improved. Yet, two of the reviewers feel that you have not addressed their concerns enough about the plausability of the model specified. This is a relatively big concern because model fit alone cannot be a basis to decide which model to use - when a model has several parameters, reasoning about the underlying model structure and meaningful implementation also needs to be used to constrain the model. Here, both reviewers feel that it makes no sense to allow response boundary to vary with the stimulus category. The reasoning for this is standard in the literature. Hence I agree with Reviewer 1 that you need to use a more constrained version, even if it slightly misfits the data. On that note, it would be important to present evidence of model fit. I think this should be presented as supplementary materials to not make the paper too long and cumbersome to read. Given that this is a major change in the results section, I am therefore inviting a major revision for the paper. Please include a point-by-point response to the reviewer's comments (note also that Reviewer 2 suggested that you may want to condense a bit the results section).

Please submit your revised manuscript by Aug 05 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Alessandra S. Souza, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

Reviewer #3: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

Reviewer #3: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have effectively responded to most of my concerns, and I think the revised manuscript is an improvement on its predecessor in terms of its accessibility and readability. That said, I still have serious concerns regarding the psychological plausibility of a model of this type in which decision thresholds depend on knowledge of the stimuli being categorized, and the authors' response on this point has not convinced me otherwise. (I also read the paper by Paulon et al. very carefully to see whether I was missing something that had been described there, but did not come across any such thing.) A model that is psychologically implausible is problematic, irrespective of how well it might fit the data, because it is unclear how its parameters can be interpreted.

It is possible that people's decision boundaries really do change in a sort of cascaded manner, with processed low-level stimulus information feeding forward into the decision processes used for higher-level categorization. However, as far as I can see this would be an entirely different model to the one applied here.

I don't want to be totally un-constructive with my comments, so let me suggest three potential paths forward on this point. First, the authors could focus on the more restricted (but psychologically plausible) model instead, and revise the paper accordingly. Second, the authors could retain the current model, explicitly discuss its psychological implausibility, and explain in detail what sort of underlying processes might lead such a model to provide a better fit than the more restricted but theoretically sounder alternative. Third, the authors could do a bit of both: Cover the results from the restricted model AND the results from the flexible one, and discuss what it is that allows this model to fit the data better despite it being implausible, and what this might imply for the underlying processes. (Of course, I leave the possibility open that there are additional paths forward that I have neglected.)

In addition to this, I have just a few very minor, language-related comments:

p. 29, lines 610--612: This sentence is a little confusing. Do you mean that the increase across blocks was not significantly different from session 1? Or the rate itself?

p. 29, lines 619--620: "changed across blocks, sessions"---a word missing here?

p. 30, lines 634--635: I suggest changing "associated with an additional decrease in threshold" to "associated with a relative decrease in threshold" (or similar), since the baseline relationship for the session was positive.

p. 34, line 720: Who is the "them" in this sentence?

Reviewer #2: After the revision, the authors addressed all my previous concerns. Hence, I recommend accepting the manuscript.

The only minor gripe is that the statistic results became too long and cumbersome to read. Moving part of the results to the supplementary material might be better for readability. However, I would be okay with the paper staying the same.

Reviewer #3: The authors have addressed most of my comments from their initial submission. However, there remain two outstanding issues that have not been adequately addressed. These map onto my 2nd and 3rd points (R3.C2 and R3.C3, respectively) raised in my initial review.

Regarding R3.C2, I still have serious concerns about the decision to allow both drift rates and decision thresholds to vary as a function of stimulus-response combination. For drift rates, this is perfectly fine, and in keeping with conventional practice with regards to fitting data. For decision thresholds, however, this level of flexibility introduces a theoretical circularity that renders the model psychologically implausible.

It is fine for decision thresholds to vary as a function of response—participants may be more or less cautious about making particular responses. Indeed, this is how racing accumulator models (such as the racing diffusion model the authors use here) address response bias issues. Where things become problematic is when decision threshold is allowed to vary as a function of the stimulus. The core objection is this: If the outcome of the decision process is to identify the stimulus as belonging to a particular category, the system cannot be configured in a stimulus-specific manner because this requires foreknowledge of the outcome of the categorization process. Put another way, conditioning the threshold on the identity of the stimulus requires the system to have already categorized the stimulus for the purposes of threshold setting. In which case, there is no need for a subsequent evidence accumulation stage because the stimulus has already been categorized implicitly.

I appreciate that the authors have presented statistical support for allowing threshold to vary according to the stimulus, but this does not address the theoretical/psychological objection to configuring the model in this way. That is, the WAIC comparisons deal with a quality of fit issue, but they do not deal with the more substantive theoretical interpretation of the model.

This raises a point I had overlooked in my original review—there is no visualization of model fit. Is it the case that good fits can only be achieved by allowing the model to estimate thresholds in a stimulus-dependent way? Plotting either quantile-averaged data against quantile-averaged model predictions would be one way to check the correspondence between theory and data. Another way—perhaps more appropriate given the large sample nature of this study—would be to generate scatterplots showing predicted vs. observed RT quantiles for correct responses and errors as well as choice probabilities.

The other issue that I still think needs addressing, regarding R3.C3, extends on the above point. Are response-dependent thresholds required to achieve good fits to data? The authors correctly note in their original response that an accumulator framework does not require a common threshold setting for each accumulator, but my question was whether a common threshold value suffices to explain the data. It may well be the case that participants have response biases that lead them to be more reluctant/cautious about making some category responses over others, but demonstrating that this flexibility is needed is important for ensuring explanatory parsimony in terms of the preferred model.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Jun 10;19(6):e0297917. doi: 10.1371/journal.pone.0297917.r004

Author response to Decision Letter 1


16 Nov 2023

We thank the editor and three reviewers for consideration of our revised manuscript. Below, we briefly summarize the additional changes we have made to the manuscript in response to reviewer comments. We then respond to each point raised by the reviewers with our responses directly underneath.

Summary:

-Utilization of a more constrained version of the model (which allows drift rates to vary across stimulus and response categories, but boundaries only vary across response categories). Further, we arrived at this model by selecting among different versions of a more constrained model, which are now further detailed in the Supporting Information.

-Include more information about model fit in the Supporting Information.

-Minor changes for readability and simplifying the Results section.

Editor

Thank you for submitting your manuscript to PLOS ONE. I have sent your paper back to the original three reviewers of your prior submission. All reviewers agree that your paper is substantially improved. Yet, two of the reviewers feel that you have not addressed their concerns enough about the plausability of the model specified. This is a relatively big concern because model fit alone cannot be a basis to decide which model to use - when a model has several parameters, reasoning about the underlying model structure and meaningful implementation also needs to be used to constrain the model. Here, both reviewers feel that it makes no sense to allow response boundary to vary with the stimulus category. The reasoning for this is standard in the literature. Hence I agree with Reviewer 1 that you need to use a more constrained version, even if it slightly misfits the data. On that note, it would be important to present evidence of model fit. I think this should be presented as supplementary materials to not make the paper too long and cumbersome to read.

RESPONSE:

In direct response to comments from the Editor as well as Reviewers 1 and 3, we have used a more constrained version of the model, which does not allow boundaries to vary across different stimulus categories. Additionally, we have added more evidence of model fit to the Supporting Information.

We would also like to reiterate that the version of the model that allows both drift rates and boundaries to vary with the stimulus category is relatively novel to the field and is the basis of an NSF award to Dr. Sarkar and Dr. Chandrasekaran, but it does provide significantly better fit to the data. We understand that there is concern about the psychological plausibility of this version of the model (hence we have used a more constrained version here). While not fully elaborated therein, we can assure (three of the original authors are co-authors here as well) that the original decision to allow this flexibility in Paulon et al. (2021) was well-thought-out one, with significant deliberation regarding the biological plausibility, and not a mere oversight.

Since the psychological plausibility of this model is outside of the scope of the current article, we plan to address this question in a separate future work. Our general rationale for the original model is as follows. (1) If we can assume (as is typical in conventional race models) that when an input stimulus category is presented, only the drift-diffusion processes corresponding to that specific input category (via drift rates varying by stimulus category) come into play in accumulating evidence in favor of different alternatives (while the other accumulators corresponding to other stimulus categories remain completely absent), we should be open to allowing the input category information influence other aspects of the underlying processes (including the boundaries) as well. (2) The most important timepoint to consider while interpreting the decision boundaries is at the time of decision, not at the beginning of or during the evidence accumulation process. Allowing the boundaries to be flexible based on incoming information from the stimulus reflects the ability to dynamically adjust behavior based on stimulus information. This ability may not be used in tasks and approaches that commonly use DDMs (e.g., detection, go-no go), but is arguably very important for perceptual categorization. (3) There is strong statistical evidence in favor of the fully flexible model (drift rates and boundaries allowed to vary over time and across responses and stimulus categories). If a reasonable argument for psychological plausibility can be made (see point 2 above), then we should take seriously the better statistical fit of this fully flexible model.

Here (for the purpose of the response to reviewers only) we provide an assessment of model fit for four different models which only differ in terms of how they treat the boundaries. They are

flexible: the original model of Paulon, et al. (2021) that allows the boundaries b_(d,s)^((i)) (t) to vary with both the stimulus s and the response d as well as with time t;

fixed: a sub-model that allows the boundaries b_d^((i)) (t) to vary with the response d as and with time t but not with the stimulus s;

constant: a sub-model that allows the boundaries b_(d,s)^((i)) (t) to vary with both the stimulus s and the response d but not with time t;

fixed-constant: a sub-model that allows the boundaries b_d^((i)) to vary with the response d but not with stimulus s, nor with time t.

All models accommodate subject heterogeneity by allowing the boundaries to vary with the subject index i. As anticipated, the original ‘flexible’ model provides the best model fit, followed by ‘fixed’, then ‘constant’, and finally the most restrictive ‘fixed-constant’.

Table R1

Model fit assessed by -2*WAIC for the original model of Paulon et al. (2021) and three different sub-models. The smaller the reported model assessment value, the better.

Session 1 Session 2 Session 3

Model with b_(d,s)^((i)) (t) (flexible) 92691.86 82792.24 76261.58

Sub-Model with b_d^((i)) (t) (fixed) 93409.11 83795.52 77647.32

Sub-Model with b_(d,s)^((i)) (constant) 95203.27 84444.15 78203.95

Sub-Model with b_d^((i))(fixed-constant) 95697.74 85671.44 79635.96

Nevertheless, in the end, in the revised version of the current article and the Supporting Information document, we have decided to follow Reviewer 1’s first suggestion (see R1.C1 below) and used the more constrained ‘fixed’ version of the model, which does not allow boundaries to vary across different stimulus categories. We have edited the manuscript accordingly and the results for decision threshold are changed from previous versions. The results for the evidence accumulation rate are nearly identical to previous versions. The results overall have thus not changed substantially and the major conclusions very much remain the same. We hope this comprehensive revision that aligns with the reviewer perspectives addresses the concerns.

Reviewer #1

R1.C1: The authors have effectively responded to most of my concerns, and I think the revised manuscript is an improvement on its predecessor in terms of its accessibility and readability. That said, I still have serious concerns regarding the psychological plausibility of a model of this type in which decision thresholds depend on knowledge of the stimuli being categorized, and the authors' response on this point has not convinced me otherwise. (I also read the paper by Paulon et al. very carefully to see whether I was missing something that had been described there, but did not come across any such thing.) A model that is psychologically implausible is problematic, irrespective of how well it might fit the data, because it is unclear how its parameters can be interpreted.

It is possible that people's decision boundaries really do change in a sort of cascaded manner, with processed low-level stimulus information feeding forward into the decision processes used for higher-level categorization. However, as far as I can see this would be an entirely different model to the one applied here.

I don't want to be totally un-constructive with my comments, so let me suggest three potential paths forward on this point. First, the authors could focus on the more restricted (but psychologically plausible) model instead, and revise the paper accordingly. Second, the authors could retain the current model, explicitly discuss its psychological implausibility, and explain in detail what sort of underlying processes might lead such a model to provide a better fit than the more restricted but theoretically sounder alternative. Third, the authors could do a bit of both: Cover the results from the restricted model AND the results from the flexible one, and discuss what it is that allows this model to fit the data better despite it being implausible, and what this might imply for the underlying processes. (Of course, I leave the possibility open that there are additional paths forward that I have neglected.)

RESPONSE:

In direct response to this comment (as well as similar comments made by Reviewer 3 and the Editor), we have used a more constrained version of the model, which does not allow boundaries to vary across different stimulus categories (Reviewer 1’s first suggestion here). Please see our detailed response to the Editor above for additional information.

In addition to this, I have just a few very minor, language-related comments:

R1.C2: p. 29, lines 610--612: This sentence is a little confusing. Do you mean that the increase across blocks was not significantly different from session 1? Or the rate itself?

RESPONSE:

Based on overall edits to the results, this sentence is no longer in the manuscript.

R1.C3: p. 29, lines 619--620: "changed across blocks, sessions"---a word missing here?

RESPONSE:

Based on overall edits to the results, this sentence is no longer in the manuscript.

R1.C4: p. 30, lines 634--635: I suggest changing "associated with an additional decrease in threshold" to "associated with a relative decrease in threshold" (or similar), since the baseline relationship for the session was positive.

RESPONSE:

Based on overall edits to the results, this sentence is no longer in the manuscript.

R1.C5: p. 34, line 720: Who is the "them" in this sentence?

RESPONSE:

We have changed this to clarify: “helps learners”

Reviewer #2

R2.C1: After the revision, the authors addressed all my previous concerns. Hence, I recommend accepting the manuscript.

The only minor gripe is that the statistic results became too long and cumbersome to read. Moving part of the results to the supplementary material might be better for readability. However, I would be okay with the paper staying the same.

RESPONSE:

Based on overall edits to the results to accommodate the more constrained model, the results are much shorter. We believe that these changes enhance readability.

Reviewer #3

The authors have addressed most of my comments from their initial submission. However, there remain two outstanding issues that have not been adequately addressed. These map onto my 2nd and 3rd points (R3.C2 and R3.C3, respectively) raised in my initial review.

R3.C1: Regarding R3.C2, I still have serious concerns about the decision to allow both drift rates and decision thresholds to vary as a function of stimulus-response combination. For drift rates, this is perfectly fine, and in keeping with conventional practice with regards to fitting data. For decision thresholds, however, this level of flexibility introduces a theoretical circularity that renders the model psychologically implausible.

It is fine for decision thresholds to vary as a function of response—participants may be more or less cautious about making particular responses. Indeed, this is how racing accumulator models (such as the racing diffusion model the authors use here) address response bias issues. Where things become problematic is when decision threshold is allowed to vary as a function of the stimulus. The core objection is this: If the outcome of the decision process is to identify the stimulus as belonging to a particular category, the system cannot be configured in a stimulus-specific manner because this requires foreknowledge of the outcome of the categorization process. Put another way, conditioning the threshold on the identity of the stimulus requires the system to have already categorized the stimulus for the purposes of threshold setting. In which case, there is no need for a subsequent evidence accumulation stage because the stimulus has already been categorized implicitly.

I appreciate that the authors have presented statistical support for allowing threshold to vary according to the stimulus, but this does not address the theoretical/psychological objection to configuring the model in this way. That is, the WAIC comparisons deal with a quality of fit issue, but they do not deal with the more substantive theoretical interpretation of the model.

This raises a point I had overlooked in my original review—there is no visualization of model fit. Is it the case that good fits can only be achieved by allowing the model to estimate thresholds in a stimulus-dependent way? Plotting either quantile-averaged data against quantile-averaged model predictions would be one way to check the correspondence between theory and data. Another way—perhaps more appropriate given the large sample nature of this study—would be to generate scatterplots showing predicted vs. observed RT quantiles for correct responses and errors as well as choice probabilities.

RESPONSE:

In direct response to the first point in this comment (as well as similar comments made by Reviewer 1 and the Editor), we have used a more constrained version of the model, which does not allow boundaries to vary across different stimulus categories (Reviewer 1’s first suggestion here). Please see our detailed response to the Editor above for additional information.

RESPONSE:

To the second point regarding the visualization of model fit, we have now plotted the predicted versus observed reaction times (and responses) to the Supporting Information (Figure S3).

R3.C2: The other issue that I still think needs addressing, regarding R3.C3, extends on the above point. Are response-dependent thresholds required to achieve good fits to data? The authors correctly note in their original response that an accumulator framework does not require a common threshold setting for each accumulator, but my question was whether a common threshold value suffices to explain the data. It may well be the case that participants have response biases that lead them to be more reluctant/cautious about making some category responses over others, but demonstrating that this flexibility is needed is important for ensuring explanatory parsimony in terms of the preferred model.

RESPONSE:

Please see our detailed response to the Editor above.

Attachment

Submitted filename: 2023-11-15_ResponsetoReviewers.docx

pone.0297917.s003.docx (33.1KB, docx)

Decision Letter 2

Alessandra S Souza

9 Jan 2024

PONE-D-23-00404R2Individual differences in working memory impact the trajectory of non-native speech category learningPLOS ONE

Dear Dr. Roark,

Thank you for submitting your manuscript to PLOS ONE. I sent your paper back to the more critical reviewers from the prior submission. They all believe you have done a great job in revising your paper. They have only made very small suggestions of clarifications for the text, which I am inviting you to correct and resubmit for final acceptance of the paper (see comments below). Thank you for considering PLOS ONE as the outlet for your work.

Please submit your revised manuscript by Feb 23 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Alessandra S. Souza, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #3: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors' current version of the manuscript has successfully addressed all the major concerns I previously had. There are only a few minor language issues I noticed while re-reading, which I detail below:

p. 6: At the top of the page, the reference to the McHaney et al. paper is numbered as 10, but should be 12.

p. 8: "We will assess generalization in each session by presenting learners with novel stimuli spoken by novel talkers that they do not encounter during training and never receive feedback about the correct category." This seems ungrammatical at the end. Maybe "…that they do not encounter during training, without providing feedback about the correct category" would be better?

p. 11: "…without making sacrifices in performance" should perhaps be "without making sacrifices in accuracy" (since response time is also part of performance).

p. 11: The reference to the McHaney et al. paper is numbered wrongly again.

p. 15: Line 319 starts with "Click or tap here to enter text."

In addition, I thank the authors for attempting to explain the validity of their original model. I'm genuinely interested in how this would work, and though I don't really understand the argument made in the response letter, I'll look forward to seeing their future work on this topic. To that end, if I were a hypothetical reviewer of that work, one thing I'd definitely want to see would be model recovery simulations that show that the more flexible model is selectively preferred when data are generated by a model with boundaries that change within a trial (but not when they are generated by a model without such a feature).

Reviewer #3: The authors have addressed my comments and concerns in their revision. I have one small suggestion for a further analysis of drift rates that the authors might wish to consider (though I don’t think it is essential). It may provide some further insight into the extent to which WM relates to extraction of evidence in general vs. more focused extraction of categorically diagnostic information.

1 - I understand the focus on drift rates for the correct response option (i.e., when stimulus and response match). This indexes absolute evidence accumulation for the one option. It would be interesting to know whether there are discriminability differences as well, however. If correct drift rates were normalized across the sum across accumulators, you could further consider discriminability of the response options. This is potentially important because it pertains to whether higher WM is related to better extraction of evidence overall vs. extraction of diagnostic evidence that differentiates the categories.

Line 319 – Editing issue, “Click or tap here to enter text”.

Lines 319-320 – Was RT filtering based on slowest/fastest 1% of responses in the entire data set, or was filtering done separately for each participant?

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Jun 10;19(6):e0297917. doi: 10.1371/journal.pone.0297917.r006

Author response to Decision Letter 2


12 Jan 2024

We thank the editor and reviewers for consideration of our revised manuscript. We respond to each point raised by the reviewers with our responses in blue italics.

Reviewer #1

The authors' current version of the manuscript has successfully addressed all the major concerns I previously had. There are only a few minor language issues I noticed while re-reading, which I detail below:

R1.C1: p. 6: At the top of the page, the reference to the McHaney et al. paper is numbered as 10, but should be 12.

We have fixed this.

R1.C2: p. 8: "We will assess generalization in each session by presenting learners with novel stimuli spoken by novel talkers that they do not encounter during training and never receive feedback about the correct category." This seems ungrammatical at the end. Maybe "…that they do not encounter during training, without providing feedback about the correct category" would be better?

We have made the suggested change.

R1.C3: p. 11: "…without making sacrifices in performance" should perhaps be "without making sacrifices in accuracy" (since response time is also part of performance).

We have made the suggested change.

R1.C4: p. 11: The reference to the McHaney et al. paper is numbered wrongly again.

We have fixed this.

R1.C5: p. 15: Line 319 starts with "Click or tap here to enter text."

We have fixed this.

R1.C6: In addition, I thank the authors for attempting to explain the validity of their original model. I'm genuinely interested in how this would work, and though I don't really understand the argument made in the response letter, I'll look forward to seeing their future work on this topic. To that end, if I were a hypothetical reviewer of that work, one thing I'd definitely want to see would be model recovery simulations that show that the more flexible model is selectively preferred when data are generated by a model with boundaries that change within a trial (but not when they are generated by a model without such a feature).

We thank the reviewer for their comments – this will surely help us in preparing this future work for publication.

Reviewer #3

The authors have addressed my comments and concerns in their revision. I have one small suggestion for a further analysis of drift rates that the authors might wish to consider (though I don’t think it is essential). It may provide some further insight into the extent to which WM relates to extraction of evidence in general vs. more focused extraction of categorically diagnostic information.

R3.C1: 1 - I understand the focus on drift rates for the correct response option (i.e., when stimulus and response match). This indexes absolute evidence accumulation for the one option. It would be interesting to know whether there are discriminability differences as well, however. If correct drift rates were normalized across the sum across accumulators, you could further consider discriminability of the response options. This is potentially important because it pertains to whether higher WM is related to better extraction of evidence overall vs. extraction of diagnostic evidence that differentiates the categories.

We appreciate this comment but see this analysis as outside of the scope of the current investigation. We have made a note about pursuing this possibility in future work.

R3.C2: Line 319 – Editing issue, “Click or tap here to enter text”.

We have fixed this.

R3.C3: Lines 319-320 – Was RT filtering based on slowest/fastest 1% of responses in the entire data set, or was filtering done separately for each participant?

We have clarified this (line 323-324, page 15): The data were filtered to exclude very fast and very slow responses by removing the top and bottom 1% of all trials across all participants based on reaction time.

Attachment

Submitted filename: 2024-01-11_ResponsetoReviewers.docx

pone.0297917.s004.docx (24KB, docx)

Decision Letter 3

Alessandra S Souza

16 Jan 2024

Individual differences in working memory impact the trajectory of non-native speech category learning

PONE-D-23-00404R3

Dear Dr. Roark,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Alessandra S. Souza, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File

    (DOCX)

    pone.0297917.s001.docx (11MB, docx)
    Attachment

    Submitted filename: 2023-04-25_ResponsetoReviewers.pdf

    pone.0297917.s002.pdf (250.6KB, pdf)
    Attachment

    Submitted filename: 2023-11-15_ResponsetoReviewers.docx

    pone.0297917.s003.docx (33.1KB, docx)
    Attachment

    Submitted filename: 2024-01-11_ResponsetoReviewers.docx

    pone.0297917.s004.docx (24KB, docx)

    Data Availability Statement

    The stimulus materials, data, and analysis code are publicly available through the Open Science Framework repository and can be accessed online at https://doi.org/10.17605/OSF.IO/WDPYU.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES