The Role of Sleep in Learning New Meanings for Familiar Words through Stories

Rachael C Hulme; Jennifer M Rodd

doi:10.5334/joc.282

. 2023 Jun 15;6(1):27. doi: 10.5334/joc.282

The Role of Sleep in Learning New Meanings for Familiar Words through Stories

Rachael C Hulme ¹, Jennifer M Rodd ¹

PMCID: PMC10275344 PMID: 37334107

Abstract

Adults often learn new meanings for familiar words, and in doing so they must integrate information about the newly-acquired meanings with existing knowledge about the prior meanings of the words in their mental lexicon. Numerous studies have confirmed the importance of sleep for learning novel word forms (e.g., “cathedruke”) either with or without associated meanings. By teaching participants new meanings for familiar word forms, this is the first study to focus exclusively on the specific role of sleep on learning word meanings. In two experiments participants were trained on new meanings for familiar words through a naturalistic story reading paradigm to minimize explicit learning strategies. Experiment 1 confirmed the benefit of sleep for recall and recognition of word meanings, with better retention after 12 hours including overnight sleep than 12 hours awake. Experiment 2, which was preregistered, further explored this sleep benefit. Recall performance was best in the condition in which participants slept immediately after exposure and were tested soon after they woke up, compared with three conditions which all included an extended period of wake during which they would encounter their normal language environment. The results are consistent with the view that, at least under these learning conditions, a benefit of sleep arises due to passive protection from linguistic interference while asleep, rather than being due to active consolidation.

Keywords: word learning, sleep, homonyms, story reading, lexical ambiguity, semantic ambiguity

Vocabulary knowledge continues to develop throughout life as we learn new words and also new meanings for words that are already familiar to us. Due to technological advancement triggering changes in language, adults must often learn new meanings for words that they already know (e.g., the internet-related meaning of “troll” as a person who posts deliberately antagonizing comments online; Hulme et al., 2019; Rodd, 2020). Adults also learn new meanings for familiar words when learning a new subject or hobby (e.g., in gaming “farming” is repetitive gameplay to gain more items/experience; Eligio & Kaschak, 2021; Rodd et al., 2012, 2016), or when joining a new social/geographical community (e.g., the Scots dialect word “piece” meaning a sandwich).

As adults learn new meanings for familiar words, they must integrate information about the newly-learned meanings with their existing knowledge about the prior meanings of words. With such an abundance of information stored in the mental lexicon, it is a challenge for the learner to acquire these new meanings whilst preserving their knowledge of the pre-existing meanings. The present experiments investigate the potential role of overnight consolidation during sleep in adults’ learning of new meanings for familiar words.

The Complementary Learning Systems (CLS) theory of word learning (Davis & Gaskell, 2009) provides an explanation for how new vocabulary may be integrated with pre-existing knowledge. Davis and Gaskell (2009) combined the CLS model of learning and memory (McClelland et al., 1995) with behavioral and neural findings from studies on spoken word form learning, to develop an account of how adults process and learn new words. They suggest that knowledge about new words is initially encoded into episodic memory in the hippocampus, but only becomes integrated into semantic memory in the neocortex after an active period of offline, sleep-related consolidation. It is only once words have become assimilated into the mental lexicon that they are more rapidly recognized and able to compete with existing similar word forms during word recognition (Tamminen & Gaskell, 2013; Leach & Samuel, 2007). The CLS theory of word learning follows a long tradition of systems consolidation theories proposing an active role for sleep in memory consolidation (Dudai, 1996, 2004; Marr, 1970; McClelland et al., 1995). The CLS model is also closely related to the active systems consolidation (ASC) model (Born & Wilhelm, 2012; Diekelmann & Born, 2010), but differs in terms of the precise mechanisms that are involved in the transfer of memories from the hippocampus to the neocortex.

The CLS model therefore proposes an active role for sleep in the consolidation of novel vocabulary items into semantic memory. In this view, sleep actively protects new vocabulary items from interference by strengthening and stabilizing memory traces. The active account predicts a benefit for words learned immediately prior to a period of sleep because these items are likely to still be available in episodic memory and so can be effectively consolidated during sleep, compared with items learned earlier in the day, which may not be available for effective consolidation. However, an alternative account is that sleep offers a passive benefit to memory: words are better retained when they are encountered immediately before a period of sleep because they are protected against interference from subsequent linguistic input that usually occurs during a period of wake (Jenkins & Dallenbach, 1924). There is a longstanding debate surrounding the nature of the benefit of sleep for memory (for a review see Ellenbogen, Payne, et al., 2006), however more recent research has provided strong evidence for an active role for sleep. For example, manipulating brain oscillations has been shown to enhance memory of word pairs (Ngo et al., 2013), and targeted memory reactivation has been found to facilitate consolidation of picture-location associations (Cairney et al., 2014). However, the importance of passive benefits of sleep on word learning should not be discounted.

Evidence for the CLS theory largely comes from studies of spoken word form learning. For example, Gaskell & Dumay (2003), taught participants novel word forms that were phonological neighbors to existing words (e.g., cathedruke for cathedral). Participants’ knowledge of these new words was then tested using a recognition test of the new word form, and lexical decision or pause detection tests of the existing words as measures of competition with the new words due to their lexicalization (Takashima et al., 2014), immediately and eight days later. Gaskell and Dumay (2003) found inhibited access for the existing words (due to competition for access from the new word forms) at the delayed test but not immediately after training, suggesting that offline consolidation is required for words to become integrated into the mental lexicon (Tamminen & Gaskell, 2013). Further studies of this nature have replicated these findings (e.g., Davis et al., 2009; Dumay & Gaskell, 2007; Tamminen et al., 2010; Tamminen & Gaskell, 2008). Several of these studies have also specifically shown the importance of sleep for the consolidation process, by dissociating sleep from the simple passage of time (Dumay & Gaskell, 2007), and showing associations between specific components of sleep, such as sleep spindles, and lexical integration (Tamminen et al., 2010). This body of work therefore provides evidence supporting the CLS account of word learning, and suggests that sleep may be critical for lexical consolidation.

However, several more recent studies have provided evidence for the lexicalization of new words without sleep. For example, Kapnoula et al. (2015) trained participants on short non-words that differed from real words on the final phoneme (e.g., jod and job), and tested for inhibition effects from these novel word forms on the existing words using the visual world eye-tracking paradigm (Tanenhaus et al., 1995). In the experiment they used a phoneme splicing manipulation to amplify competition between the new words and their pre-existing competitors. This competition was measured by examining participants’ eye movements towards a picture representing the existing word which was presenting alongside pictures denoting three unrelated filler words (one with some phonological overlap). They found that the new words were able to compete with the existing words for access immediately after training, without any consolidation (Kapnoula et al., 2015). Kapnoula and McMurray (2016) later provided further evidence that this competition seems to derive from lexicalized representations of the novel words, rather than from episodic memories, in contrast to CLS predictions. In a replication of the previously mentioned cathedruke study, Lindsay and Gaskell (2013) also found that words could be integrated immediately, without a period of sleep-based consolidation, when the novel words were repetitively trained using spaced learning alongside exposure to their existing competitor words. Therefore, under certain circumstances immediate integration of novel word forms seems to be possible without consolidation during sleep.

However, studies of learning word forms in isolation are not ecologically valid as word forms are not learned without meanings in everyday life, and this may lead participants to engage in more deliberate memorization than would occur in natural lexical learning. Most new vocabulary is learned incidentally from natural linguistic environments (e.g., conversations, books, TV) in everyday life (Nagy et al., 1985, 1987); however, most studies of vocabulary learning in adults have used rather artificial stimuli, tasks, and learning conditions. One study by (Henderson et al., 2015) examined adults’ and children’s vocabulary learning from naturalistic story reading; they found that novel words only became integrated with existing lexical knowledge after offline consolidation. Importantly, increasing the richness of information about a novel word by adding semantic information has also been shown to differentially engage the complementary memory systems during lexicalization (Takashima et al., 2014), compared to word form learning alone. It is therefore necessary to consider the implications of the CLS account for the learning of novel word meanings. Existing research has combined learning of new word forms with corresponding semantic information, for example by training participants on pronounceable non-words with invented picturable meanings (Clay et al., 2007), meaningful affixes attached to existing words (Tamminen et al., 2012), or low-frequency existing words (van der Ven et al., 2015). Each of these studies found that effects of semantic integration only arose following a period of overnight consolidation, which suggests that information about a word’s meaning also requires time to become integrated into semantic memory (van der Ven et al., 2015), and is consistent with the CLS theory.

Nevertheless, as with word form learning alone, evidence has also been found for the semantic integration of novel words and their meanings without sleep-dependent consolidation, such as in the area of language production (Oppenheim, 2015). Others have suggested that the method of encoding of novel words and their meanings can greatly impact upon subsequent semantic integration (Coutanche & Thompson-Schill, 2014; Himmer et al., 2017). Coutanche and Thompson-Schill (2014) showed that using the fast-mapping learning procedure (whereby participants are forced to infer meaning by process of elimination) enabled immediate lexical integration, while the more traditional explicit encoding procedure produced integration only after consolidation. Offline consolidation may therefore not be a prerequisite for the integration of new knowledge when the learning conditions encourage connections to be formed online between the new information and existing knowledge (Fang et al., 2017; Himmer et al., 2017), as is the case with fast mapping (Coutanche & Thompson-Schill, 2014; Himmer et al., 2017), spaced learning (Lindsay & Gaskell, 2009), and test-enhanced learning (Antony et al., 2017).

However, all of the aforementioned studies of word meaning learning and consolidation combined the acquisition of a new meaning with simultaneous acquisition of a novel word form, which is different to learning a new meaning for an existing word form that already has semantic information attached to it. This is an important distinction, because if both are learned together it is hard to disentangle whether the consolidation effects reflect learning of the form or the meaning, or both. So far only a few studies (Fang et al., 2017; Fang & Perfetti, 2017; Hulme et al., 2019; Hulme & Rodd, 2021; Maciejewski et al., 2020; Rodd et al., 2012) have looked at the effects of making an unambiguous word into an ambiguous one by assigning it a novel invented meaning. These studies did not examine any potential effects of overnight consolidation of the new meanings directly; while Fang and Perfetti (2017) did measure memory immediately after training and one week later, they did not examine the role of sleep and their study cannot disentangle the role of sleep from the passage of time.

The present experiments investigated the impact of sleep on learning new meanings for familiar words acquired incidentally through story reading. Experiment 1 compared participants’ memory of new meanings for familiar words after 12 hours including sleep to 12 hours of wake. Building on this, Experiment 2 compared participants’ memory of new word meanings trained either 24 hours or 12 hours prior to test, with participants tested either in the morning or in the evening to try to tease apart active and passive benefits of overnight sleep.

Experiment 1

Experiment 1 examined whether overnight consolidation is beneficial for learning new meanings for familiar words. The experiment used a between-groups design in which participants were trained on new meanings for familiar words incidentally through reading stories (as in Hulme et al., 2019; Hulme & Rodd, 2021). Relatively few studies have previously used such naturalistic, incidental training paradigms to investigate offline consolidation effects on vocabulary learning (Henderson et al., 2015). Participants were trained either in the evening or the morning (see Figure 1). This was then followed by a delay of 12 hours of either sleep or wake, and then by a test session.

Participants’ explicit knowledge of the new meanings was assessed using a cued recall test in which they were required to generate the word’s new meaning in response to its printed form, and a multiple-choice meaning-to-word matching test in which participants were presented with a short definition of each new word meaning and required to select the correct word form from all eight trained word forms. We included both of these explicit memory measures, as productive measures of vocabulary knowledge like cued recall and receptive measures of vocabulary knowledge like multiple-choice recognition can be differentially sensitive depending on the overall level of learning. Both measures are therefore useful together to ensure sufficient sensitivity to detect potential differences regardless of overall performance level. In addition, we included an implicit measure of reaction time (RT) on a semantic relatedness judgement task in which participants made speeded judgements about the pre-existing meaning of the target words. This test was intended to provide a measure of semantic integration by measuring the extent to which learning a new meaning for a familiar word form like “foam” would impair participants’ ability access this word’s familiar meaning and judge whether it was related to the probe word (e.g., foam-soap/foam-belt) (Fang et al., 2017; Fang & Perfetti, 2017; Maciejewski et al., 2020; Rodd et al., 2012). This task was designed to be a semantic analogue of tasks such as pause detection, which has been used in word-form learning studies to measure the impact of newly acquired knowledge about a novel word form (e.g., “cathedruke”) on online processing of a familiar word form (“cathedral”) (Gaskell & Dumay, 2003; Lindsay & Gaskell, 2013). The logic underlying this task fits within the framework of a distributed connectionist account of ambiguous word processing (Rodd, 2022; Rodd et al., 2002, 2004), whereby recognition of words with multiple unrelated meanings is delayed by the competition between the words’ mutually incompatible semantic representations (Rodd et al., 2012).

Based on previous studies of word form learning (Dumay et al., 2005; Henderson et al., 2015; Takashima et al., 2014; Tamminen et al., 2010) we predicted that participants who had slept between training and test would have better explicit memory of the new word meanings than those who had not slept. In addition, we predicted that participants who had slept would be slower to make a semantic relatedness decision to the trained words, due to increased competition arising between the new and old meanings. Any such inhibition effect would be taken as evidence that the new meanings had not only been better retained after sleep, but were sufficiently well integrated into the lexicon to compete during lexical access.

Method

Participants

We aimed to recruit 80 participants for Experiment 1 (40 participants per group, ten participants per version) in which participants were trained on eight items in a single training session. The sample size was established in consideration of previous word learning studies that have used the same or similar items (Hulme et al., 2019; Hulme & Rodd, 2021; Maciejewski et al., 2020; Rodd et al., 2012).

Eighty-four participants were included in the experiment (age: M = 31.4 years, SD = 8.6; 39 female). We over-recruited by four participants when assigning participants to the experiment versions and kept these participants. Participants were recruited through Prolific (www.prolific.co) using pre-screening criteria. Participants were eligible to take part if they were a current UK resident, monolingual native speaker of British English, and had no diagnosis of reading or language impairments. None of the participants had been diagnosed with a sleep disorder or were taking medication that could affect their sleep. They gave their informed consent before taking part and were paid £8 for their participation upon completion of all sessions.

An additional 39 participants began the experiment but dropped out before completing all sessions (12 from the wake group, 27 from the sleep group), and 17 participants failed to complete all sessions due to a technical issue; these participants were excluded. Thirteen further participants were excluded for having a sleep disorder or taking medication that could disrupt their sleep. A further 15 participants were excluded for getting more than one of the multiple-choice comprehension questions wrong when reading either of the stories. Four additional participants were excluded for being outliers in their mean reading speed (faster than 624.0 words per minute, two SD above the mean). Finally, 11 participants were excluded for misunderstanding the instructions of the cued recall test (e.g., providing a definition for the existing rather than the new meaning of a word), and one participant was excluded for low accuracy on the semantic relatedness judgement task (less than 87.8%, three SD below the mean). Excluded participants were replaced during recruitment.

Materials

Novel word meanings and short stories

The stimuli were 16 previously unambiguous words that were assigned new invented meanings, taken from Hulme et al. (2019). The novel meanings (describing fictional innovations, discoveries, and inventions) were semantically unrelated to the real existing meanings of the words. Single sentence definitions were initially constructed (matched for length: M = 32.9 words, SD = 3.7) to describe the new meanings for each of the words. These single sentence definitions were not presented to participants, but were provided to the story authors to incorporate the novel words into story narratives. For example, the definition sentence for “foam” was: “A foam is a safe that is incorporated into a piece of furniture with a wooden panel concealing the key lock, and each is individually handcrafted so that no intruders are able to recognize the chief use of the furniture.” (see Table S1 for the list of words and their definitions: https://osf.io/hykq9). The words with new meanings were incorporated into four separate short stories (2307–2446 words in length), taken from Hulme et al. (2019) (see Supplementary Materials: https://osf.io/hykq9). The stories were written for an adult audience by a professional children’s author (Story 1: Pink Candy Dream), and an unpublished author (Story 2: Prisons, Story 3: Reflections upon a Tribe, and Story 4: The Island and Elsewhere). Four of the words and their new meanings were included in each story, with eight occurrences of each item throughout the narrative. The first occurrence of an item gave sufficient information to allow participants to derive the new meaning from the context, for example, “‘Yes,’ I murmured, breathing again. ‘I knew it! It’s a foam.’ The ornate chaise longue was no ordinary piece of furniture but concealed a built-in safe with an intricate key-operated locking system.” The amount of information about each new meaning in subsequent occurrences varied naturally according to the narratives.

Additionally, paraphrased definition sentences that were used in the cued recall of word forms test in Hulme et al.’s (2019) study were used in the multiple-choice meaning-to-word matching test in this experiment (see Table S2: https://osf.io/hykq9). These short, paraphrased definition sentences were presented to participants in the multiple-choice meaning-to-word matching test and participants were asked to select the appropriate word for each definition.

Semantic relatedness judgement task

Each of the 16 stimulus words was paired with one probe word that was semantically related to the existing meaning of the word (e.g., foam-soap) and one that was semantically unrelated (e.g., foam-belt; see Table S3 for the full list of target words and probes: https://osf.io/hykq9). For each participant, the eight items that they were not trained on (in the two stories they did not read) served as untrained control items (with the items in each training condition counterbalanced across participants). The majority of the semantically related probes (n = 10) and unrelated probes (n = 12) were selected from those used by Maciejewski et al. (2020) paired with the same target words in their semantic relatedness judgement task. The remaining semantically related probes (n = 6) were selected from the Edinburgh Association Thesaurus (EAT; Kiss, Armstrong, Milroy, & Piper, 1973), and the remaining unrelated probes (n = 4) were selected from the remaining set of probe words from Maciejewski et al. (2020) that had been used with other target words. The degree of semantic relatedness between each target word and its corresponding related and unrelated probes was determined using Latent Semantic Analysis (LSA; Landauer, Foltz, & Laham, 1998). The mean LSA score comparing semantically related probes and targets was 0.4 (SD = 0.2), and for semantically unrelated probes and targets it was 0.06 (SD = 0.1). All of the probe words were nouns, and the related and unrelated probes were matched as closely as possible to both the target words and to each other in terms of frequency and word length, and were also similar in terms of number of senses and number of semantic associates (see Table S4 for the properties of the probe words: https://osf.io/hykq9). The probe words were not semantically related to the novel meanings of the words, and none of the probe words appeared in any of the stories.

Additionally, 16 fillers were selected from the control words used by Maciejewski et al. (2020) that were also matched to the stimuli by frequency and word length. Half of these fillers (selected at random) were paired with two semantically related probes, and the other half were paired with two semantically unrelated probes. This was to prevent a predictable pattern whereby each target word seen would appear once with a related probe followed by an unrelated probe, or vice versa. The probes for the fillers were selected as above. An extra eight unmatched fillers served as a practice block before the start of the main experimental task. These fillers had the same distribution as the trials in the main experiment: half were paired with both a related and unrelated probe, and half were paired with either two related or two unrelated probes. Another eight extra unmatched fillers, with the same distribution of target-probe pairings were selected to appear at the beginning of the experimental task blocks to accustom participants to the speed and rhythm of the task.

Design

Sleep was manipulated between-participants (wake group vs. sleep group) and within-items. Training condition in the semantic relatedness judgement task (trained items vs. untrained items) was manipulated within-participants and within-items. Each participant was trained on only half the total number of stimuli (eight items) through two of the four stories (Stories 1 and 4, or 2 and 3), as this was deemed to be a feasible number of new meanings to learn in a single session. Furthermore, to ensure that each new word meaning appeared roughly an equal number of times in each sleep condition, and that the order of the two blocks in the semantic relatedness judgement task was counterbalanced across participants (to minimize any order effects), there were eight versions of the experiment. Participants were pseudorandomly assigned to one of the eight versions of the experiment, there were 10–11 participants per version, and half of the participants in each the wake group and the sleep group (42 participants per group). The dependent measures were accuracy in cued recall of meanings and the multiple-choice test, and RT and accuracy in semantic relatedness judgement task.

Procedure

The experiment was run online using Qualtrics (Qualtrics, 2015) with the Qualtrics Reaction Time Engine (QRTE; Barnhoorn et al., 2015). Figure 1 shows a schematic of the experiment schedule.

Recruitment phase

Participants provided demographic details and confirmed their availability to take part in sessions at all of the possible times (although they would only be required to complete sessions at some of those times). Participants were then pseudorandomly and evenly assigned by the experimental software to one of the eight versions of the experiment, which determined whether they were in the wake group or the sleep group. They were then given the times for their two subsequent sessions starting the following day at either 8.30–10am and 8.30–10pm for the wake group, or 8.30–10pm and 8.30–10am the following morning for the sleep group. Participants were not told that the purpose of the experiment was to learn new word meanings, and were not aware that their memory would be tested. Instead they were told that the experiment investigated reading ability and comprehension at different times of day.

Training phase

Participants each read two stories (stories 1 and 4, or 2 and 3). The procedure for reading the stories was the same as described in Hulme et al. (2019). Each story was displayed on-screen across five pages of approximately equal length. After each page, a four-alternative multiple-choice comprehension question on a separate screen asked about details of the story’s plot from the preceding page (without probing details of the novel word meanings). Participants were instructed to read the story carefully, without skim-reading, and to answer the multiple-choice comprehension questions; participants were not able to re-read previous pages. After completing the first story, participants were given a brief break of 20 seconds before they could continue to the second story.

Testing phase

At the start of the testing phase participants rated their alertness on the Stanford Sleepiness Scale (SSS; Hoddes et al., 1973).

Semantic relatedness judgement

The first test was the semantic relatedness judgement task. Participants were presented with the stimuli and filler words one at a time on the screen. Each trial began with a 500 ms fixation cross, followed by the target word for 500 ms, then a second 500 ms fixation cross, and finally the probe word was presented until a response was given. Participants’ task was to decide whether the target and probe word were semantically related (e.g., foam-soap) or not (e.g., foam-belt). Participants were not told which meaning of the target word they should attend to, and were instructed to respond as quickly and accurately as possible. They indicated their choice with a “yes” response (“j” key), or a “no” response (“f” key).

Participants first completed a practice block of 16 trials. Following each practice trial, a feedback screen informed them whether their response had been correct or incorrect, and for responses slower than 2000 ms, a feedback message told participants that their response was too slow and they should respond more quickly. Participants then proceeded to the main experimental task. Trials were split into two blocks of 40 trials each, which each began with eight fillers, followed by 16 experimental trials intermixed with 16 matched filler trials. Each target word (stimulus or filler) appeared once in each of the two blocks, paired with one of its two probe words in the first block, and the other in the second block. For the stimulus items, half of the targets (selected at random) appeared with their semantically related probe in Block 1 and then with their semantically unrelated probe in Block 2; and vice versa for the other half. The order of the blocks was counterbalanced across participants, and the order of trials within each block (and the fillers at the beginning of each block) was randomised for each participant. The same feedback message was displayed when participants responded slower than 2000 ms; there was no accuracy feedback in the main task. There was a break of at least ten seconds between the two blocks.

Cued recall of meanings

Participants were then given a cued recall test for the newly learned word meanings. They were presented one at a time with the eight words they had been trained on and instructed to type the new meaning for the word into a text box. They were asked to give as much detail as they could and to try to answer in full sentences even if they were unsure of their answer. If they could not remember anything about the new meaning, they were instructed to type “don’t know”. The order of presentation of the word cues was randomized for each participant.

Multiple-choice meaning-to-word matching

Finally, participants completed the multiple-choice meaning-to-word matching test. Participants were presented one at a time with short sentences giving definitions of the each of the new word meanings that they had been trained on. For each novel meaning participants were asked to select the word that matched the definition from a list of all eight of the words they had been trained on. The order of these eight words was randomized for each test item, and the order of the new meaning definitions was randomized for each participant.

Results

Stanford Sleepiness Scale

Results for the SSS (Hoddes et al., 1973) measured at the test session were analyzed using a Wilcoxon-Mann-Whitney test. There was no significant difference in SSS score between the wake group (Mdn = 3) who were tested in the evening, and the sleep group (Mdn = 3) who were tested in the morning [W = 798, p = .437].

Analysis procedure

Responses for the cued recall test were coded by the experimenter blind to condition as “1” for correctly recalled items or “0” for incorrect. Responses were leniently coded as correct if at least one correct feature of a new meaning was recalled, and any ambiguous or partially correct responses were resolved on a case-by-case basis.1 Responses from the multiple-choice test were coded as “1” if the correct word had been selected, or “0” for incorrect.

Data from the semantic relatedness judgement task were pre-processed prior to analysis. Accuracy was very high overall for all items in the semantic relatedness judgement task.2 For analysis of the RT data incorrect trials were removed (2.6% of trials), and RTs faster than 300 ms or slower than 2500 ms were trimmed from the data (0.2% of remaining trials). The RT analysis was only for correct related trials, for which participants had correctly responded that the target and probe words were semantically related. The accuracy analysis was also only for related trials, with RTs faster than 300 ms or slower than 2500 ms trimmed from the data.

Data from the three test tasks were analyzed separately using linear mixed effects (LME) models using the lme4 package (version 1.1-12; Bates et al., 2015) and R (version 3.3.1; R Core Team, 2017). The binary accuracy data from the cued recall and multiple-choice measures were analyzed using logistic LME models. The contrast for the fixed effect of group was defined using deviation coding (sleep group: 0.5, wake group: –0.5). These two models included by-participant and by-items random intercepts, and a by-items random slope for group. The random effects structure was determined by identifying the maximal random effects structure justified by the design (Barr et al., 2013).

The RT and accuracy data from the semantic relatedness judgement task were analyzed using an LME model and logistic LME model respectively. Contrasts for the fixed effects were defined using deviation coding for group (sleep group: 0.5, wake group: –0.5), training condition (trained items: 0.5, untrained items: –0.5), and block (first block: –0.5, second block: 0.5),3 with the interactions coded by multiplying the contrasts for the appropriate factors. Both models contained fixed effects for group, training condition, and block, as well as all of the two-way interactions, and the three-way interaction. The maximal random effects structure was first fitted for both models, consisting of by-participant and by-items random intercepts, by-items random slopes for group, training condition, block, all two-way interactions, and the three-way interaction, and by-participant random slopes for training condition, block, and the interaction. This model converged and was used as the final model for the RT analysis; the model for analysis of the accuracy data was simplified by removing the correlations between the by-participant and by-item random slopes and random intercepts (as recommended by Barr et al., 2013). The assumptions of homoscedasticity and normality were violated in the raw RT data, so the data were log₁₀- and inverse-transformed (invRT = 1000/rawRT) and compared with the raw RT data. The inverse-transformed RTs met the assumptions of homoscedasticity and normality most closely and were therefore used for the analysis. Significance of the fixed effects and interactions was assessed using likelihood ratio tests comparing the full model to models with only each fixed factor/interaction of interest removed in turn. Significance for follow-up analyses was assessed using the same method. Data and analysis scripts for this experiment are available via the Open Science Framework (OSF; https://osf.io/m3pj6).

Cued recall of meanings

Accuracy on the cued recall test (Figure 2) was low overall: participants in both groups could generate the meaning for less than 50% of the words. The sleep group correctly recalled significantly more of the novel word meanings (47.9%) than the wake group (36.3%), [χ²(1) = 4.13, p = .042].

Experiment 1. Mean percentage of correct responses on the cued recall test (meanings correctly recalled for the appropriate word) for participants in each of the two groups. Error bars show standard errors for the means.

Multiple-choice meaning-to-word matching

Data for the multiple-choice test (Figure 3) showed higher overall accuracy than for cued recall, but it was not near ceiling. The pattern of the data was the same as for the cued recall test: accuracy was significantly higher for the sleep group (74.1%) than for the wake group (60.1%), [χ²(1) = 7.01, p = .008].

Semantic relatedness judgement

Mean percentage error rates for the semantic relatedness judgement task are illustrated in Figure 4; accuracy was very high overall and error rates differed little between the different conditions. There was no significant overall difference in accuracy between the two groups [χ²(1) = 0.03, p = .868], and no significant overall difference between accuracy for trained and untrained items [χ²(1) = 0.001, p = .972]. The interaction between group and training condition was non-significant [χ²(1) = 0.80, p = .370]. There was also no significant effect of block [χ²(1) = 0.69, p = .408], interaction between group and block [χ²(1) = 0.83, p = .361], interaction between training condition and block [χ²(1) = 0.08, p = .778], or three-way interaction [χ²(1) = 2.69, p = .101].

The RT data from the semantic relatedness judgement task are shown in Figure 5. There was a significant main effect of group [χ²(1) = 3.90, p = .048], whereby the sleep group, who were tested in the morning, were faster overall than the wake group, who were tested in the evening. Although there was a trend in the data for participants in both groups responding slightly slower to trained items than untrained items, the main effect of training was non-significant [χ²(1) = 1.60, p = .205]. Furthermore, the predicted interaction between group and training condition was not significant [χ²(1) = 0.09, p = .769]. There was also no significant effect of block [χ²(1) = 0.06, p = .810], interaction between group and block [χ²(1) = 0.92, p = .338], interaction between training condition and block [χ²(1) = 0.06, p = .809], or three-way interaction [χ²(1) = 0.23, p = .632].

Experiment 1. Mean reaction time on the semantic relatedness judgement task for participants in each of the two groups for untrained and trained items. The data shown are correct related trials only (trials to which the participants correctly responded ‘yes’ that the target and probe were semantically related). Error bars show standard errors for the means, corrected for the within-participants factor of training condition (Cousineau, 2005).

To obtain a full impression of the data from the semantic relatedness judgement task, we carried out additional exploratory analyses with a subset of trained items for which participants had correctly recalled the item in the cued recall test (similar to Rodd et al., 2012; Experiment 3). Results were largely similar to the main analyses (see Supplementary Results: https://osf.io/p3t49).

Discussion

Experiment 1 investigated how the contribution from a delay including sleep compares to a delay without sleep for learning new word meanings. Participants were trained on new meanings for familiar words incidentally through story reading. The sleep group were trained in the evening and tested the next morning after a 12-hour delay. The wake group were trained in the morning and tested that evening, also after a 12-hour delay. Results for the two explicit memory measures showed that participants in the sleep group remembered significantly more of the new word meanings than those in the wake group. Mean accuracy in cued recall of new meanings was 11.6% higher for the sleep group than the wake group, and accuracy on the multiple-choice test was 14.0% higher for the sleep group than the wake group.

These findings for word meanings are broadly consistent with other similar studies of word-form learning that show better retention after 12 hours including sleep than after 12 hours of wake (Dumay & Gaskell, 2007; Tamminen et al., 2010). For example, Tamminen et al. (2010) found significantly improved recall of new word forms for a group tested after 12 hours including sleep, but no improvement for a group tested after 12 hours of wake.

In contrast to these clear effects in the explicit measures of word meaning learning, the semantic relatedness judgement task showed no significant effect of training and no significant interaction between training and sleep group. It is unclear whether these null effects reflect the true absence of integration/consolidation of these new word meanings into the lexicon, or a lack of task sensitivity in the current paradigm to detect such consolidation effects.

A possible limitation of Experiment 1 lies in participant drop-out from the experiment. Although participants were randomly assigned to the sleep and wake groups during the recruitment phase of the experiment, there was some selective attrition. More participants dropped out from the sleep group, for whom the two experimental sessions were on two separate days, than the wake group for whom both experimental sessions were on the same day. This is potentially problematic as selective attrition could introduce confounds (e.g., participant motivation) of the experimental manipulation (Zhou & Fishbach, 2016). The pattern of results for Experiment 1 could potentially be partially explained by non-random participant drop-out from the sleep group if, for example, the higher attrition rate in the two-day version of the task selectively excluded participants who were less motivated to learn.

In summary, participants had better explicit memory of new meanings for familiar words after 12 hours that included a period of overnight sleep, as compared with participants tested after a 12-hour period of wake. However, in the absence of significant effects on the implicit measure of consolidation of the new meanings, it is unclear whether this sleep benefit was due to active sleep-related consolidation or due to passive protection from interference from the linguistic input that participants in the wake group will have encountered during their day. Furthermore, the sleep effect seen in recall and recognition of new word meanings may be an effect of time of day, for example due to general cognitive enhancement improving recall function (Schmidt et al., 2007), as time of day was confounded with sleep group at both encoding and test. It is possible that due to circadian differences participants may have learned better in the evening, or remembered better in the morning. However, there was no significant difference in participants’ SSS ratings (Hoddes et al., 1973) between the sleep group and the wake group at the test session.

Experiment 2

The aim of Experiment 2 was to further investigate the sleep benefit seen in Experiment 1 using a preregistered mixed “12:12” design similar to that used by Dumay and Gaskell (2007). Participants were divided into two groups who began the experiment either in the morning or the evening. Both groups were immediately trained on half the target words and then returned after 12 hours for training on the other half of the target words. After a second 12-hour delay all participants were tested on all items (see Figure 6). The two groups therefore had the same lengths of time delay between the two training sessions and the test (24 hours and 12 hours) and the same amounts of time spent asleep and awake, with the only difference being when the period of sleep occurred in relation to the test. This design allowed for exploration of the effects of sleep without repeated testing of newly learned information (Hulme & Rodd, 2021).

Experiment 2 therefore included the two 12-hour delay conditions from Experiment 1. For these conditions we expected to replicate the sleep benefit seen in Experiment 1, such that the new meanings should be better remembered by the AM-test group who experienced overnight sleep during the 12-hour delay period, compared with the PM-test group who remained awake throughout the delay. Experiment 2 also included two additional conditions in which items are tested 24 hours after learning. Critically, for the PM-test group these items were learned immediately before overnight sleep and then tested 24 hours later such that participants had the opportunity for active sleep-related consolidation that would protect them from forgetting during their subsequent day awake. In contrast, for the AM-test group learning was immediately followed by a period of wake, such that participants would have already (partially) forgotten much of what they learned during the 12 hours that they spent awake after training (as shown in Experiment 1) before they had the opportunity to consolidate this knowledge overnight.

Therefore, if the sleep benefit observed in Experiment 1 was driven by active sleep-related consolidation (Davis & Gaskell, 2009) we should observe an interaction between training condition and group: in the 12-hour delay condition performance should be better for the AM-test group, whereas in the 24-hour delay condition performance should be better for the PM-test group. This account also predicts a similar pattern of results in the semantic relatedness judgement task: competition between the new and old meanings of the words should only arise for items learned immediately before overnight sleep, and not for items trained before a 12-hour period of wake.

In contrast, the passive account of the benefit of sleep (Ellenbogen, Payne, et al., 2006) predicts a somewhat different pattern of results. Specifically, this account predicts that memory performance should be best for participants who learned the new word meanings in the evening and were tested the following morning (AM-test group: 12-hour delay), because this is the only experimental condition that does not include an extended period of daytime linguistic experience between training and test. Therefore, this condition minimizes the amount of potentially interfering linguistic information that participants encounter between training and test. In contrast, the other three experimental conditions all contain a single period of daytime awake, which is likely to reduce retention of the newly learned word meanings. This account is somewhat underspecified, and it is currently unclear whether these three conditions might also differ from each other due to differences in exactly when this interfering linguistic information is encountered relative to learning and test. Thus although this account also predicts a group by training condition interaction, it predicts that this interaction should take a somewhat different form, such that the two groups should differ in their performance in the 12-hour delay condition, but that this difference should not be present in the 24-hour delay condition. Furthermore, the passive account of the role of sleep is also underspecified as to whether we would expect to observe competition effects between new and old meanings on the semantic relatedness task as it is unclear whether any of these conditions would result in learning that was sufficiently strong to drive competition effects (Rodd et al., 2012). Finally, if the results of Experiment 1 are due to time-of-day effects, then the results of this experiment should show a straightforward pattern whereby the strength of learning depends solely on the time at which participants were trained or tested. This experiment was preregistered through the Open Science Framework: https://osf.io/uvgp4 (Hulme & Rodd, 2017, August 9). Where applicable any deviations from the preregistration have been noted in the Method and Results sections.