An electrophysiological megastudy of spoken word recognition

Kurt Winsler; Katherine J Midgley; Jonathan Grainger; Phillip J Holcomb

doi:10.1080/23273798.2018.1455985

. Author manuscript; available in PMC: 2021 Apr 27.

Published in final edited form as: Lang Cogn Neurosci. 2018 Mar 27;33(8):1063–1082. doi: 10.1080/23273798.2018.1455985

An electrophysiological megastudy of spoken word recognition

Kurt Winsler ^a, Katherine J Midgley ^a, Jonathan Grainger ^b, Phillip J Holcomb ^a

PMCID: PMC8078007 NIHMSID: NIHMS1067558 PMID: 33912620

Abstract

This study used electrophysiological recordings to a large sample of spoken words to track the time-course of word frequency, phonological neighbourhood density, concreteness and stimulus duration effects in two experiments. Fifty subjects were presented more than a thousand spoken words during either a go/no go lexical decision task (Experiment 1) or a go/no go semantic categorisation task (Experiment 2) while EEG was collected. Linear mixed effects modelling was used to analyze the data. Effects of word frequency were found on the N400 and also as early as 100 ms in Experiment 1 but not Experiment 2. Phonological neighbourhood density produced an early effect around 250 ms and the typical N400 effect. Concreteness elicited effects in later epochs on the N400. Stimulus duration affected all epochs and its influence reflected changes in the timing of the ERP components. Overall the results support cascaded interactive models of spoken word recognition.

Keywords: Spoken word recognition, ERP, frequency, phonological neighborhood density

Introduction

Our ability to recognise spoken words is one of the most frequently used and important of our cognitive skills. So, it is perhaps somewhat surprising that there is much we still do not know about the underlying neuro-cognitive processes that are involved in mapping sound onto meaning. Though perceived as effortless, the ability to decode continuous, transient auditory information into a single word from tens of thousands of candidates within a fraction of a second involves a highly complex set of neuro-cognitive process. This task is further complicated by the fact that many words are acoustically quite similar to each other and that human speech is extremely variable due to both idiosyncratic speaker characteristics and phonological context which influences the acoustic properties of phonemes depending on neighbouring phonemes. Clearly, semantic and syntactic context have important roles in spoken language comprehension in real world contexts, but there is a general consensus that such higher-level processing is driven primarily by mechanisms operating at the level of individual words. Models of spoken word recognition generally agree that this involves multiple hierarchical levels, which begin operating on partial information that activates representations of multiple word candidates in parallel which then compete for recognition.

One approach to untangling the array of underlying mechanisms involved in word recognition is to examine the impact of various linguistic factors on this process. The bulk of the work using this approach has employed behavioural dependent variables such as reaction time, although these measures largely occur after the processes of interest and therefore do not directly reflect the brain activity of the underlying neuro-cognitive processes. Moreover, such behavioural measures are generally unitary, offering a limited perspective on the dynamic nature of word processing. This latter issue might be particularly important in the case of spoken language where words unfold over time. Because they continuously reflect information processing in real time, event-related brain potentials (ERPs) have proven to be an excellent choice for studying the temporal dynamics of spoken word processing. However, while many studies have used ERPs to track the time course of visual word recognition (e.g. Grainger & Holcomb, 2009; Hauk, Davis, Ford, Pulvermüller, & Marslen-Wilson, 2006), there are comparatively fewer studies of spoken word recognition (see Hagoort & Brown, 2000, for one example) and there are none that have looked at the influence of a wide array of linguistic variables. Here we report on a study in which fifty participants listened to over a thousand single spoken words while EEG was recorded.

Perhaps the most studied lexical variable in studies of word recognition is word frequency, which is typically measured as the number of occurrences of a word in a given corpus of written or spoken language. The basic finding, which has been widely replicated, is that listeners are more accurate and have faster reaction times to high frequency compared to low frequency words in a variety of tasks (see Rubenstein, Garfield, & Millikan, 1970, for an early demonstration, and Ferrand et al., 2017, for a recent megastudy). In spoken word recognition models, such word frequency effects can be accounted for in a number of ways. In activation models, higher frequency lexical units can have lower activation thresholds (Marslen-Wilson, 1990), higher resting states of activation (McClelland & Elman, 1986), stronger connections between units (Dahan, Magnuson, & Tanenhaus, 2001) or in a Bayesian modelling framework by assuming frequency effects reflect the higher prior probability for high frequency words (Norris & McQueen, 2008). Alternatively, models such as the Neighbourhood Activation Model (NAM) suggest that word frequency does not affect processing at a lexical level, but rather acts as a post-lexical decision bias (Luce & Pisoni, 1998). Of course, it is entirely possible that a complex variable such as word frequency exerts an influence on spoken word comprehension across multiple processing levels and that the pattern of its influence may be sensitive to the task demands placed on the listener.

In ERP research on visual word recognition, word frequency manipulations have been shown to modulate the amplitude of the N400 (e.g. Smith & Halgren, 1987) a component usually associated with lexico-semantic processing. Thus, larger N400s for lower frequency words may reflect the increased processing necessary to map lexical representations of lower frequency words onto their meanings (Kutas & Federmeier, 2011). Consistent with this view is the finding that word frequency effects on the N400 decline as other factors that facilitate word processing (e.g. context) increase (Van Petten & Kutas, 1990). In addition to the N400 there has been conflicting evidence that earlier ERP components (as early as the N1) also are sensitive to manipulations of word frequency, at least in the visual modality (e.g. Chen, Davis, Pulvermüller, & Hauk, 2015; Hauk et al., 2006; Sereno, Rayner, & Posner, 1998). In their ERP megastudy using the same stimuli and a similar experimental design to the current study (see below), Dufau, Grainger, Midgley, and Holcomb (2015) found a very small effect of word frequency emerging at posterior (occipital) sites between 200 and 300 ms, with the largest effects of frequency occurring on the N400.

Effects of word frequency on spoken word ERPs have also been reported. In one study Dufour, Brunelliere, and Frauenfelder (2013) found that low frequency spoken words produced a larger anterior positivity and posterior negativity at 350 ms compared to high frequency words. Similar to written words, they also found larger late N400 activity (550 to 650 ms) for low compared to high frequency spoken words. The lateness of the second effect (550 to 650 ms) could be consistent with a post-lexical locus of word frequency for spoken words, while the bipolar effect across the scalp at 350 ms might reflect a pre-recognition influence of word frequency.

Another variable that has been shown to influence spoken word processing is phonological neighbourhood density (PND). PND is a measure of the number of other words that are phonologically similar to a given word. Behaviourally, spoken words with dense neighbourhoods (i.e. words that share phonological characteristics with many other words) tend to be recognised more slowly and with less accuracy than words with fewer neighbours (Goldinger, Luce, & Pisoni, 1989). This pattern has been suggested to indicate the influence of interference or competition from the other similar words in a target word’s phonological neighbourhood (e.g. Vitevitch & Luce, 1999). This competition between similar sounding words is assumed by many models of spoken word recognition, however it is incorporated in different ways. In the Cohort model (Marslen-Wilson, 1987), words which share initial phonemes are co-activated and compete for recognition as more information becomes available. This predicts neighbourhood effects, but only among words which initially resemble each other (cohorts) and not with other sorts of phonological neighbours such as rhymes, which nevertheless have also been found to produce neighbourhood effects (e.g. Connine, Blasko, & Titone, 1993). The Neighbourhood Activation Model (NAM: Luce & Pisoni, 1998) provides a relatively simple and effective mathematical account of PND effects, although it does not incorporate the dynamic nature of speech (input that unfolds over time), and thus has difficulty explaining why cohorts produce more competition than rhymes (Allopenna, Magnuson, & Tanenhaus, 1998). Other models such as TRACE (McClelland & Elman, 1986) and Shortlist (Norris, 1994) better account for dynamic neighbourhood effects, though TRACE includes feed-back connections while Shortlist is only feed-forward. It should also be noted that facilitatory effects of PND have been found with the auditory lexical decision task (Ernestus & Cutler, 2015; Ferrand et al., 2017; Goh, Yap, Lau, Ng, & Tan, 2016), again suggesting that task demands might differentially alter the influence PND has on underlying mechanisms.

Effects of neighbourhood density have also been reported in ERP research. Most of this work has involved visual word recognition where words from large orthographic neighbourhoods (the visual equivalent of phonological neighbourhoods) have been shown to generate larger N400s than words from small orthographic neighbourhoods (e.g. Holcomb, Grainger, & O’rourke, 2002; Laszlo & Federmeier, 2011). This greater N400 to high density words is thought to reflect the additional activation of a target word’s neighbours (Holcomb et al., 2002). In their megastudy Dufau et al. (2015) found that orthographic neighbourhood effects were largely restricted to the early phase of the N400 window (300 to 400 ms). Two studies have used ERPs to examine PND effects in spoken word recognition. Dufour et al. (2013) found smaller early positivities (250–330 ms) and larger N400s to French words with more phonological neighbours. A second study in English (Hunter, 2013) found larger P2 amplitudes to words with more neighbours but did not report any effects on the N400.

Relative concreteness is another variable that has been shown to affect word recognition. Concrete words are responded to faster than abstract words in a variety of tasks (e.g. lexical decision; Goh et al., 2016; Whaley, 1978). This effect is usually explained by concrete words having greater semantic richness (Kieras, 1978), greater tendency to induce the use of mental imagery (Paivio, 1986), or some combination of both (Holcomb et at., 1999). In the visual domain, research with ERPs has identified two components which are both more negative to more concrete than abstract words; the N400 (Kounios & Holcomb, 1994) and a later component around 700 ms (West & Holcomb, 2000). While the effect of concreteness on the N400 is thought to reflect greater activation of lexical-semantic networks as mentioned above, the later effect at 700 ms is thought to represent a process related to mental imagery (West & Holcomb, 2000). In the visual megastudy by Dufau et al. (2015), concreteness effects paralleled those from previous ERP studies with larger late negativities for more concrete words starting around 300 ms and continuing on through the N400 epoch (Dufau et al., did not report effects beyond 500 ms). It is worth noting that in the case of concreteness, N400 amplitude seems to be negatively correlated with reaction time thus indicating a facilitative role, yet in the case of word frequency (and some, but not all neighbourhood effects), larger N400s are usually associated with longer reaction times, consistent with competition or more effortful processing. To date, we are unaware of any ERP studies that have manipulated concreteness with auditory words.

Another variable relevant to lexical processing is the length of words being comprehended. In the case of visual words the number of letters determines length. In the case of spoken words it is the temporal duration that is associated with length. And while these two indices are correlated (e.g. number of letters and numbers of phonemes), there is reason to predict that that the influence of these two indices of length might operate differently during word processing. In visual word recognition there is strong evidence of parallel letter processing within a single fixation (e.g. Grainger, 2008). However, because length for spoken words translates to the temporal domain and thus necessitates some degree of serial processing, the duration of a spoken word is likely to play a more important role during spoken word recognition than number of letters does in visual word recognition. In the case of spoken words, measures such as word duration, number of phonemes and uniqueness point are temporal variables that have been shown to influence word recognition. Although not as frequently examined, a few behavioural studies have looked for effects of spoken word duration (e.g. Pitt & Samuel, 2006; Strauss & Magnuson, 2008). In these studies duration has been suggested to have a somewhat counter intuitive effect on the dynamics of word processing. Although longer spoken words take longer to recognise, they also result in greater lexical activation presumably because of their having additional acoustic information to influence processing. In their auditory lexical decision megastudy, Ferrand et al. (2017) reported that stimulus duration was the variable that accounted for the most variance (46%, followed by word frequency at 4% of additional variance, with a strong positive correlation between stimulus duration and RT - see also Ernestus & Cutler, 2015; Goh et al., 2016).

In the ERP literature on word length, the number of letters in a visually presented word has been shown to influence ERPs both quite early as well as later during word processing. For example, Hauk and Pulvermüller (2004) reported that longer visual words produced increased activity as early as 80 ms while shorter words elicited greater negativity in epochs up to 400 ms. The Dufau et al. (2015) megastudy found effects of word length emerging at around 200 ms and continuing into later epochs. Longer words tended to produce more negative-going waves than shorter words at 200 ms, and during the N400 epoch shorter words produced greater negativities. Of course, one confound for such visual effects is that longer words also tend to be larger stimuli, and increases in the size of any stimulus tends to produce larger early ERP effects (e.g. Luck, 2005). To our knowledge no study has looked at ERPs to spoken words as a function of word duration. One prediction based on the results of Pitt and Samuel (2006) is that while the time-course of ERP effects might be delayed for longer words, it might also be the case that longer words generate larger N400s than shorter words due to their activation of additional phonemic information. Note this prediction is the opposite of what Dufau et al. reported for visual word length effects.

The current study

The purpose of the current study was to use ERPs to provide a better understanding of how the above variables (word frequency, phonological neighbourhood density, concreteness and duration) affect the temporal dynamics of spoken word recognition. In all previous auditory ERP studies, variables such as these have been factorially manipulated and measures of processing have been obtained. However, this approach, which is arguably arbitrary in terms of where boundaries are placed for categorising what is a continuous variable, may oversimplify, or take away from, the complexity and variability that is inherent in language. Recently, researchers have begun conducting “megastudies” which seek to better understand these complexities by gathering data with large samples of participants and items. This method has a number of advantages including reduced experimenter bias towards item selection and the ability to run more advanced types of analyses (see Balota, Yap, Hutchison, & Cortese, 2012 for a review of advantages). This has been fruitfully applied to study visual word recognition with behavioural data (e.g. Balota et al., 2007; Ferrand et al., 2017) and ERP data (e.g. Hauk et al., 2006; Laszlo & Federmeier, 2014).

One such ERP study conducted by Dufau et al. (2015) presented over 1000 written words to a large sample of participants (n = 75). Their study allowed for precise item-level partial regression analyses of the contributions of a number of orthographic, lexical, and semantic variables to the ERPs of written words. Importantly, this method controls for the effects of other variables so that results could be more clearly attributed to each variable of interest. The current study used the same stimuli and general statistical approach as Dufau et al. However, rather than using visually presented stimuli we instead used the equivalent spoken word stimuli and we did so in two separate experiments with 50 participants. Also, instead of using traditional regression techniques we used a comparatively new approach to analyzing ERP data; linear mixed effects regression (LMER).

Experiment 1 (lexical decision)

In Experiment 1 we used the same approach as Dufau et al. (2015), employing a go/no-go Lexical Decision (LD) task, however using spoken versions of the same stimulus set. Making word/non-word decisions to each item should arguably focus participants on the lower level lexical properties of the stimuli and we predict should have a comparatively larger impact on ERP components that are sensitive to earlier pre-lexical features of the stimuli. As mentioned earlier, we also used the LMER approach rather than partial correlations to analyze the data. In applying LMER to EEG data rather than averaging across items or participants, raw single trial EEG from each stimulus is used as input to the statistical algorithm. While such ERP data sets are substantially larger than those used in typical LME behavioural studies, several recent reports have demonstrated that the technique can be successfully applied to ERP data sets (e.g. Emmorey, Midgley, Kohen, Sehyr & Holcomb, 2017; Laszlo & Sacchi, 2015; Payne, Lee, & Federmeier, 2015). One advantage of LME models is that they allow both subject and item variance to be taken into account in the same analysis, thus providing a solution to the problems inherent in approaches using separate analyses (e.g. F1 and F2; Baayen, Davidson, & Bates, 2008; Clark, 1973). An additional advantage for studies such as the current one where the influence of multiple variables is being explored, but factorial manipulation is difficult, is the possibility of including all of the variables in the model thus controlling for potential collinearity between variables (see Payne et al., 2015). And finally, as mentioned above, LME modelling can be readily applied to continuous independent variables eliminating the need for forming arbitrary boundaries with such variables.

Method

Participants

A total of 61 participants were run in this study. However, 11 were eliminated from the final analysis due to too many trials exceeding muscular or ocular artifact rejection criteria (>20% of critical trials). The 50 remaining participants ranged in age from 18 to 29 years (mean age = 22.54 years old [SD = 2.79]) and included 50% females. Most were students at San Diego State University, compensated with $15 dollars per hour of participation. All participants reported being right handed, native English speakers with normal hearing and normal or corrected to normal vision with no neurological impairment.

Materials

The critical stimuli consisted of the same 960 words used in the parallel visual word study and were originally selected to represent an assortment of word frequencies (1 to 1094/million) and word lengths (4 to 8 letters, Dufau et al., 2015). An additional 140 probe stimuli were also used. In Experiment 1, probe items were pseudowords formed by changing one or two phonemes of real words (none of which were critical items used in the analyses presented below). All 1100 stimuli were digitally recorded at a sampling rate of 44 kHz by a female speaker with a standard American accent in a sound proofed room using a SM57 microphone (Shure). Audio files were processed using Cool Edit 2000 software and were trimmed so that the onset of each word’s initial-phoneme was at the beginning of the digital file for that word. This allowed for precise alignment of word onset and the time-locking of ERP recording. The end of the file was trimmed to a point approximately eight ms into the silence after the offset of each word to ensure that no critical acoustic information in the words was clipped. Prior to analysis four critical items were eliminated because of perceptual ambiguities reported by several participants, which left 956 critical items for analysis.

The current study focused on four word based variables: Word Frequency, Phonological Neighbourhood Density, Concreteness, and Duration. For Frequency we used “Zipf” frequency (see Van Heuven, Mandera, Keuleers, & Brysbaert, 2014) which is a logarithmically normed frequency measure ranging between 1 and 7 based on American English subtitle frequencies (i.e. SUBTLEX-US frequency; Brysbaert & New, 2009). In our sample of words, this measure ranged between 1.59 and 6.09 with a mean of 4.03 (SD = 0.83). Phonological neighbourhood density (PND) was quantified using phonological Levenshtein distance (PLD) obtained from the English Lexicon Project (Balota et al., 2007). Phonological Levenshtein distance is a measure of how many phoneme changes are required to change one word into another (see Yarkoni, Balota, & Yap, 2008 for a discussion of the measure). PLD represents the phonological distance between a word and every other word, so a high PLD means that the word does not have many neighbours, while a low PLD indicates it has many neighbours. The particular measure we used from the English Lexicon Project represents phonological neighbourhood density by taking the mean PLD between a word and 20 of its closest neighbours. Here, PLD ranged from 1 to 4.5 with a mean of 2.02 (SD = 0.71). Concreteness ratings were taken from a separate group of 24 undergraduate students asked to rate all 960 items on a seven-point scale from very abstract (one) to very concrete (seven). This was the same measure used by Dufau et al. (2015) and was shown to correlate highly with other samples of concreteness ratings. Concreteness ratings ranged from 1.7 to 6.9 with a mean of 4.37 (SD=1.14). Word length was quantified by the duration of the audio files which ranged from 280 to 892 ms with an average duration of 611 ms (SD = 94 ms).

Procedure

Participants were seated in a comfortable chair, 150 cm from a stimulus monitor in a soundproofed, darkened room. The testing session began with a short practice block, followed by four experimental blocks. Auditory stimuli were presented via stereo headphones (Sennheiser model PC 151) placed around the EEG cap and set to same normal listening level (~65 dB) for each participant. Each experimental block contained 240 critical target words and 35 randomly intermixed probe items presented one at a time with an SOA of 1100 ms between word onsets (see Figure 1). Concurrent with the onset of each word a visual fixation stimulus was presented in order to keep the participant’s eyes fixed in one location. On average every 10 trials a visual “blink” stimulus replaced the fixation stimulus for four seconds. This indicated that the participant could blink/rest their eyes thus reducing the tendency for participants to blink during the critical word ERP epochs.

Figure 1. — Procedure for both of the experimental tasks with an example of a prime item for each. Items were identical between the two tasks except for the probe items which were either animal names for semantic categorisation, or non-words for lexical decision made from transposed versions of animal names.

For this experiment, each participant completed two blocks of a go/no-go lexical decision task that alternated with two blocks of a go/no-go semantic categorisation task (see Experiment 2 - note the probe items were changed for Experiment 2). The order of blocks was counterbalanced across participants and every critical word was presented in each task across participants. In the current experiment, participants were instructed to press a button on a game controller as soon as they heard a stimulus that was not a legal English word (pseudoword probes). The non-words probes made up approximately 13% of trials. The critical words made up the other 87% of trials and did not require a behavioural response.

EEG recording

The electroencephalogram (EEG) was collected using a 29-channel electrode cap containing tin electrodes (Electro-Cap International, Inc., Eaton, OH), arranged in the International 10–20 system (see Figure 2). Electrodes were also placed next to the right eye to monitor horizontal eye movements and below the left eye to monitor vertical eye movements and blinks. And finally, two electrodes were placed behind the ears over the mastoid bones. The left mastoid site was used as an online reference for the other electrodes and the right mastoid site was used to evaluate differential mastoid activity. Impedance was kept below 2.5 kΩ for all scalp and mastoid electrode sites and below 5 kΩ for the two eye channels. The EEG signal was amplified by SynAmpsRT amplifier (Neuroscan-Compumedics, Charlotte, NC) with a bandpass of DC to 200 Hz and was continuously sampled at 500 Hz.

Figure 2. — Electrode montage used for EEG recordings.

Data analysis

While a traditional factorial approach to analyzing these data would have substantial power due to the high number of subjects and items, as mentioned previously this approach is highly susceptible to confounds due to uneven distribution of values between variables. The typical approach to dealing with such confounds is to arrange the stimuli in a factorial design such that the effects of each variable are controlled across the levels of the other variables. The problem here is that with four factors each with several levels, even with almost a thousand items there would be comparatively few items per cell in the design and this still assumes that enough items can be found to meet the rigid criteria of each such cell. To help overcome this problem the data were analyzed by constructing linear mixed effects regression models using the Ime4 package (Bates, Maechler, Bolker, & Walker, 2015) written in the statistical computing language R (R core team, 2014). Rather than averaged ERP data, for these analyses we used the single trial EEG data (after artifact rejection) from 50 participants, 956 items, and 29 electrode sites as input to the analyses. The structure of the models used was based on the approach recommended by Payne et al. (2015).

A set of eight identical LME models were fit for eight consecutive 100 ms time windows starting at 100 through to 900 ms post word onset. The main effects included in the models were the word variables; Lexical Frequency, PLD, Concreteness and Duration. The word variable measures were normalised (z-scores) prior to fitting the LME models (Payne et al., 2015). Distributional effects were modelled using the relative position of each electrode in three dimensional space using three continuous variables, corresponding to the X, Y, and Z coordinate position of each of the 29 scalp sites. For the X-position variable, the left and rightmost electrode sites (T3 and T4) had the most extreme values and interactions with this variable would indicate differences in the laterality of an effect. Conversely for the Y-position variable, the most anterior and posterior electrodes (FPz and Oz) represent the most extreme values and interactions here indicate a difference in how anterior/posterior an effect is distributed. The-Z position variable varies from a maximum at the central electrode Cz at the top of the head and descends to the outer ring of peripheral sites (from FPz to T3 to Oz to T4 and back to FPz), marking the two extremes. Interactions involving the Z-position factor indicate differences in the elevation of an effect. The two-way interactions were structured so each word variable had three possible two-way interactions, one with each of the three distributional variables (X, Y, and Z position). The overall effects of these distributional variables were also added into the models as covariates.

It should be noted that distributions of actual ERP effects are non-linear, and this limits the ability of linear models to appropriately analyze scalp distribution. ANOVA approaches generally model distribution by assigning electrode sites to separate levels of discrete distributional variables (e.g “laterality”). This allows for non-linearity, but introduces a number of issues that come with discretizing a continuous variable and using ANOVAs to analyze effect distributions (e.g. MacCallum, Zhang, Preacher, & Rucker, 2002; McCarthy & Wood, 1985). The current approach allows us to approximate the distribution of an effect as the extent to which it fits one of the 3 spatial dimensions. This encompasses some general ERP distributions (e.g. the typically centralised N400 distributions) but results in a greater degree of model misfit (and thus inflated type-2 error rate) for effects which have smaller or more complex distributions (see Tremblay & Newman, 2015). Nonetheless, any further specification of distributional factors would not be justifiable without stronger predictions.

Because of the complexity of the design, a maximal random effect structure was not possible due to convergence failures (Barr, Levy, Scheepers, & Tily, 2013). Instead, based on a model selection approach recommended by Matuschek, Kliegl, Vasishth, Baayen, and Bates (2017), random effects were structured to be conservative, yet still allow every model to converge. The resulting random effect structure for each model included random intercepts for participants, items, and electrode channel as well as by-subject random slopes for the effect of each of the four experimental variables (see appendix for model code).

The 95% profile likelihood confidence intervals and t-values were calculated for each comparison (Cumming, 2014). Because of the large number of comparisons, p-values from each model were also obtained using the “Anova” function in the CAR package (CRAN) which were then false-detection-rate (FDR) corrected using the MATLAB “Mass Univariate ERP Toolbox” (Groppe, Urbach, & Kutas, 2011). To add another level of conservation, effects were only interpreted as significant if the comparison was significant for both the confidence intervals and the FDR-corrected ANOVA p-values.

Data visualization

We also used LME models to compute the equivalent of scalp voltage maps to help in visualising the various effects in each model. For these analyses we used the same approach as above but instead of including distributional variables in each model, we computed separate LME solutions for each of the 29 scalp sites in each 100 ms time epoch and plotted the resulting LME t-statistics across the scalp using interpolated topographic maps (see appendix for individual site model code).

Additionally, ERPs were used to aid in the interpretation of results. Similar to a traditional factorial approach, averaged ERPs time-locked to stimulus onset were created off-line as a function of each of the variables of interest (Frequency, PND, Concreteness and Duration). For each of these variables the data were sorted into four equally spaced levels which resulted in 239 trials per level. Trials with muscular or ocular artifact were rejected prior to averaging. The left mastoid was used as the reference electrode and averaged data were baselined using the mean voltage between −100 and 0 ms at each site. The averaged ERPs plotted in Figures 3–6 show the highest and lowest quartiles for each variable of interest per experiment. Note that these comparisons are only for visual reference and do not control for the influence of the other variables or random effects.

Figure 3. — LME t statistics, confidence intervals, topographical LME t-statistic maps, and ERPs representing the Frequency effects for Experiment 1 (a) and Experiment 2 (b). Effects are only highlighted if significant with both confidence intervals and FDR-corrected p-values. ERP plots were made using the top and bottom quartiles of items sorted by frequency. (c) Statistics from task comparisons using separate LME models including task.

Figure 6. — LME t statistics, confidence intervals, topographical LME t-statistic maps, and ERPs representing the Duration effects for Experiment 1 (a) and Experiment 2 (b). Effects are only highlighted if significant with both confidence intervals and FDR-corrected p-values. ERP plots were made using the top and bottom quartiles of items sorted by Duration. (c) Statistics from task comparisons using separate LME models including task.

Results

Linear mixed effect model results

Due to the number of results, the confidence intervals and t statistics for each comparison are presented in a series of tables for each variable (Frequency, PND, Concreteness, and Duration). Effects are highlighted only if the comparison was significant for both the confidence intervals and the FDR-corrected ANOVA p-values. Included in each table are statistical topographic maps created by single-site LME effect estimates calculated per electrode and task, using the same models as described above with the removal of the distributional variables. Also included for visual comparison are averaged ERPs comparing the highest and lowest quartile of each variable for a central (Cz) and a lateral (T3) electrode site.

Frequency effects

Starting in the first epoch from 100 to 200 ms there was a Frequency by Z-position interaction. This pattern indicates that along the continuum of word frequencies, items towards the lower end of the scale tended to produce greater ERP negativity than items towards the higher end and that this effect was larger towards the top of the head (see Figure 3a). In the following 200–300 ms epoch, the previous interaction remained and there was also a Frequency by Y-position interaction, suggesting the frequency effect was now more concentrated over posterior electrode sites. In 300–400 ms epoch these two two-way interactions remained however in the following 400–500 ms epoch there were no effects of Frequency. In the 500–600 ms epoch the frequency effect re-emerged and was significant as both a main effect as well as a Frequency by Z-position interaction. This indicated a strong, widespread frequency effect that was largest at central electrode sites. In the last three epochs (600–900 ms), these two effects remained significant, but in the 700–800 ms and 800–900 ms epoch there were also Frequency by Y-position interactions indicating the distribution shifted towards the front of the head in later epochs.

Phonological neighbourhood density effects

In the initial 100–200 ms epoch there were no effects of PND. Starting in the 200–300 ms epoch there was a main effect of PND such that words with larger phonological neighbourhoods tended to produce greater negativity than words with smaller phonological neighbourhoods. This neighbourhood effect interacted with all three distributional variables and appears to reflect a wide distribution across the central line of electrodes which was larger on the right of the montage (see Figure 4a). In the following 300–400 ms epoch there was a PND × Z-position interaction, reflecting a small reversal of the effect in central sites, with more negativity now for words from low density neighbourhoods (see map in Figure 4a). In the 400–500 ms epoch, there were no effects of PND or any interactions. Then, beginning in the 500–600 ms epoch and continuing through the rest of the epochs, there were PND by Z-position interactions showing greater negativity to words with dense phonological neighbourhoods, especially in central sites. Additionally, in only the 500–600 ms time window there was a PND by Y-position interaction, reflecting a more anterior distribution of the effect and perhaps indicating that this epoch is where the later PND effect is the strongest.

Figure 4. — LME t statistics, confidence intervals, topographical LME t-statistic maps, and ERPs representing the PLD (PND) effects for Experiment 1 (a) and Experiment 2 (b). Effects are only highlighted if significant with both confidence intervals and FDR-corrected p-values. ERP plots were made using the top and bottom quartiles of items sorted by PND. (c) Statistics from task comparisons using separate LME models including task.

Concreteness effects

In the initial three epochs there were no effects of Concreteness. In the 400–500 ms epoch there was a Concreteness by X-position and a Concreteness by Z-position interaction, demonstrating greater negativities to higher concreteness words with the effect more concentrated on the central-left side of the montage (see Figure 5a). These effects continue into the 500–600 ms epoch, with the addition of a main effect of Concreteness. In the following 600–700 ms epoch, the interaction between Concreteness and X-position switches to a Concreteness by Y-position indicating the effect now has a more posterior distribution. In the last two epochs from 700 to 900 ms, the effect remains as Concreteness by Z-position interactions and is still centrally distributed around the top of the head.

Figure 5. — LME t statistics, confidence intervals, topographical LME t-statistic maps, and ERPs representing the Concreteness effects for Experiment 1 (a) and Experiment 2 (b). Effects are only highlighted if significant with both confidence intervals and FDR-corrected p-values. ERP plots were made using the top and bottom quartiles of items sorted by Concreteness. (c) Statistics from task comparisons using separate LME models including task.

Duration effects

The effects of word duration started in the 200–300 ms epoch where Duration interacted with Z-position. Here the effect showed greater negativity for longer than shorter duration words (see Figure 6a). During the next two epochs, 300–400 ms and 400–500 ms, there were main effects of Duration as well as distributional interactions showing the direction of the effect has reversed, with shorter words now producing more negativity. This effect was distributed perpendicular to the midline, especially in the rightmost sites. In the following 500–600 ms epoch, there was no main effect for Duration, however distributional interactions suggested that lateral right sites still showed remnants of the previous effect while in posterior sites the effect reversed once more, such that longer words elicit more negativity. In the remaining three epochs, this posterior effect grew in size and magnitude and became significant as a main effect in the 700–800 ms epoch with longer words producing greater negativities than shorter words. This pattern is especially apparent in the ERP plots in Figure 6a, and appears to be due to a shift in the latency of the N400 - shorter duration words producing a shorter N400 time-course.

Behavioural results

During Experiment 1, participants correctly detected, on average, 77% of non-word probes with false alarms on 3% of critical trials. Reaction times for correct lexical decision judgments averaged 967 ms (SD = 188 ms).

Discussion

In Experiment 1, we found independent effects of four different word-based variables on the continuous processing of spoken words during a go/no-go lexical decision task. This included temporally and spatially widespread effects of word frequency, phonological neighbourhood density, concreteness and word duration. For word frequency, there was an increase in ERP negativity associated with decreases in word frequency. These effects began in the 100–200 ms epoch and became more exaggerated after 500 ms, near the peak of the auditory N400 (see Figure 3a). A somewhat different picture emerged for the phonological neighbourhood density (PND) variable. Here we found an early effect between 200 and 300 ms with greater negativity associated with increases in phonological neighbourhood density, and then a small reversal (dense neighbourhoods eliciting more positivity) in the 300–400 ms epoch. However, in the timeframe of the N400 (500–900 ms), the pattern reverted to greater negativity for denser phonological neighbourhood words (see Figure 4a). For concreteness, we found a widely distributed pattern of larger negativities associated with words rated as more concrete, and this pattern started in the 400–500 ms epoch and persisted through 900 ms (see Figure 5a). Finally, there were also widespread effects of the duration of the spoken words. As can be seen in Figure 6a, this pattern appears to result mostly from a shift in the temporal distribution of the N400, with words of shorter duration resulting in an N400 that starts and ends earlier than the comparable effect for longer words. The one departure from this pattern is the centrally distributed larger negativity for long words in the 200 to 300 ms epoch. This effect could be due to a larger P2 component for the shorter words.

Experiment 2 (semantic categorization)

Experiment 2 contains data from the same words and participants as Experiment 1, but instead of the lexical decision task, here we used a go/no-go semantic categorisation task (SC) which required subjects to determine if words were members of a specific semantic category (animals). Prior research has shown that experimental task can impact word processing in a variety of contexts. For instance, semantic priming has more of an effect on the auditory N400 during a memorisation task compared to a counting task, indicating that the N400 is not impervious to top-down influences (Bentin, Kutas, & Hillyard, 1993). Relevant to the current variables of interest, recent studies with written words have shown that the ERP effects of word frequency (Strijkers, Bertrand, & Grainger, 2015) and concreteness (Chen et al., 2015) are modulated by experimental task. Compared to lexical decision, a task like semantic categorisation that focuses participants’ attention to the semantic attributes of each word may have a larger impact on later meaning-sensitive ERP components such as the N400.