A Neurally Plausible Parallel Distributed Processing Model of Event-Related Potential Word Reading Data

Sarah Laszlo; David C Plaut

doi:10.1016/j.bandl.2011.09.001

. Author manuscript; available in PMC: 2013 Mar 1.

Published in final edited form as: Brain Lang. 2011 Sep 25;120(3):271–281. doi: 10.1016/j.bandl.2011.09.001

A Neurally Plausible Parallel Distributed Processing Model of Event-Related Potential Word Reading Data

Sarah Laszlo ⁽¹⁾, David C Plaut ^(2),⁽³⁾

PMCID: PMC3328138 NIHMSID: NIHMS327952 PMID: 21945392

Abstract

The Parallel Distributed Processing (PDP) framework has significant potential for producing models of cognitive tasks that approximate how the brain performs the same tasks. To date, however, there has been relatively little contact between PDP modeling and data from cognitive neuroscience. In an attempt to advance the relationship between explicit, computational models and physiological data collected during the performance of cognitive tasks, we developed a PDP model of visual word recognition which simulates key results from the ERP reading literature, while simultaneously being able to successfully perform lexical decision—a benchmark task for reading models. Simulations reveal that the model’s success depends on the implementation of several neurally plausible features in its architecture which are sufficiently domain-general to be relevant to cognitive modeling more generally.

Keywords: Computational Modeling, Parallel Distributed Processing, Event-Related Potentials, N400, Visual Word Recognition

Introduction

Comprehending meaning from text—visual word recognition—is a pervasive and fundamental cognitive process that is studied by researchers using a wide variety of methodologies. In broad strokes, cognitive scientists seek to characterize the component processes involved, cognitive neuroscientists seek to map those processes onto neural signatures, and computational modelers seek to make explicit the interactions that occur between the representations involved. Each of these methodologies has strengths that can supplement the weaknesses of others, and often important discoveries are made when two or more of them are combined—for example, when psychophysiology provides a time course for proposed cognitive processes or when a computational model shows that a particular cognitive architecture can in fact produce the pattern of results it has been formulated to explain.

Interplay between cognitive science and computational modeling in the domain of visual word recognition has involved the parallel development of two prominent but very different modeling frameworks: one utilizing learned representations and a uniform set of computational principles—the parallel distributed processing (PDP) approach (e.g., Seidenberg & McClelland, 1989; Plaut, McClelland, Seidenberg, & Patterson, 1996)—and another which de-emphasizes learning and relies on different types of computations in different functional pathways—the so-called “dual-route” or “dual-process” approach (e.g., Coltheart, Rastle, Perry, Langdon, & Ziegler, 2001; Perry, Ziegler, & Zorzi, 2007). Each of these approaches has its own strengths and weaknesses, but in aggregate both of them are highly successful in simulating a number of results from behavior and neuropsychology. For example, one compilation of effects that recent models have been successful in simulating (Perry et al., 2007), includes 13 items, from diverse tasks such as lexical decision, reading aloud, and many variants of priming, as well as several items pertaining to performance in dyslexia. However, there is one area in which even the most sophisticated of current models is lacking, as agreed upon by proponents of both the PDP and dual-process frameworks (e.g., Harm & Seidenberg, 2004; Perry et al., 2007), as well as advocates of other modeling techniques in other domains (e.g., Bayesian modeling; see Griffiths, Chater, Kemp, Perfors, & Tenenbaum, 2010). That area is contact with data from cognitive neuroscience and neurophysiology. It is widely hoped that more contact with cognitive neuroscience can provide constraining data on appropriate internal dynamics for models, and that more contact with data from neuroscience can improve the neural plausibility of models largely based on behavior.

Interestingly, this need for more contact with cognitive neuroscience in computational investigations of visual word recognition has coincided with a need for more contact with computational models in similar investigations conducted using the Event-Related Potential (ERP) methodology. It has begun to be commonly noted that theories about the representations and computations involved in reading stemming from ERP data have become specific enough that it would be desirable to test them by instantiating them as computational models (e.g., Barber & Kutas, 2007; vanBerkum, 2008). For example, a recent series of ERP studies pertaining to the “obligatory semantics” view of visual word recognition has presented data cast as strongly consonant with PDP models, while less supportive of dual-process models (Laszlo & Federmeier 2007, 2008, 2009, 2011). These studies have focused on the N400 ERP component, which, as discussed in more detail below, is thought to be a functionally specific marker of attempted semantic access (see Kutas & Federmeier, in press, for review). It has now been shown several times that even meaningless items with little resemblance to lexically represented items can engage the semantic access thought to be indexed by the N400, both in sentences (Laszlo & Federmeier, 2009) and in unconnected streams of text (Laszlo, Stites, & Federmeier, in press)—that is, an attempt to access semantics appears to be obligatory for all orthographic inputs, even consonant strings like XFQ. Further, the N400s elicited by meaningless illegal strings respond to manipulation of lexical characteristics such as orthographic neighborhood size (i.e., Coltheart’s N, the number of words that can be created by changing one letter of a target item; Coltheart, Davelaar, Jonasson, & Besner, 1977) and neighbor frequency in a manner both quantitatively and qualitatively similar to that demonstrated by words (Laszlo & Federmeier, 2011). These data have been taken as supportive of PDP models in that they seem to reveal a language processing system which 1) does not require an item to have a lexical representation, or even be similar to an item with a lexical representation, in order to make some contact with semantics and 2) performs what appear to be indistinguishable computations on different input types, regardless of factors like lexicality or the regularity/consistency of spelling-sound correspondences. Further, the degree to which an attempt at semantic access occurs for meaningless items appears to be strongly related to their similarity to items with associated semantics (i.e., words, acronyms), a result which is consonant with the fact that the distributed representations preferred by PDP models tend to associate similar inputs with similar outputs, to a degree determined by the amount of overlap between representations.

In contrast, the ERP results seem to be less supportive of dual-process models, insofar as such models include lexical mediation between orthographic input and semantics (e.g., Perry et al., 2007), making it difficult or impossible for items such as consonant strings, which are neither lexically represented nor similar to items that are, to contact semantics. Note that a lexically mediated system could potentially be made to allow illegal strings contact with semantics by lowering the threshold of lexical activation that needs to be met in order for semantics to be activated. That is, the many lexical entries that overlap slightly with illegal strings could be activated weakly, and the aggregation of this weak activity over many units could be allowed to be passed forward to semantics. However, such a system is no longer strongly lexicalized, in that the internal representations that mediate between orthography and semantics are now essentially distributed—that is, many units participate in the representation of each input, and the strength of activation in those units is proportional to the degree of overlap with the input. This will be true not just for nonwords but also for words as, of course, words overlap with other words to differing degrees.

Another potential mismatch between the ERP results pertaining to meaningless, illegal strings and dual-process models occurs because of one of the core properties of dual-process models: orthographic inputs tend to differentially engage separable processing streams depending on the regularity of their spelling-sound correspondence. This characteristic seems incongruent with the repeated finding that items with irregular spelling-sound correspondences (acronyms, consonant strings), elicit waveforms that are qualitatively and quantitatively quite similar to those elicited by items with regular spelling sound correspondence (words, pseudowords) up to and including the N400 portion of the ERP (Laszlo & Federmeier 2007).

The fact that these ERP data have been explicitly cast as supportive of one particular theoretical framework invites an attempt to test the obligatory semantics view by trying to simulate key ERP data in a PDP model of the type they are claimed to support. An attempt to test the obligatory semantics view by instantiating its assumptions in an explicit computational model would be useful not only in advancing a theoretical position present in the ERP literature—it would also provide new information about the degree to which the internal dynamics of a reading model constructed with PDP principles match the internal dynamics of the groups of neurons that are actually performing the task in the brain. Currently, there is limited constraint on the internal dynamics of cognitive reading models, as they are all based almost entirely on behavioral data, which is fundamentally end state data. That is, while strong inferences about internal processing can and have been made on the basis of, for example, RT or naming latency data, these data do not provide direct evidence about the processes occurring between when an item is presented and when a response is made—only data about the final consequences of those processes. ERPs, in contrast, can be collected continuously between when an item is presented and when a response is made, and can, in fact, be collected even when no overt response is made. Further, ERPs can be divided into well-specified components, which have been robustly replicated as reflecting particular cognitive functions.

The N400 component, for a particularly relevant example, is strongly tied with attempted semantic access. The designation of the N400 as a semantic component is based on a variety of converging results, including its functional properties, its neural generators, and the functional anatomy of components which precede it. The N400 is known to respond to a wide variety of semantic manipulations such as congruity with sentence and discourse context (Kutas & Hillyard, 1984; van Berkum, Hagoort, & Brown, 1999), semantic association (Nobre & McCarthy, 1995), and item concreteness (Kounios & Holcomb, 1994), to name only a few, while not being sensitive to other types of linguistic manipulations, such as those of syntactic constraint (Kutas & Hillyard, 1983), or font size (Kutas & Hillyard, 1980). Converging evidence from intracranial EEG (Nobre & McCarthy, 1995), MEG (Halgren, Dhond, Christensen, Van Petten, Marinkovic, & Lewine, 2002), and the Event-Related Optical Signal (EROS; Tse, Lee, Sullivan, Garnsey, Dell, Fabiani, & Gratton, 2007), as well as patterns of diminished N400 in brain damage (Hagoort, Brown, & Swaab, 1996) all point to a primary source of the N400 in the left anterior temporal lobe, a region strongly linked with semantic processing (e.g., Nobre & McCarthy, 1995; McCarthy, Nobre, Bentin, & Spencer, 1995). Finally, the N400 has been argued to occur not only in the correct brain areas, but also in the correct temporal window, to subserve semantic access, based on both the neural generators and functional properties of the sensory and form-based components that precede it (see Grainger & Holcomb, 2009, for extensive review). In sum, the functional specificity of the N400 component is a particularly useful property for model-building, as its clear link with semantic processing permits a direct comparison with semantic representations and processes in a model.

The goal of the present work is to test the assumptions of the obligatory semantics view of N400 processing in a PDP model that continuously simulates N400 amplitude. Three particular considerations are of importance. First, can such a model produce N400-like dynamics at all—that is, can we produce a PDP model the semantic activation of which resembles the morphology of the N400 component? To our knowledge there are no other implemented computational models of N400 processing, so this is not assured. We chose to link N400 amplitude with amount of activation in the semantic layer of representation in our model, on the basis of the N400’s strong link in the literature with semantic access (as just discussed) as well as findings that, at least in the context of reading unconnected text, N400 amplitude represents the number of semantic features being activated in response to a particular input. (e.g. Laszlo & Federmeier, 2007, 2011). Larger (more negative) N400s are elicited by items which might be expected to activate more semantic features, such as items higher in concreteness (e.g., West & Holcomb, 2000), or items with larger orthographic neighborhood sizes. The morphology of the N400 is well-characterized as essentially a curve which rises monotonically to a single peak, and then decreases monotonically throughout the remainder of its time course—Figure 1 displays several N400 potentials representative of those we sought to simulate. To be successful, the mean amount of activity in the model’s semantic layer must develop similarly, without, for example, additional oscillations. In this fashion, the model is constrained not only to reach some end state in a manner consistent with the data (as is the case in behavioral models), but also to perform in a manner consistent with the data throughout its evolution over time.

Representative single-item ERPs averaged over 120 participants, but not over items. The middle parietal electrode site, where N400 effects are most prominent, is displayed. Typical N400 morphology is visible in the 300–500 ms N400 window (boxed), for the words DOG, BUS, and FISH. In this figure, as in all ERP figures, negative is plotted up.

The second consideration is: will the dynamics of the semantic layer in the model further mirror critical results supportive of the obligatory semantics view? In seeking to answer this question, we chose to focus our simulations on data from the single-item ERP corpus (Laszlo & Federmeier, 2011), as it is both uniquely appropriate for computational modeling and also representative of the key data in support of obligatory semantics. The availability of single-item ERPs enables items analysis (e.g., items multiple regression), in addition to the more typical parametric analysis available from ERP reading studies. This makes the single-item ERP corpus a particularly appropriate target for computational modeling, as it is advantageous to model items effects, not just item aggregated, factorial effects, whenever possible. For the model to be successful, in addition to showing the broad characteristics of the N400, it must also produce simulated N400s that are consistent with the critical findings from the single-item ERP corpus (described in detail below).

Finally, it is important that the model also be able to perform the behavioral task of lexical decision., as lexical decision is among the most common benchmark tasks for computational reading models. Literate adults, though they do not receive extensive training on performing lexical decisions while learning to read, are able to make them quite easily in a lab setting. In imitation of this situation, the model is never explicitly trained on lexical decision but is asked to make lexical decisions on the basis of a simple thresholding procedure after training is complete. Attempting to implement this additional capability in the model helps to ensure that, insofar as it is able to simulate results previous models do not—from the domain of ERPs-- it is also able to simulate the fundamental behavioral data that decades of visual word recognition modeling have been built on. Without this additional ability, the ERP model would not truly be tied to its thematic predecessors (e.g., Harm & Seidenberg, 2004; Plaut et al., 1996, Seidenberg & McClelland, 1989), which would be unfortunate given the significant insights those models have provided into visual word recognition. Simulating both electrophysiological and behavioral data is a more challenging task than simulating the ERP data alone, but a worthwhile one: it lays a foundation for a much more complete, holistic model than ignoring the behavioral data would. Further, challenging the model to perform lexical decision instantiates an incremental approach to computational modeling (Perry et al., 2007) by extending a preliminary ERP model that focused on the ERP data alone (Laszlo & Plaut, 2011). A criterion for model success was that, by the end of processing each input, the model be able to produce a signal that could reliably differentiate meaningful items (words and acronyms) from non-meaningful items (pseudowords and illegal strings.)

In developing a model of ERP data, we considered it critical to incorporate some of the most general properties of the neurons which produce the ERP signal. The vast majority of the brain-generated electrical potential measured at the scalp is produced by the synchronous firing of excitatory and inhibitory post synaptic potentials by cortical neurons arranged in an open-field configuration (see Fabiani, Gratton, & Federmeier, 2007, for review). Thus, we departed from previous reading models by trying to handle excitation and inhibition in the model in a manner more true to what is understood about the neural configuration of excitation and inhibition (see, e.g., Crick & Asunama, 1986). This was accomplished in three ways. First, we separated excitation and inhibition in the model, such that individual units could have excitatory outgoing projections or inhibitory outgoing projections, but never both, as is true of cortical neurons. This arrangement can be observed in Figure 2, which presents a schematic of the ERP model. Second, we limited the distribution of inhibitory connections, such that they could occur only within, but never between, levels of representation in the model. This decision was motivated by the fact that connections between cortical areas are largely excitatory, with inhibitory connections occurring largely within a given cortical area. This feature of the model is also visible in Figure 2. Finally, we severely limited the number of inhibitory units in the model—each excitatory layer has only a single associated inhibitory unit—in accordance with the finding that the large majority of neurons in the cortex are excitatory (e.g., White, 1989). Each of these neurally plausible adjustments to the way excitation and inhibition are handled in the model represent a departure from previous reading models (e.g., Harm & Seidenberg, 2004), in that inhibition is typically unconstrained in such models, with individual units able to have both positive and negative outgoing connections, inhibitory connections allowed between levels of representation, and, because excitation and inhibition are not separated, essentially equal numbers of excitatory and inhibitory units.

Schematic of the ERP model. Lines with empty circles indicate excitatory connections, lines with filled circles indicate inhibitory connections. INH stands for “inhibitory,” and each INH bank consists of only 1 unit. Note that no units have both excitatory and inhibitory outgoing connections, and that inhibition is always within, never between, levels of representation.

In what follows, we first present the relevant phenomena from the single-item ERP corpus in some detail, in order to directly motivate the simulations that follow. Then, in two simulations, we explore a number of questions pertaining to the ability of a PDP system to successfully simulate the ERP data. First, we attempt to determine whether a PDP system can produce internal dynamics which resemble ERP morphology at all. If this is accomplished, we then seek to determine whether such a system can produce the results thought to be supportive of the obligatory semantics view of N400 processing: namely a strong effect of orthographic neighborhood size which acts similarly for lexical and non-lexical items. Importantly, if the model is able to correctly simulate the key ERP findings, its ability to perform lexical decisions is assessed as an additional metric of success. Finally, the contribution of the separation of excitation and inhibition in the model to the model’s ability to simulate the ERP data is examined.

Target Phenomena: Event-Related Potentials

A detailed report of the methods and results of the single-item ERP corpus is available elsewhere (Laszlo & Federmeier, 2011). However, for clarity, we describe here the nature of that data set and the key results that will act as target phenomena for the simulations presented below. 120 participants in the single-item ERP study viewed an unconnected list of words (e.g., HAT), pseudowords (e.g., KOF), acronyms (e.g., DVD), and meaningless illegal strings (e.g., NHK), while monitoring the stream for English proper names (e.g., SARA, DAVE). No response was required for the critical item types, in order to keep the critical ERPs free from response related components. This task, as well as the item types presented, replicated Laszlo & Federmeier, 2007. Acronyms were backsorted on the basis of a post-test such that only acronym items that individual participants were familiar with were included in that participant’s averaged waveforms. Event-Related Potentials were formed by averaging at each of the scalp electrodes time-locked to the onset of each of the critical items. In the case of single-item ERPs, averaging was done over participants only, not over items. More typical, item-aggregated ERPs (representing, for example, the response to all words) were formed by averaging over both items and participants.

One of the most striking findings in the single-item data is that individual lexical characteristics (e.g., orthographic neighborhood size, neighbor frequency), tend to be much stronger predictors of N400 amplitude than lexical type (e.g., word or pseudoword). This is demonstrated in Figure 3, in the case of orthographic neighborhood size. As is evident in Figure 3, items with high N (words, pseudowords) elicit larger N400s than items with low N (acronyms, illegal strings), and this is true regardless of lexicality. That is, though pseudowords are presumably not semantically represented, they elicit similar N400s to words, because of their similarity on N—the same is true when comparing acronyms and illegal strings. This can be quantified as a main effect of N on N400 mean amplitude, but no effect of lexicality and no interaction between the two (see Laszlo & Federmeier, 2011, for details of statistical analysis.).

Orthographic neighborhood size effect in item-aggregated ERPs. Item types with high N (words, pseudowords) elicited larger N400s than item types with lower N (acronyms, illegal strings).

The second critical finding we consider in the simulations below is that, at an items level, the slopes relating N400 mean amplitude to orthographic neighborhood size are qualitatively and quantitatively quite similar for lexical and non-lexical items—this is, of course, reflected as the lack of interaction between N and lexicality in the factorial analysis. This result is visible in Figure 4 (reproduced from Laszlo & Federmeier, 2011), which displays a scatter plot of items N400 mean amplitude versus orthographic neighborhood size along with single regression trendlines for lexical and non-lexical items. Note that the distributions of N400 mean amplitudes for lexical and non-lexical items are almost completely overlapping, as are the trendlines depicting the relationship between orthographic neighborhood size and N400 mean amplitude for the two lexical types. Thus, the N effect is quite similar for lexical and non-lexical items. In addition, automated stepwise regression analysis of the single-item ERP corpus revealed that N is, by far, the strongest predictor of unique N400 variance of those lexical variables considered (length, N, neighbor frequency, number of lexical associates, and frequency of top associate were all considered in Laszlo & Federmeier, 2011; subsequent analysis has extended the list to include bigram frequency, concreteness, imageability, number of senses, and noun verb ambiguity; Laszlo, unpublished data). The prominence of the N effect, combined with other findings indicating that, unlike effects of variables such as concreteness or written frequency, it is maintained both with repetition and in sentence context (Laszlo & Federmeier, 2007, 2008, 2009), altogether make it particularly relevant for simulations exploring the obligatory semantics view of the N400.

Orthographic neighborhood size effect in single item ERPs. N400 mean amplitude is computed over the middle parietal electrode site in the 300–500 ms post stimulus onset epoch. Lexical items (words and acronyms) are represented by filled dots, non-lexical items (pseudowords and illegal strings) are represented by empty dots. Note that the slopes representing the relationship between orthographic neighborhood size and N400 mean amplitude are quite similar. Reproduced from Laszlo & Federmeier (2011).

Simulation 1

Methods

The architecture of the ERP model is depicted in Figure 2. A 15-unit visual input layer represents the visual features of each of 3 letters in 5 non-overlapping slots. The visual input layer feeds into a 20-unit orthographic autoencoder, which was pre-trained to reproduce the visual input on a copy of the 15 input units. The autoencoder feeds through a 50-unit hidden layer to a 50-unit semantic layer with an associated 30-unit semantic cleanup layer. At the semantic layer, relatively sparse, arbitrary semantic representations were trained to be associated with the visual inputs, in accordance with the fact that, for morphologically simple words in English, orthography-semantics mappings are largely arbitrary. Semantic targets consisted of random bit patterns over the 50 semantic units—that is, semantic features were not learned but were arbitrarily assigned, with the constraint that each unit be active in at least one semantic target. Either 3 or 7 features were active in semantics for each target. The numbers 3 and 7 were chosen simply so that semantic representations would be fairly sparse (i.e., 6% of features active for a representation with 3 features, 14% active for a representation with 7 features). Two different numbers of features were chosen so that effects of semantic concreteness could be explored in future versions of the model using the same materials: the N400 is known to be sensitive to semantic concreteness (Kounios & Holcomb, 1994). Weights on connections between levels of representation were constrained to be positive-only. Each layer of representation (except for the cleanup and input layers) has one associated inhibitory unit, connected as depicted in Figure 2.

For excitatory units, the standard logistic (sigmoid) function was used to compute unit activations. For the inhibitory units, a multi-linear activation function was used, with a slope of 1 from inputs of zero through an inflection point, and a slope of 2 from the inflection point onward (see Figure 5). The multi-linear activation function was used in order to approximate the presence in the brain of separate populations of inhibitory neurons with varying temporal response properties—that is, the fact that some inhibitory neurons respond more quickly with stimulation than others (e.g., Traub, Miles, & Wong, 1989; Benardo, 1994.) As is visible in Figure 5, the multi-linear activation function is formally identical to the sum of 1) a linear activation function that begins immediately with even small amounts of input and 2) an identical linear activation function that does not begin until some threshold of activation is passed (the “elbow”). Because it takes time for activation to build up in the network, the result is that the steeper portion of the inhibitory function is not used until later in network time than the shallower portion. In this way, even though the network only has one inhibitory unit at each level of representation, that one unit is able to approximate the function of separate units with different temporal properties. The inhibitory activation function is unbounded—allowing the single inhibitory unit associated with each level of representation to produce significant inhibition—and the location of the inflection point in activation space for each inhibitory unit is a fixed parameter in the model. Output units (i.e., units in the semantic layer or the orthographic output layer in the autoencoder) are additionally constrained such that their activation decays towards zero as the inverse square root of their raw, logistic activation. Thus, units that are strongly activated tend to stay strongly activated, while units that are weakly activated tend to decay towards zero activation. This procedure is reminiscent of a k-winners-take-all function (O’Reilly, 1996a), in that it allows only the units with the strongest activations to remain active, and quiets all the rest, but differs in that the number of units that are able to remain active is dynamic.

The sum of an immediate, linear inhibition function (left) and a delayed, linear inhibition function (center) is a multilinear (“elbowed”) function (right.) In the model, inhibition is a function of input activation, not time, so its relationship to truly time dependent inhibition is only approximate. The slopes displayed here are the actual slopes used in the simulation (i.e., a slope of 1 to the inflection point, and a slope of 2 after.) The inflection point in the model’s inhibition function is a fixed parameter. Both “Amount of Inhibition” and “Time” are in arbitrary units.

Training was accomplished by back-propagating cross-entropy error through time (Rumelhart, Hinton, & Williams, 1986; Hinton, 1989). Additional constraints were added to the back-propagation procedure to assure that excitatory weights were always positive and inhibitory weights were always negative. First, the minimum outgoing weight of excitatory units is a fixed parameter in the model such that in the present implementation of the back-propagation algorithm, no weight change is made that would cause a weight to be smaller than its fixed minimum. Second, inhibitory weights were fixed to random, negative values at the beginning of training and were not updated subsequently. Thus, it was impossible for connections designated as excitatory to have negative values, or for connections designated as inhibitory to have positive values.

In order to keep the scale of the model small, there are only 10 letters in its vocabulary: seven consonants (SNCBDPT) and three vowels (OIU). Of the possible 1000 strings of letters that could be formed with 10 letters in three slots (10 ^ 3), we designated 62 as “words” and 15 as “acronyms.” Words were constrained to have a CVC structure, and acronyms could have any letters in the 1^st or 3^rd position, but were constrained to have a consonant in the 2^nd position—this was done to create a structural difference between the representations of words and acronyms and also to ensure that the orthographic neighborhood sizes of acronyms would be smaller than that of words, as is true in the single-item ERP corpus. Within the limited vocabulary of the model, words had a within-set N of 6.83, and acronyms had a within-set N of 0.8.

Before semantic training commenced, the autoencoder was trained to reproduce the orthography of each of the 77 semantically represented (i.e., word and acronym) items (see the Results section, below). This was done to ensure that, even before semantic training began, the network had some knowledge about the orthographic structure of input items. By forcing the network to condense, and reconstruct, orthographic representations prior to the onset of semantic training, the autoencoder ensures that orthographic structure will be emphasized in subsequent processing. The model learns, during autoencoder training, that inputs with interior consonants are dispreferred, through the simple fact that more words—items with interior vowels—are presented than acronyms—items with interior consonants. This information is important to the model, as without it illegal strings tend to produce too much semantic activation, by virtue of their structural similarity with acronyms. A related consequence of pre-training particular orthographic structures is that acronyms form strong internal representations in the model, without which they would be unable to activate semantics sufficiently because they are dissimilar to and less frequent than words. In essence, what the model learns by pre-training on orthographic structure it is that internal consonants are dispreferred, and thus should generally not pass much activation forward, except in the specific cases with robust representations in the autoencoder—that is, except for acronyms.

After autoencoder training was complete, the semantic training phase began, during which time the network was trained to activate the correct (although arbitrary) semantic features for each of the 77 semantically represented items, wh ile simultaneously being trained to keep all features in semantics “off” for a large set of “wordlike” nonwords. The wordlike nonwords consisted of the 1155 (77 items * 15 input features) items that could be formed by flipping one bit in the input representations of the 77 semantically represented items. That is, by changing a single one in an input representation to a zero, or vice versa. Although there were more wordlike nonwords than semantically represented items in the training corpus, each word was presented to the network during training 50 times more frequently than each nonword. One way to think about training on wordlike nonwords is that it approximates training the network to not link semantics with “mistakes,” much like training a learning reader that a word misspelled by one letter is not the same as the word itself.

On each training trial, the visual input for one of the items in the training corpus was clamped on, and activation was allowed to propagate through the network for 12 time steps with no accumulation of error. Targets continued to be presented for a subsequent four time steps, during which time error was accumulated. At the end of 16 time steps, the trial ended, the network was reset to its initial state, and the next trial began. Words and acronyms were 50 times more likely to be selected as the input for each trial than wordlike nonwords. A single training epoch consisted of 1232 (77 + 1155) trials, however not every item was necessarily trained in each epoch as words and acronyms were more likely to be selected than nonwords (e.g., a single word could be selected 50 times, meaning that not every item would be selected in every 1232 trial epoch). After 9000 epochs of training in this fashion, the network was tested on 441 items: the 62 words and 15 acronyms it was trained on, in addition to 279 illegal strings (nonwords with central consonants) and 85 pseudowords (nonwords with central vowels) to which the network was not exposed during training. The target for all illegal strings and pseudowords was for all semantic units to remain off.

Results

Autoencoder

The orthographic autoencoder was trained to reproduce the visual inputs corresponding to the 62 words and 15 acronyms on a copy of the input units. For the autoencoder, as for subsequent analyses pertaining to the model’s semantic performance, an output is considered correct if the Euclidean distance between the output representation produced for an item and the target representation for that item is lower than the distance between the output and the target for any other item. After 3000 epochs of training, the autoencoder’s performance was perfect (100%).

Semantics

After 9000 epochs of training, the network was 93% (411/441 items) accurate in producing either correct semantics (in the case of words and acronyms) or silence in the semantic layer (in the case of pseudowords and illegal strings). Of the 30 errors, 15 occurred for pseudowords, and 15 occurred for illegal strings—all items that were actually trained (words and acronyms) were correctly linked with semantics. Figure 6 displays the mean activation in semantics over time for words, acronyms, pseudowords and illegal strings—that is, the data corresponding to the item aggregated ERPs displayed in Figure 3. Two important features of the data are visible in Figure 6.

Orthographic neighborhood size effect in item-aggregated model output. Item types with high N (words, pseudowords) elicit larger simulated N400s than item types with lower N (acronyms, illegal strings). Notice also that, by the end of the epoch, semantically represented items (words, acronyms) are separated from non-represented items (pseudowords, illegal strings), meaning that the model can accurately make lexical decisions. N for items in the model is computed only on the basis of the model’s vocabulary. Units of mean semantic activation are arbitrary. Though there is no formal relationship between model time and real time, the first tick of model time does correspond to stimulus onset, and the end of the model’s processing epoch corresponds roughly to the end of the N400 and onset of the LPC in the ERPs.

First, by the end of the processing epoch, the model has successfully separated words and acronyms from pseudowords and illegal strings, meaning that a simple threshold on mean semantic activation is sufficient for separating semantically represented items from non-represented items in 90% percent of cases. That is, the model can accurately make lexical decisions based on a single, set activation threshold, despite having never been explicitly trained on lexical decision. In particular, with an activation threshold of 0.0579, 100% of words and acronyms are correctly accepted, and 88% (319/364) of pseudowords and illegal strings are correctly rejected.

Second, as in the N400 data, words and pseudowords tend to elicit more activity in semantics than do acronyms and illegal strings. To investigate the relationships between N, lexicality, and mean semantic activation in the model, we conducted a simultaneous multiple regression on mean semantic activation with N, lexicality, and the N × lexicality interaction as predictors. Mean semantic activation for an item in the model was computed as the average amount of activation elicited by that item across all 16 time steps. This analysis revealed that, just as in the ERPs, there is a large main effect of N on mean semantic activation in the model (β = .0085, 95% confidence interval .0069 < β < .101), and no interaction between N and lexicality (β = −.0031, 95% confidence interval −.0065 < β < .0002). Unlike the ERP data, however, in the model there was a reliable main effect of lexicality (β = .05, 95% confidence interval .0332 < β < .0712). This is a direct result of aiming to produce a model capable of performing lexical decision—as, of course, if words and pseudowords (or acronyms and illegal strings) elicited identical mean amounts of semantic activation they would be impossible to tell apart on that signal.

We followed up the multiple regression with a focused analysis of N effects in the model, as these effects are particularly prominent in the ERPs. In the model, the single regression of N on mean semantic activation is strongly reliable for both represented items (words and acronyms: r = .40, r² = .16, p < .0001) and non-represented items (pseudowords and illegal strings: r = .48, r² = .23, p < .0001). If the regression is computed over all items (i.e., collapsed over lexicality), the amount of variance explained is comparable to the 30.6% of variance uniquely explained by N in the ERPs (r = .61, r² = .37, p < .0001. Figure 7 presents the model regression data comparable to the ERP regression data presented in Figure 4. Note that, just as in the ERP data, the slopes of the trendlines representing the relationship between N and mean semantic activation the model are very similar for represented vs. nonrepresented items (.005 vs. .008, respectively), though the intercepts are different, representing the model’s ability to perform lexical decision.

Orthographic neighborhood size effect in single item model mean semantic activations. Lexical items (words and acronyms) are represented by filled dots, non-lexical items (pseudowords and illegal strings) are represented by empty dots. Note that the slopes representing the relationship between orthographic neighborhood size and mean semantic activation in the model are quite similar, though the intercepts differ. N for items in the model is computed only on the basis of the model’s vocabulary. Units of mean semantic activation are arbitrary.

Discussion

Simulation 1 served several goals. First, it helped to determine whether a PDP reading model with neurally plausible architecture could produce dynamics on its semantic output layer that resembled the N400 ERP component. In this the model was successful: the time course of mean semantic activation for words, pseudowords, acronyms, and illegal strings in the trained model strongly resembled N400 morphology in several critical ways. Namely, semantic output was delayed slightly from the onset of stimulus presentation (i.e., from the time when input was clamped on in the model)—just as the N400 does not onset immediately when a stimulus is presented. When activation began to arise in semantics, it did so in a way consistent with N400 morphology, by monotonically rising and falling into a stable state which was predictive of lexicality. This characteristic in the model is, in fact, not only consistent with N400 morphology but also with the morphology of subsequent components: the Late Positive Complex (LPC), which follows the N400, often displays a relatively tonic level of activation which has been shown to be predictive of the lexicality of the item which elicited it (e.g., Laszlo & Federmeier, 2009; Laszlo, Stites, & Federmeier, in press).

Since the model was successful in producing N400-like dynamics in its output, the second goal was to explore the degree to which its simulated N400 activity resembled N400 activity in the single-item ERP corpus. Here, there were both similarities and differences between the model and the physiological data. Both the model and the physiological data displayed a strong effect of orthographic neighborhood size, with N in fact explaining similar amounts of variance in model and ERPs, and with no interaction between N and lexicality. Additionally, in both the model and the ERPs, the slope of the regressions of N on mean semantic activation (in the case of the model) or N400 mean amplitude (in the case of the ERPs) were highly similar for lexical and nonlexical items. However, one difference between model and ERPs emerged in these analyses: there was a main effect of lexicality in the model, with words and acronyms eliciting more semantic activity than pseudowords and illegal strings. This was not the case in the ERPs. Based on comparison with previous simulations, it is clear that this difference is largely the result of the model’s ability to perform lexical decision solely on the basis of semantic output—nearly identical models which do not perform the LDT show exactly the same pattern as the ERPs (Laszlo & Plaut, 2011). The fact that it was able to make accurate lexical decisions on the basis of a simple fixed activation threshold—even without being explicitly trained on lexical decision-- is important, as it demonstrates that the ERP model, which makes use of several neurally plausible architectural features and was primarily designed to simulate ERP data, is not completely divorced from the vast cognitive modeling literature on reading. Further, it suggests that a neurally plausible model is also a cognitively plausible one.

In sum, the model was largely successful in simulating the phenomena it aimed to simulate, and at demonstrating that a PDP model can simulate not only general properties of ERPs, but also specific, key results pertaining to the obligatory semantics view of N400 processing. In the model, even meaningless, illegal consonant strings elicited activation in semantics, graded by the similarity of those strings to represented items in the training corpus. Additionally, items regression analysis indicated that the relationship between N and mean semantic activation was quite similar for semantically represented items and items with out semantics. Each of these phenomena, when observed in the ERPs, have been interpreted as being consistent with PDP models, and the present simulations indicate that such an interpretation is warranted.

In light of the model’s successes in Simulation 1, a clear question for additional exploration is: To what degree did the neurally plausible architecture of the ERP model contribute to its success? We investigate this question in Simulation 2, in which the neurally plausible features of the ERP model are removed. Specifically, the constraints on the separation of excitation and inhibition are removed: units in the second set of simulations have no constraints on the sign of their outgoing weights, or on the distribution of inhibitory connections. In what follows, we will refer to this model as the unconstrained model, while the version with excitation and inhibition separated will be referred to as the constrained model. The critical issue to be determined by the unconstrained model is this: will the model still display activation dynamics in semantics that resemble N400 morphology without its neurally plausible features?