Investigating the effects of phonological neighbors on word retrieval and phonetic variation in word naming and picture naming paradigms

Haoyun Zhang; Matthew T Carlson; Michele T Diaz

doi:10.1080/23273798.2019.1686529

. Author manuscript; available in PMC: 2021 Jan 1.

Published in final edited form as: Lang Cogn Neurosci. 2019 Nov 5;35(8):980–991. doi: 10.1080/23273798.2019.1686529

Investigating the effects of phonological neighbors on word retrieval and phonetic variation in word naming and picture naming paradigms

Haoyun Zhang ¹, Matthew T Carlson ¹, Michele T Diaz ¹

PMCID: PMC7540183 NIHMSID: NIHMS1541765 PMID: 33043066

Abstract

Phonological neighbors have been shown to affect word processing. Prior work has shown that when a word with an initial voiceless stop has a contrasting initial voiced stop neighbor, Voice Onset Times (VOTs) are longer. Higher phonological neighborhood density (PND) has also been shown to facilitate word retrieval latency, and be associated with longer VOTs. However, these effects have rarely been investigated with picture naming, which is thought to be a more semantically driven task. The current study examined the effects of phonological neighbors on word retrieval times and phonetic variation, and how these effects differed in word naming and picture naming paradigms. Results showed that PND was positively correlated with longer VOT in both paradigms. Furthermore, the effect of initial stop neighbors on VOTs was only significant in word naming. These results highlight the influence of phonological neighbors on word production in different paradigms, support interactive models of word production, and suggest that hyper-articulation in speech does not solely depend on communicative context.

Keywords: Language production, Interactive effects, Phonological neighborhood density, Minimal Pair, Voice Onset Time

Introduction

Speaking, or language production, is a fundamental aspect of communication that involves several processes: activating semantic information, selecting the correct lexical entry from the mental lexicon, retrieving phonological information, phonetic encoding, and articulation (Burke & Shafto, 2008; Dell & O’Seaghdha, 1992; Levelt, 1999; Levelt, Roelofs, & Meyer, 1999; Martin, 2003; Schwartz, Dell, Martin, Gahl, & Sobel, 2006). Although the above-mentioned processes are distinct, many word production models suggest that these stages are highly interactive (e.g., Dell, 1986; Dell, Schwartz, Martin, Saffran, & Gagnon, 1997; Goldrick, 2006; Rapp & Goldrick, 2000).

One of the most well-established models of language production is Dell and colleagues’ (1997) two-step interactive activation model. The first step is lemma access, which involves both semantic processing and mapping concepts to the mental lexicon (also referred to as lexical processing). The second step is phonological processing, which involves retrieving the phonological frame of a word and articulation (also referred to as postlexical processing). Interactive models suggest that these processes are interactive where the activation of any one process can spread to and influence the activation of other processes in turn. On the other hand, feed-forward models of language production (Levelt, 1999) consist of similar processes, but activation only flows from early to later processes. In other words, feed-forward models argue that activation of phonological information cannot spread back to the activation of word forms, which cannot spread back to lemma level activation.

Abundant research has provided evidence for models of word production, by investigating the effects of different word characteristics on word retrieval. For instance, studies have shown that semantic variables (e.g., imageability) affect word naming speed, suggesting feed-back activation from word forms to conceptual information, then back to lexical processing (Shibahara et al., 2003; Strain, Patterson, & Seidenberg, 1995). Likewise, lexical characteristics such as word frequency and naming agreement can also affect word retrieval times (e.g., Barry, Morrison, & Ellis, 1997; Carroll & White, 1973).

Among various word characteristics that modulate word retrieval, the current study focuses on the effects of phonological neighbors. Phonological neighbors are words that can be formed from a given word by substituting, adding, or deleting one phoneme. Phonological aspects of production are of interest as these processes undergo age-related decline (Burke, MacKay, Worthley, & Wade, 1991; Burke & Shafto, 2008; Diaz, Johnson, Burke, & Madden, 2014; Rizio, Moyer, & Diaz, 2017). Moreover, in younger adults, phonological neighborhood density (PND; i.e., the number of phonological neighbors) has been shown to significantly affect word retrieval latency and accuracy in most word naming and some picture naming paradigms, displaying either inhibitory effects (Sadat, Martin, Costa, & Alario, 2014) or more often facilitation effects (Adelman & Brown, 2007; Baus, Costa, & Carreiras, 2008; Mirman, Kittredge, & Dell, 2010; Vitevitch, 2002), which might be subject to the particularities of word formation in specific languages (Vitevitch & Stamer, 2006). The effect of phonological neighbors on word retrieval supports interactive models of language production. Specifically, in interactive models, the activation of the target word’s phonological units spreads to phonological neighbors of the target word, which in turn spreads among neighbors and back to the target word’s phonological units. Because these phonological neighbors are similar to the target word’s phonological representations, target word retrieval will be affected by the activation of its phonological neighbors. These effects cannot be accounted for by feed-forward models of language production, as they do not allow any backward influence from phonological segments to word forms. Additionally, other research has shown that higher phonological neighborhood density produces lexically conditioned phonetic variation such as longer voice onset times (VOTs, i.e., the length of time that passes between the release of a stop consonant and the onset of voicing; Fox, Reilly, & Blumstein, 2015), more coarticulation (Scarborough, 2013; Scarborough & Zellou, 2012) and more expanded vowel spaces (Munson & Solomon, 2004; Wright, 2004), which has been suggested to reflect production-internal interactions (i.e., the structure of interactions among processes within the production system, Baese-Berk & Goldrick, 2009) or increased contextual confusability (Buz, Tanenhaus, & Jaeger, 2016).

Although phonological neighbors are generally considered to be words differing from each other by one phoneme (addition, deletion, or substitution), the difference can be as small as a single phonetic unit, such as the voicing of the initial consonant (e.g., cape – gape, which begin with voiceless and voiced velar stops, respectively). We will distinguish between such close minimal pairs (henceforth “minimal pairs”), and phonological neighbors more generally, because the existence of a close minimal pair has been linked to phonetic variation in naming words. For instance, two recent studies (Baese-Berk & Goldrick, 2009; Peramunage, Blumstein, Myers, Goldrick, & Baese-Berk, 2011) asked participants to overtly read words with initial voiceless stop consonants to investigate how the presence of a phonetic minimal pair neighbor with a contrasting initial voiced stop affects voice onset time. These two studies reported that the VOTs of words with initial voiceless stop consonants were longer in words that had a contrasting initial voiced stop neighbor than words that did not have such a neighbor (e.g., cake does not have a neighbor *gake). It was suggested that this effect may arise from spreading activation from a close voiced stop neighbor which affected the articulation of the target word that had an initial voiceless stop. Furthermore, Fricke and colleagues (2016) re-analyzed the dataset from Baese-Berk and Goldrick (2009) to investigate the effect of phonological neighbors on the VOTs of minimal pair and non-minimal pair words. They found that both the location of the overlap between neighbors and target words and the total number of phonological neighbors contributed significantly to the VOTs of the target words.

Although there is considerable evidence supporting the influence of phonological attributes on phonetic variation, there has been debate about the underlying mechanisms. For example, a number of studies have suggested that the hyper-articulation effect (e.g., increased VOTs for voiceless stops) that occurs when a close competitor exists may also be a function of communication context (Buz et al., 2016; Scarborough & Zellou, 2013). Specifically, speakers might produce hyper-speech when factors in a communicative environment place extra demands on listeners. For instance, researchers have found that when listeners misunderstood speech, the size of the hyper-articulation effect significantly increased when a phonetic competitor was presented (Buz et al., 2016; Schertz, 2013). These results suggest that the hyper-articulation effect may serve as a way to clarify speech for the listeners’ benefit. However, studies focusing on natural speech also showed that the existence of a voiced-stop minimal pair predicted significantly longer VOTs, even when no listener was involved (Nelson & Wedel, 2017; Wedel, Nelson, & Sharp, 2018). It may be the case that long-term exposure to hyper-articulated VOTs from speech with listeners could lead to differences in the target pronunciations of those words, even when there is no listener present. Therefore, it is still unclear if the hyper-articulation effect in speech is for the listeners’ benefits or just a by-product of speech.

Although there is debate about the nature of these effects, most of the evidence reviewed above supports interactive models of language production through the effect of phonological neighbors on word retrieval times and lexically conditioned phonetic variation. This is because, in strictly feed-forward models, the activation of phonology proceeds automatically after a word’s lexical information is selected. Therefore, phonological neighbors of a word cannot be activated or further affect word production. On the other hand, interactive models of language production allow the activation of phonological segments of the target word to feed back to activate other words who share these phonological segments, further affecting the production of the target word.

When exploring the effect of phonological attributes on word retrieval, most previous studies have used either word naming or picture naming paradigms. While both paradigms examine word production, the influence of various processes differs across the paradigms. Specifically, picture naming involves a much higher extent of feed-forward activation from the semantic level to the lexical level compared to word naming. On the other hand, word naming is a more orthographically driven paradigm compared to picture naming as the word form is provided in word naming. In other words, word naming explicitly provides the orthographic information, providing a route to phonology without necessarily activating semantics. Therefore, a direct comparison between the two paradigms on the effects of phonological neighbors on word retrieval would inform both task driven influences on phonological processes and theoretical accounts of language production.

In the current study, we systematically examined the effects of phonological neighborhood density and minimal pair status on word retrieval times (i.e., reaction times) and phonetic variations (i.e., VOTs), and how these effects differed in a picture naming paradigm (Experiment 1) and a word naming paradigm (Experiment 2). Moreover, we controlled for several lexical and phonetic characteristics (including word frequency, number of syllables, name agreement in picture naming, average biphone probability, and first vowel height). We hypothesized that phonological neighborhood density and minimal pair status would affect both word retrieval times and phonetic variation, which would support interactive accounts of language production. Additionally, the effect of minimal pair status on word production should be stronger in word naming compared to picture naming, considering that picture naming is a more semantically driven task. In particular, although both picture and word naming involve similar processing steps (i.e., semantic activation, lexical retrieval, phonological encoding), the relative emphasis on each process varies across paradigms. In the case of a semantically driven process, such as picture naming, the effect of feed-forward activation from semantics to lexical selection would be much stronger than it would be in word naming, where a direct orthography-phonology route is available. Additionally, in the case of word naming, where the word form is presented, the activation of a contrasting neighbor with a very similar form and its feed-back activation should be very strong.

Finally, to help understand the different processes involved in picture naming and word naming, and to clarify different models of language production, a direct comparison between the two paradigms would also speak to the relationship between hyper-articulation and communication contexts. If hyper-articulation occurs for the purpose of clarifying speech for the listeners’ benefit, we should not see any difference between the two paradigms given that the communication contexts of the two paradigms were the same (i.e., no listener present or feedback provided). On the other hand, if the relationship between phonological neighbors and VOT differs between picture naming and word naming, it would indicate that hyper-articulation in speech does not depend solely on communication contexts.

Experiment 1: Picture Naming

Methods

Participants

Fifty college students participated in this experiment. One was excluded from the analysis because the microphone did not pick up most of the responses due to a soft voice, leaving 49 data sets for subsequent analyses. All participants had normal or corrected-to-normal vision and reported no psychiatric or neurological illnesses. They were all native American English speakers with little knowledge of other languages. All participants gave written, informed consent, and all procedures were approved by the Institutional Review Board at the Pennsylvania State University.

Stimuli and Procedure

Participants completed a picture naming task. Photographs were presented and participants were instructed to overtly name the photograph as quickly and accurately as possible. Target names of photographs began with a voiceless stop consonant. Because VOTs needed to be measured, only target words starting with /p/, /t/, and /k/ were used as critical stimuli. There were two conditions: minimal pair (MP) and non-minimal pair (Non-MP). The MP condition consisted of pictures with target names with voiceless initial stops that have a neighbor with a voiced initial consonant (e.g., target word cape has a voiced neighbor gape). The Non-MP condition was created by pairing every MP word with a non-minimal pair word that has the same stop consonant and a similar first vowel¹, which lacked such a neighbor (e.g., target word cake does not have a voiced neighbor *gake). There were 24 items in each condition and all words started with a CVC format (Consonant-Vowel-Consonant, e.g., cape vs. cake). Thirty filler pictures whose primary names started with other consonants were also included to obscure the experimental hypotheses and to provide a richer phonological set of picture names for participants to produce. For each trial, a fixation cross first appeared on a white background for 1000 ms, followed by a color photograph of an object or action. Participants were instructed to respond with the photograph’s name, using either a noun or a verb. The photograph disappeared immediately after participants made a response or when the maximum response time of 3000 ms was reached. This was followed by a blank screen (duration = 1000 ms). Before the critical trials, participants underwent a practice run consisting of 10 pictures. Stimuli were not repeated across the practice run or experimental conditions. Participants’ reaction times were measured and their responses were recorded using a microphone and a digital recorder.

Photographs were taken from normed databases (Brodeur, Guérard, & Bouras, 2014; Moreno-Martínez & Montoro, 2012) and online resources, and depicted a broad range of common objects and actions. Additionally, we normed the photographs with an initial set of 71 MP and Non-MP words with an independent group of 21 healthy, native American English-speaking adults. We then selected 24 pairs (48 words) which had naming consistencies of 61% or higher. The linguistic characteristics (e.g., word frequency, number of syllables, heights of the first vowel, phonological neighborhood density) of the photograph names were obtained from the International Phonetic Alphabet (IPA) chart and English Lexicon Project (ELP, Balota et al., 2007). The average biphone probability was obtained using the Phonotactic Probability Calculator (Vitevitch & Luce, 2004). For each item, an H-index (∑ ^k _{i = 1} p_i log₂(1/p_i ), where k is the number of different names produced to a picture, and p_i is the proportion of participants producing the ith name), a measure of naming consistency or agreement (Snodgrass & Vanderwart, 1980), was calculated based on the responses from the 49 participants who participated in Experiment 1.

Data Analyses

Stimuli in the two critical minimal pair conditions (MP and Non-MP) were included in the analysis (i.e., 24 words in each condition). Item-level H-index was calculated based on the number of acceptable alternatives for each item and the proportion of participants who produced each alternative. An H-index of 0 reflects perfect name agreement and larger H-index indicates lower name agreement (Snodgrass & Vanderwart, 1980). Response accuracy was coded based on the recordings from the session. Responses were marked as correct only if the participant provided the exact target name (e.g., cap for cap) or plural forms of the same word (e.g., pears for pear). Other responses, hesitations, or omissions were coded as incorrect and comprised 14.13% of trials. Due to this very strict criterion, all items had an accuracy higher than 40% (Two words’ accuracy was lower than 50%). Only correct trials were included in the analyses of reaction time (RT) and voice onset time (VOT).

Prior to analyses, RTs were trimmed – any RTs longer or shorter than 2.5 standard deviations from the individual’s overall mean or shorter than 200 ms were excluded (2.49 % of trials were thus considered outliers and excluded). For each MP and Non-MP stimulus, the VOTs of the initial voiceless stop consonant (i.e., /p/, /t/, /k/) were coded by four independent coders using PRAAT (Boersma & Weenink, 2002). The VOT of a word was calculated as the duration from the onset of the burst to the onset of the first vowel². To ensure reliability in data coding, 10% of the data across the two experiments was randomly selected and coded by all four coders. The inter-coder agreement of VOTs reached a very high level (ICC = .96; Based on Koo & Li, 2016, ICC values greater than 0.9 indicate excellent reliability).

RTs, VOTs, and accuracies were analyzed with generalized linear mixed-effect modeling, employing lmer and glmer functions in the lme4 package, respectively (Bates, Mächler, Bolker, & Walker, 2014) in the R environment (R Core Team, 2014). Unlike ANOVAs, this approach has the advantage of considering individual data points and controls for variation across participants and items simultaneously, producing more generalizable results. For each dependent variable, we began with a basic model that included fixed slopes of control variables (i.e., H-index, word frequency, number of syllables, and average biphone probability in all models, and reaction time³ and first vowel heights in VOT models), random intercepts by participant and by word, and random slopes (by participant) of phonological neighborhood density and minimal pair condition (MP vs. Non-MP)⁴. Next, we followed a stepwise procedure, adding the fixed effect of either phonological neighborhood density or minimal pair condition, and then the other of these two variables. The analysis was performed using both stepwise orders because Condition (MP vs. Non-MP) and PND were related: words in the MP condition had significantly greater phonological neighborhood density than words in the Non-MP condition (p < .001). This analysis allowed us to see whether either variable accounted for additional variance, above that shared by both. We used the ANOVA function to compare models and decide whether the added independent variable significantly improved the model log-likelihood or not (Barr et al., 2013; R Core Team, 2014). In terms of variable distribution, a general rule of thumb is that the data is considered as fairly symmetrically distributed if the skewness is between – 0.5 and 0.5. Because the distribution of RTs was very skewed (Supplemental Figure 1a; skewness = 1.50), they were log-transformed (skewness = 0.57 after transformation). VOTs were not transformed because their distribution was not skewed (Supplemental Figure 1c; skewness = 0.24). Minimal pair condition (MP vs. Non-MP), and first vowel height (low vs. high, with no mid vowels) were contrast coded (−0.5 vs. 0.5). Continuous variables included H-index, number of syllables of the target word, target word log frequency, and target word phonological neighborhood density. Continuous variables were z-scored.

Results

Four figures were plotted to demonstrate the effects of the two critical phonological variables on both reaction time and voice onset time (See Figure 1 for the effect of PND on RT, Figure 2 for the effect of MP on RT, Figure 3 for the effect of PND on VOT, and Figure 4 for the effect of MP on VOT). To facilitate comparison, each plot included the results of both Experiment 1 (Panel a) and Experiment 2 (Panel b). Values shown in the figures were observed values of dependent variables. In short, we found that higher phonological neighborhood density was associated with longer VOTs in both experiments, and MP words had longer VOTs than Non-MP words in word naming.

a) represents the relationship between phonological neighborhood density and reaction time in Picture Naming (Experiment 1); b) represents the relationship between phonological neighborhood density and reaction time in Word Naming (Experiment 2).

Effects of minimal pair condition on RTs in a) Picture Naming (Experiment 1); b) Word Naming (Experiment 2). Means and error bars were calculated based on participant level data.

Figure 3. — a) represents the relationship between phonological neighborhood density and VOT in Picture Naming (Experiment 1); b) represents the relationship between phonological neighborhood density and VOT in Word Naming (Experiment 2).

Effects of minimal pair condition on VOTs in a) Picture Naming (Experiment 1); b) Word Naming (Experiment 2). Means and error bars were calculated based on participant level data.

Reaction Times

For the picture naming task, the basic model of reaction time included fixed slopes of control variables (i.e., H-index, word frequency, number of syllables, and average biphone probability), random intercepts by participant and by word, and random slopes of phonological neighborhood density and minimal pair condition (MP vs. Non-MP). The final fitted basic models can be found in Supplemental Table 1A. The model was not significantly improved either by adding phonological neighborhood density to the basic model (χ² = .05, df = 1, p = .82), or by adding minimal pair condition in addition to phonological neighborhood density (χ² = 1.42, df = 1, p = .23). In addition, adding minimal pair condition to the basic model did not significantly improve the model fit (χ² = 1.50, df = 1, p = .22), and adding phonological neighborhood density in addition to minimal pair condition did not significantly improve the model fit either (χ² = .08, df = 1, p = .78). In summary, neither phonological neighborhood density (Figure 1a) nor minimal pair condition (Figure 2a) significantly predicted reaction times in the picture naming task.

Accuracy

The mean accuracy across all items and participants was 88.01%. A mixed logistic regression was conducted on the number of response errors to explore the effect of phonological neighborhood density and minimal pair condition. A basic model of accuracy included the same variables as the reaction time model (See Supplemental Table 1A for full fitted model details). Adding phonological neighborhood density (χ² = .25, df = 1, p = .62) or the minimal pair condition (χ² = 1.12, df = 1, p = .29) to the basic model did not significantly improve the model fit. Adding the one variable in addition to the other, did not improve the model fit either (Adding MP condition to PND: χ² = 1.65, df = 1, p = .20; Adding PND to MP condition: χ² = .78, df = 1, p = .38). In summary, similar to reaction time models, neither phonological neighborhood density nor minimal pair condition significantly predicted picture naming accuracy.

VOT

A linear basic mixed-effect model on VOTs included fixed slopes of control variables (i.e., H-index, word frequency, number of syllables, average biphone probability, first vowel height, and log-transformed reaction time), random intercepts by participant and by word, and random slopes of phonological neighborhood density and minimal pair condition (MP vs. Non-MP). The final fitted basic model can be found in Supplemental Table 1A. The log-transformed RT was included in the model to account for the potential carry-over effect of word retrieval on VOTs⁵. Adding phonological neighborhood density to the basic model significantly improved the model fit (χ² = 6.55, df = 1, p = .01). This result indicated that phonological neighborhood density was a significant predictor of VOTs in picture naming. Adding minimal pair condition in addition to PND did not significantly improve the model fit (χ² = .001, df = 1, p = .97). On the other hand, adding minimal pair condition to the basic model did not significantly improve the model fit (χ² = 1.17, df = 1, p = .28), but adding phonological neighborhood density in addition to MP condition consistently improved the model fit (χ² = 5.38, df = 1, p = .02). In summary, higher phonological neighborhood density was associated with longer VOTs (Figure 3a), while minimal pair condition did not significantly predict the VOTs in picture naming (Figure 4a).