Discovering Functional Units in Continuous Speech

Sung-Joo Lim; Francisco Lacerda; Lori L Holt

doi:10.1037/xhp0000067

. Author manuscript; available in PMC: 2015 Oct 12.

Published in final edited form as: J Exp Psychol Hum Percept Perform. 2015 May 25;41(4):1139–1152. doi: 10.1037/xhp0000067

Discovering Functional Units in Continuous Speech

Sung-Joo Lim ¹, Francisco Lacerda ², Lori L Holt ³

PMCID: PMC4601578 NIHMSID: NIHMS720628 PMID: 26010592

Abstract

Language learning requires that listeners discover acoustically variable functional units like phonetic categories and words from an unfamiliar, continuous acoustic stream. Although many category learning studies have examined how listeners learn to generalize across the acoustic variability inherent in the signals that convey the functional units of language, these studies have tended to focus upon category learning across isolated sound exemplars. However, continuous input presents many additional learning challenges that may impact category learning. Listeners may not know the timescale of the functional unit, its relative position in the continuous input, or its relationship to other evolving input regularities. Moving laboratory-based studies of isolated category exemplars toward more natural input is important to modeling language learning, but very little is known about how listeners discover categories embedded in continuous sound. In 3 experiments, adult participants heard acoustically variable sound category instances embedded in acoustically variable and unfamiliar sound streams within a video game task. This task was inherently rich in multisensory regularities with the to-be-learned categories and likely to engage procedural learning without requiring explicit categorization, segmentation, or even attention to the sounds. After 100 min of game play, participants categorized familiar sound streams in which target words were embedded and generalized this learning to novel streams as well as isolated instances of the target words. The findings demonstrate that even without a priori knowledge, listeners can discover input regularities that have the best predictive control over the environment for both non-native speech and nonspeech signals, emphasizing the generality of the learning.

Keywords: language learning, speech perception, speech categorization, auditory categorization, segmentation

A radio news show broadcast in an unfamiliar language seems to race by, giving the impression that the language uses very long words and that the broadcaster barely pauses for breath. This impression arises, in part, because the acoustic speech signal does not consistently highlight linguistically significant units with pauses like the spaces that mark words in text (Cole & Jakimik, 1980). Through experience, listeners must discover a constellation of diagnostic acoustic and statistical cues such as prosody, stress patterns, allophonic variation, phonotactic regularities, and distributional properties that support word segmentation (see Jusczyk, 1999). Complicating matters for adults listening to speech in a non-native language, native-language segmentation cues influence adults’ evaluation of non-native speech and may lead to inaccurate segmentation when the languages’ cues do not align (e.g., Altenberg, 2005; Barcroft & Sommers, 2005; Cutler, 2000; Cutler, Mehler, Norris, & Segui, 1986; Cutler & Otake, 1994; Flege & Wang, 1990; Weber & Cutler, 2006).

Making sense of an unfamiliar, continuous acoustic stream like a foreign news broadcast is further complicated by the fact that the acoustics of neighboring speech sounds, syllables, and words do not stack neatly like adjacent pearls on a string (Hockett, 1955). Rather, they intermingle so that there is substantial acoustic variability in the realization of speech produced in different contexts (Fougeron & Keating, 1997; Moon & Lindblom, 1994). Additional acoustic variability arises from inherent differences across talkers (Johnson, Ladefoged, & Lindau, 1993; Peterson & Barney, 1952) and even from outside sources like room acoustics (Watkins, 2005; Watkins & Makin, 2007). As a result, the functional “units” of language (such as phonetic categories or words) that must be discovered from the continuous sound stream vary considerably in their physical acoustic realization. To communicate effectively, listeners must come to treat these variable instances as functionally equivalent. To do so, they must discover linguistically relevant variability, generalize across linguistically insignificant variability, and relate this learning to new instances; they must learn to categorize speech (Holt & Lotto, 2010). The category learning that begins in infancy for the native language (e.g., Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992; Kuhl et al., 2006; Werker & Tees, 1983) may complicate listening to speech in a non-native language in adulthood because well-learned native categories may not align with categories of the non-native language (Best, 1995; Best, McRoberts, & Goodell, 2001; Flege, 1995). Such is the case in the classic example of native Japanese adults’ difficulty with the English /r/ and /l/ (Goto, 1971; Iverson et al., 2003; Miyawaki et al., 1975).

But how do listeners discover the acoustic variability that is linguistically relevant while also discovering the cues that support segmenting what is linguistically relevant from continuous sound? These two learning challenges are inherently concurrent; in natural spoken language, listeners must discover functional units relevant to language from a fairly continuous spoken sound stream without a priori knowledge of the temporal window that characterizes the units (e.g., phoneme, syllable, word). Since the detailed acoustics of the units vary across instances as a function of context and other factors, learners must generalize beyond highly variable experienced acoustics to new instances and, ultimately, relate these units to referents in the environment. Klein (1986) refers to this as the adult language learner’s “problem of analysis” (p. 59). We know very little about speech category learning in this richer context because laboratory studies typically investigate speech category learning across isolated, individuated sounds (e.g., syllables or words) that are not embedded in fluent, continuous sound (e.g., Grieser & Kuhl, 1989; Ingvalson, Holt, & McClelland, 2012; Kuhl et al., 1992; Lim & Holt, 2011; Lively, Logan, & Pisoni, 1993; Werker & Tees, 1983, 1984).

The present research addresses how adults contend with the problem of analysis by placing listeners in a toy model of the language-learning environment—an immersive video game in which novel, continuous sound embedded with functionally relevant, though acoustically variable, category instances serves to support adaptive behavior through its relationship with visual referents. In this way, we examine auditory category learning in the context of continuous sound.

It is important that the sounds experienced in the video game be as unfamiliar as possible in order to control and manipulate listeners’ histories of experience. In Experiment 1, this was accomplished using a natural language (Korean) unfamiliar to listeners. In Experiments 2 and 3, we exerted even stronger control over listeners’ familiarity with the sounds by creating a completely novel soundscape. To do so, we exacted an extreme acoustic manipulation, spectral rotation, on English sentences (Blesser, 1972). This rendered the speech wholly unintelligible while preserving the spectrotemporal acoustic complexities that characterize the multiple levels of regularity (and variability) present in natural speech. Specifically, we spectrally rotated each utterance so that the acoustic frequencies below 4 kHz were spectrally inverted. In contrast to natural speech (including the Korean speech in Experiment 1), these spectrally rotated sounds had no acoustic energy above 4 kHz. Although spectral rotation preserves some of the acoustic regularities present in natural speech, listeners do not readily map rotated speech to existing language representations (Blesser, 1972). Using these highly unusual acoustic signals that nonetheless capture the spectrotemporal regularities and complexities of speech, we investigated the extent to which learning in the context of simultaneous category learning and segmentation challenges generalizes to nonspeech auditory signals that are impossible productions for a human vocal tract (see Scott, Blank, Rosen, & Wise, 2000).

In the present study, our aim is to determine whether listeners discover a particular functional unit—embedded target words—within continuous sound streams. We use naturally spoken sentences, each with one of four target words embedded. The sentences are recorded multiple times so that, across recordings, the acoustics are variable, with considerable coarticulation and natural variation in rate and amplitude. Critically, even the target words are acoustically variable across utterances. Experiment 1 listeners experience unfamiliar non-native Korean sentences recorded by a native Korean speaker. For Experiments 2 and 3, we begin with English sentences spoken by a native English talker. We then spectrally rotate these sentences, rendering them unintelligible. This approach to creating a novel soundscape of continuous sound embedded with to-be-learned auditory categories preserves the variability and regularity of natural spoken language across sounds that are novel and unintelligible to naïve listeners. Across all three experiments, the challenge for listeners is to discover the functional equivalence of the acoustically variable target words from the continuous stream of unfamiliar, non-native, and unintelligible nonspeech sounds without a priori knowledge of the temporal window that characterizes this unit; that is, they must discover the new categories from continuous sound.

One means by which learners may do so is via the relationship of the sound categories to visual referents in the environment. Visual referents co-occurring with sound regularities are known to support both speech categorization and segmentation (Thiessen, 2010; Yeung & Werker, 2009), but it is as yet unclear the extent to which they may support discovery of auditory categories from continuous sound streams because studies have typically paired a single presentation of a visual referent with the onset of a corresponding sound (sometimes with a slight temporal jitter) or systematically presented referents with an isolated word or syllable (Cunillera, Laine, Càmara, & Rodríguez-Fornells, 2010; Thiessen, 2010; Yeung & Werker, 2009). Nonetheless, it may be hypothesized that the presence of co-occurring visual referents may support category learning in the context of continuous sound by signaling the distinctiveness of acoustically similar items across referents (Thiessen, 2010; Wade & Holt, 2005; Yeung & Werker, 2009) and/or the similarity of acoustically distinct items paired with the same referent.

Unlike previous studies that have temporally synchronized audiovisual presentation, in the present study, the appearance of a visual referent instead coincides with what can be thought of as a short “paragraph” of speech (or nonspeech, in the case of Experiments 2 and 3), with a target word appearing at different positions within the constituent sentences. To illustrate, imagine a speaker holding a book and saying, “Hey there, have you seen my book? It is a book I checked out from the library. It is the book with a red cover. ” The visual referent, present throughout the speech, serves as a correlated visual signal for the acoustically variable instances of book peppering the continuous acoustic stream. The approach we take in the present experiments is similar. The present stimulus paradigm is distinct from previous research in its approach to auditory–visual correspondence and moves a step beyond studies of how learners track consistent audiovisual mapping across multiple encounters with established sound units (e.g., Smith & Yu, 2008; Yu & Smith, 2007) because the units of sound, their temporal extent, and their position within the continuous sound stream are entirely unknown to learners.

We present the continuous sounds and their visual referents to adult learners in the context of an immersive video game (Leech, Holt, Devlin, & Dick, 2009; Lim & Holt, 2011; Liu & Holt, 2011; Wade & Holt, 2005). This environment allows for strict experimental control while more closely modeling the natural learning environment’s converging multimodal information sources, the need to use this auditory information to guide action, and the internal feedback present from successfully making predictions about upcoming events. It is a considerable departure from category learning tasks that make use of overt perceptual decisions and/or explicit performance feedback (e.g., Bradlow, Pisoni, Akahane-Yamada, & Tohkura, 1997; Chandrasekaran, Yi, & Maddox, 2014; Lively et al., 1993; Logan, Lively, & Pisoni, 1991) and also from entirely passive exposure paradigms in which listeners hear instances or streams of sounds without any overt behavior directed at the stimuli (e.g., Maye & Gerken, 2000, 2001; Saffran, Aslin, & Newport, 1996). To succeed within the video game, listeners must use sound to predict events and navigate the environment. The extent to which their predictions about the sounds are met with success or failure in meeting the goals of the game provides a form of internal feedback that may be supportive of learning (Lim & Holt, 2011; see Lim, Fiez, & Holt, 2014, for a review). This approach provides a means of investigating incidental, active learning that may more closely model some aspects of learning in the natural environment (see Lim & Holt, 2011; Lim et al., 2014).

Previous research has demonstrated that, within this game, adults readily learn to categorize nonspeech sounds with a complex acoustic structure (Leech et al., 2009; Liu & Holt, 2011; Wade & Holt, 2005) that are not learned readily through passive exposure (Emberson, Liu, & Zevin, 2013; Wade & Holt, 2005). Moreover, adults improve in second-language speech categorization when non-native syllables are presented in the game environment (Lim & Holt, 2011). However, as is typical of category learning experiments, each of these prior studies has simplified the category learning challenge by presenting listeners with isolated, segmented category exemplars in training.

In marked contrast, the present study provides no individuation; the visual referents are presented along with continuous streams of sound. Embedded within the continuous sound stream, exemplars from one of four acoustically variable sound categories (the target words) are associated with each visual referent. The temporal duration of the targets is unknown to participants and variable, as is the position of the targets in the continuous sound stream. However, the stimulus inventory is constructed such that the target word, although acoustically variable, is the portion of the acoustic stream most reliably related to a particular visual referent. Though acoustically variable, in a relative sense the target words are islands of reliability within the highly variable continuous acoustic signal because they are the only segment of the continuous acoustic stream that co-occurs consistently with a particular visual referent.

We predict that listeners will be sensitive to input regularities at the largest possible grain size that gives them predictive control over their environment (Ahissar & Hochstein, 2002). In the case of the artificial sound inventories exploited in the present experiments, these are the acoustically variable, yet relatively more reliable, target words. Despite the lack of an overt categorization or segmentation task and corresponding explicit feedback, we hypothesize that experience with the acoustic regularities characterizing targets will lead listeners to learn to categorize familiar sentences possessing the target words, to generalize to unfamiliar sentences that include the target words, and to generalize this knowledge to isolated instances of the target words never heard previously, thus demonstrating a beginning ability to segment the relevant acoustics from continuous sound and generalize category learning to new instances.

Experiment 1

In this experiment, we examined whether native English adult listeners can learn to segment target words appearing within a natural but unfamiliar spoken language—Korean. English and Korean are quite distinct phonologically, morphologically, and grammatically. Korean is a nontonal language, but in contrast with English, it has a triple consonant system, including soft, hard nonaspirated, and hard aspirated consonant contrasts, as well as triple contrastive ranges of vowel sounds involving monophthongs and two different kinds of diphthongs. In addition, Korean is quite morphologically complex compared to English, and the word order is typically subject–object–verb compared to subject–verb–object in English (Grayson, 2006). There are large phonotactic differences in the two languages, particularly in interpretations of consonant clusters and markers of word boundaries (Kabak & Idsardi, 2007). The two languages also differ in rhythmic structure; Korean is often described as syllable timed, whereas English is a stress timed language. Thus, engaging English monolingual listeners with unfamiliar, and linguistically quite distinct, Korean provides a means of presenting an ecologically realistic instantiation of the challenge of learning speech categories from continuous sound in adult language learning.

Method

Participants

Thirty-two monolingual English participants from Carnegie Mellon University were recruited for $30 compensation. All participants were unfamiliar with the Korean language; they had neither studied nor been exposed to spoken Korean. All reported normal hearing. An additional 11 participants with the same characteristics served as naïve listeners in a brief test of the stimulus materials.

Stimuli

There were four to-be-learned Korean target words translated as English red (/p^*alkan/), blue (/p^haran/), green (/t∫^horok/), and white (/hayan/),¹ each uttered in six different sentences in a natural, coarticulated manner by a native Korean female speaker (Sung-Joo Lim) in a sound-isolated room (16 bit, 22.05 kHz). Target words were uttered in six different sentence contexts. Each target word–sentence pairing was uttered four times to further increase acoustic variability. Across this 96-sentence training stimulus set (6 sentences × 4 words × 4 utterances), the four target words defined the to-be-learned categories and, due to their acoustic variability within category, modeled the challenge of learning functional equivalence classes. Moreover, since the target words appeared in sentence-initial, -medial, and -final positions in fluent speech and were never presented to listeners in isolation, the stimuli modeled the challenge of learning categories from continuous sound. Three additional sentences with the target words embedded and an isolated utterance of each target word were recorded to test generalization at posttest. Table 1 lists the sentences. On average, sentence stimuli were 2.25 s [minimum = 1.69 s, maximum = 3.20 s] in duration, and isolated target words were 0.94 s [minimum = 0.80 s, maximum = 1.05 s] in duration.

Table 1.

Korean Sentences Used in Video Game Training and Reserved as Novel Generalization Stimuli at Posttest

graphic file with name nihms720628t1.jpg

Open in a new tab

Note. The brackets denote the placement of Korean target words, translated as blue, white, green, and red. There were four unique recordings of each sentence to increase acoustic variability.

Eleven native English participants were recruited separately and were asked to freely sort the stimuli into four groups in a consistent manner, without familiarization or instructions on how to base their decisions. This provided a baseline measure of naïve English listeners’ categorization of the Korean sentences according to the target word. After sorting the sentences, participants were familiarized with the sentences (a total of 60 sentence exposure trials) through passive listening for 5 min and tested again. The naïve listeners exhibited above-chance (25%) consistency in their sorting, M = 36.2%, SD = 7.2%, t(10) = 5.14, p ≤ .001, Cohen’s d = 1.55, before the familiarization phase. However, there was no change in sorting consistency following additional familiarization, M_change ≤ 1%, F(1, 10) = 0.345, p > .5, $η_{p}^{2} = 0.033$ . This reveals that native English listeners were able to discover some regularities present in the continuous stream of Korean sentences through exposure but that brief passive listening did not additionally boost sorting performance. These data provide a baseline for comparison of learning within the video game paradigm.

Procedure

Training was accomplished using the Wade and Holt (2005) video game paradigm. In the game, subjects navigated through a pseudo-three-dimensional space, encountering four animated “alien” creatures (each with a unique shape, color, and movement pattern). Each alien originated from a particular quadrant of the virtual environment (with a jitter of random noise to somewhat alter starting position and assignment counterbalanced across subjects). Participants’ task was to capture two “friendly” aliens and destroy two “enemy” aliens; identity as friends or enemies was conveyed via the shape and color of a shooting mechanism on the screen (see Figure 1) and was counterbalanced across participants.

Screenshot of the video game training. As one of four alien creatures (shown in the left panel) approaches in each trial, listeners need to make correct motor actions associated with the alien. See the online article for the color version of this figure.

Each alien was associated with one target word (randomized in the assignment across participants) such that each time it appeared, multiple randomly selected exemplars (from the set of 24 sentences with the target word embedded; see Table 1) were presented in a random order through the duration of the alien’s appearance until the participant completed a capturing or destroying action. Thus, subjects heard continuous speech, with target words associated with the visual image of the alien, the spatial quadrant from which the alien originated, and the motor–tactile patterns involved in capturing or destroying the alien. This exposure was fairly incidental; the sounds were of no apparent consequence to performance, and participants were instructed only to navigate the game. Early in game play, aliens appeared near the center of the screen and approached slowly, with alien appearance synchronized to the onset of the entire utterance of a randomly drawn sentence. This provided participants time to experience the rich and consistent regularities between the visual referents and the sound categories. There was a great deal of visual and spatial information with which to succeed in the task, independent of the sound categories. Participants were not informed of the nature of the sounds nor their significance in the game. Other sound effects (including continuous, synthetic background music) were also present.

As the game progressed, the speed and difficulty of the required tasks increased so that quick identification of approaching aliens by means of their characteristic sounds was, while never required or explicitly encouraged, of gradually increasing benefit to the player. As the game difficulty increased, continued progress was only possible with reliance upon auditory cues. At higher levels of the game, players could hear the aliens before they could see them and, thus, if they had learned about the relationship of the acoustically variable sounds to the visuospatial characteristics of the alien, they could use the acoustic information to orient behavior more quickly to succeed in the goals of the game. At the highest game levels, targeting became nearly impossible without rapid sound categorization, which predicted the alien and its quadrant of origin and provided participants a head start on navigating and orienting action in the right direction. In this way, sound served as a cue to predict appropriate action, thus encouraging sound category learning through its utility for functioning in the environment, without requiring overt categorization responses. Of special note with regard to our research aims, each appearance of an alien triggered continuous presentation of sound(s) from the category. There was no indication of the relevant functional units (or that there were relevant units to be discovered) or the temporal window across which they unfold and no explicit feedback about sound segmentation or categorization.

Participants played the video game for two 50-min sessions separated by a 10-min break. An explicit posttest assessed learning and generalization immediately after training. Participants responded to 10 stimuli for each target word (four repetitions each): one randomly chosen utterance of each of the six familiar sentences experienced in training and the four novel test stimuli given in Table 1. Participants saw a game screen with the four aliens, each positioned in its typical quadrant. On each trial, a single randomly drawn stimulus was played repeatedly for as long as 5.5 s. Participants used the arrow keys as a means of classifying the given sound stimulus. It is of note that while most categorization learning studies use relatively comparable training and testing experiences (e.g., highly similar explicit categorization tasks with the same category mapping labels or response keys), the posttest in the current study is highly distinct from the incidental nature of sound categorization in the video game training experience. Therefore, we analyzed listeners’ response patterns across consistent response–sound stimulus mappings to measure the overall correct categorization performance.

Both the video game training and explicit posttest categorization tasks were presented on the center of a computer monitor (600 × 600 pixels) mounted on the wall of a sound-attenuated booth. Participants used a keyboard to interact with the game. All sounds were presented through headphones at a comfortable listening level (approximately 70 dB). The mapping between the alien creatures and the target word categories was randomized across participants, thus destroying the color match between the visual appearance of the alien and the target word meaning (in Korean) across participants.

Data from additional naïve native English participants served as a baseline accuracy measure in categorizing the Korean sentences without training. Comparing baseline performance across, rather than within, participants ensured that the trained participants entered into training without an explicit indication of the significance of the sounds to the game, or the learning questions under investigation.

Results and Discussion

Despite the presence of acoustic variability introduced by the stimulus materials, the lack of a consistent temporal window across which to predict the target word’s appearance in a sentence, and the absence of performance feedback or an explicit categorization or segmentation task, participants categorized Korean target words above chance (25%) at posttest. One sample t test revealed that Experiment 1 listeners’ posttest categorization performance was significantly different from chance [M = 53.5%, SD = 27.0%, t(31) = 5.98, p ≤ .001, d = 1.06].² Moreover, Experiment 1 listeners were significantly more accurate than naïve participants in sorting Korean stimuli without training. An independent samples t test on the categorization performance of Experiment 1 and naïve participants’ baseline sorting performance revealed a significant group difference in categorizing Korean stimuli, M_diff = 17.2%, t(41) = 2.09, p = .043, d = 0.73. We further investigated whether learning was observed for familiar and novel generalization sentences as well as isolated instances. One sample t tests against chance (25%) revealed that Experiment 1 participants reliably categorized both familiar training sentences, which participants experienced during the video game training [M = 52.4%, SD = 26.9%, t(31) = 5.76, p ≤ .001, d = 1.02], as well as novel generalization sentences [M = 53.9%, SD = 28.1%, t(31) = 5.84, p ≤ .001, d = 1.03] and isolated Korean target words [M = 59.6%, SD = 29.6%, t(31) = 6.61, p ≤ .001, d = 1.17] that participants never experienced during training. This indicates that participants were able to generalize what they learned about the acoustically variable functional units—the target words—from continuous speech (see Figure 2).

Average percent correct categorization for familiar training, novel generalization, and novel isolated target word stimuli in Experiment 1 (fluent Korean speech) and Experiments 2 and 3 (spectrally rotated English speech) (^* indicates that p ≤ .001). The line at 25% indicates chance-level performance. The learning exhibited across all experiments and all test stimuli was significantly above chance (p ≤ .001). Error bars indicate standard errors of the mean.

We further examined whether listeners’ categorization performance differed across the placement of the target words appearing in sentences as well as isolated instances. A repeated-measures analysis of variance (ANOVA) on participants’ posttest categorization performance of sentence-initial, -medial, and -final, as well as isolated, target words revealed a significant effect of target word placement [F(3, 93) = 6.74, p ≤ .001, $η_{p}^{2} = 0.179$ ]. Post hoc comparisons revealed that categorization performance for sentence-initial target words was equivalent to that for isolated target words [M_diff = 1.3%, t(31) = 0.66, p > .5, d = 0.12] and significantly more accurate than categorization of target words in the sentence-medial [M_{initial–medial} = 7.5%, t(31) = 2.58, p = .015, d = 0.46] or -final positions [M_{initial–final} = 8.8%, t(31) = 3.16, p = .003, d = 0.56], which did not differ from one another [M_{medial–final} = 1.3%, t(31) = 0.83, p > .4, d = 0.15]. One factor that may have influenced this pattern of results is the degree of acoustic variability across utterances; sentence-initial and isolated target words may have been uttered with somewhat less coarticulation than sentence-medial or -final targets, especially since it is impossible to have grammatically valid Korean sentences in which the target word appears at the end of the sentence (see Training 2 and 5 and Test 2 from Table 1). This may have exaggerated sentence-final target coarticulation. It is also possible that recognition may have been facilitated by the brief, natural pause that would precede target words in the sentence-initial position. Although the repeated presentation of sentences reduced demands on short-term auditory memory, a general primacy bias may have facilitated categorization of target words located at utterance boundaries (Aslin, Woodward, LaMendola, & Bever, 1996).

It is of note that performance varied across target words. A repeated-measures ANOVA revealed a main effect of the target word, F(3, 93) = 2.986, p ≤ .035, $η_{p}^{2} = 0.088$ , with about a 5% disadvantage in categorizing Korean blue (/p^haran/) and red (/p^*alkan/) relative to the other targets. This may be due to the relatively greater phonetic acoustic similarity of these targets.¹ Alternatively, it may indicate that listeners relied solely on the initial consonant rather than learning to segment the two-syllable word unit from speech. However, arguing quite strongly against this interpretation, there was no difference in listeners’ categorization performance across isolated target words [F(3, 93) = 1.152, p = .332, $η_{p}^{2} = 0.036$ ].

In sum, incidental learning within the video game paradigm was sufficient to induce reliable categorization of acoustically variable functional units from continuous sound, even without explicit categorization or feedback, and without exact temporal synchrony of auditory category exemplars and visual referents. Based on listeners’ ability to generalize learning to novel sentences with target words embedded and also to isolated instances of the target word, we conclude that listeners began to discover the temporal grain size in the continuous acoustic input that granted them predictive control over the environment—the target words. Though acoustically variable, the target words were the window of acoustic information within the continuous sound that best correlated with the appearance of specific visual referents.

Experiments 2 and 3

Although listeners in Experiment 1 were unfamiliar with Korean, it is nonetheless possible that they were able to exploit commonalities between Korean and English (e.g., overlapping phonetic information) for category learning. This possibility is supported by the above-chance baseline ability of naïve listeners to sort the Korean sentences. Thus, in Experiments 2 and 3, we eliminated the potentially buttressing effects of language similarities by training adults with English speech signals radically manipulated using spectral rotation (Blesser, 1972). This signal processing technique preserves much of the acoustic regularity and variability of speech but renders the sentences wholly unintelligible. Since these signals are not perceived as speech, another aim of the experiments was to examine whether the learning observed in Experiment 1 generalizes to nonspeech acoustic signals.