Supervised and unsupervised learning of multidimensional acoustic categories

Martijn Goudbeek; Daniel Swingley; Roel Smits

doi:10.1037/a0015781

. Author manuscript; available in PMC: 2020 May 14.

Published in final edited form as: J Exp Psychol Hum Percept Perform. 2009 Dec;35(6):1913–1933. doi: 10.1037/a0015781

Supervised and unsupervised learning of multidimensional acoustic categories

Martijn Goudbeek ¹, Daniel Swingley ², Roel Smits ³

PMCID: PMC7224412 NIHMSID: NIHMS1585248 PMID: 19968443

Abstract

Learning to recognize the contrasts of a language-specific phonemic repertoire can be viewed as forming categories in a multidimensional psychophysical space. Research on the learning of distributionally defined visual categories has shown that categories defined over one dimension are easy to learn and that learning multidimensional categories is more difficult but tractable under specific task conditions. In two experiments, adult participants learned either a unidimensional or a multidimensional category distinction with or without supervision (feedback) during learning. The unidimensional distinctions were readily learned and supervision proved beneficial, especially in maintaining category learning beyond the learning phase. Learning the multidimensional category distinction proved to be much more difficult and supervision was not nearly as beneficial as with unidimensionally defined categories. Maintaining a learned multidimensional category distinction was only possible when the distributional information that identified the categories remained present throughout the testing phase. We conclude that listeners are sensitive to both trial-by-trial feedback and the distributional information in the stimuli. Even given limited exposure, listeners learned to use two relevant dimensions, albeit with considerable difficulty.

Keywords: auditory categories, supervised learning, unsupervised learning, nonspeech

Introduction

Infants acquiring a first language and learners of a second language must learn to categorize the sounds of the language’s phonetic system. To succeed, the learner must use phonetic information in the speech signal to determine how many categories there are, and to categorize additional tokens of sounds as they are heard. Despite a consensus that this process should be conceptualized as a distributional learning problem (e.g., Guenther & Gjaja, 1996; Kuhl et al., 1992; Werker, Pons, Dietrich, Kajikawa, Fais, & Amano, 2007), little is known about the mechanisms by which category learning proceeds, or about what constraints on category learning are present (McCandliss, Fiez, Protopapas, Conway, & McClelland, 2002). The experiments presented here are first steps in a larger attempt to lay out general principles of auditory category learning, with particular reference to problems posed by phonetic categories (Francis & Nusbaum, 2002; Francis, Nusbaum, & Fenn, 2007; Holt & Lotto, 2006; McCandliss et al., 2002).

Our approach is similar to that taken in studies of visual category learning (Ashby & Maddox, 1993; Nosofsky, 1990), in which perceptual categories are defined as existing in a psychophysical space with continuous dimensions. We assume that when listeners hear a sound, this sound is evaluated on a number of dimensions and mapped onto a point in a multidimensional space. Repeated exposure to sounds originating from distributionally distinct categories leads to the formation of “clouds” of points. If, after a period of exposure, distinct clouds emerge, listeners can start to associate each cloud with a different category.

Most research on the learning of categories defined as clusters in perceptual space has investigated simple visual dimensions: the length and orientation of line segments, the slope of a line bisecting a circle and the size of the circle, the horizontal and vertical position of dots relative to a midline and so forth. Here, we focus on the learning of similarly constructed auditory categories that are defined over simple auditory dimensions. Determining whether similar processes underlie category learning in different sensory modalities is itself of interest (e.g., Maddox, Ing, & Lauritzen, 2006). In addition, it is hoped that a better understanding of auditory category formation in tightly controlled experimental situations will inform theories of speech perception and language acquisition.

We assume that recognition of the statistical patterns in the emerging clouds of points in multidimensional space is equivalent to category acquisition. The human capacity for resolving the categories of spoken language provides a particularly interesting example of perceptual learning, because the acquisition of language-specific categories begins in infancy (Aslin, Juszcyk, & Pisoni, 1998; Jusczyk, 1997) and because this learning is necessarily unsupervised in nature. This last observation motivates the manipulation of the presence or absence of supervision (trial-by-trial feedback) in our experiments.

The distinction between supervised and unsupervised category learning has been explored extensively in adults. Human adults have proven adept at acquiring perceptual categories when given regular and immediate feedback about the validity of their judgments (Ashby & Alfonso-Reese, 1995; Ashby, Maddox, & Bohil, 2002; Gureckis & Love, 2003, Francis, Baldwin, & Nusbaum, 2000), but such feedback is not always required (Fried & Holyoak, 1984; Fiser & Aslin, 2001; Wade & Holt, 2006) and is seldom provided by everyday experience. When confronted with complex multidimensionally varying stimuli, learners must rely on the distributional structure of the objects and events they perceive. In successful perceptual categorization, those things that occupy nearby regions of perceptual space come to be regarded as the same, and as distinct from things that occupy different regions of this space. If an observer can detect the correlated structure of category members, he or she has a basis for forming a category without external feedback.

Unsupervised category learning studies have revealed characteristic limits in observers’ abilities. Ashby, Queller, and Beretty (1999) showed that participants initially opt for unidimensional solutions (ignoring every dimension of variation but one) but can be brought to entertain multidimensional solutions with the aid of supervision. Several other studies also show the preference for the use of one dimension (Love, 2002) or of category structures with minor prototype distortions (Homa & Cultice, 1984). As stated previously, most of the evidence supporting these generalizations derives from experiments testing simple visual categories in which the dimensions of variation are readily identifiable to participants. Artificial categories involving distributions of more complex stimulus patterns whose dimensions of variation are less obvious have rarely been used in unsupervised learning experiments, and, as suggested previously, few studies have used these methods to test the learning of auditory categories (but see Holt & Lotto, 2006, McCandliss et al., 2000; McClelland, Fiez, & McCandliss, 2002).

The literature on visual category formation suggests that in all likelihood, speech sound categories should be extremely difficult to learn. Not only do speech stimuli vary on many relevant dimensions, there is also considerable overlap between categories and variability within categories (e.g., Peterson & Barney, 1952; Hillenbrand, Getty, Clark, & Wheeler, 1995). Yet it is now well known that infants are well on their way to learning the phonetic categories of their native language within the first year of life. Numerous experiments demonstrate the ability of infants to discriminate a broad range of speech sound contrasts early in development. Over the course of the first year infants start to conflate similar sounds if those sounds are not phonologically contrastive in the infant’s native language (see, e.g., Aslin, Pisoni, & Jusczyk, 1998, or Jusczyk, 1997, for reviews). Several studies have found decrements in non-native consonant discrimination by the age of 12 months (e.g., Werker & Tees, 1984) and analogous decrements in non-native vowel perception even earlier (Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992; Polka & Werker, 1994). These changes in discrimination ability are seen as adaptive for native language understanding because the failure to discriminate non-native speech contrasts is taken to imply an improved understanding of the available speech categories in the native language (see Kuhl et al, 2006).

Thus, the improved recognition of speech categories of the native language may explain the loss of the infant’s ability to discriminate non-native phonemes, possibly because of changes in infants’ attention to different phonetic cues. Once two non-native sounds have become part of the same native category, it becomes more difficult to differentiate them from each other and their category co-members (Best, 1995). Within-category discrimination is more difficult than between-category discrimination, because within category sounds are heard as more similar to each other than between category sounds (Cameron Marean, Werner, & Kuhl, 1992; Kuhl, 1985). Given that infants show evidence of perceptual knowledge of their native language before they can articulate any words (indeed, before many infants begin to babble), corrective feedback cannot be responsible for this learning. Retention of linguistically relevant phonetic contrasts based on semantically contrasting minimal pairs (words phonologically matching in all but one feature or segment) is also excluded for infants because infants’ lexical knowledge is almost certainly too meagre for language-specific phonological tuning to be driven by semantic contrast in phonologically similar words (Swingley, 2003). As a result, it is generally assumed that infants acquire their knowledge about phonetic categories via an unsupervised bottom-up distributional analysis of the speech they hear (e.g., Pierrehumbert, 2003).

A demonstration of such learning in a laboratory setting was provided in a study of 6- and 8-month-olds by Maye, Werker, and Gerken (2002). In their study two groups of infants were exposed to stimuli varying in formant trajectories, with prevoicing as a secondary cue on one end of the continuum. This led to a continuum extending from [da] to unaspirated [ta], a distinction not made in English. One group listened to stimuli in which the trajectories followed a unimodal distribution (most sounds were from the middle of the continuum) whereas the other group was presented with stimuli following a bimodal distribution (most sounds were from near the edges). Following this familiarization, infants were given the opportunity to listen to alternating stimulus sets (both of the endpoint stimuli) or non-alternating sets (the same stimulus repeated). Only the infants in the Bimodal familiarization group evidenced a preference for non-alternating over alternating stimuli at test, revealing discrimination; infants in the Monomodal group showed no such preference. Maye and Gerken (2000, 2001) found a similar sensitivity to distributional characteristics for adults with similar stimuli. However, the generality of this extremely rapid distributional learning is not clear at present (Peperkamp, Pettinato, & Dupoux, 2003; Pierrehumbert, 2003; Tyler & Johnson, 2006).

In the present contribution we describe experiments in which adult listeners were tested on their ability to learn auditory categories. The categories comprised novel sounds with speechlike properties, to simulate processes of phonetic category learning while minimizing effects of native-language phonological knowledge.

Our use of artificial categories exemplified by sampling from a distribution of variants of category prototypes ultimately descends from the pioneering studies of Attneave (1957) and Posner and Keele (1968), who laid out a range of hypotheses that are still of empirical interest. Among these are whether categories are abstracted as prototypes or stored as sets of experienced exemplars (or something in between), and when verbal descriptions of categories guide learners’ decisions (see e.g., Goldstone & Kersten, 2003). Here, we focused on two issues: first, how well listeners can learn two similar, distributionally-defined auditory categories given limited supervised or unsupervised exposure; and second, how this learning is influenced by whether the category structures demand attention to one versus two dimensions of variation.

To generate our experimental stimuli, we specified a psychophysical space spanned by two acoustical dimensions known to be relevant in vowel perception, namely frequency and duration. Categories were defined as two-dimensional probability density functions in this space. Exemplars generated from these functions formed “clouds” in perceptual space. The statistical properties of the probability density functions (their means and covariance matrices) governed the relevance of each dimension for making category judgments (see Figure 1). For example, exposure to the structure in the top left cell in Figure 1 should encourage subjects to categorize using only dimension 1, and exposure to the structure in the bottom left cell should encourage subjects to use only dimension 2. In these “unidimensional” situations, the dimension that does not differentiate the categories is irrelevant to category assignment, although it contributes just as much to the variance of the probability density functions.

Figure 1. — Four possible category structures in a two-dimensional perceptual space. Lines represent the optimal solution to the categorization problem.

Exposure to the structures in the right-hand column should encourage the use of both dimensions in categorizing, because the use of only one dimension would lead to many incorrect categorizations (Goudbeek, Swingley, & Kluender, 2007). Experiments in visual category learning have shown that subjects initially prefer a unidimensional solution (Feldman, 2000) and only with the help of feedback start using a two dimensional strategy (Ashby, Alfonso-Reese, Turken, & Waldron, 1998). Ashby et al. (1998) distinguish between verbal and procedural-based category learning. In their model, the verbal system has initial priority, and tries to categorize using a relatively simple (unidimensional) rule (e.g., long sounds in category A, short sounds in category B). Rules that are more complex and more difficult to verbalize like “all long and high frequency sounds go into category A” only enter the verbal system after the unidimensional rules have failed. The other category learning system in their model is an implicit or procedural learning system (Ashby & Waldron, 1999) that is based on the learning of actual skills or procedures (in this case, for categorization). This system does not have such a preference for unidimensional solutions, but learns more slowly.

The notion that learning categories defined over multiple dimensions could be more difficult than learning unidimensional categories may seem counterintuitive. Indeed, category learning is sometimes facilitated by the presence of multiple dimensions of variation. When multiple cues are available to aid in the identification of a category member, or when nominally distinct dimensions’ values are interpreted holistically, redundancy gain may be observed (e.g., Egeth & Mordkoff, 1991; Garner, 1974; Pomerantz & Lockhead, 1991). In addition, the presence of correlated attributes among some members of a set of objects can lead observers to form a category that includes those members and excludes the rest—an effect that has been demonstrated even in 10-month-olds (Younger, 1985). However, these advantages of correlations among stimuli depend upon redundancy. Note that in the “diagonal” categories in the right-hand column of Figure 1, the value of only one dimension is not a reliable predictor of category membership; good performance requires use of both dimensions. Relative to unidimensional “filtering” tasks (left-hand column), any advantage due to correlations among the dimensions may be outweighed by the fact that listeners must attend to two dimensions rather than one. Thus, the multidimensional-categorization task (sometimes referred to as a condensation task) is more difficult than analogous unidimensional tasks (Posner & Keele, 1970; Gottwald & Garner, 1972).

Distinguishing “diagonal” and non-“diagonal” category distributions presupposes the psychological reality of the axes and a particular interpretation of the axes’ orientation. This notion has been studied in attempts to understand the separability or integrality of pairs of dimensions. Broadly speaking, two separable dimensions can be attended to exclusively without mutual interference, while integral dimensions cannot (Garner, 1974). This leads to the prediction that if two category sets defined along separable dimensions are rotated in stimulus space (converting the left column of Figure 1 to the right column), categorization should become substantially more difficult, because observers are deprived of the effective strategy of ignoring the irrelevant dimension (or, conversely, because any tendency to rely on a single dimension leads to many errors). This prediction has been upheld in a number of studies, although the situation is complicated by the fact that classification of dimension pairs as separable or integral is not always maintained consistently over tasks (more thorough discussion of these issues may be found in Grau and Kemler Nelson, 1988; Kemler Nelson, 1993; Melara and Marks, 1990; Shepard, 1991). To anticipate our results, the present experiments reveal a large axis rotation effect, revealing that the speechlike dimensions under study are “psychologically real” in Grau and Kemler Nelson’s sense.

In our experiments adult listeners were exposed to categories of non-speech sounds. These were inharmonic tone complexes filtered by a single resonance. The two dimensions of variation were the frequency of the spectral peak at which the sound complex was filtered (formant frequency) and the duration of the stimulus (duration). These dimensions are important in the perception of vowel sounds (e.g., Ainsworth, 1972; Peterson & Barney, 1952).

Although in principle models of language acquisition might best be developed using novel speech categories (such as phonetic categories not present in the language of the participants), it is well known that users of a given language tend to interpret sounds from non-native languages in terms of the perceptual categories of their native language (Best, McRoberts, & Sithole, 1988; Best & Strange, 1992; Flege, 1995; Polivanov, 1931) especially after being trained to identify these stimuli (Francis, Nusbaum, & Fenn, 2007). This complicates efforts to model category acquisition in naïve listeners, and motivated our choice to use non-speech sounds as stimuli. However, because these dimensions (or closely related ones) are necessary for speech interpretation, there is no reason to expect that success in the task would require the development of genuinely novel features or stimulus dimensions (see Francis and Nusbaum, 2002, for discussion and evidence bearing on this point for speech sounds, and Schyns, Goldstone, and Thibaut, 1998, regarding feature creation more generally). For example, given that the native language of the participants was Dutch (Booij, 1995), all subjects were fully accustomed to distinguishing the vowels in words like maan (“moon”), man (“man”), and men (“people”). The first two words’ vowels may differ primarily in their duration (Nooteboom & Doodeman, 1972), while the last two words’ vowels differ in their formant frequencies. Thus, although the inharmonic tone complexes did not sound like spoken words, the dimensions of variation themselves were not new.

Listeners’ exposure to the category structures was given through experience with category exemplars, in a forced-choice decision task with feedback on each trial in Experiment 1, and without trial-by-trial feedback in Experiment 2. The supervised learning procedure in Experiment 1 was thus comparable to the typical procedure used in visual category learning studies and in speech-contrast training studies (e.g., Bradlow, AkahaneYamada, Pisoni, & Tohkura, 1997; Greenspan, Nusbaum, & Pisoni, 1988; Lively, Logan, & Pisoni, 1993). The unsupervised learning procedure in Experiment 2 was more comparable to the situation of infants learning their first language. Learning of multidimensionally varying categories with relevant variation in one dimension was tested in Conditions 1 and 2 of each experiment, whereas learning of multidimensional categories with relevant variation in two dimensions was tested in Condition 3.

All experiments used the same basic procedure, with a learning phase and a maintenance phase. In the learning phase, listeners were presented with stimuli drawn from two probability density functions. They were faced with the problem of partitioning the psychophysical space by using a criterion based on one or more dimensions. Listeners’ use of a unidimensional criterion would be reflected in their assignment of all stimuli below a criterion value on that dimension to one category, and all stimuli above it to another (Ashby & Maddox, 1990). The use of a multidimensional criterion would be reflected by listeners’ allowing dimensions to trade off: for example, a low value on one dimension might be compensated by a low value on the other (or a high value on the other, depending on the orientation of the category’s “diagonal” in perceptual space). This compensation entails interpretation of one dimension relative to the value of the other in assigning category membership - a process that is a hallmark of speech perception (e.g., Repp, 1982). In Conditions 1 and 2 the categorization problems could be solved completely (no miscategorized stimuli) by using one dimension, while the categorization problem of Condition 3 (and experiment 1B) required the use of both dimensions for good categorization.

After the learning phase, subjects entered a maintenance phase intended to characterize their division of psychophysical space. The stimuli of all maintenance phases except those of Experiment 1B were drawn from an equidistantly spaced grid that was intended to “scan” the subjects’ psychophysical space in a neutral way, without continued distributional information (see the lower right panel of Figure 2). This change in stimulus properties permitted more accurate assessment of listeners’ use of each dimension of variation, and also allowed evaluation of whether participants would maintain their category identification criteria once the distributional cues to category membership were no longer supported in the input. In experiment 1B, we compared maintenance performance on this grid with maintenance of the learned category identification criteria on the same stimuli as in the learning phase. In none of the maintenance phases did the listeners receive trial-by-trial feedback.

Figure 2. — Learning (upper panels) and maintenance (lower panels) conditions of Experiments 1 and 2 and the learning and maintenance conditions of Experiment 1B (rightmost panels).

Experiment 1: Supervised learning

Method

Subjects

Thirty-six subjects (twelve in each condition), all students from the University of Nijmegen, were drawn from the Max Planck Institute subject pool and participated in return for a small payment. None of the subjects reported any history of hearing problems.

Stimuli

The stimuli were inharmonic sound complexes, 112 in each category. All stimuli were created by modifying a base signal. This base signal was an inharmonic sound complex made by adding several sinusoids with exponentially spaced frequencies. The base signal was defined by the following formula:

B (t) = A \sum_{n = 0}^{N - 1} sin (2 π f_{0} F^{n} t)

(1)

In this formula, A represents the amplitude of the signal, f₀ is the frequency of the lowest sinusoid (500 Hz), t is time in seconds, and F is the frequency ratio between two successive sinusoids (1.15). Thus, the frequencies of the base signal were not spaced linearly, as they are in harmonic (e.g., speech) sounds. Finally, N is the total number of sinusoids that were added together; this was set to 17.

After the base signal was constructed, it was filtered with a single resonance peak, implemented as a second order Infinite Impulse Response (IIR) filter. The filter’s bandwidth was 0.2 times that of its resonance frequency. Each sound was truncated at the desired duration, applying linear onset and offset ramps of 5 ms to avoid the perception of clicks. In all experiments, the stimuli varied in two dimensions: the frequency of the spectral peak at which the sound complex was filtered (our non-speech analogue of formant frequency) and the duration of the sound. To ensure that both dimensions would be equally salient and discriminable, they were converted to psychopysical scales and normalized using their respective just noticeable differences (JND). The psychophysical scale commonly accepted for the perception of frequency is the Equivalent Rectangular Bandwidth scale (Glasberg & Moore, 1990). With this scale, physical frequency f expressed in Hertz is transformed to “psychological frequency” e expressed in ERB units as follows:

e = {21.4}^{10} log (0.00437 * f + 1)

(2)

Psychological duration D (measured in DUR), is converted from stimulus duration t (expressed in s) according to the following transformation:

D = 10 log (t)

(3)

This transformation was proposed by Smits, Sereno, and Jongman (2006) based on data published by Abel (1972). The relevant JND in this frequency region for formant frequency is 0.12 ERB (Kewley-Port & Watson, 1994). For duration, experiments by Smits et al. (2006) and subsequent piloting with multidimensional stimuli varying in duration and frequency indicated that a JND of 0.25 DUR resulted in a discriminability comparable to 0.12 ERB. We used these values to equalize the range of variation between the stimulus dimensions, so that the difference between the category means in the training distributions and between the highest and the lowest stimulus value in the grid used in the maintenance phase was 20 JNDs for both frequency and duration.

Our stimuli are constructed in the same way as those used by Smits, Sereno, and Jongman (2006). The participants in their experiment, who were drawn from the same MPI-subject pool, typically described the stimuli as sounding like computer sounds, organs, or horns (Smits, Sereno, & Jongman, 2006). Figure 3 contains spectrograms of four stimuli used in the experiment. The spectrograms in Figure 3 depict stimuli of short and long duration and of high and low frequency, spanning the whole range of stimuli used in our experiment. As the spectrograms imply, the stimuli varied in dimensions relevant for speech sound identification, but would not be confused for or interpreted as actual speech sounds.

Figure 3. — Spectrograms of four stimuli used in the experiment. Note the different time scales due to differences in stimulus duration. Listeners reported stimuli as being similar to speech, but definitely nonspeech (Smits, Sereno, & Jongman, 2006).

Solving the categorization problem in Conditions 1 and 2 required the use of only one dimension, whereas solving the problem in Condition 3 required the use of both dimensions. In Condition 1, the stimuli manifested relevant variation in duration and irrelevant variation in formant frequency. In Condition 2, the stimuli manifested relevant variation in formant frequency and irrelevant variation in duration. In Condition 3, the stimuli manifested relevant variation in both dimensions (see the first three upper panels of Figure 2). To ensure a large enough incentive for participants to actually use both dimensions in Condition 3 (Goudbeek, Swingley, & Kluender, 2007), we chose the mean and covariance matrices of the two distributions such that using a unidimensional solution to the categorization problem resulted in a much lower optimal percentage of correctly categorized stimuli (70%) than using the optimal two-dimensional solution (100%). Table 1 shows the perceptual and physical characteristics of the distributions of the learning stimuli of each condition.

Table 1.

Distributional characteristics of the learning stimuli with relevant variation in one dimension (Condition 1 and 2) or relevant variation in two dimensions (Condition 3).

	Category A			Category B
	Means	σ	ρ	Means	σ	ρ
Condition 1 (Duration relevant)	47.7 DUR	0.65 DUR	−0.05	52.53 DUR	0.65 DUR	−0.10
	117 ms	1.07 ms		205.0 ms	1.07 ms
	18.80 ERB	1.88 ERB		18.90 ERB	1.88 ERB
	1501 Hz	51.3Hz		1520 Hz	51.3 Hz
Condition 2 (Frequency relevant)	50.1 DUR	6.45 DUR	0.05	49.73 DUR	6.46 DUR	0.10
	149.6 ms	1.91 ms		144.5 ms	1.91 ms
	17.6 ERB	0.31 ERB		20.0 ERB	0.31 ERB
	1295 Hz	7.76 Hz		1737 Hz	7.76 Hz
Condition 3 (multidimensional)	48.38 DUR	2.80 DUR	−0.98	51.66 DUR	2.82 DUR	−0.98
	126.2 ms	1.32 ms		175.2 ms	1.33 ms
	17.79 ERB	1.34 ERB		19.70 ERB	1.33 ERB
	1322 Hz	35.5 Hz		1977 Hz	35.2 Hz

Open in a new tab

The maintenance stimuli were the same for all conditions, with items taken from an equidistantly spaced grid (see the lower left panels of Figure 2 and Table 2).

Table 2.

Distributional characteristics of the maintenance phase (equidistantly spaced grid).

	Mean	Min	Max	Step-size
Duration	50.1 DUR	47.6 DUR	52.6 DUR	0.84 DUR/step
Duration	150 ms	117 ms	193 ms	12.7 ms/step
Formant	18.8 ERB	17.6 ERB	20.00 ERB	0.4 ERB /step
frequency	1499 Hz	1288 Hz	1739 Hz	75.17 Hz/step

Open in a new tab

Procedure

Subjects were seated in a soundproof booth in front of a computer screen and a two-button response box. In the learning phase, they listened to 448 stimuli (2 categories times 112 stimuli per category times 2 presentations) through Sennheiser closed-ear headphones. The stimuli from the two categories were presented in a random order in two blocks separated by a brief rest period. All 112 stimuli from each category were presented once in each block.

The listeners’ task was to assign each stimulus to group A or B, using the button box. When their categorization was correct, the monitor displayed (the Dutch equivalent of) “right” in green letters for 700 ms; when the categorization was incorrect, the monitor displayed (the Dutch equivalent of) “wrong” in red letters for 700 ms immediately following the response. After the visual feedback disappeared, a 200 ms blank screen preceded the next stimulus.

In the maintenance phase subjects categorized sounds from the test continuum, as belonging to group A or B. There were 49 maintenance stimuli that were randomly ordered in four blocks, totalling 196 presentations. Once a participant had selected a category label on a trial, the monitor would display (the Dutch equivalent of) “next” for 700 ms and the next stimulus was played after a 200 ms delay. No feedback was given on maintenance trials.

Results and discussion

The results were analyzed using percentage correct, d’ and logistic regression. Both d’ and percentage correct are familiar measures of performance. A disadvantage is that they are based on category membership and not on the coordinates of each individual stimulus in the duration / formant-frequency plane and consequently they yield less fine-grained information about participants’ strategies. In addition, they cannot be applied to the data of the maintenance phase, because “correctness” of a response does not apply straightforwardly in the region between the trained category exemplars. Logistic regression, on the other hand, is sensitive to the coordinates of the stimuli, and can be applied to the data of the maintenance phase (Agresti, 1990).

In regression analysis, linear and interaction terms can be entered into the analysis. For the present kind of analysis, the interpretation of an interaction term is often problematic, and is usually left out in studies of this type. Here, the results were analyzed both with and without the interaction term. Of the 144 analyses in Experiments 1 and 1B (12 subjects times 4 analysis conditions times 3 experimental parts) only 12 had a significant interaction term. Furthermore, the fits of the models with interaction term hardly improved compared to those without. Based on these results we present here only the model without the interaction term.

Signal detection analysis (percent correct and d’)

The data of the learning phases were analyzed first using percentage correct and d’. To probe for learning, the first and second halves of the learning phase were analyzed separately. Listeners’ performance was fairly good. The three upper rows of Table 3 show the percentages correct and d’ values of the first and second part of the learning phase of Condition 1 (Duration relevant), Condition 2 (Frequency relevant) and Condition 3 (multidimensional learning). Recall that percentage correct and d’ were only computed for the learning phase because it is there that “right” and “wrong” can be clearly assigned. In all conditions and both learning phases, percentages correct and d’s were significantly above chance (all p < 0.05) in t-tests with correction for multiple comparisons.

Table 3.

Signal detection results (mean percentage correct (“pc”) and d’) with their standard deviations, for Experiments 1 and 1B.

	Learning phase 1				Learning phase 2
	pc	σ	d’	σ	pc	σ	d’	σ
Experiment 1, Condition 1	0.81	0.04	1.39	0.21	0.93	0.02	2.59	0.27
Experiment 1, Condition 2	0.80	0.03	1.32	0.17	0.89	0.03	2.07	0,25
Experiment 1, Condition 3	0.59	0.01	0.33	0.05	0.63	0.01	0.50	0.05
Experiment 1B	0.58	0.02	0.28	0.08	0.62	0.03	0.45	0.11

Open in a new tab

An ANOVA with Part of the experiment (Learning phase 1 versus 2) as a within-subjects variable and Condition (Duration relevant versus Frequency relevant versus Multidimensional) as a between-subjects variable revealed significant improvements in performance from the first phase to the second, for the percent correct measure (F [1,33] = 29.27, p < 0.05, η_p²= 0.47) and the d’ measure (F [1,33] = 33.29, p < 0.05, η_p²= 0.50). Both analyses showed a significant difference between Conditions (F [1,2] = 43.10, p < 0.05, η_p²= 0.63 and F [1,2] = 28.36, p < 0.05, η_p²= 0.72 for percent correct and d’ respectively). Post hoc multiple comparisons (Tukey HSD) showed no significant differences between the unidimensional conditions, while Condition 3 differed significantly from both Condition 1 and 2, indicating the advantage of unidimensional learning over multidimensional learning. Follow up analyses conducted for each condition separately revealed significant differences between the first and second parts of the experiment for both percentage correct (F_min [1,11] = 6.23, p < 0.05, η_p²= 0.36) and d’ (F_min [1,11] = 8.78, p < 0.05, η_p²= 0.44) for all conditions. The signal detection measures thus indicated that learning a multidimensional distinction was feasible, but significantly more difficult than learning a unidimensional one.

Logistic regression

Logistic regression yields two β-weights, similar to the weights in a linear regression, that reflect the influence of the independent variables (here, the perceptual dimensions) on the dependent variable (the listener’s choice). A β-weight of large magnitude indicates a strong influence of the associated dimensions on the dependent variable. The β-weights were calculated separately for each subject. Comparing the effects of β-weights for unidimensional (Condition 1 and 2) and multidimensional (Condition 3) learning problems is problematic because of conflicting predictions for successful unidimensional versus multidimensional performance. For this reason, Conditions 1 and 2 are analyzed separately from Condition 3.

Table 4 and Figure 4 display the mean β-weights for the relevant and irrelevant dimension of Condition 1 and 2 for the first half of the learning phase (“Learning phase 1”), the second half of the learning phase (“Learning phase 2”) and the maintenance phase (“Maintenance phase”).

Table 4.

Logistic regression results of Experiment 1 for Conditions 1 and 2. Mean β-weights are shown for both dimensions and the number of subjects out of 12 using one (Uni) or both (Multi) dimensions significantly.

	Condition 1 (Duration relevant)				Condition 2 (Frequency relevant)
	Learning phase 1
	μ (β)	σ (β)	Uni	Multi	μ (β)	σ (β)	Uni	Multi
Relevant	0.65	0.13	10	0	1.37	0.73	11	1
Irrelevant	0.05	0.04	0	0	0.02	0.03	0	1
	Learning phase 2
	μ (β)	σ (β)	Uni	Multi	μ (β)	σ (β)	Uni	Multi
Relevant	1.50	0.27	11	0	2.28	1.11	11	1
Irrelevant	0.10	0.10	0	0	0.02	0.04	0	1
	Maintenance phase
	μ (β)	σ (β)	Uni	Multi	μ (β)	σ (β)	Uni	Multi
Relevant	1.54	0.14	12	0	0.20	0.18	9	1
Irrelevant	0.10	0.06	0	0	0.07	0.06	0	1

Open in a new tab

In addition to β-weights, the logistic regression gives significance levels of the hypothesis that each β-weight differs from zero. If a β-weight did not differ significantly from zero at the p = .05 level, we concluded that subjects did not make use of that dimension. The columns of Table 4 labelled “Uni” and “Multi” show how many subjects used either one or both dimensions significantly. Numbers of subjects who did not use any dimension significantly are not shown (note that the number of subjects in each group was always 12).

Table 4 and Figure 4 confirm that in both conditions subjects learned to use the relevant dimension. Both the mean β-weights and the number of subjects using that dimension were higher than those of the irrelevant dimension. This also shows that subjects did not make systematic use of the irrelevant dimension of variation in making their judgments, as the values of the irrelevant dimensions remained close to zero throughout the experiment. The higher mean β-weights and number of listeners using the relevant dimension in Condition 2 compared to Condition 1 suggest that formant frequency was an easier dimension to learn to attend to than duration. In the maintenance phase, when feedback was no longer given and the stimulus grid was used, listeners persisted in their use of the relevant dimensions. However, although formant frequency was easier to learn, it also appeared easier to unlearn, as was evidenced by the large drop in the average β-weight for formant frequency in the maintenance phase.

To statistically test these effects, we carried out an ANOVA with Part of the experiment (Learning phase 1, Learning phase 2, and Maintenance phase) and Dimension (Relevant versus Irrelevant) as within-subjects variables, and Condition (Duration relevant versus Formant frequency relevant) as between-subjects variable and the β-weights as dependent measures.

Because of a significant three-way interaction between Dimension, Part of the experiment and Condition, the results were further analyzed for each condition separately¹. For Condition 1 (Duration relevant), the β-weight for the relevant dimension was higher than that for the irrelevant dimension (F [1,11] =61.06, p < 0.05, η_p²= 0.85), which confirmed that listeners learned to attend to the relevant dimension. The significant main effect for Part of the experiment (F [2,22] = 12.83, p < 0.05, η_p²= 0.54) shows that subjects improved over the course of the training. The interaction between Part of the experiment and Dimension (F [2,22] = 14.40, p < 0.05, η_p²= 0.57) indicates that the learning effect depended on whether a dimension was relevant or irrelevant: the effect for Part of the experiment was present for the relevant dimension (F [2,22] = 13.78, p < 0.05, η_p²= 0.56), but not the irrelevant dimension (F [2,22] = 1.69, n.s., η_p²= 0.13).

In Condition 2, the same main effects and interactions as in Condition 1 were present. The β-weight for the relevant dimension (frequency) was higher than that of the irrelevant dimension (F [1,11] = 175.04, p < 0.05, η_p²= 0.94) and this advantage for the relevant dimension increased during the learning phase (Part of experiment effect, F [2,22] = 15.61, p < 0.05, η_p²= 0.59). The interaction between Part of the experiment and Dimension was also present; post-hoc analysis showed a significant effect of Part of the experiment for the relevant dimension (F [2,22] = 17.34, p < 0.05, η_p²= 0.61), and a much smaller though significant effect for the irrelevant dimension (F [2,22] = 3.54, p < 0.05, η_p²= 0.24). This difference between the conditions is caused by differences in their Maintenance phases. In Condition 1, when duration was the relevant condition, its β-weight remained high in the Maintenance phase and the β-weight for frequency remained small. In Condition 2 however, the β-weight for frequency dropped in the Maintenance phase and that of duration rose. Thus, even when they had previously correctly used formant frequency, listeners had a tendency to start using duration again when presented with an evenly spaced stimulus grid and without feedback.

The difference between learning to use and maintaining the use of duration and frequency was unexpected, particularly given our attempt to equalize the tested dimensions by scaling the variability of the stimuli to empirically determined just noticeable differences (JNDs). Apparently, the similar JNDs obtained using same/different experiments varying one dimension in a two-dimensional formant-frequency × duration space did not guarantee equal categorization behaviour. Smits et al. (2006) found a similar difference and hypothesized that it may be due to a difference in stimulus dimensions introduced by Stevens and Galanter (1957). Stevens and Galanter argued that dimensions like duration are prothetic dimensions, for which an increase in value means adding more of the same, while dimensions like formant frequency are metathetic dimensions, where an increase does not necessarily mean more of the same. According to the model proposed by Smits et al., storing a category representation or comparing a stimulus with a stored category based on a prothetic dimension is noisier than storing a category representation or comparing a stimulus with a stored category based on a metathetic dimension and thus more difficult in the absence of feedback. This description is consistent with our (unidimensional) results.

Another possibility is that duration and frequency were differentially available to the subjects in these stimuli. That is, to a first approximation the duration of a signal bounded by silence may be measured in a similar way regardless of the spectral characteristics of the signal; but extracting the peak frequency of these tone complexes may have been intrinsically more difficult, or may have profited less from subjects’ background experience in processing auditory signals. Although speech makes use of frequency peaks broadly similar to those tested here (and listeners are exquisitely sensitive to variations in these speech features), the present stimuli were not speech signals. If the participants’ estimation of frequency was noisier than their estimation of duration, this could have led to their relative disregard for frequency in the maintenance phase (see, for example, Zwicker & Fastl, 1990, pp 265–271). We will return to this issue in Experiment 1B, where the effect of the distributional information in the maintenance phase on the use of these dimensions will be investigated.

In summary, these data show that listeners can, relatively quickly, learn a unidimensional categorization in a two-dimensional space and generalize this learning to new exemplars, though this learning is not always robustly maintained.

Condition 3 addressed learning of multidimensional categories with two relevant dimensions of variation. Instead of what was effectively a unidimensional distinction in Condition 1 and 2, subjects of Condition 3 had to learn a truly multidimensional distinction: both duration and formant frequency had to be used in order to obtain a high level of correct responding. Given that our interest is in whether individual participants used both dimensions (and not, say, half using one and half using the other), we present the results of condition 3 as a set of scatterplots in which each point corresponds to one participant. The left-hand side of Figure 5 presents the β-weights for duration (abscissa) and formant frequency (ordinate) for each listener in each part of the experiment. The data points are divided into four groups: listeners who used both dimensions significantly (identified by asterisks), listeners who used only formant frequency (plus-signs), listeners who used only duration (Xs), and listeners who did not use any dimension significantly (circles). Optimal performance corresponds to a point in the upper right-hand corner of the square, at an angle of 45° (when both dimensions are given equal weight) and far away from the origin (reflecting high β-weights and thus consistent behaviour).

The two upper panels of the left-hand column of Figure 5 show performance in the first and second learning phase of Condition 3. Judging by the number the asterisks a number of listeners picked up on the information provided by the shapes of the categories’ distributions and the feedback. Improvement in the second part is evident in the higher β-values (i.e., asterisks closer to the upper right corner). However, the third panel shows that listeners had trouble maintaining their learned categorization strategy (only four asterisks remain in the maintenance phase) and started using a unidimensional rule with duration as the relevant dimension (the Xs).

Most subjects succeeded in using one or more dimensions above chance levels, whereas some failed to use any dimension significantly. For the purpose of comparing the performance of the successful subjects across conditions and experiments, it would be desirable to have a measure of these subjects’ central tendency and variability. Note that simply computing the across-subjects average β-weights for each of the dimensions would not be an effective way to characterize overall performance. For example, if half of these subjects used duration exclusively, and the others formant frequency, the average β-weights for duration and frequency might both exceed chance even though none of the individuals actually used both dimensions. These considerations suggest that a measure that integrates performance on both dimensions would be useful.

Here, we derive such a measure by computing the angle formed by the line connecting each subject’s β-weights to the origin, and also computing the length of this line. These computations were done by transforming the Cartesian coordinates of the β-weights for duration and formant frequency into the polar coordinates Φ (the angle with the horizontal axis in radians) and A (the distance to the origin) by the following transformations:

A = \sqrt{(β_{d u r}^{2} + β_{freq}^{2})}

(4)

φ = arctan (β_{freq} / β_{d u r}) i f β > 0

(5a)

φ = arctan (β_{freq} / β_{dur}) + π i f β_{dur} ⩽ 0

(5b)

φ : = φ - 2 π i f φ > π

(5c)

In our analysis, Φ ranges between π and -π radians. When Φ equals ½π, listeners purely use formant frequency, when Φ equals 0, listeners use only duration, and when Φ is close to ¼π subjects are in between those two angles and use duration as well as formant frequency. As can be seen from Figure 5, listeners who used both dimensions fall in the upper right-hand plane, somewhere between 0 and ½π.

The other polar coordinate, A, ranges between zero and infinity. A large A indicates that a subject was internally consistent (though a large average A over subjects need not reflect consistent weights of each dimension across subjects); while a small A indicates that listeners’ categorizations tend not to be internally consistent. In Figure 5, the listeners that categorized using both dimensions (indicated by the asterisks) are farther removed from the origin, while listeners that do not use any dimension significantly (the circles) are all very close to the origin. The left column of Table 5 lists the mean values of Φ for each phase of Condition 3 for all subjects who in a given phase used one or more dimensions above chance levels.

Table 5.

Mean values and stand deviations of the polar coordinates Φ and A of the β weights for duration and formant frequency in the three phases of Condition 3 and Experiment 1B, as well as the numbers of subjects using only duration (D), only formant frequency (F) or both (Multi). Subjects using no dimension are not shown.

Condition 3 (Maintenance with equidistant grid)					Experiment 1B (Maintenance with learning stimuli)
Learning phase 1
		N = 6					N = 7
Φ (σ)	A (σ)	D	F	Multi	Φ (σ)	A (σ)	D	F	Multi
0.26 (0.12)	0.21 (0.10)	3	0	3	0.30 (0.09)	0.29 (0.14)	2	4	1
Learning phase 2
		N = 8					N = 8
Φ (σ)	A (σ)	D	F	Multi	Φ (σ)	A (σ)	D	F	Multi
0.32 (0.18)	0.34 (0.13)	1	1	6	0.37 (0.03)	0.18 (0.21)	0	1	7
Maintenance phase
		N = 12					N = 7
Φ (σ)	A (σ)	D	F	Multi	Φ (σ)	A (σ)	D	F	Multi
−0.22 (0.31)	0.76 (0.29)	8	0	4	0.24 (0.34)	0.42 (0.18)	0	0	7

Open in a new tab

The mean Φ of the first learning phase differed significantly from 0 (t [5] = 5.12, p < 0.05) as well as from ½π (t [5] = −4.73, p < 0.05). In the second learning phase, mean Φ was again significantly different from both 0 (t [7]= 4.96, p < 0.01) and ½π (t [7] = −2.88, p < 0.05). Mean Φ values exceeded ¼π (the value that would reflect an unbiased use of duration and formant frequency), indicating a somewhat stronger use of the frequency dimension than the duration dimension. As a group, subjects used only duration in the maintenance phase of Condition 3. The mean Φ for subjects using any dimension was not significantly different from 0 (t [11], = −0.243, n.s.), but did differ significantly from ½π (t [11] = −5.850, p < 0.01)².

An ANOVA with A as the dependent variable and Part of the experiment as a within-subjects variable showed a significant effect of Part of the Experiment (F [2,10] = 5.863, p < 0.05, η_p²= 0.54). Pairwise comparisons showed this effect to be due to a significant difference between the second³ learning phase and the maintenance phase (p < 0.05). Thus, subjects did become more internally consistent in their categorization (higher β-weights) in the maintenance phase, but many were becoming consistent in a unidimensional way.

In sum, while our listeners certainly learned to use both dimensions, they did so with considerable difficulty. Also, they tended to use formant frequency more strongly than duration, as indicated by the higher β-weights for formant frequency. This is shown in Figure 5 by the strong tendency of the listeners to fall along a line steeper than 45°. Why might listeners rely more on the dimension that is then often abandoned when the maintenance phase is uniformly distributed? As described previously, it may be that duration is more salient or easier to encode than formant frequency and that successful learners actively direct their attention to the less salient dimension, overcorrecting for the salience of duration. Recall that a similar pattern was found between subjects in the two unidimensional conditions: subjects learned to use formant frequency (when it was relevant) more reliably than duration (when it was relevant), but tended to shift toward using duration in the maintenance phase (see Table 4).

Learning a multidimensional category distinction with supervision was difficult but possible, with about half of the participants learning successfully. The analysis of percentage correct and d’ data did show a learning effect as did the development in Φ. The consistency measure A did not increase significantly from the first learning phase to the second. The change in both Φ and A in the maintenance phase showed that learning was fragile. Confronted with the equidistantly spaced grid, most listeners opted for a unidimensional solution instead of the multidimensional solution suggested by their prior experience; half of the subjects used both dimensions significantly during the last learning phase but only four of them retained this ability in the maintenance phase, and the remainder began using duration exclusively.

Experiment 1B addressed two possible explanations for participants’ change in categorization strategies when they reached the maintenance phase in Condition 3: the absence of feedback in the maintenance phase and the absence of distributional information. If exposure to a uniform distribution of category exemplars as that in Condition 3 is responsible for the altered performance in the maintenance phase, performance in this phase should be better when the training distributions are not replaced by the equidistantly spaced grid but instead are maintained. Alternatively, if the absence of trial-by-trial feedback is in itself enough to disturb the previously learned category boundaries, maintenance-phase performance may be degraded both in Condition 3 and in Experiment 1B.