Individual differences in cue weights are stable across time: The case of Japanese stop lengths

Kaori Idemaru; Lori L Holt; Howard Seltman

doi:10.1121/1.4765076

. 2012 Dec;132(6):3950–3964. doi: 10.1121/1.4765076

Individual differences in cue weights are stable across time: The case of Japanese stop lengths^a

Kaori Idemaru ^1,^a), Lori L Holt ², Howard Seltman ³

PMCID: PMC3528741 PMID: 23231125

Abstract

Speech categories are defined by multiple acoustic dimensions, and listeners give differential weighting to dimensions in phonetic categorization. The informativeness (predictive strength) of dimensions for categorization is considered an important factor in determining perceptual weighting. However, it is unknown how the perceptual system weighs acoustic dimensions with similar informativeness. This study investigates perceptual weighting of two acoustic dimensions with similar informativeness, exploiting the absolute and relative durations that are nearly equivalent in signaling Japanese singleton and geminate stop categories. In the perception experiments, listeners showed strong individual differences in their perceptual weighting of absolute and relative durations. Furthermore, these individual patterns were stable over repeated testing across as long as 2 months and were resistant to perturbation through short-term manipulation of speech input. Listeners own speech productions were not predictive of how they weighted relative and absolute duration. Despite the theoretical advantage of relative (as opposed to absolute) duration cues across contexts, relative cues are not utilized by all listeners. Moreover, examination of individual differences in cue weighting is a useful tool in exposing the complex relationship between perceptual cue weighting and language regularities.

INTRODUCTION

Speech perception is complex. One of the reasons is that multiple acoustic dimensions define speech categories, requiring integration of information across dimensions. For example, the voicing distinction in English stops (e.g., [b] versus [p]) may give an appearance of a simple phonetic contrast. However, examined acoustically, as many as 16 dimensions covary with this distinction (Lisker, 1986). This is emblematic of speech categories; multiple acoustic dimensions typically covary with phonetic category distinctions (Coleman, 2003; Dorman et al., 1977, for stop place of articulation; Jongman et al., 2000, for fricative place of articulation; Hillenbrand et al., 2000, for tense and lax vowels; Kluender and Walsh, 1992, for fricative/affricate distinction; Lisker, 1986, for stops voicing; Polka and Strange, 1985, for liquids).

Whereas any of these dimensions may inform phonetic categorization, they are not necessarily perceptually equivalent. Some acoustic dimensions play a more important role in determining category membership of a sound than do others. To distinguish [b], [d], and [g], for example, English listeners make greater use of differences in formant transitions than frequency information in the noise burst that precedes the transitions although each dimension reliably covaries with these consonant categories (Francis et al., 2000). In the voicing distinction between stop consonants at the syllable initial position such as [ba] versus [pa], English listeners rely primarily on the duration of voice onset time (VOT) and use fundamental frequency (F0) of the following vowel as a secondary source of information (Abramson and Lisker, 1985; Francis et al., 2008; Idemaru and Holt, 2011). In a vowel distinction, tense [i] and lax [i] are differentiated acoustically by both spectral and temporal acoustic dimensions. However, English listeners rely much more on the spectral dimension than the temporal dimension in categorizing these vowels (e.g., Hillenbrand et al., 2000). Thus while exploiting multiple acoustic dimensions to inform phonetic categorization, listeners give greater perceptual weight to some dimensions. This has been referred to as perceptual cue weighting (e.g., Holt and Lotto, 2006; Francis et al., 2008). Understanding what determines perceptual cue weighting is fundamental to understanding speech perception.

Cue weighting has been proposed to arise, at least in part, from distributional characteristics of the input (Holt and Lotto, 2006; Francis et al., 2008; Toscano and McMurray, 2010). Holt and Lotto (2006) argue that adaptive listeners will tune perceptual weighting of acoustic dimensions to the distributional regularities of the input to maximize categorization accuracy but that constraints from sensory processing, cognition, and previous experience may interact to influence the extent to which listeners achieve idealized perceptual weighting. For example, all other things being equal, dimensions very well correlated with category identity ought to be more strongly perceptually weighted than those less predictive of category identity. Such differences in informativeness might occur as a consequence of categories' distributional regularities. If, for example, category distributions do not overlap much along a particular acoustic dimension, then the dimension is highly informative about category identity and is likely to be strongly perceptually weighted. Holt and Lotto (2006) investigated this question by studying the extent to which distributional informativeness affected perceptual cue weights as listeners learned novel, arbitrary nonspeech auditory categories. They observed a role for informativeness but found that it interacted with other factors such as the inherent perceptual salience of a dimension and the task in which listeners were engaged.

In the domain of speech categorization, studies of perceptual cue weighting typically have examined phonetic categories for which there is a robust difference in the informativeness of the acoustic dimensions under investigation. Take the distinction of English [l] and [ɹ], for example. Acoustically, the onset frequency of the third formant (F3) is the single best predictor of English talker's intended [l] and [ɹ] productions (Yamada and Tohkura, 1992; Iverson et al., 2003; Ingvalson et al., 2011; Lotto et al., 2004). The onset frequency of the second formant (F2) is also a predictor, but a substantially weaker one (Lotto et al., 2004). This is mirrored in perception in that F3 is given most perceptual weight by native English listeners (Yamada and Tohkura, 1992). When listeners categorize sounds that span from [l] to [ɹ] varying along the dimensions of F3 and F2, responses are best correlated with the stimulus value along the F3 dimension and are weakly correlated along the F2 dimension (Ingvalson et al., 2011). Similarly, in production of syllable-initial English stop voicing (e.g., [ba] versus [pa]), VOT is the single best predictor and F0 of the following vowel is a secondary weaker one (Lehiste and Peterson, 1961; Raphael, 2005; Holt and Wade, 2004). In perception, VOT is more heavily weighted as the primary cue (Abramson and Lisker, 1985; Francis et al., 2008; Idemaru and Holt, 2011).

These cases demonstrate that when there is a robust differentiation among acoustic dimensions as a function of informativeness of category membership, the more informative dimension is most heavily weighted in speech categorization. It is equally important to investigate phonetic categories for which acoustic dimensions' informativeness is relatively equivalent. Without a robust bias in acoustic informativeness, perceptual cue weight may be balanced across dimensions because of the parity in informativeness, it may be variable across listeners as either cue will lead to accurate categorization, or it may be biased for reasons other than the informativeness of acoustic dimensions. Thus by studying such cases, it may be possible to unmask other factors contributing to perceptual cue weighting for speech categories such as how subtle distributional regularities in spoken language relate to perceptual cue weighting, how the computational demands introduced by different cues may influence perceptual weighting, and how resilient or flexible cue weight may be to short-term acoustic variability in the signal. This study investigates perceptual cue weighting when there is parity among acoustic dimensions in terms of their informativeness, a situation that has received a little attention (Holt and Lotto, 2006; Francis et al., 2008; Toscano and McMurray, 2010). We further aim to investigate the extent to which there are significant individual differences in perceptual cue weights among native listeners.

To this end, the singleton and geminate distinction in Japanese stop consonants presents an excellent example. Singleton and geminate stops (e.g., [t] and [tt]) in Japanese, and in many other languages, are distinguished primarily by the duration of stop closure. However, segmental duration is heavily influenced by speaking rate.1 As a result, absolute stop closure duration provides imperfect information because across different speaking rates, the stop closure durations corresponding to the singleton and geminate categories overlap considerably. Said another way, the informativeness of absolute duration for singleton/geminate categorization is reduced due to variability from speaking rate. This has been observed for durational contrasts in consonants and vowels in languages including English, Italian, Icelandic (Miller and Baer, 1983; Miller and Liberman, 1979; Pickett et al., 1999; Pind, 1999; Port and Dalby, 1982; Boucher, 2002) and Japanese (Fujisaki, 1979; Hirata and Whiton, 2005; Idemaru and Guion-Anderson, 2010).

Given that speaking rate variability undermines the informativeness of absolute duration in signaling durationally differentiated phonetic categories across speaking rates, relative duration has been proposed as a higher-order dimension that is more stable across variable speaking rate (Kohler, 1979; Pickett et al., 1999; Pind, 1999; Port and Dalby, 1982). Relative duration is typically expressed in the form of durational ratios between a target speech segment such as the absolute stop duration and the duration of a neighboring segment(s), reflecting a kind of inherent context-dependent normalization for rate changes. Studies have demonstrated that relative duration does better predict rate-dependent phonetic category membership for speech productions than absolute duration (Kohler, 1979; Pickett et al., 1999; Pind, 1986, 1999; Port and Dalby, 1982).

However, there is also evidence that the extent to which absolute duration differentiates (or fails to differentiate) phonetic categories across speaking rate varies across languages. For example, the absolute duration of geminate stops is reported to be three times as long as the duration of singleton stops in Japanese (Han, 1994; Idemaru and Guion, 2008), whereas geminate stops in Italian are only about two times as long as singleton stops (Ham, 2001). The robust absolute duration difference in Japanese may mean that both relative duration and absolute duration are adequate at categorizing singleton and geminate stop productions in Japanese across speaking rates. In support of this, Idemaru and Guion-Anderson (2010) showed that absolute duration (stop duration) was sufficient to categorize 87% of native Japanese singleton and geminate stops, whereas relative duration (durational ratio of stop to the previous syllable) categorized 93% of singleton and geminate stops produced by six speakers across three distinct speech rates. Thus although the informativeness of the cues, measured as their classification accuracy, is uniformly high, relative duration is slightly more informative; however, it is unclear whether this small difference in informativeness is perceptually significant.

In categorizing Japanese singleton and geminate stops, an ideal observer using the full extent of the information available in the input would rely somewhat more on relative duration due to its slight advantage in informativeness. However, it is unclear whether listeners behave as ideal observers. It is possible, for example, that listeners exploit absolute duration despite its lower informativeness. As a unidimensional acoustic cue not requiring integration of information across the utterance, it may confer a computational processing advantage. Or, instead, listeners may be promiscuous in their cue use, committing to neither dimension and exhibiting high variability in perceptual cue weighting given the parity in informativeness.

The issue of individual differences in cue weighting was noted in earlier studies (e.g., Haggard et al., 1970) and has gained attention in more recent research (Kong and Edwards, 2011; Allen et al., 2003; Shultz et al., 2012; Raizada et al., 2010). In particular, Shultz et al. (2012) and Kong and Edwards (2011) showed that whereas listeners consistently weighted VOT more than F0 in categorizing stop voicing, there was considerable individual variation in the extent with which listeners used F0. This seems to reflect the F0's secondary status in informativeness to categorization relative to VOT (e.g., Abramson and Lisker, 1985). It can be considered that relatively small difference that F0 makes for category informativeness when VOT is available leads to individual variation in weighting of F0 as a perceptual cue. Along the same lines, we predict that in categorizing Japanese singleton and geminate stops, listeners exhibit individual variation across the use of relative and absolute durations that do not vary greatly in their informativeness. Examining individual differences in perceptual cue weighting in a situation where two acoustic dimensions provide similar informativeness provides an opportunity to better understand listeners' sensitivity to distributional statistics of fine-grained acoustic dimensions defining speech categories.

In the experiments that follow, we have adopted methods and approaches recently applied to studies of perceptual cue weighting (Holt and Lotto, 2006) and Japanese geminates (Idemaru and Guion-Anderson, 2010) to investigate these issues. We explore the strength of perceptual cue weights (Experiment 1), investigating whether individual listeners' weights are relatively stable across time (Experiment 2) and whether they are resistant to perturbation (Experiment 3). Finally, we investigate whether the individual patterns of perceptual cue weights are related to the talker's own speech production patterns (Experiment 4).

EXPERIMENT 1—PERCEPTION

In Experiment 1, listeners categorized synthesized Japanese words spanning from seta (with a singleton) and setta (with a geminate) in the dimensions of absolute and relative durations. Durational parameters were manipulated so that the absolute and relative durations varied from singleton to geminate values, allowing us to assess listeners' relative use of each dimension in categorizing the stops.

Methods

Participants

Thirty-five (19 females; ages, 21–35 yr, mean = 30 yr) native Japanese listeners participated for a small payment. Participants were born in various regions of Japan (with the largest group, N = 11, from Tokyo or its surrounding areas). All listeners resided in the U.S. at the time of testing. Length of residency in the U.S. ranged from 1 month to 9 yr (mean = 2 yr, 2 months). All listeners reported normal hearing. The data from two female participants were excluded from subsequent analyses due to substantial early exposure to a foreign language.2 In addition, a technical problem occurred while testing one of the participants, and he could not complete the experiment; his data were excluded from the analysis.

Stimuli

The experiment used Japanese words seta and setta (Idemaru and Guion-Anderson, 2010; Idemaru and Guion, 2008) and methods from auditory category-learning experiments (Holt and Lotto, 2006). The stimulus space was defined by absolute duration (the duration of stop closure) and relative duration of stop closure (the durational ratio of stop closure to the previous CV syllable, [se]) to investigate the effect of these dimensions in categorization [Fig. 1a]. Mora (instead of syllable) is a term more consistent with the Japanese phonology (Vance, 1987). However, the term syllable will be used here for convenience and simplicity.

One of the endpoints, setta, is a lexical item in Japanese meaning “hurried,” whereas the other, seta, is a non-lexical item. Although endpoint tokens mismatched on their lexical status are known to introduce a lexical bias (Ganong, 1980) on speech categorization, Idemaru and Guion-Anderson (2010) used the same stimulus pair with native Japanese listeners and report a very small bias (about 10%) in the direction of the non-lexical item. The acoustic structure of these tokens provides for clean acoustic analyses and segmentation and has the benefit of being directly relatable to the previous research of Idemaru and Guion-Anderson (2010). Therefore this pair was selected for the perceptual targets.3

In a strict sense, absolute duration and relative duration defined here are not independent. Absolute duration is simply duration of stop closure, whereas relative duration is the duration of the stop closure relative to a context duration (here defined as the previous syllable duration). Thus relative duration likewise depends upon stop closure duration. The claim that has been made previously is that relative perception of this duration with respect to the rate of adjacent speech provides an acoustic correlate more robust to variability in speaking rate (Kohler, 1979; Port and Dalby, 1982; Pickett et al., 1999; Pind, 1999). Reliance on relative versus absolute acoustic cues, although not independent, has been proposed as two distinct perceptual strategies.

The landmarks for measuring component durations are illustrated in Fig. 1b. Absolute duration (stop closure duration) varied from 50 to 250 ms in five 50 ms steps. The endpoints, 50 and 250 ms, spanned exaggerated values of stop closure duration outside those of the typical Japanese voiceless singleton and geminate stops (mean singleton = 78 ms, mean geminate = 225 ms, in Idemaru and Guion-Anderson, 2010). Relative duration (the durational ratio of stop closure to previous syllable) varied from 0.20 to 1.4 in nine 0.15 steps. These ratio values also exaggerated typical Japanese singleton and geminate stop values (mean singleton = 0.42, mean geminate = 1.08, in Idemaru and Guion-Anderson, 2010). The dots in Fig. 1a illustrate the stimuli defined across absolute duration and relative duration.

The stimuli were synthesized using klattworks (McMurray, 2000). The two-dimensional (2-d) acoustic space [Fig. 1a] determined the durations for [se] (previous syllable) and [t] (stop). The [a] duration was determined by the stop-to-vowel durational ratio (2.00) reported by Idemaru and Guion-Anderson (2010) as a value unbiased either for singleton or geminate. The duration of [s] within [se] was determined to be 68% of the [se] duration, and the duration of [e] to be 32% of the [se] duration based on the production data reported by Idemaru and Guion-Anderson (2010).

The frication noise for [s] was synthesized using parameter values proposed by Klatt (1979). The F1 through F6 frequencies were 320, 1390, 2530, 3250, 3700, and 4900 Hz with the parallel tract amplitude (A1–A6) set as zero for the first five formants and 52 dB for F6. Amplitude of frication noise (AF) was set as 70 dB for the duration of the [s].

To synthesize the vowels [e] and [a], the steady state F1, F2, and F3 frequencies were taken from the acoustic study of Japanese vowels by Keating and Huffman (1984). In each stimulus, the F1 and F2 frequencies varied across the first 20 ms, rising from 276 to 476 Hz and 1515 to 1715 Hz for [e], respectively. For [a], F1 increased from 432 to 632 Hz and F2 decreased from 1663 to 1374 Hz, characteristic of vowels following [t]. This formant transition was determined using the locus equation of Sussman, McCaffrey, and Matthew (1991). The F3 frequencies, 2500 for [e] and 2383 for [a], were steady-state across the vowel. Amplitude was 40 dB at the onset of [e], then increased linearly to 60 dB across the first 20 ms of [e] and decreased to 40 dB in the last 20 ms of the [e]. Amplitude then transitioned to 0 dB where it remained for the duration of the stop, after which it increased linearly to 60 dB across the first 20 ms of [a] and decreased to 40 dB in the last 20 ms of the [a]. It was not possible to maintain these transitions for vowels with durations less than 40 ms. For these vowels, duration of the transitions was shortened (e.g., 10 ms) and the duration of the steady state was also shortened. Fundamental frequency (F0) was 160 Hz for [e] and 100 Hz for [a] within the typical range of male values (Idemaru and Guion, 2008). Amplitude and F0 correlate with Japanese stop length production (Idemaru and Guion, 2008); however, ambiguous values were chosen so that there was no acoustic bias. A 10-ms stop burst was excised from a natural production of seta by a male native Japanese speaker and was inserted before [a].

Procedure

Seated in individual sound-attenuated booths and wearing headphones (Beyer DT-150), listeners categorized 20 repetitions of each of the 45 stimuli (900 trials) by pressing response buttons labeled “seta” and “setta” in Japanese orthography. Stimulus presentation and response collection were under the control of e-prime (Psychology Software Tools, Inc.).

Statistical analysis

Although logistic regression analysis has been proposed for analyzing speech perception response data (Nearey, 1990; Benkí, 2001; Morrison, 2007) and provides a statistically rigorous and promising method, the approach is ruled out here by the fact that our data exhibit within-subject correlation and response asymptotes other than zero and 100. Therefore local polynomial nonparametric regression (LPNR; Loader, 1999), a standard statistical tool applied to data that does not conform to a known parametric shape, was used.

Nonparametric regression uses techniques to fit a smoothed curve to the data scatter plot. Unlike logistic regression, nonparametric regression does not assume the shape of the regression line. Rather, it derives the shape from the data. In the case of LPNR, instead of attempting to fit the curve to the data points all at once, a small window of analysis is applied across the independent variable(s) obtaining a local regression fit for the corresponding local dependent values. LPNR further uses kernel density estimation, a smoothing technique, so that local averaging is done with weighting such that observations closer to the center of the analysis window are weighted more. An important advantage of this smoothing technique is that a large number of observations (this could be the entire set of observations) is used to make a prediction of the dependent variable. However, this is done so that the observations closer to the center of the analysis window contribute more to the prediction.

To understand the application of this statistical technique to perceptual cue weighting in speech categorization, it is useful to consider how the categorization data fall within an acoustic space (Fig. 1). For each of the points marking a stimulus in the 2-d acoustic space, there is a percentage of geminate responses for each listener that is thought of as coming out of the plane of the plot toward the reader in the “z axis,” thus forming a 3-d data scatter plot. Here, a Gaussian kernel was applied around each x-y coordinate (where x and y were absolute duration and relative duration, respectively) in the data scatter plot. The outcome values (percentage geminate responses) associated with all the x-y observations within the analysis window were averaged with kernel smoothing, producing a fitted value of the outcome. The analysis window was moved across the x and y dimensions obtaining the locally fitted values across the entire acoustic stimulus space defined by the ranges of x and y dimensions. The entire set of observations was used in this case to make a prediction regarding the dependent variable, percent geminate responses. Furthermore, this was done so that the observations closer to the center contributed more to the prediction. Technically, this process is repeated for a range of variances (bandwidths) of the kernel, and the one with the best cross-validation score is used.4

Thus the resulting outcome was a predicted percent geminate response across the entire stimulus space for each listener, which allowed us to estimate the perceptual geminate-singleton category space in relation to the acoustic space defined by relative and absolute duration for each listener. We defined the geminate area as a region in the perceptual space where geminate responses were greater than 80% and the singleton area as an area in which geminate responses were fewer than 20%. The centers of geminate and singleton categories were then defined as the centroid of these areas, i.e., for both absolute and relative duration, the mean duration for all test points in the region was calculated. The line connecting the geminate category center to the singleton category center for each listener describes the positional relationship between the category centers within the stimulus space. The angle of this line (where a horizontal line pointing to the right is zero and angles clockwise from this origin are negative and counterclockwise angles are positive) provides a single value (angle) reflecting the relative weights that each listener placed on absolute duration and relative duration in perception of the geminate and singleton sounds.

There have been a few other methods for computing the influence of multiple acoustic dimensions on speech perception (Escudero and Boersma, 2004; Holt and Lotto, 2006). Our preliminary analysis found strong, statistically significant correlations among the cue weight indices provided by LPNR and methods proposed by Escudero and Boersma (2004) and Holt and Lotto (2006) (Pearsons r's > 0.9), indicating that LPNR, as well as the other two methods, captures some features of perceptual cue weighting in speech categorization.

Results

As discussed earlier, LPNR provides predicted percent geminate responses for the entire stimulus space for each listener. Figure 2 shows obtained predicted percent geminate responses for two listeners [Figs. 2a, 2b] as well as predicted pattern summarized for all listeners [Fig. 2c]. As the legend [Fig. 2d] indicates, darker areas show more singleton responses and lighter areas indicate more geminate responses.

Categorization responses for two individual listeners [(a) = Listener 1; (b) = Listener 2], and categorization responses summarized for all listeners (c). The white area indicates the predicted geminate category area, and the black area is the singleton category area. The line connects the center of geminate category and the center of singleton category. The last panel (d) shows the mapping between the gray-scale and percent geminate response.

As exemplified by Fig. 2, visual inspection of the predicted percent geminate response patterns revealed evidence of both consistent patterns and strong individual differences in perceptual weighting. For example, the two listeners in Fig. 2 perceptually divided the acoustic space for geminate and singleton categories in very different ways. Listener 1 [Fig. 2a] categorized geminate and singleton stops primarily on the dimension of relative duration, whereas Listener 2 [Fig. 2b] categorized the two sounds primarily on the dimension of absolute duration. The lines (angle) in the figure connect the center of their geminate category and the center of singleton category, providing an intuitive means of understanding which acoustic dimension most affected speech categorization. An examination of angle lines of individual listeners in Fig. 2c shows that although the location of the singleton center varied substantially, the location of the geminate center was relatively consistent across listeners. This may be due to the lexical status of the word including the geminate.

Figure 3a shows the distribution of the angle values (ranging from −78 to −201), with Fig. 3b illustrating the meaning of angle values. There is a small peak in the frequency distribution around −170 and a larger peak around −140. The rest were scattered between −70 and −120. This suggests that some listeners primarily used relative duration (those whose angle values were around −170), some primarily used absolute duration (those whose angle values were between − 70 and −110), and yet others used both dimensions fairly equally (those whose angle values were around −140).

(a) Distribution of angle values. The dotted lines separate values that indicate primary use of relative duration (from −190 to −160), mixed use of the two durations (from −150 to −120), and primary use of absolute duration (from −110 to −70). (b) Assignment of angle values. The angle value of −180 indicates that singleton and geminate stops were categorized along the dimension of relative duration, whereas the angle −90 indicates categorization along the dimension of absolute duration.

Discussion

In categorizing Japanese singleton and geminate stops, native Japanese listeners showed considerable variability in their use of absolute versus relative duration. Some listeners primarily rely on relative duration, others use mostly absolute duration, and yet others use the two dimensions fairly equally. The results here demonstrated that a slight advantage in informativeness for relative duration did not translate into this dimension weighted more heavily across the board. Cue weighting, thus, is not dictated solely by informativeness. The present results can be interpreted in the context of another study of Japanese geminate and singleton stops. In Idemaru and Guion-Anderson (2010), participants categorized stimuli for which context duration varied while stop duration was constant and perceptually ambiguous. In other words, relative duration varied across values typical of singletons and geminates, whereas absolute duration remained perceptually ambiguous. In this study, all listeners used relative duration: Perception of stop length changed between singleton and geminate as a function of the context duration preceding the stop. Thus when relative duration is the only reliable acoustic information, Japanese listeners can use it for categorization. The results of this study thus mirror previous research demonstrating listeners' perceptual reliance on relative durations (Kohler, 1979; Port and Dalby, 1982; Pind, 1986, 1999; Pickett et al., 1999). However, the present results demonstrate that when both relative duration and absolute duration information is available, listeners show large individual differences in perception. It is perhaps unsurprising that both relative duration and absolute duration are used by Japanese listeners with similar frequency, given the parity in informativeness of the two dimensions (87% accurate prediction by absolute duration, and 93% by relative duration, Idemaru and Guion-Anderson, 2010). These results suggest that the small difference in informativeness of relative duration and absolute duration is not highly significant for perception. The slightly better informativeness of relative duration (Hirata and Whiton, 2005; Idemaru and Guion-Anderson, 2010) did not translate into the across-the-board primacy of relative duration in perception.

It has been widely assumed that due to the acoustic variability introduced by different speaking rates, relative duration is better than absolute duration for sound categorization (e.g., Pind, 1986). However, this expectation must be conditioned by the extent to which absolute duration is undermined by increased variability in the critical segmental duration for a particular language. We have demonstrated that in the case of the Japanese stop length contrast, in which absolute duration approximates the informativeness of relative duration, the perceptual role of absolute duration does not diminish among many listeners.

Furthermore, the finer-grain LPNR analyses employed here demonstrate that there were listeners all across the spectrum with some listeners primarily relying on relative duration, others using mostly absolute duration, and yet others using the two dimensions fairly equally. It is important to note that if only the group data were considered, it would be concluded that relative and absolute durations are weighted almost equivalently (mean angle = −143.5, indicating nearly the mid-point between strong reliance on relative duration and strong reliance on absolute duration), failing to expose the extensive individual differences in listeners' relative perceptual weighting of absolute and relative duration. The current findings stress the importance of examining perceptual patterns at the individual level.

The informativeness of both absolute and relative duration across rate variability in Japanese (Idemaru and Guion-Anderson, 2010), and the extensive individual differences we observe in listeners' reliance on the two sources of information for singleton and geminate stop categorization provide an opportunity to investigate the stability of the perceptual weight listeners give to acoustic cues. The fact that many listeners used both relative and absolute duration with differential weight in the current study may simply reflect promiscuous, unsystematic use of the two dimensions. The parity in informativeness may allow listeners to switch readily between the two in perception without strong stable individual patterns across time. Or, perhaps because of long-term regularities in their own speech production or listening experience, listeners may exhibit relatively more stable individual differences across time. Experiment 2 re-tested some of the Experiment 1 participants to investigate this issue.

EXPERIMENT 2—PERCEPTUAL STABILITY

Of the 35 listeners of Experiment 1, 23 returned for the second test. At least a 3-wk interval (mean = 59 days, range = 27–140 days) separated the two testing sessions. The experimental stimuli and procedure were identical to Experiment 1.

Results and discussion

LPNR was applied to the geminate responses. The angle values characterizing singleton versus geminate perceptual cue weights were calculated. To examine whether the listeners' perceptual cue weight was consistent between the initial test and the retest, the angle values from Experiments 1 and 2 were examined for their relationship. If perceptual cue weight is relatively consistent across time, we would expect a positive correlation between the angle values of Experiments 1 and 2, whereas if differential weighting of absolute and relative duration in Experiment 1 reflects unsystematic use of the two sources of information, there should be no relationship. In fact, there was a strong and statistically significant correlation between the two sets of angle values, r = 0.69, P < 0.001 (Fig. 4). There was one listener (indicated by a * symbol in Fig. 4), who showed substantially different angle values across two tests (−178 in Experiment 1 and − 69 in Experiment 2), showing strong reliance on relative duration in Experiment 1 and strong reliance on absolute duration in Experiment 2. When this listener was excluded from analyses, the correlation improved, r = 0.88, P < 0.001. These results indicate that most listeners make consistent use of absolute and relative information in categorizing Japanese singleton and geminate stops.

Scatter plot showing the angle values of each listener obtained in Experiments 1 and 2. The starred data point indicates a listener who switched cue weighting across two experiments.

To ensure that the response pattern did not emerge as a result of learning through the perceptual task, geminate response to the first presentation of the 49 stimuli were correlated with relative duration and absolute duration in the stimuli (Holt and Lotto, 2006). Relative correlation coefficients of the two dimensions based on the very first presentation of the stimuli showed highly consistent response pattern with the overall response pattern (21 of 23 listeners showing a consistent preference for the relative or absolute dimension).

In Japanese, relative and absolute durations are similarly informative of singleton and geminate category membership across rate variability (Idemaru and Guion-Anderson, 2010), presenting a situation in which listeners potentially could categorize with high accuracy using either dimension or using the dimensions unsystematically. Experiment 1 evidenced considerable individual differences in listeners' relative reliance on the two dimensions. Experiment 2 demonstrated that these individual differences observed in the categorization task were stable across time. Experiment 3 investigated this further in a new task by examining the extent to which these perceptual patterns were resistant to short-term perturbation.

EXPERIMENT 3—RESISTANCE TO PERTURBATION

Holt and Lotto (2006) conducted a series of experiments in which listeners learned to categorize non-speech sounds varying in a 2-d acoustic space. One of the experiments demonstrated that passive exposure to variability along an acoustic dimension led listeners to make greater use of this dimension in a later categorization task than another group of listeners without such pre-exposure. Thus exposure to variability across an acoustic dimension was sufficient to shift listeners' perceptual cue weights toward the dimension.

We exploited this method to examine whether Japanese listeners' pattern of cue weighting was resistant to perturbation. If the individual patterns of perceptual cue weighting are robust, they may remain stable after exposure to variability across the less-preferred acoustic dimension. If the patterns are flexible, exposure to acoustic variability across the less-preferred dimension will increase the perceptual weight of the less-preferred acoustic dimension in subsequent categorization responses.

In this experiment, Japanese listeners who weighted absolute duration more in Experiment 1 and 2 were exposed to stimuli varying from seta to setta only in the relative duration (their less-preferred dimension) prior to a categorization test; those who weighted relative duration more in Experiment 1 and 2 were exposed to stimuli varying only in the absolute duration (their less-preferred dimension) prior to the test. The value of the other dimension, the dimension that the listeners relied more in Experiments 1 and 2, was held constant in the stimuli at a value acoustically ambiguous for category membership.