Adaptive bandwidth measurements of importance functions for speech intelligibility prediction

Nathaniel A Whitmal, III; Kristina DeRoy

doi:10.1121/1.3641453

. 2011 Dec;130(6):4032–4043. doi: 10.1121/1.3641453

Adaptive bandwidth measurements of importance functions for speech intelligibility prediction

Nathaniel A Whitmal III ^1,^a), Kristina DeRoy ^1,^b)

PMCID: PMC3253602 PMID: 22225057

Abstract

The Articulation Index (AI) and Speech Intelligibility Index (SII) predict intelligibility scores from measurements of speech and hearing parameters. One component in the prediction is the “importance function,” a weighting function that characterizes contributions of particular spectral regions of speech to speech intelligibility. Previous work with SII predictions for hearing-impaired subjects suggests that prediction accuracy might improve if importance functions for individual subjects were available. Unfortunately, previous importance function measurements have required extensive intelligibility testing with groups of subjects, using speech processed by various fixed-bandwidth low-pass and high-pass filters. A more efficient approach appropriate to individual subjects is desired. The purpose of this study was to evaluate the feasibility of measuring importance functions for individual subjects with adaptive-bandwidth filters. In two experiments, ten subjects with normal-hearing listened to vowel-consonant-vowel (VCV) nonsense words processed by low-pass and high-pass filters whose bandwidths were varied adaptively to produce specified performance levels in accordance with the transformed up-down rules of Levitt [(1971). J. Acoust. Soc. Am. 49, 467–477]. Local linear psychometric functions were fit to resulting data and used to generate an importance function for VCV words. Results indicate that the adaptive method is reliable and efficient, and produces importance function data consistent with that of the corresponding AI/SII importance function.

INTRODUCTION

Theoretical models of speech perception can provide insight for researchers and clinicians concerned with the consequences of hearing impairment on speech intelligibility. Two widely used models are the Articulation Index, or “AI” (ANSI, 1969) and its successor, the Speech Intelligibility Index, or “SII” (ANSI, 1997). The underlying principle of the AI is that disjoint frequency regions make independent and additive contributions to intelligibility, as expressed in the equation

AI = \int_{f = 0}^{\infty} I (f) W (f) df .

(1)

Here, W(f) is a measure of the audible speech peak energy at frequency f, and I(f) is a weighting function denoting the “importance” of frequency f. Values of W(f) and I(f) range between 0 and 1 (no contribution and maximum contribution to intelligibility, respectively). Measuring I(f) has typically required time-intensive intelligibility testing with groups of subjects. The present work focuses on efficient measurement of importance functions (IFs) for modeling intelligibility in individuals with either normal or impaired hearing.

Early AI applications (Kryter, 1962; ANSI, 1969) used the Bell Telephone Laboratories IFs, which were derived from testing with nonsense syllable stimuli that were either high-pass or low-pass filtered with a range of cutoff frequencies. Covariations in intelligibility and filter cutoff-frequency were converted to IFs, using either graphical curve bisection methods (French and Steinberg, 1947) or differential calculus (Fletcher and Galt, 1950). In both cases, the IF was deemed applicable to all types of test materials (e.g., word or sentence lists) and all subjects with presumably normal hearing. The AI values computed with these IFs were then input to empirically-derived “transfer functions” (TFs) that converted them into intelligibility score predictions for other types of test materials.

Later work ( Pavlovic, 1984; Kamm et al. 1985; Pavlovic et al., 1986) showed that AI predictions for hearing-impaired subjects were inaccurate. The SII (ANSI, 1997) attempted to correct these inaccuracies through use of test-specific IFs that were multiplied by “speech desensitization factors” ranging in value from 0 to 1 to account for individual hearing losses (Pavlovic et al., 1986; Studebaker and Sherbecoe, 1993; Pavlovic, 1994). Ching et al. (1998) showed that remaining inaccuracies in the SII’s predictions for hearing-impaired subjects could be improved by augmenting the individual terms of Eq. 1 with subject-specific correction factors. Algebraically, these approaches are similar to using subject-specific IFs, even though all computations used the ANSI IFs and no intention to use subject-specific IFs was reported.

One likely reason that Ching et al. (1998) did not try to measure subject-specific IFs is that IF measurement can be time-consuming and tedious, as has been shown by investigators measuring IFs and TFs for non-standard materials. The measurements of Henry et al. (1998) for CVC word stimuli required 12 subjects to listen to 50-word lists for each combination of sixteen filter settings and four signal-to-noise ratios (SNRs). Eisenberg et al. (1998) required data from 20 subjects listening to 24-sentence lists in seven filter conditions simply to confirm that existing IFs could be used with HINT sentences (Nilsson et al., 1994). Derivation of a HINT sentence transfer function required data from ten extra subjects listening to 25-sentence lists at six SNRs. Wong et al. (2007) required two stages of testing to obtain an IF and TF for Cantonese HINT sentences (Wong and Soli, 2005): a pilot study with six subjects who listened to 13 filter/SNR conditions, and a follow-up study with 78 subjects who listened to random selections of 115 different filter/SNR conditions. Further extension of the SII to new test materials will require similar measurements. A more efficient approach would encourage investigators to extend the SII to new materials and, if used with individual subjects as done indirectly by Ching et al. (1998), could improve the accuracy of the SII. Such improvements would also support short-time adaptations of the SII algorithm (Rhebergen and Versfeld, 2005; Rhebergen et al., 2006) which improve SII predictions for speech masked by fluctuating noises. Currently, these extensions (which use ANSI IFs) work well for normal hearing listeners (Rhebergen et al., 2008) but make inaccurate predictions for individual hearing-impaired listeners (Rhebergen et al., 2010). Other short-time methods using importance functions built from instantaneous signal values (Ma et al., 2009) have also been shown to improve the accuracy of speech quality and intelligibility prediction measures.

The inefficiency of IF measurement is common to psychophysical procedures that use a constant-stimulus paradigm. One possible alternative is an adaptive psychophysical procedure that uses information from previous responses to steer subject responses towards desired performance levels, rather than levels far above or far below that contain superfluous information (Treutwein, 1995; Leek et al., 2001,). In principle, an adaptive procedure that varied the bandwidth of speech to target specific intelligibility performance levels could efficiently measure psychometric functions relating intelligibility to filter bandwidth. These psychometric functions could in turn be used to produce the IF using the methods of French and Steinberg (1947) or Fletcher and Galt (1950).

Noordhoek et al. (1999) demonstrated the utility of adaptive bandwidth measurements for constructing psychometric functions. Their test subjects heard Dutch sentence recordings (Plomp and Mimpen, 1979) processed by a 1 kHz bandpass filter with bandwidth varied adaptively in 0.45-octave steps to produce 50%-correct sentence recognition. The resulting bandwidth was called the speech reception bandwidth threshold, or “SRBT.” The percent-correct data for each subject’s sentences were then plotted versus the sentences’ “relative bandwidths” (the bandwidth the sentence was presented at divided by the subject SRBT) to produce a psychometric curve relating bandwidth to intelligibility. This result, while encouraging, was not explored further by those authors. Subsequent applications of SRBT measurements focused on suprathreshold deficits in hearing-impaired listeners (Noordhoek et al., 2000, 2001,; van Schijndel et al., 2001a, 2001b,) or perceptual integration of speech bands (Hall et al., 2008), rather than psychometric functions. Moreover, the authors only targeted the 50%-correct performance level with their stimuli; as a result, their results contain little reliable data for accurately determining importance at the ends of their usable frequency range.

The purpose of the present study is to see whether an adaptive-bandwidth approach would be capable of measuring IFs in individual subjects. Two experiments were conducted. Experiment 1 measured recognition of vowel-consonant-vowel (VCV) words processed by low-pass filters (or, “LPFs”) whose cutoff frequencies were adapted using transformed up-down rules (Levitt, 1971) to produce five levels of performance. Measured bandwidths for each level were consistent across subjects and showed high test-retest reliability. The VCV recognition tests were repeated with non-adaptive filters using the average cutoff frequencies of the adaptive measurements. Statistical analyses comparing values and psychometric functions obtained in adaptive and fixed measurement modes show that adaptive-mode results are not significantly different from (and less variable than) fixed-mode results. Experiment 2 repeated the tasks of Experiment 1 with high-pass filters (or, “HPFs”). Those data revealed only minor differences between adaptive and fixed modes, with higher variability inherent to HPF speech perception. Comparisons of derived IF data with that of French and Steinberg (1947) support the viability of the proposed method.

EXPERIMENT 1 – IMPORTANCE MEASUREMENT WITH LOW-PASS FILTERS

Methods

Subjects

Seven subjects between the ages of 18 and 22 years (mean age: 20.55 years) participated in this experiment. All of the subjects were native speakers of English and passed a hearing screening for thresholds at or below 20 dB HL at 500, 1000, 2000, 4000, and 6000 Hz. Partial course credit and/or monetary compensation were provided to subjects in exchange for their participation.

Materials

Stimuli consisted of the 23 consonants /b d g p t k f θ v ð h s ∫ z ʒ t∫ ʤ m n w l j r/, recorded in /a/C/a/ format for a study by Whitmal et al. (2007). The consonants were spoken by a female talker with an American English dialect and digitally recorded in a sound-treated booth (IAC 1604, Bronx, NY) with 16-bit resolution at a 22050 Hz sampling rate.

Processing

Stimuli were processed by 2047-th order digital FIR LPFs produced by the fir1 command (version 1.15.4.4) implemented in MATLAB software (The Mathworks, Natick, MA, version 7.4.0.287). Filter cutoff frequencies ranged between 125 Hz and 8000 Hz and were either fixed at the beginning of a trial or varied adaptively as described below. Signal attenuation exceeded 80 dB for frequencies more than 200 Hz from the filter’s specified cutoff frequency. As in Noordhoek et al. (1999), the RMS level of the waveform was adjusted to match the RMS level of the unfiltered waveform. (For large reductions in bandwidth, this operation results in a substantial gain increase.) The filtered syllables were output from a computer’s sound card (SigmaTel Digital Audio, Austin, TX) to a headphone amplifier (Behringer ProXL HA4700, Bothell, WA) driving a pair of Sennheiser HD580 circumaural headphones at 65 dB SPL (flat weighting) loaded by a free-standing flat-plate coupler (Bruel and Kjaer DB0843, Norcross, GA) secured with a coupling force of 2 N. The presentation level was calibrated daily using repeated loops of level-matched speech-spectrum noise, developed by creating a 110250-sample white-noise signal (i.e., five s at a 22050 Hz sampling rate), passing it through a 50^th-order all-pole filter matched (via Levinson’s recursion) to the average autocorrelation function for the 23 VCV words, and scaling the filter’s output to the average RMS level of the VCV words. The noise was played by the COOL EDIT PRO software package (Syntrillium Software, Phoenix, AZ) through the signal chain and measured with a Class 1 sound level meter (Quest SoundPro SE/DL, Oconomowoc, WI) prior to the first testing session of each test day.

Procedures

Subjects were tested in a double-walled sound-treated booth (IAC 1604, Bronx, NY) in two separate test sessions. Each session lasted two to two-and-a-half h, with breaks provided as needed. The filtered syllables were presented to subjects by custom MATLAB software running on a laptop computer inside the booth. After hearing each syllable, subjects selected the perceived VCV from a list of 23 candidates provided by the software’s visual interface (described in detail by Whitmal et al. 2007). Before experimental data was recorded, a practice set consisting of 23 unfiltered tokens (one presentation per individual token) was presented to each subject to familiarize the listener with the use of the interface. These data were not included in the results.

The first test session for each subject consisted of adaptive runs using transformed up-down responses (Levitt, 1971) to target performance levels of 15.9, 29.3, 50.0, 70.7, and 84.1 percent-correct syllable recognition. The filter cutoff frequency for each run was set initially at 1000 Hz and then varied (based on the subject’s responses) in accordance with one of the five response rules (see Table TABLE I.). The step sizes for frequency changes were initially one octave for the first two reversals, 1/2-octave for the next two reversals, and 1/4-octave thereafter. It should be noted that the less-common “best-of-three” 50%-correct response rule used here was adopted following pilot trials in which the more common “1-up/1-down” rule produced its first four reversals before closely approaching the 50%-correct level.

Table 1.

Adaptation strategies for performance targets (after Levitt, 1971).

	Target level	Increase bandwidth	Decrease bandwidth
Rule	(%-correct)	after observing…	after observing…
1	15.9	Four consecutive misses	Three or fewer misses followed by a correct guess
2	29.3	Two consecutive misses	One correct guess or one miss followed by a correct guess
3	50.0	Two misses in two or three consecutive trials	Two correct guesses in two or three consecutive trials
4	70.7	One miss or one correct guess followed by a miss	Two consecutive correct guesses
5	84.1	Three or fewer correct guesses followed by a miss	Four consecutive correct guesses

Open in a new tab

Two runs were conducted for each performance level, resulting in ten runs per session (i.e., two runs for each of five target percent correct scores). All five target performance levels were presented once in random order, and then presented again in the same order. This was done so that subjects would not be presented with the same percent correct target for two consecutive runs, and to prevent any particular rule from having a significant advantage in learning effect. Previous studies (Leek, 2001) have addressed concerns about learning by interleaving adaptive tracks; nonetheless, results (shown below) indicate that the simple ordering used here was effective. 115 tokens (i.e., five repetitions of each VCV) were presented for each run, for a total of 1150 tokens presented per subject session.

In the second session, subjects performed the same task for VCVs filtered with constant LPF cutoff frequencies. The constant cutoff frequencies for each subject were calculated by averaging that subject’s cutoff frequencies for all reversals from the fifth reversal onward for each adaptive rule’s two runs. As in the adaptive runs, two runs were completed for each target level, for a total of 1150 tokens per session. Two of the low pass subjects were unable to return for the second session; hence, data from only five of the subjects are analyzed below.

Results

Comparison of adaptive and fixed-mode scores

Percent-correct scores were computed for each 115-trial fixed-mode run by dividing the number of correct guesses by the total number of trials. For the adaptive-mode trials, where all runs began with a 1000 Hz cutoff frequency, only the trials from the fifth reversal onward were used. The minimum number of trials used was 86; the maximum number used was 108. This restriction removed bias from initial parts of each run where the cutoff frequency was either too low or too high to approximate the target performance level well. Percent-correct scores for both adaptive-mode and fixed-mode trials are shown in Fig. 1 as open and filled symbols (respectively); dashed lines are also shown depicting each target performance level.

(Color online) Recognition scores (in percent) for VCV syllables, low-pass filtered in either adaptive-bandwidth mode (open symbols) or fixed-bandwidth mode (filled symbols) with respect to filter cutoff frequency. Dashed lines indicate target performance levels specified in Table TABLE I..

Average percent-correct values and filter cutoff frequencies (computed as described above in the Methods section) are presented in Table TABLE II.. Inspection of Fig. 1 and Table TABLE II. show that the adaptive trials approximate the target levels very well; the largest deviation from any target is only 2.28 percentage points. It should be noted that the percentages for rules 1, 2, 4, and 5 are biased away from the target values and toward the 50%-level. This type of bias is consistent with theoretical predictions for reversal averages (Oron, 2007) and has been seen in simulations of adaptive trials using rules 4 and 5 (Kollmeier et al., 1988; Schlauch and Rose, 1990; Saberi and Green, 1996; Garcia-Perez, 1998; Garcia-Perez, 2001). Saberi and Green attributed such biases to inherent imbalances in the transformed up-down rules; however, the bias predicted using their approach for rule 5 are (unlike our data) further away from the median level. To test the significance of these differences, subject percent-correct scores were converted to rationalized arcsine units, or “rau” (Studebaker, 1985) and compared with their corresponding rau-converted target percentage levels in a series of Wilcoxon signed-ranks tests. Test statistics and significance levels (shown in Table TABLE II.) indicate that only the differences for rules 1 and 5 are significant. Similar comparisons (also shown in Table TABLE II.) were made for the fixed trials, whose scores are (on average) 3.58 percentage points higher than their adaptive counterparts, more variable, and (in all but one case) further away from the target values. This difference has also been noted in previous adaptive-mode/fixed-mode comparisons (Kollmeier et al., 1988). The largest adaptive/fixed difference of 8.86 percentage points was measured for the 70.7% target, where three of the subjects produced scores of 80%-correct or higher for the fixed-mode trials. These results notwithstanding, there is considerable overlap between scores for adaptive-mode and fixed-mode trials at each target level.

Table 2.

Statistical measures for adaptive-mode and fixed-mode trials with low-pass filtered consonants in Experiment 1.

	Bandwidth (kHz)		Mean %-correct, adaptive mode				Mean %-correct, fixed mode
Target %	Mean	SD	Mean	SD	T	p	Mean	SD	T	p
15.9	0.416	0.088	18.18	1.9	25.5	0.006	19.39	4.3	18.5	0.065
29.3	0.759	0.121	30.12	2.1	8.5	0.416	34.78	2.7	27.5	0.002
50.0	1.462	0.227	49.05	3.9	−8.5	0.436	51.57	5.9	8.5	0.416
70.7	2.259	0.404	69.05	1.6	−18.5	0.061	77.91	8.3	21.5	0.027
84.1	2.868	0.468	82.39	1.5	−25.5	0.006	83.04	4.8	−3.5	0.754

Open in a new tab

T: Wilcoxon signed-ranks statistic comparing measurement to target. p: false-alarm probability for T.

The rau-converted scores were also input to a repeated-measures mixed-model analysis of variance (ANOVA) of intelligibility scores, with subject identity assumed to be a random factor. Within-subject main factors for the ANOVA included the filtering mode (adaptive or fixed), target percent-correct level, and order (i.e., first or second presentation of a mode/level combination). Among these main factors, only level (F[4,16] = 478.04, p < 0.0001) was significant at the 5% level. Filtering mode was not significant; the interaction between mode and level was, however (F[4,60] = 5.03, p = 0.0015), presumably reflecting the 8.86 percentage-point difference observed for the 70.7%-correct level. No other interactions were significant.

Test-retest reliability

The lack of significance of presentation order in the previous ANOVA suggests both high test-retest reliability and the absence of a learning effect for the two filtering conditions. These prospects were further explored through paired t-test comparisons and Pearson product-moment correlations between intelligibility scores of first and second runs. Results for adaptive-mode trials showed no significant difference (t(24) = 0.637, p = 0.51) and high test-retest reliability (r(23)= 0.986, p < 0.0001). Similar results were observed for fixed-mode trials (t(24) = −0.753, p = 0.49; r(23) = 0.984, p < 0.0001).

Comparisons of cutoff frequency values measured on first and second adaptive runs were also conducted. Results showed no significant differences between paired cutoff frequency values (t(24) = 0.076, p = 0.94) with high test-retest reliability (r(23) = 0.936, p < 0.0001).

Comparative feature analysis

Feature analyses were conducted by determining the percentage of correct voicing, manner, and place-of- articulation consonant classifications produced in the subjects’ identification tasks. (Categories are shown in Tables TABLE III. of Whitmal et al., 2007,.) Results are shown in Fig. 2. Among the three features, voicing was least sensitive to cutoff-frequency changes; 90%-correct classification was achieved for all cutoff frequencies above 758.7 Hz (i.e., the cutoff frequency for the 29.7%-correct level). Manner decreased gradually from 94.5%–correct to 80%-correct (on average) as cutoff frequency decreased from 2868.4 Hz to 758.7 Hz, and dropped sharply as cutoff frequency decreased to 416.4 Hz. The most common manner error was misidentification of non-strident fricatives and affricates as stops. This error mode decreased as bandwidth was increased and high-frequency spectral cues were restored. Changes in place (the most vulnerable feature) resembled changes in intelligibility. In nearly all conditions, feature reception accuracy for the fixed-mode trials was slightly higher (2.5 percentage points, on average) than reception accuracy for adaptive-mode trials; the largest differences of (approximately) eight to nine percentage points were observed for manner at 758.7 Hz, and for place at 2259.4 Hz.

Table 3.

Statistical measures for adaptive-mode and fixed-mode trials with high-pass filtered consonants in Experiment 2.

	Bandwidth (kHz)		Mean %-correct, adaptive mode				Mean %-correct, fixed mode
Target %	Mean	SD	Mean	SD	T	p	Mean	SD	T	p
15.9	4.889	1.092	20.34	4.1	25.5	0.006	18.78	4.2	15.5	0.131
29.3	3.675	0.689	32.51	2.6	25.5	0.006	40.78	10.2	23.5	0.014
50.0	2.671	0.440	54.46	3.1	26.5	0.004	61.83	8.6	24.5	0.010
70.7	1.906	0.488	73.62	2.3	25.5	0.006	76.78	4.9	26.5	0.004
84.1	1.259	0.434	85.18	1.9	13.5	0.184	87.83	3.6	23.5	0.014

Open in a new tab

T: Wilcoxon signed-ranks statistic comparing measurement to target. p: false-alarm probability for T.

(Color online) Reception scores for phonetic features of VCV consonants filtered in either adaptive-bandwidth mode (dashed line and open symbols) or fixed-bandwidth mode (solid line and filled symbols) vs filter cutoff frequency. Error bars denote ± SE.

Percent-correct scores for all features were converted to rau and analyzed in a mixed-model repeated-measures ANOVA; within-subject main factors included filtering mode and target percent-correct level. Results showed no significant differences between filtering modes for voicing (F[1,4] = 0.01, p = 0.92), manner (F[1,4] = 5.26, p = 0.08), or place ((F[1,4] = 4.16, p = 0.11). First-order interactions between mode and target level were significant for both manner and place; these presumably reflect the two inter-mode differences described above.

Fitting of psychometric functions

The recognition and cutoff-frequency data of Fig. 1 were used to derive psychometric curve estimates for each subject. Initial approaches with conventional logistic and Weibull functions (Wichmann and Hill, 2001a) provided poor fits to the data. Subsequently, the nonparametric local linear approach of Zychaluk and Foster (2009) was adopted. The functions produced by this approach are weighted summations of smooth kernel functions, the arguments of which are locally-fitted linear functions of the stimulus variable. The stimulus range for fitting is determined by a smoothing parameter whose value is selected by a “cross-validation” method to minimize deviance for psychometric curve estimates derived from a subset of the data. Note that deviance, defined as the difference between the log-likelihood values for a given curve estimate and a theoretical curve estimate with no prediction error, is given as

D = 2 \sum_{i = 1}^{M} [r_{i} \log \frac{r_{i}}{n_{i} P (f_{i})} + (n_{i} - r_{i}) \log \frac{(n_{i} - r_{i})}{n_{i} - n_{i} P (f_{i})}],

(2)

where M is the number of performance levels investigated, P(f_i) is the given curve estimate for performance level i at cutoff-frequency f_i, n_i is the number of trials at that performance level, and r_i is the number of correctly identified syllables. Deviance is asymptotically distributed as χ²(M) and is useful as a measure of goodness-of-fit. The reader is referred to Wichmann and Hill (2001a) for a discussion of its application to psychometric functions.

Custom software developed for local linear fits of psychometric functions (Zychaluk and Foster, 2009) was used to fit curves to each subject’s data for both adaptive-mode and fixed-mode runs. To facilitate convergence for all five subjects, the base-2 logarithm of the cutoff-frequency was used as the stimulus variable, and the cross-validation smoothing parameter was set to a minimum value of 0.6. Measured data and fitted curves for each subject are shown in the five panels of Fig. 3 marked with initials; a sixth panel shows the local linear curve fit to the log of the average cutoff-frequency and percent-correct values, with all sets of adaptive-mode data and fitted curves included for comparison. This average curve is also plotted in the other panels for reference. Inspection of Fig. 3 suggests good agreement in all cases between the curves and the measured data. Agreement between fixed-mode and adaptive-mode data is also good. Small but noticeable differences are visible between the fixed- and adaptive-mode curves of subjects JB and NH; the other subjects’ fixed- and adaptive-mode curves overlap substantially.

(Color online) Local linear psychometric curves fit to percent-correct recognition vs cutoff frequency data for VCV syllables, low-pass filtered in adaptive-bandwidth mode runs for five subjects. Panels marked “Subject” show scores and curves for a subject’s two trials. Run 1 data are denoted by triangles and dash-dotted lines; run 2 data are denoted by squares and dashed lines. The lower right-hand panel plots the average data measured for each target level (filled symbols), along with their best-fit local linear psychometric curve (solid line); individual subject data (open symbols) and fitted curves (dashed lines) are also shown for reference. The best-fit line is also shown in each of the five subjects’ panels.

Goodness-of-fit for each of the curve estimates of Fig. 3 was assessed using the simulation approach of Wichmann and Hill (2001a). These tests utilize simulated distributions of deviance that closely resemble χ²(5) distributions. The subjects’ curve estimates were used to simulate the number of correct answers the subject would produce at the individual performance levels in each of 10000 experiments. Deviance values computed for each simulation were compiled into simulated deviance distributions against which the deviance values of the measured performance data could be compared. Deviance values for the subjects’ adaptive-mode runs ranged between 0.35 and 2.61, well within the 95%-confidence interval (χ²(5) ≤ 11.07). Deviance values for the fixed-mode runs were higher, ranging between 0.86 and 3.74 for four subjects and reaching a statistically significant value of 15.41 (p = 0.009) for subject NH. This unusually high value is a result of the subject’s unusually high score for rule four trials, which is inconsistent with the other data points and the best-fit curve.

The reliability of the curve estimates of Fig. 3 was assessed using the pfcmp procedure of Wichmann and Hill (2001a, 2001b) to see whether the first- and second-run curves for each subject were significantly different. pfcmp uses Monte Carlo simulations to construct bivariate normal distributions of inter-curve threshold and slope differences; these are then used to test the null hypothesis that the threshold and slope parameters of the curves are identical. The original pfcmp procedure was modified to use the local linear fitting software of Zychaluk and Foster (2009). Hypothesis tests for each subject (using 2000 Monte Carlo trials) produced p-values ranging between 0.17 and 0.75, supporting the null hypothesis of identical curve parameters.

Convergence properties

Measures of cutoff frequency estimate (CFE) bias and variance were computed to illustrate the adaptive-mode runs’ convergence properties. Bias for the CFE values are shown in the left half of Fig. 4. There, reversal data for each run were used to compute running CFE values in base-2 logarithm units as a function of trial number. (CFE values for trials prior to the fifth reversal were set to 9.97—the base-2 log of 1000—by default.) CFE values for subject runs at each target level were averaged to produce average CFE functions; the final CFE for that level was then subtracted from the running CFE data to produce the final result. Note that these “relative bias” functions do not account for the biases observed above in subsection 1. A typical bias function may be divided into three regions:

(1)
an early region where none of the runs have achieved four reversals, the CFE is still equal to 9.97, and bias magnitude is maximized;
(2)
a transition region where at least one run has achieved four reversals and bias decreases; and,
(3)
an asymptotic region where all runs have achieved at least four reversals and the remaining residual bias moves slowly towards zero.

(Color online) Convergence statistics for cutoff-frequency estimates for VCV syllables, low-pass filtered in adaptive-bandwidth mode trials and presented in order of target level. Left panels: relative cutoff-frequency estimate bias (in octaves) vs trial number. Dashed horizontal lines denote a range of ±0.25 octaves (one adaptive step) from zero bias. Dashed vertical lines denote the trial boundary between transition and asymptotic regions: numbers to the right of the line denote the average number of reversals recorded at the trial boundary. Right panels: standard deviation (in octaves) of cutoff-frequency estimates vs trial number. Dashed horizontal lines denote deviations of 0.25 and 0.50 octaves (i.e., one or two adaptive steps) from zero. Numbers in the upper-right-hand corner of the panel denote the average number of reversals occurring in each 115-trial run.

The distinction between transition and asymptotic regions is most evident for the 15.9, 70.7, and 84.1%-correct runs where the initial bias magnitude was greater than one octave. The boundary between regions appears to occur at the first trial for which (a) bias is less than or equal in size to the 1/4-octave step and (b) more than one 1/4-octave-step reversal is available for averaging. These boundary values are denoted by vertical dash-dot lines in Fig. 4. The number of reversals completed at the boundary trials (averaged across all five levels) was 6.06.

The standard deviation (SD) for CFE functions for each level are plotted versus trial number in the right half of Fig. 4. Regions of behavior like those of the bias functions are visible, albeit with different boundaries and behavior. The early region (which has constant bias) shows no variation; the transition region shows a transient increase in variation due to inter-track differences in reversal location. The asymptotic region is marked by the gradual approach of the SD to the asymptotic value of 1/4th octave (i.e., one frequency step). For the 15.9, 29.3, and 50%-correct runs, the boundary between the transition and asymptotic regions appears at the trial where the SD equals 1/2 octaves (i.e., two frequency steps). For all five runs, the boundary between transition and asymptotic behavior appears before the 40th trial. The SD panels also display the average number of reversals completed during each run. Reversal counts for the 15.9, 29.3, 70.7, and 84.1%-correct runs were consistent with those observed by García-Pérez (2001) in simulations of k-up/1 down adaptive runs with small step sizes. (Data for the 1-up/1-down 50%-correct rule used in that paper could not be compared directly with data for the best-of-3 rule used here.)